Skip to main content
Springer logoLink to Springer
. 2021 Sep 4;83(3):32. doi: 10.1007/s00285-021-01653-8

Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes

Elizabeth Gross 1, Leo van Iersel 2, Remie Janssen 2, Mark Jones 2, Colby Long 3, Yukihiro Murakami 2,
PMCID: PMC8418599  PMID: 34482446

Abstract

Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer. Recently, network-based Markov models of DNA sequence evolution have been introduced along with model-based methods for reconstructing phylogenetic networks. For these methods to be consistent, the network parameter needs to be identifiable from data generated under the model. Here, we show that the semi-directed network parameter of a triangle-free, level-1 network model with any fixed number of reticulation vertices is generically identifiable under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints.

Keywords: Phylogenetic networks, Identifiability, Reticulation, Markov processes

Introduction

Typically, the goal of a phylogenetic analysis is to find a tree that describes the evolutionary relationships among a set of taxa. However, because trees, as directed graphs, have acyclic skeletons, they cannot represent reticulate evolutionary events, such as hybridization, introgression, and lateral gene transfer. Recognizing this limitation, it has become increasingly common to use phylogenetic networks in order to more accurately describe the history of some sets of taxa (Bapteste et al. 2013). This increasing attention to phylogenetic networks has led to many new results about the combinatorial properties of phylogenetic networks (Huson et al. 2010; Gusfield 2014), (Steel 2016, Chapter 10), as well as to new methods for inferring phylogenetic networks from biological data.

Many of these new methods for inferring phylogenetic networks are based on constructing networks from small sets of inferred trees (Baroni et al. 2005; Huber et al. 2011; Nakhleh et al. 2005; Yang et al. 2014) or adapting variants of maximum parsimony and neighbor joining (Bryant and Moulton 2004; Jin et al. 2007). Several others are model-based methods that are designed to infer various features of a species networks from data generated by a network multispecies coalescent model. These include, for example, the methods implemented in Phylonet (Than et al. 2008; Wen et al. 2018) as well as SNaQ (Solís-Lemus and Ané 2016; Solís-Lemus et al. 2017) and NANUQ (Allman et al. 2019). Now that network-based Markov models of DNA sequence evolution have been developed (see e.g. Nakhleh 2011, §3.3), it seems natural to use these models in order to add other model-based techniques to the set of tools for network inference. However, in order to consistently infer a parameter using a model-based approach, that parameter must be identifiable from some feature of the model. The question of parameter identifiability is significant and has been explored for several different phylogenetic models. For example, there are numerous identifiability results for tree-based Markov models (Allman et al. 2011; Allman and Rhodes 2006; Chang 1996; Rhodes and Sullivant 2012) and there are similar results for networks that provide the theoretical justification for methods such as SNaQ (Solis-Lemus et al. 2020) and NANUQ (Baños 2019) mentioned above. In this work, we explore the identifiability of the network parameter in network-based Markov models.

Formally, network-based Markov models are parameterized families of probability distributions on n-tuples of DNA bases. The parameterization is derived by modeling the process of DNA sequence evolution along an n-leaf leaf-labelled topological network, which we call the network parameter of the model. Given an n-taxa sequence alignment, a probability distribution in a network-based Markov model specifies the probability of observing each of the possible 4n site-patterns at a particular site. Indeed, in a model-based approach, an n-taxa sequence alignment is usually regarded as an observation of n independent and identically distributed site-patterns. A sequence alignment can therefore be viewed as an approximation of a probability distribution, with the probability for each site-pattern being proportional to the number of times it appears in the alignment. Given a collection, or class, of network-based Markov models, the network parameter is identifiable if any expected site pattern probability distribution p in the model belongs to at most one model in the class. Identifiability, as just defined, is very strong and certainly not satisfied for any reasonable collection of models. Thus, in practice, one often aims at proving that a parameter is generically identifiable. If the network parameter of a class of models is generically identifiable then a probability distribution p from one of the models almost surely belongs to no other model in the class.

The generic identifiability of the tree and network parameters of several phylogenetic models has been shown by adopting techniques from algebraic geometry (Allman et al. 2011; Gross and Long 2017; Hollering and Sullivant 2020; Long and Kubatko 2018). These results apply to several types of mixture models, network models, and multispecies coalescent models. Even though tree-based Markov models of sequence evolution are naturally defined on rooted trees, in many of these works, the tree parameter is assumed to be an unrooted tree. The reason for this is that given an expected site pattern probability distribution from a tree-based Markov model, the location of the root of the tree is not identifiable [see, for example, Sect. 8.5 in Semple and Steel (2003) or Chapter 15 in Sullivant (2018)]. Similarly, with network-based Markov models, even though we define the models on rooted networks, we will only be able to establish generic identifiability when the network parameter is assumed to be a semi-directed network. Semi-directed networks are unrooted versions of rooted networks, which retain information about which vertices are reticulation vertices (and which edges are reticulation edges). In Gross and Long (2017), algebraic techniques were used to show that the network parameter is generically identifiable when the underlying Markov process is subject to the Jukes–Cantor (JC) transition matrix constraints and the network parameter is assumed to be a semi-directed network with exactly one cycle of length at least four. Recently, in Hollering and Sullivant (2020), this result was extended using an algebraic matroid approach to include the Kimura 2-parameter and Kimura 3-parameter constraints (K2P, K3P).

Theorem 1

(Gross and Long 2017; Hollering and Sullivant 2020) The network parameter of a network-based Markov model under the Jukes–Cantor (Gross and Long 2017), Kimura 2-parameter (Hollering and Sullivant 2020), or Kimura 3-parameter (Hollering and Sullivant 2020) constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf semi-directed network with exactly one undirected cycle of length of at least four.

Still, these identifiability results only apply for networks with a single reticulation vertex. In this paper, we prove the following, extending the results to triangle-free, level-1 semi-directed networks, that is, triangle-free semi-directed networks where every undirected cycle contains a single reticulation vertex.

Theorem 2

The network parameter of a network-based Markov model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf triangle-free, level-1 semi-directed network with r0 reticulation vertices.

To illustrate the implications of Theorem 2, suppose that p is an expected site pattern probability distribution that belongs to a Markov model on a rooted phylogenetic network N. If it is known that N is level-1 with triangle-free skeleton and r reticulation vertices, then from p, it is possible (almost surely) to determine the unrooted skeleton of N as well as which vertices (edges) are hybrid vertices (edges).

Our proof is largely combinatorial, as we are able to use the algebraic results for small networks obtained in Gross and Long (2017) and Hollering and Sullivant (2020), in addition to a few new ones, as building blocks. We begin in Sect. 2 by describing more precisely the models we consider as well as the algebraic approach to establishing generic identifiability. In Sect. 3, we prove a few novel results about the algebra of 4-leaf level-1 networks and collect the other required algebraic results. In Sect. 4, we prove several combinatorial properties of level-1 phylogenetic semi-directed networks that we will need to prove the main result. Finally, with these results in place, in Sect. 5, we prove Theorem 2.

Preliminaries

We begin this section by defining the graph theoretic terminology that we will use throughout the paper. Then, in Sect. 2.2, we introduce network-based Markov models on rooted networks, and in Sect. 2.3, we show that we can also define a network-based Markov model on a semi-directed network. Finally, we describe the connection between network-based Markov models and algebraic varieties and formally define what it means for two networks to be distinguishable and precisely what it means for the network parameter of a class of models to be generically identifiable.

Graph theory terminology

A (rooted binary) phylogenetic network N on a set of leaves X is a rooted acyclic directed graph with no edges in parallel such that the root has out-degree two, each vertex with out-degree zero has in-degree one, the set of vertices with out-degree zero is X, and all other vertices either have in-degree one and out-degree two, or in-degree two and out-degree one. The skeleton of a phylogenetic network is the undirected graph that is obtained from the network by removing edge directions.

A vertex is a tree vertex if it has in-degree one and out-degree two. A vertex is a reticulation vertex if it has in-degree two and out-degree one, and the edges that are directed into a reticulation vertex are called reticulation edges. Let r(N) denote the number of reticulation vertices in network N. Since N is binary, it can be shown that it has exactly 2|X|+2r(N)-1 vertices and |X|+2r(N)-1 internal vertices. A rooted phylogenetic network with no reticulation vertices is a rooted phylogenetic tree.

The level of a phylogenetic network is the maximum number of reticulation vertices in a biconnected component of the network. Of particular interest in this paper are level-1 networks, which can also be characterized as phylogenetic networks where no vertex belongs to more than one cycle in the network’s skeleton (Rossello and Valiente 2009).

More specifically, we will be concerned with a particular kind of level-one network, in which only the reticulation edges are directed.

Definition 1

A semi-directed network is a mixed graph obtained from a phylogenetic network by undirecting all non-reticulation edges, suppressing all vertices of degree two, and identifying parallel edges.

Note that deciding whether a mixed graph, a graph with some edges directed and others undirected, is a semi-directed network can be done in quadratic time in the number of edges [Corollary 4 of Huber et al. (2019)]. The unrooted skeleton of a phylogenetic network is the skeleton of its associated semi-directed network (including leaf labels).

In a semi-directed network, the reticulation vertices are the vertices of indegree two and the level is defined the same as for a rooted phylogenetic network. A triangle-free level-1 semi-directed network is a level-1 semi-directed network where every cycle in the unrooted skeleton has length greater than three. We will also refer to level-1 semi-directed networks with exactly one reticulation vertex as k -cycle networks, where k is the length of the unique cycle in the unrooted skeleton.

We finish these preliminaries with one additional bit of graph theory terminology that will be useful throughout. Let AB be a partition of X with AB non-empty. An edge e in a network N separates A and B if every path (not necessarily directed) between any aA and bB contains e. If e separates A and B then we call e a cut-edge and we say N has an A-B split.

Network based Markov models

We begin this section by describing a model of DNA sequence evolution along an n-leaf rooted binary phylogenetic network. For the description below, we assume that the network belongs to the set of tree-child networks (Cardona et al. 2007), which contains the set of level-one networks. In a tree-child network, every internal vertex has at least one child vertex that is either a tree vertex or a leaf.

Let N be an n-leaf phylogenetic network and let ρ be the root of the network. Let S4 be the set of 4×4 (row) stochastic matrices and let Δd be the dth dimensional probability simplex, i.e. Δd:={pRd:p0,i=1dpi=1}Rd. We associate to each node v of N a random variable Xv with state space {A,G,C,T}, corresponding to the four DNA bases. The nodes of the network, including the interior nodes, represent taxa, and the random variable Xv is meant to indicate the DNA base at the particular site being modeled in the taxon at v.

Now, let π=(πA,πG,πC,πT)Δ3R4 be the distribution at the root with πi=P(Xρ=i), and associate to each edge e=uv of N a 4×4 transition matrix MeS4 where the rows and columns are indexed by the elements of the state space. With u a parent of v, the matrix Mi,je is equal to the conditional probability P(Xv=j|Xu=i). When N is a rooted tree, the probability of observing a particular n-tuple at the leaves of N is straightforward to compute. Letting V(N) be the vertex set of N, we first consider an assignment of states to the vertices of N by ϕ:V(N){A,G,C,T} where ϕ(v) is the state of Xv. Then, under the assumption of a tree based Markov model, the probability of observing the assignment ϕ can be computed using the distribution at the root and the transition matrices. Specifically, letting Σ(N) be the set of edges of N, this probability is equal to

πϕ(ρ)e=uvΣ(N)Mϕ(u),ϕ(v)e.

The probability of observing a particular assignment of states at the leaves can be obtained by marginalization, i.e. summing over all possible assignments of states to the internal nodes. In particular, if ω{A,G,C,T}|X| is an assignment of states to the leaves X of N and ϕ(X) is the restriction of ϕ to the entries corresponding to the leaves of N, the probability of observing ω is then

(ϕ:ϕ(X)=ω)πϕ(ρ)e=uvΣ(N)Mϕ(u),ϕ(v)e.

When the rooted network N contains at least one cycle in its skeleton, there is no longer a unique path between each leaf and the root, and thus reticulation edge parameters are introduced. In this case, suppose N has r reticulation vertices v1,,vr. Since each vi has in-degree two, there are two edges, ei0 and ei1, directed into vi. Assign a parameter δi(0,1) to ei1 and the value 1-δi to ei0. For 1ir, independently delete ei0, keeping ei1, with probability δi, otherwise, delete ei1 and keep ei0. Intuitively, the parameter δi corresponds to the probability that a particular site was inherited along edge ei1. Encode this set of choices with a binary vector σ{0,1}r where a 0 in the ith coordinate indicates that edge ei0 was deleted. Since N is assumed to be a tree-child network, after deleting the r edges, the result is a rooted n-leaf tree Tσ. Since there are four DNA bases and n leaves of the network, there are 4n possible site-patterns, or assignment of states, that could be observed at the leaves of N. The probability of observing the site-pattern ω is

pω=σ{0,1}ri=1rδi1-σi(1-δi)σi(ϕ:ϕ(X)=ω)πϕ(ρ)e=uvΣ(Tσ)Mϕ(u),ϕ(v)e. 1

While seemingly complicated, the above expression is a polynomial in the numerical parameters of the model: the root distribution, the entries of the transition matrices, and the reticulation edge parameters. Thus the map defined by the network N

ψN:θNΔ4n-1,

from the numerical parameter space θN:=Δ3×(S4)|Σ(N)|×(0,1)r to the probability simplex Δ4n-1 is a polynomial map. The image of the map ψN is called the model associated to N, denoted MN. Note the model MN is the set of all possible probability distributions obtained by fixing the network N and varying the numerical parameters. See Fig. 1 for an example of a network with its numerical parameters.

Fig. 1.

Fig. 1

On the left is an example of a phylogenetic network with stochastic transition matrices assigned to each edge and reticulation parameters assigned to the two reticulation edges; we denote the edge transition matrices using M(βi) rather than Mei to indicate the dependence on the parameter βi. The transition matrices all satisfy the Jukes–Cantor constraints. On the right is the semi-directed network obtained by unrooting the network on the left. Each edge of the semi-directed network is labeled by a vector of Fourier parameters. Reticulation edges are represented by dashed edges

When we place no restrictions on the entries of the transition matrices (other than that they are stochastic) the underlying substitution process is known as the general Markov model. Network-based phylogenetic models with a general Markov substitution process are studied for example in Casanellas and Fernández-Sánchez (2020). However it is quite common in phylogenetics to consider models with additional constraints, effectively reducing the dimension of the parameter space θN. For example, in the Kimura 3-parameter DNA substitution model, the root distribution is uniform and each transition matrix is assumed to have the following form, where the rows and columns are indexed by the DNA bases AGCT,

αβγδβαδγγδαβδγβα.

In the Kimura 2-parameter model (K2P), and Jukes–Cantor models, additional restrictions are placed on the entries of the transition matrices (γ=δ for K2P and β=γ=δ for JC).

In order to not overload the word “model," we will refer to these restrictions on the transition matrices as constraints. For example, we will refer to the image of ψN under the Jukes–Cantor DNA substitution model as the model associated to N under the Jukes–Cantor constraints.

We end this section on network-based Markov models by noting that there exist other natural extensions of tree-based Markov models. For example, in Francis and Moulton (2018), the authors consider a network model adapted from Thatte (2013) and are able to establish identifiability for the entire class of tree-child networks. The stronger identifiability results come at the expense of some modeling flexibility, but the difference can illustrate the possible gains that can be made by considering different processes.

Semi-directed network models

In this section, we show how to associate a model MN to a phylogenetic semi-directed network N for the group-based models considered in this paper. We will see that for a given set of constraints (JC, K2P, K3P), if N is a phylogenetic network and N is the semi-directed network attained from N as in Definition 1, then MN=MN. We start by showing that the model associated to a rooted network N does not depend on the location of the root. Then, we show that the associated model does not change if we suppress degree two vertices or remove parallel edges in the network. Thus, the phylogenetic semi-directed network N contains all of the information necessary to recover MN.

For a tree-based phylogenetic model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints, we may relocate the root and suppress vertices of degree two without changing the underlying model [see, for example, Sect. 8.5 in Semple and Steel (2003) or Chapter 15 in Sullivant (2018)]. That we can relocate the root is easily observed since each of the transition matrices is symmetric and the root distribution is uniform, so that πiMi,j=πjMj,i. To see that we may suppress vertices of degree two without changing the model, suppose the edges e and f are incident to a vertex of degree two and that the Markov transition matrices Me and Mf satisfy the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints. Then the transition matrix MeMf will satisfy the same constraints, so we may suppress the vertex of degree two and assign this transition matrix to the newly created edge to obtain the same site pattern probability distribution from the model. These results imply that the location of the root of the rooted tree parameter in a tree-based Markov model cannot be identified from an expected site-pattern in the model. Or, viewed another way, these results mean that we can associate a tree-based Markov model to an unrooted tree and consider the tree parameter in a tree-based Markov model to be an unrooted tree.

A similar result holds for the network-based Markov models considered in this paper. For a fixed choice of parameters in a network model, the associated site pattern probability distribution is the weighted sum of site-pattern probability distributions from the constituent tree models. The weights are determined by the reticulation edge parameters. Since relocating the root in each of the trees does not affect the tree models, the network model will remain the same if we relocate the root of the network and redirect the edges in any way that preserves the direction of the reticulation edges. For example, in the rooted network in Fig. 1, we could suppress the existing root vertex, subdivide the edge directed into the leaf vertex labeled by z to create a new root, and then redirect edges away from the new root in a way that preserves the directions of the reticulation edges.

If a child of the root vertex is a reticulation vertex, then unrooting and suppressing the root will may result in a pair of parallel reticulation edges in the semi-directed network. However, under the JC, K2P, and K3P constraints, we may identify any pair of parallel edges without altering the model. The reason for this is that the sets of transition matrices under each of these constraints are closed under convex sums. So if a network contains a set of parallel reticulation edges with transition matrices Me and Mf, we can replace these edges with a single edge with transition matrix δMe+(1-δ)Mf and obtain the same site-pattern probability distribution in the model, where δ is the reticulation edge parameter for the edge e.

Together, these arguments give us the following proposition.

Proposition 1

Let N1 and N2 be two tree-child phylogenetic networks with associated phylogenetic semi-directed networks N1 and N2. Under the JC, K2P, or K3P constraints, if N1=N2 then MN1=MN2.

Thus, the model associated to a rooted phylogenetic network is entirely determined by the associated phylogenetic semi-directed network. Although we note that the arguments above are specific to the JC, K2P, and K3P constraints, and similar arguments might not work for other network-based Markov models.

Proposition 1 suggests that we may regard the network parameter of a network-based Markov model as a phylogenetic semi-directed network. Given a phylogenetic semi-directed network N, we can determine the model MN by choosing any rooted network N for which N is the associated semi-directed network and defining MN:=MN. Therefore, for the rest of this paper, we will assume that the network parameter of each model is an n-leaf phylogenetic semi-directed network. Indeed, this is necessary to obtain any identifiability results, as the location of the root in a rooted network is not identifiable from an expected site pattern probability distribution in the model.

Markov models as algebraic varieties

In this paper, we prove generic identifiability using tools from combinatorics and computational algebraic geometry. In order to understand MN=Im(ψN) within an algebraic-geometric framework, we consider the complex extension of ψN, which we denote as ψN.

Let C[pω:ω{A,G,C,T}n] be the set of all polynomials on 4n variables with coefficients in C. The ideal associated to MN is the set of polynomials that vanish on the image of ψN, i.e.

IN:={fC[pω:ω{A,G,C,T}n]:f(p)=0pIm(ψN)}.

The elements of IN are called phylogenetic invariants. Each polynomial in IN vanishes on MN, that is, each polynomial yields zero when we substitute the entries of any probability distribution pMN. Phylogenetic invariants are the defining polynomials of the variety VN associated to MN, which we will refer to as the network variety. Specifically,

VN:=V(IN)={pC4n:f(p)=0for allfIN}.

Elements of IN are polynomial relationships among the entries of p that hold for all distributions pMN. If we look back at equation (1), it is reasonable to assume that such relationships may be quite complicated since each probability coordinate pω is parameterized by a polynomial that is the sum of 2r4(n+2r-1) terms. Because of this, we perform a linear change of coordinates on both the parameter space and the image space called the Fourier-Hadamard transform (Evans and Speed 1993; Hendy and Penny 1996). After the transform, the invariants are expressed in the ring of q-coordinates,

C[qω:ω{A,G,C,T}n].

As an example of how the Fourier-Hadamard simplifies the resulting algebra, for a tree-based phylogenetic model, the parameterization of each q-coordinate is a monomial in the Fourier parameters and the phylogenetic tree ideal is generated by binomials. Working in the transformed coordinates is common when working with group-based models and it is what enables us to compute the required network invariants. While the details of the Fourier-Hadamard transform are outside the scope of this paper, we give here a brief description of how to parametrize a phylogenetic network model under the Jukes–Cantor, Kimura 2-parameter, and Kimura 3-parameter constraints. More details can be found in Sturmfels and Sullivant (2005) and Chapter 15 of Sullivant (2018).

First, we will describe how to determine the Fourier parametrization of a phylogenetic tree, T. As in Sturmfels and Sullivant (2005) and Sullivant (2018), we begin by identifying the four DNA bases with elements of the group Z2×Z2 as follows A=(0,0), G=(1,0), C=(0,1) and T=(1,1). Under the Kimura 3-parameter constraints, there are then four Fourier parameters associated to each edge i, denoted as aAi, aGi, aCi, and aTi (after transformation, the stochastic condition on the transition matrices forces aAi=1). Letting ω be the site pattern (g1,g2,,gn), the parametrization is then given by

qω=eΣ(T)ajYgjeifj=1ngj=00otherwise.

where Σ(T) is the set of edges of T and Y-Z is the split induced by e in T. All addition is in the group Z2×Z2.

Notice that this is a monomial, in which there is one parameter associated to each edge of the tree T. In order to parametrize a phylogenetic network, we take the sum of the monomials corresponding to all 2r trees created by removing reticulation edges from the network. The monomials are weighted by the corresponding reticulation edge parameters.

Generic identifiability

A model-based approach to network inference selects the model from a set of candidate models that best fits the observed data according to some criteria and returns the network parameter of this model. In our setting, the observed data are the aligned DNA sequences of the taxa under consideration, from which we construct the observed site pattern probability distribution. In the ideal setting, if we had access to infinite noiseless data generated by a network-based Markov model, then the observed site pattern distribution would be equal to an expected site pattern distribution in the model. Inferring the correct network parameter in this case would be as simple as determining which model from a set of candidate models the site pattern probability distribution belongs to. However, even in this idealized setting, it may be that the observed site pattern distribution belongs to the models corresponding to two distinct networks, making it impossible to determine which network produced the data. Thus, a desirable theoretical property for a class of network models is that each distribution in one of the models belongs to no other model, or that the network parameter be identifiable.

Let N be a set of leaf-labelled networks. More formally, the condition that the network parameter is identifiable with respect to a collection of models {MN}NN is equivalent to the condition that for all distinct N1,N2N, MN1MN2=, meaning the two models do not intersect. Since this notion of identifiability is rather strong, the more practical notion of generic identifiability is more commonly explored.

Definition 2

Let {MN}NN be a class of phylogenetic network models. The network parameter is generically identifiable with respect to the class {MN}NN if given any two distinct n-leaf networks N1,N2N, the set of numerical parameters in θN1 that ψN1 maps into MN2 is a set of Lebesgue measure zero.

To establish generic identifiability, we can use algebraic geometry by considering the family of irreducible algebraic varieties {VN}NN, where VN is the network variety associated to N. Generic identifiability is then closely related to the concept of distinguishability.

Definition 3

(Gross and Long 2017) Two distinct n-leaf networks N1 and N2 are distinguishable if VN1VN2 is a proper subvariety of VN1 and VN2, that is, VN1VN2 and VN1VN2. Otherwise, they are indistinguishable.

Proposition 2

(Gross and Long 2017, Proposition 3.3) Let {MN}NN be a class of phylogenetic network models. The network parameter of a phylogenetic network model is generically identifiable with respect to {MN}NN if given any two distinct n-leaf networks N1,N2N, the networks N1 and N2 are distinguishable.

The condition that the network parameter be generically identifiable then amounts to showing that for all N1,N2N, the networks N1 and N2 are distinguishable, or equivalently, VN1VN2 and VN1VN2. Proving that this condition is satisfied can then be done either by explicit computation of the ideals associated to N1 and N2 (as in Gross and Long (2017)), or by arguing that certain phylogenetic invariants must exist [as in Hollering and Sullivant (2020)].

Distinguishability of 4-leaf semi-directed networks

Our aim is to prove Theorem 2, by showing that any two distinct n-leaf r-reticulation triangle-free level-1 semi-directed networks are distinguishable. In order to show this, we will require a number of results concerning 4-leaf networks which we prove in Lemma 1 below.

Up to leaf relabeling, there are six different 4-leaf level-1 semi-directed networks which are depicted in Fig. 2. In Lemma 1, we assume that N1 and N2 are two distinct 4-leaf semi-directed networks. We then consider all cases where N1 and N2 are each either a quartet tree (Q), a single triangle network (Δ), a double-triangle network (DT), or a 4-cycle network (4C), and compare the resulting varieties. We only need to consider four possibilities for each of N1 and N2, because under the JC, K2P, and K3P constraints, the variety of a triangle or double-triangle semi-directed network is determined by the unrooted skeleton of the network. This can be shown by first observing that under the JC, K2P, and K3P models, the ideals of all of the 3-leaf semi-directed triangle networks are identical. The proof then follows by applying the same toric fiber product argument that is described in the remark following Proposition 4.5 in Gross and Long (2017).

Fig. 2.

Fig. 2

All possible semi-directed level-1 networks on four leaves (up to relabeling of leaves), grouped by their unrooted skeletons

The results of Lemma 1 are summarized in Table 1 and the caption of that table contains the key to the symbols. To give a couple of examples, part (ii) of the lemma corresponds to the (2, 2) entry of the table. The symbol indicates that the networks are distinguishable, but only if N1 and N2 have distinct unrooted skeletons. The results of part (iii) of the lemma are represented by the entries (4, 1) and (4, 2) (when k1=4 and N1 is a 4-cycle network) and by (2, 1) (when k1=3 and N1 is a 3-cycle, or triangle network). And of course, these results are also represented by the entries (1, 4), (2, 4), and (2, 1) when the roles of N1 and N2 are reversed.

Table 1.

An overview of Lemma 1 results for two distinct 4-leaf level-1 semi-directed networks N1 and N2. The two networks are represented by the row for N1 and the column for N2, and each element in the 4×4 grid indicates whether the two networks are distinguishable (), the variety of one network is not contained in that of the other ( means VN1VN2, and means VN1VN2), or the two networks are distinguishable if the unrooted skeletons are different ()

N2
Q Δ DT 4C
Q
N1 Δ
DT
4C

Lemma 1

Let N1 and N2 be distinct 4-leaf level-1 semi-directed networks. Then under the JC, K2P, or K3P constraints:

  • (i)

    If N1 and N2 are both trees, then N1 and N2 are distinguishable;

  • (ii)

    If N1 and N2 are both single-triangle networks and have different (leaf-labelled) unrooted skeletons, then N1 and N2 are distinguishable;

  • (iii)

    If N1 is a k1-cycle network with k14 and N2 is a tree or a k2-cycle network with k2<k1, then VN1VN2;

  • (iv)

    If N1 and N2 are both 4-cycle networks, then N1 and N2 are distinguishable;

  • (v)

    If N1 is a double-triangle network and N2 a single-triangle network or a tree, then VN1VN2;

  • (vi)

    If N1 is a double-triangle network and N2 is a 4-cycle network, then N1 and N2 are distinguishable;

  • (vii)

    If N1 and N2 are both double-triangle networks and have different (leaf-labelled) unrooted skeletons, then N1 and N2 are distinguishable.

See Table 1 for an overview.

The proof of Lemma 1 will be given below. We first outline the proof strategy. Some parts of the lemma will follow immediately from results in Gross and Long (2017) and Hollering and Sullivant (2020). In Gross and Long (2017), the proofs were obtained by computing Gröbner bases for all of the ideals involved and then comparing the ideals. However, this was only possible because the constraints considered were the Jukes–Cantor constraints, the most restrictive that we consider. In Hollering and Sullivant (2020), the authors extend the results to the K2P and K3P constraints using a method based on the theory of algebraic matroids. This method is preferable when there are fewer constraints since the Gröbner bases computations are difficult if not impossible to carry out. Here, we find the required invariants by modifying this method slightly. Specifically, we apply the random search strategy described in that paper to locate small subsets of variables that are likely to contain distinguishing invariants. We then perform our computations in a much smaller subring of the original variables. This greatly reduces the size of the required computations and allows us to generate specific invariants without computing Gröbner bases for the ideals.

In order to reduce the total number of invariants required to prove each part, we take advantage of the symmetry between networks. As an example, suppose that the statement in part (vii) is false. Then there must exist two double-triangle networks with distinct skeletons, N1 and N2, that are not distinguishable. All of the network varieties are parameterized, and hence irreducible as algebraic varieties, which means we may assume that if two networks are not distinguishable then one is contained in the other. Thus, without loss of generality, we may assume that VN1VN2, which implies the reverse inclusion of ideals, IN2IN1. Up to relabeling, every double-triangle network has the same unrooted skeleton. Thus, we can obtain any arbitrary double-triangle network N^2 from N2 by permuting leaf labels. If we apply the same permutation to the leaf labels of N1, we obtain another double-triangle network N^1 for which IN^2IN^1. Since our choice of N^2 is arbitrary, if we can show that there is a single double-triangle network with ideal not contained in the ideal of any other double-triangle network, then we arrive at a contradiction, and have thus proven part (vii). Therefore, in order to prove part (vii), it will suffice to produce a single invariant that vanishes on exactly one of the double-triangle network varieties. A similar argument applies in each of the other parts.

In order to prove some parts of the lemma, we require two or more invariants to distinguish all of the relevant networks, though all parts can be proven using some combination of just the following six polynomial invariants:

g1=qATTAqCCGGqGATC-qAAGGqCTTCqGCTA,g2=qCTTC-qGCGC,g3=qCAGTqGTCAqTGAC-qCACAqGTGTqTGAC-qCAGTqGTACqTGCA+qCAACqGTGTqTGCA+qCACAqGTACqTGGT-qCAACqGTCAqTGGT,g4=qAACCqCGCGqGAGAqTAAT-qAACCqCGATqGAGAqTACG+qAACCqCAGTqGGAAqTACG-qAAAAqCAGTqGGCCqTACG,g5=qAAAAqGACTqGCGC-qAAGGqTAATqTGCA,g6=qAAGGqGATCqTAAT-qAATTqGAAGqTAGC.

In the supplementary Macaulay2 (Grayson and Stillman 2002) files, available at

https://github.com/colbyelong/DistinguishingLevel1PhylogeneticNetworks,

we provide the code to verify that these polynomials vanish or do not vanish on the referenced varieties as claimed.

Proof

(Proof of Lemma 1) Statement (i) is a well-known result for the JC, K2P, and K3P constraints and can be verified using the Small trees catalog (Casanellas et al. 2005). For the Jukes–Cantor constraints, (ii)-(iv) follow from Proposition 4.6, Corollary 4.8, and Corollary 4.9 in Gross and Long (2017).

To prove (ii) for the K2P and K3P constraints we require a set of invariants that vanishes on exactly one of the single-triangle networks. The set {g1} is confirmed to be such a set for both constraints in the supplementary files. Statements (iii) and (iv) are proven for the K2P and K3P models by Lemmas 28 and 29 in Hollering and Sullivant (2020).

To prove (v), we require a set of invariants that vanishes on one of the tree varieties, but on none of the double-triangle network varieties, and a set of invariants that vanishes on one of the single-triangle networks varieties, but on none of the double-triangle network varieties. The set {g1} is shown to be the required set for both parts under K2P and K3P, and the set {g1,g2} works for the JC constraints.

For (vi), we must first show that there is a set of invariants that vanishes on one of the double-triangle network varieties but on none of the 4-cycle network varieties. The set {g3} works for all constraints and thus establishes that if N1 is a double-triangle and N2 is a 4-cycle network, then VN2VN1. We prove that VN1VN2, and hence that the networks are distinguishable, by constructing a set of invariants that vanishes on one of the 4-cycle network varieties but on none of the double-triangle network varieties. For the JC constraints, this set is {g4,g5}. For the K2P and K3P constraints, this set is {g4,g6}.

The invariant g3 also establishes (vii), since it vanishes on exactly one of the double-triangle networks under JC, K2P, and K3P.

We also need a result on 4-leaf networks that does not fit into Table 1. To state this result we first need some definitions concerning the type of splits in a network.

Definition 4

For networks N1 and N2, we say X-Y is a common split if X-Y is a split in both N1 and N2; it is non-trivial if |X|,|Y|2. Two splits X-Y in N1 and A-B in N2 are conflicting if XA,XB,YA,YB are all non-empty.

Lemma 2

Let N1 and N2 be distinct 4-leaf level-1 semi-directed networks. If N1,N2 have conflicting splits, then N1 and N2 are distinguishable under the JC, K2P, or K3P constraints.

Proof

Note that 4-cycles have no non-trivial splits, so we just need to compare trees, single-triangle networks, and double-triangle networks. Moreover, Table 1 shows that we only need to verify that VN1VN2 in the following cases:

  • (i)

    when N1 is a tree or triangle network and N2 is a double-triangle network with a conflicting split and

  • (ii)

    when N1 is a tree and N2 is a triangle network with a conflicting split.

The invariant g3 can be used to verify case (i) for all three constraints. The invariant g2 can be used to verify case (ii) for JC, and g1 can be used to verify case (ii) for K2P and K3P.

Finally we require Lemma 3, which allows us to use the above small networks as building blocks to prove the claim about larger networks. To state Lemma 3, we first define the restriction of a network to a subset of leaves.

Definition 5

Let N be an n-leaf semi-directed network on X and let AX. The restriction of N to A is the semi-directed network N|A obtained by taking the union of all directed paths between leaves in A (where undirected edges are treated as bidirected) and then suppressing all degree two vertices and removing parallel edges.

Lemma 3 is essentially a one-way version of Proposition 4.3 from Gross and Long (2017), and we use a piece of the proof of that proposition below.

Lemma 3

Let N1 and N2 be distinct n-leaf semi-directed networks on X and let AX. If VN1|AVN2|A, then VN1VN2.

Proof

Let VN1|AVN2|A. Then VN1|AVN2|AVN1|A. In the proof of Proposition 4.3 from Gross and Long (2017), it is shown that if VN1|AVN2|AVN1|A, then there exists a polynomial invariant f contained in IN2\IN1, which implies that IN2IN1, and so VN1VN2.

Lemma 3 implies that in order to prove Theorem 2 it will suffice to show that for any distinct triangle-free level-1 semi-directed networks N1 and N2, there either exists a set AX with |A|=4 such that N1|A and N2|A are distinguishable, or sets A,BX with |A|=|B|=4 such that VN1|AVN2|A and VN1|BVN2|B.

Combinatorial properties of triangle-free level-1 semi-directed networks

If XY is a partition of X such that N contains an X-Y split, then denote by N/X the network N|{x}Y, for an arbitrary xX. Observe that the unrooted skeleton of N/X does not depend on the choice of x. Observe also that r(N)=r(N/X)+r(N/Y).

Observation 1

If N1 and N2 are distinct n-leaf semi-directed networks and X-Y is a common split, then either N1/X and N2/X are distinct or N1/Y and N2/Y are distinct.

The next lemma follows immediately from Lemma 3 and the definition of N/X.

Lemma 4

Let N1 and N2 be distinct n-leaf semi-directed networks on X. Suppose XY is a partition of X such that N1 and N2 both contain an X-Y split. If VN1/XVN2/X then VN1VN2.

Let N be an n-leaf triangle-free level-1 semi-directed network on X and C a cycle in N. Let e1,,es be the cut-edges incident to C. Then the partition induced by C is the partition X1||Xs of X such that xXi if and only if x is separated from C by ei. We say Xi is below the reticulation vertex if ei is the edge incident to the reticulation vertex in C. If Xi is below the reticulation vertex in C then we also say that x is below the reticulation vertex for any xXi.

We say a set of three or more leaves {x1,,xt} meet at a cycle C if each leaf in {x1,,xt} appears in a different set of the partition induced by C. We say that they induce a cycle in N if N|{x1,,xt} is a t-cycle network. Note that if the set of leaves {x1,,xt} induce a cycle then they must meet at a cycle, but the converse does not hold unless one of {x1,xt} is below the reticulation vertex. As an example consider the network in Fig. 4a: {a1,a2,a3} meet at the cycle C1 but do not induce a cycle, whereas {x,a1,a2,a3} also induce a cycle.

Fig. 4.

Fig. 4

Illustration of part of the proof of Lemma 6. On the left we have an example of some leaves joining a cycle C1 in N1, such that {a,b,c,d} all meet at C1 with d below the reticulation vertex, and {a1,a2,a3,x} all meet at C1 with x below the reticulation vertex. The cycles on the right are all induced 4-cycles in N1, and therefore by assumption are also induced 4-cycles in N2. As the sets {a,b,c,d} and {a1,b,c,d} differ by only 1 element, they must meet at the same cycle in N2. By repeating a similar argument, we can show that {a,b,c,d} and {a1,a2,a3,x} meet at the same cycle in N2

Observe that if {x1,,xt} (t3) meet at a cycle, then they meet in exactly one cycle in N, i.e., this cycle is unique in N. Denote this cycle by CN(x1,,xt). (Note that CN(x1,,xt) is not well-defined if {x1,,xt} do not all meet at a cycle.)

Let N1 and N2 be distinct n-leaf triangle-free level-1 semi-directed networks on X, and let C1 be a cycle in N1 that induces a partition A1||As|X with X below the reticulation vertex. Let C2 be a cycle in N2 that induces a partition B1||Bt|X, with X below the reticulation vertex. We say that C2 refines C1 if B1||Bt is a refinement of A1||As, i.e., if i=1sAi=j=1tBj and every pair of leaves ab that are contained in different sets in A1||As also appear in different sets in B1||Bt. See Fig. 3.

Fig. 3.

Fig. 3

Two triangle-free level-1 semi-directed networks N1 and N2 on taxa set {a,a1,a2,a3,a4,b,c,d,x}. The cycle C1 in N1 induces a partition {a,a1,a4}|{a2,b}|{a3}|{c}|{x,d} and the cycle C2 in N2 induces a partition {a,a1}|{a4}|{a2}|{b}|{a3}|{c}|{x,d}. The cycle C2 refines C1

We recall a combinatorial result from Baños (2019) on four-leaf induced cycles. We state the result using notation and terms from this paper.

Lemma 5

(Lemmas 12 and 13 of Baños (2019)) Let N be an n-leaf triangle-free level-1 semi-directed network on X. If two distinct subsets of four leaves induce a 4-cycle, where three leaves in the two sets are the same, then the five leaves (the union of the two sets) meet at the same cycle. In other words, let a,b,c,d,eX be leaves of N such that N|{a,b,c,d} and N|{a,b,c,e} are both 4-cycle networks. Then {a,b,c,d} and {a,b,c,e} meet at the same cycle.

Lemma 6

Let N1 and N2 be distinct n-leaf triangle-free level-1 semi-directed networks on X. Suppose that for any a,b,c,dX, if N1|{a,b,c,d} is a 4-cycle, then N2|{a,b,c,d}=N1|{a,b,c,d}. Then every cycle in N1 is refined by a cycle in N2.

Proof

Let C1 be a cycle in N1 that induces a partition A1||As|X with X below the reticulation vertex. Choose any a1A1,a2A2,a3A3,xX. As N1|{a1,a2,a3,x} is a 4-cycle, N2|{a1,a2,a3,x} is the same 4-cycle. So let C2=CN2(a1,a2,a3,x). We claim that C2 is the desired cycle of N2 that refines C1.

To see this, first consider any aAh,bAi,cAj,dX where 1h<i<js. Then abcd all meet at C1 and so CN1(a,b,c,d) is well-defined. Since i,j>1, we can replace a with a1 and have that the set of leaves {a1,b,c,d} also meet at C1. By similar arguments, we also have that {a1,a2,c,d} meet at C1 and {a1,a2,a3,d} meet at C1. Moreover each of these sets of 4 leaves induces a cycle in N1 (as d is below the reticulation vertex in C1), and so also induce a cycle in N2. Thus we have that N2|{a,b,c,d}, N2|{a1,b,c,d}, N2|{a1,a2,c,d}, N2|{a1,a2,a3,d} are all 4-cycles, and in particular CN2(a,b,c,d), CN2(a1,b,c,d), CN2(a1,a2,c,d), CN2(a1,a2,a3,d) are all well-defined. (See Fig. 4.) By Lemma 5, we must have that CN2(a,b,c,d)=CN2(a1,b,c,d)=CN2(a1,a2,c,d)=CN2(a1,a2,a3,d)=CN2(a1,a2,a3,x)=C2.

We thus have that for aAh,bAi,cAj,dX with h<i<j, the set of leaves {a,b,c,d} all meet at C2.

Now consider any two leaves a,b such that a,b appear in different sets in A1||As|X. By choosing additional leaves c,d from other sets, such that one of a,b,c,d is in X, we have that CN2(a,b,c,d)=CN2(a,b,c,d) where aAh,bAi,cAj,dX, for some h<i<j. Then by the above we have that CN2(a,b,c,d)=C2. In particular, a,b appear in different sets in the partition induced by C2. This implies that the partition induced by C2 is a refinement of the partition induced by C1. Moreover, observe that a is below the reticulation vertex in C2 if and only if aX (since the only element of {a,b,c,d} below the reticulation vertex in C2 is the one from X). Thus, the partition induced by C2 is B1||Bt|X with X below the reticulation and B1||Bt a refinement of A1||As. Therefore, C2 refines C1.

Lemma 7

Suppose that every cycle in N1 is refined by a cycle in N2. If N2 has a non-trivial split, then either N1,N2 share a non-trivial common split or they have conflicting splits.

Proof

Let A-B be a non-trivial split in N2. Fix an arbitrary bB, and take the edge e in N1 furthest from b such that e separates b from A. If e separates A from B, then A-B is a non-trivial common split and we are done.

Otherwise, let u be the vertex in e nearer to A. If u is on a cycle, then denote this cycle by C1. Let X1||Xs be the partition induced by C1, noting by construction that XiA= for the set Xi containing b (since Xi is the set of leaves reachable from C via e). If XjA for any j, then the corresponding edge ej leaving C is an edge that is further away from b than e and which separates A from b, contradicting the choice of e. So we may assume that the partition X1||Xs must subdivide A (that is, A has non-empty intersection with at least two sets Xj,Xh).

Furthermore X1||Xs must subdivide B, as otherwise the set Xi (which contains b) contains all of B and also none of A, which would imply that A-B is a common split. So C1 is a cycle in N1 whose induced partition subdivides both A and B. As every cycle in N1 is refined by a cycle in N2, this implies that some cycle in N2 also subdivides both A and B. But this contradicts the fact that N2 contains an A-B split. (See Fig. 5a.)

Fig. 5.

Fig. 5

Illustration of N1 in the proof of Lemma 7

If u is not on a cycle, let f and g be the other edges incident to u. By choice of e, neither f nor g can separate A from b. Thus there is at least one element aA reachable from u via f, and at least one element aA reachable from u via g. As e does not separate A from B, there is at least one bB that is reachable from u via either f or g, say (without loss of generality) f. Then let X-Y be the split induced by f, with Y the set containing b. Observe that a,bX while a,bY. Thus we have that XA,XB,YA,YB are all non-empty, and so N1 and N2 have conflicting splits. (See Fig. 5b.)

Distinguishability of triangle-free level-1 networks

Theorem 2 follows as a corollary of the next lemma.

Lemma 8

Let N1 and N2 be distinct n-leaf triangle-free level-1 semi-directed networks on X and r(N1)r(N2). Then VN1VN2 under the JC, K2P, and K3P constraints.

Proof

We prove the claim by induction on n=|X|, the number of leaves in N1 and N2. For the base case, if n4 then either r(N1)=0 or r(N1)=1. If r(N1)=0, then N1 and N2 are both trees. If r(N1)=1 and r(N2)=1, then N1 and N2 are both 4-cycles. If r(N1)=1 and r(N2)=0, then N1 is a 4-cycle network and N2 is a tree. For each of these cases, by Lemma 1, it follows that VN1VN2. Note that we must have r(N1)1 and r(N2)1 as these networks are triangle-free. Thus, this covers all cases for n4.

So now assume that n>4 and that the claim is true for all smaller values of n. We first show that we may assume that any set of 4 leaves that induces a 4-cycle in N1 induces the same 4-cycle in N2.

Indeed, suppose this is not the case, and consider some arbitrary AX with |A|=4 such that N1|A is a 4-cycle but N2|A is not the same 4-cycle. If N2|A is a different 4-cycle or a double-triangle, then by Lemma 1, N1|A and N2|A are distinguishable (and in particular, VN1|AVN2|A).

Otherwise, N2|A is either a tree or a 3-cycle network, and Lemma 1 implies that VN1|AVN2|A. In either case, VN1|AVN2|A and hence, by Lemma 3, we have VN1VN2.

So we may now assume that any set of 4 leaves that induces a 4-cycle in N1 induces the same 4-cycle in N2. By Lemma 6, this implies that every cycle in N1 is refined by a cycle in N2. By Lemma 7N1 and N2 must have either a non-trivial common split or conflicting splits, or N2 must have no non-trivial split. It remains to complete the proof in these three cases.

Firstly, if N1,N2 have conflicting splits, then by Lemma 2 we have VN1VN2, as required.

Secondly, suppose that X-Y is a non-trivial common split, and consider N1/X N2/X, N1/Y, N2/Y as defined in the beginning of Sect. 4. Since |X|,|Y|2, each of these networks has fewer than n leaves. Thus by the induction hypothesis, if N1/X,N2/X are distinct and r(N1/X)r(N2/X), then VN1/XVN2/X, from which it follows that VN1VN2. A similar argument holds if N1/Y,N2/Y are distinct and r(N1/Y)r(N2/Y). But at least one of these cases must hold. Indeed, since r(N1/X)+r(N1/Y)=r(N1)r(N2)=r(N2/X)+r(N2/Y), it must hold that r(N1/X)>r(N2/X), r(N1/Y)>r(N2/Y) or r(N1/X)=r(N2/X) and r(N1/Y)=r(N2/Y). If r(N1/X)>r(N2/X) (or r(N1/Y)>r(N2/Y)) then those networks are clearly distinct. Otherwise we have r(N1/X)=r(N2/X) and r(N1/Y)=r(N2/Y). We must have that N1/X,N2/X are distinct or N1/Y,N2/Y are distinct, since N1 and N2 are distinct. Thus we either have that N1/X,N2/X are distinct and r(N1/X)r(N2/X), or N1/Y,N2/Y are distinct and r(N1/Y)r(N2/Y). In either case we have VN1VN2, as required.

Finally, suppose that N2 has no non-trivial split. Then N2 is an n-cycle network, that is, N2 has a single cycle and every leaf is incident to a vertex on the cycle. If r(N1)=1, then N1 and N2 are both networks with exactly one cycle of length at least four. It then follows from Theorem 1, together with Proposition 2, that N1 and N2 are distinguishable (and, in particular, VN1VN2). If on the other hand r(N1)2, then consider two cycles C1 and C2 in N1, with X1 the subset of X below the reticulation in C1, and X2 the subset of X below the reticulation in C2. Since C1 and  C2 are different cycles, X1X2. But then this contradicts the fact that every cycle in N1 is refined by a cycle in N2, as the single cycle in N2 would have to have both X1 and X2 as the set of leaves below the reticulation. Thus in all cases we have either a contradiction or VN1VN2, which completes the proof of Lemma 8.

We are now ready to prove Theorem 2, which we restate for convenience.

Theorem 2 The network parameter of a network-based Markov model under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints is generically identifiable with respect to the class of models where the network parameter is an n-leaf triangle-free, level-1 semi-directed network with r0 reticulation vertices.

Proof

Let {MN}NN be a class of triangle-free, level-1 network models with a fixed number of reticulation vertices. Let N1,N2N be distinct n-leaf triangle-free level-1 semi-directed networks on X with r(N1)=r(N2). By invoking Lemma 8 twice, we have VN1VN2 and VN2VN1 under the JC, K2P, and K3P constraints. By definition, N1 and N2 are distinguishable; as N1 and N2 were chosen arbitrarily from N, it follows that the semi-directed network parameter of {MN}NN is generically identifiable under the JC, K2P, and K3P constraints.

Discussion

We have shown that triangle-free level-1 semi-directed networks are generically identifiable under the Jukes–Cantor, Kimura 2-parameter, and Kimura 3-parameter constraints. This means that, given a long enough multiple sequence alignment that evolved on a network of this class under one of these models, this network is, with high probability, the only network from the class that coincides with the given data. Roughly speaking, this means that the data provide sufficient information to reconstruct the network. To prove this result, we employed a blend of algebraic and combinatorial methods to show that any pair of networks are geometrically distinguishable.

Previously, it had been shown that networks cannot be identified from certain substructures. For example, networks cannot be inferred from their displayed trees since more than one network can display the same set of trees (Gambette and Huber 2012; Pardi and Scornavacca 2015). Similarly, a network cannot in general be reconstructed from its collection of proper subnetworks, since two distinct networks can have exactly the same set of proper subnetworks (Huber et al. 2015). Nevertheless, for certain restricted network classes it has been shown that those networks can be uniquely reconstructed from their subnetworks (Huber and Moulton 2013; Huebler et al. 2019; van Iersel and Moulton 2014; Nipius 2020). These proofs are related to our combinatorial results, in that our proof strategy for showing network distinguishability involved careful examination of induced 4-leaf subnetworks. However, there are some fundamental differences that prevent directly using known results on building networks from subnetworks. Firstly, the existing results focus either on directed (e.g. van Iersel and Moulton (2014)) or on undirected (e.g. van Iersel and Moulton (2018)) networks. Our results, as well as the ones in Allman et al. (2019), Baños (2019), Huebler et al. (2019), provide the first combinatorial results on semi-directed networks. The main obstacle, however, was that not all 4-leaf level-1 semi-directed networks are distinguishable under the considered models. Hence, two networks can be indistinguishable even if the sets of induced subnetworks are distinct. Consequently, we had a severely restricted set of building blocks available, requiring a combination of combinatorial and algebraic techniques.

On the algebraic front, the computations reveal differences between the relationships between the network ideals under the JC constraints and the relationships between the ideals under the K2P and K3P constraints that would be interesting directions for further exploration. In Hollering and Sullivant (2020), the authors remark that the phenomenon observed in Gross and Long (2017) under the JC constraints, where each triangle network variety is contained within several of the 4-cycle network varieties, does not occur under the K2P and K3P constraints. In other words, under the K2P and K3P constraints, 4-cycle networks and triangle networks are distinguishable. In our computations for this paper, we noticed another phenomenon that seems to only hold for JC constraints. In particular, under the JC constraints, the ideals for the double-triangle networks and the 4-cycle networks are of the same dimension and are all distinct. This is somewhat surprising as one might expect the additional reticulation vertex and associated reticulation parameters of the double-triangle network to increase the dimension of the model. Our numerical computations suggest that this is another unique feature of the JC constraints. However, establishing this result rigorously may require other methods, since we were unable to compute full generating sets for the vanishing ideals of the networks under the K2P and K3P constraints.

Additionally, from an algebraic perspective, we note that adapting the random search strategy described in Hollering and Sullivant (2020) is what allowed us to find candidate subsets of variables for locating the necessary invariants to establish our main result. Something similar will likely need to be employed if these results are to be extended to other families of networks. It would be interesting to understand the relative computational costs once a candidate subset of variables is found, of either computing invariants in a subring of the original variables as we did, or of computing the linear matroid of the Jacobian with symbolic parameters as was done in Hollering and Sullivant (2020).

Finally, a major open problem, which is the larger setting for this paper, is to determine whether generic identifiability results such as these can be extended to higher level networks. We expect finding the necessary invariants for the increased number of non-unique induced 4-leaf subnetworks will be challenging. Furthermore, the complexity of the combinatorial part of the proof will explode for higher levels. This question is open not only for the group-based models studied in this paper, but also for the general Markov model, which has just started to be studied in the context of networks Casanellas and Fernández-Sánchez (2020).

Footnotes

Leo van Iersel, Remie Janssen, Mark Jones and Yukihiro Murakami were partly supported by the Netherlands Organization for Scientific Research (NWO), Vidi Grant 639.072.602 and Mark Jones also by the gravitation Grant NETWORKS. Elizabeth Gross was supported by the National Science Foundation (NSF), DMS-1945584.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Elizabeth Gross, Email: egross@hawaii.edu.

Leo van Iersel, Email: L.J.J.vanIersel@tudelft.nl.

Remie Janssen, Email: R.Janssen-2@tudelft.nl.

Mark Jones, Email: M.E.L.Jones@tudelft.nl.

Colby Long, Email: clong@wooster.edu.

Yukihiro Murakami, Email: Y.Murakami@tudelft.nl, Email: yukimurakami07201994@gmail.com.

References

  1. Allman ES, Rhodes JA. The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol. 2006;13(5):1101–1113. doi: 10.1089/cmb.2006.13.1101. [DOI] [PubMed] [Google Scholar]
  2. Allman ES, Petrović S, Rhodes JA, Sullivant S. Identifiability of 2-tree mixtures for group-based models. IEEE/ACM Trans Comp Biol Bioinform. 2011;8(3):710–722. doi: 10.1109/TCBB.2010.79. [DOI] [PubMed] [Google Scholar]
  3. Allman ES, Baños H, Rhodes JA. Nanuq: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol Biol. 2019;14(1):24. doi: 10.1186/s13015-019-0159-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Baños H. Identifying species network features from gene tree quartets under the coalescent model. Bull Math Biol. 2019;81(2):494–534. doi: 10.1007/s11538-018-0485-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bapteste E, van Iersel L, Janke A, Kelchner S, Kelk S, McInerney JO, Morrison DA, Nakhleh L, Steel M, Stougie L, et al. Networks: expanding evolutionary thinking. Trends Genet. 2013;29(8):439–441. doi: 10.1016/j.tig.2013.05.007. [DOI] [PubMed] [Google Scholar]
  6. Baroni M, Semple C, Steel M. A framework for representing reticulate evolution. Ann Comb. 2005;8(4):391–408. doi: 10.1007/s00026-004-0228-0. [DOI] [Google Scholar]
  7. Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. 2004;21(2):255–265. doi: 10.1093/molbev/msh018. [DOI] [PubMed] [Google Scholar]
  8. Cardona G, Rosseló F, Valiente G. Comparison of tree-child phylogenetic networks. IEEE/ACM Trans Comp Biol Bioinform. 2007;6:552–569. doi: 10.1109/TCBB.2007.70270. [DOI] [PubMed] [Google Scholar]
  9. Casanellas M, Fernández-Sánchez J (2020) Rank conditions on phylogenetic networks. arXiv preprint arXiv:2004.12988
  10. Casanellas M, Garcia LD, Sullivant S. Catalog of small trees. Cambridge: Cambridge University Press; 2005. pp. 291–304. [Google Scholar]
  11. Chang J. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci. 1996;137(1):51–73. doi: 10.1016/S0025-5564(96)00075-2. [DOI] [PubMed] [Google Scholar]
  12. Evans S, Speed T. Invariants of some probability models used in phylogenetic inference. Ann Stat. 1993;21(1):355–377. doi: 10.1214/aos/1176349030. [DOI] [Google Scholar]
  13. Francis A, Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination–mutation model of evolution. J Theor Biol. 2018;446:160–167. doi: 10.1016/j.jtbi.2018.03.011. [DOI] [PubMed] [Google Scholar]
  14. Gambette P, Huber KT. On encodings of phylogenetic networks of bounded level. J Math Biol. 2012;65(1):157–180. doi: 10.1007/s00285-011-0456-y. [DOI] [PubMed] [Google Scholar]
  15. Grayson D, Stillman M (2002) Macaulay2, a software system for research in algebraic geoemetry. Available at http://www.math.uiuc.edu/Macaulay2/
  16. Gross EK, Long C. Distinguishing phylogenetic networks. SIAM J Appl Algebra Geom. 2017;2:72–93. doi: 10.1137/17M1134238. [DOI] [Google Scholar]
  17. Gusfield D. ReCombinatorics: the algorithmics of ancestral recombination graphs and explicit phylogenetic networks. Cambridge: MIT press; 2014. [Google Scholar]
  18. Hendy MD, Penny D. Complete families of linear invariants for some stochastic models of sequence evolution, with and without molecular clock assumptions. J Comput Biol. 1996;3:19–32. doi: 10.1089/cmb.1996.3.19. [DOI] [PubMed] [Google Scholar]
  19. Hollering B, Sullivant S. Identifiability in phylogenetics using algebraic matroids. J Symb Comput. 2020 doi: 10.1016/j.jsc.2020.04.012. [DOI] [Google Scholar]
  20. Huber KT, Moulton V. Encoding and constructing 1-nested phylogenetic networks with trinets. Algorithmica. 2013;66(3):714–738. doi: 10.1007/s00453-012-9659-x. [DOI] [Google Scholar]
  21. Huber KT, van Iersel L, Kelk S, Suchecki R. A practical algorithm for reconstructing level-1 phylogenetic networks. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 2011;8(3):635–649. doi: 10.1109/TCBB.2010.17. [DOI] [PubMed] [Google Scholar]
  22. Huber KT, Van Iersel L, Moulton V, Wu T. How much information is needed to infer reticulate evolutionary histories? Syst Biol. 2015;64(1):102–111. doi: 10.1093/sysbio/syu076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Huber KT, van Iersel L, Janssen R, Jones M, Moulton V, Murakami Y, Semple C (2019) Rooting for phylogenetic networks. ArXiv preprint arXiv:1906.07430
  24. Huebler S, Morris R, Rusinko J, Tao Y (2019) Constructing semi-directed level-1 phylogenetic networks from quarnets. ArXiv preprint arXiv:1910.00048
  25. Huson DH, Rupp R, Scornavacca C. Phylogenetic networks: concepts, algorithms and applications. Cambridge: Cambridge University Press; 2010. [Google Scholar]
  26. Jin G, Nakhleh L, Snir S, Tuller T. Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics. 2007;23(2):e123–e128. doi: 10.1093/bioinformatics/btl313. [DOI] [PubMed] [Google Scholar]
  27. Long C, Kubatko L. Identifiability and reconstructibility of a modified coalescent. Bull Math Biol. 2018;81:408. doi: 10.1007/s11538-018-0456-9. [DOI] [PubMed] [Google Scholar]
  28. Nakhleh L. Problem solving handbook in computational biology and bioinformatics. Berlin: Springer Science+Business Media, LLC; 2011. pp. 125–158. [Google Scholar]
  29. Nakhleh L, Ruths D, Wang LS (2005) Riata-HGT: a fast and accurate heuristic for reconstructing horizontal gene transfer. In: International computing and combinatorics conference. Springer, pp 84–93
  30. Nipius L (2020) Rooted binary level-3 phylogenetic networks are encoded by quarnets. Bachelor’s thesis, Delft University of Technology. http://resolver.tudelft.nl/uuid:a9c5a8d4-bc8b-4d15-bdbb-3ed35a9fb75d
  31. Pardi F, Scornavacca C. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol. 2015;11(4):e1004135. doi: 10.1371/journal.pcbi.1004135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Rhodes JA, Sullivant S. Identifiability of large phylogenetic mixtures. Bull Math Biol. 2012;74(1):212–231. doi: 10.1007/s11538-011-9672-2. [DOI] [PubMed] [Google Scholar]
  33. Rossello F, Valiente G. All that glisters is not galled. Math Biosci. 2009;221:54–59. doi: 10.1016/j.mbs.2009.06.007. [DOI] [PubMed] [Google Scholar]
  34. Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press on Demand; 2003. [Google Scholar]
  35. Solís-Lemus C, Ané C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 2016;12(3):e1005896. doi: 10.1371/journal.pgen.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Solís-Lemus C, Bastide P, Ané C. Phylonetworks: a package for phylogenetic networks. Mol Biol Evol. 2017;34(12):3292–3298. doi: 10.1093/molbev/msx235. [DOI] [PubMed] [Google Scholar]
  37. Solis-Lemus C, Coen A, Ane C (2020) On the identifiability of phylogenetic networks under a pseudolikelihood model. ArXiv preprint arXiv:2010.01758
  38. Steel M. Phylogeny: discrete and random processes in evolution. New Delhi: SIAM; 2016. [Google Scholar]
  39. Sturmfels B, Sullivant S. Toric ideals of phylogenetic invariants. J Comput Biol. 2005;12(2):204–228. doi: 10.1089/cmb.2005.12.204. [DOI] [PubMed] [Google Scholar]
  40. Sullivant S. Algebraic statistics. USA: American Mathematical Soc; 2018. [Google Scholar]
  41. Than C, Ruths D, Nakhleh L. Phylonet: a software package for analyzing and reconstructing reticulate evolutionary histories. BMC Bioinform. 2008;9:322. doi: 10.1186/1471-2105-9-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Thatte BD. Reconstructing pedigrees: some identifiability questions for a recombination-mutation model. J Math Biol. 2013;66(1):37–74. doi: 10.1007/s00285-011-0503-8. [DOI] [PubMed] [Google Scholar]
  43. van Iersel L, Moulton V. Trinets encode tree-child and level-2 phylogenetic networks. J Math Biol. 2014;68(7):1707–1729. doi: 10.1007/s00285-013-0683-5. [DOI] [PubMed] [Google Scholar]
  44. van Iersel L, Moulton V. Leaf-reconstructibility of phylogenetic networks. SIAM J Discrete Math. 2018;32(3):2047–2066. doi: 10.1137/17M1111930. [DOI] [Google Scholar]
  45. Wen D, Yu Y, Zhu J, Nakhleh L. Inferring phylogenetic networks using phylonet. Syst Biol. 2018;67(4):735–740. doi: 10.1093/sysbio/syy015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Yang J, Grünewald S, Xu Y, Wan XF. Quartet-based methods to reconstruct phylogenetic networks. BMC Syst Biol. 2014;8(1):21. doi: 10.1186/1752-0509-8-21. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Mathematical Biology are provided here courtesy of Springer

RESOURCES