Abstract
We show that many topological features of level-1 species networks are identifiable from the distribution of the gene tree quartets under the network multi-species coalescent model. In particular, every cycle of size at least 4 and every hybrid node in a cycle of size at least 5 is identifiable. This is a step toward justifying the inference of such networks which was recently implemented by Solís-Lemus and Ané. We show additionally how to compute quartet concordance factors for a network in terms of simpler networks, and explore some circumstances in which cycles of size 3 and hybrid nodes in 4-cycles can be detected.
Keywords: Coalescent Theory, Phylogenetics, Networks, Concordance factors
1. Introduction
As phylogenetic analysis of DNA data has progressed, more evidence has appeared showing that hybridization is often an important factor in evolution. As surveyed in [16], hybridization has played a very important role in the evolutionary history of plants, some groups of fish and frogs ([7], [12], [14], [17], [20]). Other biological processes such as introgression, lateral gene transfer and gene flow, also require moving beyond a simple tree-like view of species relationships.
Phylogenetic networks are the objects used to represent the relationships between species that admit such events ([3],[4]). These networks are often thought of as obtained from phylogenetic trees by adding additional edges, so that some nodes in the tree have two parents. Nodes with two parents, called hybrid nodes, represent species whose genome arises from two different ancestral species. Inference of phylogenetics networks from biological data presents new challenges, with methods still being developed, as shown by recent works including [2], [15], [23], [32], [29] and [31].
Another challenge in inferring evolutionary history arises from the fact that many multi-locus data sets exhibit gene tree incongruence, even without suspected hybridization. One possible reason is incomplete lineage sorting (ILS), which is described in the tree setting by the multi-species coalescent model [18]. See for example [5], [19], and [27] where ILS is explained in the biological setting.
Meng and Kubatko [15] formulated a model of gene tree production, based on the multi-species coalescent model, incorporating both hybridization and ILS. We refer to this model as the network multi-species coalescent model, which is further developed in [30], [33], and [24], to mention some. The model determines the probability of observing any rooted gene tree given a metric rooted phylogenetic species network.
Solís-Lemus and Ané [23] recently presented a novel statistical method, based on the network multi-species coalescent model, to infer phylogenetic networks from gene tree quartets in a pseudolikelihood framework. The quartets themselves might come from larger gene trees inferred by standard phylogenetic methods. The pseudolikelihood in this work is built on quartet frequencies, or concordance factors, extending an idea of Liu [13] from the tree setting. The pseudolikelihood approach is simpler and faster than computing the full likelihood and makes large-scale data analysis more tractable. They demonstrate positive results in reconstructing the evolutionary relationships among swordtails and platyfishes.
However, the theoretical underpinnings of the method of [23] are not complete. In using a model for statistical inference it is important to know if it is theoretically possible to uniquely recover the parameters from the data the model predicts. In more precise terms, for model-based statistical inference to have a solid basis, we need that the probability distribution for data which arises under the model uniquely determines the parameters. This is known as identifiability of the model parameters.
While [23] showed that any particular hybridization in a level-1 network with h hybridizations, and n taxa can be generically detected under certain assumptions, their study never addressed the full identifiability of the network topology, only the detectability of a specific hybridization event. Working in the setting of level-1 networks, which is also adopted here, their arguments do not include investigations on network properties such as cycle sizes, and the structure of the whole network. These properties are crucial to determine, for example, whether two networks with different cycle sizes, or different number of cycles, could produce the same set of gene tree quartet probabilities.
The primary purpose of this work is to begin to address some of these identifiability questions raised in [23]. That is, we study the question: given information on gene quartet probabilities for some unknown level-1 network , what can be determined about the topology of ?
Although others have considered the problem of constructing large networks from small ones, these works do not seem to be applicable to the question studied here. Most of these works, including [9], [10] and [11], are primarily combinatorial in nature. In particular, these studies do not address semidirected networks, ILS through the network multi-species coalescent model, nor the types of inputs that might be obtained from biological data.
The main result of this work, Theorem 4 of Section 8, is that under the network multi-species coalescent model on level-1 networks, we can generically identify from gene quartet distributions “most” of the unrooted topological network, including all cycles of size at least 4, and hybrid nodes in the cycles of size greater than 4. “Generically” here means for all values of numerical parameters except those in a set of measure zero. The methods used are a mix of the semi-algebraic study of quartet gene tree frequencies (in terms of linear equalities and inequalities they satisfy) with combinatorial approaches to combining this knowledge for many quartets. As a side benefit the proofs suggest combinatorial methods for reconstructing networks, as opposed to just showing identifiability. However, we do not explore how such methods might be implemented in the presence of the noise that any collection of inferred gene trees will have.
Another result of this work, in Section 5, is a rigorous derivation of how gene quartet probabilities can be computed for large networks under the coalescent model. Although this parallels some of the results in [23], the arguments given here are more rigorous, as is necessary for them to form the basis of our main results. Our approach is to express quartet frequencies as convex combinations of those on simplified networks, ultimately leading to expressions in terms of trees, as is done in other situations [34]. This is different from the approach in [23] of finding networks with less hybridizations displaying the same gene quartet probabilities.
The outline of this work is as follows: Section 2 introduces basic definitions and establishes some terminology on graphs and networks. Section 3 sets forth insights and tools for studying the structure of level-1 networks. Section 4 reviews the network multi-species coalescent model of [15], as well as quartet concordance factors and some of their properties. In Section 5 we show how concordance factors of quartet networks can be expressed in terms of simpler networks. Section 6 introduces the “Cycle property” of concordance factors and Section 7 defines the “Big Cycle” property of concordance factors. In Section 8, the main result on topological network identifiability is proved using the Big Cycle property and in Section 9 some extended results on the “Cycle property” are shown.
2. Phylogenetic networks
We adopt standard terminology for graphs and networks, as used in phylogenetics; see for example [22] and [25]. All undirected, directed, or semidirected graphs will not contain loops. If G is a directed or semidirected graph, the undirected graph of G, denoted by U(G), is the graph G with all directions omitted.
2.1. Rooted networks
To set terminology, we begin with some fundamental definitions.
Definition 1 A topological binary rooted phylogenetic network on taxon set X is a connected directed acyclic graph with vertices V and edges E, where V is the disjoint union V = {r}⊔VL ⊔VH ⊔VT and E is the disjoint union E = EH ⊔ ET, and a bijective leaf-labeling function f: VL → X with the following characteristics:
The root r has indegree 0 and outdegree 2.
A leaf v ∈ VL has indegree 1 and outdegree 0.
A tree node v ∈ VT has indegree 1 and outdegree 2.
A hybrid node v ∈ VH has indegree 2 and outdegree 1.
A hybrid edge e ∈ EH is an edge whose child is a hybrid node.
A tree edge e ∈ ET is an edge whose child is a tree node or a leaf.
Definition 2 Let be a topological binary rooted phylogenetic network with |E| = m and |EH| = 2h. A metric for is a pair (λ,γ), where and γ : EH → (0,1) satisfies that if two edges h1 and h2 have the same hybrid node as child, then γ(h1) + γ(h2) = 1.
If (λ,γ) is a metric for , then we refer to (,(λ,γ)) as a metric binary rooted phylogenetic network.
Note that Definition 1 differs from that of [25] in that it allows up to two edges between a pair of nodes. An edge weight λ(e) is interpreted as the time (in coalescent units) between speciation events represented by the ends of edge e. For any hybrid edge h with child v, the value γ(h) = γh is the probability that a lineage at v has ancestral lineage in h and is often called hybridization parameter or inheritance probability. Since we are focusing on parameter identifiability we will use the term hybridization parameter.
2.2. Lowest stable ancestor
We review and show some properties of the lowest stable ancestor, a network analog of the most recent common ancestor on a tree.
Definition 3 Let be a (metric or topological) binary rooted phylogenetic network. We say that a node v is above a node u, and u is below v, if there exists a non-empty directed path in from v to u. We also say that an edge with parent node x and child y is above (below) a node v if y is above or equal to v (x is below or equal to v).
Note that since has no directed cycles, u cannot be both above and below v.
Definition 4 [25] Let be a (metric or topological) binary rooted phylogenetic network on X and let Z ⊆ X. Let D be the set of nodes which lie on every directed path from the root r of to any z ∈ Z. Then the lowest stable ancestor of Z of, denoted by LSA(Z,), is the unique node v ∈ D such that v is below all u ∈ D, u ≠ v.
When is clear from context, we write LSA(Z) for LSA(Z,). To see that LSA(Z) is well defined for any Z ⊆ X, note first that D ≠ ∅ since r ∈ D. Also, since every pair of nodes u,v ∈ D both lie on a path, we have a notion of above and below for u and v, i.e. a total order on D, and hence a minimal element.
While the definition of LSA agrees with the most recent common ancestor for trees, it is more subtle. In particular, if is a network on X, LSA(X) need not to be the root of the network, as Figure 1 (left) shows. Furthermore, there can be nodes below LSA(X) which are ancestral to all of X, as Figure 2 shows.
Fig. 1.
(Left) A binary rooted phylogenetic network on X, with LSA(X) the node labeled x, and (Right) its induced unrooted semidirected network. In a depiction of a rooted network, all edges are directed downward, from the root, but arrowheads are shown only on hybrid edges. For the unrooted network, all edges except hybrid ones are undirected.
Fig. 2.
A binary rooted phylogenetic network where the node labeled y is ancestral to all taxa in X but is not LSA(X). LSA(X) here is the root of the network.
Lemma 1 Let be a (metric or topological) binary rooted phylogenetic network on X with root r, and let Z ⊆ Y ⊆ X. Then
-
(i)
the indegree of LSA(Z) is at most one for any Z ⊂ X;
-
(ii)
at most one of the out edges of LSA(Z) is hybrid;
-
(iii)
if Z ⊆ Y ⊆ X then LSA(Z) is below or equal to LSA(Y ).
Proof To see (i), suppose that the indegree of LSA(Z) is two. Then the outdegree would be one, and the child of LSA(Z) would be in any path from the root to any taxa in Z, contradicting the definition of LSA(Z).
For (ii), suppose the out edges of LSA(Z), e1 and e2, are both hybrid. If e1 and e2 have the same child then every path from r to any z ∈ Z would contain that node, contradicting the definition of LSA(Z).
Now denote by x1 ≠ x2 the child nodes of e1 and e2 respectively. If both x1 and x2 had parents below LSA(Z), then x1 has a parent below x2 and x2 has a parent below x1 giving a directed cycle. Thus, without loss of generality, assume x1 has parents LSA(Z) and v with v not below LSA(Z). Let z ∈ Z with z below x1. If we remove the LSA(Z) from there is still a path from r to z (which goes from r to v to x1 to z). This contradicts the fact that LSA(Z) is on all paths from r to any z ∈ Z.
For (iii) we observe that since Z ⊆ Y , LSA(Y ) must be equal or above LSA(Z) since the set of paths from r to any taxa in Y contains the set of paths from r to any taxon in Z.
Lemma 2 Let be a (metric or topological) binary rooted phylogenetic network on X and let Z ⊂ X, |Z| ≥ 2. For every x ∈ Z, there is a y ∈ Z such that LSA(x,y)=LSA(Z).
Proof Let m=LSA(Z), fix x ∈ Z and let P be a path from m to x. By definition of LSA, for all y ∈ Z, LSA(x,y) is a node in P and is below or equal to m by Lemma 1. Suppose that LSA(x,y) is below m for all y ∈ Z. Let z ∈ Z be such that LSA(x,z) is above or equal to LSA(x,y) for all y ∈ Z \ {z}.
We claim that any path from m to y ∈ Z passes through LSA(x,z). Suppose there exists taxon y with path P′ from m to y that does not pass through LSA(x,z). But P′ must pass through LSA(x,y). Since LSA(x,y) is below LSA(x,z), there is a path from m to LSA(x,y) to x that does not contain LSA(x,z). This is a contradiction.
But every path from m to any y ∈ Z passes through LSA(x,z), contradicting that LSA(x,z) is below m. +
By this Lemma we can characterize LSA(Z) as the highest node of the form LSA(x,y) for some x,y ∈ Z, or the highest node of that form for fixed x ∈ Z.
2.3. Unrooted networks.
Let G be a directed or semidirected graph with z a degree two node. Let x and y be the two nodes adjacent to z. Then, up to isomorphism, the subgraph on x,y and z must be one of the graphs shown on the left of Figure 3, which we denote by H. By suppressing z we mean replacing H in G by the graph to the right of it in Figure 3.
Fig. 3.
On the left are all the semidirected graphs, up to isomorphism, on a degree two node z and its adjacent vertices x and y. On the right are the corresponding graphs obtained by suppressing z.
Definition 5 Let be a binary topological rooted phylogenetic network on a set of taxa X. Then is the semidirected network obtained by 1) keeping only the edges and nodes below LSA(X); 2) removing the direction of all tree edges; 3) suppressing LSA(X). We refer to as the topological unrooted semidirected network induced from.
Figure 1 shows an example of a network and its induced. We now introduce a metric on induced from one on.
Definition 6 Let (,(λ,γ)) be a metric binary rooted phylogenetic network and let be the topological unrooted semidirected network induced from. Denote by e∗ the edge of introduced in place of the edges e1 and e2 in when LSA(X) is suppressed. Define such that λ′(e*) = λ(e1) + λ(e2) and λ′(e) = λ(e) for e ∈, e ≠ e∗. If e∗ is not hybrid, γ′ = γ, else let γ′(h) = γ(h) for all hybrid edges of other than e∗ and γ′(e*) = γ(ei), where ei is, by Lemma 1, the single hybrid edge in {e1,e2}. We refer to (,(λ′,γ′)) as the metric unrooted semidirected network induced from (,(λ,γ)).
The networks considered in this work are always induced from a rooted binary metric phylogenetic network. To simplify language, we refer to a (metric or topological) binary rooted phylogenetic network as a (metric or topological) rooted network and to a induced (metric or topological) unrooted semidirected phylogenetic network as a (metric or topological) unrooted network.
We note that not all binary semidirected graphs are topological unrooted networks, since some graphs are not compatible with suppressing the root on any rooted network. Moreover, might be induced from several rooted networks. See Figure 4.
Fig. 4.
The top graph is not a topological unrooted semidirected phylogenetic network, since its directed edges cannot be obtained by suppressing the root of any 6-taxon topological binary rooted phylogenetic network. The middle graph is the induced topological unrooted network from either of the bottom rooted networks, as well as others.
Although an unrooted network does not have a root specified, since hybrid edges are directed, the suppressed LSA(X) of must have been located ‘above’ them. Thus in, we still have a well-defined notion of which taxa are descendants of a hybrid node v. These are the taxa x such that there exists a semidirected path from v to x in. In this case we say that x descends from v.
2.4. Induced networks on subset of taxa
Since later arguments require an understanding of the behavior of the network multi-species coalescent model on a subset of taxa, we introduce some needed definitions.
Definition 7 Let be a (metric or topological) rooted network on X and let Z ⊂ X. The induced rooted network on Z is the network obtained from by 1) retaining only edges and nodes in paths from the root to any taxa in Z; 2) suppressing all degree two nodes except the root; 3) in the case the root then has outdegree one, contracting the edge incident to the root.
Note that LSA(Z,)=LSA(Z,). If |Z| = 4 then, the induced rooted quartet network on Z, will also be denoted by to emphasize it involves only 4 taxa.
Definition 8 Let be a (metric or topological) rooted network on X and let Z ⊂ X. The induced LSA network of Z, denoted, is the rooted network obtained from by deleting everything above LSA(Z,).
In particular we note that has root LSA(Z,). If |Z| = 4 then, the induced LSA quartet network on Z, is also denoted by.
Definition 9 Let G be a semidirected graph and let x,y be two nodes in G. A trek in G from x to y is an ordered pair of semidirected paths (P1,P2) where P1 has terminal node x, P2 has terminal node y, and both P1 and P2 have starting node v. The node v is called the top of the trek, denoted top(P1,P2). A trek (P1,P2) is simple if the only common node among P1 and P2 is v.
This definition is adopted from non-phylogenetic studies of statistical models on graphs, such as [26].
Definition 10 Let be a (metric or topological) unrooted network on X and let Z ⊆ X. The induced unrooted network on a set of taxa Z is the network obtained from by retaining only edges in simple treks between pairs of taxa in Z, and then suppressing all degree two nodes.
Note that it is not immediately clear that for a network , the networks and are isomorphic. Proposition 1 shows that the operations of unrooting and inducing a network on a subset of taxa commute. While this statement is intuitively plausible its rather technical proof is in the Appendix.
Proposition 1 Let be a (metric or topological) rooted network on X and let Z ⊆ X. Then and are isomorphic.
If |Z| = 4 then , the induced unrooted quartet network on Z, is also denoted by.
2.5 Cycles
Although the networks, are acyclic (in both, the directed and semidirected settings), their undirected graphs U(), U() may contain a cycle. Thus the term ‘cycle’ may be used to unambiguously refer to cycles in the undirected graphs. We formalize this with the following definition:
Definition 11 Let be a (metric or topological, rooted or unrooted) network. A cycle in is a non-empty path from a node to itself, allowing edges to be traversed without regard to their possible direction. The size of the cycle is the number of edges in the path. A k-cycle is a cycle of size k.
By contracting or shrinking a cycle C in a graph we mean removing all edges in C and identifying all nodes in C.
3. Structure of level-1 networks
The class of all phylogenetic networks is often too large to obtain strong mathematical results ([25]), so it is common to restrict to networks that have a simpler structure, for instance, the class of level-1 phylogenetic networks.
Definition 12 Let be a (rooted or unrooted) topological network. If no two cycles in share an edge, then is level-1.
If is a level-1 network, any subnetwork or induced network of is also level-1.
Given a hybrid node v, denote the hybrid edges whose child is v by hv and . Then hv and are called the hybrid edges of v.
Lemma 3 Let be a (topological or metric, rooted or unrooted) level-1 network and let C be a cycle of. Then C contains exactly one hybrid node v, and the associated hybrid edges hv,. Furthermore, each node of is in at most one cycle and, as a result, v, hv and are in exactly one cycle of.
The proof of each statement of this Lemma, using different terminology, is given by Rossello and Valiente [21].
Proposition 2 Let be a topological level-1 rooted network on X. The structure of all the nodes and edges above LSA(X) in is a (possibly empty) chain of 2-cycles connected by edges, as depicted in Figure 5.
>Fig. 5.

In a level-1 network on X, the structure between the root and m =LSA(X) is a chain of two cycles. The number of two cycles in the chain could be zero.
Proof Let m = LSA(X), and denote by r the root of. The proof is by induction on the number of the edges above m. If there are no edges above m, then m = r and the result is trivially true. By Lemma 1, one easily sees that there cannot be only 1 or 2 edges above m in a binary phylogenetic network. That is, if there were just 1 edge above m the outdegree of the root would be 1, contradicting the definition of binary phylogenetic network. Suppose there are 2 edges above m. By definition of binary phylogenetic network the outdegree of r is 2 and by definition of LSA(X) all paths from the root to x ∈ X contain m. Therefore m has indegree 2, contradicting Lemma 1 part (i).
Now assume the claim holds when there are at most k edges above m and suppose there are k + 1 edges above m. Note that r has outdegree 2 by the definition of.
Suppose that edges incident to r have different children, x and y. Note neither x nor y can be m. The outdegree of one of x or y must be 2, otherwise both would be hybrid nodes, which would require x above y and y above x. Without loss of generality suppose x has outdegree 2, and denote by e1 and e2 its out edges, and denote by e3 the edge (r,y). Since every path from r to a leaf goes through m, there are at least 3 distinct paths P1, P2, P3 from r to m, where Pi contains ei.
This contradicts the level-1 condition. Thus x = y, and the edges from r form a 2-cycle. sssss
Now since x is a hybrid node, it has outdegree 1, with child v. Also, there are k −3 edges above m that are also below v. Applying the inductive hypothesis to with edges above v removed, the result follows.
Proposition 2 applied to illustrates the structure of the common ancestry of a subset Z of taxa. When we pass to a LSA network or an induced unrooted network, we “throw away” this structure. We show in Section 5 that under the network multi-species coalescent model this structure has no effect on the formation of quartet gene trees.
Let v be a hybrid node in a level-1 (rooted or unrooted, metric or topological) network on X and let Cv be the cycle containing v. By removing the edges of Cv from we obtain a partition of X according to the connected components of the resulting graph. We refer to this partition as the v-partition and its partition sets as v-blocks.
Note that each node in Cv can be associated to a v-block. That is, a v-block Bu is associated to a node u in Cv if by removing u from the network (and therefore the edges adjacent to u), the induced partition of taxa is {Bu,X \ Bu}. We refer to the v-block Bv, whose elements descend from v, as the v-hybrid block. Two distinct v-blocks Bu,Bw are adjacent if the nodes u,w ∈ Cv are adjacent.
Let be a collection of cycles in. The partition of X obtained by removing all the edges in the cycles of is the network partition induced by and its blocks are network blocks induced by. When is the set of all cycles in of size at least k, the partition is the k-network partition and its blocks are k-network blocks. The 4-network blocks play an important role in Section 8. For now and on, we will refer to removing all edges of a cycle C from a network as removing the cycle C from.
The following is straightforward to prove.
Lemma 4 Let be a level-1 (rooted or unrooted) topological network on X. Let be a collection of cycles in. For any two taxa a and b in different network blocks induced by, there exists a hybrid node v of some cycle in such that a and b are in different v-blocks.
If two taxa a and b are in the same network block induced by, then they are connected when all cycles in are removed. As a result they are connected when a single cycle in is removed. This comment together with Lemma 4 yields the following.
Corollary 1 Let be a level-1 (rooted or unrooted) topological network on X. Let be a collection of cycles in, with vi the hybrid node associated to Ci. The network partition induced by is the common refinement of the vi-partitions for 1 ≤ i ≤ n.
Since contracting cycles in level-1 networks does not introduce loops or multi-edges, we can define a notion of a tree of cycles which is useful for the proof of Theorem 4.
Definition 13 Let be a topological unrooted level-1 network. Let be the graph obtained from by 1) removing all pendant edges, repeatedly, until no pendant edges remain; 2) suppressing all vertices of degree two that are not part of a cycle; 3) contracting each cycle in the network obtained from steps 1 and 2. We refer to as the tree of cycles of.
In the tree of cycles of certain nodes, including all the leaves, represent a cycle of the original network. The notion of tree of cycles is different from “tree of blobs” of [8], as there is no deletion of the non-cycle edges in the tree of blobs. In Figure 6 we see an example of a tree of cycles.
Fig. 6.
(Left) A level-1 unrooted network and (Right) the tree of cycles of.
4. The network multi-species coalescent model and quartet concordance factors.
Coalescent theory models the formation of gene trees within populations of species. The coalescent model for a single population traces (backwards in time) the ancestries of a finite set of individual copies of a gene as the lineages coalesce to form ancestral lineages (see [28]). The multi-species coalescent (MSC) model is a generalization of the coalescent model, formulated by applying it to multiple populations connected to form a rooted population tree, or species tree. It is commonly used to obtain the probabilities of gene trees in the presence of incomplete lineage sorting.
Meng and Kubatko [15] extended the MSC by introducing phenomena such as hybridization or other horizontal gene transfer across the species-level and Nakhleh et al. further developed it [30, 33]. This model describes any situation in which a gene lineage may “jump” from one population to another at a specific time. The model parameters are specified by a metric binary rooted phylogenetic network as defined in Section 2. Different from models such as the structured coalescent with continuous gene flow (see [28]), the network model approach assumes the gene transfer occurs at a single point in time along hybrid edges. We refer to this extended version of the MSC as the network multi-species coalescent (NMSC) model.
The NMSC model assumes that speciation by hybridization results in what Meng and Kubatko refer to as a mosaic genome. One assumption of the NMSC model, inherited from the MSC model, is that all gene lineages present at a specific point on the species tree behave identically above this point. That is, the probability of any event conditioned on a set of lineages being present at a certain point on the species tree is invariant under permutation of those lineages. This feature is known as the exchangeability property.
Example 1 We illustrate how to compute the probability of a gene tree topology under the NMSC with an example. Suppose we have the rooted metric species network given in Figure 7. Let A,B,C and D be genes sampled from species a,b,c and d respectively. We compute the probability that a gene tree has the unrooted topology ((A,B),(C,D)) under the NMSC model.
Fig. 7.
Two gene trees within a species network with one hybrid node.
First observe that until B and C trace back to the edge with length z there cannot be a coalescent event. In that edge these lineages cannot coalesce if the gene tree ((A,B),(C,D)) is to be formed. The probability of no coalescence on this edge is e−z. Now there are 4 cases, illustrated in Figure 8:
Fig. 8.
Cases 1–4 (Left-Right) of Example 1, of how lineages may behave under the NMSC model on the network of Figure 7.
-
1)
with probability γ2, lineages B and C enter the edge of length w;
-
2)
with probability (1 − γ)2, B and C enter the edge of length v;
-
3)
with probability γ(1 − γ), B enters the edge of length w and C enters the edge of length v;
-
4)
with probability (1 − γ)γ, B enters the edge of length v and C enters the edge of length w.
Observe that each case is now reduced to a standard MSC scenario with several samples per population (see [6]). Let Pi the probability of observing ((A,B),(C,D)) under the MSC of case i. Then the probability of observing ((A,B),(C,D)) is e−z(γ2P1 + (1 − γ)2P2 + γ(1 − γ)P3 + γ(1 − γ)P4).
Following Solís-Lemús and Ané [23], we are interested in the probability that a species network produces various gene quartets under the NMSC. This motivates the following definition.
Definition 14 Let be a metric rooted network on a taxon set X. Let A,B,C,D be genes sampled from species a,b,c,d respectively. Given a gene quartet AB|CD, the quartet concordance factor CFAB|CD is the probability under the NMSC on that a gene tree displays the quartet AB|CD, and
is the ordered triple of concordance factors of each quartet on the taxa a,b,c,d.
When a,b,c,d are clear from context, we write CF for CFabcd.
In the particular case where has no hybrid edges, so the network is a tree, it is known that the quartet concordance factors do not depend on the root placement [1]. For example let a,b,c,d be taxa and consider any root placement in the unrooted species tree with topology ab|cd and internal edge of length t. Then
| (1) |
As mentioned in [23], for unrooted species networks the concordance factors do not depend on the placement of the root in the species network, as long as the root is placed in a way consistent with the direction of the hybrid edges. This fact is shown in Section 5, as we explore quartet concordance factors more thoroughly.
Definition 15 Let be a metric rooted level-1 network on X. Given a set of distinct taxa {a,b,c,d}, we define the ordering of CFabcd on as the natural decreasing order of CFAB|CD, CFAC|BD, CFAD|BC in the real line.
For example if t > 0 the ordering of the concordance factors in equation (1) is given by
Many arguments towards the main result of this work use the ordering of CFabcd, and not its precise values.
5. Computing quartet concordance factors
In this section we show how to express the concordance factors arising on a LSA quartet network as a linear combination of the concordance factors arising on quartet trees using a similar approach as in [29]. This enables us to see how the ordering of concordance factors reflects the network topology, and how the precise root location does not matter.
The final results of this section are largely in [23]. However, we provide formal arguments and take in consideration some matters that were left unaddressed. For example, we address the possibility that an induced 4-taxon network does not contain the root of the original network.
Let be a (metric or topological) rooted level-1 network on X and let {a,b,c,d} be a set of distinct taxa of X. Then the induced unrooted network on 4 taxa is a (metric or topological) unrooted level-1 network. By Proposition 1, is the same graph as and , where is the LSA network of Definition 8. Any cycle in induces a cycle in. A cycle C in of size k, induces a cycle in of either size k (when C does not contain LSA(a,b,c,d)) or size k − 1 (otherwise). For convenience when we refer to the size of a cycle C in we mean the size of the induced cycle in .
Lemma 5 Let be a metric unrooted level-1 quartet network. The number of k-cycles in is 0 for k ≥ 5, at most 1 for k = 4 in which case there is no 3-cycle, and at most 2 for k = 3.
Proof Suppose that has a cycle C = Cv of size k. Then there is an associated partition of taxa into k v-blocks. Trivially none of these blocks can be empty, so k ≤ 4.
Suppose that there are two cycles, a cycle C1 of size k1 and C2 of size k2 with ki ≥ 3, i = 1,2. Since is level-1, by removing these two cycles we induce a partition of the taxa into at least k1 + k2 − 2 blocks. None of the blocks of this partition can be empty, so k1+k2−2 ≤ 4. Hence there is a most one cycle of size 4 or at most two cycles of size 3. Moreover there cannot be a cycle of size 3 and a cycle of size 4 in the same unrooted quartet network.
Suppose that there are three cycles, a cycle C1 of size k1, C2 of size k2, and C3 of size k3 with ki ≥ 3, i = 1,2,3. By removing these three cycles we induce a partition of the taxa into at least k1+k2+k3−3 blocks, so k1+k2+k3−3 ≤ 4 which is a contradiction since ki ≥ 3.
Our arguments will depend on the number of descendants on the hybrid node of a cycle, so we introduce additional terminology. An n-cycle with exactly k taxa descending from the hybrid node is referred to as a nk-cycle. Figure 9 shows the 6 different types of 2-, 3-, and 4-cycles possible in an unrooted quartet network.
Fig. 9.
(Left) The three types of 2-cycles in an unrooted quartet network (21-,22- and a 23-cycle); (Center) The two types of 3-cycles in the unrooted quartet network (31- and a 32-cycle). (Right) The only type of 4-cycle in an unrooted quartet network (a 41-cycle). The dashed lines represent subgraphs that may contain other cycles.
Lemma 6 Let be a metric unrooted level-1 unrooted quartet network. Then cannot have two 32-cycles, or a 22-cycle and a 41-cycle.
Proof Suppose has two distinct 32-cycles, Cu and Cv. Suppose Cu has u-hybrid block {a,b} and u-blocks {c} and {d}. If we remove Cu from Q, by the level-1 assumption Cv is in one on the connected components. This implies that 2 of the 3 v-blocks must be contained in one of {a,b}, {c} or {d}. This is only possible if the v-hybrid block is {c,d}, and the other v-blocks are {a} and {b}. Thus Q must be as the network in Figure 10, where u is below v and v is below u, contradicting that Q is induced from a rooted network.
Fig. 10.
A graph with two 32 cycles. Each dashed edge represents a chain of 2-cycles with, possibly, other cycles.
Now suppose that Q has a 4-cycle and a 22-cycle. The 4-cycle induces 4 singleton blocks. By the level-1 condition at least one of the blocks induced by the 22-cycle has to be contained in a singleton block. That is impossible since the blocks induced by the 22-cycle have size 2.
Lemmas 5 and 6 determine all possible topological structures for unrooted quartet networks which are shown in Figure 11.
Fig. 11.
Possible structures for unrooted quartet networks. Every dashed arrow represents a chain of an arbitrary number of 2-cycles, as the one in the bottom of the Figure. The direction of these 2-cycles must be such that the obtained graph is induced from a rooted network.
5.1. Concordance factor formulas for quartet networks
Next we prove a number of “reduction” lemmas relating concordance factors for quartet networks to those for networks with fewer cycles. This allows us to express the network concordance factors as a linear combination of concordance factors of trees. The following observation is useful through this section.
Observation 1 Given a rooted metric species quartet network, under the NMSC model the first coalescent event (going backwards in time) determines the unrooted topology of a quartet gene tree.
As illustrated in Figure 12, in passing from a rooted network on X to a rooted induced network on Z ⊂ X, , we may find there is a network structure above LSA(Z), a chain of 2-cycles by Proposition 2. A priori, this could have an impact on the behavior of the NMSC model on. For quartet concordance factors, however, this additional structure has no impact, and we effectively snip it off. Formally, we have the following.
Fig. 12.
A level-1 rooted network where the root differs from the LSA(a,b,c,d).
Theorem 2 Let be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC model, CFabcd can be computed from the LSA network.
Proof In any realization of the coalescent process if there are fewer than 4 lineages at the LSA(a,b,c,d) in, then a coalescent event has occurred below and therefore the unrooted gene tree topology has been determined. Thus we condition on 4 lineages being present at LSA(a,b,c,d).
There are 2 rooted shapes for 4-taxon gene trees, the caterpillar and balanced trees. Regardless of the ancestral chain of 2-cycles above LSA(a,b,c,d), conditioned on one of these shapes, exchangeability of lineages under the coalescent tells us all labeled versions of that specific shape will have equal probability. While the rooted shapes might have different probability, since there is only 1 unrooted shape, all labellings of it must be equally probable. This is the same as if there were no ancestral cycles. Therefore .
This argument can be modified to apply to 5 taxa, but not 6 or more, since then there is more than 1 unrooted shape.
Let be a level-1 LSA quartet network and let Cv be a cycle in Q⊕, with hybrid node v and hybrid edges h1 and h2, where . The following notation is used throughout this section:
• denotes the rooted quartet network obtained from Q⊕ by removing h2.
• denotes the rooted quartet network obtained from Q⊕ by removing h1.
• denotes the rooted quartet network obtained from Q⊕ by contracting Cv; if the root of Q⊕ is in Cv, the node obtained in the contraction process is the root of.
Note that , for i = 1,2 have degree 2 nodes, and thus are not binary. This does not affect the coalescent process in any way and by suppressing such nodes we obtain a binary LSA network. In a slight abuse of notation, we use to denote both of these networks, as needed in our arguments.
To compute concordance factors we often need to designate how many lineages are present at a hybrid node in a realization of the coalescent process. To handle this formally, given a rooted metric species network on X, we define the random variable Kv to be the number of lineages at node v, where Kv takes values in {1,...,lv}, where lv is the number of taxa below v. We can extend this concept to hybrid nodes in , since a hybrid node in induces an orientation of the nodes that are descending from it.
Let be a level-1 LSA quartet network and let Cv be a cycle in Q⊕, with hybrid node v, which induces a cycle in. If has size 2, then 1≤lv≤3; if has size three, then 1 ≤ lv ≤ 2; and if has size four then lv = 1. For example, let Q⊕ be the LSA network shown in the left of Figure 14 and let Cv be the cycle in Q⊕. By unrooting Q⊕ note that Cv induces a 3-cycle. Note also that Q− is isomorphic to the network in Figure 18.
Fig. 14.
A LSA quartet Q⊕ with a cycle C that induces a 32-cycle in the unrooted quartet and the graphs obtained by deleting everything below the hybrid node, disjointing, and labeling the leaves.
Fig. 18.
An unrooted quartet with a single 32-cycle.
We show that cycles in that induce 21-cycles or 23-cycles in have no impact on concordance factors. But first we state Propositions 3 and 4, proven in [1], which are useful in arguments to come.
Proposition 3 Let be a binary rooted metric species tree on X. For |X| = 4, is identifiable from the unrooted topological gene tree distribution under the multispecies coalescent model on, but is not.
Proposition 4 Proposition 3 remains valid when not binary.
Lemma 7 Let be a metric level-1 LSA quartet network and let Cv be a cycle in Q⊕ that induces a 21-cycle in. Then .
Proof Let K = Kv. Since Cv induces a 21-cycle in , P(K = 1) = 1. Then
If the root of Q⊕ is not in Cv, no lineages can coalesce on the edges that differ in and since there is only one lineage in such edges. Thus,
and the claim is established in this case.
Now suppose the root r of Q⊕ is in Cv, and Cv has nodes r, u, v, and edges (r,v), (r,u), (u,v). Without loss of generality suppose that the taxon below v is d. Since u is a tree node it has another descendant y. Note that and have the same topology, moreover, they just differ in the edge length from the root to y. Define a random variable K′, by K′ = 1 if there has been a coalescent event before a, b, and c trace back to y and K′ = 0 otherwise. If K′ = 1, the unrooted topology has been determined and thus
Also, by Proposition 4,
Thus .
Lemma 8 Let be a metric level-1 LSA quartet network and let Cv be a cycle in Q⊕, that induces a 23-cycle in . Then .
Proof Let K = Kv, so K takes values in {1,2,3}. Therefore
| (2) |
If K = 1 or 2 then at least one coalescent event has occurred, so the unrooted gene tree topology is already determined, and
The case K = 3 requires more argument. Without loss of generality suppose that the three taxa descending from v are a, b, and c. Denote by the random variable defined by if the lineage d is involved in the first coalescent event and otherwise. Thus
| (3) |
If d is in the first coalescent event, by the exchangeability property of the NMSC, a,b or c are equally likely to be the other lineage involved in that event.This is the same as if the cycle was contracted, so
If d is not in the first coalescent event, this event involves only two of a,b, and c, with each pair equally likely by exchangeability. This is also the same as if the cycle was contracted, so
Thus by equations (2) and (3),.
Together, the preceding Lemmas yield the following.
Corollary 2 Let be a metric level-1 LSA quartet network and let be the LSA network obtained by contracting all cycles that induce either 23- or a 21-cycles in . Then .
While 21- and 23-cycles have no impact on concordance factors, things are not quite so simple for other types of cycles.
Lemma 9 Let be a metric level-1 LSA quartet network and let Cv be a cycle in Q⊕, that induces a 22-cycle in. Then
Proof Let K = Kv with values in {1,2}, so that
Suppose the root r of Q⊕ is not in Cv, so Cv is also a 22-cycle in Q⊕. Note that
Thus we will express CF(Q⊕ | K = 1) in a similar fashion. If K = 1 the gene tree topology has been determined before the lineages enter v. Thus CF(| K = 1) = CF(Q⊕ | K = 1) for i ∈ {0,1,2} and
| (4) |
by summing the result holds when r is not in Cv.
Now suppose that r is in Cv, and Cv has nodes r, v, u. Without loss of generality suppose that the taxa below v are c and d. Since u is a tree node it has another descendant y. Define a random variable Ky to be the number of lineages at y. Note that K and Ky are independent, with values in {1,2}. If either K or Ky is 1, one coalescent event has occurred and the unrooted gene tree topology has been determined so CF(| K = 1 or Ky = 1) are equal for i ∈ {0,1,2}, and
| (5) |
Even though equation (5) is equal to CF(| K = 1 or Ky = 1), we express it in a similar fashion to the claimed result. Now suppose that K and Ky are both 2. Let Tc and Td be the trees shown on Figure 13. Therefore
Fig. 13.
The two trees Td and Tc in the proof of Lemma 9, obtained when K = 2, Ky = 2 and the lineages c and d trace different hybrid edges.
By Proposition 3, CF(Td | Ky = 2) = CF(Tc | Ky = 2), and in fact they equal CF(| K = 2 or Ky = 2). This is because in the contraction of the cycle identifies the nodes r, u, and v, so conditioned on K = 2, Ky = 2 we may view the coalescent process on as that in the 4-taxon tree ((a,b) : l,(c,d) : 0) where l is the length of (u,y). By Proposition 4, CF(Tc | Ky = 2) = CF(| K = 2, Ky = 2). Therefore
This together with equation (5) implies the claim.
Lemma 10 Let be a metric level-1 LSA quartet network and let Cv be a cycle in Q⊕, that induces either a 4-cycle or a 31-cycle in. Then
Proof Letting K = Kv, then P(K = 1) = 1. Thus,
It remains to consider a 32-cycle. For this case it helps to introduce new terminology. Let G be a semidirected graph and v be a node in G with indegree 2 and outdegree 0. Let hv and be the edges incident to v and let u and u′ the parent nodes in hv and respectively. We refer to disjointing hv and from v as the process of 1) deleting v from G; 2) introducing nodes w and w′; 3) introducing directed edges (u,w) and (u′,w′).
Let be a metric level-1 LSA quartet network, and Cv a cycle in Q⊕, that induces a 32-cycle in . Without loss of generality suppose that a and b are the taxa below v. Let be the network obtained from Q⊕ by 1) deleting everything below v; 2) disjointing h1 and h2 from v; 3) labeling a leaf that is currently unlabeled by a and the other unlabeled leaf by b. We construct by swapping the labels a and b in. Figure 14 depicts an particular example of this.
Lemma 11 Let be a metric level-1 LSA quartet network, Cv be a cycle in Q⊕, that induces a 32-cycle in and let K = Kv. Suppose that the two taxa below v are a and b, then
Proof By hypothesis K takes values in {1,2} and
If K = 1 the unrooted tree topology has been determined and CF(Q⊕ | K = 1) is given by the expression in equation (4). If K = 2,
Therefore,
which yields the claim.
These Lemmas together imply that concordance factor for rooted quartet networks actually depend only on the unrooted network. This is formalized in the following.
Proposition 5 Let and be metric level-1 LSA quartet networks which induce the same unrooted network . Then.
Proof We prove this by induction on the number of cycles in . When there are no cycles in, Q and are trees, and by Proposition 3, . Assume now the result is true when there are fewer than k +1 cycles and that has k+1 cycles. Let Cv be a cycle in with hybrid edges h1 and h2, by Lemmas 7, 8, 9, 10, and 11, we can express the concordance factors of Q and in terms of networks with one fewer cycle. Note that these networks for Q and have the same unrooted metric structure. Thus by the induction hypothesis , for i = 0,1,2, and therefore.
Corollary 3 Let be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC, can be computed from the unrooted network.
We indicate how to compute the concordance factors of a LSA network from the unrooted quartet network without having to introduce a root. For a unrooted metric level-1 quartet network, where using Corollary 3 we define :
-
i)
Q′ be the graph obtained from Q by contracting all 23- and 21- cycles. By Corollary 2, CF(Q) = CF(Q′). If Q has a 4-cycle go to step (ii), otherwise go to step (iii).
-
ii)
Lemma 5 and Lemma 6 there are no 31-, 32- or 22-cycles in Q, and thus none in Q′. Then Q′ only has a 4-cycle so apply Lemma 10 to Q′. Since and are quartet trees, use the formula in equation (1) to complete the calculation.
-
iii)
are at most two 31-cycles in Q′. Choose one arbitrarily and apply Lemma 10. If and still have a 31-cycle, apply Lemma 10 again to and.
-
iv)
have now expressed concordance factors of Q in terms of concordance factors of unrooted quartet networks with no 21-,23-,31−, or 4-cycles. Apply Lemma 9 to these networks, by for instance choosing a 22-cycle with smallest graph theoretical distance from its hybrid node to a leaf, repeating until no 2-cycle remains.
-
v)have now an expression of the concordance factors of Q in terms of concordance factors of unrooted quartet networks with at most one 32-cycle. Apply Lemma 11. Then we have suppressed all cycles, and the concordance factors are now in terms of unrooted quartet trees. The formula of equation
-
(1)the calculation.
-
(1)
use of these Lemmas and Theorem is illustrated by a few examples.
Example 2 Consider the unrooted quartet network shown in Figure 15. By Lemma 9, with , the quartet concordance factors are given by:
| (6) |
Fig. 15.
An unrooted quartet with a single 22-cycle.
Example 3 Consider the unrooted quartet network shown in Figure 16. By Lemma 10, with, the quartet concordance factors are given by:
| (7) |
Fig. 16.
An unrooted quartet with a single 31-cycle.
Example 4 Consider the unrooted quartet network shown in Figure 17. By Lemma 10, with, the quartet concordance factors are given by:
| (8) |
Fig. 17.
An unrooted quartet with a single 41-cycle.
Example 5 Consider the unrooted quartet network shown in Figure 18. Given K = 1, one coalescent event has occurred below the hybrid node, so a and b coalesced. Therefore CF(Q0 | K = 1) = (1,0,0). By Lemma 11, with , the quartet concordance factors are given by:
| (9) |
1–5, agree with those in [23].
6. The Cycle property
this section we focus on the ordering by magnitude of the concordance factors.
Proposition 6 Let be a metric unrooted level-1 quartet network with no 32-cycle. The ordering of CFabcd(Q) is the ordering of CFabcd(Q′) where Q′ is obtained from Q by contracting all 2-cycles and all 31-cycles.
Proof By Corollary 2, CF(Q) = CF(Q∗), where Q∗ is obtained from Q by contracting all 21- and 23-cycles. Therefore we can assume Q has no 21- or 23-cycles. If Q has a 4-cycle, it has no 31- and no 22-cycles and the claim is established.
suppose Q has only 22-cycles and 31-cycles. We proceed by induction in the number of cycles, with the base case of 0 cycles trivial. Assume the result is true for unrooted quartet networks with k 31- and 22-cycles and suppose Q has k + 1. Picking one cycle and applying one of Lemmas 9 or 10 to Q, we can express the concordance factors of Q as a convex combination of CF(Q0), CF(Q1) and CF(Q2). Note that Q0, Q1 and Q2 have the same topology and by induction hypothesis, CF (Q0), CF(Q1) and CF(Q2) have the same ordering as the concordance factors of , and respectively, the networks obtained after contracting all 22- and 31-cycles from Q0, Q1 and Q2. Since , , Q2 and Q′ are trees with the same topology, their concordance factors have the same ordering by equations (1). Thus CF(Q0), CF(Q1) and CF(Q2) have the same ordering, and ergo so does CF(Q).
One consequence of Proposition 6 is that for any unrooted metric level-1 quartet network Q without a 32- or a 4-cycle, the ordering of the concordance factors is the same as the ordering of the concordance factors of a quartet tree. That is, the two smallest elements of the concordance factors are equal. When this happens we say that Q is treelike, since we could use equations (1) to find a quartet tree with appropriate edge lengths and concordance factors equal to CF(Q). However, not all unrooted quartet networks are treelike.
Example 6 Let be the unrooted 32-cycle quartet in Figure 18, where, , , and . By the equations in (9) we observe that the concordance factors are:
The fact that such a quartet network can be not treelike was identified in [24], where it was pointed out that this may cause species tree methods not to be robust to the presence of gene flow.
motivates the following definition.
Definition 16 Let be a metric rooted level-1 network on X. We say that a set of four distinct taxa s = {a,b,c,d} satisfies the Cycle property if is not treelike, that is, if the two smallest values of are not equal.
Cycle property is best understood geometrically. Denote by ∆2 the 2-dimensional probability simplex, the set of points in with nonnegative entries adding to 1. Observe that CFabcd ∈ ∆2 for any distinct taxa a,b,c,d. Figure 19 (left) depicts the simplex where the black lines are the points where the Cycle property is not satisfied; that is, the treelike unrooted quartet networks are those with concordance factors (x,y,z) satisfying , y = z or , x = z or, x = y. All points off these segments satisfy the Cycle property. For simplicity in arguments to come, note that we can interpret concordance factors, CFabcd, as a function that depends on a metric network on {a,b,c,d} and has for image points in ∆2.
Fig. 19.
On the left a planar projection of the simplex ∆2, where the black lines represent concordance factors that are treelike. In the center, the gray segments in ∆2 represent all the concordance factors arising from unrooted quartet networks with a 32-cycle. On the right, the black lines represent the variety V ((x − z)(y − z)(x − y),x + y + z − 1), these are all concordance factors not satisfying the BC property of Definition 17
Proposition 7 Let be a metric unrooted level-1 quartet network with a 32-cycle. Then CF(Q) lies in the set defined by , y = z or , x = z or, x = y, shown on the middle of Figure 19. Furthermore, for any point (x,y,z) in this set there is such a Q with (x,y,z) = CF(Q).
Proof Let s = {a,b,c,d} be a set of four distinct taxa and suppose that contains only a 32-cycle, as in Figure 18. Then CF() is given by the equations (7) with , and in particular CFAC|BD = CFAD|BC. To maximize CFAD|BC in (9), let ti → 0 for i ∈ {1,2,4} and t3 → ∞ to obtain a quadratic polynomial in γ,
whose maximum value is and it is attained at . For these values, we obtain . To minimize CFAD|BC it is enough to let t3 →∞ so .
be the open line segment with endpoints (1,0,0) and . Since is continuous in ti and γ, its image is a connected set on the line (x,y,y) containing points arbitrarily close to the endpoints of. Thus the image of is. Permuting taxon names shows every point in the set is a concordance factor for a network with a 32-cycle.
suppose has a 32 cycle with a,b descending from the hybrid node, and possibly other cycles. We may contract all 21- and 23-cycles by Corollary 2 without affecting . By Lemmas 9 and 10, we may supress 22- and 31-cycles by expressing as a convex sum of networks with a 32-cycle, but one fewer cycle. Thus is a convex sum of points in, which lies in.
the supplementary materials of [23] it is stated that an unrooted quartet network Qabcd with a 32-cycle can be always reduced to an unrooted quartet tree with some adjustment in the edge lengths. This is not true in general; that is, when {a,b,c,d} satisfies the Cycle property it is not treelike. However, Proposition 7 indicates that sometimes unrooted quartet networks with 32-cycles are treelike.
conclude this section, we show the Cycle property can give positive information about a network.
Proposition 8 Let be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the Cycle property, the unrooted quartet network contains either a 32-cycle or a 4-cycle.
Proof Proposition 6 shows that if has neither a 32-cycle nor a 4-cycle, the concordance factors of are those of a tree.
7. The Big Cycle property
this section we investigate how to detect 4-cycles in a network from quartet concordance factors.
though the Cycle property give us some information about an unrooted quartet network, it is not sufficient to tell us what the unrooted quartet network is. This is shown by the following Example, where a 4-cycle network lead to identical concordance factors as those in Example 6.
Example 7 Let be the 4-cycle unrooted quartet in Figure 17, where ,. By the equations in (8) the concordance factors are:
These agree with those of in Example 6.
This motivates the following definition.
Definition 17 Let be a metric rooted level-1 network on X. We say that a subset of four distinct taxa {a,b,c,d} ⊂ X satisfies the Big Cycle property (denoted BC) if all the entries of CFabcd are different.
Let {a,b,c,d} be a subset of taxa satisfying the BC property. Denote by the unrooted quartet corresponding to the smallest entry of CFabcd.
For example, if CFAB|CD < CFAC|BD < CFAD|BC, then .
Note that if s satisfies the BC property then s satisfies the Cycle property but the Cycle property is weaker than the Big Cycle property.
Proposition 9 Let be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the BC property, then the unrooted quartet network contains a 4-cycle.
Proof By Proposition 8, contains either a 32-cycle or a 4-cycle, and by Proposition 7, cannot have a 32-cycle. 2
A converse of Proposition 9 also holds, provided we include an assumption of generic parameters.
Proposition 10 Let be a metric rooted level-1 on X with |X| ≥ 4. Let {a,b,c,d} ⊂ X such that has a 4-cycle. Then {a,b,c,d} satisfies the Cycle property. Moreover, for generic numerical parameters on, {a,b,c,d} satisfies the BC property. That is, for all numerical parameters except those in a set of measure zero, the BC property holds.
Proof Let s = {a,b,c,d} ⊂ X be such that has a 4-cycle. Without loss of generality suppose that c is the descendant of the hybrid node and the hybrid block {c} of is adjacent to the v-blocks containing b and d. Since is level-1, the only other possible cycles in are 21 or 23-cycles. By Corollary 2, , where Q′ is the network obtained after contracting all cycles other than the 4-cycle. Note that Q′ is the network shown in Figure 17, and by equations (6), CF(Q′) depends only on the length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges of . Moreover, equations (6) show that {a,b,c,d} satisfies the Cycle property.
When is obtained from , the lengths of the edges of are the sum of edge lengths from. Let be the numerical parameter space for and let. Thus we can define a map such that for any metric (λ,γ) of, νs((λ,γ)) encodes the edge length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges. In particular this map is linear and surjective.
With χs = (0,1)2 × [0,1], let be defined as , so η is a biholomorphic function. Defining f : χs → ∆2 by f((L1,L2,γ)) = (1 − γ)(1 − 2L1/3,L1/3,L1/3) + γ(L2/3,L2/3,1 − 2L2/3), the quartet concordance factor map can be viewed as a composition
It is straightforward to see that the image of f restricted to γ = 0 and γ = 1 is the red (skewed) and blue (vertical) segments shown on the right of Figure 20.
Fig. 20.
The function f maps the cube χs (left) to ∆2 (right). The blue facets (rear and top) of the cube are mapped by f to the blue (vertical) segment and the red facets (bottom and right) to the red (skewed) segment. The full cube is mapped onto the shaded triangle with all the concordance factor displayed by a network with a 4-cycle. The three line segments, two on the boundary of and one within the shaded triangle, are comprised of points not satisfying the BC property.
Let V = V ((x − z)(y − z)(x − y),x + y + z − 1), that is, let V be the algebraic variety composed of the points on which (x − z)(y − z)(x − y) and x+y+z −1 are zero, as depicted on the right of Figure 19. Observe that V is the points in ∆2 that, if interpreted as concordance factors, would not satisfy the BC property.
Since f is a polynomial map whose image is not contained in V , the preimage of V under f is contained in a proper sub-variety of χs, and therefore f−1(V ) has measure zero in χs. Since η is biholomorphic, then η−1(f−1(V )) has measure zero. Since ν is linear surjective, then ν−1(η−1(f−1(V ))) has measure zero. Thus generic points in are mapped to concordance factors satisfying the BC property.
To better understand the geometry of the map f in this proof, let s = {a,b,c,d} be a subset of four distinct taxa satisfying the BC property. Figure 20 depicts the subset of χs that is mapped by f to those segments of the shaded triangle inside ∆2. The interior of χs is mapped to the interior of the shaded triangle.
The following Theorem follows immediately from Proposition 10 and Proposition 9.
Theorem 3 Let be a metric rooted level-1 network on X with |X| ≥ 4 and {a,b,c,d} ⊂ X. For generic numerical parameters, {a,b,c,d} satisfies the BC property if and only if has a 4-cycle.
Theorem 3 and Proposition 8, yield the following.
Corollary 4 Let be a metric unrooted level-1 network on X and let s = {a,b,c,d} be a set of distinct taxa in X. Then if s satisfies the Cycle property but not the BC property for generic parameters, then contains a 32-cycle.
The converse of Corollary 4 does not hold, as pointed out by Proposition 7.
If a set of 4 taxa satisfy the BC property, we can deduce some finer information about the 4-cycle on the unrooted quartet network and a larger network, as proved in the following.
Proposition 11 Let be a metric unrooted level-1 network on X and let {a,b,c,d} ⊆ X satisfy the BC property, so contains a 4-cycle Cv. Then if and only the v-blocks of containing a and c are not adjacent.
Proof Let . Since is level-1 the only possible cycles in Q, other than Cv, are 21 and 23-cycles. Let Q′ be the network obtained after contracting all 21 and 23-cycles, so Q′ has only a four cycle. By Corollary 2, CF(Q) = CF(Q′). Example 4 shows that if the v-blocks of containing a and c are not adjacent then . Interchanging taxon labels in this example shows that when , then a and c are not adjacent.
Lemma 12 Let be a metric unrooted level-1 network on X with generic numerical parameters. There exists {a,b,c,d} ⊆ X satisfying the BC property if and only if contains a cycle Cv of size k ≥ 4 with one of these taxa is in the hybrid block, and the others in distinct v-blocks on.
Proof Suppose that has a cycle of size k for some k ≥ 4 with hybrid node v. Choose four taxa {a,b,c,d}, such that a is in the hybrid block and a,b,c and d are in distinct v-blocks. This set of taxa induces a unrooted quartet network with a 4-cycle, and so by Theorem 3 this set of taxa satisfies the BC property for generic parameters. Suppose conversely, that there exists {a,b,c,d} satisfying the BC property. By Theorem 3, has a 4-cycle, so has a cycle of at least size four and one of these taxa is a descendant of the hybrid node. Since the other taxa are in distinct v-blocks of, they must be in distinct v-blocks of.
For a level-1 metric unrooted network, let S be the collection of sets of 4 distinct taxa satisfying the BC property and VH be the set of hybrid nodes. We observe that for any s ∈ S, there is a natural map ψ : S ↦ VH, where ψ(s) = v if v is the hybrid node associated to the cycle of size 4 in. In this case we say that s determines the hybrid node v.
Lemma 13 Let be a metric unrooted level-1 network and let {a,b,c,d} and {a,b,c,e} be subsets of the taxa satisfying the BC property. The set {a,b,c,d} determines v if and only if {a,b,c,e} determines v.
Proof Let {a,b,c,d} determine v, {a,b,c,e} determine u, and suppose that u ≠ v. Let Cv and Cu the cycles in containing v and u respectively, so Cu and Cv do not share edges. Since {a,b,c,d} satisfies the BC property, by Lemma 12, a, b, c, and d belong to different v-blocks, so that in \E(Cv) the taxa a, b and c are in different connected components. Since is level-1, Cu is in one of the connected components of \E(Cv), say. In particular note that all the taxa not in are in the same u-block. But at least two of a,b and c are not in, so at least two of a, b and c are in the same u-block. This contradicts Lemma 12, so u = v.
Interestingly, under the NMSC the ordering of quartet concordance factors is insufficient to identify the hybrid node of cycles of size 4. For example, the networks shown in Figure 21 all have the same ordering of their concordance factors despite different hybrid nodes. The concordance factors for all those networks have the same values:
Fig. 21.
Four unrooted metric level-1 quartet networks with the same concordance factors.
Figure 22 shows the 4-cycle network topologies drawn in the regions of ∆2 which their concordance factors fill. In each case it does not matter which of the cycle nodes is the hybrid node; all those unrooted quartet networks define concordance factors that fill that region.
Fig. 22.
Each section of the simplex is depicted with an unrooted quartet network topology whose image under the concordance factor map fills that region, independent of the placement of the hybrid node.
8. Identifying cycles in networks
Having shown that the BC property can detect the existence of 4-cycles in networks, for generic parameters, we are poised to prove our main result. Our arguments now are mainly combinatorial.
Given a network on X, let S denote the set of 4-taxon subsets of X satisfying the BC property. Recall that for a unrooted level-1 network on X, the 4-network partition is the partition of X according to the connected components of the graph obtained after removing all cycles of size at least 4 from . Recall also that the blocks of such partition are referred to as 4-network blocks.
Lemma 14 Let be a metric rooted level-1 network on X. Then under the NMSC model with generic parameters the 4-network blocks of can be determined from the set S.
Proof If |X| < 3 there is nothing to prove. The case |X| = 4 follows from Proposition 9, so we assume |X| ≥ 5. By Lemma 12, for any {a,b,c,d} ∈ S each taxon a, b, c, d must belong to a different 4-network block. Let
Then Ya is the complement of the 4-network block containing a. To see this, note that for any taxon b that does not belong to the 4-network block of a, by Lemma 4, there exists a cycle Cv of size at least 4 such that a and b are in different v-blocks. Now choose any two different taxa c and d, such that all taxa a, b, c, d are in different v-blocks and one of a, b, c or d is in the v-hybrid block. Then {a,b,c,d} ∈ S, and thus b ∈ Ya.
It follows that X \ Yx is the 4-network block containing taxon x. Since x was arbitrary, all 4-network blocks can be determined.
Lemma 15 Let be a metric rooted level-1 network on X with cycle Cv of size kv ≥ 4. Then for generic parameter choices, the v-blocks and the size kv can be identified from the set S. If kv ≥ 5 the v-hybrid block can also be identified.
Proof Let {a,b,c,d} ∈ S and let v be the hybrid node determined by it. By Lemma 12, each of these taxa belongs to a different v-block, and hence to a different 4-network block. Denote by A,B,C,D the v-blocks containing a,b,c and d respectively.
Let Zabc be the set of all taxa e such that {a,b,c,e} ∈ S. By Lemma 13, all such {a,b,c,e} ∈ S determine the same hybrid node v. Consider now Zbcd, Zacd and Zabd. If kv = 4, then, by the last statement of Lemma 12, Zabc = D, Zbcd = A, Zacd = B and Zabd = C, so all pairwise intersections of Zabc, Zbcd, Zacd, Zabd are empty. If kv > 4, then, again by Lemma 12, for some distinct taxa i,j,k ∈ {a,b,c,d}, Zijk is the v-hybrid block, and for any l,m,n ∈ {a,b,c,d} with {l,m,n} ≠ {i,j,k}, Zlmn = (L ∪ M ∪ N)c. Note that Zijk ∩ Zlmn = ∅ since one of L,M,N is the v-hybrid block. Since Zlmn contains at least one v-block other than A, B, C or D, for any l′,m′,n′ ∈ {a,b,c,d}, with {l′,m′,n′} ≠ {i,j,k}, Zlmn ∩ Zl′m′n′ ≠ ∅. Hence we can determine whether kv > 4 or kv = 4: if all pairwise intersection of Zabc, Zbcd, Zacd, Zabd are empty then kv = 4, else kv > 4. If kv > 4 we can determine the hybrid block, by noting which of the sets Zabc, Zbcd, Zacd, Zabd has empty intersection with any other set in this family. At this point we have determined either that kv = 4 and all v-blocks, or that kv > 4 and the hybrid block.
In the case kv > 4, without loss of generality, suppose that A is the vhybrid block. Let y /∈ Zabc = (A ∪ B ∪ C)c, so y is in one of A, B and C. For some u,w ∈ {a,b,c}, s′ = {y,u,w,d} ∈ S, which shows y and the taxon g ∈ {a,b,c} \ {u,w} are in the same v-block. Thus we can determine A, B and C.
Note that for any taxon x that is not in any of A, B or C, then s ={a,x,b,c} ∈ S. Since s determines v, following the steps of the last paragraph identifies the v-block that contains x. Therefore all v-blocks can be determined, and thus kv as well.
Lemma 16 Let be a metric rooted level-1 network on X. Then for any hybrid node v with kv ≥ 4 the order of the v-blocks in the cycle can be determined from the ordering of the concordance factors.
Proof If kv = 4, the claim is established by Proposition 11. Now suppose that kv > 4, so by Lemma 15 we know the v-hybrid block. Let A1,...,Akv be the v-block partition with A1 the v-hybrid block. Let ai ∈ Ai be an element of the i-th v-block. By Proposition 11, A1 and Aj are adjacent if and only if for any distinct . Thus we can identify the two v-blocks adjacent to A1. Suppose that such v-blocks are Ap and Aq. We find the other v-block adjacent to Aq from for all distinct j,m ∈ {2,3,4,…,kv} \ {p,q}. This is, Aq and Aj are adjacent if and only if for any distinct and j ≠ 1,p,q. Continuing in this way, the full order of blocks around the cycle can be determined.
We reach the main result.
Theorem 4 Let be a metric rooted level-1 network on X. Then under the NMSC model, for generic parameters, the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network obtained from by contracting all 2- and 3-cycles, and directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges of k-cycles for k ≥ 5.
Proof We proceed by induction in the number of cycles of size ≥ 4. Suppose there are no such cycles.Then every induced quartet tree will have no cycle of size 4, and the ordering of the concordance factors determines the topology of the quartet tree obtained by contracting all 2- and 3-cycles. These then determine the topology by a standard result [22].
Suppose there is exactly one cycle of size at least 4. Then there is just one hybrid node v in with kv ≥ 4. By Lemmas 15 and 16 we can determine the size kv of the cycle, the v-blocks and the order of the v-blocks in the cycle. If kv ≥ 5 we can identify the hybrid node v and thus identify the direction of the hybrid edges. Let Pu be a v-block where u is a node in Cv, and q ∈ X \ Pu. Let be the induced network on Pu∪{q} with all 2-cycles and 3-cycles contracted. Note that is a tree, and the quartet concordance factors for taxa in Pu ∪ q identify its topology. Viewing q as an outgroup of Pu, induces a rooted tree on Pu. The root can then be joined with an edge to u. Doing this for all v-blocks establishes the claim.
Now suppose that the result is true for networks with l cycles of size at least 4, and contains l+1 such cycles. We can first determine all 4-network blocks and the v-blocks and its cycle order for every cycle of size at least 4 by Lemmas 14, 15, and 16. Following Definition 13, consider , the tree of cycles of . A leaf of arises from a cycle Cv on if and only if all v-blocks but one are 4-network blocks. We may therefore determine the v-blocks of some cycle Cv that is a leaf of .
Let u be the vertex in Cv associated to the v-block that is not a 4-network block. Note that is a disconnected graph, with two connected components and . Let be the component containing all nodes of C except u, and Si the set of taxa on , i ∈ {1,2}. Let si ∈ Si. Then for i,j ∈ {1,2}, i ≠ j, has at most l cycles of size at least 4. By the induction hypothesis we can determine the semidirected topological network obtained from by contracting all 2- and 3-cycles, and directions of the hybrid edges in 4-cycles, while retaining directions of the hybrid edges of k-cycles for k ≥ 5. We obtain by identifying s1 in with s2 in and suppressing that node.
Figure 23 shows a phylogenetic metric rooted network and, the unrooted semidirected topological network which is identified by Theorem 4.
Fig. 23.
A rooted metric phylogenetic network (left) and the network structure (right) that can be identified by Theorem 4. The 4-cycle on the network in the right, colored gray, has 3 different candidates for the hybrid node.
The cycle colored in green is a 4-cycle and, though, its hybrid node is not identified from quartet concordance factors. However, its hybrid node has to be such that is induced from a rooted network. Thus the node labeled x in Figure 23 cannot be the hybrid node. This illustrates that although we cannot always identify the hybrid node on 4-cycles, sometimes the structure of the resulting network restricts the possible nodes for its placement.
9. Further results on 32-cycles
Under some special circumstances, for example when a set of taxa satisfy the Cycle property but not the BC property, it is possible to detect further information about the topology of the network than that given in Theorem 4. For instance, some 3-cycles are identifiable under such hypothesis. In this section, we discuss these extensions briefly, as it is difficult to formulate general statements on identifiability.
Recall that a 32-cycle may lead to concordance factors satisfying the Cycle property, but it need not, as shown in Proposition 7. There is a full-dimensional subset of parameters space on which concordance factors indicate a 32-cycle and another in which it fails to. Nonetheless, the following gives a positive, but limited, identifiability result.
Proposition 12 Let be a metric rooted level-1 network on X and suppose {a,b,c,d} ⊂ X satisfies the Cycle property but not the BC property. Then under the NMSC model, for generic parameters, if there is no taxon e ∈ X such that {i,j,k,e} satisfies the BC property for any distinct i,j,k ∈ {a,b,c,d} then contains a 3-cycle with at least two descendants of the hybrid node.
Proof Since {a,b,c,d} ⊂ X satisfy the Cycle property but not the BC property, by Proposition 8, there is a 32-cycle in. Thus three taxa of a,b,c,d are in distinct v-blocks in. This implies that there exists a cycle Cv in where three taxa of a,b,c,d are in distinct v-blocks. Since {i,j,k,e} does not satisfy the BC property for any distinct i,j,k ∈ {a,b,c,d}, this implies Cv is not a k-cycle for k ≥ 4. Thus by Proposition 7, Cv has size 3 and at least two of a, b, c, d descend from v.
Let be an unrooted level-1 quartet network where {a,b,c,d} satisfies the Cycle property but not the BC property. It can be shown that if, for example, the smallest entry in CFabcd is the one corresponding to the quartet AB|CD, then either a,b or c,d are in the v-hybrid block. This proof is very similar to that of Proposition 11.
Let be a network such that (in the network obtained from in Theorem 4) is as shown in Figure 24. Observe that {a,b,c,d} satisfies the BC property by Theorem 3. If {a,e,b,d} satisfies the Cycle property, then the following Proposition indicates the hybrid node in the network shown in Figure 24 can be determined.
Fig. 24.
A network with a four cycle such that if {a,b,c,e} satisfies the Cycle property, the hybrid block can be detected.
Proposition 13 Let be a metric rooted level-1 network on X and let Cv be a 4-cycle in. Let a,b,c,d ∈ X be in different v-blocks in. Suppose under the NMSC model, for generic parameters, for distinct i,j,k ∈ {a,b,c,d}, there exists a taxon e ∈ X such that {i,j,k,e} satisfies the Cycle property but not the BC property. Then the v-block containing e is the v-hybrid block.
Proof Without loss of generality suppose that i = a, j = b and k = c. Note that e is not in the same v-block as d, otherwise {a,b,c,e} would satisfy the BC property. Thus e is the same v-block as a,b or c. Without loss of generality suppose that is in the same v-block as a. Thus {e,b,c,d} satisfies the BC property and by Theorem 4 the order of the cycle can be determined. Without loss of generality suppose that the order is the one as in Figure 24. By Lemma 13, {a,b,c,d} and {e,b,c,d} determine the same hybrid node v. Since {a,b,c,e} satisfies the Cycle property, Corollary 4 shows has a 32-cycle. The 4-cycle in and the 3-cycle in have to have the same hybrid edges, otherwise the level-1 condition would be violated. Observe that the only possibility for having a 32-cycle is if e and a are in the hybrid block.
In [23] it is stated that one could identify the hybrid node in a 4-cycle when the number of taxa in the network is greater than 4 by using multiple concordance factors at once.
10. Discussion
In this work, we show that for generic numerical parameters, under the network multi-species coalescent model the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network obtained from by contracting all 2- and 3-cycles, and ignoring the directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges in larger cycles.
As mentioned in the introduction, the proof of this result suggests combinatorial methods for constructing the network under noiseless data, but the question remains open in the presence of noise. There are two challenges when noise is introduced. The first one consist of detecting whether a quartet network contains a 4-cycle or not. We would never expect the empirical concordance factors to be exactly treelike. For this challenge, one could develop a statistical test to determine when concordance factors are sufficiently close to treelike to doubt the presence of a 4-cycle. The second challenge arises after determining such test. Since the test will not be accurate all the time, some quartets will not be inferred correctly and thus we need a method to reconstruct the network with some erroneous quartets. We leave this for future work.
Acknowledgements
The author deeply thanks John A. Rhodes and Elizabeth S. Allman for their technical assistance and suggestions during the development of this work, and the reviewers for their valuable suggestions and observations.
This research was supported in part by the National Institutes of Health grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.
11 Appendix
Here, Proposition 1 of Section 2 is proved. The argument uses the following.
Lemma 17 Let be a (metric or topological) rooted network on X and let Z ⊂ X. For any edge e below LSA(Z), with a descendant in Z, there are x,y ∈ Z such that e is in a simple trek in from x to y whose edges are below LSA(Z).
Proof Let x ∈ Z be below e. By Lemma 2 there exists y ∈ Z with LSA(x,y) above e.
Suppose y is not below e. Let Px be a path from LSA(x,y) to x containing e and let Py be a path from LSA(x,y) to y. Let u be the minimal node in the intersection of Px and Py. Since y is not below e, u cannot be below e. Then the subpath of Px from u to x, which contains e, and the subpath of Py from f to y form a simple trek containing e.
Now assume y is below e. Since e is below LSA(x,y), there exists a path from LSA(x,y) to one of y or x that does not pass through the child of e. Without loss of generality suppose such a path Py goes from LSA(x,y) to y.
Let Px be a path from LSA(x,y) to x that passes through e. Let A = A(Px,Py) be the set of nodes above e, common to Py and Px. Let a ∈ A be the minimal node in A.
Let B(Py,Px) be the set of nodes below e, common to Py and Px. We may assume that we choose Px and Py such that B = B(Py,Px) has minimal cardinality. If B = ∅ then the desired trek is easily constructed, with top a. So suppose B ≠∅ has minimal element b− and maximal element b+. We are going to contradict the minimality of B. Note that b+ must be the hybrid node of a cycle containing e (see Figure 25 for a graphical reference).
Fig. 25.
In gray we see the subgraph composed by P and P′, the dashed edges represent that P and P′ could intersect, the dotted segments represent just a succession of edges. In black we see the different cases of the possible edges in P∗ above b but below a.
Since b− is not LSA(x,y), there exists a path P∗ from LSA(x,y) to one of x or y that does not pass through b−. Note that P∗ has to intersect at least one of Py or Px at an internal node below b−. Let C1 be the set of nodes below b−, common to P∗ and Py and let C2 be the set of nodes below b−, common to P∗ and Py. Let c be the maximal node in C1 ∪C2. We can assume, without loss of generality, that c is in Py. This is because if instead, c were in Px, we can construct paths and where contains all the edges in Pi above b− and all edges of Pj below b− for i,j ∈ {x,y}, i ≠ j. Note that passes through e and does not contains c, while does not pass through e, contains c, and .
Denote by W the set of nodes in (P* ∩ Py) ∪ (P* ∩ Px) and let w be the minimal node of W above b−. Since is binary, w cannot be a or b+ (see Figure 25 for a graphical reference). There are 5 different cases of the location of w in the network composed by the paths Py and Px. These are
1. w is in Py, above b+ but below a.
2. w is in Px, above b+ but below e.
3. w is in Px, above e but below a.
4. w is in one or more of Px or Py, above a.
5. w is in one or more of Px or Py, above b− but below b+.
Figure 25 depicts in gray the graph composed by the paths Py and Px, and in black we see the possible subpaths of P∗ from w to c. In any of case 1, 2 or 3 we can find a simple trek containing e as depicted in Figure 26 by choosing the appropriate edges, and thus B was not minimal. For case 4 and 5 there are two possibilities; (i) w is in both Py and Px; (ii) w is only in one of Py or Px. For case 4 (i), the situation is simple, and we can find a simple trek as depicted on the left in Figure 27. For case 4 (ii), we first find the node in A that is right above w. Then as depicted on the left of Figure 27 we can find a simple trek.
Fig. 26.
The treks in case 1 (left), case 2 (center), and case 3 (right).
>Fig. 27.
(Left) The treks in the two possibilities of case 4. (Right) The two possibilities of case 5, where the black segments represent possible edges red and blue at the same time.
For case 5 we do not find a simple trek directly, instead we construct two paths P1 and P2 from LSA(x,y) to x, y respectively, only one of which contains e with at least one less node in B(P1,P2) than B. For case 5 (i), we just take P1 to be the same as Px and for P2 we consider the same edges that are in Py above w, the edges below c, and the edges in P∗ between w and c. For case 5 (ii), we assume without loss of generality that w is in Px. Let b be the node in B right above w. Let P1 be the path containing the edges in Px that are above b, the edges in Py that are below b but above the node b′ ∈ B right below w, and at last the edges in Px below b′. Let P2 the path containing the edges in Py that are above b, the edges in Px that are above a but below b, the edges in P∗ that are above c but below w and at last the edges in Py that are below c. Figure 27 (right) depicts P1 (red) and P2 (blue) for (i) and (ii). Since B(P1,P2) has at least one less node that B and we assumed B, the minimality of B is contradicted.
Proof (of Proposition 1) Let . Let M− be the graph obtained from M+ by ignoring the direction of all tree edges and then suppressing the LSA(Z,), that is, the induced unrooted network from M+. Denote by M0 the graph obtained by ignoring all directions of the tree edges in M+, so that by suppressing degree two nodes of either M− or M′ gives . Let K be the graph obtained by considering all the edges in simple treks in from x to y for all x,y ∈ Z, so that suppressing degree two nodes in K gives. Showing either M′ = K or M− = K, will prove the claim.
First we show that if LSA(Z,)≠LSA(X,) then M′ = K, by arguing that M′ and K have the same edges. Let e be an edge of M′. Since LSA(Z,)≠LSA(X,), M′ is a subgraph of and e is directed in M+. By Lemma 17, e is in a simple trek in M+ from x to y, for some x,y ∈ Z. This trek induces a simple trek in M′ from x to y, and therefore a simple trek in from x to y. Thus e is in K.
Now let e be an edge of K. Then there exists a simple trek () in from x to y, for some x,y ∈ Z containing e. Let and let T be the sequence of incident edges in from x to v conformed of edges inducing those in and . Since () is simple, T does not have repeated edges. Following T in from x to y, edges are first transversed “uphill” (in reverse direction) until there is a first “downhill” edge (u,w). The next edge in T cannot be uphill, as otherwise it would be hybrid and () would have not been a trek in . This argument applies for all consecutive edges in T until we end at y. Thus there is a simple trek () from x to y in with top u. Note that u must be below or equal to LSA(Z,) since otherwise the trek would not be simple. Moreover, P1 and P2 contain only edges in M+ and thus in M′ after the directions of the tree edges is omitted. Thus e is in M′, so K = M′.
If LSA(Z,)=LSA(X,) then M− = K follows from a straight forward modification of the previous argument to account for the suppression of LSA(z,) in both M− and K.
References
- 1.Allman Elizabeth S., Degnan James H., and Rhodes John A.. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833–862, 2011. [DOI] [PubMed] [Google Scholar]
- 2.Ané Cécile, Larget Bret, Baum David A., Smith Stacey D., and Rokas Antonis. Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24(2):412–426, 2007. [DOI] [PubMed] [Google Scholar]
- 3.Arnold Michael L.. Natural hybridization and evolution, volume 53 Oxford University Press, 1997. [Google Scholar]
- 4.Bapteste Eric, van Iersel Leo, Janke Axel, Kelchner Scot, Kelk Steven, McInerney James O., Morrison David A., Nakhleh Luay, Steel Mike, Stougie Leen, and Whitfield James. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013. [DOI] [PubMed] [Google Scholar]
- 5.Carstens Bryan C., Lacey Knowles L, and Collins Tim. Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers. Systematic Biology, 56(3):400–411, 2007. [DOI] [PubMed] [Google Scholar]
- 6.Degnan James H, Lacey Knowles L, and Salter Kubatko Laura. Probabilities of gene trees with intraspecific sampling given a species tree In Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, 2010. [Google Scholar]
- 7.Ellstrand NC, Whitkus R, and Rieseberg LH. Distribution of spontaneous plant hybrids. Proceedings of the National Academy of Sciences of the United States of America, 93(10):5090–5093, 1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gusfield Dan, Bansal Vikas, Bafna Vineet, and Song Yun S. A decomposition theory for phylogenetic networks and incompatible characters. Journal of Computational Biology, 14(10):1247–1272, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huber Katharina T., van Iersel Leo, Moulton Vincent, Scornavacca Celine, and Wu Taoyang. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, January 2017. [Google Scholar]
- 10.Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. https://arxiv.org/pdf/1711.06720.pdf, 2017. [DOI] [PMC free article] [PubMed]
- 11.C M Keijsper J and Pendavingh RA. Reconstructing a Phylogenetic Level-1 Network from Quartets. Bulletin of Mathematical Biology, 76(10):2517–2541, 2014. [DOI] [PubMed] [Google Scholar]
- 12.Randal Linder C and Loren H Rieseberg. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany, 91(10):1700–1708, 2004. [DOI] [PubMed] [Google Scholar]
- 13.Liu Liang, Yu Lili, and Edwards Scott V. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology, 10(1):302, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mallet James. Hybridization as an invasion of the genome. Trends in Ecology & Evolution, 20(5):229–237, 2005. Special issue: Invasions, guest edited by Hochberg Michael E.and Gotelli Nicholas J.. [DOI] [PubMed] [Google Scholar]
- 15.Meng Chen and Salter Kubatko Laura. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. [DOI] [PubMed] [Google Scholar]
- 16.Nakhleh Luay. Evolutionary Phylogenetic Networks: Models and Issues. Problem Solving Handbook in Computational Biology and Bioinformatics, pages 125–158, 2011. [Google Scholar]
- 17.Noor Mohamed A. F. and Feder Jeffrey L.. Speciation genetics: evolving approaches. Nature Reviews Genetics, 7(11):851–861, 2006. [DOI] [PubMed] [Google Scholar]
- 18.Pamilo P and Nei M. Relationships between gene trees and species trees. Molecular Biology and Evolution, 5:568583, 1988. [DOI] [PubMed] [Google Scholar]
- 19.Pollard Daniel A., Iyer Venky N., Moses Alan M., and Eisen Michael B.. Widespread discordance of gene trees with species tree in drosophila: Evidence for incomplete lineage sorting. PLoS Genetics, 2(10):1634–1647, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rieseberg Loren H., Baird Stuart J.E., and Gardner Keith A.. Hybridization, introgression, and linkage evolution. Plant Molecular Biology, 42(1):205–224, 2000. [PubMed] [Google Scholar]
- 21.Rosselló Francesco and Valiente Gabriel. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. [DOI] [PubMed] [Google Scholar]
- 22.Semple Charles and Steel Mike. Phylogenetics. Oxford University Press, 2005. [Google Scholar]
- 23.Solís-Lemus Claudia and Ané Cécile. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Solís-Lemus Claudia, Ané Cécile, and Yang Mengyao. Inconsistency of Species Tree Methods under Gene Flow. Systematic Biology, 65(5):843–851, 2016. [DOI] [PubMed] [Google Scholar]
- 25.Steel Mike. Phylogeny Discrete and Random Processes in Evolution. David Marshall, 2016.
- 26.Sullivant Seth, Talaska Kelli, and Draisma Jan. Trek separation for gaussian graphical models. Ann. Statist, 38(3):1665–1685, 06 2010. [Google Scholar]
- 27.Syring John, Willyard Ann, Cronn Richard, and Liston Aaron. Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. American Journal of Botany, 92(12):2086–2100, 2005. [DOI] [PubMed] [Google Scholar]
- 28.Wakeley John. Coalescent Theory: An Introduction, volume 58 Roberts and Company Publishers, 2008. [Google Scholar]
- 29.Yu Y, Dong J, Liu KJ, and Nakhleh L. Maximum likelihood inference of reticulate evolutionary histories. PNAS, 111:296–305, 11 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yu Yun, Degnan James H., and Nakhleh Luay. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genetics, 8:e1002660, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yu Yun, Than Cuong, Degnan James H., and Nakhleh Luay. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang C, Ogilvie HW, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35:504–517, 02 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: Revisiting trees within networks. BMC Bioinformatics, 17:415, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhu S and Degnan J. Displayed trees do not determine distinguishability under the network multispecies coalescent. Systematic Biology, 66:283298, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]


























