Identifying species network features from gene tree quartets under the coalescent model

Hector Baños

doi:10.1007/s11538-018-0485-4

. Author manuscript; available in PMC: 2020 Feb 1.

Published in final edited form as: Bull Math Biol. 2018 Aug 9;81(2):494–534. doi: 10.1007/s11538-018-0485-4

Identifying species network features from gene tree quartets under the coalescent model

Hector Baños ¹

PMCID: PMC6344282 NIHMSID: NIHMS1503360 PMID: 30094772

Abstract

We show that many topological features of level-1 species networks are identifiable from the distribution of the gene tree quartets under the network multi-species coalescent model. In particular, every cycle of size at least 4 and every hybrid node in a cycle of size at least 5 is identifiable. This is a step toward justifying the inference of such networks which was recently implemented by Solís-Lemus and Ané. We show additionally how to compute quartet concordance factors for a network in terms of simpler networks, and explore some circumstances in which cycles of size 3 and hybrid nodes in 4-cycles can be detected.

Keywords: Coalescent Theory, Phylogenetics, Networks, Concordance factors

1. Introduction

As phylogenetic analysis of DNA data has progressed, more evidence has appeared showing that hybridization is often an important factor in evolution. As surveyed in [16], hybridization has played a very important role in the evolutionary history of plants, some groups of fish and frogs ([7], [12], [14], [17], [20]). Other biological processes such as introgression, lateral gene transfer and gene flow, also require moving beyond a simple tree-like view of species relationships.

Phylogenetic networks are the objects used to represent the relationships between species that admit such events ([3],[4]). These networks are often thought of as obtained from phylogenetic trees by adding additional edges, so that some nodes in the tree have two parents. Nodes with two parents, called hybrid nodes, represent species whose genome arises from two different ancestral species. Inference of phylogenetics networks from biological data presents new challenges, with methods still being developed, as shown by recent works including [2], [15], [23], [32], [29] and [31].

Another challenge in inferring evolutionary history arises from the fact that many multi-locus data sets exhibit gene tree incongruence, even without suspected hybridization. One possible reason is incomplete lineage sorting (ILS), which is described in the tree setting by the multi-species coalescent model [18]. See for example [5], [19], and [27] where ILS is explained in the biological setting.

Meng and Kubatko [15] formulated a model of gene tree production, based on the multi-species coalescent model, incorporating both hybridization and ILS. We refer to this model as the network multi-species coalescent model, which is further developed in [30], [33], and [24], to mention some. The model determines the probability of observing any rooted gene tree given a metric rooted phylogenetic species network.

Solís-Lemus and Ané [23] recently presented a novel statistical method, based on the network multi-species coalescent model, to infer phylogenetic networks from gene tree quartets in a pseudolikelihood framework. The quartets themselves might come from larger gene trees inferred by standard phylogenetic methods. The pseudolikelihood in this work is built on quartet frequencies, or concordance factors, extending an idea of Liu [13] from the tree setting. The pseudolikelihood approach is simpler and faster than computing the full likelihood and makes large-scale data analysis more tractable. They demonstrate positive results in reconstructing the evolutionary relationships among swordtails and platyfishes.

However, the theoretical underpinnings of the method of [23] are not complete. In using a model for statistical inference it is important to know if it is theoretically possible to uniquely recover the parameters from the data the model predicts. In more precise terms, for model-based statistical inference to have a solid basis, we need that the probability distribution for data which arises under the model uniquely determines the parameters. This is known as identifiability of the model parameters.

While [23] showed that any particular hybridization in a level-1 network with h hybridizations, and n taxa can be generically detected under certain assumptions, their study never addressed the full identifiability of the network topology, only the detectability of a specific hybridization event. Working in the setting of level-1 networks, which is also adopted here, their arguments do not include investigations on network properties such as cycle sizes, and the structure of the whole network. These properties are crucial to determine, for example, whether two networks with different cycle sizes, or different number of cycles, could produce the same set of gene tree quartet probabilities.

The primary purpose of this work is to begin to address some of these identifiability questions raised in [23]. That is, we study the question: given information on gene quartet probabilities for some unknown level-1 network $N$ , what can be determined about the topology of $N$ ?

Although others have considered the problem of constructing large networks from small ones, these works do not seem to be applicable to the question studied here. Most of these works, including [9], [10] and [11], are primarily combinatorial in nature. In particular, these studies do not address semidirected networks, ILS through the network multi-species coalescent model, nor the types of inputs that might be obtained from biological data.

The main result of this work, Theorem 4 of Section 8, is that under the network multi-species coalescent model on level-1 networks, we can generically identify from gene quartet distributions “most” of the unrooted topological network, including all cycles of size at least 4, and hybrid nodes in the cycles of size greater than 4. “Generically” here means for all values of numerical parameters except those in a set of measure zero. The methods used are a mix of the semi-algebraic study of quartet gene tree frequencies (in terms of linear equalities and inequalities they satisfy) with combinatorial approaches to combining this knowledge for many quartets. As a side benefit the proofs suggest combinatorial methods for reconstructing networks, as opposed to just showing identifiability. However, we do not explore how such methods might be implemented in the presence of the noise that any collection of inferred gene trees will have.

Another result of this work, in Section 5, is a rigorous derivation of how gene quartet probabilities can be computed for large networks under the coalescent model. Although this parallels some of the results in [23], the arguments given here are more rigorous, as is necessary for them to form the basis of our main results. Our approach is to express quartet frequencies as convex combinations of those on simplified networks, ultimately leading to expressions in terms of trees, as is done in other situations [34]. This is different from the approach in [23] of finding networks with less hybridizations displaying the same gene quartet probabilities.

The outline of this work is as follows: Section 2 introduces basic definitions and establishes some terminology on graphs and networks. Section 3 sets forth insights and tools for studying the structure of level-1 networks. Section 4 reviews the network multi-species coalescent model of [15], as well as quartet concordance factors and some of their properties. In Section 5 we show how concordance factors of quartet networks can be expressed in terms of simpler networks. Section 6 introduces the “Cycle property” of concordance factors and Section 7 defines the “Big Cycle” property of concordance factors. In Section 8, the main result on topological network identifiability is proved using the Big Cycle property and in Section 9 some extended results on the “Cycle property” are shown.

2. Phylogenetic networks

We adopt standard terminology for graphs and networks, as used in phylogenetics; see for example [22] and [25]. All undirected, directed, or semidirected graphs will not contain loops. If G is a directed or semidirected graph, the undirected graph of G, denoted by U(G), is the graph G with all directions omitted.

2.1. Rooted networks

To set terminology, we begin with some fundamental definitions.

Definition 1 A topological binary rooted phylogenetic network $N^{+}$ on taxon set X is a connected directed acyclic graph with vertices V and edges E, where V is the disjoint union V = {r}⊔V_L ⊔V_H ⊔V_T and E is the disjoint union E = E_H ⊔ E_T, and a bijective leaf-labeling function f: V_L → X with the following characteristics:

The root r has indegree 0 and outdegree 2.
A leaf v ∈ V_L has indegree 1 and outdegree 0.
A tree node v ∈ V_T has indegree 1 and outdegree 2.
A hybrid node v ∈ V_H has indegree 2 and outdegree 1.
A hybrid edge e ∈ E_H is an edge whose child is a hybrid node.
A tree edge e ∈ E_T is an edge whose child is a tree node or a leaf.

Definition 2 Let $N^{+}$ be a topological binary rooted phylogenetic network with |E| = m and |E_H| = 2h. A metric for $N^{+}$ is a pair (λ,γ), where $λ : E \to ℝ_{> 0}$ and γ : E_H → (0,1) satisfies that if two edges h₁ and h₂ have the same hybrid node as child, then γ(h₁) + γ(h₂) = 1.

If (λ,γ) is a metric for $N^{+}$ , then we refer to ( $N^{+}$ ,(λ,γ)) as a metric binary rooted phylogenetic network.

Note that Definition 1 differs from that of [25] in that it allows up to two edges between a pair of nodes. An edge weight λ(e) is interpreted as the time (in coalescent units) between speciation events represented by the ends of edge e. For any hybrid edge h with child v, the value γ(h) = γ_h is the probability that a lineage at v has ancestral lineage in h and is often called hybridization parameter or inheritance probability. Since we are focusing on parameter identifiability we will use the term hybridization parameter.

2.2. Lowest stable ancestor

We review and show some properties of the lowest stable ancestor, a network analog of the most recent common ancestor on a tree.

Definition 3 Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network. We say that a node v is above a node u, and u is below v, if there exists a non-empty directed path in $N^{+}$ from v to u. We also say that an edge with parent node x and child y is above (below) a node v if y is above or equal to v (x is below or equal to v).

Note that since $N^{+}$ has no directed cycles, u cannot be both above and below v.

Definition 4 [25] Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on X and let Z ⊆ X. Let D be the set of nodes which lie on every directed path from the root r of $N^{+}$ to any z ∈ Z. Then the lowest stable ancestor of Z of $N^{+}$ , denoted by LSA(Z, $N^{+}$ ), is the unique node v ∈ D such that v is below all u ∈ D, u ≠ v.

When $N^{+}$ is clear from context, we write LSA(Z) for LSA(Z, $N^{+}$ ). To see that LSA(Z) is well defined for any Z ⊆ X, note first that D ≠ ∅ since r ∈ D. Also, since every pair of nodes u,v ∈ D both lie on a path, we have a notion of above and below for u and v, i.e. a total order on D, and hence a minimal element.

While the definition of LSA agrees with the most recent common ancestor for trees, it is more subtle. In particular, if $N^{+}$ is a network on X, LSA(X) need not to be the root of the network, as Figure 1 (left) shows. Furthermore, there can be nodes below LSA(X) which are ancestral to all of X, as Figure 2 shows.

Fig. 1 — (Left) A binary rooted phylogenetic network on X, with LSA(X) the node labeled x, and (Right) its induced unrooted semidirected network. In a depiction of a rooted network, all edges are directed downward, from the root, but arrowheads are shown only on hybrid edges. For the unrooted network, all edges except hybrid ones are undirected.

Fig. 2 — A binary rooted phylogenetic network where the node labeled y is ancestral to all taxa in X but is not LSA(X). LSA(X) here is the root of the network.

Lemma 1 Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on X with root r, and let Z ⊆ Y ⊆ X. Then

(i)
the indegree of LSA(Z) is at most one for any Z ⊂ X;
(ii)
at most one of the out edges of LSA(Z) is hybrid;
(iii)
if Z ⊆ Y ⊆ X then LSA(Z) is below or equal to LSA(Y ).

Proof To see (i), suppose that the indegree of LSA(Z) is two. Then the outdegree would be one, and the child of LSA(Z) would be in any path from the root to any taxa in Z, contradicting the definition of LSA(Z).

For (ii), suppose the out edges of LSA(Z), e₁ and e₂, are both hybrid. If e₁ and e₂ have the same child then every path from r to any z ∈ Z would contain that node, contradicting the definition of LSA(Z).

Now denote by x₁ ≠ x₂ the child nodes of e₁ and e₂ respectively. If both x₁ and x₂ had parents below LSA(Z), then x₁ has a parent below x₂ and x₂ has a parent below x₁ giving a directed cycle. Thus, without loss of generality, assume x₁ has parents LSA(Z) and v with v not below LSA(Z). Let z ∈ Z with z below x₁. If we remove the LSA(Z) from $N^{+}$ there is still a path from r to z (which goes from r to v to x₁ to z). This contradicts the fact that LSA(Z) is on all paths from r to any z ∈ Z.

For (iii) we observe that since Z ⊆ Y , LSA(Y ) must be equal or above LSA(Z) since the set of paths from r to any taxa in Y contains the set of paths from r to any taxon in Z.

Lemma 2 Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on X and let Z ⊂ X, |Z| ≥ 2. For every x ∈ Z, there is a y ∈ Z such that LSA(x,y)=LSA(Z).

Proof Let m=LSA(Z), fix x ∈ Z and let P be a path from m to x. By definition of LSA, for all y ∈ Z, LSA(x,y) is a node in P and is below or equal to m by Lemma 1. Suppose that LSA(x,y) is below m for all y ∈ Z. Let z ∈ Z be such that LSA(x,z) is above or equal to LSA(x,y) for all y ∈ Z \ {z}.

We claim that any path from m to y ∈ Z passes through LSA(x,z). Suppose there exists taxon y with path P′ from m to y that does not pass through LSA(x,z). But P′ must pass through LSA(x,y). Since LSA(x,y) is below LSA(x,z), there is a path from m to LSA(x,y) to x that does not contain LSA(x,z). This is a contradiction.

But every path from m to any y ∈ Z passes through LSA(x,z), contradicting that LSA(x,z) is below m. +

By this Lemma we can characterize LSA(Z) as the highest node of the form LSA(x,y) for some x,y ∈ Z, or the highest node of that form for fixed x ∈ Z.

2.3. Unrooted networks.

Let G be a directed or semidirected graph with z a degree two node. Let x and y be the two nodes adjacent to z. Then, up to isomorphism, the subgraph on x,y and z must be one of the graphs shown on the left of Figure 3, which we denote by H. By suppressing z we mean replacing H in G by the graph to the right of it in Figure 3.

Fig. 3 — On the left are all the semidirected graphs, up to isomorphism, on a degree two node z and its adjacent vertices x and y. On the right are the corresponding graphs obtained by suppressing z.

Definition 5 Let $N^{+}$ be a binary topological rooted phylogenetic network on a set of taxa X. Then $N^{-}$ is the semidirected network obtained by 1) keeping only the edges and nodes below LSA(X); 2) removing the direction of all tree edges; 3) suppressing LSA(X). We refer to $N^{-}$ as the topological unrooted semidirected network induced from $N^{+}$ .

Figure 1 shows an example of a network $N^{+}$ and its induced $N^{-}$ . We now introduce a metric on $N^{-}$ induced from one on $N^{+}$ .

Definition 6 Let ( $N^{+}$ ,(λ,γ)) be a metric binary rooted phylogenetic network and let $N^{-}$ be the topological unrooted semidirected network induced from $N^{+}$ . Denote by e^∗ the edge of $N^{-}$ introduced in place of the edges e₁ and e₂ in $N^{+}$ when LSA(X) is suppressed. Define $λ' : E (N^{-}) \to ℝ_{> 0}$ such that λ′(e*) = λ(e₁) + λ(e₂) and λ′(e) = λ(e) for e ∈ $N^{-}$ , e ≠ e^∗. If e^∗ is not hybrid, γ′ = γ, else let γ′(h) = γ(h) for all hybrid edges of $N^{-}$ other than e^∗ and γ′(e*) = γ(e_i), where e_i is, by Lemma 1, the single hybrid edge in {e₁,e₂}. We refer to ( $N^{-}$ ,(λ′,γ′)) as the metric unrooted semidirected network induced from ( $N^{+}$ ,(λ,γ)).

The networks considered in this work are always induced from a rooted binary metric phylogenetic network. To simplify language, we refer to a (metric or topological) binary rooted phylogenetic network as a (metric or topological) rooted network and to a induced (metric or topological) unrooted semidirected phylogenetic network as a (metric or topological) unrooted network.

We note that not all binary semidirected graphs are topological unrooted networks, since some graphs are not compatible with suppressing the root on any rooted network. Moreover, $N^{-}$ might be induced from several rooted networks $N^{+}$ . See Figure 4.

Fig. 4 — The top graph is not a topological unrooted semidirected phylogenetic network, since its directed edges cannot be obtained by suppressing the root of any 6-taxon topological binary rooted phylogenetic network. The middle graph is the induced topological unrooted network from either of the bottom rooted networks, as well as others.

Although an unrooted network $N^{-}$ does not have a root specified, since hybrid edges are directed, the suppressed LSA(X) of $N^{+}$ must have been located ‘above’ them. Thus in $N^{-}$ , we still have a well-defined notion of which taxa are descendants of a hybrid node v. These are the taxa x such that there exists a semidirected path from v to x in $N^{-}$ . In this case we say that x descends from v.

2.4. Induced networks on subset of taxa

Since later arguments require an understanding of the behavior of the network multi-species coalescent model on a subset of taxa, we introduce some needed definitions.

Definition 7 Let $N^{+}$ be a (metric or topological) rooted network on X and let Z ⊂ X. The induced rooted network $N_{Z}^{+}$ on Z is the network obtained from $N^{+}$ by 1) retaining only edges and nodes in paths from the root to any taxa in Z; 2) suppressing all degree two nodes except the root; 3) in the case the root then has outdegree one, contracting the edge incident to the root.

Note that LSA(Z, $N_{Z}^{+}$ )=LSA(Z, $N^{+}$ ). If |Z| = 4 then $N_{Z}^{+}$ , the induced rooted quartet network on Z, will also be denoted by $Q_{Z}^{+}$ to emphasize it involves only 4 taxa.

Definition 8 Let $N^{+}$ be a (metric or topological) rooted network on X and let Z ⊂ X. The induced LSA network of Z, denoted $N_{Z}^{\oplus}$ , is the rooted network obtained from $N_{Z}^{+}$ by deleting everything above LSA(Z, $N^{+}$ ).

In particular we note that $N_{Z}^{\oplus}$ has root LSA(Z, $N^{+}$ ). If |Z| = 4 then $N_{Z}^{\oplus}$ , the induced LSA quartet network on Z, is also denoted by $Q_{Z}^{\oplus}$ .

Definition 9 Let G be a semidirected graph and let x,y be two nodes in G. A trek in G from x to y is an ordered pair of semidirected paths (P₁,P₂) where P₁ has terminal node x, P₂ has terminal node y, and both P₁ and P₂ have starting node v. The node v is called the top of the trek, denoted top(P₁,P₂). A trek (P₁,P₂) is simple if the only common node among P₁ and P₂ is v.

This definition is adopted from non-phylogenetic studies of statistical models on graphs, such as [26].

Definition 10 Let $N^{-}$ be a (metric or topological) unrooted network on X and let Z ⊆ X. The induced unrooted network ${(N^{-})}_{Z}$ on a set of taxa Z is the network obtained from $N^{-}$ by retaining only edges in simple treks between pairs of taxa in Z, and then suppressing all degree two nodes.

Note that it is not immediately clear that for a network $N^{+}$ , the networks ${(N^{-})}_{Z}$ and ${(N_{Z}^{+})}^{-}$ are isomorphic. Proposition 1 shows that the operations of unrooting and inducing a network on a subset of taxa commute. While this statement is intuitively plausible its rather technical proof is in the Appendix.

Proposition 1 Let $N^{+}$ be a (metric or topological) rooted network on X and let Z ⊆ X. Then ${(N^{-})}_{Z}$ and ${(N_{Z}^{+})}^{-}$ are isomorphic.

If |Z| = 4 then ${(N^{-})}_{Z}$ , the induced unrooted quartet network on Z, is also denoted by $Q_{Z}^{-}$ .

2.5 Cycles

Although the networks $N^{+}$ , $N^{-}$ are acyclic (in both, the directed and semidirected settings), their undirected graphs U( $N^{+}$ ), U( $N^{-}$ ) may contain a cycle. Thus the term ‘cycle’ may be used to unambiguously refer to cycles in the undirected graphs. We formalize this with the following definition:

Definition 11 Let $N$ be a (metric or topological, rooted or unrooted) network. A cycle in $N$ is a non-empty path from a node to itself, allowing edges to be traversed without regard to their possible direction. The size of the cycle is the number of edges in the path. A k-cycle is a cycle of size k.

By contracting or shrinking a cycle C in a graph we mean removing all edges in C and identifying all nodes in C.

3. Structure of level-1 networks

The class of all phylogenetic networks is often too large to obtain strong mathematical results ([25]), so it is common to restrict to networks that have a simpler structure, for instance, the class of level-1 phylogenetic networks.

Definition 12 Let $N$ be a (rooted or unrooted) topological network. If no two cycles in $N$ share an edge, then $N$ is level-1.

If $N$ is a level-1 network, any subnetwork or induced network of $N$ is also level-1.

Given a hybrid node v, denote the hybrid edges whose child is v by h_v and $h_{v}^{'}$ . Then h_v and $h_{v}^{'}$ are called the hybrid edges of v.

Lemma 3 Let $N$ be a (topological or metric, rooted or unrooted) level-1 network and let C be a cycle of $N$ . Then C contains exactly one hybrid node v, and the associated hybrid edges h_v, $h_{v}^{'}$ . Furthermore, each node of $N$ is in at most one cycle and, as a result, v, h_v and $h_{v}^{'}$ are in exactly one cycle of $N$ .

The proof of each statement of this Lemma, using different terminology, is given by Rossello and Valiente [21].

Proposition 2 Let $N^{+}$ be a topological level-1 rooted network on X. The structure of all the nodes and edges above LSA(X) in $N^{+}$ is a (possibly empty) chain of 2-cycles connected by edges, as depicted in Figure 5.

>Fig. 5 — In a level-1 network on X, the structure between the root and m =LSA(X) is a chain of two cycles. The number of two cycles in the chain could be zero.

Proof Let m = LSA(X), and denote by r the root of $N^{+}$ . The proof is by induction on the number of the edges above m. If there are no edges above m, then m = r and the result is trivially true. By Lemma 1, one easily sees that there cannot be only 1 or 2 edges above m in a binary phylogenetic network. That is, if there were just 1 edge above m the outdegree of the root would be 1, contradicting the definition of binary phylogenetic network. Suppose there are 2 edges above m. By definition of binary phylogenetic network the outdegree of r is 2 and by definition of LSA(X) all paths from the root to x ∈ X contain m. Therefore m has indegree 2, contradicting Lemma 1 part (i).

Now assume the claim holds when there are at most k edges above m and suppose there are k + 1 edges above m. Note that r has outdegree 2 by the definition of $N^{+}$ .

Suppose that edges incident to r have different children, x and y. Note neither x nor y can be m. The outdegree of one of x or y must be 2, otherwise both would be hybrid nodes, which would require x above y and y above x. Without loss of generality suppose x has outdegree 2, and denote by e₁ and e₂ its out edges, and denote by e₃ the edge (r,y). Since every path from r to a leaf goes through m, there are at least 3 distinct paths P₁, P₂, P₃ from r to m, where P_i contains e_i.

This contradicts the level-1 condition. Thus x = y, and the edges from r form a 2-cycle. sssss

Now since x is a hybrid node, it has outdegree 1, with child v. Also, there are k −3 edges above m that are also below v. Applying the inductive hypothesis to $N^{+}$ with edges above v removed, the result follows.

Proposition 2 applied to $N_{Z}^{+}$ illustrates the structure of the common ancestry of a subset Z of taxa. When we pass to a LSA network or an induced unrooted network, we “throw away” this structure. We show in Section 5 that under the network multi-species coalescent model this structure has no effect on the formation of quartet gene trees.

Let v be a hybrid node in a level-1 (rooted or unrooted, metric or topological) network $N$ on X and let C_v be the cycle containing v. By removing the edges of C_v from $N$ we obtain a partition of X according to the connected components of the resulting graph. We refer to this partition as the v-partition and its partition sets as v-blocks.

Note that each node in C_v can be associated to a v-block. That is, a v-block B_u is associated to a node u in C_v if by removing u from the network (and therefore the edges adjacent to u), the induced partition of taxa is {B_u,X \ B_u}. We refer to the v-block B_v, whose elements descend from v, as the v-hybrid block. Two distinct v-blocks B_u,B_w are adjacent if the nodes u,w ∈ C_v are adjacent.

Let $D = {C_{1}, \dots, C_{n}}$ be a collection of cycles in $N$ . The partition of X obtained by removing all the edges in the cycles of $D$ is the network partition induced by $D$ and its blocks are network blocks induced by $D$ . When $D$ is the set of all cycles in $N$ of size at least k, the partition is the k-network partition and its blocks are k-network blocks. The 4-network blocks play an important role in Section 8. For now and on, we will refer to removing all edges of a cycle C from a network $N$ as removing the cycle C from $N$ .

The following is straightforward to prove.

Lemma 4 Let $N$ be a level-1 (rooted or unrooted) topological network on X. Let $D = {C_{1}, \dots, C_{n}}$ be a collection of cycles in $N$ . For any two taxa a and b in different network blocks induced by $D$ , there exists a hybrid node v of some cycle in $D$ such that a and b are in different v-blocks.

If two taxa a and b are in the same network block induced by $D$ , then they are connected when all cycles in $D$ are removed. As a result they are connected when a single cycle in $D$ is removed. This comment together with Lemma 4 yields the following.

Corollary 1 Let $N$ be a level-1 (rooted or unrooted) topological network on X. Let $D = {C_{1}, \dots, C_{n}}$ be a collection of cycles in $N$ , with v_i the hybrid node associated to C_i. The network partition induced by $D$ is the common refinement of the v_i-partitions for 1 ≤ i ≤ n.

Since contracting cycles in level-1 networks does not introduce loops or multi-edges, we can define a notion of a tree of cycles which is useful for the proof of Theorem 4.

Definition 13 Let $N^{-}$ be a topological unrooted level-1 network. Let $T$ be the graph obtained from $N^{-}$ by 1) removing all pendant edges, repeatedly, until no pendant edges remain; 2) suppressing all vertices of degree two that are not part of a cycle; 3) contracting each cycle in the network obtained from steps 1 and 2. We refer to $T$ as the tree of cycles of $N^{-}$ .

In the tree of cycles of $N^{-}$ certain nodes, including all the leaves, represent a cycle of the original network $N^{-}$ . The notion of tree of cycles is different from “tree of blobs” of [8], as there is no deletion of the non-cycle edges in the tree of blobs. In Figure 6 we see an example of a tree of cycles.

Fig. 6 — (Left) A level-1 unrooted network $N^{-}$ and (Right) the tree of cycles of $N^{-}$ .

4. The network multi-species coalescent model and quartet concordance factors.

Coalescent theory models the formation of gene trees within populations of species. The coalescent model for a single population traces (backwards in time) the ancestries of a finite set of individual copies of a gene as the lineages coalesce to form ancestral lineages (see [28]). The multi-species coalescent (MSC) model is a generalization of the coalescent model, formulated by applying it to multiple populations connected to form a rooted population tree, or species tree. It is commonly used to obtain the probabilities of gene trees in the presence of incomplete lineage sorting.

Meng and Kubatko [15] extended the MSC by introducing phenomena such as hybridization or other horizontal gene transfer across the species-level and Nakhleh et al. further developed it [30, 33]. This model describes any situation in which a gene lineage may “jump” from one population to another at a specific time. The model parameters are specified by a metric binary rooted phylogenetic network as defined in Section 2. Different from models such as the structured coalescent with continuous gene flow (see [28]), the network model approach assumes the gene transfer occurs at a single point in time along hybrid edges. We refer to this extended version of the MSC as the network multi-species coalescent (NMSC) model.

The NMSC model assumes that speciation by hybridization results in what Meng and Kubatko refer to as a mosaic genome. One assumption of the NMSC model, inherited from the MSC model, is that all gene lineages present at a specific point on the species tree behave identically above this point. That is, the probability of any event conditioned on a set of lineages being present at a certain point on the species tree is invariant under permutation of those lineages. This feature is known as the exchangeability property.

Example 1 We illustrate how to compute the probability of a gene tree topology under the NMSC with an example. Suppose we have the rooted metric species network given in Figure 7. Let A,B,C and D be genes sampled from species a,b,c and d respectively. We compute the probability that a gene tree has the unrooted topology ((A,B),(C,D)) under the NMSC model.

Fig. 7 — Two gene trees within a species network with one hybrid node.

First observe that until B and C trace back to the edge with length z there cannot be a coalescent event. In that edge these lineages cannot coalesce if the gene tree ((A,B),(C,D)) is to be formed. The probability of no coalescence on this edge is e^−z. Now there are 4 cases, illustrated in Figure 8:

Fig. 8 — Cases 1–4 (Left-Right) of Example 1, of how lineages may behave under the NMSC model on the network of Figure 7.

1)
with probability γ², lineages B and C enter the edge of length w;
2)
with probability (1 − γ)², B and C enter the edge of length v;
3)
with probability γ(1 − γ), B enters the edge of length w and C enters the edge of length v;
4)
with probability (1 − γ)γ, B enters the edge of length v and C enters the edge of length w.

Observe that each case is now reduced to a standard MSC scenario with several samples per population (see [6]). Let P_i the probability of observing ((A,B),(C,D)) under the MSC of case i. Then the probability of observing ((A,B),(C,D)) is e^−z(γ²P₁ + (1 − γ)²P₂ + γ(1 − γ)P₃ + γ(1 − γ)P₄).

Following Solís-Lemús and Ané [23], we are interested in the probability that a species network produces various gene quartets under the NMSC. This motivates the following definition.

Definition 14 Let $N^{+}$ be a metric rooted network on a taxon set X. Let A,B,C,D be genes sampled from species a,b,c,d respectively. Given a gene quartet AB|CD, the quartet concordance factor CF_AB|CD is the probability under the NMSC on $N^{+}$ that a gene tree displays the quartet AB|CD, and

C F_{a b c d} = (C F_{A B | C D}, C F_{A C | B D}, C F_{A D | B C})

is the ordered triple of concordance factors of each quartet on the taxa a,b,c,d.

When a,b,c,d are clear from context, we write CF for CF_abcd.

In the particular case where $N^{+}$ has no hybrid edges, so the network is a tree, it is known that the quartet concordance factors do not depend on the root placement [1]. For example let a,b,c,d be taxa and consider any root placement in the unrooted species tree with topology ab|cd and internal edge of length t. Then

C F_{a b c d} = (1 - \frac{2}{3} e^{- t}, \frac{1}{3} e^{- t}, \frac{1}{3} e^{- t}) .

(1)

As mentioned in [23], for unrooted species networks the concordance factors do not depend on the placement of the root in the species network, as long as the root is placed in a way consistent with the direction of the hybrid edges. This fact is shown in Section 5, as we explore quartet concordance factors more thoroughly.

Definition 15 Let $N^{+}$ be a metric rooted level-1 network on X. Given a set of distinct taxa {a,b,c,d}, we define the ordering of CF_abcd on $N^{+}$ as the natural decreasing order of CF_AB|CD, CF_AC|BD, CF_AD|BC in the real line.

For example if t > 0 the ordering of the concordance factors in equation (1) is given by

C F_{A B | C D} > C F_{A C | B D} = C F_{A D | B C} .

Many arguments towards the main result of this work use the ordering of CF_abcd, and not its precise values.

5. Computing quartet concordance factors

In this section we show how to express the concordance factors arising on a LSA quartet network as a linear combination of the concordance factors arising on quartet trees using a similar approach as in [29]. This enables us to see how the ordering of concordance factors reflects the network topology, and how the precise root location does not matter.

The final results of this section are largely in [23]. However, we provide formal arguments and take in consideration some matters that were left unaddressed. For example, we address the possibility that an induced 4-taxon network does not contain the root of the original network.

Let $N^{+}$ be a (metric or topological) rooted level-1 network on X and let {a,b,c,d} be a set of distinct taxa of X. Then the induced unrooted network on 4 taxa $Q_{a b c d}^{-}$ is a (metric or topological) unrooted level-1 network. By Proposition 1, $Q_{a b c d}^{-}$ is the same graph as ${(N_{a b c d}^{+})}^{-}$ and ${(N_{a b c d}^{\oplus})}^{-}$ , where $N_{a b c d}^{\oplus}$ is the LSA network of Definition 8. Any cycle in $N_{a b c d}^{\oplus} = Q_{a b c d}^{\oplus}$ induces a cycle in $Q_{a b c d}^{-}$ . A cycle C in $Q_{a b c d}^{\oplus}$ of size k, induces a cycle in $Q_{a b c d}^{-}$ of either size k (when C does not contain LSA(a,b,c,d)) or size k − 1 (otherwise). For convenience when we refer to the size of a cycle C in $Q_{a b c d}^{\oplus}$ we mean the size of the induced cycle in $Q_{a b c d}^{-}$ .

Lemma 5 Let $Q_{a b c d}^{-}$ be a metric unrooted level-1 quartet network. The number of k-cycles in $Q_{a b c d}^{-}$ is 0 for k ≥ 5, at most 1 for k = 4 in which case there is no 3-cycle, and at most 2 for k = 3.

Proof Suppose that $Q_{a b c d}^{-}$ has a cycle C = C_v of size k. Then there is an associated partition of taxa into k v-blocks. Trivially none of these blocks can be empty, so k ≤ 4.

Suppose that there are two cycles, a cycle C₁ of size k₁ and C₂ of size k₂ with k_i ≥ 3, i = 1,2. Since $Q_{a b c d}^{-}$ is level-1, by removing these two cycles we induce a partition of the taxa into at least k₁ + k₂ − 2 blocks. None of the blocks of this partition can be empty, so k₁+k₂−2 ≤ 4. Hence there is a most one cycle of size 4 or at most two cycles of size 3. Moreover there cannot be a cycle of size 3 and a cycle of size 4 in the same unrooted quartet network.

Suppose that there are three cycles, a cycle C₁ of size k₁, C₂ of size k₂, and C₃ of size k₃ with k_i ≥ 3, i = 1,2,3. By removing these three cycles we induce a partition of the taxa into at least k₁+k₂+k₃−3 blocks, so k₁+k₂+k₃−3 ≤ 4 which is a contradiction since k_i ≥ 3.

Our arguments will depend on the number of descendants on the hybrid node of a cycle, so we introduce additional terminology. An n-cycle with exactly k taxa descending from the hybrid node is referred to as a n_k-cycle. Figure 9 shows the 6 different types of 2-, 3-, and 4-cycles possible in an unrooted quartet network.

Fig. 9 — (Left) The three types of 2-cycles in an unrooted quartet network (2₁-,2₂- and a 2₃-cycle); (Center) The two types of 3-cycles in the unrooted quartet network (3₁- and a 3₂-cycle). (Right) The only type of 4-cycle in an unrooted quartet network (a 4₁-cycle). The dashed lines represent subgraphs that may contain other cycles.

Lemma 6 Let $Q_{a b c d}^{-}$ be a metric unrooted level-1 unrooted quartet network. Then $Q_{a b c d}^{-}$ cannot have two 3₂-cycles, or a 2₂-cycle and a 4₁-cycle.

Proof Suppose $Q = Q_{a b c d}^{-}$ has two distinct 3₂-cycles, C_u and C_v. Suppose C_u has u-hybrid block {a,b} and u-blocks {c} and {d}. If we remove C_u from Q, by the level-1 assumption C_v is in one on the connected components. This implies that 2 of the 3 v-blocks must be contained in one of {a,b}, {c} or {d}. This is only possible if the v-hybrid block is {c,d}, and the other v-blocks are {a} and {b}. Thus Q must be as the network in Figure 10, where u is below v and v is below u, contradicting that Q is induced from a rooted network.

Fig. 10 — A graph with two 3₂ cycles. Each dashed edge represents a chain of 2-cycles with, possibly, other cycles.

Now suppose that Q has a 4-cycle and a 2₂-cycle. The 4-cycle induces 4 singleton blocks. By the level-1 condition at least one of the blocks induced by the 2₂-cycle has to be contained in a singleton block. That is impossible since the blocks induced by the 2₂-cycle have size 2.

Lemmas 5 and 6 determine all possible topological structures for unrooted quartet networks which are shown in Figure 11.

Fig. 11 — Possible structures for unrooted quartet networks. Every dashed arrow represents a chain of an arbitrary number of 2-cycles, as the one in the bottom of the Figure. The direction of these 2-cycles must be such that the obtained graph is induced from a rooted network.

5.1. Concordance factor formulas for quartet networks

Next we prove a number of “reduction” lemmas relating concordance factors for quartet networks to those for networks with fewer cycles. This allows us to express the network concordance factors as a linear combination of concordance factors of trees. The following observation is useful through this section.

Observation 1 Given a rooted metric species quartet network, under the NMSC model the first coalescent event (going backwards in time) determines the unrooted topology of a quartet gene tree.

As illustrated in Figure 12, in passing from a rooted network on X to a rooted induced network on Z ⊂ X, $N_{Z}^{+}$ , we may find there is a network structure above LSA(Z), a chain of 2-cycles by Proposition 2. A priori, this could have an impact on the behavior of the NMSC model on $N_{Z}^{+}$ . For quartet concordance factors, however, this additional structure has no impact, and we effectively snip it off. Formally, we have the following.

Fig. 12 — A level-1 rooted network where the root differs from the LSA(*a,b,c,d*).

Theorem 2 Let $N^{+}$ be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC model, CF_abcd can be computed from the LSA network $Q_{a b c d}^{\oplus}$ .

Proof In any realization of the coalescent process if there are fewer than 4 lineages at the LSA(a,b,c,d) in $N_{a b c d}^{+} = Q_{a b c d}^{+}$ , then a coalescent event has occurred below and therefore the unrooted gene tree topology has been determined. Thus we condition on 4 lineages being present at LSA(a,b,c,d).

There are 2 rooted shapes for 4-taxon gene trees, the caterpillar and balanced trees. Regardless of the ancestral chain of 2-cycles above LSA(a,b,c,d), conditioned on one of these shapes, exchangeability of lineages under the coalescent tells us all labeled versions of that specific shape will have equal probability. While the rooted shapes might have different probability, since there is only 1 unrooted shape, all labellings of it must be equally probable. This is the same as if there were no ancestral cycles. Therefore $C F_{a b c d} (Q_{a b c d}^{\oplus}) = C F_{a b c d} (Q_{a b c d}^{+})$ .

This argument can be modified to apply to 5 taxa, but not 6 or more, since then there is more than 1 unrooted shape.

Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a level-1 LSA quartet network and let C_v be a cycle in Q^⊕, with hybrid node v and hybrid edges h₁ and h₂, where $γ = γ_{h_{1}}$ . The following notation is used throughout this section:

• $Q_{1}^{\oplus}$ denotes the rooted quartet network obtained from Q^⊕ by removing h₂.

• $Q_{2}^{\oplus}$ denotes the rooted quartet network obtained from Q^⊕ by removing h₁.

• $Q_{0}^{\oplus}$ denotes the rooted quartet network obtained from Q^⊕ by contracting C_v; if the root of Q^⊕ is in C_v, the node obtained in the contraction process is the root of $Q_{0}^{\oplus}$ .

Note that $Q_{i}^{\oplus}$ , for i = 1,2 have degree 2 nodes, and thus are not binary. This does not affect the coalescent process in any way and by suppressing such nodes we obtain a binary LSA network. In a slight abuse of notation, we use $Q_{i}^{\oplus}$ to denote both of these networks, as needed in our arguments.

To compute concordance factors we often need to designate how many lineages are present at a hybrid node in a realization of the coalescent process. To handle this formally, given a rooted metric species network $N^{+}$ on X, we define the random variable K_v to be the number of lineages at node v, where K_v takes values in {1,...,l_v}, where l_v is the number of taxa below v. We can extend this concept to hybrid nodes in $N^{-}$ , since a hybrid node in $N^{-}$ induces an orientation of the nodes that are descending from it.

Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a level-1 LSA quartet network and let C_v be a cycle in Q^⊕, with hybrid node v, which induces a cycle $C_{v}^{'}$ in $Q_{a b c d}^{-}$ . If $C_{v}^{'}$ has size 2, then 1≤l_v≤3; if $C_{v}^{'}$ has size three, then 1 ≤ l_v ≤ 2; and if $C_{v}^{'}$ has size four then l_v = 1. For example, let Q^⊕ be the LSA network shown in the left of Figure 14 and let C_v be the cycle in Q^⊕. By unrooting Q^⊕ note that C_v induces a 3-cycle $C_{v}^{'}$ . Note also that Q⁻ is isomorphic to the network in Figure 18.

Fig. 14 — A LSA quartet Q^⊕ with a cycle C that induces a 3₂-cycle in the unrooted quartet and the graphs obtained by deleting everything below the hybrid node, disjointing, and labeling the leaves.

Fig. 18 — An unrooted quartet with a single 3₂-cycle.

We show that cycles in $Q_{a b c d}^{\oplus}$ that induce 2₁-cycles or 2₃-cycles in $Q_{a b c d}^{-}$ have no impact on concordance factors. But first we state Propositions 3 and 4, proven in [1], which are useful in arguments to come.

Proposition 3 Let $T^{+}$ be a binary rooted metric species tree on X. For |X| = 4, $T^{-}$ is identifiable from the unrooted topological gene tree distribution under the multispecies coalescent model on $T^{+}$ , but $T^{+}$ is not.

Proposition 4 Proposition 3 remains valid when $T^{+}$ not binary.

Lemma 7 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network and let C_v be a cycle in Q^⊕ that induces a 2₁-cycle in $Q_{a b c d}^{-}$ . Then $C F (Q^{\oplus}) = C F (Q_{0}^{\oplus})$ .

Proof Let K = K_v. Since C_v induces a 2₁-cycle in $Q_{a b c d}^{-}$ , P(K = 1) = 1. Then

C F (Q^{\oplus}) = P (K = 1) C F (Q^{\oplus} | K = 1) = P (K = 1) [γ C F (Q_{1}^{\oplus} | K = 1) + (1 - γ) C F (Q_{2}^{\oplus} | K = 1)] = γ C F (Q_{1}^{\oplus}) + (1 - γ) C F (Q_{2}^{\oplus})

If the root of Q^⊕ is not in C_v, no lineages can coalesce on the edges that differ in $Q_{1}^{\oplus}$ and $Q_{2}^{\oplus}$ since there is only one lineage in such edges. Thus,

C F (Q_{1}^{\oplus}) = C F (Q_{2}^{\oplus}) = C F (Q_{0}^{\oplus}),

and the claim is established in this case.

Now suppose the root r of Q^⊕ is in C_v, and C_v has nodes r, u, v, and edges (r,v), (r,u), (u,v). Without loss of generality suppose that the taxon below v is d. Since u is a tree node it has another descendant y. Note that $Q_{1}^{\oplus}$ and $Q_{2}^{\oplus}$ have the same topology, moreover, they just differ in the edge length from the root to y. Define a random variable K′, by K′ = 1 if there has been a coalescent event before a, b, and c trace back to y and K′ = 0 otherwise. If K′ = 1, the unrooted topology has been determined and thus

C F (Q_{1}^{\oplus} | K^{'} = 1) = C F (Q_{2}^{\oplus} | K^{'} = 1) = C F (Q_{0}^{\oplus} | K^{'} = 1) .

Also, by Proposition 4,

C F (Q_{1}^{\oplus} | K^{'} = 0) = C F (Q_{2}^{\oplus} | K^{'} = 0) = C F (Q_{0}^{\oplus} | K^{'} = 0) .

Thus $C F (Q^{\oplus}) = C F (Q_{0}^{\oplus})$ .

Lemma 8 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network and let C_v be a cycle in Q^⊕, that induces a 2₃-cycle in $Q_{a b c d}^{-}$ . Then $C F (Q^{\oplus}) = C F (Q_{0}^{\oplus})$ .

Proof Let K = K_v, so K takes values in {1,2,3}. Therefore

C F (Q^{\oplus}) = P (K = 1) C F (Q^{\oplus} | K = 1) + P (K = 2) C F (Q^{\oplus} | K = 2) + P (K = 3) C F (Q^{\oplus} | K = 3) .

(2)

If K = 1 or 2 then at least one coalescent event has occurred, so the unrooted gene tree topology is already determined, and

C F (Q^{\oplus} | K = k) = C F (Q_{0}^{\oplus} | K = k) for k = 1, 2.

The case K = 3 requires more argument. Without loss of generality suppose that the three taxa descending from v are a, b, and c. Denote by $D$ the random variable defined by $D = 1$ if the lineage d is involved in the first coalescent event and $D = 0$ otherwise. Thus

C F (Q^{\oplus} | K = 3) = P (D = 1) C F (Q^{\oplus} | K = 3, D = 1) + P (D = 0) C F (Q^{\oplus} | K = 3, D = 0) .

(3)

If d is in the first coalescent event, by the exchangeability property of the NMSC, a,b or c are equally likely to be the other lineage involved in that event.This is the same as if the cycle was contracted, so

C F (Q^{\oplus} | K = 3, D = 1) = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3}) = C F (Q_{0}^{\oplus} | K = 3, D = 1)

If d is not in the first coalescent event, this event involves only two of a,b, and c, with each pair equally likely by exchangeability. This is also the same as if the cycle was contracted, so

C F (Q^{\oplus} | K = 3, D = 0) = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3}) = C F (Q_{0}^{\oplus} | K = 3, D = 0)

Thus by equations (2) and (3), $C F (Q^{\oplus}) = C F (Q_{0}^{\oplus})$ .

Together, the preceding Lemmas yield the following.

Corollary 2 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network and let ${\tilde{Q}}^{\oplus}$ be the LSA network obtained by contracting all cycles that induce either 2₃- or a 2₁-cycles in $Q_{a b c d}^{-}$ . Then $C F (Q^{\oplus}) = C F ({\tilde{Q}}^{\oplus})$ .

While 2₁- and 2₃-cycles have no impact on concordance factors, things are not quite so simple for other types of cycles.

Lemma 9 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network and let C_v be a cycle in Q^⊕, that induces a 2₂-cycle in $Q_{a b c d}^{-}$ . Then

C F (Q^{\oplus}) = γ^{2} C F (Q_{1}^{\oplus}) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus}) + 2 γ (1 - γ) C F (Q_{0}^{\oplus}) .

Proof Let K = K_v with values in {1,2}, so that

C F (Q^{\oplus}) = P (K = 1) C F (Q^{\oplus} | K = 1) + P (K = 2) C F (Q^{\oplus} | K = 2) .

Suppose the root r of Q^⊕ is not in C_v, so C_v is also a 2₂-cycle in Q^⊕. Note that

C F (Q^{\oplus} | K = 2) = γ^{2} C F (Q_{1}^{\oplus} | K = 2) + {(1 - γ)}^{2} C F (Q_{2}^{\otimes} | K = 2) + 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 2) .

Thus we will express CF(Q^⊕ | K = 1) in a similar fashion. If K = 1 the gene tree topology has been determined before the lineages enter v. Thus CF( $Q_{i}^{\oplus}$ | K = 1) = CF(Q^⊕ | K = 1) for i ∈ {0,1,2} and

C F (Q^{\oplus} | K = 1) = γ^{2} C F (Q_{1}^{\oplus} | K = 1) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 1) + 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 1);

(4)

by summing the result holds when r is not in C_v.

Now suppose that r is in C_v, and C_v has nodes r, v, u. Without loss of generality suppose that the taxa below v are c and d. Since u is a tree node it has another descendant y. Define a random variable K_y to be the number of lineages at y. Note that K and K_y are independent, with values in {1,2}. If either K or K_y is 1, one coalescent event has occurred and the unrooted gene tree topology has been determined so CF( $Q_{i}^{\oplus}$ | K = 1 or K_y = 1) are equal for i ∈ {0,1,2}, and

C F (Q^{\oplus} | K = 1 or K_{y} = 1) = γ^{2} C F (Q_{1}^{\oplus} | K = 1 or K_{y} = 1) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 1 or K_{y} = 1) + 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 1 or K_{y} = 1)

(5)

Even though equation (5) is equal to CF( $Q_{0}^{\oplus}$ | K = 1 or K_y = 1), we express it in a similar fashion to the claimed result. Now suppose that K and K_y are both 2. Let T_c and T_d be the trees shown on Figure 13. Therefore

C F (Q^{\oplus} | K = 2, K_{y} = 2) = γ^{2} C F (Q_{1}^{\oplus} | K = 2, K_{y} = 2) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 2, K_{y} = 2) + γ (1 - γ) C F (T_{c} | K_{y} = 2) + γ (1 - γ) C F (T_{d} | K_{y} = 2) .

Fig. 13 — The two trees T_d and T_c in the proof of Lemma 9, obtained when K = 2, K_y = 2 and the lineages c and d trace different hybrid edges.

By Proposition 3, CF(T_d | K_y = 2) = CF(T_c | K_y = 2), and in fact they equal CF( $Q_{0}^{\oplus}$ | K = 2 or K_y = 2). This is because in $Q_{0}^{\oplus}$ the contraction of the cycle identifies the nodes r, u, and v, so conditioned on K = 2, K_y = 2 we may view the coalescent process on $Q_{0}^{\oplus}$ as that in the 4-taxon tree ((a,b) : l,(c,d) : 0) where l is the length of (u,y). By Proposition 4, CF(T_c | K_y = 2) = CF( $Q_{0}^{\oplus}$ | K = 2, K_y = 2). Therefore

C F (Q^{\oplus} | K = 2, K_{y} = 2) = γ^{2} C F (Q_{1}^{\oplus} | K = 2, K_{y} = 2) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 2, K_{y} = 2) + 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 2, K_{y} = 2) .

This together with equation (5) implies the claim.

Lemma 10 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network and let C_v be a cycle in Q^⊕, that induces either a 4-cycle or a 3₁-cycle in $Q_{a b c d}^{-}$ . Then

C F (Q^{\oplus}) = γ C F (Q_{1}^{\oplus}) + (1 - γ) C F (Q_{2}^{\oplus}) .

Proof Letting K = K_v, then P(K = 1) = 1. Thus,

C F (Q^{\oplus}) = P (K = 1) C F (Q^{\oplus} | K = 1) = P (K = 1) (γ C F (Q_{1}^{\oplus} | K = 1) + (1 - γ) C F (Q_{2}^{\oplus} | K = 1)) = γ C F (Q_{1}^{\oplus}) + (1 - γ) C F (Q_{2}^{\oplus}) .

It remains to consider a 3₂-cycle. For this case it helps to introduce new terminology. Let G be a semidirected graph and v be a node in G with indegree 2 and outdegree 0. Let h_v and $h_{v}^{'}$ be the edges incident to v and let u and u′ the parent nodes in h_v and $h_{v}^{'}$ respectively. We refer to disjointing h_v and $h_{v}^{'}$ from v as the process of 1) deleting v from G; 2) introducing nodes w and w′; 3) introducing directed edges (u,w) and (u′,w′).

Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network, and C_v a cycle in Q^⊕, that induces a 3₂-cycle in $Q_{a b c d}^{-}$ . Without loss of generality suppose that a and b are the taxa below v. Let $Q_{a}^{\oplus}$ be the network obtained from Q^⊕ by 1) deleting everything below v; 2) disjointing h₁ and h₂ from v; 3) labeling a leaf that is currently unlabeled by a and the other unlabeled leaf by b. We construct $Q_{b}^{\oplus}$ by swapping the labels a and b in $Q_{a}^{\oplus}$ . Figure 14 depicts an particular example of this.

Lemma 11 Let $Q^{\oplus} = Q_{a b c d}^{\oplus}$ be a metric level-1 LSA quartet network, C_v be a cycle in Q^⊕, that induces a 3₂-cycle in $Q_{a b c d}^{-}$ and let K = K_v. Suppose that the two taxa below v are a and b, then

C F (Q^{\oplus}) = γ^{2} C F (Q_{1}^{\oplus}) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus}) + P (K = 1) 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 1) + P (K = 2) γ (1 - γ) [C F (Q_{a}^{\oplus}) + C F (Q_{b}^{\oplus})] .

Proof By hypothesis K takes values in {1,2} and

C F (Q^{\oplus}) = P (K = 1) C F (Q^{\oplus} | K = 1) + P (K = 2) C F (Q^{\oplus} | K = 2) .

If K = 1 the unrooted tree topology has been determined and CF(Q^⊕ | K = 1) is given by the expression in equation (4). If K = 2,

C F (Q^{\oplus} | K = 2) = γ^{2} C F (Q_{1}^{\otimes} | K = 2) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 2) + γ (1 - γ) C F (Q_{a}^{\oplus}) + γ (1 - γ) C F (Q_{b}^{\oplus}) .

Therefore,

C F (Q^{\oplus}) = P (K = 1) (γ^{2} C F (Q_{1}^{\oplus} | K = 1) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 1) + 2 γ (1 - γ) C F (Q_{0}^{\oplus} | K = 1) + P (K = 2) [γ^{2} C F (Q_{1} \oplus | K = 2) + {(1 - γ)}^{2} C F (Q_{2}^{\oplus} | K = 2) + γ (1 - γ) C F (Q_{a}^{\oplus}) + γ (1 - γ) C F (Q_{b}^{\oplus})],

which yields the claim.

These Lemmas together imply that concordance factor for rooted quartet networks actually depend only on the unrooted network. This is formalized in the following.

Proposition 5 Let $Q = Q_{a b c d}^{\oplus}$ and $\tilde{Q} = {\tilde{Q}}_{a b c d}^{\oplus}$ be metric level-1 LSA quartet networks which induce the same unrooted network $Q_{a b c d}^{-} = {\tilde{Q}}_{a b c d}^{-}$ . Then $C F (Q) = C F (\tilde{Q})$ .

Proof We prove this by induction on the number of cycles in $Q_{a b c d}^{-}$ . When there are no cycles in $Q_{a b c d}^{-}$ , Q and $\tilde{Q}$ are trees, and by Proposition 3, $C F (Q) = C F (\tilde{Q})$ . Assume now the result is true when there are fewer than k +1 cycles and that $Q_{a b c d}^{-}$ has k+1 cycles. Let C_v be a cycle in $Q_{a b c d}^{-}$ with hybrid edges h₁ and h₂, by Lemmas 7, 8, 9, 10, and 11, we can express the concordance factors of Q and $\tilde{Q}$ in terms of networks with one fewer cycle. Note that these networks for Q and $\tilde{Q}$ have the same unrooted metric structure. Thus by the induction hypothesis $C F ({\tilde{Q}}_{i}) = C F (Q_{i})$ , for i = 0,1,2, and therefore $C F (\tilde{Q}) = C F (Q)$ .

Corollary 3 Let $N^{+}$ be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC, $C F_{a b c d} = C F (Q_{a b c d}^{\oplus})$ can be computed from the unrooted network $Q_{a b c d}^{-}$ .

We indicate how to compute the concordance factors of a LSA network $Q_{a b c d}^{\oplus}$ from the unrooted quartet network $Q = Q_{a b c d}^{-}$ without having to introduce a root. For $Q = Q_{a b c d}^{-}$ a unrooted metric level-1 quartet network, where using Corollary 3 we define $C F (Q) = C F (Q_{a b c d}^{\oplus})$ :

i)
Q′ be the graph obtained from Q by contracting all 2₃- and 2₁- cycles. By Corollary 2, CF(Q) = CF(Q′). If Q has a 4-cycle go to step (ii), otherwise go to step (iii).
ii)
Lemma 5 and Lemma 6 there are no 3₁-, 3₂- or 2₂-cycles in Q, and thus none in Q′. Then Q′ only has a 4-cycle so apply Lemma 10 to Q′. Since $Q_{1}^{'}$ and $Q_{2}^{'}$ are quartet trees, use the formula in equation (1) to complete the calculation.
iii)
are at most two 3₁-cycles in Q′. Choose one arbitrarily and apply Lemma 10. If $Q_{1}^{'}$ and $Q_{2}^{'}$ still have a 3₁-cycle, apply Lemma 10 again to $Q_{1}^{'}$ and $Q_{2}^{'}$ .
iv)
have now expressed concordance factors of Q in terms of concordance factors of unrooted quartet networks with no 2₁-,2₃-,3₁−, or 4-cycles. Apply Lemma 9 to these networks, by for instance choosing a 2₂-cycle with smallest graph theoretical distance from its hybrid node to a leaf, repeating until no 2-cycle remains.
v)
have now an expression of the concordance factors of Q in terms of concordance factors of unrooted quartet networks with at most one 3₂-cycle. Apply Lemma 11. Then we have suppressed all cycles, and the concordance factors are now in terms of unrooted quartet trees. The formula of equation
- (1)
  the calculation.

use of these Lemmas and Theorem is illustrated by a few examples.

Example 2 Consider the unrooted quartet network shown in Figure 15. By Lemma 9, with $x_{i} = e^{- t_{i}}$ , the quartet concordance factors are given by:

C F_{A B | C D} = {(1 - γ)}^{2} (1 - \frac{2}{3} x_{1} x_{2} x_{3}) + 2 γ (1 - γ) (1 - \frac{2}{3} x_{1} x_{2}) + γ^{2} (1 - \frac{2}{3} x_{1} x_{2} x_{4}), C F_{A C | B D} = C F_{A D | B C} = {(1 - γ)}^{2} (\frac{1}{3} x_{1} x_{2} x_{3}) + 2 γ (1 - γ) (\frac{1}{3} x_{1} x_{2}) + γ^{2} (\frac{1}{3} x_{1} x_{2} x_{4}) .

(6)

Fig. 15 — An unrooted quartet with a single 2₂-cycle.

Example 3 Consider the unrooted quartet network shown in Figure 16. By Lemma 10, with $x_{i} = e^{- t_{i}}$ , the quartet concordance factors are given by:

C F_{A B | C D} = (1 - γ) (1 - \frac{2}{3} x_{1}) + γ (1 - \frac{2}{3} x_{1} x_{2}), C F_{A C | B D} = C F_{A D | B C} = (1 - γ) (\frac{1}{3} x_{1}) + γ (\frac{1}{3} x_{1} x_{2}) .

(7)

Fig. 16 — An unrooted quartet with a single 3₁-cycle.

Example 4 Consider the unrooted quartet network shown in Figure 17. By Lemma 10, with $x_{i} = e^{- t_{i}}$ , the quartet concordance factors are given by:

C F_{A B | C D} = (1 - γ) (1 - \frac{2}{3} x_{1}) + γ (\frac{1}{3} x_{2}), C F_{A C | B D} = (1 - γ) (\frac{1}{3} x_{1}) + γ (\frac{1}{3} x_{2}), C F_{A D | B C} = (1 - γ) (\frac{1}{3} x_{1}) + γ (1 - \frac{2}{3} x_{2}) .

(8)

Fig. 17 — An unrooted quartet with a single 4₁-cycle.

Example 5 Consider the unrooted quartet network shown in Figure 18. Given K = 1, one coalescent event has occurred below the hybrid node, so a and b coalesced. Therefore CF(Q₀ | K = 1) = (1,0,0). By Lemma 11, with $x_{i} = e^{- t_{i}}$ , the quartet concordance factors are given by:

C F_{A B | C D} = {(1 - γ)}^{2} (1 - \frac{2}{3} x_{1} x_{2}) + 2 γ (1 - γ) (1 - x_{1} + \frac{1}{3} x_{1} x_{3}) + γ^{2} (1 - \frac{2}{3} x_{1} x_{4}), C F_{A C | B D} = C F_{A D | B C} = {(1 - γ)}^{2} (\frac{1}{3} x_{1} x_{2}) + γ (1 - γ) x_{1} (1 - \frac{1}{3} x_{3}) + γ^{2} (\frac{1}{3} x_{1} x_{4}) .

(9)

1–5, agree with those in [23].

6. The Cycle property

this section we focus on the ordering by magnitude of the concordance factors.

Proposition 6 Let $Q = Q_{a b c d}^{-}$ be a metric unrooted level-1 quartet network with no 3₂-cycle. The ordering of CF_abcd(Q) is the ordering of CF_abcd(Q′) where Q′ is obtained from Q by contracting all 2-cycles and all 3₁-cycles.

Proof By Corollary 2, CF(Q) = CF(Q^∗), where Q^∗ is obtained from Q by contracting all 2₁- and 2₃-cycles. Therefore we can assume Q has no 2₁- or 2₃-cycles. If Q has a 4-cycle, it has no 3₁- and no 2₂-cycles and the claim is established.

suppose Q has only 2₂-cycles and 3₁-cycles. We proceed by induction in the number of cycles, with the base case of 0 cycles trivial. Assume the result is true for unrooted quartet networks with k 3₁- and 2₂-cycles and suppose Q has k + 1. Picking one cycle and applying one of Lemmas 9 or 10 to Q, we can express the concordance factors of Q as a convex combination of CF(Q₀), CF(Q₁) and CF(Q₂). Note that Q₀, Q₁ and Q₂ have the same topology and by induction hypothesis, CF (Q₀), CF(Q₁) and CF(Q₂) have the same ordering as the concordance factors of $Q_{0}^{'}$ , $Q_{1}^{'}$ and $Q_{2}^{'}$ respectively, the networks obtained after contracting all 2₂- and 3₁-cycles from Q₀, Q₁ and Q₂. Since $Q_{0}^{'}$ , $Q_{1}^{'}$ , Q₂ and Q′ are trees with the same topology, their concordance factors have the same ordering by equations (1). Thus CF(Q₀), CF(Q₁) and CF(Q₂) have the same ordering, and ergo so does CF(Q).

One consequence of Proposition 6 is that for any unrooted metric level-1 quartet network Q without a 3₂- or a 4-cycle, the ordering of the concordance factors is the same as the ordering of the concordance factors of a quartet tree. That is, the two smallest elements of the concordance factors are equal. When this happens we say that Q is treelike, since we could use equations (1) to find a quartet tree with appropriate edge lengths and concordance factors equal to CF(Q). However, not all unrooted quartet networks are treelike.

Example 6 Let $Q_{a b c d}^{-}$ be the unrooted 3₂-cycle quartet in Figure 18, where $γ = \frac{1}{2}$ , $t_{1} = - \log (\frac{6}{7})$ , $t_{2} = - \log (\frac{6}{7})$ , $t_{3} = - \log (\frac{1}{14})$ and $t_{4} = - \log (\frac{13}{14})$ . By the equations in (9) we observe that the concordance factors are:

C F_{A B | C D} = \frac{32}{98}, C F_{A C | B D} = \frac{33}{98}, C F_{A D | B C} = \frac{33}{98} .

The fact that such a quartet network can be not treelike was identified in [24], where it was pointed out that this may cause species tree methods not to be robust to the presence of gene flow.

motivates the following definition.

Definition 16 Let $N^{+}$ be a metric rooted level-1 network on X. We say that a set of four distinct taxa s = {a,b,c,d} satisfies the Cycle property if $Q_{s}^{-}$ is not treelike, that is, if the two smallest values of $C F_{s} = C F (Q_{s}^{-})$ are not equal.

Cycle property is best understood geometrically. Denote by ∆₂ the 2-dimensional probability simplex, the set of points in $ℝ^{3}$ with nonnegative entries adding to 1. Observe that CF_abcd ∈ ∆₂ for any distinct taxa a,b,c,d. Figure 19 (left) depicts the simplex where the black lines are the points where the Cycle property is not satisfied; that is, the treelike unrooted quartet networks are those with concordance factors (x,y,z) satisfying $x > \frac{1}{3}$ , y = z or $y > \frac{1}{3}$ , x = z or $z > \frac{1}{3}$ , x = y. All points off these segments satisfy the Cycle property. For simplicity in arguments to come, note that we can interpret concordance factors, CF_abcd, as a function that depends on a metric network on {a,b,c,d} and has for image points in ∆₂.

Fig. 19 — On the left a planar projection of the simplex ∆₂, where the black lines represent concordance factors that are treelike. In the center, the gray segments in ∆₂ represent all the concordance factors arising from unrooted quartet networks with a 3₂-cycle. On the right, the black lines represent the variety V ((x − z)(y − z)(x − y),x + y + z − 1), these are all concordance factors not satisfying the BC property of Definition 17

Proposition 7 Let $Q = Q_{a b c d}^{-}$ be a metric unrooted level-1 quartet network with a 3₂-cycle. Then CF(Q) lies in the set $I$ defined by $x > \frac{1}{6}$ , y = z or $y > \frac{1}{6}$ , x = z or $z > \frac{1}{6}$ , x = y, shown on the middle of Figure 19. Furthermore, for any point (x,y,z) in this set there is such a Q with (x,y,z) = CF(Q).

Proof Let s = {a,b,c,d} be a set of four distinct taxa and suppose that $Q_{s}^{-}$ contains only a 3₂-cycle, as in Figure 18. Then CF( $Q_{s}^{-}$ ) is given by the equations (7) with $x_{i} = e^{- t_{i}}$ , and in particular CF_AC|BD = CF_AD|BC. To maximize CF_AD|BC in (9), let t_i → 0 for i ∈ {1,2,4} and t₃ → ∞ to obtain a quadratic polynomial in γ,

C F_{A D | B C} \to \frac{1}{3} {(1 - γ)}^{2} + γ (1 - γ) + \frac{1}{3} γ^{2},

whose maximum value is $\frac{5}{12}$ and it is attained at $γ = \frac{1}{2}$ . For these values, we obtain $C F (Q_{s}^{-}) \to (\frac{2}{12}, \frac{5}{12}, \frac{5}{12})$ . To minimize CF_AD|BC it is enough to let t₃ →∞ so $C F (Q_{s}^{-}) \to (1, 0, 0)$ .

$L$ be the open line segment with endpoints (1,0,0) and $(\frac{2}{12}, \frac{5}{12}, \frac{5}{12})$ . Since $C F (Q_{s}^{-})$ is continuous in t_i and γ, its image is a connected set on the line (x,y,y) containing points arbitrarily close to the endpoints of $L$ . Thus the image of $C F (Q_{s}^{-})$ is $L$ . Permuting taxon names shows every point in the set $I$ is a concordance factor for a network with a 3₂-cycle.

suppose $Q_{s}^{-}$ has a 3₂ cycle with a,b descending from the hybrid node, and possibly other cycles. We may contract all 2₁- and 2₃-cycles by Corollary 2 without affecting $C F (Q_{s}^{-})$ . By Lemmas 9 and 10, we may supress 2₂- and 3₁-cycles by expressing $C F (Q_{s}^{-})$ as a convex sum of networks with a 3₂-cycle, but one fewer cycle. Thus $C F (Q_{s}^{-})$ is a convex sum of points in $L$ , which lies in $L$ .

the supplementary materials of [23] it is stated that an unrooted quartet network Q_abcd with a 3₂-cycle can be always reduced to an unrooted quartet tree with some adjustment in the edge lengths. This is not true in general; that is, when {a,b,c,d} satisfies the Cycle property it is not treelike. However, Proposition 7 indicates that sometimes unrooted quartet networks with 3₂-cycles are treelike.

conclude this section, we show the Cycle property can give positive information about a network.

Proposition 8 Let $Q_{s}^{-}$ be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the Cycle property, the unrooted quartet network $Q_{s}^{-}$ contains either a 3₂-cycle or a 4-cycle.

Proof Proposition 6 shows that if $Q_{s}^{-}$ has neither a 3₂-cycle nor a 4-cycle, the concordance factors of $Q_{s}^{-}$ are those of a tree.

7. The Big Cycle property

this section we investigate how to detect 4-cycles in a network from quartet concordance factors.

though the Cycle property give us some information about an unrooted quartet network, it is not sufficient to tell us what the unrooted quartet network is. This is shown by the following Example, where a 4-cycle network lead to identical concordance factors as those in Example 6.

Example 7 Let ${\tilde{Q}}_{a b c d}^{-}$ be the 4-cycle unrooted quartet in Figure 17, where $γ = \frac{1}{2}$ , $t_{1} = - \log (\frac{48}{49}) = t_{2}$ . By the equations in (8) the concordance factors are:

C F_{A B | C D} = \frac{32}{98}, C F_{A C | B D} = \frac{33}{98}, C F_{A D | B C} = \frac{33}{98},

These agree with those of $Q_{a b c d}^{-}$ in Example 6.

This motivates the following definition.

Definition 17 Let $N^{+}$ be a metric rooted level-1 network on X. We say that a subset of four distinct taxa {a,b,c,d} ⊂ X satisfies the Big Cycle property (denoted BC) if all the entries of CF_abcd are different.

Let {a,b,c,d} be a subset of taxa satisfying the BC property. Denote by $q_{a b c d}^{B C}$ the unrooted quartet corresponding to the smallest entry of CF_abcd.

For example, if CF_AB|_CD < CF_AC|_BD < CF_AD|_BC, then $q_{a b c d}^{B C} = A B | C D$ .

Note that if s satisfies the BC property then s satisfies the Cycle property but the Cycle property is weaker than the Big Cycle property.

Proposition 9 Let $Q_{s}^{-}$ be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the BC property, then the unrooted quartet network $Q_{s}^{-}$ contains a 4-cycle.

Proof By Proposition 8, $Q_{s}^{-}$ contains either a 3₂-cycle or a 4-cycle, and by Proposition 7, $Q_{s}^{-}$ cannot have a 3₂-cycle. 2

A converse of Proposition 9 also holds, provided we include an assumption of generic parameters.

Proposition 10 Let $N^{+}$ be a metric rooted level-1 on X with |X| ≥ 4. Let {a,b,c,d} ⊂ X such that $Q_{a b c d}^{-}$ has a 4-cycle. Then {a,b,c,d} satisfies the Cycle property. Moreover, for generic numerical parameters on $N^{+}$ , {a,b,c,d} satisfies the BC property. That is, for all numerical parameters except those in a set of measure zero, the BC property holds.

Proof Let s = {a,b,c,d} ⊂ X be such that $Q_{s}^{-}$ has a 4-cycle. Without loss of generality suppose that c is the descendant of the hybrid node and the hybrid block {c} of $Q_{s}^{-}$ is adjacent to the v-blocks containing b and d. Since $N^{-}$ is level-1, the only other possible cycles in $Q_{s}^{-}$ are 2₁ or 2₃-cycles. By Corollary 2, $C F (Q_{s}^{-}) = C F (Q^{'})$ , where Q′ is the network obtained after contracting all cycles other than the 4-cycle. Note that Q′ is the network shown in Figure 17, and by equations (6), CF(Q′) depends only on the length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges of $Q_{s}^{-}$ . Moreover, equations (6) show that {a,b,c,d} satisfies the Cycle property.

When $Q_{s}^{-}$ is obtained from $N^{-}$ , the lengths of the edges of $Q_{s}^{-}$ are the sum of edge lengths from $N^{-}$ . Let $Θ_{N^{-}} = {(0, \infty)}^{m} \times {[0, 1]}^{h}$ be the numerical parameter space for $N^{-}$ and let $Θ_{s}^{'} = {(0, \infty)}^{2} \times [0, 1]$ . Thus we can define a map $ν_{s} : Θ_{N^{-}} \to Θ_{s}^{'}$ such that for any metric (λ,γ) of $N^{-}$ , ν_s((λ,γ)) encodes the edge length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges. In particular this map is linear and surjective.

With χ_s = (0,1)² × [0,1], let $η : Θ_{s}^{'} \to χ_{s}$ be defined as $η (l_{1}, l_{2}, γ) = (e^{- l_{1}}, e^{- l_{2}}, γ)$ , so η is a biholomorphic function. Defining f : χ_s → ∆₂ by f((L₁,L₂,γ)) = (1 − γ)(1 − 2L₁/3,L₁/3,L₁/3) + γ(L₂/3,L₂/3,1 − 2L₂/3), the quartet concordance factor map can be viewed as a composition

Θ_{N^{-}} \overset{ν_{s}}{\to} Θ_{s}^{'} \overset{η}{\to} χ_{s} \overset{f}{\to} Δ_{2} .

It is straightforward to see that the image of f restricted to γ = 0 and γ = 1 is the red (skewed) and blue (vertical) segments shown on the right of Figure 20.

Fig. 20 — The function f maps the cube χ_s (left) to ∆₂ (right). The blue facets (rear and top) of the cube are mapped by f to the blue (vertical) segment and the red facets (bottom and right) to the red (skewed) segment. The full cube is mapped onto the shaded triangle with all the concordance factor displayed by a network with a 4-cycle. The three line segments, two on the boundary of and one within the shaded triangle, are comprised of points not satisfying the BC property.

Let V = V ((x − z)(y − z)(x − y),x + y + z − 1), that is, let V be the algebraic variety composed of the points on which (x − z)(y − z)(x − y) and x+y+z −1 are zero, as depicted on the right of Figure 19. Observe that V is the points in ∆₂ that, if interpreted as concordance factors, would not satisfy the BC property.

Since f is a polynomial map whose image is not contained in V , the preimage of V under f is contained in a proper sub-variety of χ_s, and therefore f⁻¹(V ) has measure zero in χ_s. Since η is biholomorphic, then η⁻¹(f⁻¹(V )) has measure zero. Since ν is linear surjective, then ν⁻¹(η⁻¹(f⁻¹(V ))) has measure zero. Thus generic points in $Θ_{N^{-}}$ are mapped to concordance factors satisfying the BC property.

To better understand the geometry of the map f in this proof, let s = {a,b,c,d} be a subset of four distinct taxa satisfying the BC property. Figure 20 depicts the subset of χ_s that is mapped by f to those segments of the shaded triangle inside ∆₂. The interior of χ_s is mapped to the interior of the shaded triangle.

The following Theorem follows immediately from Proposition 10 and Proposition 9.

Theorem 3 Let $N^{+}$ be a metric rooted level-1 network on X with |X| ≥ 4 and {a,b,c,d} ⊂ X. For generic numerical parameters, {a,b,c,d} satisfies the BC property if and only if $Q_{a b c d}^{-}$ has a 4-cycle.

Theorem 3 and Proposition 8, yield the following.

Corollary 4 Let $N^{-}$ be a metric unrooted level-1 network on X and let s = {a,b,c,d} be a set of distinct taxa in X. Then if s satisfies the Cycle property but not the BC property for generic parameters, then $Q_{s}^{-}$ contains a 3₂-cycle.

The converse of Corollary 4 does not hold, as pointed out by Proposition 7.

If a set of 4 taxa satisfy the BC property, we can deduce some finer information about the 4-cycle on the unrooted quartet network and a larger network, as proved in the following.

Proposition 11 Let $N^{-}$ be a metric unrooted level-1 network on X and let {a,b,c,d} ⊆ X satisfy the BC property, so $Q_{a b c d}^{-}$ contains a 4-cycle C_v. Then $q_{a b c d}^{B C} = A C | B D$ if and only the v-blocks of $Q_{a b c d}^{-}$ containing a and c are not adjacent.

Proof Let $Q = Q_{a b c d}^{-}$ . Since $N^{-}$ is level-1 the only possible cycles in Q, other than C_v, are 2₁ and 2₃-cycles. Let Q′ be the network obtained after contracting all 2₁ and 2₃-cycles, so Q′ has only a four cycle. By Corollary 2, CF(Q) = CF(Q′). Example 4 shows that if the v-blocks of $Q_{a b c d}^{-}$ containing a and c are not adjacent then $q_{a b c d}^{B C} = A C | B D$ . Interchanging taxon labels in this example shows that when $q_{a b c d}^{B C} = A C | B D$ , then a and c are not adjacent.

Lemma 12 Let $N^{-}$ be a metric unrooted level-1 network on X with generic numerical parameters. There exists {a,b,c,d} ⊆ X satisfying the BC property if and only if $N^{-}$ contains a cycle C_v of size k ≥ 4 with one of these taxa is in the hybrid block, and the others in distinct v-blocks on $N^{-}$ .

Proof Suppose that $N^{-}$ has a cycle of size k for some k ≥ 4 with hybrid node v. Choose four taxa {a,b,c,d}, such that a is in the hybrid block and a,b,c and d are in distinct v-blocks. This set of taxa induces a unrooted quartet network with a 4-cycle, and so by Theorem 3 this set of taxa satisfies the BC property for generic parameters. Suppose conversely, that there exists {a,b,c,d} satisfying the BC property. By Theorem 3, $Q_{a b c d}^{-}$ has a 4-cycle, so $N^{-}$ has a cycle of at least size four and one of these taxa is a descendant of the hybrid node. Since the other taxa are in distinct v-blocks of $Q_{a b c d}^{-}$ , they must be in distinct v-blocks of $N^{-}$ .

For a level-1 metric unrooted network $N^{-}$ , let S be the collection of sets of 4 distinct taxa satisfying the BC property and V_H be the set of hybrid nodes. We observe that for any s ∈ S, there is a natural map ψ : S ↦ V_H, where ψ(s) = v if v is the hybrid node associated to the cycle of size 4 in $Q_{s}^{-}$ . In this case we say that s determines the hybrid node v.

Lemma 13 Let $N^{-}$ be a metric unrooted level-1 network and let {a,b,c,d} and {a,b,c,e} be subsets of the taxa satisfying the BC property. The set {a,b,c,d} determines v if and only if {a,b,c,e} determines v.

Proof Let {a,b,c,d} determine v, {a,b,c,e} determine u, and suppose that u ≠ v. Let C_v and C_u the cycles in $N^{-}$ containing v and u respectively, so C_u and C_v do not share edges. Since {a,b,c,d} satisfies the BC property, by Lemma 12, a, b, c, and d belong to different v-blocks, so that in $N^{-}$ \E(C_v) the taxa a, b and c are in different connected components. Since $N^{-}$ is level-1, C_u is in one of the connected components of $N^{-}$ \E(C_v), say $K$ . In particular note that all the taxa not in $K$ are in the same u-block. But at least two of a,b and c are not in $K$ , so at least two of a, b and c are in the same u-block. This contradicts Lemma 12, so u = v.

Interestingly, under the NMSC the ordering of quartet concordance factors is insufficient to identify the hybrid node of cycles of size 4. For example, the networks shown in Figure 21 all have the same ordering of their concordance factors despite different hybrid nodes. The concordance factors for all those networks have the same values:

C F_{A B | C D} = (1 - γ) (1 - \frac{2}{3} e^{- t_{1}}) + γ (\frac{1}{3} e^{- t_{2}}), C F_{A C | B D} = (1 - γ) (\frac{1}{3} e^{- t_{1}}) + γ (\frac{1}{3} e^{- t_{2}}), C F_{A D | B C} = (1 - γ) (\frac{1}{3} e^{- t_{1}}) + γ (1 - \frac{2}{3} e^{- t_{2}}) .

Fig. 21 — Four unrooted metric level-1 quartet networks with the same concordance factors.

Figure 22 shows the 4-cycle network topologies drawn in the regions of ∆₂ which their concordance factors fill. In each case it does not matter which of the cycle nodes is the hybrid node; all those unrooted quartet networks define concordance factors that fill that region.

Fig. 22 — Each section of the simplex is depicted with an unrooted quartet network topology whose image under the concordance factor map fills that region, independent of the placement of the hybrid node.

8. Identifying cycles in networks

Having shown that the BC property can detect the existence of 4-cycles in networks, for generic parameters, we are poised to prove our main result. Our arguments now are mainly combinatorial.

Given a network $N^{+}$ on X, let S denote the set of 4-taxon subsets of X satisfying the BC property. Recall that for a unrooted level-1 network $N^{-}$ on X, the 4-network partition is the partition of X according to the connected components of the graph obtained after removing all cycles of size at least 4 from $N^{-}$ . Recall also that the blocks of such partition are referred to as 4-network blocks.

Lemma 14 Let $N^{+}$ be a metric rooted level-1 network on X. Then under the NMSC model with generic parameters the 4-network blocks of $N^{+}$ can be determined from the set S.

Proof If |X| < 3 there is nothing to prove. The case |X| = 4 follows from Proposition 9, so we assume |X| ≥ 5. By Lemma 12, for any {a,b,c,d} ∈ S each taxon a, b, c, d must belong to a different 4-network block. Let

Y_{a} = \underset{{s \in S | a \in s}}{\cup} s \ {a}

Then Y_a is the complement of the 4-network block containing a. To see this, note that for any taxon b that does not belong to the 4-network block of a, by Lemma 4, there exists a cycle C_v of size at least 4 such that a and b are in different v-blocks. Now choose any two different taxa c and d, such that all taxa a, b, c, d are in different v-blocks and one of a, b, c or d is in the v-hybrid block. Then {a,b,c,d} ∈ S, and thus b ∈ Y_a.

It follows that X \ Y_x is the 4-network block containing taxon x. Since x was arbitrary, all 4-network blocks can be determined.

Lemma 15 Let $N^{+}$ be a metric rooted level-1 network on X with cycle C_v of size k_v ≥ 4. Then for generic parameter choices, the v-blocks and the size k_v can be identified from the set S. If k_v ≥ 5 the v-hybrid block can also be identified.

Proof Let {a,b,c,d} ∈ S and let v be the hybrid node determined by it. By Lemma 12, each of these taxa belongs to a different v-block, and hence to a different 4-network block. Denote by A,B,C,D the v-blocks containing a,b,c and d respectively.

Let Z_abc be the set of all taxa e such that {a,b,c,e} ∈ S. By Lemma 13, all such {a,b,c,e} ∈ S determine the same hybrid node v. Consider now Z_bcd, Z_acd and Z_abd. If k_v = 4, then, by the last statement of Lemma 12, Z_abc = D, Z_bcd = A, Z_acd = B and Z_abd = C, so all pairwise intersections of Z_abc, Z_bcd, Z_acd, Z_abd are empty. If k_v > 4, then, again by Lemma 12, for some distinct taxa i,j,k ∈ {a,b,c,d}, Z_ijk is the v-hybrid block, and for any l,m,n ∈ {a,b,c,d} with {l,m,n} ≠ {i,j,k}, Z_lmn = (L ∪ M ∪ N)^c. Note that Z_ijk ∩ Z_lmn = ∅ since one of L,M,N is the v-hybrid block. Since Z_lmn contains at least one v-block other than A, B, C or D, for any l′,m′,n′ ∈ {a,b,c,d}, with {l′,m′,n′} ≠ {i,j,k}, Z_lmn ∩ Z_l′m′n′ ≠ ∅. Hence we can determine whether k_v > 4 or k_v = 4: if all pairwise intersection of Z_abc, Z_bcd, Z_acd, Z_abd are empty then k_v = 4, else k_v > 4. If k_v > 4 we can determine the hybrid block, by noting which of the sets Z_abc, Z_bcd, Z_acd, Z_abd has empty intersection with any other set in this family. At this point we have determined either that k_v = 4 and all v-blocks, or that k_v > 4 and the hybrid block.

In the case k_v > 4, without loss of generality, suppose that A is the vhybrid block. Let y /∈ Z_abc = (A ∪ B ∪ C)^c, so y is in one of A, B and C. For some u,w ∈ {a,b,c}, s′ = {y,u,w,d} ∈ S, which shows y and the taxon g ∈ {a,b,c} \ {u,w} are in the same v-block. Thus we can determine A, B and C.

Note that for any taxon x that is not in any of A, B or C, then s ={a,x,b,c} ∈ S. Since s determines v, following the steps of the last paragraph identifies the v-block that contains x. Therefore all v-blocks can be determined, and thus k_v as well.

Lemma 16 Let $N^{+}$ be a metric rooted level-1 network on X. Then for any hybrid node v with k_v ≥ 4 the order of the v-blocks in the cycle can be determined from the ordering of the concordance factors.

Proof If k_v = 4, the claim is established by Proposition 11. Now suppose that k_v > 4, so by Lemma 15 we know the v-hybrid block. Let A₁,...,A_kv be the v-block partition with A₁ the v-hybrid block. Let a_i ∈ A_i be an element of the i-th v-block. By Proposition 11, A₁ and A_j are adjacent if and only if $q_{a_{1} a_{j} x y}^{B C} \neq a_{1} a_{j} | x y$ for any distinct $x, y \in {a_{2}, \dots, a_{k_{v}}} \ {a_{j}}$ . Thus we can identify the two v-blocks adjacent to A₁. Suppose that such v-blocks are A_p and A_q. We find the other v-block adjacent to A_q from ${q_{a_{1} a_{p} a_{j} a_{m}}^{B C}}$ for all distinct j,m ∈ {2,3,4,…,k_v} \ {p,q}. This is, A_q and A_j are adjacent if and only if $q_{a_{1} a_{j} a_{p} x}^{B C} \neq a_{1} a_{j} | x a_{p}$ for any distinct $x \in {a_{2}, \dots, a_{k_{v}}} \ {a_{p}, a_{q}, a_{j}}$ and j ≠ 1,p,q. Continuing in this way, the full order of blocks around the cycle can be determined.

We reach the main result.

Theorem 4 Let $N^{+}$ be a metric rooted level-1 network on X. Then under the NMSC model, for generic parameters, the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network $\tilde{N}$ obtained from $N^{-}$ by contracting all 2- and 3-cycles, and directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges of k-cycles for k ≥ 5.

Proof We proceed by induction in the number of cycles of size ≥ 4. Suppose there are no such cycles.Then every induced quartet tree will have no cycle of size 4, and the ordering of the concordance factors determines the topology of the quartet tree obtained by contracting all 2- and 3-cycles. These then determine the topology $\tilde{N}$ by a standard result [22].

Suppose there is exactly one cycle of size at least 4. Then there is just one hybrid node v in $\tilde{N}$ with k_v ≥ 4. By Lemmas 15 and 16 we can determine the size k_v of the cycle, the v-blocks and the order of the v-blocks in the cycle. If k_v ≥ 5 we can identify the hybrid node v and thus identify the direction of the hybrid edges. Let P_u be a v-block where u is a node in C_v, and q ∈ X \ P_u. Let $K$ be the induced network on P_u∪{q} with all 2-cycles and 3-cycles contracted. Note that $K$ is a tree, and the quartet concordance factors for taxa in P_u ∪ q identify its topology. Viewing q as an outgroup of P_u, induces a rooted tree on P_u. The root can then be joined with an edge to u. Doing this for all v-blocks establishes the claim.

Now suppose that the result is true for networks with l cycles of size at least 4, and $N^{-}$ contains l+1 such cycles. We can first determine all 4-network blocks and the v-blocks and its cycle order for every cycle of size at least 4 by Lemmas 14, 15, and 16. Following Definition 13, consider $T$ , the tree of cycles of $\tilde{N}$ . A leaf of $T$ arises from a cycle C_v on $N^{-}$ if and only if all v-blocks but one are 4-network blocks. We may therefore determine the v-blocks of some cycle C_v that is a leaf of $T$ .

Let u be the vertex in C_v associated to the v-block that is not a 4-network block. Note that $\tilde{N} \ {u}$ is a disconnected graph, with two connected components ${\tilde{N}}_{1}$ and ${\tilde{N}}_{2}$ . Let ${\tilde{N}}_{1}$ be the component containing all nodes of C except u, and S_i the set of taxa on ${\tilde{N}}_{i}$ , i ∈ {1,2}. Let s_i ∈ S_i. Then $N_{^{S_{i} \cup {s_{j}}}}^{-}$ for i,j ∈ {1,2}, i ≠ j, has at most l cycles of size at least 4. By the induction hypothesis we can determine the semidirected topological network ${\tilde{N}}_{i}$ obtained from $N_{^{S_{i} \cup {s_{j}}}}^{-}$ by contracting all 2- and 3-cycles, and directions of the hybrid edges in 4-cycles, while retaining directions of the hybrid edges of k-cycles for k ≥ 5. We obtain $\tilde{N}$ by identifying s₁ in $N_{2}$ with s₂ in $N_{1}$ and suppressing that node.

Figure 23 shows a phylogenetic metric rooted network $N^{+}$ and $\tilde{N}$ , the unrooted semidirected topological network which is identified by Theorem 4.

Fig. 23 — A rooted metric phylogenetic network $N^{+}$ (left) and the network structure $\tilde{N}$ (right) that can be identified by Theorem 4. The 4-cycle on the network in the right, colored gray, has 3 different candidates for the hybrid node.

The cycle colored in green is a 4-cycle and, though, its hybrid node is not identified from quartet concordance factors. However, its hybrid node has to be such that $\tilde{N}$ is induced from a rooted network. Thus the node labeled x in Figure 23 cannot be the hybrid node. This illustrates that although we cannot always identify the hybrid node on 4-cycles, sometimes the structure of the resulting network $\tilde{N}$ restricts the possible nodes for its placement.

9. Further results on 3₂-cycles

Under some special circumstances, for example when a set of taxa satisfy the Cycle property but not the BC property, it is possible to detect further information about the topology of the network than that given in Theorem 4. For instance, some 3-cycles are identifiable under such hypothesis. In this section, we discuss these extensions briefly, as it is difficult to formulate general statements on identifiability.

Recall that a 3₂-cycle may lead to concordance factors satisfying the Cycle property, but it need not, as shown in Proposition 7. There is a full-dimensional subset of parameters space on which concordance factors indicate a 3₂-cycle and another in which it fails to. Nonetheless, the following gives a positive, but limited, identifiability result.

Proposition 12 Let $N^{+}$ be a metric rooted level-1 network on X and suppose {a,b,c,d} ⊂ X satisfies the Cycle property but not the BC property. Then under the NMSC model, for generic parameters, if there is no taxon e ∈ X such that {i,j,k,e} satisfies the BC property for any distinct i,j,k ∈ {a,b,c,d} then $N^{-}$ contains a 3-cycle with at least two descendants of the hybrid node.

Proof Since {a,b,c,d} ⊂ X satisfy the Cycle property but not the BC property, by Proposition 8, there is a 3₂-cycle in $Q_{a b c d}^{-}$ . Thus three taxa of a,b,c,d are in distinct v-blocks in $Q_{a b c d}^{-}$ . This implies that there exists a cycle C_v in $N^{-}$ where three taxa of a,b,c,d are in distinct v-blocks. Since {i,j,k,e} does not satisfy the BC property for any distinct i,j,k ∈ {a,b,c,d}, this implies C_v is not a k-cycle for k ≥ 4. Thus by Proposition 7, C_v has size 3 and at least two of a, b, c, d descend from v.

Let $Q_{a b c d}^{-}$ be an unrooted level-1 quartet network where {a,b,c,d} satisfies the Cycle property but not the BC property. It can be shown that if, for example, the smallest entry in CF_abcd is the one corresponding to the quartet AB|CD, then either a,b or c,d are in the v-hybrid block. This proof is very similar to that of Proposition 11.

Let $N^{+}$ be a network such that $\tilde{N}$ (in the network obtained from $N^{+}$ in Theorem 4) is as shown in Figure 24. Observe that {a,b,c,d} satisfies the BC property by Theorem 3. If {a,e,b,d} satisfies the Cycle property, then the following Proposition indicates the hybrid node in the network shown in Figure 24 can be determined.

Fig. 24 — A network $\tilde{N}$ with a four cycle such that if {*a,b,c,e*} satisfies the Cycle property, the hybrid block can be detected.

Proposition 13 Let $N^{+}$ be a metric rooted level-1 network on X and let C_v be a 4-cycle in $N^{-}$ . Let a,b,c,d ∈ X be in different v-blocks in $N^{-}$ . Suppose under the NMSC model, for generic parameters, for distinct i,j,k ∈ {a,b,c,d}, there exists a taxon e ∈ X such that {i,j,k,e} satisfies the Cycle property but not the BC property. Then the v-block containing e is the v-hybrid block.

Proof Without loss of generality suppose that i = a, j = b and k = c. Note that e is not in the same v-block as d, otherwise {a,b,c,e} would satisfy the BC property. Thus e is the same v-block as a,b or c. Without loss of generality suppose that is in the same v-block as a. Thus {e,b,c,d} satisfies the BC property and by Theorem 4 the order of the cycle can be determined. Without loss of generality suppose that the order is the one as in Figure 24. By Lemma 13, {a,b,c,d} and {e,b,c,d} determine the same hybrid node v. Since {a,b,c,e} satisfies the Cycle property, Corollary 4 shows $Q_{a b c e}^{-}$ has a 3₂-cycle. The 4-cycle in $Q_{a b c d}^{-}$ and the 3-cycle in $Q_{a b c e}^{-}$ have to have the same hybrid edges, otherwise the level-1 condition would be violated. Observe that the only possibility for $Q_{a b c e}^{-}$ having a 3₂-cycle is if e and a are in the hybrid block.

In [23] it is stated that one could identify the hybrid node in a 4-cycle when the number of taxa in the network is greater than 4 by using multiple concordance factors at once.

10. Discussion

In this work, we show that for generic numerical parameters, under the network multi-species coalescent model the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network obtained from $N^{-}$ by contracting all 2- and 3-cycles, and ignoring the directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges in larger cycles.

As mentioned in the introduction, the proof of this result suggests combinatorial methods for constructing the network under noiseless data, but the question remains open in the presence of noise. There are two challenges when noise is introduced. The first one consist of detecting whether a quartet network contains a 4-cycle or not. We would never expect the empirical concordance factors to be exactly treelike. For this challenge, one could develop a statistical test to determine when concordance factors are sufficiently close to treelike to doubt the presence of a 4-cycle. The second challenge arises after determining such test. Since the test will not be accurate all the time, some quartets will not be inferred correctly and thus we need a method to reconstruct the network with some erroneous quartets. We leave this for future work.

Acknowledgements

The author deeply thanks John A. Rhodes and Elizabeth S. Allman for their technical assistance and suggestions during the development of this work, and the reviewers for their valuable suggestions and observations.

This research was supported in part by the National Institutes of Health grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

11 Appendix

Here, Proposition 1 of Section 2 is proved. The argument uses the following.

Lemma 17 Let $N^{+}$ be a (metric or topological) rooted network on X and let Z ⊂ X. For any edge e below LSA(Z), with a descendant in Z, there are x,y ∈ Z such that e is in a simple trek in $N^{+}$ from x to y whose edges are below LSA(Z).

Proof Let x ∈ Z be below e. By Lemma 2 there exists y ∈ Z with LSA(x,y) above e.

Suppose y is not below e. Let P_x be a path from LSA(x,y) to x containing e and let P_y be a path from LSA(x,y) to y. Let u be the minimal node in the intersection of P_x and P_y. Since y is not below e, u cannot be below e. Then the subpath of P_x from u to x, which contains e, and the subpath of P_y from f to y form a simple trek containing e.

Now assume y is below e. Since e is below LSA(x,y), there exists a path from LSA(x,y) to one of y or x that does not pass through the child of e. Without loss of generality suppose such a path P_y goes from LSA(x,y) to y.

Let P_x be a path from LSA(x,y) to x that passes through e. Let A = A(P_x,P_y) be the set of nodes above e, common to P_y and P_x. Let a ∈ A be the minimal node in A.

Let B(P_y,P_x) be the set of nodes below e, common to P_y and P_x. We may assume that we choose P_x and P_y such that B = B(P_y,P_x) has minimal cardinality. If B = ∅ then the desired trek is easily constructed, with top a. So suppose B ≠∅ has minimal element b⁻ and maximal element b⁺. We are going to contradict the minimality of B. Note that b⁺ must be the hybrid node of a cycle containing e (see Figure 25 for a graphical reference).

Fig. 25 — In gray we see the subgraph composed by P and P′, the dashed edges represent that P and P′ could intersect, the dotted segments represent just a succession of edges. In black we see the different cases of the possible edges in P^∗ above b but below a.

Since b⁻ is not LSA(x,y), there exists a path P^∗ from LSA(x,y) to one of x or y that does not pass through b⁻. Note that P^∗ has to intersect at least one of P_y or P_x at an internal node below b⁻. Let C₁ be the set of nodes below b⁻, common to P^∗ and P_y and let C₂ be the set of nodes below b⁻, common to P^∗ and P_y. Let c be the maximal node in C₁ ∪C₂. We can assume, without loss of generality, that c is in P_y. This is because if instead, c were in P_x, we can construct paths $P_{x}^{'}$ and $P_{y}^{'}$ where $P_{i}^{'}$ contains all the edges in P_i above b⁻ and all edges of P_j below b⁻ for i,j ∈ {x,y}, i ≠ j. Note that $P_{x}^{'}$ passes through e and does not contains c, while $P_{y}^{'}$ does not pass through e, contains c, and $B = B (P_{y}^{'}, P_{x}^{'})$ .

Denote by W the set of nodes in (P* ∩ P_y) ∪ (P* ∩ P_x) and let w be the minimal node of W above b⁻. Since $N^{+}$ is binary, w cannot be a or b⁺ (see Figure 25 for a graphical reference). There are 5 different cases of the location of w in the network composed by the paths P_y and P_x. These are

1. w is in P_y, above b⁺ but below a.
2. w is in P_x, above b⁺ but below e.
3. w is in P_x, above e but below a.
4. w is in one or more of P_x or P_y, above a.
5. w is in one or more of P_x or P_y, above b⁻ but below b⁺.

Figure 25 depicts in gray the graph composed by the paths P_y and P_x, and in black we see the possible subpaths of P^∗ from w to c. In any of case 1, 2 or 3 we can find a simple trek containing e as depicted in Figure 26 by choosing the appropriate edges, and thus B was not minimal. For case 4 and 5 there are two possibilities; (i) w is in both P_y and P_x; (ii) w is only in one of P_y or P_x. For case 4 (i), the situation is simple, and we can find a simple trek as depicted on the left in Figure 27. For case 4 (ii), we first find the node in A that is right above w. Then as depicted on the left of Figure 27 we can find a simple trek.

Fig. 26 — The treks in case 1 (left), case 2 (center), and case 3 (right).

>Fig. 27 — (Left) The treks in the two possibilities of case 4. (Right) The two possibilities of case 5, where the black segments represent possible edges red and blue at the same time.

For case 5 we do not find a simple trek directly, instead we construct two paths P₁ and P₂ from LSA(x,y) to x, y respectively, only one of which contains e with at least one less node in B(P₁,P₂) than B. For case 5 (i), we just take P₁ to be the same as P_x and for P₂ we consider the same edges that are in P_y above w, the edges below c, and the edges in P^∗ between w and c. For case 5 (ii), we assume without loss of generality that w is in P_x. Let b be the node in B right above w. Let P₁ be the path containing the edges in P_x that are above b, the edges in P_y that are below b but above the node b′ ∈ B right below w, and at last the edges in P_x below b′. Let P₂ the path containing the edges in P_y that are above b, the edges in P_x that are above a but below b, the edges in P^∗ that are above c but below w and at last the edges in P_y that are below c. Figure 27 (right) depicts P₁ (red) and P₂ (blue) for (i) and (ii). Since B(P₁,P₂) has at least one less node that B and we assumed B, the minimality of B is contradicted.

Proof (of Proposition 1) Let $M^{+} = N_{Z}^{\oplus}$ . Let M⁻ be the graph obtained from M⁺ by ignoring the direction of all tree edges and then suppressing the LSA(Z, $N^{+}$ ), that is, the induced unrooted network from M⁺. Denote by M⁰ the graph obtained by ignoring all directions of the tree edges in M⁺, so that by suppressing degree two nodes of either M⁻ or M′ gives ${(N_{Z}^{+})}^{-}$ . Let K be the graph obtained by considering all the edges in simple treks in $N^{-}$ from x to y for all x,y ∈ Z, so that suppressing degree two nodes in K gives ${(N^{-})}_{Z}$ . Showing either M′ = K or M⁻ = K, will prove the claim.

First we show that if LSA(Z, $N^{+}$ )≠LSA(X, $N^{+}$ ) then M′ = K, by arguing that M′ and K have the same edges. Let e be an edge of M′. Since LSA(Z, $N^{+}$ )≠LSA(X, $N^{+}$ ), M′ is a subgraph of $N^{-}$ and e is directed in M⁺. By Lemma 17, e is in a simple trek in M⁺ from x to y, for some x,y ∈ Z. This trek induces a simple trek in M′ from x to y, and therefore a simple trek in $N^{-}$ from x to y. Thus e is in K.

Now let e be an edge of K. Then there exists a simple trek ( $\bar{P_{1}}, \bar{P_{2}}$ ) in $N^{-}$ from x to y, for some x,y ∈ Z containing e. Let $v = top (\bar{P_{1}}, \bar{P_{2}})$ and let T be the sequence of incident edges in $N^{+}$ from x to v conformed of edges inducing those in $\bar{P_{1}}$ and $\bar{P_{2}}$ . Since ( $\bar{P_{1}}, \bar{P_{2}}$ ) is simple, T does not have repeated edges. Following T in $N^{+}$ from x to y, edges are first transversed “uphill” (in reverse direction) until there is a first “downhill” edge (u,w). The next edge in T cannot be uphill, as otherwise it would be hybrid and ( $\bar{P_{1}}, \bar{P_{2}}$ ) would have not been a trek in $N^{-}$ . This argument applies for all consecutive edges in T until we end at y. Thus there is a simple trek ( $\bar{P_{1}}, \bar{P_{2}}$ ) from x to y in $N^{+}$ with top u. Note that u must be below or equal to LSA(Z, $N^{+}$ ) since otherwise the trek would not be simple. Moreover, P₁ and P₂ contain only edges in M⁺ and thus in M′ after the directions of the tree edges is omitted. Thus e is in M′, so K = M′.

If LSA(Z, $N^{+}$ )=LSA(X, $N^{+}$ ) then M⁻ = K follows from a straight forward modification of the previous argument to account for the suppression of LSA(z, $N^{+}$ ) in both M⁻ and K.

References

1.Allman Elizabeth S., Degnan James H., and Rhodes John A.. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833–862, 2011. [DOI] [PubMed] [Google Scholar]
2.Ané Cécile, Larget Bret, Baum David A., Smith Stacey D., and Rokas Antonis. Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24(2):412–426, 2007. [DOI] [PubMed] [Google Scholar]
3.Arnold Michael L.. Natural hybridization and evolution, volume 53 Oxford University Press, 1997. [Google Scholar]
4.Bapteste Eric, van Iersel Leo, Janke Axel, Kelchner Scot, Kelk Steven, McInerney James O., Morrison David A., Nakhleh Luay, Steel Mike, Stougie Leen, and Whitfield James. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013. [DOI] [PubMed] [Google Scholar]
5.Carstens Bryan C., Lacey Knowles L, and Collins Tim. Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers. Systematic Biology, 56(3):400–411, 2007. [DOI] [PubMed] [Google Scholar]
6.Degnan James H, Lacey Knowles L, and Salter Kubatko Laura. Probabilities of gene trees with intraspecific sampling given a species tree In Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, 2010. [Google Scholar]
7.Ellstrand NC, Whitkus R, and Rieseberg LH. Distribution of spontaneous plant hybrids. Proceedings of the National Academy of Sciences of the United States of America, 93(10):5090–5093, 1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Gusfield Dan, Bansal Vikas, Bafna Vineet, and Song Yun S. A decomposition theory for phylogenetic networks and incompatible characters. Journal of Computational Biology, 14(10):1247–1272, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Huber Katharina T., van Iersel Leo, Moulton Vincent, Scornavacca Celine, and Wu Taoyang. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, January 2017. [Google Scholar]
10.Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. https://arxiv.org/pdf/1711.06720.pdf, 2017. [DOI] [PMC free article] [PubMed]
11.C M Keijsper J and Pendavingh RA. Reconstructing a Phylogenetic Level-1 Network from Quartets. Bulletin of Mathematical Biology, 76(10):2517–2541, 2014. [DOI] [PubMed] [Google Scholar]
12.Randal Linder C and Loren H Rieseberg. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany, 91(10):1700–1708, 2004. [DOI] [PubMed] [Google Scholar]
13.Liu Liang, Yu Lili, and Edwards Scott V. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology, 10(1):302, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Mallet James. Hybridization as an invasion of the genome. Trends in Ecology & Evolution, 20(5):229–237, 2005. Special issue: Invasions, guest edited by Hochberg Michael E.and Gotelli Nicholas J.. [DOI] [PubMed] [Google Scholar]
15.Meng Chen and Salter Kubatko Laura. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. [DOI] [PubMed] [Google Scholar]
16.Nakhleh Luay. Evolutionary Phylogenetic Networks: Models and Issues. Problem Solving Handbook in Computational Biology and Bioinformatics, pages 125–158, 2011. [Google Scholar]
17.Noor Mohamed A. F. and Feder Jeffrey L.. Speciation genetics: evolving approaches. Nature Reviews Genetics, 7(11):851–861, 2006. [DOI] [PubMed] [Google Scholar]
18.Pamilo P and Nei M. Relationships between gene trees and species trees. Molecular Biology and Evolution, 5:568583, 1988. [DOI] [PubMed] [Google Scholar]
19.Pollard Daniel A., Iyer Venky N., Moses Alan M., and Eisen Michael B.. Widespread discordance of gene trees with species tree in drosophila: Evidence for incomplete lineage sorting. PLoS Genetics, 2(10):1634–1647, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rieseberg Loren H., Baird Stuart J.E., and Gardner Keith A.. Hybridization, introgression, and linkage evolution. Plant Molecular Biology, 42(1):205–224, 2000. [PubMed] [Google Scholar]
21.Rosselló Francesco and Valiente Gabriel. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. [DOI] [PubMed] [Google Scholar]
22.Semple Charles and Steel Mike. Phylogenetics. Oxford University Press, 2005. [Google Scholar]
23.Solís-Lemus Claudia and Ané Cécile. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Solís-Lemus Claudia, Ané Cécile, and Yang Mengyao. Inconsistency of Species Tree Methods under Gene Flow. Systematic Biology, 65(5):843–851, 2016. [DOI] [PubMed] [Google Scholar]
25.Steel Mike. Phylogeny Discrete and Random Processes in Evolution. David Marshall, 2016.
26.Sullivant Seth, Talaska Kelli, and Draisma Jan. Trek separation for gaussian graphical models. Ann. Statist, 38(3):1665–1685, 06 2010. [Google Scholar]
27.Syring John, Willyard Ann, Cronn Richard, and Liston Aaron. Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. American Journal of Botany, 92(12):2086–2100, 2005. [DOI] [PubMed] [Google Scholar]
28.Wakeley John. Coalescent Theory: An Introduction, volume 58 Roberts and Company Publishers, 2008. [Google Scholar]
29.Yu Y, Dong J, Liu KJ, and Nakhleh L. Maximum likelihood inference of reticulate evolutionary histories. PNAS, 111:296–305, 11 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Yu Yun, Degnan James H., and Nakhleh Luay. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genetics, 8:e1002660, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Yu Yun, Than Cuong, Degnan James H., and Nakhleh Luay. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zhang C, Ogilvie HW, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35:504–517, 02 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: Revisiting trees within networks. BMC Bioinformatics, 17:415, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zhu S and Degnan J. Displayed trees do not determine distinguishability under the network multispecies coalescent. Systematic Biology, 66:283298, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Allman Elizabeth S., Degnan James H., and Rhodes John A.. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833–862, 2011. [DOI] [PubMed] [Google Scholar]

[R2] 2.Ané Cécile, Larget Bret, Baum David A., Smith Stacey D., and Rokas Antonis. Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24(2):412–426, 2007. [DOI] [PubMed] [Google Scholar]

[R3] 3.Arnold Michael L.. Natural hybridization and evolution, volume 53 Oxford University Press, 1997. [Google Scholar]

[R4] 4.Bapteste Eric, van Iersel Leo, Janke Axel, Kelchner Scot, Kelk Steven, McInerney James O., Morrison David A., Nakhleh Luay, Steel Mike, Stougie Leen, and Whitfield James. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013. [DOI] [PubMed] [Google Scholar]

[R5] 5.Carstens Bryan C., Lacey Knowles L, and Collins Tim. Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers. Systematic Biology, 56(3):400–411, 2007. [DOI] [PubMed] [Google Scholar]

[R6] 6.Degnan James H, Lacey Knowles L, and Salter Kubatko Laura. Probabilities of gene trees with intraspecific sampling given a species tree In Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, 2010. [Google Scholar]

[R7] 7.Ellstrand NC, Whitkus R, and Rieseberg LH. Distribution of spontaneous plant hybrids. Proceedings of the National Academy of Sciences of the United States of America, 93(10):5090–5093, 1996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Gusfield Dan, Bansal Vikas, Bafna Vineet, and Song Yun S. A decomposition theory for phylogenetic networks and incompatible characters. Journal of Computational Biology, 14(10):1247–1272, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Huber Katharina T., van Iersel Leo, Moulton Vincent, Scornavacca Celine, and Wu Taoyang. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, January 2017. [Google Scholar]

[R10] 10.Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. https://arxiv.org/pdf/1711.06720.pdf, 2017. [DOI] [PMC free article] [PubMed]

[R11] 11.C M Keijsper J and Pendavingh RA. Reconstructing a Phylogenetic Level-1 Network from Quartets. Bulletin of Mathematical Biology, 76(10):2517–2541, 2014. [DOI] [PubMed] [Google Scholar]

[R12] 12.Randal Linder C and Loren H Rieseberg. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany, 91(10):1700–1708, 2004. [DOI] [PubMed] [Google Scholar]

[R13] 13.Liu Liang, Yu Lili, and Edwards Scott V. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology, 10(1):302, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Mallet James. Hybridization as an invasion of the genome. Trends in Ecology & Evolution, 20(5):229–237, 2005. Special issue: Invasions, guest edited by Hochberg Michael E.and Gotelli Nicholas J.. [DOI] [PubMed] [Google Scholar]

[R15] 15.Meng Chen and Salter Kubatko Laura. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. [DOI] [PubMed] [Google Scholar]

[R16] 16.Nakhleh Luay. Evolutionary Phylogenetic Networks: Models and Issues. Problem Solving Handbook in Computational Biology and Bioinformatics, pages 125–158, 2011. [Google Scholar]

[R17] 17.Noor Mohamed A. F. and Feder Jeffrey L.. Speciation genetics: evolving approaches. Nature Reviews Genetics, 7(11):851–861, 2006. [DOI] [PubMed] [Google Scholar]

[R18] 18.Pamilo P and Nei M. Relationships between gene trees and species trees. Molecular Biology and Evolution, 5:568583, 1988. [DOI] [PubMed] [Google Scholar]

[R19] 19.Pollard Daniel A., Iyer Venky N., Moses Alan M., and Eisen Michael B.. Widespread discordance of gene trees with species tree in drosophila: Evidence for incomplete lineage sorting. PLoS Genetics, 2(10):1634–1647, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Rieseberg Loren H., Baird Stuart J.E., and Gardner Keith A.. Hybridization, introgression, and linkage evolution. Plant Molecular Biology, 42(1):205–224, 2000. [PubMed] [Google Scholar]

[R21] 21.Rosselló Francesco and Valiente Gabriel. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. [DOI] [PubMed] [Google Scholar]

[R22] 22.Semple Charles and Steel Mike. Phylogenetics. Oxford University Press, 2005. [Google Scholar]

[R23] 23.Solís-Lemus Claudia and Ané Cécile. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Solís-Lemus Claudia, Ané Cécile, and Yang Mengyao. Inconsistency of Species Tree Methods under Gene Flow. Systematic Biology, 65(5):843–851, 2016. [DOI] [PubMed] [Google Scholar]

[R25] 25.Steel Mike. Phylogeny Discrete and Random Processes in Evolution. David Marshall, 2016.

[R26] 26.Sullivant Seth, Talaska Kelli, and Draisma Jan. Trek separation for gaussian graphical models. Ann. Statist, 38(3):1665–1685, 06 2010. [Google Scholar]

[R27] 27.Syring John, Willyard Ann, Cronn Richard, and Liston Aaron. Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. American Journal of Botany, 92(12):2086–2100, 2005. [DOI] [PubMed] [Google Scholar]

[R28] 28.Wakeley John. Coalescent Theory: An Introduction, volume 58 Roberts and Company Publishers, 2008. [Google Scholar]

[R29] 29.Yu Y, Dong J, Liu KJ, and Nakhleh L. Maximum likelihood inference of reticulate evolutionary histories. PNAS, 111:296–305, 11 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Yu Yun, Degnan James H., and Nakhleh Luay. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genetics, 8:e1002660, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Yu Yun, Than Cuong, Degnan James H., and Nakhleh Luay. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Zhang C, Ogilvie HW, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35:504–517, 02 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: Revisiting trees within networks. BMC Bioinformatics, 17:415, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Zhu S and Degnan J. Displayed trees do not determine distinguishability under the network multispecies coalescent. Systematic Biology, 66:283298, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identifying species network features from gene tree quartets under the coalescent model

Hector Baños

Abstract

1. Introduction

2. Phylogenetic networks

2.1. Rooted networks

2.2. Lowest stable ancestor

Fig. 1.

Fig. 2.

2.3. Unrooted networks.

Fig. 3.

Fig. 4.

2.4. Induced networks on subset of taxa

3. Structure of level-1 networks

>Fig. 5.

Fig. 6.

4. The network multi-species coalescent model and quartet concordance factors.

Fig. 7.

Fig. 8.

5. Computing quartet concordance factors

Fig. 9.

Fig. 10.

Fig. 11.

5.1. Concordance factor formulas for quartet networks

Fig. 12.

Fig. 14.

Fig. 18.

Fig. 13.

Fig. 15.

Fig. 16.

Fig. 17.

6. The Cycle property

Fig. 19.

7. The Big Cycle property

Fig. 20.

Fig. 21.

Fig. 22.

8. Identifying cycles in networks

Fig. 23.

9. Further results on 32-cycles

Fig. 24.

10. Discussion

Acknowledgements

11 Appendix

Fig. 25.

Fig. 26.

>Fig. 27.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

9. Further results on 3₂-cycles