Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Feb 1.
Published in final edited form as: Bull Math Biol. 2018 Aug 9;81(2):494–534. doi: 10.1007/s11538-018-0485-4

Identifying species network features from gene tree quartets under the coalescent model

Hector Baños 1
PMCID: PMC6344282  NIHMSID: NIHMS1503360  PMID: 30094772

Abstract

We show that many topological features of level-1 species networks are identifiable from the distribution of the gene tree quartets under the network multi-species coalescent model. In particular, every cycle of size at least 4 and every hybrid node in a cycle of size at least 5 is identifiable. This is a step toward justifying the inference of such networks which was recently implemented by Solís-Lemus and Ané. We show additionally how to compute quartet concordance factors for a network in terms of simpler networks, and explore some circumstances in which cycles of size 3 and hybrid nodes in 4-cycles can be detected.

Keywords: Coalescent Theory, Phylogenetics, Networks, Concordance factors

1. Introduction

As phylogenetic analysis of DNA data has progressed, more evidence has appeared showing that hybridization is often an important factor in evolution. As surveyed in [16], hybridization has played a very important role in the evolutionary history of plants, some groups of fish and frogs ([7], [12], [14], [17], [20]). Other biological processes such as introgression, lateral gene transfer and gene flow, also require moving beyond a simple tree-like view of species relationships.

Phylogenetic networks are the objects used to represent the relationships between species that admit such events ([3],[4]). These networks are often thought of as obtained from phylogenetic trees by adding additional edges, so that some nodes in the tree have two parents. Nodes with two parents, called hybrid nodes, represent species whose genome arises from two different ancestral species. Inference of phylogenetics networks from biological data presents new challenges, with methods still being developed, as shown by recent works including [2], [15], [23], [32], [29] and [31].

Another challenge in inferring evolutionary history arises from the fact that many multi-locus data sets exhibit gene tree incongruence, even without suspected hybridization. One possible reason is incomplete lineage sorting (ILS), which is described in the tree setting by the multi-species coalescent model [18]. See for example [5], [19], and [27] where ILS is explained in the biological setting.

Meng and Kubatko [15] formulated a model of gene tree production, based on the multi-species coalescent model, incorporating both hybridization and ILS. We refer to this model as the network multi-species coalescent model, which is further developed in [30], [33], and [24], to mention some. The model determines the probability of observing any rooted gene tree given a metric rooted phylogenetic species network.

Solís-Lemus and Ané [23] recently presented a novel statistical method, based on the network multi-species coalescent model, to infer phylogenetic networks from gene tree quartets in a pseudolikelihood framework. The quartets themselves might come from larger gene trees inferred by standard phylogenetic methods. The pseudolikelihood in this work is built on quartet frequencies, or concordance factors, extending an idea of Liu [13] from the tree setting. The pseudolikelihood approach is simpler and faster than computing the full likelihood and makes large-scale data analysis more tractable. They demonstrate positive results in reconstructing the evolutionary relationships among swordtails and platyfishes.

However, the theoretical underpinnings of the method of [23] are not complete. In using a model for statistical inference it is important to know if it is theoretically possible to uniquely recover the parameters from the data the model predicts. In more precise terms, for model-based statistical inference to have a solid basis, we need that the probability distribution for data which arises under the model uniquely determines the parameters. This is known as identifiability of the model parameters.

While [23] showed that any particular hybridization in a level-1 network with h hybridizations, and n taxa can be generically detected under certain assumptions, their study never addressed the full identifiability of the network topology, only the detectability of a specific hybridization event. Working in the setting of level-1 networks, which is also adopted here, their arguments do not include investigations on network properties such as cycle sizes, and the structure of the whole network. These properties are crucial to determine, for example, whether two networks with different cycle sizes, or different number of cycles, could produce the same set of gene tree quartet probabilities.

The primary purpose of this work is to begin to address some of these identifiability questions raised in [23]. That is, we study the question: given information on gene quartet probabilities for some unknown level-1 network N, what can be determined about the topology of N?

Although others have considered the problem of constructing large networks from small ones, these works do not seem to be applicable to the question studied here. Most of these works, including [9], [10] and [11], are primarily combinatorial in nature. In particular, these studies do not address semidirected networks, ILS through the network multi-species coalescent model, nor the types of inputs that might be obtained from biological data.

The main result of this work, Theorem 4 of Section 8, is that under the network multi-species coalescent model on level-1 networks, we can generically identify from gene quartet distributions “most” of the unrooted topological network, including all cycles of size at least 4, and hybrid nodes in the cycles of size greater than 4. “Generically” here means for all values of numerical parameters except those in a set of measure zero. The methods used are a mix of the semi-algebraic study of quartet gene tree frequencies (in terms of linear equalities and inequalities they satisfy) with combinatorial approaches to combining this knowledge for many quartets. As a side benefit the proofs suggest combinatorial methods for reconstructing networks, as opposed to just showing identifiability. However, we do not explore how such methods might be implemented in the presence of the noise that any collection of inferred gene trees will have.

Another result of this work, in Section 5, is a rigorous derivation of how gene quartet probabilities can be computed for large networks under the coalescent model. Although this parallels some of the results in [23], the arguments given here are more rigorous, as is necessary for them to form the basis of our main results. Our approach is to express quartet frequencies as convex combinations of those on simplified networks, ultimately leading to expressions in terms of trees, as is done in other situations [34]. This is different from the approach in [23] of finding networks with less hybridizations displaying the same gene quartet probabilities.

The outline of this work is as follows: Section 2 introduces basic definitions and establishes some terminology on graphs and networks. Section 3 sets forth insights and tools for studying the structure of level-1 networks. Section 4 reviews the network multi-species coalescent model of [15], as well as quartet concordance factors and some of their properties. In Section 5 we show how concordance factors of quartet networks can be expressed in terms of simpler networks. Section 6 introduces the “Cycle property” of concordance factors and Section 7 defines the “Big Cycle” property of concordance factors. In Section 8, the main result on topological network identifiability is proved using the Big Cycle property and in Section 9 some extended results on the “Cycle property” are shown.

2. Phylogenetic networks

We adopt standard terminology for graphs and networks, as used in phylogenetics; see for example [22] and [25]. All undirected, directed, or semidirected graphs will not contain loops. If G is a directed or semidirected graph, the undirected graph of G, denoted by U(G), is the graph G with all directions omitted.

2.1. Rooted networks

To set terminology, we begin with some fundamental definitions.

Definition 1 A topological binary rooted phylogenetic network N+ on taxon set X is a connected directed acyclic graph with vertices V and edges E, where V is the disjoint union V = {r}⊔VLVHVT and E is the disjoint union E = EHET, and a bijective leaf-labeling function f: VLX with the following characteristics:

  1. The root r has indegree 0 and outdegree 2.

  2. A leaf vVL has indegree 1 and outdegree 0.

  3. A tree node vVT has indegree 1 and outdegree 2.

  4. A hybrid node vVH has indegree 2 and outdegree 1.

  5. A hybrid edge eEH is an edge whose child is a hybrid node.

  6. A tree edge eET is an edge whose child is a tree node or a leaf.

Definition 2 Let N+ be a topological binary rooted phylogenetic network with |E| = m and |EH| = 2h. A metric for N+ is a pair (λ,γ), where λ:E>0 and γ : EH → (0,1) satisfies that if two edges h1 and h2 have the same hybrid node as child, then γ(h1) + γ(h2) = 1.

If (λ,γ) is a metric for N+, then we refer to (N+,(λ,γ)) as a metric binary rooted phylogenetic network.

Note that Definition 1 differs from that of [25] in that it allows up to two edges between a pair of nodes. An edge weight λ(e) is interpreted as the time (in coalescent units) between speciation events represented by the ends of edge e. For any hybrid edge h with child v, the value γ(h) = γh is the probability that a lineage at v has ancestral lineage in h and is often called hybridization parameter or inheritance probability. Since we are focusing on parameter identifiability we will use the term hybridization parameter.

2.2. Lowest stable ancestor

We review and show some properties of the lowest stable ancestor, a network analog of the most recent common ancestor on a tree.

Definition 3 Let N+ be a (metric or topological) binary rooted phylogenetic network. We say that a node v is above a node u, and u is below v, if there exists a non-empty directed path in N+ from v to u. We also say that an edge with parent node x and child y is above (below) a node v if y is above or equal to v (x is below or equal to v).

Note that since N+ has no directed cycles, u cannot be both above and below v.

Definition 4 [25] Let N+ be a (metric or topological) binary rooted phylogenetic network on X and let ZX. Let D be the set of nodes which lie on every directed path from the root r of N+ to any zZ. Then the lowest stable ancestor of Z ofN+, denoted by LSA(Z,N+), is the unique node vD such that v is below all uD, uv.

When N+ is clear from context, we write LSA(Z) for LSA(Z,N+). To see that LSA(Z) is well defined for any ZX, note first that D ≠ ∅ since rD. Also, since every pair of nodes u,vD both lie on a path, we have a notion of above and below for u and v, i.e. a total order on D, and hence a minimal element.

While the definition of LSA agrees with the most recent common ancestor for trees, it is more subtle. In particular, if N+ is a network on X, LSA(X) need not to be the root of the network, as Figure 1 (left) shows. Furthermore, there can be nodes below LSA(X) which are ancestral to all of X, as Figure 2 shows.

Fig. 1.

Fig. 1

(Left) A binary rooted phylogenetic network on X, with LSA(X) the node labeled x, and (Right) its induced unrooted semidirected network. In a depiction of a rooted network, all edges are directed downward, from the root, but arrowheads are shown only on hybrid edges. For the unrooted network, all edges except hybrid ones are undirected.

Fig. 2.

Fig. 2

A binary rooted phylogenetic network where the node labeled y is ancestral to all taxa in X but is not LSA(X). LSA(X) here is the root of the network.

Lemma 1 Let N+ be a (metric or topological) binary rooted phylogenetic network on X with root r, and let ZYX. Then

  • (i)

    the indegree of LSA(Z) is at most one for any ZX;

  • (ii)

    at most one of the out edges of LSA(Z) is hybrid;

  • (iii)

    if ZYX then LSA(Z) is below or equal to LSA(Y ).

Proof To see (i), suppose that the indegree of LSA(Z) is two. Then the outdegree would be one, and the child of LSA(Z) would be in any path from the root to any taxa in Z, contradicting the definition of LSA(Z).

For (ii), suppose the out edges of LSA(Z), e1 and e2, are both hybrid. If e1 and e2 have the same child then every path from r to any zZ would contain that node, contradicting the definition of LSA(Z).

Now denote by x1x2 the child nodes of e1 and e2 respectively. If both x1 and x2 had parents below LSA(Z), then x1 has a parent below x2 and x2 has a parent below x1 giving a directed cycle. Thus, without loss of generality, assume x1 has parents LSA(Z) and v with v not below LSA(Z). Let zZ with z below x1. If we remove the LSA(Z) from N+ there is still a path from r to z (which goes from r to v to x1 to z). This contradicts the fact that LSA(Z) is on all paths from r to any zZ.

For (iii) we observe that since ZY , LSA(Y ) must be equal or above LSA(Z) since the set of paths from r to any taxa in Y contains the set of paths from r to any taxon in Z.

Lemma 2 Let N+ be a (metric or topological) binary rooted phylogenetic network on X and let ZX, |Z| ≥ 2. For every xZ, there is a yZ such that LSA(x,y)=LSA(Z).

Proof Let m=LSA(Z), fix xZ and let P be a path from m to x. By definition of LSA, for all yZ, LSA(x,y) is a node in P and is below or equal to m by Lemma 1. Suppose that LSA(x,y) is below m for all yZ. Let zZ be such that LSA(x,z) is above or equal to LSA(x,y) for all yZ \ {z}.

We claim that any path from m to yZ passes through LSA(x,z). Suppose there exists taxon y with path P′ from m to y that does not pass through LSA(x,z). But P′ must pass through LSA(x,y). Since LSA(x,y) is below LSA(x,z), there is a path from m to LSA(x,y) to x that does not contain LSA(x,z). This is a contradiction.

But every path from m to any yZ passes through LSA(x,z), contradicting that LSA(x,z) is below m. +

By this Lemma we can characterize LSA(Z) as the highest node of the form LSA(x,y) for some x,yZ, or the highest node of that form for fixed xZ.

2.3. Unrooted networks.

Let G be a directed or semidirected graph with z a degree two node. Let x and y be the two nodes adjacent to z. Then, up to isomorphism, the subgraph on x,y and z must be one of the graphs shown on the left of Figure 3, which we denote by H. By suppressing z we mean replacing H in G by the graph to the right of it in Figure 3.

Fig. 3.

Fig. 3

On the left are all the semidirected graphs, up to isomorphism, on a degree two node z and its adjacent vertices x and y. On the right are the corresponding graphs obtained by suppressing z.

Definition 5 Let N+ be a binary topological rooted phylogenetic network on a set of taxa X. Then N is the semidirected network obtained by 1) keeping only the edges and nodes below LSA(X); 2) removing the direction of all tree edges; 3) suppressing LSA(X). We refer to N as the topological unrooted semidirected network induced fromN+.

Figure 1 shows an example of a network N+ and its inducedN. We now introduce a metric on N induced from one onN+.

Definition 6 Let (N+,(λ,γ)) be a metric binary rooted phylogenetic network and let N be the topological unrooted semidirected network induced fromN+. Denote by e the edge of N introduced in place of the edges e1 and e2 in N+ when LSA(X) is suppressed. Define λ:E(N)>0 such that λ′(e*) = λ(e1) + λ(e2) and λ′(e) = λ(e) for eN, ee. If e is not hybrid, γ′ = γ, else let γ′(h) = γ(h) for all hybrid edges of N other than e and γ′(e*) = γ(ei), where ei is, by Lemma 1, the single hybrid edge in {e1,e2}. We refer to (N,(λ′,γ′)) as the metric unrooted semidirected network induced from (N+,(λ,γ)).

The networks considered in this work are always induced from a rooted binary metric phylogenetic network. To simplify language, we refer to a (metric or topological) binary rooted phylogenetic network as a (metric or topological) rooted network and to a induced (metric or topological) unrooted semidirected phylogenetic network as a (metric or topological) unrooted network.

We note that not all binary semidirected graphs are topological unrooted networks, since some graphs are not compatible with suppressing the root on any rooted network. Moreover, Nmight be induced from several rooted networksN+. See Figure 4.

Fig. 4.

Fig. 4

The top graph is not a topological unrooted semidirected phylogenetic network, since its directed edges cannot be obtained by suppressing the root of any 6-taxon topological binary rooted phylogenetic network. The middle graph is the induced topological unrooted network from either of the bottom rooted networks, as well as others.

Although an unrooted network N does not have a root specified, since hybrid edges are directed, the suppressed LSA(X) of N+ must have been located ‘above’ them. Thus inN, we still have a well-defined notion of which taxa are descendants of a hybrid node v. These are the taxa x such that there exists a semidirected path from v to x inN. In this case we say that x descends from v.

2.4. Induced networks on subset of taxa

Since later arguments require an understanding of the behavior of the network multi-species coalescent model on a subset of taxa, we introduce some needed definitions.

Definition 7 Let N+ be a (metric or topological) rooted network on X and let ZX. The induced rooted network NZ+ on Z is the network obtained from N+ by 1) retaining only edges and nodes in paths from the root to any taxa in Z; 2) suppressing all degree two nodes except the root; 3) in the case the root then has outdegree one, contracting the edge incident to the root.

Note that LSA(Z,NZ+)=LSA(Z,N+). If |Z| = 4 thenNZ+, the induced rooted quartet network on Z, will also be denoted by QZ+ to emphasize it involves only 4 taxa.

Definition 8 Let N+ be a (metric or topological) rooted network on X and let ZX. The induced LSA network of Z, denotedNZ, is the rooted network obtained from NZ+ by deleting everything above LSA(Z,N+).

In particular we note that NZ has root LSA(Z,N+). If |Z| = 4 thenNZ, the induced LSA quartet network on Z, is also denoted byQZ.

Definition 9 Let G be a semidirected graph and let x,y be two nodes in G. A trek in G from x to y is an ordered pair of semidirected paths (P1,P2) where P1 has terminal node x, P2 has terminal node y, and both P1 and P2 have starting node v. The node v is called the top of the trek, denoted top(P1,P2). A trek (P1,P2) is simple if the only common node among P1 and P2 is v.

This definition is adopted from non-phylogenetic studies of statistical models on graphs, such as [26].

Definition 10 Let N be a (metric or topological) unrooted network on X and let ZX. The induced unrooted network (N)Z on a set of taxa Z is the network obtained from N by retaining only edges in simple treks between pairs of taxa in Z, and then suppressing all degree two nodes.

Note that it is not immediately clear that for a network N+, the networks (N)Z and (NZ+) are isomorphic. Proposition 1 shows that the operations of unrooting and inducing a network on a subset of taxa commute. While this statement is intuitively plausible its rather technical proof is in the Appendix.

Proposition 1 Let N+ be a (metric or topological) rooted network on X and let ZX. Then (N)Z and (NZ+) are isomorphic.

If |Z| = 4 then (N)Z, the induced unrooted quartet network on Z, is also denoted byQZ.

2.5 Cycles

Although the networksN+, N are acyclic (in both, the directed and semidirected settings), their undirected graphs U(N+), U(N) may contain a cycle. Thus the term ‘cycle’ may be used to unambiguously refer to cycles in the undirected graphs. We formalize this with the following definition:

Definition 11 Let N be a (metric or topological, rooted or unrooted) network. A cycle in N is a non-empty path from a node to itself, allowing edges to be traversed without regard to their possible direction. The size of the cycle is the number of edges in the path. A k-cycle is a cycle of size k.

By contracting or shrinking a cycle C in a graph we mean removing all edges in C and identifying all nodes in C.

3. Structure of level-1 networks

The class of all phylogenetic networks is often too large to obtain strong mathematical results ([25]), so it is common to restrict to networks that have a simpler structure, for instance, the class of level-1 phylogenetic networks.

Definition 12 Let N be a (rooted or unrooted) topological network. If no two cycles in N share an edge, then N is level-1.

If N is a level-1 network, any subnetwork or induced network of N is also level-1.

Given a hybrid node v, denote the hybrid edges whose child is v by hv and hv. Then hv and hv are called the hybrid edges of v.

Lemma 3 Let N be a (topological or metric, rooted or unrooted) level-1 network and let C be a cycle ofN. Then C contains exactly one hybrid node v, and the associated hybrid edges hv,hv. Furthermore, each node of N is in at most one cycle and, as a result, v, hv and hv are in exactly one cycle ofN.

The proof of each statement of this Lemma, using different terminology, is given by Rossello and Valiente [21].

Proposition 2 Let N+ be a topological level-1 rooted network on X. The structure of all the nodes and edges above LSA(X) in N+ is a (possibly empty) chain of 2-cycles connected by edges, as depicted in Figure 5.

>Fig. 5.

>Fig. 5

In a level-1 network on X, the structure between the root and m =LSA(X) is a chain of two cycles. The number of two cycles in the chain could be zero.

Proof Let m = LSA(X), and denote by r the root ofN+. The proof is by induction on the number of the edges above m. If there are no edges above m, then m = r and the result is trivially true. By Lemma 1, one easily sees that there cannot be only 1 or 2 edges above m in a binary phylogenetic network. That is, if there were just 1 edge above m the outdegree of the root would be 1, contradicting the definition of binary phylogenetic network. Suppose there are 2 edges above m. By definition of binary phylogenetic network the outdegree of r is 2 and by definition of LSA(X) all paths from the root to xX contain m. Therefore m has indegree 2, contradicting Lemma 1 part (i).

Now assume the claim holds when there are at most k edges above m and suppose there are k + 1 edges above m. Note that r has outdegree 2 by the definition ofN+.

Suppose that edges incident to r have different children, x and y. Note neither x nor y can be m. The outdegree of one of x or y must be 2, otherwise both would be hybrid nodes, which would require x above y and y above x. Without loss of generality suppose x has outdegree 2, and denote by e1 and e2 its out edges, and denote by e3 the edge (r,y). Since every path from r to a leaf goes through m, there are at least 3 distinct paths P1, P2, P3 from r to m, where Pi contains ei.

This contradicts the level-1 condition. Thus x = y, and the edges from r form a 2-cycle. sssss

Now since x is a hybrid node, it has outdegree 1, with child v. Also, there are k −3 edges above m that are also below v. Applying the inductive hypothesis to N+ with edges above v removed, the result follows.

Proposition 2 applied to NZ+ illustrates the structure of the common ancestry of a subset Z of taxa. When we pass to a LSA network or an induced unrooted network, we “throw away” this structure. We show in Section 5 that under the network multi-species coalescent model this structure has no effect on the formation of quartet gene trees.

Let v be a hybrid node in a level-1 (rooted or unrooted, metric or topological) network N on X and let Cv be the cycle containing v. By removing the edges of Cv from N we obtain a partition of X according to the connected components of the resulting graph. We refer to this partition as the v-partition and its partition sets as v-blocks.

Note that each node in Cv can be associated to a v-block. That is, a v-block Bu is associated to a node u in Cv if by removing u from the network (and therefore the edges adjacent to u), the induced partition of taxa is {Bu,X \ Bu}. We refer to the v-block Bv, whose elements descend from v, as the v-hybrid block. Two distinct v-blocks Bu,Bw are adjacent if the nodes u,wCv are adjacent.

Let D={C1,,Cn} be a collection of cycles inN. The partition of X obtained by removing all the edges in the cycles of D is the network partition induced by D and its blocks are network blocks induced byD. When D is the set of all cycles in N of size at least k, the partition is the k-network partition and its blocks are k-network blocks. The 4-network blocks play an important role in Section 8. For now and on, we will refer to removing all edges of a cycle C from a network N as removing the cycle C fromN.

The following is straightforward to prove.

Lemma 4 Let N be a level-1 (rooted or unrooted) topological network on X. Let D={C1,,Cn} be a collection of cycles inN. For any two taxa a and b in different network blocks induced byD, there exists a hybrid node v of some cycle in D such that a and b are in different v-blocks.

If two taxa a and b are in the same network block induced byD, then they are connected when all cycles in D are removed. As a result they are connected when a single cycle in D is removed. This comment together with Lemma 4 yields the following.

Corollary 1 Let N be a level-1 (rooted or unrooted) topological network on X. Let D={C1,,Cn} be a collection of cycles inN, with vi the hybrid node associated to Ci. The network partition induced by D is the common refinement of the vi-partitions for 1 ≤ in.

Since contracting cycles in level-1 networks does not introduce loops or multi-edges, we can define a notion of a tree of cycles which is useful for the proof of Theorem 4.

Definition 13 Let N be a topological unrooted level-1 network. Let T be the graph obtained from N by 1) removing all pendant edges, repeatedly, until no pendant edges remain; 2) suppressing all vertices of degree two that are not part of a cycle; 3) contracting each cycle in the network obtained from steps 1 and 2. We refer to T as the tree of cycles ofN.

In the tree of cycles of N certain nodes, including all the leaves, represent a cycle of the original networkN. The notion of tree of cycles is different from “tree of blobs” of [8], as there is no deletion of the non-cycle edges in the tree of blobs. In Figure 6 we see an example of a tree of cycles.

Fig. 6.

Fig. 6

(Left) A level-1 unrooted network N and (Right) the tree of cycles ofN.

4. The network multi-species coalescent model and quartet concordance factors.

Coalescent theory models the formation of gene trees within populations of species. The coalescent model for a single population traces (backwards in time) the ancestries of a finite set of individual copies of a gene as the lineages coalesce to form ancestral lineages (see [28]). The multi-species coalescent (MSC) model is a generalization of the coalescent model, formulated by applying it to multiple populations connected to form a rooted population tree, or species tree. It is commonly used to obtain the probabilities of gene trees in the presence of incomplete lineage sorting.

Meng and Kubatko [15] extended the MSC by introducing phenomena such as hybridization or other horizontal gene transfer across the species-level and Nakhleh et al. further developed it [30, 33]. This model describes any situation in which a gene lineage may “jump” from one population to another at a specific time. The model parameters are specified by a metric binary rooted phylogenetic network as defined in Section 2. Different from models such as the structured coalescent with continuous gene flow (see [28]), the network model approach assumes the gene transfer occurs at a single point in time along hybrid edges. We refer to this extended version of the MSC as the network multi-species coalescent (NMSC) model.

The NMSC model assumes that speciation by hybridization results in what Meng and Kubatko refer to as a mosaic genome. One assumption of the NMSC model, inherited from the MSC model, is that all gene lineages present at a specific point on the species tree behave identically above this point. That is, the probability of any event conditioned on a set of lineages being present at a certain point on the species tree is invariant under permutation of those lineages. This feature is known as the exchangeability property.

Example 1 We illustrate how to compute the probability of a gene tree topology under the NMSC with an example. Suppose we have the rooted metric species network given in Figure 7. Let A,B,C and D be genes sampled from species a,b,c and d respectively. We compute the probability that a gene tree has the unrooted topology ((A,B),(C,D)) under the NMSC model.

Fig. 7.

Fig. 7

Two gene trees within a species network with one hybrid node.

First observe that until B and C trace back to the edge with length z there cannot be a coalescent event. In that edge these lineages cannot coalesce if the gene tree ((A,B),(C,D)) is to be formed. The probability of no coalescence on this edge is ez. Now there are 4 cases, illustrated in Figure 8:

Fig. 8.

Fig. 8

Cases 1–4 (Left-Right) of Example 1, of how lineages may behave under the NMSC model on the network of Figure 7.

  • 1)

    with probability γ2, lineages B and C enter the edge of length w;

  • 2)

    with probability (1 − γ)2, B and C enter the edge of length v;

  • 3)

    with probability γ(1 − γ), B enters the edge of length w and C enters the edge of length v;

  • 4)

    with probability (1 − γ)γ, B enters the edge of length v and C enters the edge of length w.

Observe that each case is now reduced to a standard MSC scenario with several samples per population (see [6]). Let Pi the probability of observing ((A,B),(C,D)) under the MSC of case i. Then the probability of observing ((A,B),(C,D)) is ez(γ2P1 + (1 − γ)2P2 + γ(1 − γ)P3 + γ(1 − γ)P4).

Following Solís-Lemús and Ané [23], we are interested in the probability that a species network produces various gene quartets under the NMSC. This motivates the following definition.

Definition 14 Let N+ be a metric rooted network on a taxon set X. Let A,B,C,D be genes sampled from species a,b,c,d respectively. Given a gene quartet AB|CD, the quartet concordance factor CFAB|CD is the probability under the NMSC on N+ that a gene tree displays the quartet AB|CD, and

CFabcd=(CFAB|CD,CFAC|BD,CFAD|BC)

is the ordered triple of concordance factors of each quartet on the taxa a,b,c,d.

When a,b,c,d are clear from context, we write CF for CFabcd.

In the particular case where N+ has no hybrid edges, so the network is a tree, it is known that the quartet concordance factors do not depend on the root placement [1]. For example let a,b,c,d be taxa and consider any root placement in the unrooted species tree with topology ab|cd and internal edge of length t. Then

CFabcd=(123et,13et,13et). (1)

As mentioned in [23], for unrooted species networks the concordance factors do not depend on the placement of the root in the species network, as long as the root is placed in a way consistent with the direction of the hybrid edges. This fact is shown in Section 5, as we explore quartet concordance factors more thoroughly.

Definition 15 Let N+ be a metric rooted level-1 network on X. Given a set of distinct taxa {a,b,c,d}, we define the ordering of CFabcd on N+ as the natural decreasing order of CFAB|CD, CFAC|BD, CFAD|BC in the real line.

For example if t > 0 the ordering of the concordance factors in equation (1) is given by

CFAB|CD>CFAC|BD=CFAD|BC.

Many arguments towards the main result of this work use the ordering of CFabcd, and not its precise values.

5. Computing quartet concordance factors

In this section we show how to express the concordance factors arising on a LSA quartet network as a linear combination of the concordance factors arising on quartet trees using a similar approach as in [29]. This enables us to see how the ordering of concordance factors reflects the network topology, and how the precise root location does not matter.

The final results of this section are largely in [23]. However, we provide formal arguments and take in consideration some matters that were left unaddressed. For example, we address the possibility that an induced 4-taxon network does not contain the root of the original network.

Let N+ be a (metric or topological) rooted level-1 network on X and let {a,b,c,d} be a set of distinct taxa of X. Then the induced unrooted network on 4 taxa Qabcd is a (metric or topological) unrooted level-1 network. By Proposition 1, Qabcdis the same graph as (Nabcd+) and (Nabcd), where Nabcd is the LSA network of Definition 8. Any cycle in Nabcd=Qabcd induces a cycle inQabcd. A cycle C in Qabcd of size k, induces a cycle in Qabcd of either size k (when C does not contain LSA(a,b,c,d)) or size k − 1 (otherwise). For convenience when we refer to the size of a cycle C in Qabcd we mean the size of the induced cycle in Qabcd.

Lemma 5 Let Qabcd be a metric unrooted level-1 quartet network. The number of k-cycles in Qabcd is 0 for k ≥ 5, at most 1 for k = 4 in which case there is no 3-cycle, and at most 2 for k = 3.

Proof Suppose that Qabcd has a cycle C = Cv of size k. Then there is an associated partition of taxa into k v-blocks. Trivially none of these blocks can be empty, so k ≤ 4.

Suppose that there are two cycles, a cycle C1 of size k1 and C2 of size k2 with ki ≥ 3, i = 1,2. Since Qabcd is level-1, by removing these two cycles we induce a partition of the taxa into at least k1 + k2 − 2 blocks. None of the blocks of this partition can be empty, so k1+k2−2 ≤ 4. Hence there is a most one cycle of size 4 or at most two cycles of size 3. Moreover there cannot be a cycle of size 3 and a cycle of size 4 in the same unrooted quartet network.

Suppose that there are three cycles, a cycle C1 of size k1, C2 of size k2, and C3 of size k3 with ki ≥ 3, i = 1,2,3. By removing these three cycles we induce a partition of the taxa into at least k1+k2+k3−3 blocks, so k1+k2+k3−3 ≤ 4 which is a contradiction since ki ≥ 3.

Our arguments will depend on the number of descendants on the hybrid node of a cycle, so we introduce additional terminology. An n-cycle with exactly k taxa descending from the hybrid node is referred to as a nk-cycle. Figure 9 shows the 6 different types of 2-, 3-, and 4-cycles possible in an unrooted quartet network.

Fig. 9.

Fig. 9

(Left) The three types of 2-cycles in an unrooted quartet network (21-,22- and a 23-cycle); (Center) The two types of 3-cycles in the unrooted quartet network (31- and a 32-cycle). (Right) The only type of 4-cycle in an unrooted quartet network (a 41-cycle). The dashed lines represent subgraphs that may contain other cycles.

Lemma 6 Let Qabcd be a metric unrooted level-1 unrooted quartet network. Then Qabcd cannot have two 32-cycles, or a 22-cycle and a 41-cycle.

Proof Suppose Q=Qabcd has two distinct 32-cycles, Cu and Cv. Suppose Cu has u-hybrid block {a,b} and u-blocks {c} and {d}. If we remove Cu from Q, by the level-1 assumption Cv is in one on the connected components. This implies that 2 of the 3 v-blocks must be contained in one of {a,b}, {c} or {d}. This is only possible if the v-hybrid block is {c,d}, and the other v-blocks are {a} and {b}. Thus Q must be as the network in Figure 10, where u is below v and v is below u, contradicting that Q is induced from a rooted network.

Fig. 10.

Fig. 10

A graph with two 32 cycles. Each dashed edge represents a chain of 2-cycles with, possibly, other cycles.

Now suppose that Q has a 4-cycle and a 22-cycle. The 4-cycle induces 4 singleton blocks. By the level-1 condition at least one of the blocks induced by the 22-cycle has to be contained in a singleton block. That is impossible since the blocks induced by the 22-cycle have size 2.

Lemmas 5 and 6 determine all possible topological structures for unrooted quartet networks which are shown in Figure 11.

Fig. 11.

Fig. 11

Possible structures for unrooted quartet networks. Every dashed arrow represents a chain of an arbitrary number of 2-cycles, as the one in the bottom of the Figure. The direction of these 2-cycles must be such that the obtained graph is induced from a rooted network.

5.1. Concordance factor formulas for quartet networks

Next we prove a number of “reduction” lemmas relating concordance factors for quartet networks to those for networks with fewer cycles. This allows us to express the network concordance factors as a linear combination of concordance factors of trees. The following observation is useful through this section.

Observation 1 Given a rooted metric species quartet network, under the NMSC model the first coalescent event (going backwards in time) determines the unrooted topology of a quartet gene tree.

As illustrated in Figure 12, in passing from a rooted network on X to a rooted induced network on ZX, NZ+, we may find there is a network structure above LSA(Z), a chain of 2-cycles by Proposition 2. A priori, this could have an impact on the behavior of the NMSC model onNZ+. For quartet concordance factors, however, this additional structure has no impact, and we effectively snip it off. Formally, we have the following.

Fig. 12.

Fig. 12

A level-1 rooted network where the root differs from the LSA(a,b,c,d).

Theorem 2 Let N+ be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC model, CFabcd can be computed from the LSA networkQabcd.

Proof In any realization of the coalescent process if there are fewer than 4 lineages at the LSA(a,b,c,d) inNabcd+=Qabcd+, then a coalescent event has occurred below and therefore the unrooted gene tree topology has been determined. Thus we condition on 4 lineages being present at LSA(a,b,c,d).

There are 2 rooted shapes for 4-taxon gene trees, the caterpillar and balanced trees. Regardless of the ancestral chain of 2-cycles above LSA(a,b,c,d), conditioned on one of these shapes, exchangeability of lineages under the coalescent tells us all labeled versions of that specific shape will have equal probability. While the rooted shapes might have different probability, since there is only 1 unrooted shape, all labellings of it must be equally probable. This is the same as if there were no ancestral cycles. Therefore CFabcd(Qabcd)=CFabcd(Qabcd+).

This argument can be modified to apply to 5 taxa, but not 6 or more, since then there is more than 1 unrooted shape.

Let Q=Qabcd be a level-1 LSA quartet network and let Cv be a cycle in Q, with hybrid node v and hybrid edges h1 and h2, where γ=γh1. The following notation is used throughout this section:

Q1 denotes the rooted quartet network obtained from Q by removing h2.

Q2 denotes the rooted quartet network obtained from Q by removing h1.

Q0 denotes the rooted quartet network obtained from Q by contracting Cv; if the root of Q is in Cv, the node obtained in the contraction process is the root ofQ0.

Note that Qi, for i = 1,2 have degree 2 nodes, and thus are not binary. This does not affect the coalescent process in any way and by suppressing such nodes we obtain a binary LSA network. In a slight abuse of notation, we use Qi to denote both of these networks, as needed in our arguments.

To compute concordance factors we often need to designate how many lineages are present at a hybrid node in a realization of the coalescent process. To handle this formally, given a rooted metric species network N+ on X, we define the random variable Kv to be the number of lineages at node v, where Kv takes values in {1,...,lv}, where lv is the number of taxa below v. We can extend this concept to hybrid nodes in N, since a hybrid node in N induces an orientation of the nodes that are descending from it.

Let Q=Qabcd be a level-1 LSA quartet network and let Cv be a cycle in Q, with hybrid node v, which induces a cycle Cv inQabcd. If Cv has size 2, then 1≤lv≤3; if Cv has size three, then 1 ≤ lv ≤ 2; and if Cv has size four then lv = 1. For example, let Q be the LSA network shown in the left of Figure 14 and let Cv be the cycle in Q. By unrooting Q note that Cv induces a 3-cycleCv. Note also that Q is isomorphic to the network in Figure 18.

Fig. 14.

Fig. 14

A LSA quartet Q with a cycle C that induces a 32-cycle in the unrooted quartet and the graphs obtained by deleting everything below the hybrid node, disjointing, and labeling the leaves.

Fig. 18.

Fig. 18

An unrooted quartet with a single 32-cycle.

We show that cycles in Qabcd that induce 21-cycles or 23-cycles in Qabcd have no impact on concordance factors. But first we state Propositions 3 and 4, proven in [1], which are useful in arguments to come.

Proposition 3 Let T+ be a binary rooted metric species tree on X. For |X| = 4, Tis identifiable from the unrooted topological gene tree distribution under the multispecies coalescent model onT+, but T+ is not.

Proposition 4 Proposition 3 remains valid when T+ not binary.

Lemma 7 Let Q=Qabcd be a metric level-1 LSA quartet network and let Cv be a cycle in Q that induces a 21-cycle inQabcd. Then CF(Q)=CF(Q0).

Proof Let K = Kv. Since Cv induces a 21-cycle in Qabcd, P(K = 1) = 1. Then

CF(Q)=P(K=1)CF(Q|K=1)=P(K=1)[γCF(Q1|K=1)+(1γ)CF(Q2|K=1)]=γCF(Q1)+(1γ)CF(Q2)

If the root of Q is not in Cv, no lineages can coalesce on the edges that differ in Q1 and Q2 since there is only one lineage in such edges. Thus,

CF(Q1)=CF(Q2)=CF(Q0),

and the claim is established in this case.

Now suppose the root r of Q is in Cv, and Cv has nodes r, u, v, and edges (r,v), (r,u), (u,v). Without loss of generality suppose that the taxon below v is d. Since u is a tree node it has another descendant y. Note that Q1 and Q2 have the same topology, moreover, they just differ in the edge length from the root to y. Define a random variable K′, by K′ = 1 if there has been a coalescent event before a, b, and c trace back to y and K′ = 0 otherwise. If K′ = 1, the unrooted topology has been determined and thus

CF(Q1|K=1)=CF(Q2|K=1)=CF(Q0|K=1).

Also, by Proposition 4,

CF(Q1|K=0)=CF(Q2|K=0)=CF(Q0|K=0).

Thus CF(Q)=CF(Q0).

Lemma 8 Let Q=Qabcd be a metric level-1 LSA quartet network and let Cv be a cycle in Q, that induces a 23-cycle in Qabcd. Then CF(Q)=CF(Q0).

Proof Let K = Kv, so K takes values in {1,2,3}. Therefore

CF(Q)=P(K=1)CF(Q|K=1)+P(K=2)CF(Q|K=2)+P(K=3)CF(Q|K=3). (2)

If K = 1 or 2 then at least one coalescent event has occurred, so the unrooted gene tree topology is already determined, and

CF(Q|K=k)=CF(Q0|K=k)fork=1,2.

The case K = 3 requires more argument. Without loss of generality suppose that the three taxa descending from v are a, b, and c. Denote by D the random variable defined by D=1 if the lineage d is involved in the first coalescent event and D=0 otherwise. Thus

CF(Q|K=3)=P(D=1)CF(Q|K=3,D=1)+P(D=0)CF(Q|K=3,D=0). (3)

If d is in the first coalescent event, by the exchangeability property of the NMSC, a,b or c are equally likely to be the other lineage involved in that event.This is the same as if the cycle was contracted, so

CF(Q|K=3,D=1)=(13,13,13)=CF(Q0|K=3,D=1)

If d is not in the first coalescent event, this event involves only two of a,b, and c, with each pair equally likely by exchangeability. This is also the same as if the cycle was contracted, so

CF(Q|K=3,D=0)=(13,13,13)=CF(Q0|K=3,D=0)

Thus by equations (2) and (3),CF(Q)=CF(Q0).

Together, the preceding Lemmas yield the following.

Corollary 2 Let Q=Qabcd be a metric level-1 LSA quartet network and let Q˜ be the LSA network obtained by contracting all cycles that induce either 23- or a 21-cycles in Qabcd. Then CF(Q)=CF(Q˜).

While 21- and 23-cycles have no impact on concordance factors, things are not quite so simple for other types of cycles.

Lemma 9 Let Q=Qabcd be a metric level-1 LSA quartet network and let Cv be a cycle in Q, that induces a 22-cycle inQabcd. Then

CF(Q)=γ2CF(Q1)+(1γ)2CF(Q2)+2γ(1γ)CF(Q0).

Proof Let K = Kv with values in {1,2}, so that

CF(Q)=P(K=1)CF(Q|K=1)+P(K=2)CF(Q|K=2).

Suppose the root r of Q is not in Cv, so Cv is also a 22-cycle in Q. Note that

CF(Q|K=2)=γ2CF(Q1|K=2)+(1γ)2CF(Q2|K=2)+2γ(1γ)CF(Q0|K=2).

Thus we will express CF(Q | K = 1) in a similar fashion. If K = 1 the gene tree topology has been determined before the lineages enter v. Thus CF(Qi| K = 1) = CF(Q | K = 1) for i ∈ {0,1,2} and

CF(Q|K=1)=γ2CF(Q1|K=1)+(1γ)2CF(Q2|K=1)+2γ(1γ)CF(Q0|K=1); (4)

by summing the result holds when r is not in Cv.

Now suppose that r is in Cv, and Cv has nodes r, v, u. Without loss of generality suppose that the taxa below v are c and d. Since u is a tree node it has another descendant y. Define a random variable Ky to be the number of lineages at y. Note that K and Ky are independent, with values in {1,2}. If either K or Ky is 1, one coalescent event has occurred and the unrooted gene tree topology has been determined so CF(Qi| K = 1 or Ky = 1) are equal for i ∈ {0,1,2}, and

CF(Q|K=1orKy=1)=γ2CF(Q1|K=1orKy=1)+(1γ)2CF(Q2|K=1orKy=1)+2γ(1γ)CF(Q0|K=1orKy=1) (5)

Even though equation (5) is equal to CF(Q0| K = 1 or Ky = 1), we express it in a similar fashion to the claimed result. Now suppose that K and Ky are both 2. Let Tc and Td be the trees shown on Figure 13. Therefore

CF(Q|K=2,Ky=2)=γ2CF(Q1|K=2,Ky=2)+(1γ)2CF(Q2|K=2,Ky=2)+γ(1γ)CF(Tc|Ky=2)+γ(1γ)CF(Td|Ky=2).

Fig. 13.

Fig. 13

The two trees Td and Tc in the proof of Lemma 9, obtained when K = 2, Ky = 2 and the lineages c and d trace different hybrid edges.

By Proposition 3, CF(Td | Ky = 2) = CF(Tc | Ky = 2), and in fact they equal CF(Q0| K = 2 or Ky = 2). This is because in Q0 the contraction of the cycle identifies the nodes r, u, and v, so conditioned on K = 2, Ky = 2 we may view the coalescent process on Q0 as that in the 4-taxon tree ((a,b) : l,(c,d) : 0) where l is the length of (u,y). By Proposition 4, CF(Tc | Ky = 2) = CF(Q0| K = 2, Ky = 2). Therefore

CF(Q|K=2,Ky=2)=γ2CF(Q1|K=2,Ky=2)+(1γ)2CF(Q2|K=2,Ky=2)+2γ(1γ)CF(Q0|K=2,Ky=2).

This together with equation (5) implies the claim.

Lemma 10 Let Q=Qabcd be a metric level-1 LSA quartet network and let Cv be a cycle in Q, that induces either a 4-cycle or a 31-cycle inQabcd. Then

CF(Q)=γCF(Q1)+(1γ)CF(Q2).

Proof Letting K = Kv, then P(K = 1) = 1. Thus,

CF(Q)=P(K=1)CF(Q|K=1)=P(K=1)(γCF(Q1|K=1)+(1γ)CF(Q2|K=1))=γCF(Q1)+(1γ)CF(Q2).

It remains to consider a 32-cycle. For this case it helps to introduce new terminology. Let G be a semidirected graph and v be a node in G with indegree 2 and outdegree 0. Let hv and hv be the edges incident to v and let u and u′ the parent nodes in hv and hv respectively. We refer to disjointing hv and hv from v as the process of 1) deleting v from G; 2) introducing nodes w and w′; 3) introducing directed edges (u,w) and (u′,w′).

Let Q=Qabcd be a metric level-1 LSA quartet network, and Cv a cycle in Q, that induces a 32-cycle in Qabcd. Without loss of generality suppose that a and b are the taxa below v. Let Qa be the network obtained from Q by 1) deleting everything below v; 2) disjointing h1 and h2 from v; 3) labeling a leaf that is currently unlabeled by a and the other unlabeled leaf by b. We construct Qb by swapping the labels a and b inQa. Figure 14 depicts an particular example of this.

Lemma 11 Let Q=Qabcd be a metric level-1 LSA quartet network, Cv be a cycle in Q, that induces a 32-cycle in Qabcd and let K = Kv. Suppose that the two taxa below v are a and b, then

CF(Q)=γ2CF(Q1)+(1γ)2CF(Q2)+P(K=1)2γ(1γ)CF(Q0|K=1)+P(K=2)γ(1γ)[CF(Qa)+CF(Qb)].

Proof By hypothesis K takes values in {1,2} and

CF(Q)=P(K=1)CF(Q|K=1)+P(K=2)CF(Q|K=2).

If K = 1 the unrooted tree topology has been determined and CF(Q | K = 1) is given by the expression in equation (4). If K = 2,

CF(Q|K=2)=γ2CF(Q1|K=2)+(1γ)2CF(Q2|K=2)+γ(1γ)CF(Qa)+γ(1γ)CF(Qb).

Therefore,

CF(Q)=P(K=1)(γ2CF(Q1|K=1)+(1γ)2CF(Q2|K=1)+2γ(1γ)CF(Q0|K=1)+P(K=2)[γ2CF(Q1|K=2)+(1γ)2CF(Q2|K=2)+γ(1γ)CF(Qa)+γ(1γ)CF(Qb)],

which yields the claim.

These Lemmas together imply that concordance factor for rooted quartet networks actually depend only on the unrooted network. This is formalized in the following.

Proposition 5 Let Q=Qabcd and Q˜=Q˜abcd be metric level-1 LSA quartet networks which induce the same unrooted network Qabcd=Q˜abcd. ThenCF(Q)=CF(Q˜).

Proof We prove this by induction on the number of cycles in Qabcd. When there are no cycles inQabcd, Q and Q˜ are trees, and by Proposition 3, CF(Q)=CF(Q˜). Assume now the result is true when there are fewer than k +1 cycles and that Qabcd has k+1 cycles. Let Cv be a cycle in Qabcd with hybrid edges h1 and h2, by Lemmas 7, 8, 9, 10, and 11, we can express the concordance factors of Q and Q˜ in terms of networks with one fewer cycle. Note that these networks for Q and Q˜ have the same unrooted metric structure. Thus by the induction hypothesis CF(Q˜i)=CF(Qi), for i = 0,1,2, and thereforeCF(Q˜)=CF(Q).

Corollary 3 Let N+ be a level-1 rooted metric network on X and let a,b,c,d be distinct taxa of X. Under the NMSC,CFabcd=CF(Qabcd) can be computed from the unrooted networkQabcd.

We indicate how to compute the concordance factors of a LSA network Qabcd from the unrooted quartet network Q=Qabcd without having to introduce a root. For Q=Qabcd a unrooted metric level-1 quartet network, where using Corollary 3 we define CF(Q)=CF(Qabcd):

  • i)

    Q′ be the graph obtained from Q by contracting all 23- and 21- cycles. By Corollary 2, CF(Q) = CF(Q′). If Q has a 4-cycle go to step (ii), otherwise go to step (iii).

  • ii)

    Lemma 5 and Lemma 6 there are no 31-, 32- or 22-cycles in Q, and thus none in Q′. Then Q′ only has a 4-cycle so apply Lemma 10 to Q′. Since Q1 and Q2 are quartet trees, use the formula in equation (1) to complete the calculation.

  • iii)

    are at most two 31-cycles in Q′. Choose one arbitrarily and apply Lemma 10. If Q1 and Q2 still have a 31-cycle, apply Lemma 10 again to Q1 andQ2.

  • iv)

    have now expressed concordance factors of Q in terms of concordance factors of unrooted quartet networks with no 21-,23-,31−, or 4-cycles. Apply Lemma 9 to these networks, by for instance choosing a 22-cycle with smallest graph theoretical distance from its hybrid node to a leaf, repeating until no 2-cycle remains.

  • v)
    have now an expression of the concordance factors of Q in terms of concordance factors of unrooted quartet networks with at most one 32-cycle. Apply Lemma 11. Then we have suppressed all cycles, and the concordance factors are now in terms of unrooted quartet trees. The formula of equation
    • (1)
      the calculation.

use of these Lemmas and Theorem is illustrated by a few examples.

Example 2 Consider the unrooted quartet network shown in Figure 15. By Lemma 9, with xi=eti, the quartet concordance factors are given by:

CFAB|CD=(1γ)2(123x1x2x3)+2γ(1γ)(123x1x2)+γ2(123x1x2x4),CFAC|BD=CFAD|BC=(1γ)2(13x1x2x3)+2γ(1γ)(13x1x2)+γ2(13x1x2x4). (6)

Fig. 15.

Fig. 15

An unrooted quartet with a single 22-cycle.

Example 3 Consider the unrooted quartet network shown in Figure 16. By Lemma 10, withxi=eti, the quartet concordance factors are given by:

CFAB|CD=(1γ)(123x1)+γ(123x1x2),CFAC|BD=CFAD|BC=(1γ)(13x1)+γ(13x1x2). (7)

Fig. 16.

Fig. 16

An unrooted quartet with a single 31-cycle.

Example 4 Consider the unrooted quartet network shown in Figure 17. By Lemma 10, withxi=eti, the quartet concordance factors are given by:

CFAB|CD=(1γ)(123x1)+γ(13x2),CFAC|BD=(1γ)(13x1)+γ(13x2),CFAD|BC=(1γ)(13x1)+γ(123x2). (8)

Fig. 17.

Fig. 17

An unrooted quartet with a single 41-cycle.

Example 5 Consider the unrooted quartet network shown in Figure 18. Given K = 1, one coalescent event has occurred below the hybrid node, so a and b coalesced. Therefore CF(Q0 | K = 1) = (1,0,0). By Lemma 11, with xi=eti, the quartet concordance factors are given by:

CFAB|CD=(1γ)2(123x1x2)+2γ(1γ)(1x1+13x1x3)+γ2(123x1x4),CFAC|BD=CFAD|BC=(1γ)2(13x1x2)+γ(1γ)x1(113x3)+γ2(13x1x4). (9)

1–5, agree with those in [23].

6. The Cycle property

this section we focus on the ordering by magnitude of the concordance factors.

Proposition 6 Let Q=Qabcd be a metric unrooted level-1 quartet network with no 32-cycle. The ordering of CFabcd(Q) is the ordering of CFabcd(Q′) where Qis obtained from Q by contracting all 2-cycles and all 31-cycles.

Proof By Corollary 2, CF(Q) = CF(Q), where Q is obtained from Q by contracting all 21- and 23-cycles. Therefore we can assume Q has no 21- or 23-cycles. If Q has a 4-cycle, it has no 31- and no 22-cycles and the claim is established.

suppose Q has only 22-cycles and 31-cycles. We proceed by induction in the number of cycles, with the base case of 0 cycles trivial. Assume the result is true for unrooted quartet networks with k 31- and 22-cycles and suppose Q has k + 1. Picking one cycle and applying one of Lemmas 9 or 10 to Q, we can express the concordance factors of Q as a convex combination of CF(Q0), CF(Q1) and CF(Q2). Note that Q0, Q1 and Q2 have the same topology and by induction hypothesis, CF (Q0), CF(Q1) and CF(Q2) have the same ordering as the concordance factors of Q0, Q1 and Q2 respectively, the networks obtained after contracting all 22- and 31-cycles from Q0, Q1 and Q2. Since Q0, Q1, Q2 and Q′ are trees with the same topology, their concordance factors have the same ordering by equations (1). Thus CF(Q0), CF(Q1) and CF(Q2) have the same ordering, and ergo so does CF(Q).

One consequence of Proposition 6 is that for any unrooted metric level-1 quartet network Q without a 32- or a 4-cycle, the ordering of the concordance factors is the same as the ordering of the concordance factors of a quartet tree. That is, the two smallest elements of the concordance factors are equal. When this happens we say that Q is treelike, since we could use equations (1) to find a quartet tree with appropriate edge lengths and concordance factors equal to CF(Q). However, not all unrooted quartet networks are treelike.

Example 6 Let Qabcd be the unrooted 32-cycle quartet in Figure 18, whereγ=12, t1=log(67), t2=log(67), t3=log(114) and t4=log(1314). By the equations in (9) we observe that the concordance factors are:

CFAB|CD=3298,CFAC|BD=3398,CFAD|BC=3398.

The fact that such a quartet network can be not treelike was identified in [24], where it was pointed out that this may cause species tree methods not to be robust to the presence of gene flow.

motivates the following definition.

Definition 16 Let N+ be a metric rooted level-1 network on X. We say that a set of four distinct taxa s = {a,b,c,d} satisfies the Cycle property if Qs is not treelike, that is, if the two smallest values of CFs=CF(Qs) are not equal.

Cycle property is best understood geometrically. Denote by 2 the 2-dimensional probability simplex, the set of points in 3 with nonnegative entries adding to 1. Observe that CFabcd2 for any distinct taxa a,b,c,d. Figure 19 (left) depicts the simplex where the black lines are the points where the Cycle property is not satisfied; that is, the treelike unrooted quartet networks are those with concordance factors (x,y,z) satisfying x>13, y = z or y>13, x = z orz>13, x = y. All points off these segments satisfy the Cycle property. For simplicity in arguments to come, note that we can interpret concordance factors, CFabcd, as a function that depends on a metric network on {a,b,c,d} and has for image points in 2.

Fig. 19.

Fig. 19

On the left a planar projection of the simplex 2, where the black lines represent concordance factors that are treelike. In the center, the gray segments in 2 represent all the concordance factors arising from unrooted quartet networks with a 32-cycle. On the right, the black lines represent the variety V ((xz)(yz)(xy),x + y + z − 1), these are all concordance factors not satisfying the BC property of Definition 17

Proposition 7 Let Q=Qabcd be a metric unrooted level-1 quartet network with a 32-cycle. Then CF(Q) lies in the set I defined by x>16, y = z or y>16, x = z orz>16, x = y, shown on the middle of Figure 19. Furthermore, for any point (x,y,z) in this set there is such a Q with (x,y,z) = CF(Q).

Proof Let s = {a,b,c,d} be a set of four distinct taxa and suppose that Qs contains only a 32-cycle, as in Figure 18. Then CF(Qs) is given by the equations (7) with xi=eti, and in particular CFAC|BD = CFAD|BC. To maximize CFAD|BC in (9), let ti → 0 for i ∈ {1,2,4} and t3 → ∞ to obtain a quadratic polynomial in γ,

CFAD|BC13(1γ)2+γ(1γ)+13γ2,

whose maximum value is 512 and it is attained at γ=12. For these values, we obtain CF(Qs)(212,512,512). To minimize CFAD|BC it is enough to let t3 →∞ so CF(Qs)(1,0,0).

L be the open line segment with endpoints (1,0,0) and (212,512,512). Since CF(Qs) is continuous in ti and γ, its image is a connected set on the line (x,y,y) containing points arbitrarily close to the endpoints ofL. Thus the image of CF(Qs) isL. Permuting taxon names shows every point in the set I is a concordance factor for a network with a 32-cycle.

suppose Qs has a 32 cycle with a,b descending from the hybrid node, and possibly other cycles. We may contract all 21- and 23-cycles by Corollary 2 without affecting CF(Qs). By Lemmas 9 and 10, we may supress 22- and 31-cycles by expressing CF(Qs) as a convex sum of networks with a 32-cycle, but one fewer cycle. Thus CF(Qs) is a convex sum of points inL, which lies inL.

the supplementary materials of [23] it is stated that an unrooted quartet network Qabcd with a 32-cycle can be always reduced to an unrooted quartet tree with some adjustment in the edge lengths. This is not true in general; that is, when {a,b,c,d} satisfies the Cycle property it is not treelike. However, Proposition 7 indicates that sometimes unrooted quartet networks with 32-cycles are treelike.

conclude this section, we show the Cycle property can give positive information about a network.

Proposition 8 Let Qs be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the Cycle property, the unrooted quartet network Qs contains either a 32-cycle or a 4-cycle.

Proof Proposition 6 shows that if Qs has neither a 32-cycle nor a 4-cycle, the concordance factors of Qs are those of a tree.

7. The Big Cycle property

this section we investigate how to detect 4-cycles in a network from quartet concordance factors.

though the Cycle property give us some information about an unrooted quartet network, it is not sufficient to tell us what the unrooted quartet network is. This is shown by the following Example, where a 4-cycle network lead to identical concordance factors as those in Example 6.

Example 7 Let Q˜abcd be the 4-cycle unrooted quartet in Figure 17, where γ=12,t1=log(4849)=t2. By the equations in (8) the concordance factors are:

CFAB|CD=3298,CFAC|BD=3398,CFAD|BC=3398,

These agree with those of Qabcd in Example 6.

This motivates the following definition.

Definition 17 Let N+ be a metric rooted level-1 network on X. We say that a subset of four distinct taxa {a,b,c,d} ⊂ X satisfies the Big Cycle property (denoted BC) if all the entries of CFabcd are different.

Let {a,b,c,d} be a subset of taxa satisfying the BC property. Denote by qabcdBC the unrooted quartet corresponding to the smallest entry of CFabcd.

For example, if CFAB|CD < CFAC|BD < CFAD|BC, then qabcdBC=AB|CD.

Note that if s satisfies the BC property then s satisfies the Cycle property but the Cycle property is weaker than the Big Cycle property.

Proposition 9 Let Qs be an unrooted level-1 quartet network on a set of taxa s = {a,b,c,d}. If s satisfies the BC property, then the unrooted quartet network Qs contains a 4-cycle.

Proof By Proposition 8, Qscontains either a 32-cycle or a 4-cycle, and by Proposition 7, Qscannot have a 32-cycle. 2

A converse of Proposition 9 also holds, provided we include an assumption of generic parameters.

Proposition 10 Let N+ be a metric rooted level-1 on X with |X| ≥ 4. Let {a,b,c,d} ⊂ X such that Qabcd has a 4-cycle. Then {a,b,c,d} satisfies the Cycle property. Moreover, for generic numerical parameters onN+, {a,b,c,d} satisfies the BC property. That is, for all numerical parameters except those in a set of measure zero, the BC property holds.

Proof Let s = {a,b,c,d} ⊂ X be such that Qs has a 4-cycle. Without loss of generality suppose that c is the descendant of the hybrid node and the hybrid block {c} of Qs is adjacent to the v-blocks containing b and d. Since N is level-1, the only other possible cycles in Qs are 21 or 23-cycles. By Corollary 2, CF(Qs)=CF(Q), where Q′ is the network obtained after contracting all cycles other than the 4-cycle. Note that Q′ is the network shown in Figure 17, and by equations (6), CF(Q′) depends only on the length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges of Qs. Moreover, equations (6) show that {a,b,c,d} satisfies the Cycle property.

When Qs is obtained from N, the lengths of the edges of Qs are the sum of edge lengths fromN. Let ΘN=(0,)m×[0,1]h be the numerical parameter space for N and letΘs=(0,)2×[0,1]. Thus we can define a map νs:ΘNΘs such that for any metric (λ,γ) ofN, νs((λ,γ)) encodes the edge length of the non-hybrid edges in the 4-cycle and the γ parameter of the hybrid edges. In particular this map is linear and surjective.

With χs = (0,1)2 × [0,1], let η:Θsχs be defined as η(l1,l2,γ)=(el1,el2,γ), so η is a biholomorphic function. Defining f : χs2 by f((L1,L2)) = (1 − γ)(1 − 2L1/3,L1/3,L1/3) + γ(L2/3,L2/3,1 − 2L2/3), the quartet concordance factor map can be viewed as a composition

ΘNνsΘsηχsfΔ2.

It is straightforward to see that the image of f restricted to γ = 0 and γ = 1 is the red (skewed) and blue (vertical) segments shown on the right of Figure 20.

Fig. 20.

Fig. 20

The function f maps the cube χs (left) to 2 (right). The blue facets (rear and top) of the cube are mapped by f to the blue (vertical) segment and the red facets (bottom and right) to the red (skewed) segment. The full cube is mapped onto the shaded triangle with all the concordance factor displayed by a network with a 4-cycle. The three line segments, two on the boundary of and one within the shaded triangle, are comprised of points not satisfying the BC property.

Let V = V ((xz)(yz)(xy),x + y + z − 1), that is, let V be the algebraic variety composed of the points on which (xz)(yz)(xy) and x+y+z −1 are zero, as depicted on the right of Figure 19. Observe that V is the points in 2 that, if interpreted as concordance factors, would not satisfy the BC property.

Since f is a polynomial map whose image is not contained in V , the preimage of V under f is contained in a proper sub-variety of χs, and therefore f−1(V ) has measure zero in χs. Since η is biholomorphic, then η−1(f−1(V )) has measure zero. Since ν is linear surjective, then ν−1(η−1(f−1(V ))) has measure zero. Thus generic points in ΘN are mapped to concordance factors satisfying the BC property.

To better understand the geometry of the map f in this proof, let s = {a,b,c,d} be a subset of four distinct taxa satisfying the BC property. Figure 20 depicts the subset of χs that is mapped by f to those segments of the shaded triangle inside 2. The interior of χs is mapped to the interior of the shaded triangle.

The following Theorem follows immediately from Proposition 10 and Proposition 9.

Theorem 3 Let N+ be a metric rooted level-1 network on X with |X| ≥ 4 and {a,b,c,d} ⊂ X. For generic numerical parameters, {a,b,c,d} satisfies the BC property if and only if Qabcd has a 4-cycle.

Theorem 3 and Proposition 8, yield the following.

Corollary 4 Let N be a metric unrooted level-1 network on X and let s = {a,b,c,d} be a set of distinct taxa in X. Then if s satisfies the Cycle property but not the BC property for generic parameters, then Qs contains a 32-cycle.

The converse of Corollary 4 does not hold, as pointed out by Proposition 7.

If a set of 4 taxa satisfy the BC property, we can deduce some finer information about the 4-cycle on the unrooted quartet network and a larger network, as proved in the following.

Proposition 11 Let N be a metric unrooted level-1 network on X and let {a,b,c,d} ⊆ X satisfy the BC property, so Qabcd contains a 4-cycle Cv. Then qabcdBC=AC|BD if and only the v-blocks of Qabcd containing a and c are not adjacent.

Proof Let Q=Qabcd. Since N is level-1 the only possible cycles in Q, other than Cv, are 21 and 23-cycles. Let Q′ be the network obtained after contracting all 21 and 23-cycles, so Q′ has only a four cycle. By Corollary 2, CF(Q) = CF(Q′). Example 4 shows that if the v-blocks of Qabcd containing a and c are not adjacent then qabcdBC=AC|BD. Interchanging taxon labels in this example shows that when qabcdBC=AC|BD, then a and c are not adjacent.

Lemma 12 Let N be a metric unrooted level-1 network on X with generic numerical parameters. There exists {a,b,c,d} ⊆ X satisfying the BC property if and only if N contains a cycle Cv of size k ≥ 4 with one of these taxa is in the hybrid block, and the others in distinct v-blocks onN.

Proof Suppose that N has a cycle of size k for some k ≥ 4 with hybrid node v. Choose four taxa {a,b,c,d}, such that a is in the hybrid block and a,b,c and d are in distinct v-blocks. This set of taxa induces a unrooted quartet network with a 4-cycle, and so by Theorem 3 this set of taxa satisfies the BC property for generic parameters. Suppose conversely, that there exists {a,b,c,d} satisfying the BC property. By Theorem 3, Qabcdhas a 4-cycle, so N has a cycle of at least size four and one of these taxa is a descendant of the hybrid node. Since the other taxa are in distinct v-blocks ofQabcd, they must be in distinct v-blocks ofN.

For a level-1 metric unrooted networkN, let S be the collection of sets of 4 distinct taxa satisfying the BC property and VH be the set of hybrid nodes. We observe that for any sS, there is a natural map ψ : SVH, where ψ(s) = v if v is the hybrid node associated to the cycle of size 4 inQs. In this case we say that s determines the hybrid node v.

Lemma 13 Let N be a metric unrooted level-1 network and let {a,b,c,d} and {a,b,c,e} be subsets of the taxa satisfying the BC property. The set {a,b,c,d} determines v if and only if {a,b,c,e} determines v.

Proof Let {a,b,c,d} determine v, {a,b,c,e} determine u, and suppose that uv. Let Cv and Cu the cycles in N containing v and u respectively, so Cu and Cv do not share edges. Since {a,b,c,d} satisfies the BC property, by Lemma 12, a, b, c, and d belong to different v-blocks, so that in N\E(Cv) the taxa a, b and c are in different connected components. Since N is level-1, Cu is in one of the connected components of N\E(Cv), sayK. In particular note that all the taxa not in K are in the same u-block. But at least two of a,b and c are not inK, so at least two of a, b and c are in the same u-block. This contradicts Lemma 12, so u = v.

Interestingly, under the NMSC the ordering of quartet concordance factors is insufficient to identify the hybrid node of cycles of size 4. For example, the networks shown in Figure 21 all have the same ordering of their concordance factors despite different hybrid nodes. The concordance factors for all those networks have the same values:

CFAB|CD=(1γ)(123et1)+γ(13et2),CFAC|BD=(1γ)(13et1)+γ(13et2),CFAD|BC=(1γ)(13et1)+γ(123et2).

Fig. 21.

Fig. 21

Four unrooted metric level-1 quartet networks with the same concordance factors.

Figure 22 shows the 4-cycle network topologies drawn in the regions of 2 which their concordance factors fill. In each case it does not matter which of the cycle nodes is the hybrid node; all those unrooted quartet networks define concordance factors that fill that region.

Fig. 22.

Fig. 22

Each section of the simplex is depicted with an unrooted quartet network topology whose image under the concordance factor map fills that region, independent of the placement of the hybrid node.

8. Identifying cycles in networks

Having shown that the BC property can detect the existence of 4-cycles in networks, for generic parameters, we are poised to prove our main result. Our arguments now are mainly combinatorial.

Given a network N+ on X, let S denote the set of 4-taxon subsets of X satisfying the BC property. Recall that for a unrooted level-1 network N on X, the 4-network partition is the partition of X according to the connected components of the graph obtained after removing all cycles of size at least 4 from N. Recall also that the blocks of such partition are referred to as 4-network blocks.

Lemma 14 Let N+ be a metric rooted level-1 network on X. Then under the NMSC model with generic parameters the 4-network blocks of N+ can be determined from the set S.

Proof If |X| < 3 there is nothing to prove. The case |X| = 4 follows from Proposition 9, so we assume |X| ≥ 5. By Lemma 12, for any {a,b,c,d} ∈ S each taxon a, b, c, d must belong to a different 4-network block. Let

Ya={sS|as}s\{a}

Then Ya is the complement of the 4-network block containing a. To see this, note that for any taxon b that does not belong to the 4-network block of a, by Lemma 4, there exists a cycle Cv of size at least 4 such that a and b are in different v-blocks. Now choose any two different taxa c and d, such that all taxa a, b, c, d are in different v-blocks and one of a, b, c or d is in the v-hybrid block. Then {a,b,c,d} ∈ S, and thus bYa.

It follows that X \ Yx is the 4-network block containing taxon x. Since x was arbitrary, all 4-network blocks can be determined.

Lemma 15 Let N+ be a metric rooted level-1 network on X with cycle Cv of size kv ≥ 4. Then for generic parameter choices, the v-blocks and the size kv can be identified from the set S. If kv ≥ 5 the v-hybrid block can also be identified.

Proof Let {a,b,c,d} ∈ S and let v be the hybrid node determined by it. By Lemma 12, each of these taxa belongs to a different v-block, and hence to a different 4-network block. Denote by A,B,C,D the v-blocks containing a,b,c and d respectively.

Let Zabc be the set of all taxa e such that {a,b,c,e} ∈ S. By Lemma 13, all such {a,b,c,e} ∈ S determine the same hybrid node v. Consider now Zbcd, Zacd and Zabd. If kv = 4, then, by the last statement of Lemma 12, Zabc = D, Zbcd = A, Zacd = B and Zabd = C, so all pairwise intersections of Zabc, Zbcd, Zacd, Zabd are empty. If kv > 4, then, again by Lemma 12, for some distinct taxa i,j,k ∈ {a,b,c,d}, Zijk is the v-hybrid block, and for any l,m,n ∈ {a,b,c,d} with {l,m,n} ≠ {i,j,k}, Zlmn = (LMN)c. Note that ZijkZlmn = ∅ since one of L,M,N is the v-hybrid block. Since Zlmn contains at least one v-block other than A, B, C or D, for any l′,m′,n′ ∈ {a,b,c,d}, with {l′,m′,n′} ≠ {i,j,k}, Zlmn ∩ Zlmn ≠ ∅. Hence we can determine whether kv > 4 or kv = 4: if all pairwise intersection of Zabc, Zbcd, Zacd, Zabd are empty then kv = 4, else kv > 4. If kv > 4 we can determine the hybrid block, by noting which of the sets Zabc, Zbcd, Zacd, Zabd has empty intersection with any other set in this family. At this point we have determined either that kv = 4 and all v-blocks, or that kv > 4 and the hybrid block.

In the case kv > 4, without loss of generality, suppose that A is the vhybrid block. Let y /Zabc = (ABC)c, so y is in one of A, B and C. For some u,w ∈ {a,b,c}, s′ = {y,u,w,d} ∈ S, which shows y and the taxon g ∈ {a,b,c} \ {u,w} are in the same v-block. Thus we can determine A, B and C.

Note that for any taxon x that is not in any of A, B or C, then s ={a,x,b,c} ∈ S. Since s determines v, following the steps of the last paragraph identifies the v-block that contains x. Therefore all v-blocks can be determined, and thus kv as well.

Lemma 16 Let N+ be a metric rooted level-1 network on X. Then for any hybrid node v with kv ≥ 4 the order of the v-blocks in the cycle can be determined from the ordering of the concordance factors.

Proof If kv = 4, the claim is established by Proposition 11. Now suppose that kv > 4, so by Lemma 15 we know the v-hybrid block. Let A1,...,Akv be the v-block partition with A1 the v-hybrid block. Let aiAi be an element of the i-th v-block. By Proposition 11, A1 and Aj are adjacent if and only if qa1ajxyBCa1aj|xy for any distinct x,y{a2,,akv}\{aj}. Thus we can identify the two v-blocks adjacent to A1. Suppose that such v-blocks are Ap and Aq. We find the other v-block adjacent to Aq from {qa1apajamBC} for all distinct j,m ∈ {2,3,4,…,kv} \ {p,q}. This is, Aq and Aj are adjacent if and only if qa1ajapxBCa1aj|xap for any distinct x{a2,,akv}\{ap,aq,aj} and j ≠ 1,p,q. Continuing in this way, the full order of blocks around the cycle can be determined.

We reach the main result.

Theorem 4 Let N+ be a metric rooted level-1 network on X. Then under the NMSC model, for generic parameters, the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network N˜ obtained from N by contracting all 2- and 3-cycles, and directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges of k-cycles for k ≥ 5.

Proof We proceed by induction in the number of cycles of size ≥ 4. Suppose there are no such cycles.Then every induced quartet tree will have no cycle of size 4, and the ordering of the concordance factors determines the topology of the quartet tree obtained by contracting all 2- and 3-cycles. These then determine the topology N˜ by a standard result [22].

Suppose there is exactly one cycle of size at least 4. Then there is just one hybrid node v in N˜ with kv ≥ 4. By Lemmas 15 and 16 we can determine the size kv of the cycle, the v-blocks and the order of the v-blocks in the cycle. If kv ≥ 5 we can identify the hybrid node v and thus identify the direction of the hybrid edges. Let Pu be a v-block where u is a node in Cv, and qX \ Pu. Let K be the induced network on Pu∪{q} with all 2-cycles and 3-cycles contracted. Note that K is a tree, and the quartet concordance factors for taxa in Puq identify its topology. Viewing q as an outgroup of Pu, induces a rooted tree on Pu. The root can then be joined with an edge to u. Doing this for all v-blocks establishes the claim.

Now suppose that the result is true for networks with l cycles of size at least 4, and N contains l+1 such cycles. We can first determine all 4-network blocks and the v-blocks and its cycle order for every cycle of size at least 4 by Lemmas 14, 15, and 16. Following Definition 13, consider T, the tree of cycles of N˜. A leaf of T arises from a cycle Cv on N if and only if all v-blocks but one are 4-network blocks. We may therefore determine the v-blocks of some cycle Cv that is a leaf of T.

Let u be the vertex in Cv associated to the v-block that is not a 4-network block. Note that N˜\{u} is a disconnected graph, with two connected components N˜1 and N˜2. Let N˜1 be the component containing all nodes of C except u, and Si the set of taxa on N˜i, i ∈ {1,2}. Let siSi. Then NSi{sj} for i,j ∈ {1,2}, ij, has at most l cycles of size at least 4. By the induction hypothesis we can determine the semidirected topological network N˜i obtained from NSi{sj} by contracting all 2- and 3-cycles, and directions of the hybrid edges in 4-cycles, while retaining directions of the hybrid edges of k-cycles for k ≥ 5. We obtain N˜ by identifying s1 in N2 with s2 in N1 and suppressing that node.

Figure 23 shows a phylogenetic metric rooted network N+ andN˜, the unrooted semidirected topological network which is identified by Theorem 4.

Fig. 23.

Fig. 23

A rooted metric phylogenetic network N+ (left) and the network structure N˜ (right) that can be identified by Theorem 4. The 4-cycle on the network in the right, colored gray, has 3 different candidates for the hybrid node.

The cycle colored in green is a 4-cycle and, though, its hybrid node is not identified from quartet concordance factors. However, its hybrid node has to be such that N˜ is induced from a rooted network. Thus the node labeled x in Figure 23 cannot be the hybrid node. This illustrates that although we cannot always identify the hybrid node on 4-cycles, sometimes the structure of the resulting network N˜ restricts the possible nodes for its placement.

9. Further results on 32-cycles

Under some special circumstances, for example when a set of taxa satisfy the Cycle property but not the BC property, it is possible to detect further information about the topology of the network than that given in Theorem 4. For instance, some 3-cycles are identifiable under such hypothesis. In this section, we discuss these extensions briefly, as it is difficult to formulate general statements on identifiability.

Recall that a 32-cycle may lead to concordance factors satisfying the Cycle property, but it need not, as shown in Proposition 7. There is a full-dimensional subset of parameters space on which concordance factors indicate a 32-cycle and another in which it fails to. Nonetheless, the following gives a positive, but limited, identifiability result.

Proposition 12 Let N+ be a metric rooted level-1 network on X and suppose {a,b,c,d} ⊂ X satisfies the Cycle property but not the BC property. Then under the NMSC model, for generic parameters, if there is no taxon e ∈ X such that {i,j,k,e} satisfies the BC property for any distinct i,j,k ∈ {a,b,c,d} then N contains a 3-cycle with at least two descendants of the hybrid node.

Proof Since {a,b,c,d} ⊂ X satisfy the Cycle property but not the BC property, by Proposition 8, there is a 32-cycle inQabcd. Thus three taxa of a,b,c,d are in distinct v-blocks inQabcd. This implies that there exists a cycle Cv in N where three taxa of a,b,c,d are in distinct v-blocks. Since {i,j,k,e} does not satisfy the BC property for any distinct i,j,k ∈ {a,b,c,d}, this implies Cv is not a k-cycle for k ≥ 4. Thus by Proposition 7, Cv has size 3 and at least two of a, b, c, d descend from v.

Let Qabcd be an unrooted level-1 quartet network where {a,b,c,d} satisfies the Cycle property but not the BC property. It can be shown that if, for example, the smallest entry in CFabcd is the one corresponding to the quartet AB|CD, then either a,b or c,d are in the v-hybrid block. This proof is very similar to that of Proposition 11.

Let N+ be a network such that N˜ (in the network obtained from N+ in Theorem 4) is as shown in Figure 24. Observe that {a,b,c,d} satisfies the BC property by Theorem 3. If {a,e,b,d} satisfies the Cycle property, then the following Proposition indicates the hybrid node in the network shown in Figure 24 can be determined.

Fig. 24.

Fig. 24

A network N˜ with a four cycle such that if {a,b,c,e} satisfies the Cycle property, the hybrid block can be detected.

Proposition 13 Let N+ be a metric rooted level-1 network on X and let Cv be a 4-cycle inN. Let a,b,c,dX be in different v-blocks inN. Suppose under the NMSC model, for generic parameters, for distinct i,j,k ∈ {a,b,c,d}, there exists a taxon eX such that {i,j,k,e} satisfies the Cycle property but not the BC property. Then the v-block containing e is the v-hybrid block.

Proof Without loss of generality suppose that i = a, j = b and k = c. Note that e is not in the same v-block as d, otherwise {a,b,c,e} would satisfy the BC property. Thus e is the same v-block as a,b or c. Without loss of generality suppose that is in the same v-block as a. Thus {e,b,c,d} satisfies the BC property and by Theorem 4 the order of the cycle can be determined. Without loss of generality suppose that the order is the one as in Figure 24. By Lemma 13, {a,b,c,d} and {e,b,c,d} determine the same hybrid node v. Since {a,b,c,e} satisfies the Cycle property, Corollary 4 shows Qabce has a 32-cycle. The 4-cycle in Qabcd and the 3-cycle in Qabce have to have the same hybrid edges, otherwise the level-1 condition would be violated. Observe that the only possibility for Qabce having a 32-cycle is if e and a are in the hybrid block.

In [23] it is stated that one could identify the hybrid node in a 4-cycle when the number of taxa in the network is greater than 4 by using multiple concordance factors at once.

10. Discussion

In this work, we show that for generic numerical parameters, under the network multi-species coalescent model the collection of orderings of quartet concordance factors identifies the unrooted semidirected topological network obtained from N by contracting all 2- and 3-cycles, and ignoring the directions of hybrid edges in 4-cycles, while retaining directions of hybrid edges in larger cycles.

As mentioned in the introduction, the proof of this result suggests combinatorial methods for constructing the network under noiseless data, but the question remains open in the presence of noise. There are two challenges when noise is introduced. The first one consist of detecting whether a quartet network contains a 4-cycle or not. We would never expect the empirical concordance factors to be exactly treelike. For this challenge, one could develop a statistical test to determine when concordance factors are sufficiently close to treelike to doubt the presence of a 4-cycle. The second challenge arises after determining such test. Since the test will not be accurate all the time, some quartets will not be inferred correctly and thus we need a method to reconstruct the network with some erroneous quartets. We leave this for future work.

Acknowledgements

The author deeply thanks John A. Rhodes and Elizabeth S. Allman for their technical assistance and suggestions during the development of this work, and the reviewers for their valuable suggestions and observations.

This research was supported in part by the National Institutes of Health grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

11 Appendix

Here, Proposition 1 of Section 2 is proved. The argument uses the following.

Lemma 17 Let N+ be a (metric or topological) rooted network on X and let ZX. For any edge e below LSA(Z), with a descendant in Z, there are x,yZ such that e is in a simple trek in N+ from x to y whose edges are below LSA(Z).

Proof Let xZ be below e. By Lemma 2 there exists yZ with LSA(x,y) above e.

Suppose y is not below e. Let Px be a path from LSA(x,y) to x containing e and let Py be a path from LSA(x,y) to y. Let u be the minimal node in the intersection of Px and Py. Since y is not below e, u cannot be below e. Then the subpath of Px from u to x, which contains e, and the subpath of Py from f to y form a simple trek containing e.

Now assume y is below e. Since e is below LSA(x,y), there exists a path from LSA(x,y) to one of y or x that does not pass through the child of e. Without loss of generality suppose such a path Py goes from LSA(x,y) to y.

Let Px be a path from LSA(x,y) to x that passes through e. Let A = A(Px,Py) be the set of nodes above e, common to Py and Px. Let aA be the minimal node in A.

Let B(Py,Px) be the set of nodes below e, common to Py and Px. We may assume that we choose Px and Py such that B = B(Py,Px) has minimal cardinality. If B = ∅ then the desired trek is easily constructed, with top a. So suppose B ≠∅ has minimal element b and maximal element b+. We are going to contradict the minimality of B. Note that b+ must be the hybrid node of a cycle containing e (see Figure 25 for a graphical reference).

Fig. 25.

Fig. 25

In gray we see the subgraph composed by P and P′, the dashed edges represent that P and P′ could intersect, the dotted segments represent just a succession of edges. In black we see the different cases of the possible edges in P above b but below a.

Since b is not LSA(x,y), there exists a path P from LSA(x,y) to one of x or y that does not pass through b. Note that P has to intersect at least one of Py or Px at an internal node below b. Let C1 be the set of nodes below b, common to P and Py and let C2 be the set of nodes below b, common to P and Py. Let c be the maximal node in C1C2. We can assume, without loss of generality, that c is in Py. This is because if instead, c were in Px, we can construct paths Px and Py where Pi contains all the edges in Pi above b and all edges of Pj below b for i,j ∈ {x,y}, ij. Note that Px passes through e and does not contains c, while Py does not pass through e, contains c, and B=B(Py,Px).

Denote by W the set of nodes in (P* ∩ Py) ∪ (P* ∩ Px) and let w be the minimal node of W above b. Since N+ is binary, w cannot be a or b+ (see Figure 25 for a graphical reference). There are 5 different cases of the location of w in the network composed by the paths Py and Px. These are

  1. 1. w is in Py, above b+ but below a.

  2. 2. w is in Px, above b+ but below e.

  3. 3. w is in Px, above e but below a.

  4. 4. w is in one or more of Px or Py, above a.

  5. 5. w is in one or more of Px or Py, above b but below b+.

Figure 25 depicts in gray the graph composed by the paths Py and Px, and in black we see the possible subpaths of P from w to c. In any of case 1, 2 or 3 we can find a simple trek containing e as depicted in Figure 26 by choosing the appropriate edges, and thus B was not minimal. For case 4 and 5 there are two possibilities; (i) w is in both Py and Px; (ii) w is only in one of Py or Px. For case 4 (i), the situation is simple, and we can find a simple trek as depicted on the left in Figure 27. For case 4 (ii), we first find the node in A that is right above w. Then as depicted on the left of Figure 27 we can find a simple trek.

Fig. 26.

Fig. 26

The treks in case 1 (left), case 2 (center), and case 3 (right).

>Fig. 27.

>Fig. 27

(Left) The treks in the two possibilities of case 4. (Right) The two possibilities of case 5, where the black segments represent possible edges red and blue at the same time.

For case 5 we do not find a simple trek directly, instead we construct two paths P1 and P2 from LSA(x,y) to x, y respectively, only one of which contains e with at least one less node in B(P1,P2) than B. For case 5 (i), we just take P1 to be the same as Px and for P2 we consider the same edges that are in Py above w, the edges below c, and the edges in P between w and c. For case 5 (ii), we assume without loss of generality that w is in Px. Let b be the node in B right above w. Let P1 be the path containing the edges in Px that are above b, the edges in Py that are below b but above the node b′ ∈ B right below w, and at last the edges in Px below b′. Let P2 the path containing the edges in Py that are above b, the edges in Px that are above a but below b, the edges in P that are above c but below w and at last the edges in Py that are below c. Figure 27 (right) depicts P1 (red) and P2 (blue) for (i) and (ii). Since B(P1,P2) has at least one less node that B and we assumed B, the minimality of B is contradicted.

Proof (of Proposition 1) Let M+=NZ. Let M be the graph obtained from M+ by ignoring the direction of all tree edges and then suppressing the LSA(Z,N+), that is, the induced unrooted network from M+. Denote by M0 the graph obtained by ignoring all directions of the tree edges in M+, so that by suppressing degree two nodes of either M or M′ gives (NZ+). Let K be the graph obtained by considering all the edges in simple treks in N from x to y for all x,yZ, so that suppressing degree two nodes in K gives(N)Z. Showing either M′ = K or M = K, will prove the claim.

First we show that if LSA(Z,N+)≠LSA(X,N+) then M′ = K, by arguing that M′ and K have the same edges. Let e be an edge of M′. Since LSA(Z,N+)≠LSA(X,N+), M′ is a subgraph of N and e is directed in M+. By Lemma 17, e is in a simple trek in M+ from x to y, for some x,yZ. This trek induces a simple trek in M′ from x to y, and therefore a simple trek in N from x to y. Thus e is in K.

Now let e be an edge of K. Then there exists a simple trek (P1¯,P2¯) in N from x to y, for some x,yZ containing e. Let v=top(P1¯,P2¯) and let T be the sequence of incident edges in N+ from x to v conformed of edges inducing those in P1¯ and P2¯. Since (P1¯,P2¯) is simple, T does not have repeated edges. Following T in N+ from x to y, edges are first transversed “uphill” (in reverse direction) until there is a first “downhill” edge (u,w). The next edge in T cannot be uphill, as otherwise it would be hybrid and (P1¯,P2¯) would have not been a trek in N. This argument applies for all consecutive edges in T until we end at y. Thus there is a simple trek (P1¯,P2¯) from x to y in N+ with top u. Note that u must be below or equal to LSA(Z,N+) since otherwise the trek would not be simple. Moreover, P1 and P2 contain only edges in M+ and thus in M′ after the directions of the tree edges is omitted. Thus e is in M′, so K = M′.

If LSA(Z,N+)=LSA(X,N+) then M = K follows from a straight forward modification of the previous argument to account for the suppression of LSA(z,N+) in both M and K.

References

  • 1.Allman Elizabeth S., Degnan James H., and Rhodes John A.. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6):833–862, 2011. [DOI] [PubMed] [Google Scholar]
  • 2.Ané Cécile, Larget Bret, Baum David A., Smith Stacey D., and Rokas Antonis. Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24(2):412–426, 2007. [DOI] [PubMed] [Google Scholar]
  • 3.Arnold Michael L.. Natural hybridization and evolution, volume 53 Oxford University Press, 1997. [Google Scholar]
  • 4.Bapteste Eric, van Iersel Leo, Janke Axel, Kelchner Scot, Kelk Steven, McInerney James O., Morrison David A., Nakhleh Luay, Steel Mike, Stougie Leen, and Whitfield James. Networks: expanding evolutionary thinking. Trends in Genetics, 29(8):439–441, 2013. [DOI] [PubMed] [Google Scholar]
  • 5.Carstens Bryan C., Lacey Knowles L, and Collins Tim. Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers. Systematic Biology, 56(3):400–411, 2007. [DOI] [PubMed] [Google Scholar]
  • 6.Degnan James H, Lacey Knowles L, and Salter Kubatko Laura. Probabilities of gene trees with intraspecific sampling given a species tree In Estimating Species Trees: Practical and Theoretical Aspects. Wiley-Blackwell, 2010. [Google Scholar]
  • 7.Ellstrand NC, Whitkus R, and Rieseberg LH. Distribution of spontaneous plant hybrids. Proceedings of the National Academy of Sciences of the United States of America, 93(10):5090–5093, 1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gusfield Dan, Bansal Vikas, Bafna Vineet, and Song Yun S. A decomposition theory for phylogenetic networks and incompatible characters. Journal of Computational Biology, 14(10):1247–1272, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huber Katharina T., van Iersel Leo, Moulton Vincent, Scornavacca Celine, and Wu Taoyang. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, January 2017. [Google Scholar]
  • 10.Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. https://arxiv.org/pdf/1711.06720.pdf, 2017. [DOI] [PMC free article] [PubMed]
  • 11.C M Keijsper J and Pendavingh RA. Reconstructing a Phylogenetic Level-1 Network from Quartets. Bulletin of Mathematical Biology, 76(10):2517–2541, 2014. [DOI] [PubMed] [Google Scholar]
  • 12.Randal Linder C and Loren H Rieseberg. Reconstructing patterns of reticulate evolution in plants. American Journal of Botany, 91(10):1700–1708, 2004. [DOI] [PubMed] [Google Scholar]
  • 13.Liu Liang, Yu Lili, and Edwards Scott V. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology, 10(1):302, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mallet James. Hybridization as an invasion of the genome. Trends in Ecology & Evolution, 20(5):229–237, 2005. Special issue: Invasions, guest edited by Hochberg Michael E.and Gotelli Nicholas J.. [DOI] [PubMed] [Google Scholar]
  • 15.Meng Chen and Salter Kubatko Laura. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. [DOI] [PubMed] [Google Scholar]
  • 16.Nakhleh Luay. Evolutionary Phylogenetic Networks: Models and Issues. Problem Solving Handbook in Computational Biology and Bioinformatics, pages 125–158, 2011. [Google Scholar]
  • 17.Noor Mohamed A. F. and Feder Jeffrey L.. Speciation genetics: evolving approaches. Nature Reviews Genetics, 7(11):851–861, 2006. [DOI] [PubMed] [Google Scholar]
  • 18.Pamilo P and Nei M. Relationships between gene trees and species trees. Molecular Biology and Evolution, 5:568583, 1988. [DOI] [PubMed] [Google Scholar]
  • 19.Pollard Daniel A., Iyer Venky N., Moses Alan M., and Eisen Michael B.. Widespread discordance of gene trees with species tree in drosophila: Evidence for incomplete lineage sorting. PLoS Genetics, 2(10):1634–1647, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Rieseberg Loren H., Baird Stuart J.E., and Gardner Keith A.. Hybridization, introgression, and linkage evolution. Plant Molecular Biology, 42(1):205–224, 2000. [PubMed] [Google Scholar]
  • 21.Rosselló Francesco and Valiente Gabriel. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. [DOI] [PubMed] [Google Scholar]
  • 22.Semple Charles and Steel Mike. Phylogenetics. Oxford University Press, 2005. [Google Scholar]
  • 23.Solís-Lemus Claudia and Ané Cécile. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Solís-Lemus Claudia, Ané Cécile, and Yang Mengyao. Inconsistency of Species Tree Methods under Gene Flow. Systematic Biology, 65(5):843–851, 2016. [DOI] [PubMed] [Google Scholar]
  • 25.Steel Mike. Phylogeny Discrete and Random Processes in Evolution. David Marshall, 2016.
  • 26.Sullivant Seth, Talaska Kelli, and Draisma Jan. Trek separation for gaussian graphical models. Ann. Statist, 38(3):1665–1685, 06 2010. [Google Scholar]
  • 27.Syring John, Willyard Ann, Cronn Richard, and Liston Aaron. Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. American Journal of Botany, 92(12):2086–2100, 2005. [DOI] [PubMed] [Google Scholar]
  • 28.Wakeley John. Coalescent Theory: An Introduction, volume 58 Roberts and Company Publishers, 2008. [Google Scholar]
  • 29.Yu Y, Dong J, Liu KJ, and Nakhleh L. Maximum likelihood inference of reticulate evolutionary histories. PNAS, 111:296–305, 11 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yu Yun, Degnan James H., and Nakhleh Luay. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genetics, 8:e1002660, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yu Yun, Than Cuong, Degnan James H., and Nakhleh Luay. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhang C, Ogilvie HW, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35:504–517, 02 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: Revisiting trees within networks. BMC Bioinformatics, 17:415, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhu S and Degnan J. Displayed trees do not determine distinguishability under the network multispecies coalescent. Systematic Biology, 66:283298, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES