Identifiability of species network topologies from genomic sequences using the logDet distance

Elizabeth S Allman; Hector Baños; John A Rhodes

doi:10.1007/s00285-022-01734-2

. Author manuscript; available in PMC: 2022 Jun 13.

Published in final edited form as: J Math Biol. 2022 Apr 7;84(5):35. doi: 10.1007/s00285-022-01734-2

Identifiability of species network topologies from genomic sequences using the logDet distance

Elizabeth S Allman ¹, Hector Baños ², John A Rhodes ³

PMCID: PMC9192096 NIHMSID: NIHMS1806814 PMID: 35385988

Abstract

Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees. It applies to both unlinked site data, such as for SNPs, and to sequence data in which many contiguous sites may have evolved on a common tree, such as concatenated gene sequences. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.

Keywords: species network, identifiability, logDet, phylogenetic inference

1. Introduction

As genomic-scale sequencing has become increasingly common, attention in phylogenetics has shifted from inferring trees of evolutionary relationships for individual genetic loci from a set of species to inferring relationships between the species themselves. A substantial complication is that population genetic processes within species, as modeled by the Multispecies Coalescent (MSC) model can lead to individual gene trees having quite different topological structures than the tree relating the species overall. If the evolutionary history of the species also involved hybridization or other forms of horizontal gene flow, so that a species network is a more suitable depiction of relationships, the relationships of gene trees to the network, as modeled by the Network Multispecies Coalescent (NMSC) model, is even more complex.

Inference of species networks, through a combined NMSC and sequence substitution model, can be performed in a Bayesian framework [Zhang et al., 2017, Wen and Nakhleh, 2018] but computational demands severely limit both the number of taxa and the number of genetic loci considered. Other methods take a faster two-stage approach, first inferring gene trees which are treated as “data” for a second inference of a species network. Approaches include maximum pseudolikelihood using either rooted triples (PhyloNet) or quartets (SNaQ) displayed on the gene trees [Yu and Nakhleh, 2015, Solís-Lemus and Ané, 2016], or the faster, distance-based analysis built on gene quartets of NANUQ [Allman et al., 2019a]. Still, the first stage of these approaches, the inference of individual gene trees, can be a major computational burden. Avoiding such gene tree inference, and passing more directly from sequences to an inferred network, could substantially reduce total computational time in data analysis pipelines.

The goal of this paper is to show that most topological features of a level-1 species network can be identified from logDet intertaxon distances computed from aligned genomic-scale sequences. In particular this can be done without partitioning the sequences by genes, under a combined model of the NMSC and a mixture of general time-reversible (GTR) substitution processes on gene trees. While the main result, that the logDet distances retain enough information to recover most of the species network, despite having lost information on individual genes, is a theoretical one, it points the way toward faster algorithms for practical inference. In particular, since the computation of logDet distances requires little effort, it suggests that a distance-based approach similar to NANUQ’s, but avoiding individual gene tree inference, may offer substantially faster analyses than current methods.

The model of sequence evolution underlying our result accounts not only for base substitutions along each gene tree, but also for variation in gene trees due to their formation under a coalescent process combined with hybridization or similar gene transfer. Our model extends to networks the mixture of coalescent mixtures model on species trees of Allman et al. [2019b], which itself extended the coalescent mixture introduced by Chifman and Kubatko [2015]. More specifically, for a fixed species network, gene trees are formed under the Network Multispecies Coalescent model [Meng and Kubatko, 2009, Yu et al., 2011, Zhu et al., 2016] for each site independently. GTR substitution parameters for base evolution on each site’s tree are then independently chosen from some distribution, leading to a site pattern distribution. These site distributions are finally combined to give a site pattern distribution for genomic sequences. (As discussed in Section 2, this distribution also applies to a more realistic model in which multisite genes with a single substitution process have lengths chosen independently from some distribution.) While this pattern frequency distribution thus reflects the substitution processes on all the gene trees, information about pattern frequencies arising on any individual gene tree is hidden.

The logDet distance was first introduced in the context of a single class general Markov model of sequence evolution on a single gene tree [Steel, 1994, Lockhart et al., 1994], and has been used both to obtain both gene tree identifiability results and for inference of individual gene trees. Considering genomic sequences, Liu and Edwards [2009], and independently Dasarathy et al. [2015], showed that for a Jukes-Cantor substitution model and an ultrametric species tree, the Jukes-Cantor distances obtained under the coalescent mixture model still allowed for consistent inference of topological species trees. By passing to the logDet distance, Allman et al. [2019b] extended this result to the more realistic mixture of coalescent mixtures model, showing that the logDet distance allowed for consistent inference of a topological species tree, assuming it is ultrametric in generations. This study builds on all these works on gene and species tree models, but considers level-1 species networks on which all extant species are equidistant from the root.

Passing from species trees to networks is a substantial step, however, and our approach is strongly motivated by the approach taken by Baños [2019] in studying identifiability of features of unrooted level-1 topological species networks from gene tree quartet concordance factors (probabilities of the different quartet topologies displayed on gene trees). In the ultrametric setting of this work, we show that logDet distances computed from genomic sequences suffice to determine 4-cycles on undirected rooted triple networks, and then that this 4-cycle information for different rooted triples can be combined to determine all cycles of size 4 or more, and even all hybrid nodes in those cycles of size 5 or more. We do not obtain information on 2- or 3- cycles, so our results closely parallel those of Baños [2019], despite the rather different source of information.

There are a number of other theoretical works in the literature on determining phylogenetic networks from limited information. For instance, Jansson and Sung [2006] investigate determining a level-1 network from the rooted triple trees it displays, Huber et al. [2017, 2018] discuss how knowledge of trinets (induced 3-taxon directed rooted networks) and quarnets (induced 4-taxon undirected unrooted networks) determine larger networks, and van Iersel et al. [2020] explore determination of networks from distances. However, the question of how, or whether, these results can be applied to biological data is not addressed, and the setting of these works is not directly applicable to obtaining our results.

Other works [Gross and Long, 2018, Gross et al., 2020, Hollering and Sullivant, 2021] use algebraic approaches to show that certain types of level-1 networks can be identified from joint pattern frequency arrays under group-based models of sequence evolution such as the Jukes-Cantor and Kimura models. In addition to their restriction on sequence evolution models, these works do not incorporate a coalescent process. That is, all sequence sites are assumed to have evolved on one of the finitely-many trees displayed on the network. Since the absence of a coalescent process is a limiting case of our coalescent-based model, our results allowing for mixtures of more general sequence evolution models extend those results in the ultrametric case. Algebraic study of a network model combined with the general Markov model, again with no coalescent process, was also conducted by Casanellas and Fernández-Sánchez [2020].

This paper proceeds as follows. Section 2 defines the networks and models under consideration, as well as the logDet distance. For most of the paper we restrict to a model of unlinked sites, only later passing to a model allowing concatenated genes whose sites evolve on the same gene tree. Section 3 uses combinatorial arguments to show how information on undirected rooted triple networks can be used to determine features of a larger directed network from which they are induced. Expected frequencies of site patterns for sequences produced by the mixture of coalescent mixtures model are studied in Section 4, and shown to be expressible as convex combinations of pattern frequencies from simpler networks. In Section 5 we show that the ordering by magnitude of logDet distances for triples of taxa tells us about the induced rooted triple species network, and by combining this with the result of Section 3 we obtain our main identifiability result, Theorem 1. Section 6 discusses two variations on our main result that are implied by it. The first is to a model with genes of linked sites that evolve on a common tree. The second is to a non-coalescent model, in which all gene trees must be displayed on the species network. Section 7 further studies the logDet distances from a rooted triple network, in order to better understand what triples of distances can arise under the mixture of coalescent mixtures model. We conclude in Section 8 with an outline of how these results can be developed into a practical inference algorithm.

2. Networks and models

2.1. Phylogenetic Networks

Although there are many variations on the notion of a phylogenetic network in the literature, we adopt ones appropriate to the Network Multispecies Coalescent (NMSC) model. This model, which describes the formation of trees of gene lineages in the presence of both incomplete lineage sorting and hybridization, will be further developed in the next subsection. First, we focus on setting forth combinatorial aspects of the networks.

Definition 1 [Solís-Lemus and Ané, 2016, Baños, 2019] A topological binary rooted phylogenetic network $N^{+}$ on taxon set X is a connected directed acyclic graph with vertices $V = V (N^{+})$ and edges $E = E (N^{+})$ , where V is a disjoint union V = {r} ⊔ V_L ⊔ V_H ⊔ V_T and E is a disjoint union E = E_H ⊔ E_T, with a bijective leaf-labeling function f : V_L → X with the following characteristics:

The root r has indegree 0 and outdegree 2.
A leaf v ∈ V_L has indegree 1 and outdegree 0.
A tree node v ∈ V_T has indegree 1 and outdegree 2.
A hybrid node v ∈ V_H has indegree 2 and outdegree 1.
A hybrid edge e = (v, w) ∈ E_H is an edge whose child node w is hybrid.
A tree edge e = (v, w) ∈ E_T is an edge whose child node w is either a tree node or a leaf.

When ∣X∣ = 3 or 4, we refer to $N^{+}$ as a rooted triple network or a rooted quartet network, respectively.

The vertices, and edges, of $N^{+}$ are partially ordered by the directedness of the graph. For instance, a node u is below a node v, and v is above u, if there exists a non-empty directed path in $N^{+}$ from v to u. The root is thus above all other nodes.

A metric notion of the network above incorporates some of the parameters of the NMSC model. This introduces edge lengths, measured in generations throughout this article, as well as probabilities that a gene lineage at a hybrid node follows one or the other hybrid edge as it traces back in time toward the network root. Since we focus on binary networks, only hybrid edges are allowed to have length 0, to model possibly instantaneous jumping of a lineage from one population to another.

Definition 2 A metric binary rooted phylogenetic network $(N^{+}, {ℓ_{e}}_{e \in E}, {γ_{e}}_{e \in E_{H}})$ is a topological binary rooted phylogenetic network together with an assignment of weights or lengths ℓ_e to all edges and hybridization parameters γ_e to all hybrid edges, subject to the following restrictions:

The length ℓ_e of a tree edge e ∈ E_T is positive.
The length ℓ_e of a hybrid edge e ∈ E_H is non-negative.
The hybridization parameters γ_e and γ_e′ for a pair of hybrid edges e, e′ ∈ E_H with the same child hybrid node are positive and sum to 1.

A metric network of this sort is said to be ultrametric if every directed path from the root to a leaf has the same total length. This is equivalent to requiring the ultrametricity of all trees displayed on the network. An example of a simple ultrametric network is shown in Figure 1 (Right).

Fig. 1 — (Left) An ultrametric species network $N^{+}$ with time t in generations before the present, hybrid edges h and h′ shown in red, and population functions N_e(t) on each edge depicted by widths of “tubes.” The edge lengths τ are measured on the t-axis between the dashed lines indicating speciation and hybridization events. The dashed red/blue boundary represents a hybrid node, the top dashed line the root of the network, and other dashed lines tree nodes. (Right) A schematic of the same species tree, which does not show population sizes. Hybridization parameters γ and γ′ are omitted from both drawings.

On directed networks there are several analogs [Steel, 2016] of the most recent common ancestor of a set of taxa on a tree. The following is the most useful in this work.

Definition 3 [Steel, 2016] Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on a set of taxa X and let Z ⊆ X. Let D be the set of nodes which lie on every directed path from the root r of $N^{+}$ to any z ∈ Z. Then the lowest stable ancestor of Z on $N^{+}$ , denoted $LSA (Z, N^{+})$ , is the unique node v ∈ D such that v is below all u ∈ D with u ≠ v. The lowest stable ancestor (LSA) of a network on X is LSA(X).

Phylogenetic networks as defined here have no cycles in the usual sense for a directed graph. The term cycle will thus be used to refer to a collection of edges that form a cycle when all edges are undirected. A cycle must contain at least two hybrid edges sharing a hybrid node, and may contain any non-negative number of tree edges. The class of networks we focus on is those in which cycles are separated, in the following sense.

Definition 4 A rooted binary phylogenetic network $N^{+}$ is said to be level-1 if no two distinct cycles in $N^{+}$ share an edge.

Although this is not the standard definition of level-1 [Rosselló and Valiente, 2009], in the setting of binary networks it is equivalent.

Each cycle on a level-1 phylogenetic network contains exactly one hybrid node and two hybrid edges with that node as a child. Thus there is a one-to-one correspondence between cycles and the hybrid nodes they contain. A cycle composed of n edges, 2 of which are hybrid, is called an n-cycle. If the cycle’s hybrid node has k leaf descendants, it is an n_k-cycle.

Passing from a large network to one on a subset of the taxa is similar to the process for trees.

Definition 5 Suppressing a node with both in- and out-degree 1 in a directed phylogenetic network means replacing it and its two incident edges with a single edge from its parent to its child. For a metric network, the new edge is assigned a length equal to the sum of lengths of the two replaced. If the outedge was hybrid, the new edge is also hybrid and retains the hybridization parameter.

Similarly, suppressing a node of degree 2 between two undirected edges means replacing it and its two incident edges with a single undirected edge.

Definition 6 Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on X and let Y ⊂ X. The induced rooted network $N_{Y}^{+}$ on Y is the network obtained from $N^{+}$ by retaining nodes and edges in every path from the root r on $N^{+}$ to any y ∈ Y, and then suppressing all nodes with in- and out-degree 1. We then say $N^{+}$ displays $N_{Y}^{+}$ .

We also need the notion of a rooted undirected network, in which all edges have been undirected but the root retained. Note that if a rooted network is a tree, knowledge of the root alone is enough to recover the direction of every edge, so this notion is not useful in that setting. If cycles are present, knowledge of the root determines only the direction of every cut edge (an edge whose deletion results in a graph with two connected components), and edges directly descended from cut edges. Knowing the root and all hybrid nodes in an undirected level-1 network does, however, determine the full directed network.

Several other notions of networks induced from a directed one are needed.

Definition 7 Let $N^{+}$ be a (metric or topological) binary rooted phylogenetic network on X.

[Baños, 2019] The LSA network $N^{\oplus}$ induced from $N^{+}$ is the network on X obtained by deleting all edges and nodes above $LSA (X, N^{+})$ , and designating $LSA (X, N^{+})$ as the root node.
The undirected LSA network $N^{⊖}$ is the rooted network obtained from the LSA network $N^{\oplus}$ by undirecting all edges.
[Baños, 2019] The unrooted semidirected network $N^{-}$ is the unrooted network obtained from the LSA network $N^{\oplus}$ by undirecting all tree edges and suppressing the root, but retaining directions of hybrid edges.

For a binary level-1 network $N^{+}$ , the only possible structure above the LSA has the form of a (possibly empty) chain of 2-cycles [Baños, 2019], an example of which is shown in Figure 2. The LSA network $N^{\oplus}$ is obtained by simply deleting that chain.

Fig. 2 — A rooted network $N^{+}$ whose LSA network $N^{\oplus}$ is the rooted tree ((*a, b*), c), but which has a chain of 2-cycles above LSA(*a, b, c*).

Note that the terminology of “n_k-cycles” can be applied to LSA networks $N^{\oplus}$ , as hybrid edges retain their direction. On undirected LSA networks $N^{⊖}$ , however, “n-cycle” can still be applied, but “n_k-cycle” generally cannot.

Definition 8 By suppressing a cycle C in a topological level-1 network we mean deleting all edges in C, identifying all nodes in C, and if the resulting node is of degree 2 suppressing it. If the network is rooted and this results in the root becoming a degree 1-node, then the resulting edge below the root is also deleted, with its child becoming the root.

Suppressing an n-cycle in a binary level-1 network results in a non-binary network when n ≥ 4. However if only 2- and 3-cycles are suppressed, the result is binary.

2.2. Coalescent Model on Networks

The formation of gene trees within a species network, as ancestral lineages of sampled loci from extant taxa join together moving backwards in time, is given a mechanistic description by the Network Multispecies Coalescent Model (NMSC) [Meng and Kubatko, 2009, Yu et al., 2011, Zhu et al., 2016].

Parameters of the NMSC for a set of taxa X include a metric rooted binary phylogenetic network $(N^{+}, {ℓ_{e}}, {γ_{e}})$ on X, with edge lengths ℓ_e in generations. In addition, for each edge e = (u, v) fix a function $N_{e} : [0, ℓ_{e}) \to R^{> 0}$ giving the (haploid) population size along the edge, where N_e(0) is the population size at the child node v and N_e(t) is the population at time t units above it. Finally, let $N_{r} : [0, \infty) \to R^{> 0}$ be an additional population size function for an infinite length ‘edge’ ancestral to the root r of the network. The N_e need not be constant nor equal, although those are common assumptions in other works. As did Allman et al. [2019b], we make the biologically-plausible technical assumptions that the functions N_e are bounded, and that all 1/N_e(t) are integrable over finite intervals.

Figure 1 (Left) depicts an example species network that is ultrametric in generations, with hybrid edges h and h′, and population functions N_e on each edge depicted by time-varying widths of the network edges. The edge lengths ℓ_e are measured on the t-axis between the horizontal lines indicating speciation and hybridization events. Figure 1 (Right) gives a schematic of the same species tree, without a depiction of population functions.

The standard Kingman coalescent models the formation of gene trees, with edge lengths in generations, within a single population edge e, with pairs of lineages coalescing independently as they trace backward in time, at instantaneous rate 1/N_e(t). The multispecies coalescent model (MSC) extends this to a tree of populations, by using the standard coalescent on each edge, as well as an infinite length edge above the root, allowing multiple gene lineages to enter a population from its descendant ones at a tree node. The NMSC extends this further, so that lineages reaching hybrid nodes randomly enter one or the other hybrid edge above them, with the choice determined independently according to the hybridization parameter probabilities. Thus the NMSC parameters $(N^{+}, {ℓ_{e}}, {γ_{e}})$ and {N_e} determine a distribution of rooted metric gene trees. The structure of the NMSC also ensures that the distributions of gene trees obtained by marginalization to a subset Y of taxa are the same as the distributions obtained from the NMSC on the displayed network $N_{Y}^{+}$ .

2.3. Sequence substitution models on gene trees

The k-state general time-reversible model (GTR) for sequence evolution is a continuous-time Markov process on a metric gene tree. Gene tree edge lengths are in substitution units, and sequences are composed of k possible states, or bases. Model parameters are a k × k instantaneous rate matrix Q together with a k-state distribution π, with non-negative entries summing to 1, satisfying the following:

off-diagonals entries of Q are positive,
row sums of Q are 0,
trace Q = −1,
πQ = 0,
diag(π)Q is symmetric.

In the ultrametric framework for our species networks, we introduce an additional time-dependent but lineage-independent rate scalar μ(t) for Q, where t is measured in generations from leaves to the root and beyond, and μ(t) has units of substitutions/generation. We assume μ is piecewise-continuous, μ(t) > 0 for all t ≥ 0 so that the mutations process never stops, and $\int_{0}^{\infty} μ (t) d t = \infty$ so that the total amount of possible mutation is unbounded. Following Allman et al. [2019b], this substitution model is denoted by GTR+μ.

For any node u on a gene tree, let t_u denote the distance, in generations, to that node from its descendant leaves. The states at a single site in sequences at the taxa at the leaves on the gene tree are then determined as follows: A state is randomly chosen at the root of the tree from the distribution π. For each edge e = (u, v) descendant from a node u the site undergoes random state changes with rates μ(t)Q for times t ∈ [t_v, t_u] to obtain states at the child nodes. The full substitution process on the edge is thus described by the Markov matrix

M_{e} = exp (\int_{t_{v}}^{t_{u}} μ (t) d t Q) .

A similar process is then repeated for those nodes’ children, and so on, until states at the taxa have been determined.

2.4. Mixture of coalescent mixtures

The model we focus on is the m-class mixture of coalescent mixtures [Allman et al., 2019b] extended from a tree to an ultrametric network. This model has as parameters an ultrametric species network $(N^{+}, {ℓ_{e}}, {γ_{e}})$ , population size functions {N_e}, a finite collection ${(Q_{i}, π_{i}; μ_{i})}_{i = 1}^{m}$ of GTR+μ parameters for the m classes, and a vector λ of m positive class size parameters summing to 1.

Sequence data is generated as follows: For each site:

a gene tree T is sampled according to the NMSC model on $(N^{+}, {ℓ_{e}}, {γ_{e}})$ with population sizes {N_e},
class i is sampled from the distribution λ to determine parameters (Q_i, π_i; μ_i),
the bases for each x ∈ X are sampled under the GTR+μ process on T with parameters (Q_i, π_i; μ_i).

This model is denoted by $M = M (θ)$ where

θ = ((N^{+}, {ℓ_{e}}, {γ_{e}}), {N_{e}}, λ, {(Q_{i}, π_{i}; μ_{i})}) .

Sampling n independent sites from this model produces k-state aligned sequences of n unlinked sites. As usual in phylogenetics, these are summarized through counts of site patterns across the sequences in an ∣X∣-dimensional k × k × ⋯ × k array. Marginalizations of this array to 2-dimensions give pairwise k × k site pattern count matrices that compare only the sequences for two taxa in X.

In the tree context, two extensions of this model were discussed by Allman et al. [2019b]. For the first, the model assumption of one independently drawn gene tree for each site is modified to a more realistic one for genomic sequences in which all sites for a genetic locus share a gene tree. If the lengths (in number of sites) of the loci are independent identically distributed draws from some distribution, then the expected site pattern distribution for such a model is unchanged from that determined by $M$ . Only the rate of convergence, as the number of sampled genes grows, of frequencies of sampled site patterns to the asymptotic distribution will be slowed. This model is considered in Section 6, as its analysis follows easily from that for unlinked sites.

Another extension in the tree setting of Allman et al. [2019b] allowed for relaxing the ultrametric condition while retaining strong results on identifiability from the logDet distances. In that extension, the scalar rate function was allowed to be edge dependent as long as a certain symmetry condition on mixture components resulted in ultrametricity in substitution units “on average” across gene trees. While a similar model extension in the network setting seems likely to lead to similar results, it is not explored here, as the technical complications are greater than in the tree case.

2.5. LogDet distance

The fundamental tool we use to study relationships of taxa under the mixture of coalescent mixtures model $M$ is the logDet distance between a pair of aligned sequences. It is computed as follows: For taxa a, b ∈ X, let ${\hat{F}}^{a b}$ be a k × k matrix of empirical relative site-pattern frequencies, obtained by normalizing the site pattern count matrix for a and b, so that its entries sum to 1. Thus the ij entry of ${\hat{F}}^{a b}$ is the proportion of sites in the sequences exhibiting base i for a and base j for b. With ${\hat{f}}_{a}$ and ${\hat{f}}_{b}$ the vectors of row and column sums of ${\hat{F}}^{a b}$ , which give the proportions of various bases in the sequences for a and b, let ${\hat{g}}_{a}$ and ${\hat{g}}_{b}$ the products of the entries of ${\hat{f}}_{a}$ , ${\hat{f}}_{b}$ , respectively. Then the empirical logDet distance is

{\hat{d}}_{L D} (a, b) = - \frac{1}{k} (ln det ({\hat{F}}^{a b}) - \frac{1}{2} ln ({\hat{g}}_{a} {\hat{g}}_{b}))

(1)

Under most phylogenetic models, including the mixture of coalescent mixtures model, individual site patterns in sequences are assumed to be independent and identically distributed. By the weak law of large numbers, ${\hat{F}}^{a b}$ computed from a sample will converge in probability to its expected value F^ab as the sequence length goes to ∞. By the continuous function theorem, e.g. [van der Vaart, 1998], the empirical logDet distance thus converges in probability to the logDet distance computed by the same formula from the expected F^ab, a quantity we refer to as the theoretical logDet distance and denote by d_LD(a, b).

3. Rooted Networks from Undirected Rooted Triple Networks

The goal of this section is to establish Proposition 1, a combinatorial result indicating features of a topological level-1 rooted n-taxon network that can be recovered from its induced undirected rooted triple networks with 2- and 3-cycles suppressed. This is a rooted analog of a key result of Baños [2019] relating unrooted semidirected networks and their induced undirected quartet networks. Later sections of this paper focus on identifying these rooted triple networks under the model $M$ .

There are several possible routes to Proposition 1. One approach would be to follow the argument of the quartet analog, with modifications throughout due to the rooted setting. Another would be to imitate the alternate proof of the quartet result given by Allman et al. [2019a], based on an extension of the intertaxon quartet distance of Rhodes [2019], but instead using the rooted triple distance also introduced in that work. The argument presented here is shorter than these approaches, as it leverages information about undirected rooted triple networks to obtain information about undirected quartet networks, and then applies the theory of Baños [2019].

The following result, extracted from the proof of Theorem 4 of Baños [2019], will be used. In it, and throughout this work, by a network modulo 2- and 3-cycles we mean the network obtained by suppressing all 2- and 3-cycles. Similarly, modulo directions of edges in 4-cycles means that all edges in 4-cycles are undirected. As a result, which of the edges in a 4-cycle are hybrid, and therefore which node is hybrid, is not indicated.

Lemma 1 ([Baños, 2019]) Let $N^{+}$ be a level-1 rooted binary topological phylogenetic network on X. Let Q be the set of undirected quartet networks obtained from those displayed on $N^{+}$ by unrooting, suppressing all cycles of size 2 and 3, and undirecting all edges. Then modulo 2- and 3-cycles and directions of edges in 4-cycles, the semidirected unrooted network $N^{-}$ is determined by Q.

In order to apply this to rooted triples, we first recall some combinatorial properties of rooted triple and quartet networks.

Lemma 2 ([Baños, 2019]) Let $Q^{-}$ be a level-1 unrooted semidirected binary quartet network. Then $Q^{-}$ has no k-cycles for k ≥ 5, and at most one 4-cycle. If $Q^{-}$ has a 4-cycle, then it has neither 3- nor 2₂-cycles. If there is no 4-cycle, then there are at most two 3-cycles, with at most one of these a 3₂-cycle.

Lemma 2 can be used to characterize possible cycles in a rooted triple network, by attaching an outgroup at the root. More specifically, by attaching an outgroup o to the root of an n-taxon network on taxa X with o ∉ X we mean identifying the root r of the network with the node r on an edge (r, o) and undirecting all tree edges. This gives a (n + 1)-taxon unrooted semidirected network. The rooted triple networks displayed on the original network are then in one-to-one correspondence with induced semidirected quartet networks containing o on the new network. This construction yields the following.

Corollary 1 Let $N^{+}$ be a level-1 binary rooted triple network. Then $N^{+}$ has no k-cycles for k ≥ 5, and at most one 4-cycle in which case there are no 3- or 2₂-cycles. If there is no 4-cycle, then there are at most two 3-cycles, with at most one of these a 3₂-cycle.

Considering a rooted quartet network $Q^{+}$ , and the impact of passing to its associated unrooted semidirected quartet network $Q^{-}$ , Lemma 2 also immediately yields the following.

Corollary 2 Let $Q^{+}$ be a level-1 rooted binary quartet network. Then $Q^{+}$ has no k-cycles for k ≥ 6, and has at most a one 5-cycle or 4-cycle, but not both.

We now catalog the rooted quartet networks with 4- or 5-cycles, modulo smaller cycles.

Lemma 3 Let $Q^{+}$ be a level-1 binary rooted quartet network with one 4-cycle or one 5-cycle. Then modulo 2- and 3- cycles and up to taxon relabelling, the LSA network $Q^{\oplus}$ is one of those shown in Figure 3. Thus $Q^{+}$ displays either 1, 2, or 3 rooted triples with a 4-cycle.

Fig. 3 — All rooted directed topological quartet networks with a single 4- or 5-cycle, and no other cycles, up to relabeling of taxa. Networks in the top row display exactly one rooted triple with a 4-cycle, those in the middle row display two, and those in the bottom row display three.

Proof Let $Q^{+}$ be a rooted level-1 network on {a, b, c, d} with a cycle C of size 4 or 5. By Corollary 2, C is the only cycle of size greater than 3. Figure 3 shows the topologies, up to taxon relabeling, of all the rooted quartet networks with a 4- or 5-cycle and no 2- or 3-cycles, as determined by enumerating all possible locations for adding hybrid edges to a rooted 4-taxon tree. The top row of Figure 3 shows the quartet networks with exactly one displayed rooted triple, on {a, b, c}, having a 4-cycle. The middle row shows the networks with exactly two displayed rooted triples, on {a, b, c} and {a, b, d}, having a 4-cycle. The bottom row shows those with exactly three displayed rooted triples, on {a, b, c}, {a, b, d}, and {a, c, d}, having a 4-cycle.

Now we proceed to the main result of this section.

Proposition 1 Let $N^{+}$ be a level-1 rooted binary topological phylogenetic network on X. Let S be the set of undirected rooted triple networks obtained from those displayed on $N^{+}$ by suppressing all cycles of size 2 and 3 and undirecting all edges. Then modulo 2- and 3-cycles and directions of edges in 4-cycles, the LSA network $N^{\oplus}$ is determined by S.

Proof We first build a set of rooted quartet networks from S. Let {a, b, c, d} ∈ X and let S_abcd ⊆ S be the set of undirected rooted triple networks on any three elements of {a, b, c, d}, so ∣S_abcd∣ = 4. By Corollary 2 and Lemma 3, there are k = 0, 1, 2, or 3 elements of S_abcd with a 4-cycle. We consider each possibility in turn, showing that we can determine the undirected rooted quartet network $N_{a b c d}^{⊖}$ modulo 2- and 3-cycles.

If k = 0, all rooted triples in S_abcd are trees and since $N_{a b c d}^{+}$ has no 4- or 5-cycles by Lemma 3, the undirected LSA network $N_{a b c d}^{⊖}$ modulo 2-and 3-cycles is a tree. By a well-known result for trees [Semple and Steel, 2005], S_abcd determines $N_{a b c d}^{⊖}$ modulo 2- and 3-cycles.

If k = 1, then modulo 2- and 3-cycles and relabelling of taxa, $N_{a b c d}^{+}$ is isomorphic to one of the networks in the top row of Figure 3. But for these networks if a, b, c are the taxa in the rooted triple network with a 4-cycle, then the rooted 4-taxon network is obtained by attaching d as an outgroup to it. Thus $N_{a b c d}^{⊖}$ is determined modulo 2- and 3-cycles.

If k = 2, $N_{a b c d}^{+}$ is isomorphic, modulo 2- and 3-cycles and relabeling, to one of the networks in the middle row of Figure 3. Note that for all those rooted quartet networks, the displayed rooted triple networks with 4-cycles are on {a, b, c} and {a, b, d}, and the 4-taxon network can be obtained from either of these by replacing c or d with a cherry on {c, d}, thus determining $N_{a b c d}^{⊖}$ modulo 2- and 3-cycles.

If k = 3, $N_{a b c d}^{+}$ is isomorphic, modulo 2-, and 3-cycles and relabeling, to one of the networks in the bottom row of Figure 3. In both of these, there is exactly one taxon, a, that is in all three rooted triple networks with 4-cycles, and there is exactly one taxon, c, that has graph-theoretic distance 3 from a in exactly one of the two rooted triples with 4-cycles it appears in. Thus we can determine which taxon is a, and which is c. For the remaining pair b, d, if there is a taxon that is at distance 4 from a in both 4-cycle rooted triple networks it appears in, then the 4-taxon network is the one shown on the left, and that taxon is d. Otherwise, the network is the one shown on the right. In this case there is exactly one rooted triple network on a and c which has its third taxon at distance 2 from the root, and this determines b. Thus we obtain the rooted 4-taxon network $N_{a b c d}^{\oplus}$ modulo 2- and 3-cycles, and hence $N_{a b c d}^{⊖}$ modulo 2- and 3-cycles

With all rooted 4-taxon networks $N_{a b c d}^{⊖}$ modulo 2- and 3-cycles determined, we attach an outgroup o to all, giving the collection of all 5-taxon unrooted networks including o, modulo 2- and 3-cycles, induced from the unrooted network $N^{'}$ formed by attaching o to the root of $N^{+}$ . But the unrooted 4-taxon networks displayed on these 5-taxon ones form the collection of all 4-taxon undirected networks (possibly including o) modulo 2- and 3-cycles displayed on $N^{'}$ .

Lemma 1 now determines $N^{'}$ modulo 2- and 3-cycles, with directions of cut edges and edges in cycles of size ≥ 5, though not in 4-cycles. Rooting N′ by the outgroup o we recover the topology of $N^{\oplus}$ modulo 2- and 3-cycles and directions of edges in 4-cycles.

4. Expected pattern frequencies as convex sums

The theoretical logDet distance between taxa depends on the matrix of expected relative site-pattern frequencies F^xy in aligned sequences for taxa x, y, under the mixture of coalescent mixtures model $M (θ)$ . The goal of this section is to show that F^xy on a level-1 ultrametric rooted triple network can be expressed as a convex combination of frequency matrices for networks with no cycles below the LSA of the taxa. In this way, we reduce the computation of F^xy to its computation on simpler networks. This is complicated somewhat by the fact that the convex combination may have terms which are expected pattern frequencies conditioned on a pair of lineages coalescing below a certain node in a network.

The lemmas that follow often involve modifying a network $N^{+}$ by removing a hybrid edge, to obtain a new network $N_{i}^{+}$ . If one hybrid edge in a cycle is removed, the hybrid node is then suppressed as the other hybrid edge is joined to the descendant tree edge and given the induced length and population size. We retain all other edge lengths and population sizes, as well as hybrid parameters for unaffected cycles. The parameters for the substitution process describing sequence evolution on gene trees are also retained. If θ denotes the full set of parameters associated to $N^{+}$ , then θ_i denotes the full set of parameters associated to $N_{i}^{+}$ in this way. Notation such as F^xy(θ) or F^xy(θ_i) denotes the dependence of F^xy on the parameters θ or θ_i, which include the network $N^{+}$ or $N_{i}^{+}$ .

The most straightforward network simplifications occur when the hybrid node of a cycle has a single descendant leaf, as depicted by the example 2₁-, 3₁- and 4₁-cycles in Figure 4.

Fig. 4 — Examples of level-1 rooted triple networks with 2₁-, 3₁-, and 4₁-cycles. While multiple 2₁-cycles may be present along any pendant edge shown here in dashes, there can be at most two 3₁-cycles, whose hybrid nodes are located on a dashed pendant edge. At most one 4₁-cycle can be present. Site-pattern frequency matrices from the model $M$ on rooted triple networks with these types of cycles are convex combinations of such matrices for 1, 2, or 4 networks without those cycles, as shown by Lemmas 4 and 5.

Lemma 4 (Removing 2₁-cycles) Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} and let C be a 2₁-cycle in $N^{+}$ with hybrid edges h₁, h₂. Let $N_{1}^{+}$ be the network obtained from $N^{+}$ by removing h₂. Then, under the model $M$ for any x, y ∈ {a, b, c},

F^{x y} (θ) = F^{x y} (θ_{1}) .

Proof Since the hybrid node of C has only one descendant, the combined coalescent and substitution process on $N^{+}$ can be expressed as a linear combination of those processes on $N_{1}^{+}$ , $N_{2}^{+}$ , weighted by γ₁ = γ(h₁), γ₂ = γ(h₂). That is, for any x, y ∈ {a, b, c},

F^{x y} (θ) = γ_{1} F^{x y} (θ_{1}) + γ_{2} F^{x y} (θ_{2}) .

But $N_{1}^{+}$ and $N_{2}^{+}$ only differ by h₁ and h₂ which have the same length, though possibly different population sizes. However, since only one lineage can be present in the population for those edges, those population sizes have no impact in model $M$ , so F^xy(θ₂) = F^xy(θ₁). Since γ₁ + γ₂ = 1, the claim follows.

If a network $N^{+}$ has multiple 2₁-cycles, then applying Lemma 4 repeatedly gives $F^{x y} (θ) = F^{x y} (\tilde{θ})$ where ${\tilde{N}}^{+}$ is a rooted network with no 2₁-cycles obtained from $N^{+}$ by deleting one hybrid edge in each of the 2₁-cycles on $N^{+}$ .

Lemma 5 (Decomposing 3₁- and 4₁-cycles) Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} and let C be either a 3₁- or a 4₁-cycle on $N^{+}$ . Let h₁, h₂ be the hybrid edges of C with γ_i = γ(h_i). Let $N_{i}^{+}$ be the network obtained from $N^{+}$ by removing h_j, j ≠ i. Then, under the model $M$ for any x, y ∈ {a, b, c},

F^{x y} (θ) = γ_{1} F^{x y} (θ_{1}) + γ_{2} F^{x y} (θ_{2}) .

Proof Since the hybrid node of C has only one descendant, we can express the combined coalescent and substitution process on $N^{+}$ as a linear combination of the processes of the $N_{i}$ with coefficients γ_i, i = 1, 2.

A level-1 rooted triple network may have one 4₁-cycle, one 3₁-cycle, or two 3₁-cycles. In the last case, Lemma 5 may be applied twice, to express the pattern frequency matrix under the model as a convex combination of four such matrices for networks with no 3₁-cycles.

With Lemma 4 this shows that computation of the matrix of relative site-pattern frequencies of a level-1 ultrametric rooted triple network $N^{+}$ reduces to cases where there are no 2₁-, 3₁-, or 4₁-cycles. The effects of 2₂- and 3₂-cycles are more complicated, however, as a coalescent event may or may not occur below the hybrid nodes of such cycles.

The following definition facilitates studying the impact of such cycles. In it a node p may be either an existing node or a new node introduced along an edge of a network, with appropriate division of the original edge length and population function. Although strictly speaking this second case passes out of the class of binary networks, we allow this only to simplify reference to intermediate states of the coalescent process.

Definition 9 Let K_p(θ) be the random variable giving the number of lineages at node $p \in V (N^{+})$ under the NMSC. With X_p ⊆ X denoting the set of taxa below p, K_p(θ) has sample space {1, 2,…, ∣X_p∣}.

When θ is clear from context we write K_p = K_p(θ). We also use the notation $F_{∣ K_{p} = m}^{x y} (θ)$ to denote the joint distribution of site patterns conditioned on K_p = m under the model $M$ with parameters θ.

Lemma 6 (Decomposing 2₂-cycles) Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} without 2₁- or 3₁-cycles. Suppose, as depicted in Figure 5, C is a 2₂-cycle on $N^{+}$ , with edges h₁, h₂ from node q to hybrid node p, hybridization parameters γ_i = γ(h_i), leaf descendants a, b of p, and no cycles below p. Denote by $N_{i}^{+}$ , i = 1, 2 the network obtained from $N^{+}$ by removing h_j, j ≠ i and by $N_{0}^{+}$ the network obtained from $N^{+}$ by deleting all edges and nodes below q and attaching edges (q, a) and (q, b) of appropriate length so that $N_{0}^{+}$ is ultrametric. Then, under the model $M$ for any x, y ∈ {a, b, c},

F^{x y} (θ) = γ_{1}^{2} F^{x y} (θ_{1}) + γ_{2}^{2} F^{x y} (θ_{2}) + P (K_{p} = 2) 2 γ_{1} γ_{2} F^{x y} (θ_{0}) + P (K_{p} = 1) 2 γ_{1} γ_{2} F_{∣ K_{p} = 1}^{x y} (θ_{1}) .

Fig. 5 — (Top) A rooted level-1 ultrametric network on {*a, b, c*}, with the 2₂-cycle closest to LSA(*a, b*) shown. (Bottom) The networks $N_{1}^{+}$ , $N_{2}^{+}$ , and $N_{0}^{+}$ obtained from $N^{+}$ , respectively, as described in Lemma 6. Note that there may be additional cycles along the dashed lines, with hybrid nodes above node q and taxon c.

Proof Since the structure of the model for $N^{+}$ , $N_{1}^{+}$ , and $N_{2}^{+}$ is identical below p, we may also use K_p to denote K_p(θ₁) and K_p(θ₂). Thus

F^{x y} (θ) = P (K_{p} = 2) F_{∣ K_{p} = 2}^{x y} (θ) + P (K_{p} = 1) F_{∣ K_{p} = 1}^{x y} (θ) = P (K_{p} = 2) [γ_{1}^{2} F_{∣ K_{p} = 2}^{x y} (θ_{1}) + γ_{2}^{2} F_{∣ K_{p} = 2}^{x y} (θ_{2}) + 2 γ_{1} γ_{2} F^{x y} (θ_{0})] + P (K_{p} = 1) F_{∣ K_{p} = 1}^{x y} (θ) .

(2)

But since $F_{∣ K_{p} = 1}^{x y} (θ) = F_{∣ K_{p} = 1}^{x y} (θ_{i})$ for i = 1, 2 by the argument used for Lemma 4, and the identity $1 = γ_{1}^{2} + γ_{2}^{2} + 2 γ_{1} γ_{2}$ ,

F_{∣ K_{p} = 1}^{x y} (θ) = γ_{1}^{2} F_{∣ K_{p} = 1}^{x y} (θ_{1}) + γ_{2}^{2} F_{∣ K_{p} = 1}^{x y} (θ_{2}) + 2 γ_{1} γ_{2} F_{∣ K_{p} = 1}^{x y} (θ_{1}) .

Substituting this into equation (2) and using P(K_p = 1) + P(K_p = 2) = 1 yields the claim.

Note that while $N_{1}^{+}$ and $N_{2}^{+}$ of Lemma 6 have the same topology and edge lengths, the hybrid edges h₁, h₂ may have different population sizes. Thus F^xy(θ₁) ≠ F^xy(θ₂) is possible. This is in contrast to the argument on removing 2₁-cycles in Lemma 4, in which hybrid edge population sizes did not play a role.

Since a level-1 3-taxon rooted network cannot have a 2₂-cycle above a 3₂-cycle, Lemma 6 can be applied recursively to the $N_{i}^{+}$ , i ∈ {1, 2} to eliminate all 2₂-cycles. Thus the remaining complication to producing an expression for F^xy(θ) as a convex combination of such matrices for networks without 2₁-, 3₁-, or 2₂-cycles is the presence of terms of the form $F_{∣ K_{p} = 1}^{x y} (θ^{'})$ where $N^{' +}$ has cherry {a, b} and neither 2₁- nor 3₁-cycles. Such terms are handled with the following.

Lemma 7 (Decomposing 2₂- and 3₂-cycles conditioned on coalescence) Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} on which {a, b} form a cherry, with no 2₁-, 3₁-, or 4₁-cycles, and at least one 2₂- or 3₂-cycle. (See Figure 6.) Let p be the hybrid node parental to the common parent of a, b. Let ${\tilde{N}}^{+}$ be the network obtained from $N^{+}$ by removing one hybrid edge from each 2₂-cycle.

Fig. 6 — Networks $N^{+}$ meeting the hypothesis of Lemma 7, with at least one 2₂- or 3₂-cycle, and possibly 2₃-cycles. In both figures the dashed internal edge represents a possible chain of 2₂-cycles, and the dashed edge above the LSA a possible chain of 2₃-cycles. Note that a network with a 3₂-cycle may also have no 2₂-cycles (not shown), in which case p would be the 3₂-cycle’s hybrid node.

If $N^{+}$ has no 3₂-cycle, then

F_{∣ K_{p} = 1}^{x y} (θ) = F_{∣ K_{p} = 1}^{x y} (\tilde{θ}) .

If $N^{+}$ has a 3₂-cycle, with hybrid edges h₁, h₂ and hybridization parameters γ_i = γ(h_i), then let ${\tilde{N}}_{i}^{+}$ be the network obtained from ${\tilde{N}}^{+}$ by removing h_j, j ≠ i. Then

F_{∣ K_{p} = 1}^{x y} (θ) = γ_{1} F_{∣ K_{p} = 1}^{x y} ({\tilde{θ}}_{1}) + γ_{2} F_{∣ K_{p} = 1}^{x y} ({\tilde{θ}}_{2}) .

Proof Conditioned on K_p = 1, there is only one lineage in any population above p and below the hybrid node of a 3₂-cycle, if such a cycle is present, or the LSA otherwise. Thus, as in the proof of Lemma 4, no 2₂-cycle will have any effect on the joint distribution. If there is no 3₂-cycle on $N^{+}$ this yields the claim. If there is a 3₂-cycle, since only one lineage reaches the hybrid node of the 3₂-cycle, we obtain the claim as in the proof of Lemma 5.

Lemma 8 (Decomposing 3₂-cycles) Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} with no cycles below its LSA except a 3₂-cycle C. Let p denote the hybrid node of C, and h₁, h₂ the hybrid edges with hybridization parameters γ_i = γ(h_i) and lengths y, z, as depicted at the top of Figure 7. Let $N_{1}^{+}$ , $N_{2}^{+}$ , $N_{3}^{+}$ , and $N_{4}^{+}$ be the networks derived from $N^{+}$ shown at the bottom of Figure 7. Then, under the model $M$ , for any x, y ∈ {a, b, c}, with K_p = K_p(θ),

F^{x y} (θ) = γ_{1}^{2} F^{x y} (θ_{1}) + γ_{2}^{2} F^{x y} (θ_{2}) + P (K_{p} = 2) γ_{1} γ_{2} (F^{x y} (θ_{3}) + F^{x y} (θ_{4})) + P (K_{p} = 1) γ_{1} γ_{2} (F_{∣ K_{p} = 1}^{x y} (θ_{1}) + F_{∣ K_{p} = 1}^{x y} (θ_{2})) .

Fig. 7 — (Top) A rooted level-1 ultrametric network with a 3₂-cycle, and (Bottom) the networks $N_{1}^{+}$ , $N_{2}^{+}$ , $N_{3}^{+}$ , and $N_{4}^{+}$ used in Lemma 8. Although only topology and branch lengths are shown, population size parameters for each edge of $N_{i}^{+}$ are obtained from the corresponding ones of $N^{+}$ .

Proof Observe that

F^{x y} (θ) = P (K_{p} = 2) F_{∣ K_{p} = 2}^{x y} (θ) + P (K_{p} = 1) F_{∣ K_{p} = 1}^{x y} (θ) = P (K_{p} = 2) [γ_{1}^{2} F_{∣ K_{p} = 2}^{x y} (θ_{1}) + γ_{2}^{2} F_{∣ K_{p} = 2}^{x y} (θ_{2}) + γ_{1} γ_{2} F^{x y} (θ_{3}) + γ_{1} γ_{2} F^{x y} (θ_{4})] + P (K_{p} = 1) F_{∣ K_{p} = 1}^{x y} (θ) .

(3)

Since $F_{∣ K_{p} = 1}^{x y} (θ) = γ_{1} F_{∣ K_{p} = 1}^{x y} (θ_{1}) + γ_{2} F_{∣ K_{p} = 1}^{x y} (θ_{2})$ and γ₁ + γ₂ = 1,

F_{∣ K_{p} = 1}^{x y} (θ) = γ_{1}^{2} F_{∣ K_{p} = 1}^{x y} (θ_{1}) + γ_{2}^{2} F_{∣ K_{p} = 1}^{x y} (θ_{2}) + γ_{1} γ_{2} (F_{∣ K_{p} = 1}^{x y} (θ_{1}) + F_{∣ K_{p} = 1}^{x y} (θ_{2})) .

Using this and P(K_p = 1) + P(K_p = 2) = 1 in equation (3) yields the claim.

5. Theoretical logDet distances

In this section, we show that, under the mixture of coalescent mixtures model $M$ on an ultrametric level-1 rooted triple network, the theoretical logDet distances between taxa determine most topological features of the network. The previous section established that the pattern frequency matrices for the model on such networks can be expressed as convex combinations of those on simpler networks (possibly subject to conditioning), whose only cycles are 2₃-cycles located above LSA(a, b, c), such as depicted in Figure 2. The following algebraic lemma is key to drawing conclusions about the determinants of such linear combinations of matrices.

Lemma 9 ([Allman et al., 2019b], Lemma 3.1) Suppose for each i, F_i and G_i are κ × κ symmetric positive definite matrices such that y^TF_iy ≥ y^TG_iy for every $y \in R^{κ}$ with the inequality strict for some y and some i. For α_i ≥ 0, let

F = \sum_{i = 1}^{m} α_{i} F_{i}, G = \sum_{i = 1}^{m} α_{i} G_{i} .

Then

det F > det G .

Analyzing the pattern frequency matrix for networks with 2₃-cycles above LSA(a, b, c) requires a detailed look at the coalescent process in such a chain of 2-cycles. For a simple case, aspsume lineages x and y enter the single cycle chain depicted in Figure 8. Population functions N₁, N₂, N₃, and N₄ are fixed for each edge, where for convenience, we shift domains from the convention in Section 2.2 so that N₁ is defined on [0, t₀), N₂, N₃ on [t₀, t₁), and N₄ on [t₁, ∞).

Fig. 8 — A 2-cycle and adjacent tree edges in a species network, depicted (Left) with pipes whose width represent population sizes, and (Right) as a schematic.

The probability density c(t) for time to coalescence of the lineages x, y entering at the bottom node (t = 0) can be calculated piecewise as follows: For t ∈ [0, t₀),

c (t) = \frac{1}{N_{1} (t)} exp (- \int_{0}^{t} \frac{1}{N_{1} (τ)} d τ),

(4)

as given by Allman et al. [2019b].

For t ∈ [t₀, t₁),

c (t) = p_{0} (γ^{2} c_{2} (t) + (1 - γ)^{2} c_{3} (t))

where $p_{0} = 1 - \int_{0}^{t_{0}} c (t) d t$ is the probability of no coalescence before t₀, and for i = 2, 3

c_{i} (t) = \frac{1}{N_{i} (t)} exp (- \int_{t_{0}}^{t} \frac{1}{N_{i} (τ)} d τ) .

Finally, for t ∈ [t₁, ∞), with $p_{1} = 1 - \int_{0}^{t_{1}} c (t) d t$ the probability of no coalescence before t₁,

c (t) = p_{1} \frac{1}{N_{4} (t)} exp (- \int_{t_{1}}^{t} \frac{1}{N_{4} (τ)} d τ) .

It is straightforward to extend this analysis of c(t) to a chain with an arbitrary number of 2-cycles. Since we will not need an explicit formula for the distribution of coalescent times for two lineages entering such a chain of 2-cycles, we omit a complete derivation, and only state the properties of it that we use.

Formally, a chain of 2-cycles is a species network with leaf a₀, internal vertices b₁, a₁, b₂, a₂,…, a_n, with root r = a_n, tree edges e_i = (b_i, a_i−1), and hybrid edges $e_{i}^{'} = (a_{i}, b_{i})$ , $e_{i}^{″} = (a_{i}, b_{i})$ , together with edge lengths, piecewise-continuous population size functions on each edge, including above the root, and hybrid parameters $γ_{i}^{'}$ , $γ_{i}^{″} = 1 - γ_{i}^{'}$ for each pair of hybrid edges $e_{i}^{'}$ , $e_{i}^{″}$ .

Using the technical assumptions given in Subsection 2.2, it is straightforward to deduce the following.

Lemma 10 Consider a fixed chain of 2-cycles with leaf a₀. Let $c : [0, \infty) \to R^{\geq 0}$ denote the probability density function under the NMSC for the time T of coalescence of two lineages entering the chain at a₀. Then c(t) is piecewise continuous, and c(t) > 0 for all t ∈ [0, ∞).

The next three technical lemmas generalize Lemmas 4.1, 4.4, and 4.5 of Allman et al. [2019b] from a tree to a network setting. These culminate in Proposition 2 below, which justifies the application of Lemma 9.

Lemma 11 Let $c : [0, \infty) \to R^{\geq 0}$ be the probability density function under the NMSC for the time T of coalescence of two lineages entering a chain of 2-cycles, and for times t₂ > t₁ ≥ 0 let c_i be the conditional density given T ≥ t_i. Then the cumulative distribution functions for c₁ and c₂ satisfy

C_{1} (t) \geq C_{2} (t),

with the inequality strict on some interval.

Proof Since 0 = c₂(t) ≤ c₁(t) for all t ≤ t₂, the inequality is immediate for t ≤ t₂. Since using Lemma 10 we have c₁(t) > c₂(t) = 0 for t ∈ (t₁, t₂), the inequality is strict on a subinterval.

For t ≥ t₂, let $J = \int_{t_{1}}^{t_{2}} c_{1} (t) d t$ and $I (t) = \int_{t_{2}}^{t} c_{1} (s) d s$ , so

C_{1} (t) - C_{2} (t) = J + I (t) - \frac{I (t)}{1 - J} = J - \frac{J}{1 - J} I (t) .

Differentiating and using Lemma 10 shows C₁(t) − C₂(t) is decreasing for t > t₂. Since C₁(t) − C₂(t) → 0 as t → ∞, this implies C₁(t) − C₂(t) ≥ 0, as claimed.

Lemma 12 Let c₁, c₂ be probability density functions on [0, ∞), with cumulative distribution functions C₁, C₂, such that C₁(t) ≥ C₂(t) for all t, with the inequality strict on some interval. Let $s (t) = \int_{0}^{t} μ (x) d x$ for a positive, piecewise-continuous μ on [0, ∞) such that s(∞) = ∞. For λ ≤ 0 let

f (λ, μ, C_{i}) = \int_{0}^{\infty} exp (2 λ s (t)) c_{i} (t) d t .

Then if λ = 0,

f (0, μ, C_{1}) = f (0, μ, C_{2}) = 1 .

while for λ < 0

f (λ, μ, C_{1}) > f (λ, μ, C_{2}) .

Proof For λ = 0 we find $f (0, μ, C_{i}) = \int_{0}^{\infty} c_{i} (t) d t = 1$ .

If λ < 0, integrating by parts yields

{f (λ, μ, C_{i}) = exp (2 λ s (t)) C_{i} (t) ∣}_{t = 0}^{\infty} - 2 λ \int_{0}^{\infty} μ (t) exp (2 λ s (t)) C_{i} (t) d t = - 2 λ \int_{0}^{\infty} μ (t) exp (2 λ s (t)) C_{i} (t) d t .

Thus

f (λ, μ, C_{1}) - f (λ, μ, C_{2}) = - 2 λ \int_{0}^{\infty} μ (t) exp (2 λ s (t)) (C_{1} (t) - C_{2} (t)) d t .

As the integrand is non-negative, and positive on some interval, the claim for λ < 0 follows.

Lemma 13 Consider a GTR substitution model with rate matrix Q ≠ 0, a scalar-valued rate function μ(t) satisfying the assumptions of Subsection 2.3, and a cumulative distribution function C(t) for the time T to coalescence of 2 lineages in a population.

Let F(x) = F(Q, μ, C, x) be the expected site-pattern frequency array for two lineages that enter a population at time 0 and undergo substitutions at rate μ(t)Q conditioned on T ≥ x. For x < x₁ let $\tilde{F} (x, x_{1}) = \tilde{F} (Q, μ, C, x, x_{1})$ be the expected site-pattern frequency array for two lineages that enter a population at time 0 and undergo substitutions at rate μ(t)Q conditioned, on x < T < x₁.

Then for all $0 \neq y \in R^{k}$ the functions y^TF(x)y and $y^{T} \tilde{F} (x, x_{1}) y$ are positive-valued and decreasing in x. Moreover there exists a y for which both are strictly decreasing, and for which if x₀ < x₁ ≤ x₂

y^{T} \tilde{F} (x_{0}, x_{1}) y > y^{T} F (x_{2}) y .

Proof Let c_x(t) denote the conditional probability density function for the coalescent time T given T > x. With $s (t) = \int_{0}^{t} μ (τ) d τ$ , the Markov matrix describing the substitution process on a single lineage from time 0 to time t is

M (μ, Q, t) = exp (s (t) Q) .

Thus using time-reversibility of the substitution process, with π the stationary distribution for Q,

F (x) = diag (π) \int_{0}^{\infty} (M (μ, Q, t))^{2} c_{x} (t) d t .

Here the square of the Markov matrix accounts for substitutions in the two lineages before coalescence.

Now S⁻¹QS is diagonal for a matrix S = diag(π)^−1/2U with U orthogonal, and Q’s eigenvalues satisfy 0 = λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_k with at least one λ_i < 0 (Lemma 2.2 of Allman et al. [2019b]). Thus diagonalizing the Markov matrix yields

U^{T} diag (π)^{- 1 ∕ 2} F (x) diag (π)^{- 1 ∕ 2} U = \int_{0}^{\infty} Λ_{M (μ, Q, t)} c_{c} (t) d t

where Λ_M(μ,Q,t) is diagonal with entries exp(2s(t)λ_i). The diagonal entries of this integral are thus

\int_{0}^{\infty} exp (2 s (t) λ_{i}) c_{x} (t) d t .

But Lemmas 11 and 12 show this is positive, decreasing in x, and strictly decreasing for some i. This establishes the claims about F, by choosing y to be any eigenvector of Q whose eigenvalue is negative to obtain a strictly decreasing function.

The corresponding claims about $\tilde{F}$ are given by the same argument with the cumulative distribution function C replaced by the conditional distribution function given the coalescent time T < x₁, that is, with

{\tilde{C}}_{x_{1}} (t) = {\begin{matrix} C (t) ∕ C (x_{1}) & if t \leq x_{1} \\ 1 & if t > x_{1} \end{matrix} .

Finally, since for every t the function ${\tilde{C}}_{x_{1}} (t)$ is decreasing in x₁, then for any y and x₀, a similar diagonalization argument and again using Lemma 12 shows the function $y^{T} \tilde{F} (x_{0}, x_{1}) y$ is decreasing in x₁. Thus if x₀ < x₁ ≤ x₂, then

y^{T} \tilde{F} (x_{0}, x_{1}) y \geq lim_{x_{1} \to \infty} y^{T} \tilde{F} (x_{0}, x_{1}) y = y^{T} F (x_{0}) y \geq y^{T} F (x_{2}) y .

Moreover, if y is an eigenvector of Q whose eigenvalue is negative, then strict inequality holds.

Proposition 2 Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} whose LSA network has topology ((a, b), c), but above $LSA ({a, b, c}, N^{+})$ there is possibly a chain of 2-cycles. Then, under a coalescent mixture model on $N^{+}$ with fixed parameters μ(t), {N_e}, Q, π, the relative site-pattern frequency matrices F^ab, F^bc, and F^ac are symmetric positive definite, with F^ac = F^bc, and satisfy

y^{T} F^{a b} y \geq y^{T} F^{a c} y

for every $y \in R^{k}$ , with the inequality strict for some y. Moreover, the same statements hold when the arrays F^xy are replaced by $F_{∣ K_{p} = 1}^{x y}$ with p a node placed above the parent of a, b and below the parent of c.

Proof Let x₁ be the length of the pendant edges to a and b, and x₂ the length of the pendant edge to c, so x₂ > x₁. Then applying Lemma 13 for an appropriately chosen distribution C(t) of coalescent times so

F^{a b} = F (x_{1}), F^{a c} = F^{b c} = F (x_{2}),

the result is immediate.

Let x_p denote the distance from a or b to p, so x₁ < x_p < x₂. Then conditioning on K_p = 1, in the notation of Lemma 13 we have

F_{∣ K_{p} = 1}^{a b} = \tilde{F} (x_{1}, x_{p}), F_{∣ K_{p} = 1}^{a c} = F_{∣ K_{p} = 1}^{b c} = F^{b c} = F (x_{2}),

so again Lemma 13 yields the claim.

We now turn from considering a coalescent mixture model, with a single substitution model class, to the mixture of coalescent mixtures $M$ .

Lemma 14 Let $N^{+}$ be a level-1 ultrametric rooted triple network on {a, b, c} with no 4-cycle. Suppose {a, b} form a cherry in the tree topology obtained from suppressing all cycles of $N^{+}$ . Then, under the mixture of coalescent mixtures model $M$ on $N^{+}$ , F^ac(θ) = F^bc(θ).

Proof By Lemmas 4 and 5, we may assume $N^{+}$ has neither a 2₁- nor a 3₁-cycle, so there are no cycles below the parent of a, b. By the ultrametricity of the network, a and b are exchangeable under the combined coalescent and substitution model for each substitution model class, and therefore for the model $M$ .

This result is used to show that logDet distances from rooted triple networks with only 2- and 3₁-cycles satisfy the same equality and inequality relationships as those from trees.

Proposition 3 (No 4₁-cycles or 3₂-cycles) Let $N^{+}$ be a level-1 ultrametric rooted triple network on {a, b, c} with neither a 4-cycle nor a 3₂-cycle. Let $T = ((a, b), c)$ be the tree topology obtained after suppressing all cycles in $N^{+}$ . Under the mixture of coalescent mixtures model $M$ on $N^{+}$ the theoretical logDet distances satisfy

d_{L D} (a, c) = d_{L D} (b, c) > d_{L D} (a, b) .

Proof Under the model $M$ , the frequencies of bases at any taxon are identical, given by the same convex combination of the base frequency vectors π_i for substitution classes i. Thus the value of ln(g_ug_v) in the definition of the logDet distance, equation (1), is identical for every pair of distinct taxa x, y ∈ {a, b, c}. It thus suffices to show

det F^{a b} (θ) \geq det F^{a c} (θ) = det F^{b c} (θ) .

Lemma 14 gives the equality. By Lemmas 4, 5, and 6, we can express F^xy(θ) as a convex combination of relative site-pattern frequency matrices, possibly conditioned on K_p = 1, of networks of the form of the tree $T$ joined to a (possibly empty) chain of 2-cycles above $T$ ’s root, such as depicted in Figure 2. By Proposition 2 each of those matrices for coalescent mixture models satisfy the hypotheses of Lemma 9. Lemma 9 thus yields the claim for mixtures of coalescent mixtures by considering a convex combination across both the networks and substitution model classes.

A weaker result, without the inequality, applies to networks with 3₂-cycles.

Proposition 4 (3₂-cycle) Let $N^{+}$ be a level-1 ultrametric rooted triple network on {a, b, c} with a 3₂-cycle. Let $T = ((a, b), c)$ be the tree topology obtained after suppressing all cycles in $N^{+}$ . Then under the mixture of coalescent mixtures model $M$ on $N^{+}$ , the theoretical logDet distances satisfy

d_{L D} (a, c) = d_{L D} (b, c) .

Proof From Lemma 14, F^ac(θ) = F^bc(θ), so the result follows as in the previous proof.

Proposition 3, and the arguments leading to it, show that the equality and inequality relationships of logDet distances between only 3 taxa carry no signal of either 2- or 3₁-cycles. Proposition 4, however, leaves open the possibility that for a network with a 3₂-cycle the smallest distance may not necessarily correspond to the taxa which are neighbors after 2- and 3- cycles are suppressed. This suggests that the presence of a 3₂-cycle might be detectable, at least under some circumstances. In Section 7 we return to this issue, providing a more in-depth analysis of triples of logDet distances.

Proposition 5 (4₁-cycle) Let $N^{+}$ be a level-1 ultrametric rooted triple network on {a, b, c} with a 4-cycle, such that contracting all cycles except the 4-cycle and then deleting one of its hybrid edges gives the trees ((a, b), c) and ((a, c), b). (See Figure 9.) Then under the mixture of coalescent mixtures model $M$ on $N^{+}$ , the theoretical logDet distances satisfy

d_{L D} (b, c) > d_{L D} (a, b) a n d d_{L D} (b, c) > d_{L D} (a, c) .

Fig. 9 — (Top) Three topologically-distinct rooted triple networks with a 4-cycle displaying the trees ((*a, b*), c) and ((*a, c*), b). (Bottom) The undirected rooted topology shared by them.

Moreover, if all other parameters are fixed, then for generic values of the hybridization parameters,

d_{L D} (a, b) \neq d_{L D} (a, c) .

Proof As in Proposition 3, to establish these inequalities for the logDet distance, it is enough to show

det F^{b c} (θ) < det F^{a b} (θ) and det F^{b c} (θ) < det F^{a c} (θ) .

(5)

From Lemmas 4 and 5, for x, y ∈ {a, b, c}

F^{x y} (θ) = γ_{1} F^{x y} (θ_{1}) + γ_{2} F^{x y} (θ_{2})

where $N_{1}^{+}$ and $N_{2}^{+}$ have the structure of the trees ((a, b), c) and ((a, c), b) with chains of 2-cycles possibly attached above their roots. Proposition 2 implies that for each GTR substitution model class

y^{T} F^{a b} (θ_{1}) y \geq y^{T} F^{b c} (θ_{1}) y = y^{T} F^{a c} (θ_{1}) y and y^{T} F^{a c} (θ_{2}) y \geq y^{T} F^{a b} (θ_{2}) y = y^{T} F^{b c} (θ_{2}) y,

for every $y \in R^{k}$ , with the inequalities strict for some choices of y. From this and Lemma 9 we obtain the in equalities (5).

To see d_LD(a, b) ≠ d_LD(a, c) for generic hybridization parameters, first observe that these distances extend to analytic functions of the γ on all of $C$ . To show the inequality for generic γ, it is enough to show there exists one specific choice of $γ \in C$ for which they are not equal. First consider a choice on the boundary of the parameter space, by letting γ_e = 1, γ_e′ = 0 for every pair e, e′ of hybrid edges with a common child so that the model reduces to one on the tree ((a, c), b). In this case Theorem 1 of Allman et al. [2019b] establishes the inequality. Continuity implies that there are then choices of 0 < γ_e < 1, where the model does not degenerate to one on a tree, for which these distances are also not equal.

Assuming generic parameter values, Proposition 5 combined with earlier results implies that the presence of a 4-cycle is indicated by three distinct logDet distances computed from expected pattern frequencies. However, the three networks at the top of Figure 9 all satisfy the hypothesis of Proposition 5, but using equalities and inequalities of logDet distances we cannot distinguish them. We can only identify their undirected version as depicted in the bottom of Figure 9.

Nonetheless, the combinatorial result of Proposition 1 yields information on larger cycles and their hybrid nodes by first using logDet distances to determine undirected rooted triple networks. This gives our main result.

Theorem 1 Let $N^{+}$ be a binary level-1 ultrametric network on X with a ∣X∣ ≥ 3. Let $\tilde{N}$ denote the topological LSA network $N^{\oplus}$ modulo 2- and 3-cycles and directions of edges in 4-cycles. Then for generic hybridization parameters under the mixture of coalescent mixtures model $M$ on $N^{+}$ , $\tilde{N}$ is identifiable from the theoretical logDet distances for pairs of taxa.

Proof Propositions 3, 4, and 5 imply that for generic parameters the three logDet distances for any choice of 3 taxa are distinct if, and only if, the induced rooted triple network has a 4-cycle. Moreover, the unrooted topology of the 4-cycle is determined by the largest of the three distances. Thus the set S of Proposition 1 is determined, yielding the result.

An example of a rooted level-1 network and the structure that we have shown to be identifiable from logDet distances under the model $M$ is given in Figure 10. On the left is a level-1 rooted phylogenetic network with cycles of various sizes, and on the right the partially directed network that could be inferred from it for generic parameters.

Fig. 10 — (Left) A rooted binary level-1 network and (Right) that part of its structure that Theorem 1 identifies from logDet distances under the model $M$ for generic parameters. Both 2- and 3- cycles are lost, as are the directions of 4-cycle edges, and hence knowledge of the hybrid nodes in 4-cycles. Directed edges in cycles of size greater than 4 are identifiable.

6. Modifying the model

In this section we show how our results apply to two variants of the model used throughout earlier sections. In the first, we no longer require that sites be independent, allowing instead finite subsets of sites (e.g., modeling individual genes) evolving on common gene trees. In the second, we consider a limiting case of the model, in which gene lineages entering a population have an immediate common ancestor, without any delay from a coalescent process. Other variants, such as one combining the features of the two considered here, could be treated similarly.

6.1. Variant 1: A model for unlinked genes

The first model variation allows for unlinked genetic loci, each composed of linked sites evolving on a common gene tree. This is a relaxation of the model assumption in Section 2 that sites be unlinked. The original model only properly applies to unlinked SNP data, while this variant allows for concatenated gene sequences. We require only that the length of each locus be a random draw from some length distribution with finite mean, independent of the topology of the gene tree.

To formalize this, let g be a probability mass function supported on $N$ , with mean $m = \sum_{n = 1}^{\infty} g (n) n < \infty$ . The model description in Section 2 is modified so that sequence data is generated as follows: For each gene,

a gene tree T is sampled according to the NMSC model on $(N^{+}, {ℓ_{e}}, {γ_{e}})$ with population sizes {N_e},
class i is sampled from the distribution λ to determine parameters (Q_i, π_i; μ_i), and gene length n is sampled according to g, and
for n independent sites, the bases for each extant taxon x ∈ X are sampled under the GTR+μ process on T with parameters (Q_i, π_i; μ_i).

All sites are then summarized by a site pattern frequency array, so that information as to which sites evolved on the same gene tree is lost.

To show that Theorem 1 applies to this model, we need only show that the expected pattern frequency array for two taxa, ${\tilde{F}}^{a b}$ , under this model, is the same as the expectation, F^ab, under the model of Section 2. Let ${\tilde{F}}_{∣ T}^{a b}$ and $F_{∣ T}^{a b}$ denote expected pattern frequencies conditioned on a particular gene tree T. Then with dT denoting the probability measure for gene trees under the NMSC with the given parameters,

{\tilde{F}}^{a b} = \int_{T} {\tilde{F}}_{∣ T}^{a b} d T = \int_{T} (\frac{1}{m} \sum_{n = 1}^{\infty} g (n) n F_{∣ T}^{a b}) d T = \int_{T} (\frac{1}{m} \sum_{n = 1}^{\infty} g (n) n) F_{∣ T}^{a b} d T = \int_{T} F_{∣ T}^{a b} d T = F^{a b} .

Note that in applications of the theory developed here, empirical frequency arrays produced from gene sequences are likely to converge more slowly to their expected values than for those produced from SNP data, due to the linkage of sites. The argument above suggests that enough genes are needed so that the variation in gene length averages out over each possible gene tree.

6.2. Variant 2: A non-coalescent model

The second model variation we consider is a non-coalescent model for an ultrametric level-1 species network, in which gene trees must be displayed on the species network. One can think of this as simply requiring immediate coalescence of gene lineages when they enter a common population. Population size parameters are thus no longer relevant, but all other features of the model of Section 2 are retained.

This model is similar to the non-coalescent model considered by Gross et al. [2020], who used algebraic and combinatorial arguments to obtain an identifiability result for most features of a level-1 species network topology assuming generic numerical parameters. However, we impose one more restrictive assumption, namely that the network be ultrametric. On the other hand, we considerably relax their assumptions on the sequence substitution model, from a requirement of a single Jukes-Cantor or Kimura process to the mixture of GTR processes used throughout this paper.

Informally, to produce immediate coalescence of gene lineages in a coalescent model, one can simply take a limit as the population sizes approach 0. Small population size produces bottlenecks, which encourage rapid coalescence of lineages. In general, results obtained under the coalescent model will still apply under a non-coalescent model, provided the arguments respect taking such a limit.

To sketch how this applies in our arguments, first fix all population sizes N_e on edges in a species network to have a common value N. Note that population size plays no role in any of our arguments before those of Section 5, except through probabilities such as P(K_p = 1) and P(K_p = 2) which appear in formulas in Section 4 but are not computed there. Thus all results through Section 4 remain valid.

As N → 0⁺, the density function c(t) of equation (4) for the time to coalescence of two lineages in a population is easily seen to approach δ₀, a point mass at t = 0. Thus with probability 1 lineages coalesce immediately upon entering a common population. While this observation can be traced through the remaining lemmas of Section 5 (making some modifications to their presentation), it is simpler to give a direct proof of the following analog of Proposition 2.

Proposition 6 Let $N^{+}$ be a binary level-1 ultrametric rooted triple network on {a, b, c} whose LSA network has topology ((a, b), c), but above $LSA ({a, b, c}, N^{+})$ there is possibly a chain of 2-cycles. Then, under a non-coalescent model on $N^{+}$ with fixed parameters μ(t), Q, π, the relative site-pattern frequency matrices F^ab, F^bc, and F^ac are symmetric positive definite, with F^ac = F^bc, and satisfy

y^{T} F^{a b} y \geq y^{T} F^{a c} y

for every $y \in R^{k}$ , with the inequality strict for some y.

Proof Let x₁ be the length of the pendant edges to a, b and x₂ the length of the pendant edge to c. With $s (t) = \int_{0}^{t} μ (τ) d τ$ , the Markov matrix describing the substitution process on a single lineage from time 0 to time t is

M (μ, Q, t) = exp (s (t) Q) .

Thus using time-reversibility of the substitution process

F^{a b} = diag (π) M (μ, Q, x_{1})^{2} = diag (π) exp (2 s (x_{1}) Q)

F^{a c} = F^{b c} = diag (π) M (μ, Q, x_{2})^{2} = diag (π) exp (2 s (x_{2}) Q) .

Since Q is a GTR rate matrix, the result follows by diagonalization, as in Lemma 13.

The remainder of the arguments of Section 5 apply unchanged, to yield an analog of Theorem 1. Note that while population sizes are no longer model parameters, all other parameters are unchanged in the limit.

Remark 1 In general, results under the MSC and NMSC models yield results for simpler non-coalescent models in the limit as population sizes decrease to 0. For instance, without considering a site substitution process Baños [2019] and Allman et al. [2019a] show that most features of a level-1 network can be identified from the frequencies of displayed gene quartet trees under the NMSC. Letting all population sizes → 0 then gives that most features of a level-1 network can be identified from the frequencies of its displayed quartet trees.

7. Normalized triples of logDet distances.

In the previous section, we obtained linear equalities and inequalities that the logDet distances between three taxa must satisfy if they are related by various level-1 rooted networks. Combined with the combinatorial result of Section 3 these are sufficient for proving the identifiability claim that is the main focus of this work. However, it is worthwhile to seek a more complete characterization of what distances are achievable by various network topologies. In particular, with an eye toward practical application, any tighter characterizations would enable stronger testing for network topology from the empirical distances.

Here we conduct a partial investigation, characterizing not the triple of theoretical logDet distances that may be produced on rooted 3-taxon networks, but rather the normalized triple obtained by dividing the distances by their sum. The triple of distances forms a point in the non-negative octant $(R_{\geq 0})^{3}$ , while the normalized triple gives a point in the 2-dimensional simplex. Thus plots can be made with the normalized distances that are analogous to the simplex plots for visualizing gene quartet concordance factors [Baños, 2019, Mitchell et al., 2019, Allman et al., 2021]. Just as simplex plots of concordance factors aid in understanding genomic data sets, we anticipate that the 2-simplex visualization of the normalized logDet distance triples will be similarly useful.

We begin with the logDet triples from 3-taxon trees.

Proposition 7 Let ℓ = (ℓ_ab, ℓ_ac, ℓ_bc) with 0 < ℓ_ab ≤ ℓ_ac = ℓ_bc be a triple of positive numbers summing to 1. Then there exists an ultrametric rooted tree with topology ((a, b), c) and GTR substitution model parameters such that the normalized theoretical logDet distances of sequences generated under the coalescent mixture model are ℓ.

Proof Consider the metric species tree ((a:0, b:0):x/2, c:x/2), and constant population sizes ϵ > 0 on all edges. Fix a single substitution model, say the Jukes-Cantor, for sequence generation. Since small population sizes ϵ result in rapid coalescence with arbitrarily high probability, by taking ϵ sufficiently small one can show the expected frequency array can be made arbitrarily close to that which would arise if all gene trees exactly matched the species tree. Thus the theoretical logDet distances can be made arbitrarily close to d_LD(a, b) = 0 and d_LD(a, c) = d_LD(b, c) = x, which normalizes to (0, 1/2, 1/2).

The unresolved species tree (a:x/2, b:x/2, c:x/2), regardless of choice of population functions on the edges yields, by exchangeability of the taxa, a triple of equal logDet distances, which normalizes to (1/3, 1/3, 1/3).

While the two trees above have 0-length edges and hence are non-binary, perturbations to binary trees with positive length edges can produce normalized logDet distances that are arbitrarily close.

Since the normalized logDet distances are continuous functions of parameters, the parameter space is connected, and the image of the normalized distances lies in a line segment by Proposition 3, the claim follows.

We turn now to networks with a single cycle.

Proposition 8 Let ℓ = (ℓ_ab, ℓ_ac, ℓ_bc) with 0 < ℓ_ab ≤ ℓ_ac < ℓ_bc be a triple of positive numbers summing to 1. Then there exists a binary ultrametric rooted network on taxa a, b, c with a single 4-cycle and GTR substitution model parameters such that the normalized theoretical logDet distances of sequences generated, under a single-class coalescent mixture model are ℓ.

Proof The 4-cycle network we construct is shown in Figure 11, with t₀, t₁ measured in generations, and the hybrid edges of length 0. Consider a single constant population size N > 0 for all populations over the tree and above the root, and a Jukes-Cantor substitution process with constant rate μ > 0. We will choose values for t₀, t₁ > 0, γ ∈ [1/2, 1) so that the normalized distances for the coalescent mixture model with this single substitution process are given by ℓ.

Fig. 11 — The 4-cycle network, with times in generations, constructed in Proposition 8. Hybridization parameters are γ, 1 − γ, and hybrid edges have length 0.

Recall that if M(t) denotes the Jukes-Cantor Markov matrix for a substitution process over time t with rate 1, then the common value of all its off-diagonal entries is

f (t) = \frac{1}{4} (1 - e^{- \frac{4}{3} t}) .

With D = diag(1/4, 1/4, 1/4, 1/4), the Jukes-Cantor pattern frequency array is DM(t), and the logDet distance (equal to Jukes-Cantor distance) is

t = f^{- 1} (f (t)) = - \frac{3}{4} \log (1 - 4 f (t)) .

Note that f is an increasing function.

From equation 4.1 of Allman et al. [2019b], for a coalescent mixture Jukes-Cantor model on an ultrametric tree with uniform population size N and mutation rate μ, sequences for two taxa x, y whose MRCA is at time t before the present has expected pattern frequency array

F (t) = D M (2 t μ) \tilde{M} (μ, N),

where $\tilde{M} (μ, N)$ is a Markov matrix of Jukes-Cantor form describing the expected additional substitutions due to the coalescent model delaying lineages merging until some time above the MRCA. The logDet distance between x, y is then the same as the Jukes-Cantor distance, which is computed to be

d_{L D} (x, y) = 2 t μ + β

where β = β(μ, N) > 0 can be explicitly computed from $\tilde{M} (μ, N)$ , though we will not do so here. Since β is continuous and β(μ, N) → 0 as N → 0 and β(μ, N) → ∞ as N → ∞, it follows that β takes on all positive values.

Now by Lemma 5 on the 4-cycle network of Figure 11 the expected pattern frequency array for a, b is

γ F (t_{0}) + (1 - γ) F (t_{1}) = D M_{a b} \tilde{M} (μ, N)

where

M_{a b} = γ M (2 t_{0} μ) + (1 - γ) M (2 t_{1} μ)

has the usual Jukes-Cantor form, with off-diagonal entries

f_{a b} = γ f (2 t_{0} μ) + (1 - γ) f (2 t_{1} μ) .

This shows

d_{L D} (a, b) = f^{- 1} (f_{a b}) + β .

A similar calculation shows

d_{L D} (a, c) = f^{- 1} (f_{a c}) + β,

where

f_{a c} = γ f (2 t_{1} μ) + (1 - γ) f (2 t_{0} μ) .

The expected pattern frequencies for b, c sequences is F(t₁), so

d_{L D} (b, c) = f^{- 1} (f_{b c}) + β

where

f_{b c} = f (2 t_{1} μ) .

We now determine parameters which produce the normalized triple of distances ℓ. Fixing values of μ, N determines a fixed value of β > 0. Next, choose some value m so that

f (m ℓ_{a b} - β) > \frac{1}{8},

which can be done since $f : R^{> 0} \to (0, 1 ∕ 4)$ is surjective and increasing. Then, with x_ij = f(mℓ_ij − β), because ℓ_ab ≤ ℓ_ac < ℓ_bc we have

\frac{1}{8} < x_{a b} \leq x_{a c} < x_{b c} < \frac{1}{4} .

Let x₀ = x_ab + x_ac − x_bc, so $0 < x_{0} < \frac{1}{4}$ . Determine t₀ by f(2t₀μ) = x₀, and γ ∈ [1/2, 1) by

γ = \frac{x_{b c} - x_{a b}}{2 x_{b c} - x_{a b} - x_{a c}}, so 1 - γ = \frac{x_{b c} - x_{a c}}{2 x_{b c} - x_{a b} - x_{a c}} .

Then choose t₁ by f(2t₁μ) = x_bc.

To verify that these parameter choices give the desired normalized triple of distances, the expected distance between a, b is

d_{L D} (a, b) = f^{- 1} (γ f (2 t_{0} μ) + (1 - γ) f (2 t_{1} μ)) + β = f^{- 1} (γ x_{0} + (1 - γ) x_{b c}) + β = f^{- 1} (x_{a b}) + β = m ℓ_{a b} .

Similarly, we see d_LD(a, c) = mℓ_ac. Finally we have

d_{L D} (b, c) = f^{- 1} (f (2 t_{1} μ)) + β = f^{- 1} (x_{b c}) + β = m ℓ_{b c} .

Note that even if ℓ_ac = ℓ_bc, the argument of Proposition 8 can be modified slightly by taking γ = 1 in the analytic continuation of the parameterization. However, that choice of the hybridization parameter essentially means that in place of a 4-cycle network parameter we have a tree.

Finally, we consider a network with a 3₂-cycle. While Proposition 4 shows the normalized triples of theoretical logDet distances lie on the same line as those for a tree, we establish they need not be restricted to the same line segment of tree-like distances. However, we do not completely characterize the extent of the segment they fill out.

Proposition 9 Let ℓ = (ℓ_ab, ℓ_ac, ℓ_bc) with ℓ_ac = ℓ_bc be a triple of positive numbers summing to 1 with $0 < ℓ_{a b} < \frac{1}{2}$ . Then there exists a binary ultrametric rooted network on taxa {a, b, c} with a single 3₂-cycle whose leaf-descendants are a, b and GTR substitution model parameters such that the normalized theoretical LogDet distances of sequences generated under the coalescent mixture model are ℓ.

Proof We construct several 3₂-cycle species networks of the form shown in Figure 12, with edge lengths t_i = ℓ(e_i). In making choices of numerical parameters, since the network is ultrametric we view t₁, t₃, t₅, t₇ as independent, determining t₂, t₄, t₆. The population size on edge e_i for 3 ≤ i ≤ 8 are constants N_i, with the sizes on terminal edges irrelevant. The hybridization parameters are 1 − γ and γ on edges e₄ and e₅ respectively. We also fix a single Jukes-Cantor substitution process with any constant rate μ > 0.

Fig. 12 — A 3₂-network, with numbered edges, as used in Proposition 9. The hybridization parameter on edge e₅ is γ, and on e₄ is 1 − γ.

By Proposition 4, for any choices of the t_i, N_i, γ, the theoretical LogDet distances will satisfy d_LD(a, c) = d_LD(b, c) so the normalized theoretical LogDet distance triple lies on a line. Since the parameter space is connected, it is enough to show that

\frac{d_{L D} (a, b)}{2 d_{L D} (a, c) + d_{L D} (a, b)}

(6)

is arbitrarily close to 0 for some choice of the parameters, and arbitrarily close to 1/2 for others, to conclude that the rescaled expected distances give all the described triples.

To make expression (6) near 0, we choose parameters with t₁ and N₃ sufficiently small so that with high probability the a, b lineages coalesce quickly. Specifically, let t₃ = 1, and fix any positive values for t₅, t₇ and N_i for i ≠ 3. Now for any ϵ > 0, as N₃ → 0⁺, the probability of lineages from a, b coalescing on e₃ within ϵ of entering it approaches 1. Using this, it is straightforward to show that as N₃ → 0⁺ the expected pattern frequency array for a, b approaches that for the JC model on a 2-taxon tree of total length 2t₁. This then implies that d_LD(a, b) → 2μt₁ as N₃ → 0⁺. On the other hand, for all values of N₃ > 0 one can show d_LD(a, c) > 2μ(t₁ + 2). Thus for a sufficiently small choices of t₁ and N₃, we can make d_LD(a, b)/(2d(a, c) + d(a, b)) as close to 0 as desired.

To produce a value of expression (6) near 1/2 is more subtle. We choose parameters so that a, b lineages are likely to enter e₅, but if they both do they are then unlikely to coalesce in it, and coalescence of any pair of lineages in e₇ is likely to occur quickly. First set t₅ = 0, t₇ = 1 and N₈ arbitrary. For any t₁, t₃ and γ, by choosing N₃ = N₄ = N₅ sufficiently large, the probability that the a, b lineages coalesce on e₃, e₄, or e₅ can be made arbitrarily small, so that if they coalesce below the root with (conditional) probability approaching 1 they must do so on e₇. This requires that both the a, b lineages follow e₅, which occurs with probability γ². If lineages a, c coalesce below the root, they must do so on e₇, requiring the a lineage to follow e₅, which occurs with probability γ. By picking N₇ sufficiently small, the probability that two lineages in edge e₇ coalesce near the lower end can be made close to 1. All this shows that once t₁, t₃ and γ are chosen, by appropriate choices of the N_i we can ensure the expected frequency arrays for a, b and a, c are arbitrarily close to

γ^{2} F (t_{1} + t_{3}) + (1 - γ^{2}) G (t_{1} + t_{3} + 1, N_{8})

and

γ F (t_{1} + t_{3}) + (1 - γ) G (t_{1} + t_{3} + 1, N_{8}),

respectively, where F(t) is the expected pattern frequency array for two samples at distance 2t and G(t, N) is the expected array under the coalescent for 2 lineages which enter a common population of size N at time t. Further picking sufficiently small values for t₁, t₃, the pattern frequency arrays for a, b and a, c can be made arbitrarily close to

γ^{2} \frac{1}{4} I + (1 - γ^{2}) G (1, N_{8})

and

γ \frac{1}{4} I + (1 - γ) G (1, N_{8}),

respectively. Thus for any γ the theoretical distance can be made arbitrarily close to the distance computed from the above arrays. Using the formulas defined in the proof of Proposition 8, we find these distances are

d_{L D} (a, b) = f^{- 1} ((1 - γ^{2}) δ)

and

d_{L D} (a, c) = f^{- 1} ((1 - γ) δ)

where δ > 0 is the off-diagonal entry of G(1, N₈). Thus once γ is specified, by choosing t₁, t₃, N₃ = N₄ = N₅, N₇ we can ensure expression (6) is arbitrarily close to

\frac{\log (1 - 4 δ (1 - γ^{2}))}{2 \log (1 - 4 δ (1 - γ)) + \log (1 - 4 δ (1 - γ^{2}))} .

(7)

Applying L’Hopital’s rule shows the limit of expression (7) as $γ \to 1 is \frac{1}{2}$ . Thus for any ϵ > 0, by first choosing γ near 1 so that the expression (7) is within ϵ/2 of 1/2, and then choosing t₁ = t₃, N₃ = N₄ = N₅, N₇ so that expression (6) is within ϵ/2 of expression (7), we obtain the desired result.

The results of this section, combined with those of Section 5 are summarized by Figure 13, which indicates the various regions of the simplex which normalized logDet triples fill, according to whether the network has a 4-cycle, a 3₂-cycle, or neither.

Fig. 13 — The regions of the simplex filled by normalized triples of logDet distances under the model $M$ on a 3-taxon network. The networks shown are those obtained by suppressing all cycles other than 4- and 3₂-cycles, and then undirecting the 4-cycle edges. Normalized logDet distances are ordered as (*ℓ_ab*, *ℓ_ac*, *ℓ_bc*). Networks with 3₂-cycles fill the solid line segments in the center simplex, but it is unknown whether they may also produce points in the dashed line segments.

Note that the possibility that a 3₂-cycle (as depicted in the center of Figure 13) leads to a triple of normalized logDet distances lying on an extension of the corresponding line segment for the tree topology displayed on the networks (as depicted to the right of the figure) echoes a number of similar results arising in studies of network inference under the coalescent from gene tree data. For unrooted quartets, these include the works of Solís-Lemus et al. [2016], Baños [2019] and Allman et al. [2019a], and for rooted triples Long and Kubatko [2018] and Jiao and Yang [2020]. In essence, all these results indicate that the coalescent can lead to anomalous gene trees, in the sense that the most frequent gene tree topology may not match that of the trees displayed on the species network, even though all such displayed trees have the same topology.

8. Conclusion

Theorem 1 states that most topological features of an ultrametric level-1 network can be identified from theoretical logDet distances under a fairly general model of sequence evolution with incomplete lineage sorting. It more generally implies network identifiability from pattern frequency arrays, since logDet distances are functions of these. In particular, individual gene trees, or even sequences partitioned into genes, are not required for network identifiability.

While identifiability is a theoretical question about the model, it has important implications for data analysis. Indeed, it is a key requirement for a statistically consistent inference procedure to exist. While our method of proof of identifiability, using the logDet distance, suggests using that distance as a basis for an inference procedure, others might be developed as well.

In subsequent work, we will explore using the logDet distance in a procedure for level-1 network inference following the framework of NANUQ [Allman et al., 2019a]. In outline, for each triple of taxa, the location of the normalized triple of logDet distances in simplex plots such as those of Figure 13 can indicate whether the rooted triple has a 4-cycle or not. A triple near the lines through the centroid can, through some statistical test, be judged unlikely to have arisen from a 4-cycle, while those farther away are judged to have arisen from a 4-cycle. Then, modifying the rooted triple distance of Rhodes [2019] to a network setting, similarly to how NANUQ modified the quartet distance, an intertaxon distance can be computed from the results of these statistical tests. Rules for relating a splits graph for the expected rooted triple distance to the original network will be developed. When applied to the splits graph constructed by NeighborNet from the empirically-derived distance, this should lead to consistent network inference. Since individual gene trees are never inferred, this will potentially give a much faster data analysis pipeline than the current version of NANUQ, which is built on quartet concordance factors across gene trees.

9. Acknowledgements

This work was supported by the National Institutes of Health [R01 GM117590], under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological Mathematical Sciences, and [2P20GM103395], an NIGMS Institutional Development Award (IDeA), and by the National Science Foundation [2051760]. H.B. was also partially supported by the Moore-Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation grant 735923LPI (DOI: https://doi.org/10.46714/735923LPI) awarded to Andrew J. Roger and Edward Susko.

Contributor Information

Elizabeth S. Allman, Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA

Hector Baños, Department of Biochemistry & Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, Nova Scotia, CANADA; Department of Mathematics & Statistics, Faculty of Science, Dalhousie University, Halifax, Nova Scotia, CANADA.

John A. Rhodes, Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA

References

Allman ES, Baños H, and Rhodes JA. NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms Mol. Biol, 14(24):1–25, 2019a. [DOI] [PMC free article] [PubMed] [Google Scholar]
Allman ES, Long C, and Rhodes JA. Species tree inference from genomic sequences using the log-det distance. SIAM J. Appl. Algebra Geometry, 3:107–127, 2019b. [DOI] [PMC free article] [PubMed] [Google Scholar]
Allman ES, Mitchell JD, and Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst. Biol, 2021. Advance article 10.1093/sysbio/syab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baños H. Identifying species network features from gene tree quartets. Bulletin of Mathematical Biology, 81:494–534, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casanellas M and Fernández-Sánchez J. Rank conditions on phylogenetic networks. arXiv:2004.12988, to appear in Research Perspectives CRM Barcelona, Spring 2019, vol. 10, in Trends in Mathematics Springer-Birkhauser, 2020. [Google Scholar]
Chifman J and Kubatko L. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. Journal of Theoretical Biology, 374:35–47, 2015. [DOI] [PubMed] [Google Scholar]
Dasarathy G, Nowak R, and Roch S. Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Trans. Comput. Biol. and Bioinf, 12(2):422–432, 2015. [DOI] [PubMed] [Google Scholar]
Gross E and Long C. Distinguishing phylogenetic networks. SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018. doi: 10.1137/17M1134238. [DOI] [Google Scholar]
Gross E, van Iersel L, Janssen R, Jones M, Long C, and Murakami Y. Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes. arXiv:2007.08782, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hollering B and Sullivant S. Identifiability in phylogenetics using algebraic matroids. Journal of Symbolic Computation, 104:142–158, 2021. ISSN 0747-7171. doi: 10.1016/j.jsc.2020.04.012. [DOI] [Google Scholar]
Huber KT, van Iersel L, Moulton V, Scornavacca C, and Wu T. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, 2017. doi: 10.1007/s00453-015-0069-8. URL 10.1007/s00453-015-0069-8. [DOI] [Google Scholar]
Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. Bulletin of Mathematical Biology, 80(8):2137–2153, 2018. doi: 10.1007/s11538-018-0450-2. URL 10.1007/s11538-018-0450-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jansson J and Sung W-K. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science, 363(1):60–68, 2006. ISSN 0304-3975. doi: 10.1016/j.tcs.2006.06.022. Computing and Combinatorics. [DOI] [Google Scholar]
Jiao X and Yang Z. Defining species when there is gene flow. Syst. Biol, 70(1):108–119, July 2020. ISSN 1063-5157. doi: 10.1093/sysbio/syaa052. URL 10.1093/sysbio/syaa052. [DOI] [PubMed] [Google Scholar]
Liu L and Edwards SV. Phylogenetic analysis in the anomaly zone. Systematic Biology, 58(4):452–460, 2009. [DOI] [PubMed] [Google Scholar]
Lockhart PJ, Steel MA, Hendy MD, and Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol, 11:605–612, 1994. [DOI] [PubMed] [Google Scholar]
Long C and Kubatko L. The effect of gene flow on coalescent-based species-tree inference. Syst. Biol, 67(5):770–785, March 2018. ISSN 1063-5157. doi: 10.1093/sysbio/syy020. URL 10.1093/sysbio/syy020. [DOI] [PubMed] [Google Scholar]
Meng C and Kubatko LS. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. ISSN 00405809. doi: 10.1016/j.tpb.2008.10.004. [DOI] [PubMed] [Google Scholar]
Mitchell JD, Allman ES, and Rhodes JA. Hypothesis testing near singularities and boundaries. Electron. J. Statist, 13(1):2150–2193, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rhodes JA. Topological metrizations of trees, and new quartet methods of tree inference. IEEE/ACM Trans. Comput. Biol. Bioinform, early access, 2019. doi: 10.1109/TCBB.2019.2917204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosselló F and Valiente G. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. ISSN 00255564. doi: 10.1016/j.mbs.2009.06.007. [DOI] [PubMed] [Google Scholar]
Semple C and Steel M. Phylogenetics. Oxford University Press, 2005. ISBN 0 19 850942 1. [Google Scholar]
Solís-Lemus C and Ané C. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. ISSN 15537404. doi: 10.1371/journal.pgen.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
Solís-Lemus C, Yang M, and Ané C. Inconsistency of species tree methods under gene flow. Syst. Biol, 65(5):843–851, May 2016. ISSN 1063-5157. doi: 10.1093/sysbio/syw030. URL 10.1093/sysbio/syw030. [DOI] [PubMed] [Google Scholar]
Steel M. Phylogeny: Discrete and Random Processes in Evolution. SIAM, Philadelphia, 2016. ISBN 9781611974478. [Google Scholar]
Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Applied Mathematics Letters, 7(2):19–24, 1994. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]
van Iersel L, Moulton V, and Murakami Y. Reconstructibility of unrooted level-k phylogenetic networks from distances. Advances in Applied Mathematics, 120:102075, 2020. [Google Scholar]
Wen D and Nakhleh L. Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Systematic Biology, 67(3):439–457, 2018. [DOI] [PubMed] [Google Scholar]
Yu Y and Nakhleh L. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genomics, 16(Suppl 10):S10, 2015. ISSN 1471-2164. doi: 10.1186/1471-2164-16-S10-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu Y, Than C, Degnan JH, and Nakhleh L. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C, Ogilvie HA, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35(2):504–517, December 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: revisiting trees within networks. BMC Bioinformatics, 5:271–282, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Allman ES, Baños H, and Rhodes JA. NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms Mol. Biol, 14(24):1–25, 2019a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Allman ES, Long C, and Rhodes JA. Species tree inference from genomic sequences using the log-det distance. SIAM J. Appl. Algebra Geometry, 3:107–127, 2019b. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Allman ES, Mitchell JD, and Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst. Biol, 2021. Advance article 10.1093/sysbio/syab008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Baños H. Identifying species network features from gene tree quartets. Bulletin of Mathematical Biology, 81:494–534, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Casanellas M and Fernández-Sánchez J. Rank conditions on phylogenetic networks. arXiv:2004.12988, to appear in Research Perspectives CRM Barcelona, Spring 2019, vol. 10, in Trends in Mathematics Springer-Birkhauser, 2020. [Google Scholar]

[R6] Chifman J and Kubatko L. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. Journal of Theoretical Biology, 374:35–47, 2015. [DOI] [PubMed] [Google Scholar]

[R7] Dasarathy G, Nowak R, and Roch S. Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Trans. Comput. Biol. and Bioinf, 12(2):422–432, 2015. [DOI] [PubMed] [Google Scholar]

[R8] Gross E and Long C. Distinguishing phylogenetic networks. SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018. doi: 10.1137/17M1134238. [DOI] [Google Scholar]

[R9] Gross E, van Iersel L, Janssen R, Jones M, Long C, and Murakami Y. Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes. arXiv:2007.08782, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Hollering B and Sullivant S. Identifiability in phylogenetics using algebraic matroids. Journal of Symbolic Computation, 104:142–158, 2021. ISSN 0747-7171. doi: 10.1016/j.jsc.2020.04.012. [DOI] [Google Scholar]

[R11] Huber KT, van Iersel L, Moulton V, Scornavacca C, and Wu T. Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets. Algorithmica, 77(1):173–200, 2017. doi: 10.1007/s00453-015-0069-8. URL 10.1007/s00453-015-0069-8. [DOI] [Google Scholar]

[R12] Huber KT, Moulton V, Semple C, and Wu T. Quarnet inference rules for level-1 networks. Bulletin of Mathematical Biology, 80(8):2137–2153, 2018. doi: 10.1007/s11538-018-0450-2. URL 10.1007/s11538-018-0450-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Jansson J and Sung W-K. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science, 363(1):60–68, 2006. ISSN 0304-3975. doi: 10.1016/j.tcs.2006.06.022. Computing and Combinatorics. [DOI] [Google Scholar]

[R14] Jiao X and Yang Z. Defining species when there is gene flow. Syst. Biol, 70(1):108–119, July 2020. ISSN 1063-5157. doi: 10.1093/sysbio/syaa052. URL 10.1093/sysbio/syaa052. [DOI] [PubMed] [Google Scholar]

[R15] Liu L and Edwards SV. Phylogenetic analysis in the anomaly zone. Systematic Biology, 58(4):452–460, 2009. [DOI] [PubMed] [Google Scholar]

[R16] Lockhart PJ, Steel MA, Hendy MD, and Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol, 11:605–612, 1994. [DOI] [PubMed] [Google Scholar]

[R17] Long C and Kubatko L. The effect of gene flow on coalescent-based species-tree inference. Syst. Biol, 67(5):770–785, March 2018. ISSN 1063-5157. doi: 10.1093/sysbio/syy020. URL 10.1093/sysbio/syy020. [DOI] [PubMed] [Google Scholar]

[R18] Meng C and Kubatko LS. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theoretical Population Biology, 75(1):35–45, 2009. ISSN 00405809. doi: 10.1016/j.tpb.2008.10.004. [DOI] [PubMed] [Google Scholar]

[R19] Mitchell JD, Allman ES, and Rhodes JA. Hypothesis testing near singularities and boundaries. Electron. J. Statist, 13(1):2150–2193, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Rhodes JA. Topological metrizations of trees, and new quartet methods of tree inference. IEEE/ACM Trans. Comput. Biol. Bioinform, early access, 2019. doi: 10.1109/TCBB.2019.2917204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Rosselló F and Valiente G. All that glisters is not galled. Mathematical Biosciences, 221(1):54–59, 2009. ISSN 00255564. doi: 10.1016/j.mbs.2009.06.007. [DOI] [PubMed] [Google Scholar]

[R22] Semple C and Steel M. Phylogenetics. Oxford University Press, 2005. ISBN 0 19 850942 1. [Google Scholar]

[R23] Solís-Lemus C and Ané C. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genetics, 12(3), 2016. ISSN 15537404. doi: 10.1371/journal.pgen.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Solís-Lemus C, Yang M, and Ané C. Inconsistency of species tree methods under gene flow. Syst. Biol, 65(5):843–851, May 2016. ISSN 1063-5157. doi: 10.1093/sysbio/syw030. URL 10.1093/sysbio/syw030. [DOI] [PubMed] [Google Scholar]

[R25] Steel M. Phylogeny: Discrete and Random Processes in Evolution. SIAM, Philadelphia, 2016. ISBN 9781611974478. [Google Scholar]

[R26] Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Applied Mathematics Letters, 7(2):19–24, 1994. [Google Scholar]

[R27] van der Vaart AW. Asymptotic Statistics. Cambridge University Press, 1998. [Google Scholar]

[R28] van Iersel L, Moulton V, and Murakami Y. Reconstructibility of unrooted level-k phylogenetic networks from distances. Advances in Applied Mathematics, 120:102075, 2020. [Google Scholar]

[R29] Wen D and Nakhleh L. Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Systematic Biology, 67(3):439–457, 2018. [DOI] [PubMed] [Google Scholar]

[R30] Yu Y and Nakhleh L. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genomics, 16(Suppl 10):S10, 2015. ISSN 1471-2164. doi: 10.1186/1471-2164-16-S10-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Yu Y, Than C, Degnan JH, and Nakhleh L. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zhang C, Ogilvie HA, Drummond AJ, and Stadler T. Bayesian inference of species networks from multilocus sequence data. Molecular Biology and Evolution, 35(2):504–517, December 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Zhu J, Yu Y, and Nakhleh L. In the light of deep coalescence: revisiting trees within networks. BMC Bioinformatics, 5:271–282, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identifiability of species network topologies from genomic sequences using the logDet distance

Elizabeth S Allman

Hector Baños

John A Rhodes

Abstract

1. Introduction

2. Networks and models

2.1. Phylogenetic Networks

Fig. 1.

Fig. 2.

2.2. Coalescent Model on Networks

2.3. Sequence substitution models on gene trees

2.4. Mixture of coalescent mixtures

2.5. LogDet distance

3. Rooted Networks from Undirected Rooted Triple Networks

Fig. 3.

4. Expected pattern frequencies as convex sums

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

5. Theoretical logDet distances

Fig. 8.

Fig. 9.

Fig. 10.

6. Modifying the model

6.1. Variant 1: A model for unlinked genes

6.2. Variant 2: A non-coalescent model

7. Normalized triples of logDet distances.

Fig. 11.

Fig. 12.

Fig. 13.

8. Conclusion

9. Acknowledgements

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases