A Metric on the Space of Partly Reduced Phylogenetic Networks

Juan Wang

doi:10.1155/2016/7534258

. 2016 Jun 23;2016:7534258. doi: 10.1155/2016/7534258

A Metric on the Space of Partly Reduced Phylogenetic Networks

Juan Wang ^1,^*

PMCID: PMC4935902 PMID: 27419137

Abstract

Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, such as recombination between genes, hybridization between lineages, and horizontal gene transfer. The researchers have designed several measures for computing the dissimilarity between two phylogenetic networks, and each measure has been proven to be a metric on a special kind of phylogenetic networks. However, none of the existing measures is a metric on the space of partly reduced phylogenetic networks. In this paper, we provide a metric, d _e-distance, on the space of partly reduced phylogenetic networks, which is polynomial-time computable.

1. Introduction

Phylogenies reveal the history of evolutionary events of a group of species, and they are central to comparative analysis methods for testing hypotheses in evolutionary biology [1]. Computing the distance between a pair of phylogenies is very important for understanding the evolutionary history of species.

A metric d on a space S satisfies four properties for all a, b, c ∈ S:

(I)
d(a, b) ≥ 0 (nonnegative property);
(II)
d(a, b) = 0 if and only if a = b (separation property);
(III)
d(a, b) = d(b, a) (symmetry property);
(IV)
d(a, b) + d(b, c) ≥ d(a, c) (triangle inequality).

Phylogenetic network can represent reticulate evolutionary events, such as recombinations between genes, hybridization between lineages, and horizontal gene transfer [2–5]. For the comparison of phylogenetic networks, there are many metrics on the restricted subclasses of networks including the tripartition metric on the space of tree-child phylogenetic networks [6–9], the μ-distance on the space of tree-sibling phylogenetic networks [10], and the m-distance on the space of reduced phylogenetic networks [11]. Later the m-distance was also proved to be a metric on the space of tree-child phylogenetic networks, semibinary tree-sibling time consistent phylogenetic networks, and multilabeled phylogenetic trees [12–15].

For any rooted phylogenetic network N, we can obtain its reduced version by removing all nodes in maximal convergent sets (will be discussed in the following) and all the nodes, with indegree 1 and outdegree 1, from N. The reduced versions of all rooted phylogenetic networks form the space of reduced phylogenetic networks (m-distance, defined by Nakhleh, is on this space). In this paper, we will discuss the partly reduced version of a phylogenetic network by removing the nodes in a part of the convergent sets and all the nodes, with indegree 1 and outdegree 1, from the phylogenetic network. The partly reduced versions of all rooted phylogenetic networks form the space of partly reduced phylogenetic networks. Then we will introduce a novel metric on the space of partly reduced phylogenetic networks. The space is not the space of rooted phylogenetic networks, but it is the largest space on which a polynomial-time computable metric has been defined so for. The papers [16, 17] have proved that the isomorphism for rooted phylogenetic networks is graph isomorphism-complete. Unless the graph isomorphism problem belongs to P, there is no hope of defining a polynomial-time computable metric on the space of all rooted phylogenetic networks. However, our paper's aim is mainly to find a larger space on which a polynomial-time computable metric can be defined such that the space is closer to the space of rooted phylogenetic networks.

2. Preliminaries

Let N = (V, E) be a directed acyclic graph, or DAG for short. We denote the indegree of a node u as indeg(u) and the outdegree of u as outdeg(u). We will say that a node u is a tree node if indeg(u) ≤ 1. Particularly, u is a root of N if indeg(u) = 0 of N. If a single root exists, we will say that the DAG is rooted. We will say that a node u is a reticulate node if indeg(u) ≥ 2. A tree node u is a leaf if outdeg(u) = 0. A node is called an internal node if its outdeg ≥ 1. For a DAG N = (V, E), we will say that v is a child of u if (u, v) ∈ E; in this case, we will also say that u is a parent of v. Note that any tree node has a single parent, except for the root of the graph. Whenever there is a directed path from a node u to v, we will say that v is a descendant of u or u is an ancestor of v.

The height of a node is the length of a longest path starting at the node and ending in a leaf. The absence of cycles implies that the nodes of a DAG N can be stratified by means of their heights: the nodes of height 0 are the leaves; if a node has height a > 0, then all its children have heights that are smaller than a and at least one of them has height exactly a − 1.

The depth of a node is the length of a longest path starting at the root and ending in the node. Similarly, the absence of cycles implies that the nodes of a DAG N can also be stratified according to their depths: the node of depth 0 is the root; if a node has depth b > 0, then all its parents have depths that are smaller than b and at least one of them has depth exactly b − 1.

Let 𝒳 be a set of taxa. A rooted phylogenetic network N on 𝒳 is a rooted DAG such that

(i)
no tree node has outdeg 1;
(ii)
its leaves are labeled by 𝒳 by a bijective mapping f.

We use the notation N = ((V, E), f) (or N = (V, E)) for the rooted phylogenetic network N and the notation V _N for its leaf set.

Definition 1 . —

Two rooted phylogenetic networks N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) are isomorphic if and only if there is a bijection G from V ₁ to V ₂ such that

(i)
(u, v) is an edge in E ₁ if and only if (G(u), G(v)) is an edge in E ₂;

(ii)
f ₁(w) = f ₂(G(w)) for all w ∈ V _N₁.

Moret et al. (2004) discussed the concept of reduced phylogenetic networks from a reconstruction standpoint. Subsequently, we briefly review the concept of reduced phylogenetic networks and introduce a new definition of partly reduced phylogenetic networks. In the following section, we present a metric on the space of all partly reduced phylogenetic networks. First we review the concept of a maximal convergent set that has been given in [7, 11].

Definition 2 . —

Given a network N = (V, E), we say that a set U of internal nodes in V is convergent if |U | ≥ 2 and

every leaf reachable from some node in U is reachable from all nodes in U.

If there is no convergent set U ₀ containing U except U itself, we say that U is a maximal convergent set.

Here the leaf set reachable from the nodes in a convergent set U is called the leaf set of U.

We will take Figure 1 as an example in the following. The two networks N ₁, N ₂ on {1, 2, 3, 4, x} are adapted from refinements (1) and (2) in Table 1 in [11].

Networks N ₁ and N ₂ from refinements (1) and (2) in Table 1 in [11]. H1 and H2 (resp., h1 and h2) are the reticulate nodes, A~G (resp., a~g); H1 and H2 (resp., h1 and h2) as well as the root R (resp., r) are the internal tree nodes in network N ₁ (resp., N ₂).

Example 3 . —

Consider the networks in Figure 1. The set {H1, H2, G} is the only maximal convergent set of N ₁ and the set {h1, h2, g} is the only maximal convergent set of N ₂.

For a phylogenetic network N = ((V, E), f) on 𝒳, the reduced version of N can be obtained by the following reduction procedures:

(1)
For each maximal pendant subtree (i.e., the maximal clade that includes no reticulate nodes) t, rooted at node r _t, create a new node h _t and an edge (p _t, h _t), where p _t is the parent of r _t, delete the edge (p _t, r _t) and the subtree t, and label h _t as t. Then we denote the resulting network as N ₀.

(2)
Repeat the following two steps on N ₀ until no change occurs:

(I)
For each maximal convergent set U with leaf set L _U⊆V _N₀, remove all nodes and edges on the paths from a node in U to the parent of leaf in L _U, including all nodes in U and excluding the parent of leaf in L _U. For each edge (p, v), where p lies outside the deleted set and v lies inside the deleted set, replace it with a set of edges {(p, q): q is the parent of leaf in L _U}.

(II)
For each node w in the network, with indeg(w) = outdeg(w) = 1, remove the edges (u, w), (w, v) and the node w, add an edge (u, v), where u is the parent of w and v is the child of w. Repeat this step until no such node can be removed.

(3)
Replace each leaf labeled by the subtree t by its root r _t.

Figure 2 shows the results of applying the reduction procedures to the network N. For the networks in Figure 1, their reduced versions are the same (see Figure 3). The reduced versions of all rooted phylogenetic networks form the space of reduced phylogenetic networks. Nakhleh has introduced a polynomial-time computable metric on this space [11]. In order to enlarge the space in which a polynomial-time computable metric can be defined, we will introduce a new metric and a new space that contains the space of reduced phylogenetic networks.

The rooted phylogenetic network N is on {1,2, 3,4, 5,6, 7,8, 9,10}. N ₀, N ₁, and N ₂ are the networks obtained by applying each one of the three reduction procedures to N, respectively.

The reduced version of the networks in Figure 1.

Definition 4 . —

Given a network N = (V, E), let 𝒫(v) be the set of parents of a node v in V. We say that U ⊂ V is a super convergent set, if

(i)
U is a convergent set;

(ii)
𝒫(u ₁) = 𝒫(u ₂) for any two nodes u ₁, u ₂ ∈ U;

(iii)
𝒫(u) is a convergent set for a node u ∈ U, if |𝒫(u)| ≥ 2.

Example 5 . —

The set {H, J} is the only superconvergent set for any one network in Figure 4, while the networks in Figure 1 have no superconvergent set.

Networks N ₁ and N ₂ are not isomorphic.

We will obtain the new reduction procedures, called partial reduction procedures, from the above reduction procedures by just processing superconvergent sets rather than maximal convergent sets in step (I) of step (2). After applying the partial reduction procedures to a rooted phylogenetic network N, the partly reduced version of N is obtained. The partly reduced versions of all rooted phylogenetic networks form the space of partly reduced phylogenetic networks. This space contains the space of reduced phylogenetic networks, but they are not identical. Next we will introduce a polynomial-time computable metric for the partly reduced phylogenetic networks.

We begin with the notion of node semiequivalence. For the sake of simplicity, we will hereafter refer to the rooted phylogenetic networks as the networks.

3. A Metric

Definition 6 . —

Given a network N = ((V, E), f), we say that two nodes u, v ∈ V (not necessarily different) are semiequivalent, denoted by u≜v, if

(i)
u, v ∈ V _N and f(u) = f(v) or

(ii)
node u has k (≥1) children u ₁, u ₂,…, u _k; node v has k children v ₁, v ₂,…, v _k, and u _i≜v _i for 1 ≤ i ≤ k.

By the definition, it follows that the semiequivalence of nodes is an equivalence relation; that is, it is reflexive, symmetric, and transitive, and the semiequivalent nodes must have the same height.

Example 7 . —

Consider the network N ₁ in Figure 1. For any node u ∈ V ₁∖{H1, H2}, u is only semiequivalent to u itself, while the nodes H1 and H2 are semiequivalent.

Property 1 . —

If u ₁, u ₂,…, u _k are semiequivalent from the network N = ((V, E), f), then u ₁, u ₂,…, u _k are the same nodes or there are the nodes v ₁ (u ₁ or a descendant of u ₁), v ₂ (u ₂ or a descendant of u ₂),…, v _k (u _k or a descendant of u _k) such that v ₁, v ₂,…, v _k have the same children. See Figure 5.

The topology relation of semiequivalent nodes.

Proof —

We use induction on the height a of u ₁ to prove it. If a = 0, obviously u ₁, u ₂,…, u _k are the only leaf. Thus, in this case, the property holds. We assume that the result is tenable when a ≤ n, and let a = n + 1. Then the children of u ₁, u ₂,…, u _k are semiequivalent, respectively (let the children of u _i be a _i1, a _i2,…, a _il for 1 ≤ i ≤ k; then a _1j, a _2j,…, a _kj are semiequivalent for 1 ≤ j ≤ l), and their height is at most n by the property of node height. By the induction hypothesis, the children of u ₁, u ₂,…, u _k satisfy the property. The descendants of children of u ₁, u ₂,…, u _k are the descendants of u ₁, u ₂,…, u _k. Thus, the property holds.

Definition 8 . —

Given a network N = (V, E), we say that two nodes u, v ∈ V (not necessarily different) are equivalent, denoted by u ≡ v, if u≜v, and

(i)
u, v are the root or

(ii)
node u has l (≥1) parents u ₁, u ₂,…, u _l; node v has l parents v ₁, v ₂,…, v _l, and u _i ≡ v _i for 1 ≤ i ≤ l.

For any node u in N, it is equivalent to itself. The equivalence of nodes is also an equivalence relation. The equivalent nodes have the same height and depth.

Example 9 . —

Consider the network N ₁ in Figure 1. For any node u ∈ V ₁, it is equivalent to itself. Consider the network N ₁ in Figure 4. For any node u ∈ V ₁∖{H, J}, it is equivalent to itself, while the nodes H and J are equivalent to each other.

Property 2 . —

If u ₁, u ₂,…, u _k are equivalent in the network N = ((V, E), f), then u ₁, u ₂,…, u _k are the same nodes or there are the nodes p ₁ (u ₁ or an ancestor of u ₁), p ₂ (u ₂ or an ancestor of u ₂),…, p _k (u _k or an ancestor of u _k) such that p ₁, p ₂,…, p _k have the same parents. See Figure 6.

The topology relation of equivalent nodes.

Proof —

We use induction on the depth b of u ₁ to prove it. If b = 0, then u ₁, u ₂,…, u _k are the unique root node. Thus, in this case, the property holds. We assume that the result is tenable when b ≤ n, and let b = n + 1. Then the parents of u ₁, u ₂,…, u _k are equivalent, respectively (let the parents of u _i be a _i1, a _i2,…, a _il for 1 ≤ i ≤ k; then a _1j, a _2j,…, a _kj are equivalent for 1 ≤ j ≤ l), and their depth is at most n by the property of node depth. By the induction hypothesis, the parents of u ₁, u ₂,…, u _k satisfy the property. The ancestors of the parents of u ₁, u ₂,…, u _k are the ancestors of u ₁, u ₂,…, u _k. Thus, the property holds.

In this paper, we are mainly concerned with comparing networks; the notion of node semiequivalence and equivalence will be extended to nodes from two different networks, as established in the semiequivalence and equivalence mapping of Definitions 10 and 13, respectively.

Given a set V, we use P(V) to denote the set of all subsets of V.

Definition 10 . —

Let N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) be two networks on 𝒳. We define the semiequivalence mapping between N ₁ and N ₂, h : V ₁ → P(V ₂), such that v ∈ h(u), for u ∈ V ₁ and v ∈ V ₂, if

(i)
u ∈ V _N₁, v ∈ V _N₂, and f ₁(u) = f ₂(v) or

(ii)
node u has k (≥1) children u ₁, u ₂,…, u _k; node v has k children v ₁, v ₂,…, v _k, and v _i ∈ h(u _i) for 1 ≤ i ≤ k.

Further, while inequation |h(u ₁)| ≤ 1 holds in phylogenetic trees, it is not always the case for general phylogenetic networks.

Example 11 . —

Consider the networks in Figure 1. h is a semiequivalence mapping between N ₁ and N ₂. For the reticulate nodes H1 and H2 in N ₁, h(H1) = {h1, h2} and h(H2) = {h1, h2}. For the other nodes in N ₁, h(A) = {a}, h(B) = {b},…, h(G) = {g}, h(1) = {1},…, h(4) = {4}, h(x) = {x}, and h(R) = {r}.

Theorem 12 . —

Let N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) be two networks on 𝒳, and let u ₁, u ₂ be two nodes in V ₁ and h a semiequivalence mapping between N ₁ and N ₂. Assume that h(u ₁) ≠ ∅ and h(u ₂) ≠ ∅. Then, u ₁≜u ₂ if and only if v ₁≜v ₂, for v ₁ ∈ h(u ₁) and v ₂ ∈ h(u ₂).

Proof —

For the “only if” direction, let v ₁ ∈ h(u ₁), v ₂ ∈ h(u ₂), and u ₁≜u ₂. Obviously, u ₁, u ₂, v ₁, and v ₂ have the same height a. Then, we use induction on such height a to prove v ₁≜v ₂. In particular, if a = 0, that is, u ₁, u ₂ ∈ V _N₁, and f ₁(u ₁) = f ₁(u ₂), then v ₁, v ₂ ∈ V _N₂ and f ₂(v ₁) = f ₁(u ₁) = f ₁(u ₂) = f ₂(v ₂). Thus, in this case, v ₁≜v ₂. We assume that the result is tenable when a ≤ n, and let a = n + 1. We assume that node u ₁ has k children p ₁, p ₂,…, p _k. Due to u ₁≜u ₂, it follows that node u ₂ has k children q ₁, q ₂,…, q _k, and p _i≜q _i (1 ≤ i ≤ k). Due to v ₁ ∈ h(u ₁) and v ₂ ∈ h(u ₂), it follows that v ₁ has k children w ₁, w ₂,…, w _k, and w _i ∈ h(p _i) (1 ≤ i ≤ k), v ₂ has k children y ₁, y ₂,…, y _k, and y _i ∈ h(q _i) (1 ≤ i ≤ k). The height of p _i, q _i, w _i, and y _i is at most n. By the induction hypothesis, w _i≜y _i. Thus, v ₁≜v ₂.

For the “if” direction, let v ₁ ∈ h(u ₁), v ₂ ∈ h(u ₂), and v ₁≜v ₂. Similarly, we also use induction on the same height a of u ₁, u ₂, v ₁, and v ₂ to prove u ₁≜u ₂. If a = 0, that is, v ₁, v ₂ ∈ V _N₂, and f ₂(v ₁) = f ₂(v ₂), then u ₁, u ₂ ∈ V _N₁ and f ₁(u ₁) = f ₂(v ₁) = f ₂(v ₂) = f ₁(u ₂). Thus, in this case, u ₁≜u ₂. We assume that the result is tenable when a ≤ n, and let a = n + 1. We assume that node v ₁ has k children w ₁, w ₂,…, w _k. Since v ₁≜v ₂, node v ₂ has k children y ₁, y ₂,…, y _k, and w _i≜y _i (1 ≤ i ≤ k). Since v ₁ ∈ h(u ₁) and v ₂ ∈ h(u ₂), u ₁ has k children p ₁, p ₂,…, p _k, and w _i ∈ h(p _i) (1 ≤ i ≤ k), u ₂ has k children q ₁, q ₂,…, q _k, and y _i ∈ h(q _i) (1 ≤ i ≤ k). The height of p _i, q _i, w _i, and y _i is at most n by the property of node height. By the induction hypothesis, p _i≜q _i. Thus, u ₁≜u ₂.

Theorem 12 tells us that the semiequivalence mapping keeps the semiequivalence of nodes. Thus, all nodes in h(u) are semiequivalent. Sometimes we use h(u) to denote an arbitrary node in the set. We say that the nodes in h(u) are semiequivalent with u.

Definition 13 . —

Let N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) be two networks on 𝒳. We define the equivalence mapping between N ₁ and N ₂, g : V ₁ → P(V ₂), such that v ∈ g(u), for u ∈ V ₁ and v ∈ V ₂, if v ∈ h(u), and

(i)
u, v are the roots or

(ii)
node u has l (≥1) parents u ₁, u ₂,…, u _l; node v has l parents v ₁, v ₂,…, v _l, and v _i ∈ g(u _i), for 1 ≤ i ≤ l,

where h is a semiequivalence mapping between N ₁ and N ₂.

Example 14 . —

Consider the networks in Figure 1. h is the semiequivalence mapping between N ₁ and N ₂ discussed in Example 11. g is an equivalence mapping between N ₁ and N ₂ defined in Definition 13. For any node u ∈ V ₁∖{H1, H2, G and x}, g(u) = h(u), while g(v) = ∅ when v ∈ {H1, H2, G and x}.

Theorem 15 . —

Let N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) be two networks on 𝒳, and let u ₁, u ₂ be two nodes in V ₁. g is an equivalence mapping between N ₁ and N ₂. Assume that g(u ₁) ≠ ∅ and g(u ₂) ≠ ∅. Then, u ₁ ≡ u ₂ if and only if v ₁ ≡ v ₂, for v ₁ ∈ g(u ₁) and v ₂ ∈ g(u ₂).

Proof —

Let v ₁ ∈ g(u ₁), v ₂ ∈ g(u ₂). Then v ₁ ∈ h(u ₁), v ₂ ∈ h(u ₂) based on Definition 13. For the “only if” direction, let u ₁ ≡ u ₂. We can deduce that v ₁≜v ₂ according to Theorem 12, and u ₁, u ₂ and v ₁ and v ₂ have the same depth b. Then, we use induction on b to prove that v ₁ ≡ v ₂. If b = 0, that is, u ₁, u ₂ are the unique root node of N ₁, then v ₁, v ₂ are the unique root node of N ₂. Thus, in this case, v ₁ ≡ v ₂. We assume that the result is tenable when b ≤ n, and let b = n + 1. We assume that node u ₁ has l parents p ₁, p ₂,…, p _l. Due to u ₁ ≡ u ₂, node u ₂ has l parents q ₁, q ₂,…, q _l, and p _i ≡ q _i (1 ≤ i ≤ l). Due to v ₁ ∈ g(u ₁) and v ₂ ∈ g(u ₂), v ₁ has l parents w ₁, w ₂,…, w _l, and w _i ∈ g(p _i) (1 ≤ i ≤ l), v ₂ has l parents y ₁, y ₂,…, y _l, and y _i ∈ g(q _i) (1 ≤ i ≤ l). The depth of p _i, q _i, w _i, and y _i is at most n by the property of node depth. By the induction hypothesis, w _i ≡ y _i. Thus, v ₁ ≡ v ₂.

For the “if” direction, let v ₁ ∈ g(u ₁), v ₂ ∈ g(u ₂), and v ₁ ≡ v ₂. We can deduce first that u ₁≜u ₂ according to Theorem 12. Similarly, we also use induction on the same depth b of u ₁, u ₂ and v ₁, v ₂ to prove that u ₁ ≡ u ₂. If b = 0, that is, v ₁, v ₂ are the unique root node of N ₂, then u ₁, u ₂ are the unique root node of N ₁. Thus, in this case, u ₁ ≡ u ₂. We assume that the result is tenable when b ≤ n, and let b = n + 1. We assume that node v ₁ has l parents w ₁, w ₂,…, w _l. Due to v ₁ ≡ v ₂, node v ₂ has l parents y ₁, y ₂,…, y _l, and w _i ≡ y _i (1 ≤ i ≤ l). Due to v ₁ ∈ g(u ₁) and v ₂ ∈ g(u ₂), u ₁ has l parents p ₁, p ₂,…, p _l, and w _i ∈ g(p _i) (1 ≤ i ≤ l), u ₂ has l parents q ₁, q ₂,…, q _l, and y _i ∈ g(q _i) (1 ≤ i ≤ l). The depth of p _i, q _i, w _i, and y _i is at most n. So, by the induction hypothesis, p _i ≡ q _i. Thus, u ₁ ≡ u ₂.

Theorem 15 tells us that the equivalence mapping keeps the equivalence of nodes. Thus, all nodes in g(u) are equivalent. Sometimes we use g(u) to denote an arbitrary node in the set. We say that the nodes in g(u) are equivalent to u.

Lemma 16 . —

Let N = ((V, E), f) be a network and u, v ∈ V two equivalent nodes. Then u, v belong to a superconvergent set.

Proof —

This lemma is obtained easily from Properties 1 and 2.

Lemma 17 . —

Let N = ((V, E), f) be a partly reduced phylogenetic network. Then u ₁≢u ₂ for any two nodes u ₁, u ₂ ∈ V.

Proof —

From the partial reduction procedures of the network, we have that all superconvergent sets in a partly reduced network have been deleted.

Given two networks N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂), assume that V ₁ = {v ₁, v ₂,…, v _p}. The unique nodes of N ₁, denoted by L(N ₁), is defined by the following processes. First let L(N ₁) = ∅. Then for each one node u ∈ V ₁, if there exists no node u′ ∈ L(N ₁) such that u′ ≡ u, add u to L(N ₁). We define L(N ₂) in a similar way. Further for each node v _i ∈ L(N ₁), we define e _N₁(v _i) = |{v ∈ V ₁ : v ≡ v _i}| and e _N₂(u _i) similarly for each node u _i ∈ V ₂. We define e(∅) = 0 for any network N. When the context is clear, we drop the subscript of e. We are now in a position to define the measure on pairs of partly reduced phylogenetic networks.

Definition 18 . —

Let N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂) be two phylogenetic networks on 𝒳. Then d _e(N ₁, N ₂) equals

$\begin{matrix} \frac{1}{2} [\sum_{v \in L (N_{1})} \max \{0, e (v) - e (v^{'})\} + \sum_{u \in L (N_{2})} \max \{0, e (u) - e (u^{'})\}], \end{matrix}$ (1)

where v′(u′) is a node in L(N ₂)(L(N ₁)) that is equivalent to v(u), and if no such equivalent node exists, then v′(u′) = ∅.

Lemma 19 . —

If d _e(N ₁, N ₂) = 0 for two networks N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂), then |V ₁ | = |V ₂|.

Proof —

Let g ₁ : V ₁ → P(V ₂) and g ₂ : V ₂ → P(V ₁) be two equivalence mappings from Definition 13. Since d _e(N ₁, N ₂) = 0, it follows that e(v ₁) = e(g ₁(v ₁)) (where g ₁(v ₁) denotes a node u, which is equivalent to g ₁(v ₁) and in L(N ₂)) along with |g ₁(v ₁)| > 0 for all v ₁ ∈ L(N ₁) and e(v ₂) = e(g ₂(v ₂)) (where g ₂(v ₂) denotes a node u, which is equivalent to g ₂(v ₂) and in L(N ₁)) along with |g ₂(v ₂)| > 0 for all v ₂ ∈ L(N ₂). From this and Theorem 15, we have that |V ₁ | = ∑_{v₁∈L(N₁)} e(v ₁) = ∑_{v₁∈L(N₁)} e(g ₁(v ₁))≤|V ₂| (due to g ₁(v ₁) ∈ V ₂) and |V ₂ | = ∑_{v₂∈L(N₂)} e(v ₂) = ∑_{v₂∈L(N₂)} e(g ₂(v ₂))≤|V ₁| (due to g ₂(v ₂) ∈ V ₁). Thus |V ₁ | = |V ₂|.

Theorem 20 . —

Let N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂) be two partly reduced networks. Then, N ₁ and N ₂ are isomorphic if and only if d _e(N ₁, N ₂) = 0.

Proof —

Let g : V ₁ → P(V ₂) be an equivalence mapping, as given in Definition 13. From Lemma 19, it follows that |V ₁ | = |V ₂| and e(v) = e(g(v)) for all v ∈ L(N ₁). From Lemmas 16 and 17, we have that g(v ₁) is defined and unique for each v ₁ ∈ V ₁. We now prove that if (u, v) ∈ E ₁, then (u ₀, v ₀) ∈ E ₂, where v ₀ = g(v) and u ₀ = g(u). Given that v ₀ = g(v), that is, v and v ₀ are equivalent, this implies that v ₀ and v have equivalent parents. Since u ₀ = g(u) is defined and unique, u ₀ is a parent of v ₀. Thus, (u ₀, v ₀) ∈ E ₂. It shows that the mapping g is bijective, which also preserves the labels of the leaves and the edges of networks. Thus, N ₁ and N ₂ are isomorphic.

The converse implication is obvious.

From the definition of the measure, the symmetry property follows immediately.

Lemma 21 . —

For any pair networks N ₁ and N ₂, one has d _e(N ₁, N ₂) = d _e(N ₂, N ₁).

The measure d _e(N ₁, N ₂) can be viewed as half of the symmetric difference of two multisets on the same set of elements, where the multiplicity of element u in N ₁ is e _N₁(u) and similarly for N ₂. Since the symmetric difference defines a metric on multisets [12], we have the following triangle inequality.

Lemma 22 . —

Let N ₁, N ₂, and N ₃ be three networks. Then, d _e(N ₁, N ₂) + d _e(N ₂, N ₃) ≥ d _e(N ₁, N ₃).

From Theorem 20 and Lemmas 21 and 22, we have the following main result.

Theorem 23 . —

The measure d _e is a metric on the space of partly reduced phylogenetic networks.

Proof —

It follows from Theorem 20 and Lemmas 21 and 22 and the fact that max{0, e(v) − e(v′)} ≥ 0.

Let N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂) be two phylogenetic networks. For a node u in N ₁, we refer to its semiequivalent nodes from N ₁ as internal semiequivalence (equivalence) nodes and its semiequivalent (equivalence) nodes from N ₂ as external semiequivalence (equivalence) nodes. When computing the distance between two networks, we first compute internal and external equivalence nodes for every node in the two networks; subsequently by formula (1) we obtain the distance between the two considered networks. The maximum of measure d _e(N ₁, N ₂) is (|V ₁ | +|V ₂|)/2.0, when any node in N ₁ and in N ₂ has no external equivalence nodes.

In order to show the results of the distance computed by formula (1), we give an example as follows.

Example 24 . —

Consider the networks in Figure 1. N ₁, N ₂ are two different networks on {1, 2, 3, 4, x}. However, in [11], they are indistinguishable and their m-distance [11] is 0. Now, we compute the d _e-distance between them: d _e(N ₁, N ₂) = 4 (see Example 14).

4. Computational Aspects

From the definition of semiequivalent nodes, whether in the same network or in two different networks, we have that the semiequivalent nodes can be computed by means of a bottom-up technique. Similarly, the equivalent nodes can be computed by means of a top-down technique. Let N ₁ = ((V ₁, E ₁), f ₁) and N ₂ = ((V ₂, E ₂), f ₂) be two phylogenetic networks. For a pair of nodes u and v, whether in the same network or in different networks, the following shows the pseudocode (Algorithm 1) that decides whether they are internal semiequivalent to each other, the pseudocode (Algorithm 2) that decides whether they are internal equivalent to each other, and the pseudocode (Algorithm 3) that computes the d _e-distance for a pair of networks (where ISE is the abbreviation for the set of internal semiequivalent nodes, ESE is the abbreviation for the set of external semiequivalent nodes, IE is the abbreviation for the set of internal equivalent nodes, and EE is the abbreviation for the set of external equivalent nodes). If two nodes u and v from the same network are semiequivalent, then we add u to the ISE of v and add v to the ISE of u. Obviously, this decision costs at most O(n ³) time, where n = max⁡(|V ₁ | , |V ₂|). So, it takes totally O(n ⁵) time to find out all internal and external semiequivalent nodes for every node in the two networks. In a similar way, we have that it also takes O(n ⁵) time to find out all internal and external equivalent nodes for every node in the two networks. Subsequently we spend O(n) time computing the formula (1). In conclusion, it costs totally O(n ⁵) time to compute the distance between two networks, where n is the maximum between their node numbers.

Algorithm 1 — Deciding semiequivalence for two nodes u and v.

Algorithm 2 — Deciding equivalence for two nodes u and v.

Algorithm 3 — Computing the d _e-distance for N ₁ = (V ₁, E ₁) and N ₂ = (V ₂, E ₂).

5. Conclusion

In [11], Nakhleh introduced a polynomial-time computable m-distance in the space of reduced phylogenetic networks. In order to enlarge the space of phylogenetic networks we can compare, we devised a polynomial-time computable d _e-distance on the space of partly reduced phylogenetic networks, which can be viewed as half of the symmetric difference of two multisets on the same set of elements. To our knowledge, the space is the largest space that has a polynomial-time computable metric. d _e-distance is also a metric on the space of reduced phylogenetic networks which is included in the space of partly reduced phylogenetic networks. In general, for two phylogenetic networks, their d _e-distance is larger than their m-distance. From [12], we have that the d _e-distance is also a metric on the space of tree-child phylogenetic networks, semibinary tree-sibling time consistent phylogenetic networks, and multilabeled phylogenetic trees. However, the d _e-distance is not a metric on the space of all rooted phylogenetic networks; for example, in the two phylogenetic networks in Figure 4, their d _e-distance is 0, but they are not isomorphic.

d _e-distance can also apply to computing the dissimilarity for other types of networks, such as spiking neural networks [18–20], which will be a direction of further research.

Acknowledgments

This work was supported by the Natural Science Foundation of Inner Mongolia province of China (2015BS0601).

Competing Interests

The author declares that they have no competing interests.

References

1.Pagel M. Inferring the historical patterns of biological evolution. Nature. 1999;401(6756):877–884. doi: 10.1038/44766. [DOI] [PubMed] [Google Scholar]
2.Wang J., Guo M., Liu X., et al. Lnetwork: an efficient and effective method for constructing phylogenetic networks. Bioinformatics. 2013;29(18):2269–2276. doi: 10.1093/bioinformatics/btt378. [DOI] [PubMed] [Google Scholar]
3.Wang J., Guo M., Xing L., Che K., Liu X., Wang C. BIMLR: a method for constructing rooted phylogenetic networks from rooted phylogenetic trees. Gene. 2013;527(1):344–351. doi: 10.1016/j.gene.2013.06.036. [DOI] [PubMed] [Google Scholar]
4.Wang J. A new algorithm to construct phylogenetic networks from trees. Genetics and Molecular Research. 2014;13(1):1456–1464. doi: 10.4238/2014.March.6.4. [DOI] [PubMed] [Google Scholar]
5.Wang J., Guo M.-Z., Xing L. L. FastJoin, an improved neighbor-joining algorithm. Genetics and Molecular Research. 2012;11(3):1909–1922. doi: 10.4238/2012.july.19.10. [DOI] [PubMed] [Google Scholar]
6.Nakhleh L., Warnow J. S. T., Linder C. R., Moret B. M., Tholse A. Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. Proceedings of the 18th Pacific Symposium on Biocomputing; January 2003; Kauai, Hawaii, USA. [PubMed] [Google Scholar]
7.Moret B. M. E., Nakhleh L., Warnow T., et al. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2004;1(1):13–23. doi: 10.1109/tcbb.2004.10. [DOI] [PubMed] [Google Scholar]
8.Baroni M., Semple C., Steel M. A framework for representing reticulate evolution. Annals of Combinatorics. 2004;8(4):391–408. doi: 10.1007/s00026-004-0228-0. [DOI] [Google Scholar]
9.Cardona G., Rosselló F., Valiente G. Tripartitions do not always discriminate phylogenetic networks. Mathematical Biosciences. 2008;211(2):356–370. doi: 10.1016/j.mbs.2007.11.003. [DOI] [PubMed] [Google Scholar]
10.Cardona G., Llabrés M., Rosselló F., Valiente G. A distance metric for a class of tree-sibling phylogenetic networks. Bioinformatics. 2008;24(13):1481–1488. doi: 10.1093/bioinformatics/btn231. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Nakhleh L. A metric on the space of reduced phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;7(2):218–222. doi: 10.1109/TCBB.2009.2. [DOI] [PubMed] [Google Scholar]
12.Cardona G., Llabrès M., Rossellò F., Valiente G. On Nakhleh's metric for reduced phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;6(4):629–638. doi: 10.1109/TCBB.2009.33. [DOI] [PubMed] [Google Scholar]
13.Zou Q., Hu Q., Guo M., Wang G. HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015;31(15):2475–2481. doi: 10.1093/bioinformatics/btv177. [DOI] [PubMed] [Google Scholar]
14.Zou Q., Li J., Song L., Zeng X., Wang G. Similarity computation strategies in the microRNA-disease network: a survey. Briefings in Functional Genomics. 2016;15(1):55–64. doi: 10.1093/bfgp/elv024. [DOI] [PubMed] [Google Scholar]
15.Zou Q., Li X.-B., Jiang W.-R., Lin Z.-Y., Li G.-L., Chen K. Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics. 2014;15(4):637–647. doi: 10.1093/bib/bbs088.bbs088 [DOI] [PubMed] [Google Scholar]
16.Cardona G., Llabrés M., Rosselló F., Valiente G. The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete. The Scientific World Journal. 2014;2014:6. doi: 10.1155/2014/254279.254279 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Booth K. S., Colbourn C. J. Problems polynomially equivalent to graph isomorphism. http://cs.uwaterloo.ca/research/tr/1977/CS-77-04.pdf.
18.Chowhan S., Kulkarni U. V., Shinde G. N. Iris recognition using modified fuzzy hypersphere neural network with different distance measures. International Journal of Advanced Computer Science and Applications. 2011;2(6) doi: 10.14569/ijacsa.2011.020619. [DOI] [Google Scholar]
19.Van Schaik A. Building blocks for electronic spiking neural networks. Neural Networks. 2001;14(6-7):617–628. doi: 10.1016/S0893-6080(01)00067-3. [DOI] [PubMed] [Google Scholar]
20.Graham B. J., Northmore D. P. M. A spiking neural network model of midbrain visuomotor mechanisms that avoids objects by estimating size and distance monocularly. Neurocomputing. 2007;70(10–12):1983–1987. doi: 10.1016/j.neucom.2006.10.102. [DOI] [Google Scholar]

[B1] 1.Pagel M. Inferring the historical patterns of biological evolution. Nature. 1999;401(6756):877–884. doi: 10.1038/44766. [DOI] [PubMed] [Google Scholar]

[B2] 2.Wang J., Guo M., Liu X., et al. Lnetwork: an efficient and effective method for constructing phylogenetic networks. Bioinformatics. 2013;29(18):2269–2276. doi: 10.1093/bioinformatics/btt378. [DOI] [PubMed] [Google Scholar]

[B3] 3.Wang J., Guo M., Xing L., Che K., Liu X., Wang C. BIMLR: a method for constructing rooted phylogenetic networks from rooted phylogenetic trees. Gene. 2013;527(1):344–351. doi: 10.1016/j.gene.2013.06.036. [DOI] [PubMed] [Google Scholar]

[B4] 4.Wang J. A new algorithm to construct phylogenetic networks from trees. Genetics and Molecular Research. 2014;13(1):1456–1464. doi: 10.4238/2014.March.6.4. [DOI] [PubMed] [Google Scholar]

[B5] 5.Wang J., Guo M.-Z., Xing L. L. FastJoin, an improved neighbor-joining algorithm. Genetics and Molecular Research. 2012;11(3):1909–1922. doi: 10.4238/2012.july.19.10. [DOI] [PubMed] [Google Scholar]

[B6] 6.Nakhleh L., Warnow J. S. T., Linder C. R., Moret B. M., Tholse A. Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. Proceedings of the 18th Pacific Symposium on Biocomputing; January 2003; Kauai, Hawaii, USA. [PubMed] [Google Scholar]

[B7] 7.Moret B. M. E., Nakhleh L., Warnow T., et al. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2004;1(1):13–23. doi: 10.1109/tcbb.2004.10. [DOI] [PubMed] [Google Scholar]

[B8] 8.Baroni M., Semple C., Steel M. A framework for representing reticulate evolution. Annals of Combinatorics. 2004;8(4):391–408. doi: 10.1007/s00026-004-0228-0. [DOI] [Google Scholar]

[B9] 9.Cardona G., Rosselló F., Valiente G. Tripartitions do not always discriminate phylogenetic networks. Mathematical Biosciences. 2008;211(2):356–370. doi: 10.1016/j.mbs.2007.11.003. [DOI] [PubMed] [Google Scholar]

[B10] 10.Cardona G., Llabrés M., Rosselló F., Valiente G. A distance metric for a class of tree-sibling phylogenetic networks. Bioinformatics. 2008;24(13):1481–1488. doi: 10.1093/bioinformatics/btn231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Nakhleh L. A metric on the space of reduced phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;7(2):218–222. doi: 10.1109/TCBB.2009.2. [DOI] [PubMed] [Google Scholar]

[B12] 12.Cardona G., Llabrès M., Rossellò F., Valiente G. On Nakhleh's metric for reduced phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;6(4):629–638. doi: 10.1109/TCBB.2009.33. [DOI] [PubMed] [Google Scholar]

[B13] 13.Zou Q., Hu Q., Guo M., Wang G. HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015;31(15):2475–2481. doi: 10.1093/bioinformatics/btv177. [DOI] [PubMed] [Google Scholar]

[B14] 14.Zou Q., Li J., Song L., Zeng X., Wang G. Similarity computation strategies in the microRNA-disease network: a survey. Briefings in Functional Genomics. 2016;15(1):55–64. doi: 10.1093/bfgp/elv024. [DOI] [PubMed] [Google Scholar]

[B15] 15.Zou Q., Li X.-B., Jiang W.-R., Lin Z.-Y., Li G.-L., Chen K. Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics. 2014;15(4):637–647. doi: 10.1093/bib/bbs088.bbs088 [DOI] [PubMed] [Google Scholar]

[B16] 16.Cardona G., Llabrés M., Rosselló F., Valiente G. The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete. The Scientific World Journal. 2014;2014:6. doi: 10.1155/2014/254279.254279 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Booth K. S., Colbourn C. J. Problems polynomially equivalent to graph isomorphism. http://cs.uwaterloo.ca/research/tr/1977/CS-77-04.pdf.

[B18] 18.Chowhan S., Kulkarni U. V., Shinde G. N. Iris recognition using modified fuzzy hypersphere neural network with different distance measures. International Journal of Advanced Computer Science and Applications. 2011;2(6) doi: 10.14569/ijacsa.2011.020619. [DOI] [Google Scholar]

[B19] 19.Van Schaik A. Building blocks for electronic spiking neural networks. Neural Networks. 2001;14(6-7):617–628. doi: 10.1016/S0893-6080(01)00067-3. [DOI] [PubMed] [Google Scholar]

[B20] 20.Graham B. J., Northmore D. P. M. A spiking neural network model of midbrain visuomotor mechanisms that avoids objects by estimating size and distance monocularly. Neurocomputing. 2007;70(10–12):1983–1987. doi: 10.1016/j.neucom.2006.10.102. [DOI] [Google Scholar]

PERMALINK

A Metric on the Space of Partly Reduced Phylogenetic Networks

Juan Wang

Abstract

1. Introduction

2. Preliminaries

Definition 1 . —

Definition 2 . —

Figure 1.

Example 3 . —

Figure 2.

Figure 3.

Definition 4 . —

Example 5 . —

Figure 4.

3. A Metric

Definition 6 . —

Example 7 . —

Property 1 . —

Figure 5.

Proof —

Definition 8 . —

Example 9 . —

Property 2 . —

Figure 6.

Proof —

Definition 10 . —

Example 11 . —

Theorem 12 . —

Proof —

Definition 13 . —

Example 14 . —

Theorem 15 . —

Proof —

Lemma 16 . —

Proof —

Lemma 17 . —

Proof —

Definition 18 . —

Lemma 19 . —

Proof —

Theorem 20 . —

Proof —

Lemma 21 . —

Lemma 22 . —

Theorem 23 . —

Proof —

Example 24 . —

4. Computational Aspects

Algorithm 1.

Algorithm 2.

Algorithm 3.

5. Conclusion

Acknowledgments

Competing Interests

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases