Abstract
Faith’s Phylogenetic Diversity (PD) on rooted phylogenetic trees satisfies the so-called strong exchange property that guarantees that, for every two sets of leaves of different cardinalities, a leaf can always be moved from the larger set to the smaller set in such a way that the sum of the PD values does not decrease. This strong exchange property entails a simple polynomial-time greedy solution to the PD optimization problem on rooted phylogenetic trees. In this paper we obtain an exchange property for the rooted Phylogenetic Subnet Diversity (rPSD) on rooted phylogenetic networks, which involves a more complicated exchange of leaves. We derive from it a polynomial-time greedy solution to the rPSD optimization problem on rooted semibinary level-2 phylogenetic networks.
Supplementary Information
The online version contains supplementary material available at 10.1007/s00285-024-02142-4.
Keywords: Phylogenetic network, Level-k network, Phylogenetic subnet diversity, Phylogenetic subnet diversity optimization problem
Introduction
Over the last few centuries, human activity has caused the destruction of natural habitats at an unprecedented pace, resulting in a major episode of biodiversity extinction (Kolbert 2014). Urgent action is required to combat extinction and preserve biodiversity, but there are challenges, including a lack of funding and uncertainties about conservation strategies. Consequently, there has been an increasing need to provide criteria for defining priorities and proposing variables that allow quantification of biodiversity.
The traditional approach to assessing biodiversity based on species counts, species richness, and number of endemic species has limitations. For instance, this type of data is so heterogeneous that it can be difficult to compare across different sites and times (Gaston 1996). The approach based on lists of threatened species also has its drawbacks: for example, changes in the composition of these lists may represent changes in knowledge of species status rather than changes in the status itself (Possingham et al. 2002). Finally, measures of biodiversity based solely on species have been criticized for treating all species as equal, without regard to their functional roles in the ecosystem or their evolutionary history (Faith 1992).
A feature of species that may influence their biodiversity value is their evolutionary distinctness. A species with few close living evolutionary relatives is considered more worthy of protection than a species with many close genetically and phenotypically similar relatives (McNeely et al. 1990). At the beginning of the 1990s, the qualitative value afforded to evolutionarily distinct species was replaced by quantitative measures of phylogenetic distinctness. One of the first published measures of biodiversity based on phylogenetic information was Faith’s phylogenetic diversity, PD (Faith 1992). The PD value of a set of species placed in the leaves of a phylogenetic tree is defined as the total weight (i.e., the sum of the branch lengths) of the spanning tree connecting the root and these leaves. In its original formulation, the branch lengths represented the number of changes in phenotypic characters, and PD measured the diversity of phenotypic characters in a set of species. In the current usual interpretation of phylogenetic trees, branch lengths represent evolutionary time, which is assumed to be positively correlated with character variation.
Since its introduction, PD has been widely studied and applied (Pellens and Grandcolas 2016). One of its most useful properties, both from the formal and the applicability point of view, is the possibility of efficiently finding and characterizing all subsets of species in a phylogenetic tree of a given size with maximal PD value by means of a very simple greedy algorithm (Pardi and Goldman 2005; Steel 2005); for instance, for a recent application to the analysis of SARS-CoV-2 phylogeny, see Zhukova et al. (2021). The basis of this result is the so-called strong exchange property stating that for every pair of sets of leaves with , we can always move a leaf from X to without decreasing the sum of the PD values.
Faith’s PD is defined on evolutionary histories modelled by means of phylogenetic trees. But phylogenetic trees can only cope with speciation events due to mutations, where each species other than the universal common ancestor has only one parent in the evolutionary history (its parent in the tree). It is clearly understood now that other speciation events, which cannot be properly represented by means of single arcs in a tree, play an important role in evolution (Doolittle 1999). These are reticulate events, like genetic recombinations, hybridizations, or lateral gene transfers, where a species is the result of the interaction between several parent species. This has lead to the introduction of phylogenetic networks as models of phylogenetic histories that allow to include these reticulate events (Huson et al. 2010). Faith’s PD has been extended to split networks1(Spillner et al. 2008) and to rooted phylogenetic networks (Wicke and Fischer 2018; Bordewich et al. 2022); as a matter of fact, several generalizations to rooted phylogenetic networks have been proposed, the most natural of which is the rooted Phylogenetic Subnet Diversity, rPSD, introduced by Wicke and Fischer (2018) and renamed AllPaths-PD by Bordewich et al. (2022).
It has been proved that the PD optimization problem can be solved efficiently on circular split networks2 using integer programming (Chernomor et al. 2016; Spillner et al. 2008), as well as (for rPSD) on the simplest class of non-tree rooted phylogenetic networks, the so-called galled trees, by reducing it to sets of linear size of minimum-cost flow problems (Bordewich et al. 2009, 2022). It is also known that these optimization problems are in general NP-hard on rooted phylogenetic networks (Bordewich et al. 2022) and on split networks (Chernomor et al. 2016).
In this paper we focus on the extension of the greedy optimization algorithm for PD on phylogenetic trees to rPSD on rooted phylogenetic networks. As we have mentioned, the greedy algorithm on phylogenetic trees is a consequence of the strong exchange property for PD that guarantees that, given two sets of leaves of different cardinalities, we can always move some element from the larger set to the smaller one without lowering the sum of the PD values. It is easy to check that this strong exchange property for rPSD is no longer valid even on galled trees (Bordewich et al. 2022). So, our first main contribution is its generalization to rPSD through a more involved exchange of leaves than simply moving one leaf from one set to another.
Our exchange property then allows us to strengthen the result of Bordewich et al. on galled trees, by proving that every rPSD-optimal set of m leaves in a galled tree is always obtained from an rPSD-optimal set of leaves by either optimally adding a leaf or optimally replacing a leaf by a pair of leaves. It also allows us to give polynomial time greedy solutions for the rPSD problem on semibinary level-2 networks and semi-3-ary level-1 networks, the next complexity level of rooted phylogenetic networks (see §2.1 for the definitions). On the negative side, we have not been able to deduce from it a greedy algorithm for semibinary level-3 or semi-4-ary level-1 networks and the problem for these more general classes remains open.
This paper is organized as follows. In Sect. 2.1 we define the concepts necessary to understand this work, including a generalization of the Phylogenetic Diversity due to Wicke and Fischer (2018), together with its properties and an example. Section 3 contains the main result of this manuscript, Theorem 1, and Sect. 4 exposes some of its applications to galled trees and to semi-d-ary level-k networks, for particular instances of d and k. We end in Sect. 5 with some concluding remarks. The proof of Theorem 1 together with two required lemmas can be found in the Appendix and proofs of additional results can be found in the Supplementary Material.
Preliminaries
Phylogenetic networks
Let be a finite set of labels. By a phylogenetic network on we understand a rooted directed acyclic simple graph where each node of in-degree has out-degree exactly 1 and whose leaves (i.e., its nodes of out-degree 0) are bijectively labeled by (Huson et al. 2010). A phylogenetic tree is simply a phylogenetic network without nodes of in-degree . Let us point out here that, although the usual definition of phylogenetic tree and network forbids, for reconstructibility reasons, the existence of elementary nodes, that is, of nodes of in-degree and out-degree 1, we shall allow their existence in order to simplify some statements and proofs.
Let N be a phylogenetic network. We shall denote its root (i.e., its only node of in-degree 0) by r and its sets of nodes and arcs by V(N) and E(N), respectively, and we shall always identify its leaves with their corresponding labels. Given two nodes u, v in N, we say that v is a child of u, and also that u is a parent of v, when . A node in N is of tree type, or a tree node, when its in-degree is , and a reticulation when its in-degree is (and hence, its out-degree is 1). We shall say that N is semi-d-ary when all its reticulations have in-degree , and that N is binary when it is semibinary and all its internal tree nodes have out-degree 2.
We shall denote a (directed) path in N from a node u to a node v by . The intermediate nodes of a path are the nodes involved in it other than u and v. For every , we say that v is a descendant of u, and also that u is an ancestor of v, when there exists a path , and that v is a descendant of an arc when it is a descendant of its end u. In particular, every node is an ancestor, and a descendant, of itself. If v is a descendant of u and , we shall say that it is a proper descendant of u. A set of nodes is independent when no node in it is a proper descendant of any other node in it.
For every , its cluster (or simply C(v) when N is clear from the context), is the set of (labels of) the descendant leaves of v, and the subnetwork of N rooted at v is the subgraph of N induced by the set of all descendants of v. is a phylogenetic network on C(v) with root v.
For every , we shall denote the set of all nodes in N that are ancestors of nodes in X by . Given an arc , we shall make the abuse of notation of writing to mean that e has some descendant in X, that is, that .
A subgraph of a phylogenetic network N is biconnected when it is connected (as an undirected graph) and it remains connected after removing any node from it together with all arcs incident to this node. Every node and every arc in N are biconnected subgraphs. A biconnected component of N is a maximal biconnected subgraph, and we shall call a biconnected component with more than 2 nodes a blob. Every blob has one, and only one, node that is an ancestor of all its nodes; we call it its split node. Every node in a blob with no child inside is a reticulation (should it be of tree type, removing its parent would disconnect ); we call such reticulations the exit reticulations of , and the rest of its reticulations, internal. Every node in has some descendant exit reticulation.
A phylogenetic network is level-k (Jansson and Sung 2006) when every biconnected component contains at most k reticulations. Thus, a level-0 network is a phylogenetic tree. A semibinary level-1 network is also called a galled tree (Gusfield et al. 2004); the phylogenetic network in Fig. 1 is a galled tree.
A phylogenetic network N is weighted when it is endowed with a weight mapping . The total weight of a subgraph of a weighted phylogenetic network is the sum of the weights of all arcs in the subgraph. In particular, the weight of a path is the sum of the weights of its arcs. All phylogenetic networks (and trees) appearing from now on in this paper are assumed to be weighted, usually without any further notice.
The rooted phylogenetic diversity on phylogenetic trees
Given a finite set , we shall denote henceforth its set of subsets by and, for every , the set of all its subsets of cardinality k by .
Given a weighted phylogenetic tree T on , Faith’s rooted Phylogenetic Diversity (Faith 1992) is the set function sending each to the total weight of the subtree induced by the ancestors of nodes in X:
This function on phylogenetic trees satisfies the following strong exchange property, introduced by Steel (2005) for unrooted phylogenetic trees: for every phylogenetic tree T on and for every such that , there exists some such that
For a proof of this fact in the rooted case, see (Steel 2016, §6.4.1).
This strong exchange property for is the key ingredient in the proof that the simple Algorithm 1 given below produces, for every , the family of all -optimal subsets of of cardinality k, that is, of all sets of k leaves with maximum value. For this proof in the unrooted case, see Steel (2005); the proof in the rooted case is similar: cf. §6.4.1 in Steel (2016). In particular, given a phylogenetic tree T on , this algorithm provides a polynomial solution to the problem of finding the maximum value among all members of , and a member of reaching this maximum.
The rooted phylogenetic subnet diversity
Wicke and Fischer (2018) proposed several generalizations of Faith’s rooted Phylogenetic Diversity function to phylogenetic networks. One of them, and possibly the most straightforward, is the rooted Phylogenetic Subnet Diversity: the set function sending each to the total weight of the subgraph induced by the ancestors of nodes in X:
It is clear that if N is a phylogenetic tree, then . When N is clear from the context, we shall omit the subscript N and simply write .
Example 1
On the phylogenetic network N depicted in Fig. 1,
For every phylogenetic network N on , is:
-
(i)
Monotone nondecreasing: For every , .
-
(ii)Subadditive: For every ,
-
(iii)Submodular: For every and for every ,
(i) and (ii) are clear. As to (iii), it is proved by Bordewich et al. (2022).
On the negative side, need not satisfy the strong exchange property, even for the simplest non-tree networks N. Indeed, consider again the binary galled tree N depicted in Fig. 1. Take and . Then
Therefore, there is no such that
As a consequence, an -optimal set of cardinality k of a phylogenetic network N need not contain any -optimal set of cardinality . Consider again the galled tree depicted in Fig. 1. Its only set of two labels with largest value is and its only set of three labels with largest value is .
So, Algorithm 1 cannot be used to produce -optimal sets of a given cardinality as it stands. Actually, Bordewich et al. (2022) prove that, given a phylogenetic network N on and an integer k, the problem of finding the maximum value on is NP-hard. On the positive side, these authors also prove that this problem can be solved in polynomial time on binary galled trees.
A general exchange property
Let be a finite set and a function. Given such that , a W-improving pair for is a pair of sets (A, B), with , , and , such that
To simplify the notation, given , and , we shall denote henceforth by .
Given a set
we shall say that satisfies the exchange property with respect to when every pair of sets with has a W-improving pair in . So, Steel’s strong exchange property for phylogenetic trees mentioned in §2.2 says that, for every phylogenetic tree T on , satisfies the exchange property with respect to
As we have seen, this is no longer true for on galled trees. The main result in this paper, Theorem 1, says that satisfies, on every semi-d-ary level-k phylogenetic network on , the exchange property with respect to a larger family of pairs of subsets whose description only depends on k and d. These families are, when ,
and, when ,
From now on, when it is unnecessary to explicit the set of labels , we shall omit it from the notation of these families.
Given k and d, the cardinalities of these families of sets are polynomial in : and
As we announced above, the main result in this section is the following theorem. Since its proof is quite long and technical, in order not to lose the thread of the manuscript we postpone it until Appendix A at the end of the paper.
Theorem 1
If N is a semi-d-ary level-k phylogenetic network, satisfies the exchange property with respect to .
The family cannot be improved, because there are semi-d-ary level-k phylogenetic networks N and pairs of sets of leaves with having no -improving pair (A, B) with . The next example describes one such network for ; it is straightforward to generalize it to the semi-d-ary setting for any
Example 2
Consider the binary level-k phylogenetic network N on depicted in Fig. 2. Assume that all its arcs e have weight .
Let and . Let us check that, for every (A, B) such that , , and ,
and that the equality holds only when . This will imply that the only -improving pair for in is itself.
Let:
be the arcs in ; that is, and those beginning in .
; that is, the arcs ending in .
Then,
Now, on the one hand, if and
and then
because for every the arc (or if ) does not belong to and therefore .
On the other hand, if ,
and then
where, arguing as above, the inequality is an equality exactly when .
We close this section with a refinement of Theorem 1 for level-1 networks. The proof is similar, and we provide it in Section 2 of the Supplementary file.
Corollary 1
If N is a semi-d-ary level-1 phylogenetic network on , satisfies the exchange property with respect to
Moreover, if have an improving pair , then there exists a blob in N with exit reticulation H and split node v such that , , and .
Applications
In this section we apply Theorem 1 to the study of -optimal subsets for low values of the level of N and the in-degree of its reticulations. Throughout this section, let N be a phylogenetic network on a set of cardinality n and . We shall use the following notation:
- For every m, let be the family of -optimal subsets of of cardinality m:
An optimal sequence of N is a sequence with each .
- For every and , for every , and for every ,
- is the family of subsets of of cardinality of the form (this is, with , , , and .
- are the members of with largest value.
- is the family of subsets of of cardinality of the form (this is, with , , , and .
- are the members of with largest value.
- Finally, for every and , for every , we describe the family of subsets of of cardinality (resp. ) of the form (resp. ) with , , with largest value obtained from each :
- .
- .
We begin with galled trees. As we have already mentioned, it was proved in Bordewich et al. (2022, Cor 4.6) that the optimization problem for can be solved in polynomial time on galled trees. The next proposition strengthens this result by providing a recursive construction of the -optimal sets for these networks.
Proposition 1
Let N be a galled tree. Then, for every ,
Proof
Let and . By Theorem 1, there exists some , with and , such that
1 |
Since , we have that and , and then, being and optimal in and , respectively,
2 |
Combining these inequalities with (1) we obtain
Then, the inequalities (2) must be equalities, from which we deduce that:
, and thus .
, and thus .
Since the choice of the optimal sets was arbitrary, we conclude that
as stated.
Remark 1
Notice that along the proof of the previous proposition we have proved that, in a galled tree, for every and , there exists some pair , with and , such that and .
Proposition 1 implies that, on a galled tree, the members of are those obtained from members of by either optimally adding a leaf or optimally replacing a leaf by two leaves. This result yields the simple greedy polynomial time Algorithm 2 computing the family of optimal sets in increasing order of m that extends the greedy Algorithm 1 for phylogenetic trees.
Remark 2
Proposition 1 also implies that, on a galled tree, the members of each are obtained from members of by removing a leaf or replacing a pair of leaves by a leaf in such a way that the value of decreases the least.
To move up in the complexity ladder of phylogenetic networks, it is convenient to introduce a notation that allows a more compact description of the arguments of the type used in the previous proposition. Given a semi-d-ary level-k phylogenetic network N and an optimal sequence of it, we shall write, for every and for every ,
to mean that there exists an -improving pair for and such that . When we need to emphasize an improving pair (A, B), we shall write “ by an improving pair (A, B)”. In addition, we shall write to mean that and .
Remark 3
By Theorem 1, given any optimal sequence Y of a semi-d-ary level-k phylogenetic network and , there always exists some such that .
The proof of the next lemma, which we leave to the reader, is similar to that of Proposition 1; actually, that proposition is a direct consequence of this lemma for .
Lemma 1
Let N be a phylogenetic network and Y an optimal sequence of N. If and , then and .
In particular, if and , then and .
Corollary 2
Let N be a phylogenetic network and Y an optimal sequence of N. If there exists a closed -chain of length
then, for each ,
Proof
The closed chain ensures that all the inequalities in
are equalities, and the result follows from applying the Lemma 1 to each
It is time to move one step up in the complexity ladder of phylogenetic networks. Recall that
and in particular, for every , .
Proposition 2
If N is a semibinary level-2 or a semi-3-ary level-1 network, then:
for every .
for every .
Proof
Let Y be an optimal sequence of N and fix . Then, by Theorem 1,
3 |
for some or .
Thus, in both cases we have that
which, by the arbitrary choice of Y and m, concludes the proof.
Point (a) in the last proposition tells us that if N is semibinary level-2 or semi-3-ary level-1, all members of each are obtained either from members of by optimally adding a leaf, optimally replacing a leaf by a pair of leaves, or optimally replacing a pair of leaves by a triple of leaves (this possibility need not be considered in the semi-3-ary level-1 case by Corollary 1), or from members of by optimally replacing a leaf by a triple of leaves. This proves the correctness of the polynomial time greedy Algorithm 3 to compute the family of optimal sets for such a network N in increasing order of m (as we have mentioned, if N is semi-3-ary level-1, the sets in the loop need not be computed).
Example 3
Consider the phylogenetic networks in Fig. 3. On the left, a semi-3-ary level-1 network and on the right a semibinary level-2 network obtained by blowing up the reticulations in the left-hand side network into a pair of in-degree 2 connected reticulations. In both networks, we have the following optimal sets of leaves:
Then, in both networks,
Now, if we move one more step further in the complexity ladder, the structure of the optimal sets is no longer so simple.
Proposition 3
If N is a semibinary level-3 or a semi-4-ary level-1 network, then, for every , at least one of the following assertions is true:
and .
,
where (k, d) is (3, 2) or (1, 4), depending on the type of network.
Proof
To begin with, notice that
and therefore . To simplify the notation, we shall abbreviate by simply . Observe that j can only go from 1 to 3.
Let Y be an optimal sequence of N and fix . To ease the task of the reader, we sketch the flow of the proof in Fig. 4; all implications leading to (a) or (b) are due to Cor. 2.
By Theorem 1,
4 |
for some .
If , then and we conclude as in (1) in the proof of Proposition 2 that and .
- If , then . Applying Theorem 1 again,
for some .- If or , and we conclude as in (2) in the proof of Proposition 2 that and .
- When , we have and we can only deduce that and .
Summarizing, we only have two possibilities:
- On the one hand, in the cases (1), (2.a), (3.a.i), and (3.b),
- On the other hand, in the cases (2.b) and (3.a.ii),
By the arbitrary choice of Y and m, this concludes the proof.
A similar result holds for (k, d) such that . We give its proof in Section 3 of the Supplementary file.
Proposition 4
If N is a semi-5-ary level-1 or a semi-3-ary level-2 network, then, for every , at least one of the following assertions is true:
and .
,
where or (1, 5), depending on the type of network.
So, while we could give a greedy optimization algorithm for semibinary level-2 networks or semi-3-ary level-1 networks, an analogous argument fails for more complex networks. The reason why Propositions 3 and 4 are not sufficient to provide such a greedy algorithm is that we would require their assertion (a) —or a similar expression— to be true for all m. In the occurrence of any m where only assertion (b) holds, we do not have enough information about to be able to ensure that it can be obtained from previous optimal sets.
Remark 4
A close analysis of the proof of Proposition 3, using Corollary 2 in its full strength, shows that we actually have a more general result: for every optimal sequence Y of N and for every , at least one of the following conditions holds (the labels correspond to the cases in the proof):
and .
- , , and
- and , or
- and .
and .
- , , , , and
- and , or
- and .
and .
- , , and
- and , or
- and .
Unfortunately, the extra information obtained in this way is still not enough to prove the correctness of a greedy -optimization algorithm for the networks considered in that proposition. A similar situation appears in the context of Proposition 4.
But we must point out that we have not been able to find any semibinary level-3 or any semi-4-ary level-1 network for which for some m. Similarly, we have not been able to find any semi-5-ary level-1 or any semi-3-ary level-2 network for which for some m. So, it might be possible that the greedy algorithm also works in these cases, since we have not discovered a counterexample that disproves its correctness for these types of networks. In Section 4 of the Supplementary file we provide several examples that illustrate our search for a counterexample. More examples can be found in the second author’s PhD Thesis (Riera 2023).
Conclusions
PD on phylogenetic trees satisfies the strong exchange property that guarantees that, for every two sets of leaves of different cardinalities, a leaf can always be moved from the larger set to the smaller one without decreasing the sum of the PD values. But rPSD does not longer satisfy this exchange property even for galled trees. In this paper we have generalized this exchange property to rPSD on phylogenetic networks of bounded level and reticulations’ in-degree, showing that a similar results holds if we allow more involved exchanges of leaves’ subsets. Our final goal was to use this generalized exchange property to find a polynomial time greedy algorithm for the optimization of rPSD on phylogenetic networks of bounded level and in-degree of reticulations. We have ultimately failed in this goal. We have indeed shown that the generalized exchange property entails such a greedy algorithm for semibinary level-2 networks and semi-3-ary level-1 networks (and sheds new light on the structure of the families of rPSD-optimal sets on galled trees) but it cannot be used, as it stands, to obtain such an algorithm on more complex networks. However, we have not been able to find examples of semibinary level-3 networks or semi-4-ary level-1 networks where the greedy algorithm fails: it is simply that the generalized exchange property alone seems not to be enough to prove its correctness.
Finally, it is important to point out that just like the optimization problem itself, testing counterexamples is computationally expensive, too. While the greedy algorithm runs in polynomial time, finding whether can be obtained from some or not still requires calculating by brute force, and testing whether the exchange property holds for a certain subset of where also requires testing all subsets . All these operations are exponential, hence trying even slightly larger examples can dramatically increase the runtime of the test.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This research was partially supported by the grant PID2021-126114NB-C44, PGC2018-096956-B-C43 funded by MCIU/AEI/10.13039/501100011033 and by “ERDF/EU.”
Appendix A: Proof of Theorem 1
We begin by stating two auxiliary lemmas. From now on, we shall call a semi-d-ary k-blob any blob with k reticulations, all of them of in-degree . Given such a semi-d-ary k-blob and a non-empty subset of its exit reticulations, the first lemma provides a sharp upper bound for the cardinality of any independent set of nodes V of whose members have no descendant exit reticulation outside . This bound will entail the bound for the cardinality of A in the definition of . We give the proof of this lemma in Section 1 of the Supplementary file.
Lemma 2
Let be a semi-d-ary k-blob with l exit reticulations and a non-empty subset of its exit reticulations of cardinality . Then, for every independent set of nodes V of without descendant exit reticulations outside ,
The constructions explained in the proof of this lemma easily show that the bound it provides is sharp, in the sense that, for every with and , there are semi-d-ary k-blobs with l exit reticulations and subsets of exit reticulations containing an independent set of nodes V without descendant exit reticulations outside of cardinality : cf. Fig. 5.
Remark 5
By Lemma 2, if is a semi-d-ary blob without internal reticulations, if is a subset of its exit reticulations of cardinality , and if V is an independent set of nodes in without descendant exit reticulations outside , then . A close analysis of the proof of that lemma easily shows that the upper bound is achieved when all the reticulations in have in-degree d and the set V contains, for every , exactly d nodes whose only reticulate descendant is H. Of course, such sets do not always exist: for instance, when contains a node that is a parent of two different exit reticulations.
The second auxiliary lemma extracts a key technical step in the proof of Theorem 1. This lemma provides an analog of the exchange property for sets of ancestors of multisets of nodes of a blob. More precisely, we prove that if are multisets of nodes of a semi-d-ary k-blob with and satisfying some extra conditions (those under which we shall apply the lemma in the proof of the main theorem) then there exist a subset A of X disjoint with and a submultiset B of disjoint with X whose cardinalities satisfy the restrictions defining the family and such that if we replace X and by and , the set of nodes that are simultaneously ancestors of nodes in both sets does not decrease.
We use in this lemma some standard notation for multisets X: denotes the multiplicity of an element v in X; denotes the support of X, that is, the set of elements v such that ; we say that X is a set when all its multiplicities are , and then we identify it with its support; a submultiset Y of X is full when for every ; and the cardinality of X is . We shall also use the notation when X, S, T are multisets with and .
This lemma also uses some basic properties of -notation. Some simple results in this regard are that, for any two sets A, B, and , and that if , then . Moreover, given a multiset A, we define as , without taking into account the multiplicities of the elements of A.
Lemma 3
Let be a semi-d-ary k-blob and two multisets of nodes of with and satisfying the following two further conditions:
-
(i)
For each , if , then and .
-
(ii)
Each exit reticulation of belongs to X or .
Then
5 |
for some set and some full submultiset B of with such that and , or , or and .
Proof
First, we introduce some notation.
- Let be the set of exit reticulations of , and let
Let , and . By (ii), . For each , let be the set of nodes whose only descendant exit reticulation is H. Since every node in has some descendant exit reticulation, . Observe that if .
Let be the set . By (i), .
Let be the full submultiset of supported on .
The inequality implies that , too. Indeed:
6 |
We shall consider three cases; in all of them we shall choose a subset and a full submultiset satisfying the requirements in the statement and we shall prove that they satisfy Eqn. (5).
(a) If there exists some with a proper descendant in X, then and hence . In this case, taking and we have that
(b) Assume that no has any proper descendant in X and that . This implies that and that , as any would have some proper descendant in .
In this case, there exists an such that . Indeed, assume that for every there existed some node without any descendant in . Then, each would belong to . Since the sets are pairwise disjoint, the nodes would be pairwise different, forming a subset of of cardinality , which cannot exist because .
Take then and . If can prove that , then we will have
So, let . There are two possibilities:
If v has some descendant in , then the latter will have a descendant in , which will also be a descendant of v.
- If v has no descendant in , then
(c) Assume finally that and that no has any proper descendant in X. This last condition implies that the set of nodes is independent and all their descendant exit reticulations belong to . Then, by Lemma 2 we have that
7 |
In particular,
8 |
Now, on the one hand, if , take and . By Eqns. (6) and (8), they satisfy the required conditions in the statement, and
which implies .
On the other hand, if , then all inequalities in the sequence (7) as well as the inequality are equalities. The equality implies that the blob has no reticulation other than those in . Moreover, since the first inequality in (7) is an equality, reaches the maximum number of possible independent nodes in . Then, as noted in Remark 5, it must happen for each that and .
Now, since , there must exist some with . Take and B the multiset with and . We have that and hence the pair (A, B) satisfies the requirements in the statement. As to Eqn. (5), notice that
Now, and, by assumption, the elements of A have no proper descendant in X, which implies
Moreover, since , we have that . Therefore
as we wanted to prove.
Theorem 1
If N is a semi-d-ary level-k phylogenetic network, satisfies the exchange property with respect to .
Proof
The case is Steel’s strong exchange property for phylogenetic trees (Steel 2016, §6.4.1). So, we shall focus on the case .
Without any loss of generality, we can assume that every tree node in N is at most bifurcating, in the sense that the out-degree of each tree node is at most 2 (recall that we do not forbid out-degree 1 tree nodes in our networks). Indeed, let first be the phylogenetic network obtained from N as follows: for every node v that is the split node of more than one blob and for each such blob rooted at v, add a new split node to the blob and a new arc with weight 0. is still semi-d-ary and level-k, no node in it is the split node of more than one blob, and for every . Now, let be the phylogenetic network obtained from as follows: for every tree node v with children , replace in N the subgraph supported on by a bifurcating tree with root v and leaves and all its arcs except those ending in of weight 0: the arc ending in each inherits the original weight of ; if any node had any entering arcs other than , we keep them with their weights. Since v was the split node of at most one blob, no blob increases its level from to , and therefore is still semi-d-ary and level-k, and for every .
So, in the rest of this proof we shall suppose that N is at-most-bifurcating and in particular that no node in N is the split node of more than one blob.
We shall proceed by induction on the number of arcs of the network. A phylogenetic network with is a phylogenetic tree consisting of a single leaf, where the stated exchange property trivially holds. Now, let N be an at-most-bifurcating semi-d-ary level-k phylogenetic network with arcs, and let us suppose that the thesis in the statement is true for all at-most-bifurcating semi-d-ary level-k phylogenetic networks with less than arcs.
Let with . If the exchange property is trivially satisfied taking and , so we assume from now on that . Now consider the tree of blobs T of N (Gusfield et al. 2007), obtained by collapsing each blob in N into its split node. Then, T is a phylogenetic tree with the same root r as N, , and, for every , its cluster in T and in N are the same; let us denote it by C(v). Since and , the set of nodes v in T such that and is nonempty: it contains the root r.
We shall consider four cases.
(a) Assume that T contains some node such that and . Since , is in N a tree node such that the arc ending in it does not belong to any blob, which implies that it is a cut arc. Let and let be the network obtained from N by removing and the arc and, if is a reticulation node, appending to it a dummy leaf child (not labelled in ) through an arc of weight 0; cf. Figure 6. By the induction hypothesis, satisfies the thesis in the statement.
Now, for every , if , then , and if , then
(Throughout this proof, given a network with set of leaves and a set Z, we write to denote actually . So, for instance, and in the expressions above actually mean and , respectively.)
Since , by the induction hypothesis there exist and such that and
9 |
Since , and , and thus, in particular,
Notice also that because .
Assume first that , so that and . Then,
By the same argument, using that and , we also have that
Therefore, by Eqn. (9),
Assume now that . Then, by the definition of , the set A must be a singleton and then , because, by assumption, . Then, arguing as above, we have that
Similarly, if ,
while if (and using that ),
In either case, by Eqn. (9) we have again
(b) Assume now that the only node v in T such that and is the root r, and that r is not the split node of any blob in N. Then, each child v of r in N is also its child in T and thus, if , then . But since , r must have some child such that and hence such that and ; and then, since , r must have a second child and . For each , let and let be the subnetwork of N rooted at . The sets of leaves of are disjoint and therefore, for each ,
where, for each , if and otherwise.
Let and take and . Then, and , , , and . Therefore,
and, similarly,
Hence, in this case,
(c) Assume finally that the only node v in T such that and is the root r, and that r is the split node of a (single) blob . we distinguish two subcases.
(c.1) If contains some exit reticulation H with no descendant in , and if are the parents of H, then let be the phylogenetic network obtained from N by removing the subnetwork , adding new leaves with dummy labels outside , and replacing each arc by an arc with weight 0; cf. Figure 7. is still at-most-bifurcating, semi-d-ary, and level-k and it has less than arcs (we have removed the arcs in ). Therefore, by the induction hypothesis, it satisfies the thesis in the statement. Let be its set of labels. Then, since, by assumption, , there exist and such that and
Since for every , we conclude that
(c.2) Finally, assume that all the exit reticulations of the blob rooted at r have descendants in X or . Let be the set of nodes of that have a child outside of ; if , we shall denote its child outside of by . Notice that:
(its two children must belong to the blob);
the exit reticulations of belong to ;
since reticulations have out-degree 1, the internal reticulations of do not belong to ;
for every , and thus, by the current assumption, if then .
For each let be the subnetwork of N rooted at v consisting of , v and the arc .
For each , we shall denote by the multiset of nodes of supported on
and with multiplicities . Since the subnetworks , with , have pairwise disjoint sets of leaves and the union of their sets of leaves is , we have that and
10 |
So, ; by the current assumption, every exit reticulation belongs to ; and if , then . Therefore, the multisets , satisfy the hypotheses of Lemma 3, which implies the existence of a set and a multiset of nodes of such that:
; thus, if , and .
and, for every , .
and , or , or and .
.
Let
Then, and with and . In particular, by property (3), . We shall prove that
Before doing so, let us point out some facts that we shall use. First, notice that and , because for every
Moreover
11 |
12 |
Indeed, as to Eqn. (11), for every
and in particular
because for every .
A similar argument, using that, for every , and that , proves Eqn. (12).
We can proceed now to prove the desired inequality
By Eqn. (10),
13 |
where
14 |
because if , then and if , then ; and
15 |
because if , then , and if , then .
Therefore, combining Eqns. (13) to (15), we obtain
A similar argument proves that
Thus,
if, and only if,
Finally, this last inequality holds because
where step is due to
and, by property (4) of and (and, again, (11) and (12)),
This completes the proof of case (c.2).
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Declarations
Conflict of interest
The authors of this article declare that they have no financial Conflict of interest with the content of this article.
Footnotes
A class of undirected graphs that generalize unrooted trees and do not describe evolutionary histories but simply evolutionary relationships.
A subclass of split networks widely used because they are the output of popular programs like PhyloNet (Yu et al. 2014) or Splitstree4 (Huson and Bryant 2006).
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Bordewich M, Semple C, Spillner A (2009) Optimizing phylogenetic diversity across two trees. Appl Math Lett 22:638–641 [Google Scholar]
- Bordewich M, Semple C, Wicke K (2022) On the complexity of optimising variants of phylogenetic diversity on phylogenetic networks. Theoret Comput Sci 917:66–80 [Google Scholar]
- Chernomor O, Klaere S etal (2016) “Split diversity: measuring and optimizing biodiversity using phylogenetic split networks.” In: Pellens and Grandcolas (2016) , 173-195
- Doolittle WF (1999) Phylogenetic classification and the universal tree. Science 284:2124–2128 [DOI] [PubMed] [Google Scholar]
- Faith D (1992) Conservation evaluation and phylogenetic diversity. Biol Cons 61:1–10 [Google Scholar]
- Gaston KJ (1996) Species richness: measures and measurements. In: Gaston KJ (ed) Biodiversity: a biology of numbers and differences. Blackwell Science, pp 77–113 [Google Scholar]
- Gusfield D, Eddhu S, Langley C (2004) Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J Bioinform Comput Biol 2:173–213 [DOI] [PubMed] [Google Scholar]
- Gusfield D, Bansal V et al (2007) A decomposition theory for phylogenetic networks and incompatible characters. J Comput Biol 14:1247–1272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huson D, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23:254–267 [DOI] [PubMed] [Google Scholar]
- Huson D, Rupp R, Scornavacca C (2010) Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press [Google Scholar]
- Jansson J, Sung W-K (2006) Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoret Comput Sci 363:60–68 [Google Scholar]
- Kolbert E (2014) The Sixth Extinction. An Unnatural History. Henry Holt and Company [Google Scholar]
- McNeely JA, Miller KR et al. (1990). Conserving the world’s biological diversity. In: International Union for conservation of nature and natural resources
- Pardi F, Goldman N (2005) Species choice for comparative genomics: Being greedy works. PLoS Genet 1:e71 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pellens R, Grandcolas P eds. (2016). Biodiversity conservation and phylogenetic systematics: preserving our evolutionary heritage in an extinction crisis Springer Nature
- Possingham HP, Andelman S et al (2002) Limits to the use of threatened species lists. Trends Ecol Evol 17:503–507 [Google Scholar]
- Riera G (2023) Theoretical Models and Computational Techniques for the Analysis of Microbial Communities. PhD Thesis, UIB
- Spillner A, Nguyen BT, Moulton V (2008) Computing phylogenetic diversity for split systems. IEEE/ACM Trans Comput Biol Bioinf 5:235–244 [DOI] [PubMed] [Google Scholar]
- Steel M (2005) Phylogenetic diversity and the greedy algorithm. Syst Biol 54:527–529 [DOI] [PubMed] [Google Scholar]
- M. Steel (2016). Phylogeny: Discrete and random processes in evolution
- Wicke K, Fischer M (2018) Phylogenetic diversity and biodiversity indices on phylogenetic networks. Math Biosci 298:80–90 [DOI] [PubMed] [Google Scholar]
- Yu Y, Dong J, Liu KJ (2014) Bayesian estimation of species networks from multilocus data. Mol Biol Evol 31:1032–1043 [Google Scholar]
- Zhukova A, Blassel L et al (2021) Origin, evolution and global spread of SARS-CoV-2. CR Biol 344:57–75 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.