Abstract
Given a collection of subsets of a finite set X, we say that is phylogenetically flexible if, for any collection R of rooted phylogenetic trees whose leaf sets comprise the collection , R is compatible (i.e. there is a rooted phylogenetic X-tree that displays each tree in R). We show that is phylogenetically flexible if and only if it satisfies a Hall-type inequality condition of being ‘slim’. Using submodularity arguments, we show that there is a polynomial-time algorithm for determining whether or not is slim. This ‘slim’ condition reduces to a simpler inequality in the case where all of the sets in have size 3, a property we call ‘thin’. Thin sets were recently shown to be equivalent to the existence of an (unrooted) tree for which the median function provides an injective mapping to its vertex set; we show here that the unrooted tree in this representation can always be chosen to be a caterpillar tree. We also characterise when a collection of subsets of size 2 is thin (in terms of the flexibility of total orders rather than phylogenies) and show that this holds if and only if an associated bipartite graph is a forest. The significance of our results for phylogenetics is in providing precise and efficiently verifiable conditions under which supertree methods that require consistent inputs of trees can be applied to any input trees on given subsets of species.
Keywords: Phylogenetic tree, Set systems, Partial taxon coverage, Bipartite graph, Hall’s marriage theorem, Submodularity
Introduction
In phylogenomics, biologists often encounter the following problem: Given a collection of different subsets of species, the corresponding phylogenetic trees—each one reconstructed from the genomic data available for the corresponding subset—cannot be consistently combined into a single phylogenetic tree for all the species. When this occurs, various heuristic and somewhat ad hoc ‘supertree’ methods (such as ‘matrix recoding with parsimony’) are often applied to provide some estimate of a parent tree (Felsenstein 2004). However, when the collection of subsets of species has sufficiently sparse overlap (in a sense we will make precise shortly), then any phylogenetic tree assignment for will lead to a set of trees that can be consistently combined into a parent tree. Figure 1i provides an example of this.
In this paper, we investigate the conditions under which the existence of a consistent parent tree can be guaranteed regardless of the tree structure for each subset. Here ‘parent’ tree means that the leaf set of the tree is the union of the leaf sets of the input trees. For example, given a set of input trees, if there is a parent tree that displays each tree, then a simple, fast and well-known algorithm due to Aho et al. (1981) constructs such a tree in a canonical way. However, this method will fail to return any phylogenetic tree when presented with input trees that are incompatible (i.e. cannot be displayed by any parent tree). In this paper, we characterise when such a method will always be safe to use on any set of input trees, given the sets of taxa that form the leaf sets of those trees. Thus, we consider as input just subsets of species and develop mathematical characterisations and algorithms for this combinatorial question in the special case where each subset has a fixed (small) size. Later in the paper, we consider how the results extend to more general set systems. Our approach throughout is to reduce certain combinatorial questions in phylogenetics to the study of systems of inequalities involving linear expressions and related submodularity properties.
In discussion section, we mention a further biological context where the results may be relevant. Note that there are many reasons why phylogenetic trees are constructed on different subsets of species, and a particularly topical one is that genes used to estimate a given phylogeny may only be present (or have been sequenced) in a given subset of the species, and these subsets vary from gene to gene (Sanderson et al. 2011).
Our work is motivated in part by a remarkable combinatorial result by Grünewald (2012) involving unrooted binary trees. In that paper, a set of binary trees having leaves labelled from some set X is said to be ‘slim’ if for every non-empty subset of , the number of leaves appearing in at least one tree in is at least the total number of interior edges of T plus 3. Theorem 1.1 of Grünewald (2012) then states that for any such thin collection there is a tree with leaf set X that ‘displays’ each of the trees in . In particular, this leads to the rather striking consequence that ‘the property of being slim only depends on the involved leaf sets of the trees and not on which phylogenetic tree is chosen for a fixed leaf set’ (Grünewald 2012, p. 324). In this paper, we explore this notion further, and by working with rooted trees (rather than unrooted ones) we are able to establish precise characterisations of the analogous ‘slim’ property.
Our work is also partly motivated by results from Dress and Steel (2009) where slim-type properties also arise in a tree-based setting, but for a quite different question involving ‘median’ vertices. To explain this, given a tree and a subset S of V of size 3, say , consider the path in T connecting x, y, the path connecting x, z and the path connecting y, z. There is a unique vertex that is shared by these three paths, the median vertex of S in T, denoted . In Dress and Steel (2009), the authors show that ‘slim’-type properties characterise when a set of triples from X can be realised as providing an encoding of the interior vertices of a (unrooted) tree with leaf set X. (An extension of this to sets of subset of X of size greater than 3 is also described.) In this paper, we extend this result further by showing that the tree that provides this encoding can be chosen to have a particular special type of structure (a ‘caterpillar’).
The phylogenetic combinatorics of subsets of a species set is a topic that has also been explored recently in the setting of ‘phylogenetic decisiveness’ (Steel and Sanderson 2010). However, the questions that we consider here are quite different from that setting; rather than requiring a dense overlap of the species subsets in the phylogenetic decisiveness setting, here we investigate sparse overlap.
We begin with some definitions. Throughout this paper, X will denote a fixed finite set.
Thin Set Systems
Suppose is a non-empty subset of . Let (i.e. the set of elements of X that appear in at least one set in ) and define the excess of , denoted , by:
We say that is thin if, for all non-empty subsets of , we have:
This notion appears in related but slightly different settings, namely for the leaf sets of unrooted trees in Grünewald (2012), in the median representation of sets of triples in Dress and Steel (2009), and as sparse triplet covers in Grünewald et al. (2017).
In the following lemma, recall that a collection of (not necessarily distinct) sets has a system of distinct representatives if one can select an element for each so that the elements are all distinct. For a non-empty subset of with and for let be the number of elements in that contain x.
Lemma 1
Let be a non-empty subset of , and . If is thin, then the following properties hold:
-
(i)
where .
-
(ii)
For some , .
-
(iii)
For any subset B of X of size , the collection of sets has a system of distinct representatives.
Proof
Part (i) follows from the defining condition for thin upon taking .
Part (ii) can be established by the following double-counting argument. Suppose that there is no element with , so that for all . Let . We then have:
1 |
where . On the other hand:
2 |
where the inequality is from Part (i). Combining (1) and (2) gives and, so, . By the definition of k, (ii) follows.
For Part (iii), consider the union of any l sets where and for . (Note that these sets may have different sizes and a set may occur more than once.) Since is thin, , and so, since B has size , . Since the inequality holds for all , Hall’s marriage theorem (Hall 1935) ensures that has a system of distinct representatives.
For the first part of this paper, we will deal with the case where . However, the main theorem in this setting (Theorem 1) will be used in Sect. 4 to derive a result for the more general case where the sets have different sizes. When , notice that if , then ; however, if , then so it suffices, in the definition of thin, to consider subsets of of of size at least 3.
A simple way to generate a thin set is to take any ordered sequence of subsets of X of size 3, for which the ordered sequence has the property that each member contains at least one element of X that is not present in any earlier member of the sequence. However, not all thin sets can be obtained in this way. For example, consider the collection of four subsets sets of . This collection of subsets is thin, yet these four sets cannot be ordered so as to satisfy the property described.
Phylogenetic Trees and Flexible Sets
Following Semple and Steel (2003), a rooted phylogenetic treeT is a rooted tree having a set L(T) of labelled leaves (vertices of out-degree 0) and for which every non-leaf vertex is unlabelled and has out-degree at least 2. We let , or more briefly denote the root vertex of T, which has in-degree 0. In case each non-leaf vertex has out-degree exactly 2, we say that T is binary. If , we will also say that T is a rooted phylogenetic X-tree. We let denote the set of interior (i.e. non-leaf) vertices of T. Similarly, an unrooted phylogenetic treeT is an unrooted tree having a set L(T) of labelled leaves (vertices of degree 1) and for which every non-leaf vertex is unlabelled and has degree at least 3. In case each non-leaf vertex has degree exactly 3, we say that T is binary. If , we will also say that T is a unrooted phylogenetic X-tree.
A rooted triple is a rooted binary phylogenetic tree on three leaves, and we denote such a tree as ab|c if it has leaf set with leaf c adjacent to the root. A rooted phylogenetic X-tree T is said to display the rooted triple ab|c if some subdivision of the tree ab|c is a subgraph of T.
A cherry in a (rooted or unrooted) phylogenetic tree is a pair of leaves that is adjacent to the same vertex. A rooted (respectively, unrooted) caterpillar tree on X is a rooted (resp. unrooted) binary phylogenetic X-tree for which the number of cherries is at most 1 (respectively, 2).
These notions are illustrated in Fig. 2.
A set R of rooted triples chosen from X is said to be compatible if there is a rooted phylogenetic X-tree T that displays each rooted triple in R (in which case, we say that TdisplaysR). Note that if R is compatible, then T can always be chosen to be a binary tree and R can contain at most one tree for any triplet (i.e. at most one of ab|c, ac|b, and bc|a can be present in R).
Suppose that we have a set R of rooted triples with leaves chosen from X. We will let ||R|| denote the subset of consisting of the leaf sets of the trees in R. We say that a non-empty subset of is phylogenetically flexible if every set R of rooted triples for which holds is compatible. An example to illustrate this notion is provided in Fig. 1.
The following observation that phylogenetic flexibility is hereditary is straightforward to check.
Lemma 2
Suppose is a non-empty subset of that is phylogenetically flexible. If is a non-empty subset of , then is phylogenetically flexible.
Characterisation Result
We can now state our first main result.
Theorem 1
Suppose that is a non-empty subset of . Then is phylogenetically flexible if and only if is thin.
The ‘if’ direction of Theorem 1 can be established by applying Theorem 1.1 of Grünewald (2012); however, we give a shorter and more direct proof of this direction here (as well as establishing the converse). We begin with some preliminary results, which are required for the argument.
Given a rooted phylogenetic tree with leaf set X and every vertex in having degree three. We say that a rooted triple xy|zsupports a vertex v in T if xy|z is displayed by T and .
For a set R of rooted triples on X, put . Furthermore, for a non-empty subset S of X, let [R, S] be the graph with vertex set S and with an edge if and only if there exists a rooted triple for at least one element . By Bryant and Steel (1995, Theorem 2), R is compatible if and only if the graph [R, S] is disconnected for all subsets S of X of size at least 2.
Lemma 3
Suppose that T is a rooted binary phylogenetic X-tree, that R is a set of rooted triples with , and that each rooted triple supports a unique (interior non-root) vertex in T. Then, the graph [R, X] has precisely two connected components.
Proof
For , let be the leaf set of the rooted subtree of T with root v. We claim that for every such v, the graph induced by [R, X] on is connected. The lemma then follows immediately by considering the graphs induced by [R, X] on for u and w the children of the root of T.
To prove the claim, for u (a child of the root of T), we consider the following set:
where v is said to be belowu if u lies on the path from to v. Note that since , there must exist a child u of such that and also there exists some vertex below or equal to u such that . We use induction on for in . If , then both children of v are leaves and the lemma holds because if , then, by assumption, there exists a rooted triple in R of the form r|pq for some that supports v. Hence, there is an edge in [R, X], and therefore, the graph induced by [R, X] on is connected.
Now suppose that v is an internal vertex of T below or equal to u such that . Then at least one of the two children and of v is not a leaf of T. Without loss of generality, we may assume that is that child. Therefore, and so, by induction, the graph induced by [R, X] on is connected. If is not a leaf of T, then the same arguments as before imply that the graph induced by [R, X] on is also connected. If is a leaf of T, then the graph [R, X] on is a vertex and therefore is (trivially) connected. Since, by assumption, there exists a rooted triple in R that supports v, there is an edge in [R, X] with and . Hence, the graph induced by [R, X] on is connected.
Proof of Theorem 1
We first establish the ‘if’ direction. Suppose that is thin, and let R be a set of rooted triples with leaves chosen from X with . We show that any such choice of R is compatible.
We will establish the compatibility of R via the aforementioned characterisation that R is compatible if and only if [R, S] is disconnected for all subsets S of X of size at least 2. To that end, let S be a subset of X of size at least two.
Notice that where is the subset of those rooted triples in R that have all three of their leaves in S. Let . Since is thin, we have , in other words:
3 |
Now (i) the number of vertices of [R, S] is |S| and ; (ii) the number of edges of [R, S] is at most . Thus, by Inequality (3), the number of vertices of [R, S] minus the number of edges of this graph is at least 2. But any finite graph with this property must be disconnected. Since this holds for all subsets S of X of size at least two, it follows that R is compatible.
We turn now to the ‘only if’ direction.
We use induction on . If , then is clearly thin. So, suppose the ‘only if’ direction holds for all with , some , and let such that . Without loss of generality, we may assume that .
Suppose that is a non-empty proper subset of . By Lemma 2, is phylogenetically flexible. Hence by induction, is thin. Thus, . To show that is thin, it therefore suffices to prove that .
Suppose for the purposes of obtaining a contradiction that . Let and set . Then, as is thin by induction,
4 |
Hence and, so, . Thus, .
Now, since is thin, there exists a (unrooted) phylogenetic tree with leaf set X, and all vertices in of degree 3, for which the map is one to one (Dress and Steel 2009) (see also Sect. 3). We claim that the map must in fact be bijective. Suppose that this is not the case. Then, there exists some such that , for all . Hence, and, so, . But then , which is impossible as is an integer. Hence, is a bijection as claimed.
Now, root the tree T by inserting a root vertex into an edge which separates x, y from z, when the edge is removed from T. Let be a set of rooted triples induced by the map (for each element in maps to some so that we get a rooted triple with leaf set which supports v in the rooted version of T) with and . Since is a bijection, satisfies the conditions of Lemma 3 for the rooted version of T. Hence, the graph has two connected components, one that contains x, y in its vertex set and the other that contains z.
Now consider the set of rooted triples . Then , [R, X] is connected and so R is not compatible, and . But this is impossible, since is phylogenetically flexible.
The following corollary of Theorem 1 is now immediate from Lemma 1(i).
Corollary 1
If a non-empty subset of is phylogenetically flexible, then where .
We end this section by considering how many trees can display a set of rooted triples R when ||R|| is phylogenetically flexible. It might be suspected that since the overlap between the leaf sets of the trees in R is sparse, the number of trees displaying R would need to be large. Indeed, this is sometimes the case; for example, suppose that the leaf sets in R are all disjoint, so the total number of leaves is given by , where . In this case, the number N of rooted binary trees on n leaves that display R is given by:
5 |
which grows exponentially with n. The proof of Eq. (5) is to observe that each of the ways to select a rooted triple from the k triples in ||R|| provides a set of rooted triples that is displayed by at least one rooted phylogenetic tree [by the algorithm from Aho et al. (1981)] and hence by at least one rooted binary tree, and these rooted binary trees are pairwise distinct, since any two of them display a different rooted triple for at least one triple in ||R||.
At the other extreme, if R has the maximum possible size for a phylogenetically flexible set on n leaves (namely by Corollary 1), then it is possible for there to be just a single rooted phylogenetic tree that displays R; this is stated more precisely in the next proposition.
Proposition 1
-
(i)
For every rooted binary phylogenetic X-tree T on leaves, there exists a set of rooted triples for which (a) T is the only phylogenetic X-tree that displays and (b) is thin.
-
(ii)
There exist phylogenetically flexible sets of triples of size on n leaves () for which each assignment of a tree structure to these triples leads to a set of rooted triples that can be displayed by more than one rooted phylogenetic tree.
Proof
(i) We use induction on n. For , we can write , in which case satisfies Conditions (a) and (b). Suppose now that Proposition 1 holds for where and that T is a rooted binary phylogenetic X-tree with leaves. Select a pair of leaves a, b that are adjacent to the same vertex (say v) of T (i.e. is a cherry of T), let vertex u be the parent of vertex v in T, and let c be any leaf of T present in the component of (the graph obtained by deleting u from T) that contains neither the root, nor the leaves a, b. Put and let be the rooted binary phylogenetic -tree obtained from T by deleting leaf a and its incident edge and suppressing the resulting vertex of degree 2. Since has n leaves, the induction hypothesis ensures that there is a set of rooted triples for which is the only phylogenetic -tree that displays and that is thin. If we now let , then is a set of rooted triples and satisfies Conditions (a) and (b) for the tree T. This establishes the induction step and thereby the proposition.
(ii) Let . In this case, is a thin (and therefore phylogenetically flexible) set of size . Now, for , it can be checked that any assignment of a tree structure to these triples leads to a set of rooted triples that can be displayed by more than one rooted phylogenetic tree.
Median Characterisations
Given a phylogenetic tree T with leaf set X and a set , let refer to the vertex that is the unique median vertex of T for the three elements of s.
The following result was established in Dress and Steel (2009, Theorem 1.1). Suppose that is a subset of with . The following are equivalent:
-
(i)
is thin.
-
(ii)
There exists a binary unrooted phylogenetic X-tree for which the function : from the elements s of to the set of interior vertices of T is one to one.
When (ii) holds, we say that T provides a median representation of . Figure 3i illustrates how this equivalence applies.
We now strengthen this result from Dress and Steel (2009) by showing that the tree T can always be chosen to be an unrooted caterpillar tree. For example, for the thin collection of sets considered in Fig. 3, we may select the caterpillar tree shown in Fig. 3ii.
Theorem 2
Suppose is a non-empty subset of , where . If is thin, then there exists an unrooted caterpillar tree with leaf set X for which the function is one to one.
Proof
We adapt the proof of (3) (2) of Dress and Steel (2009, Theorem 1.1) and use induction on the size of X. If , the theorem clearly holds in view of Lemma 1(i). Let us suppose that it holds whenever , for some . Let X be such that . By Lemma 1(ii), we may assume that one of the following two cases hold:
-
(A)
There is an element x of X with .
-
(B)
There is an element x of X with .
In case (A), there is some triple such that for we have that is thin. Put . By induction, there is an unrooted caterpillar tree with leaf set and the function is one to one. Now we can create a tree T by inserting an edge where u is a new vertex subdividing an interior edge of on the path between a and b. The resulting tree T is clearly a unrooted caterpillar tree on X and is one to one. This establishes the induction step in this case.
In Case (B), there is an element x in X with . Then there exist two distinct triples each of which contains x. We consider the following two possible cases: (i) and (ii) .
Case (i):. In this case, there exist with such that and . Since is thin, it follows that
is also thin. Put . Then, by induction, there is a unrooted caterpillar tree with leaf set and is one to one.
Consider the leaf of . Let denote the vertex adjacent to . As is an unrooted caterpillar tree, it suffices to consider the following two subcases:
Subcase (a): The leaves a and b are on the same side of relative to (i.e. they are in the same connected component of as ). Without loss of generality, assume that the distance from a to in is less than or equal to distance from b to b in . Note that in this case is the vertex in that is adjacent to a. Now create a tree T with leaf set X by inserting a new vertex u and a new edge into such that is an edge on the path connecting and a. The tree T is again an unrooted caterpillar tree on X. Furthermore, is one to one since (i) is one to one, and (ii) and the median of in T corresponds to the median vertex of in and therefore is a vertex of T that is different from any other median vertex of an element in .
Subcase (b): The leaves a and b are on different sides of relative to . Note that in this case, . Now create a tree T with leaf set X by inserting a new vertex u and a new edge into such that is an edge on the path connecting and b. T is then clearly an unrooted caterpillar tree on X. Since and the median of in T corresponds to the median vertex of in , the same arguments as in the previous case imply that is one to one.
Case (ii):. In this case, there exist pairwise distinct elements in X such that and . We may assume that does not contain both and as, otherwise, the claim follows from Case (B)(i)(a). By symmetry, we can assume without loss of generality that is not in . Let . Then since is thin it follows that is thin. Put . Then, by induction, there is an unrooted caterpillar tree with leaf set and is one to one.
Consider the leaf . As is a caterpillar tree on , we can again consider two subcases ((a) and (b)), the first of which involves two further subcases:
Case (a): The leaves a and b are on the same side of relative to . Without loss of generality, assume that the distance from a to in is less than or equal to distance from b to in . We now have two subcases to consider for this subcase:
Case (a1): The leaf is on the same side of the caterpillar as a and b relative to . If a and b are on the same side of relative , then the same arguments as in the Case (B)(i)(a) apply with playing the role of . If a and b are on different sides of relative , then the same arguments apply as in Case (B)(i)(b) with playing the role of .
Case (a2): The leaf is on a different side of the caterpillar from a and b relative to . Now create a tree T on X by inserting a new vertex u and a new edge into such that with the vertex adjacent with we have that is an edge on the path connecting and . Then, T is clearly a unrooted caterpillar tree with leaf set X. Since is u and the median of in T corresponds to the median of in , the same arguments as in Case (B)(i)(a) imply that is one to one.
Case (b): The leaves a and b are on different sides of relative to . If lies on the same side of as a relative and and lie on different sides of relative a, then the same arguments as in the Case (B)(i)(a) apply with playing the role of . In all other cases, the same arguments as in the Case (B)(i)(b) apply with playing the role of
The Case
The concept of phylogenetic flexibility does not directly carry over to the case where , since in this case, there is just a single rooted phylogenetic tree. Instead, we use a stronger notion of tree structure (namely, total order) to obtain an analogue of Theorem 1.
We say that a non-empty subset of is total-order flexible if every choice of a total order on the set s, for each , is compatible with a total order on X. More formally, for every , if we declare that either or , then for any such selection of choices (one for each ), there is a total order on X that agrees with these inequalities. For example, is total-order flexible but is not, since the orderings are not compatible with any total order on a, b, c. The following result is the analogue of Theorem 1 for the case where .
Theorem 3
Suppose that is a non-empty subset of . Then is thin if and only if is total-order flexible.
Proof
We first show that if is not thin, then is not total-order flexible. Suppose that is not thin. Then there exists a non-empty subset of for which . Let be the graph that has vertex set and edge set . Since has at least as many edges as vertices, this graph has a connected component that contains a cycle. If the edges of this cycle are , then the total orders on these pairs are not compatible with any total order on X (since transitivity would imply that ).
We now show that the thin property implies total-order flexibility by using induction on . The result clearly holds for so suppose that the result holds for subsets of of size and that is a thin set of size . By Lemma 1(ii), there is an element x in X that is present in precisely one set, say , in . Let be the set obtained from by deleting . Then is thin and, since , the induction hypothesis implies that is total-order flexible. Then any choice of a total order on the set s for each is compatible with a total order on (recall that by the choice of x). If we now introduce a total order on , then we can extend the total order to X by placing x after y if is ordered as x, y, and placing x after y otherwise.
We now present some characterisations for when a non-empty set is thin. We begin with an analogue of Dress and Steel (2009, Theorem 1.1), which was stated in the last section.
Given a rooted tree T with leaf set X and a set , let refer to the vertex that is the unique vertex of T that is the least common ancestor of the elements in the set s.
Theorem 4
Suppose that is a subset of with . The following are equivalent:
-
(i)
is thin.
-
(ii)
There exists a rooted binary phylogenetic X-tree for which the function from the elements of to the set of interior vertices of T is one to one.
-
(iii)
As for (ii) but with T a rooted caterpillar tree.
Proof
(iii) (ii) is trivial.
(i) (iii) Suppose is thin. We use induction on the size of X. If , then clearly (iii) holds. Therefore, suppose it holds whenever , for some . Let .
By Lemma 1(ii), there is some x with . Let . It is then straightforward to see that there is some pair with . Clearly is thin as is thin and either or .
Assume first that , where . By induction, there is a rooted caterpillar tree with leaf set and root for which the function is one to one. Now, we can create a new rooted tree T with root by adding two new edges and to T where is a new vertex that is not in . T is then clearly a rooted caterpillar tree on X; every vertex in has out-degree 2 and is one to one. This establishes the induction step, and so (iii) holds.
Assume next that , where . By induction, there exists a rooted caterpillar tree on . Let denote the root of . Let T be rooted caterpillar tree obtained from via the following two-step process. First, add a new root and a new edge to . In the resulting tree subdivide e by a vertex c and add the edges and . Clearly, is one to one. This establishes again the induction step, and so (iii) holds too in this case.
(ii) (i) Suppose x is an element which is not in . Given a non-empty subset of , let .
Suppose a rooted phylogenetic tree T on X satisfies the conditions in Part (ii) of the theorem. Add a new leaf x that is not in to T by adding the edge and regard the resulting tree as an unrooted phylogenetic tree on . In , the map from to the internal vertices of is then one to one. Hence, by Dress and Steel (2009, Theorem 1.1), is thin, and thus, for any non-empty subset of , we have:
It immediately follows that is thin.
Interestingly, we can give an alternative characterisation of thin subsets of in terms of bipartite graphs.
We first recall some results from matching theory. For a graph G and v a vertex in G, we let denote the degree of v in G. Given a bipartite graph and a non-empty set , we let denote the set of vertices in B that are adjacent to some vertex in Y, and we define the surplus of Y to be:
We also define the surplus of G, to be the minimum surplus over all non-empty sets of A. We say that a bipartite graph has positive surplus (as viewed from A) if . The following result is from Lovász and Plummer (1986, Theorem 1.3.8).
Theorem 5
A bipartite graph has positive surplus (as viewed from A) if and only if G contains a forest F such that for all .
We now apply this result to the setting of thin sets. Let be a non-empty collection of non-empty subsets of X. We associate a bipartite graph to that has the vertex set and the edge set given by containment (i.e. is an edge in if and only if with and ). Thus, we are representing our set by a bipartite graph with , and E given by containment.
Since is thin if and only if has positive surplus, by Theorem 5, the following corollary is straightforward.
Corollary 2
Suppose that is a subset of with . Then is thin if and only if is a forest.
For , it might also be interesting to characterise those graphs for which is thin.
The General Case (Slim Set Systems)
Suppose we have a non-empty collection of subsets of X, each of size at least 3. Consider the modified notion of excess, denoted and defined as follows: Define
Notice that when this notion of excess agrees with the earlier one.
Given a non-empty collection of subsets of X, each of size at least 3, we say that is slim if for every non-empty subset of , we have . The next result relates slim to thin; the two notions coincide when ; however, slim is a more restrictive notion than thin when , for . Note, however, that (unlike the thin property) the slim property does not require the sets in to all have the same size.
Lemma 4
Suppose that . If is slim, then is thin. Moreover, for is thin if and only if is slim.
Proof
If for each , then ; therefore, is thin if and only if for every non-empty subset of . We now impose the assumption that . First, if , then the required inequality: is equivalent to the condition that , which holds by the assumption that is non-empty. Thus if is slim, it is also thin. Moreover, when , the inequality becomes an equality; in this case, is slim if and only if it is thin.
We can extend the notion of phylogenetic flexibility introduced in Sect. 1 to arbitrary collections of subsets of X as follows. We first need to extend the earlier definitions of ‘display’ and ‘compatibility’ from sets of rooted triples to arbitrary collections of rooted trees, as follows.
Given a rooted phylogenetic X-tree T and a binary phylogenetic tree with leaf set is said to display if T contains a subdivision of as a (directed) subtree. [This is equivalent to the condition that each rooted triple displayed by is also displayed by T (Bryant and Steel 1995).] A set R of rooted binary phylogenetic trees is said to be compatible if there is rooted phylogenetic tree T that displays each of the trees in R.
For a set R of rooted binary phylogenetic tree, let ||R|| denote the collection of their leaf sets. Thus, ||R|| is a set of sets. Given a non-empty collection of subsets of X, each of size at least 3, we say that is phylogenetically flexible if every set R of rooted binary phylogenetic trees for which holds is compatible.
This notion agrees with the earlier notion of phylogenetic flexibility in the case where each set in has size exactly 3. Moreover, as before, we can assume without loss of generality that the tree T (in the definition) is binary.
The following result is a strengthening of our earlier Theorem 1; one direction follows from that theorem, the other direction is a consequence of a result from Grünewald (2012) (which dealt with unrooted trees).
Theorem 6
Suppose that is a collection of sets, each of size at least 3. Then is phylogenetically flexible if and only if is slim.
Proof
We first establish the ‘only if’ direction. Suppose that is phylogenetically flexible. For each set , select two elements and let:
and for any non-empty subset of let
Thus, A(s) is a set of triples, and is also a set of triples.
Claim 1
for each .
To see this, suppose that a triple, say , lies in A(s) and for two distinct elements s and of . Then, we can select a rooted binary phylogenetic tree with leaf set s that displays the rooted triple ab|c and select a rooted binary phylogenetic tree with leaf set that displays the rooted triple ac|b. But no rooted binary phylogenetic tree can display both and (since such a tree would also simultaneously display two different rooted triples with leaf set ). This contradicts the assumption that is phylogenetically flexible, so such a shared triple in cannot exist. This establishes Claim 1.
Claim 2
is phylogenetically flexible.
To see this, suppose that for each triple , we have an associated rooted triple with leaf set t. We need to show that there is a rooted binary phylogenetic tree that displays . Observe that A(s) is thin for each , since if A is a non-empty subset of A(s) of size k (say), then , and so . Theorem 1 (the ‘if’ direction) then ensures that for each s in , there is a rooted phylogenetic tree with leaf set s that displays . Moreover, since is phylogenetically flexible, there is a rooted binary phylogenetic tree T that displays for each . It follows that the tree T displays , and so is phylogenetically flexible, as claimed.
Claim 2 implies that is thin, by Theorem 1 (the ‘only if’ direction). We now show that this implies that is slim. Let be a non-empty subset of and consider . Since , we have:
6 |
where the inequality holds because (and thereby its subset ) is thin.
Now:
where the first equality holds by Claim 1. Combining this last equation with Inequality (6) gives:
which shows that is slim as claimed.
This establishes the ‘only if’ direction. Notice in doing so that we have used both directions of Theorem 1 in different places in this proof.
We turn now to the ‘if’ direction. Given , select a new element, say x, that is not present in any of the sets in , and add this to each of the sets in to produce a set . Notice that if is slim, then satisfies the property that for each non-empty set of , we have:
It follows from Theorem 1.1 of Grünewald (2012) that for any assignment of unrooted binary phylogenetic trees with leaf sets that correspond to the sets in , there is a binary phylogenetic tree that displays each of these unrooted trees. Suppose now that we have an assignment of rooted binary phylogenetic trees having leaf sets that correspond to the sets in . By attaching x as a leaf adjacent to the root of each of these trees, we obtain an assignment of unrooted binary phylogenetic trees with leaf sets that correspond to the sets in . Hence, by the result just stated, there is an unrooted binary phylogenetic tree that displays each of these unrooted trees. If we now let T be the rooted binary phylogenetic tree obtained from by deleting the leaf x and rooting the resulting tree on the vertex adjacent to x, then T displays the original assignment of rooted binary phylogenetic trees. Since this holds for all possible assignments of rooted phylogenetic trees to the sets in , it follows that is phylogenetically flexible.
Polynomial-Time Algorithms for Thin and Slim
Given finite set S, a function is called submodular if for all :
Submodular functions play an important role in optimisation and matroid theory (see, e.g. Lovász and Plummer 1986; Bixby and Cunningham 1995; Welsh 1995). In this section, we exploit these connections to show that there are polynomial-time algorithms to decide whether sets are thin or slim.
Suppose that is a subset of . For , we define
and
Note that (since the summation term is then empty). Although the following result is straightforward to show by using results concerning submodular functions in the literature, for completeness we give a direct proof.
Theorem 7
For a non-empty subset of , the functions and are submodular.
Proof
Suppose that is a non-empty subset of and that are non-empty. Clearly, and . Hence:
The fact that is submodular now follows, since and thus, by the above inequality, we have:
Similarly, is submodular, since
and therefore:
For , we define:
Lemma 5
-
(i)
Suppose that where . Then is thin if and only .
-
(ii)
Suppose such that each element in has size at least three. Then is slim if and only .
Proof
-
(i)
is thin if and only if for all non-empty if and only if .
-
(ii)
is slim if and only if for all non-empty if and only if .
The following result (Lovász 1983, Theorem 4.4) is originally due to Grötschel et al. (1981) (see also Lovász and Plummer 1986, pp. 417–418).
Theorem 8
Let f be a submodular function defined on the subsets of some finite set S. A set minimising f over all non-empty subsets of S can then be found in polynomial time.
In light of this theorem and Theorem 7, it follows that we can determine and for a given set in polynomial time. Therefore, by Lemma 5, we can determine whether or not a given set for which each element has size at least three is thin or slim in polynomial time.
Note that although this shows that polynomial-time algorithms exist for determining whether or not a set is thin or slim, these are likely to be impracticable (Lovász and Plummer 1986, pp. 417–418). However, for the case of determining whether or not a set is thin a more explicit algorithm can be given. More specifically, in Fritzilas et al. (2013, Theorem 2) a polynomial-time algorithm is presented for computing the surplus of a bipartite graph G. Since for a set , the surplus of the bipartite graph as defined in Sect. 3.1 is equal to , we can therefore apply this algorithm to determine if is thin. It would be interesting to find an explicit algorithm for determining whether a set is slim.
Theorem 7 has another consequence that relates to phylogenetics. Recall that a patchwork is a non-empty collection of sets that satisfies the property: if and , then . A combinatorial theory of patchworks, relevant to phylogenetics, was developed in Böcker and Dress (2001). Patchworks were also referred to as ‘intersecting families’ in earlier work by Lovász (1983, p. 240). The following is a generalisation of Dress and Steel (2009, Lemma 1.2), and the proof follows a similar argument to that result.
Corollary 3
If is slim, then the collection of non-empty subsets of such that for all and forms a patchwork.
Proof
Suppose satisfy . For , notice that and so, by the submodularity property of from Theorem 7, we have:
7 |
noting that is well defined by the condition that . Since , Inequality (7) gives:
It follows that the terms and on the right of this inequality must both be zero since and are non-empty subsets of the slim set and so each has non-negative excess. Thus, , as required.
Discussion
When an evolutionary biologist compares a number of trees on different, but overlapping, leaf sets, it is typically very rare that these trees are found to be compatible, due mainly to errors in the estimation of phylogenetic trees. Thus, in cases where the trees are compatible this fact alone may provide the biologist with some heightened confidence in the accuracy of the input trees. However, such confidence should clearly depend, in part, on the pattern of taxon coverage. In the extreme case where the subsets of species on which the input trees were built from a phylogenetically flexible collection, it is clear that compatibility provides absolutely no hint of accuracy of the input trees, since any trees that had been considered for those subsets would be compatible. For applications, it might therefore be useful to quantify how close to ‘phylogenetically flexible’ a given pattern of taxon coverage is.
Our results also suggest a second possible future research direction. Since submodular functions are connected to matroid theory, are there relevant connections between thin/slim sets and matroids? Other matroid structures in phylogenetics have been recently been described, in different contexts, by Dress et al. (2014) and Hellmuth and Seemann (2017).
Acknowledgements
We thank the organisers of the Algebraic and Combinatorial Phylogenetics Workshop (Barcelona, 26–30 June 2017) where some of the ideas in this paper were conceived, and the London Mathematical Society for supporting the visit of KTH and VM to visit MS in New Zealand. We also thank the two anonymous reviewers of this paper for numerous helpful suggestions.
Contributor Information
Katharina T. Huber, Email: K.Huber@uea.ac.uk
Vincent Moulton, Email: V.Moulton@uea.ac.uk.
Mike Steel, Email: mike.steel@canterbury.ac.nz.
References
- Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421. doi: 10.1137/0210030. [DOI] [Google Scholar]
- Bixby RE, Cunningham WH, et al. Matroid optimization and algorithms. In: Graham RL, et al., editors. Handbook for combinatorics. New York: Elsevier; 1995. pp. 551–609. [Google Scholar]
- Böcker S, Dress AWM. Patchworks. Adv Math. 2001;157:1–21. doi: 10.1006/aima.1999.1912. [DOI] [Google Scholar]
- Bryant DJ, Steel M. Extension operations on sets of leaf-labelled trees. Adv Appl Math. 1995;16(4):425–453. doi: 10.1006/aama.1995.1020. [DOI] [Google Scholar]
- Dress A, Steel M. A Hall-type theorem for triplet set systems based on medians in trees. Appl Math Lett. 2009;22:1789–1792. doi: 10.1016/j.aml.2009.07.001. [DOI] [Google Scholar]
- Dress A, Huber KT, Steel M. A matroid associated with a phylogenetic tree. Discret Math Theor Comput Sci. 2014;16(2):41–56. [Google Scholar]
- Felsenstein J. Inferring phylogenies. Sunderland: Sinauer Associates; 2004. [Google Scholar]
- Fritzilas E, Milanič M, Monnot J, Rios-Solis YA. Resilience and optimization of identifiable bipartite graphs. Discret Appl Math. 2013;161(4):593–603. doi: 10.1016/j.dam.2012.01.005. [DOI] [Google Scholar]
- Grötschel M, Lovász L, Schrijver A. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica. 1981;1:169–197. doi: 10.1007/BF02579273. [DOI] [Google Scholar]
- Grünewald S. Slim sets of binary trees. J Comb Theory A. 2012;119:323–330. doi: 10.1016/j.jcta.2011.09.007. [DOI] [Google Scholar]
- Grünewald S, Huber KT , Moulton V, Steel M (2017) Combinatorial properties of triplet covers for binary trees. arXiv:1707.07908 [DOI] [PMC free article] [PubMed]
- Hall P. On representatives of subsets. J Lond Math Soc. 1935;10(1):26–30. doi: 10.1112/jlms/s1-10.37.26. [DOI] [Google Scholar]
- Hellmuth M, Seemann CR (2017) The matroid structure of representative triple sets and triple-closure computation. arXiv:1707.01667
- Lovász L (1983) Submodular functions and convexity. In: Mathematical programming the state of the art. Springer, pp 235–257
- Lovász L, Plummer MD. Matching theory. New York: Elsevier; 1986. [Google Scholar]
- Sanderson MJ, McMahon MM, Steel M. Terraces in phylogenetic tree space. Science. 2011;333:448–450. doi: 10.1126/science.1206357. [DOI] [PubMed] [Google Scholar]
- Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003. [Google Scholar]
- Steel M, Sanderson MJ. Characterizing phylogenetically decisive taxon coverage. Appl Math Lett. 2010;23:82–86. doi: 10.1016/j.aml.2009.08.009. [DOI] [Google Scholar]
- Welsh DJA, et al. Matroids: fundamental concepts. In: Graham RL, et al., editors. Handbook for combinatorics. New York: Elsevier; 1995. pp. 481–550. [Google Scholar]