Abstract
Clusters of paralogous genes such as the famous HOX cluster of developmental transcription factors tend to evolve by stepwise duplication of its members, often involving unequal crossing over. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate the phylogenetic relationships. As a consequence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. In this contribution we show that the expansion of gene clusters by unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances, namely a subclass of circular split systems. Furthermore, when the gene cluster was left undisturbed by genome rearrangements, the shortest Hamiltonian paths with respect to genetic distances coincide with the genomic order. This observation can be used to detect ancient genomic rearrangements of gene clusters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms.
Electronic supplementary material
The online version of this article (10.1007/s00285-017-1197-3) contains supplementary material, which is available to authorized users.
Keywords: Evolution of gene clusters, Non-homologous recombination, Unequal crossing over, Phylogenetic combinatorics, Kalmanson metrics, Hamiltonian path problems
Introduction
The genomes of higher eukaryotes typically contain many families of genes with similar DNA sequence. These usually encode similar proteins and share similar function. Their sequence similarity indicates that they have evolved from a single original ancestor by means of multiple rounds of duplication. Such paralogous genes are often, but by no means always, located at the same genomic locus, where they form a gene cluster. In many cases clustered genes are not tied together functionally and the clusters can disintegrate by genome rearrangement without detrimental effects.
However, some gene clusters are evolutionarily old and have retained a very particular organization of their member genes for hundreds of millions of years. Among the best characterized gene clusters are the globin gene clusters, which encode major players in the transport of oxygen within the bloodstream (Maniatis et al. 1980) and the homeobox Hox gene clusters, which play a crucial role in the early stages of animal development (Garcia-Fernàndez 2005). In vertebrates, the latter show very low levels of repeats and unrelated open reading frames, and the genes in paralogous clusters share the same order and orientation. Experimental work demonstrated that the consolidated arrangement is crucial and constrained due to the necessity of a coordinated regulation orchestrated by enhancer sequences outside the cluster (Hardison et al. 1997; Montavon and Duboule 2013).
The details of the molecular mechanisms and evolutionary forces that govern the expansion of clusters of paralogous genes are by no means completely understood. Walter J. Gehring, a developmental biologist famous for his studies of the Hox gene cluster in Drosophila melanogaster, interpreted the fact that the three Hox genes (abd-B, abd-A, and Ubx) appear in a tandem arrangement as evidence for gene duplication by “unequal crossing over”. He proposed that the current Hox cluster expanded from two Hox genes by a series of unequal crossing over events between highly similar but mispaired paralogous genes (Gehring 1998). In this scenario, a new paralog is created as a hybrid of its left and right neighbors as indicated in Fig. 1.
The local gene duplication model constitutes an alternative explanation. Again, unequal crossing over is a molecular mechanism resulting in the duplication. However, in this scenario the crossing over occurs between genes and thus results in the creation of a faithful copy of the complete gene. Diversification, subfunctionalization, or neofunctionalization then drives the subsequent divergence of the paralogous sequences (Ohno 1970; Force et al. 1999).
Gehring noted that terminal genes in a Hox cluster are not subject to changes by crossing over and that the genes in the middle of the cluster are more similar to the consensus sequence than more distal genes. The paralogs in a cluster most similar to a given gene tend to be its neighbors. A recent analysis of the genetic distances, i.e., a suitably transformed measure of sequence similarity (Nei 1972) between Hox genes, furthermore, showed that the shortest Hamiltonian path with respect to the genetic distance follows the genomic order of the cluster (Höner zu Siederdissen et al. 2015). We ask here if and how these observations can be explained by Gehring’s model and the local gene duplication model.
The analysis of the history of a gene family is usually based on the inference of a phylogenetic tree of the paralogous genes in question. However, this is a difficult task and often remains unsuccessful, in particular for the deep branches since several effects conspire to erase the phylogenetic signal. Saturation of the phylogenetic signal limits the power of reconstruction in particular for old events and events separated by relatively short time scales.
Genomic elements that are very similar in sequence and in close proximity, as is the case in clusters of paralogous genes, are particularly prone to gene conversion and other mechanisms of concerted evolution (Carson and Scherer 2009; Noonan et al. 2004). Last but not least, the very process that introduces additional new members may involve unequal crossing over in Gehring’s model thus producing a non-tree-like structure of genetic distances to begin with.
The purpose of this contribution is two-fold. First, we investigate the consequences of Gehring’s model for gene cluster expansion and show that while the resulting genetic distances are not additive trees, they form a special class of Kalmanson (circular decomposable) metrics (Kalmanson 1975), which we term type R metrics. Circular decomposable metrics are intimately related to weakly compatible split systems (Bandelt and Dress 1992) that admit a circular order (Christopher et al. 1996; Chepoi and Fichet 1998). Our interest in circular orderings in this context is far from accidental: There is a large body of literature that not only explores these connections in detail (Farach 1997; Dress et al. 2000; Kleinman et al. 2013); circular decomposable metrics and their associated split systems also form the basis for the most important and widely used practical methods for reconstructing phylogenetic networks: NeighborNet (Bryant et al. 2004, 2007) and Qnet (Grünewald et al. 2007, 2009). Second, we will see that in the absence of extreme selective pressure they have the Robinson property, which ensures that the Hamiltonian path with the shortest genetic distance between genes is co-linear with the genomic order in the gene cluster. We then use this result to distinguish between gene clusters that likely have evolved under Gehring’s model and retained synteny from those that have a different origin or were subject to a rearrangement of their gene order. The contribution is organized as follows: In the following section we survey background material that sets the stage for a detailed analysis of Gehring’s model, which we formalize in Sect. 3 in terms of type R metrics. We then proceed to compare the mathematical model with both simulated and real-life data. A short concluding section, finally, summarizes our findings and points to open questions.
Trees, metrics, and Hamiltonian paths
In this section we introduce the notation and provide some mathematical background information on the connection between tree metrics and Hamiltonian paths. The material presented in this section is mostly “folklore” and included primarily as an introduction to the more formal development of the following sections. Proofs are included in this section for completeness where we are not aware of any convenient references.
Gene duplications and genomic gene order
We consider a family X of paralogous genes whose evolutionary history is given by the tree T (with vertex set V, leaf set , and edge set E) and strictly positive branch lengths . The corresponding genetic distance function is given by
1 |
where denotes the unique path connecting x and y in T. We write for the maximal distance between two leaves. It is important here that the genetic distance is additive in the branch lengths and thus proportional to divergence time (Nei 1972). Distance measures counting differences in sequence alignments therefore need to be suitably transformed to additive measures, see e.g. Jukes and Cantor (1969).
Let be a bijection. In other words, defines an ordering of X so that iff . A special ordering is the arrangement of the genes on the genome.
A circular (or cyclic) ordering (Meggido 1976) is a ternary relation on a set X that satisfies the following five conditions for all :
implies i, j, k are pairwise distinct. (irreflexive)
implies . (cyclic)
implies . (antisymmetric)
and implies . (transitive)
If i, j, k are pairwise distinct then or . (total)
A pair of points (p, q) is adjacent in a total circular order on V if there is no such that . Circular orderings can be linearized by cutting them at any point resulting in a linear order with the cut point as its minimal (or maximal) element (Novák 1984). We will write, by abuse of notation, to mean together with a suitable linearization, i.e., a cut between k and i.
It is well known that trees are planar graphs. Let be a fixed planar embedding of T. It defines, up to orientation, a unique circular ordering of the leaf set X, see e.g. Semple and Steel (2003) for more details. Any linearization of this circular order defines a linear order, which we will refer to as a T -order, see Fig. 2.
Consider a tree with leaf-set and fix a particular circular order on X. Let be a set of edges connecting consecutive leaves with respect to and denote by the auxiliary graph with the same vertices as T and an edge set extended by . Thus is a Halin graph (Halin 1971) whenever is T-order. A necessary condition for to be a T-order therefore is that is a planar graph.
Clearly, if the gene family originated exclusively by tandem duplications, then the genomic order is a T-order for the gene phylogeny T. On the other hand, if a block containing two or more genes is duplicated as a unit, then and the tree are discordant as shown in Fig. 3. Every duplication scenario in which more than a single gene duplicated at least once must contain this situation as a subgraph, and thus the complete bipartite graph shown in the right panel of Fig. 3 is a minor. and the complete graph of five vertices are the two minimal obstructions to planarity, see Makarychev (1997) and the references therein. Thus it follows that is not a T-order whenever the evolutionary scenario involves larger block duplications. We remark that gene loss may erase this signature of block duplications. For instance, the loss of node (leaf) 2 or 3 in Fig. 3 leads back to a T-order.
From trees to Hamiltonian paths
For an arbitrary order we define the length function
2 |
can be interpreted as the length of the Hamiltonian path defined by the ordering in the complete graph with vertex set X and edge lengths .
Theorem 1
Let d be the additive tree metric associated with the tree T and its non-degenerative length function . Then is minimal if and only if (i) is a T-order and (ii) .
Proof
We use the abbreviation .
Claim 1
Every order satisfies .
Denote by the closed walk . Its length is . Since connects any two leaves, it contains all edges of T. Furthermore, since T contains no cycle, must leave each subtree that it enters along the same edge. Thus covers any edge at least twice. Hence . Since contains exactly one path too many, and the longest possible path had length , the claim follows.
Claim 2
If is T-order, then .
By construction associated with a T-order is the closed walk defined by the “outline” of the tree, cf. Fig. 2. Any such walk covers each edge of T exactly twice, once when entering and once when leaving a given subtree. This construction is well known in literature, see e.g. (Moret et al. 2002, Theorem 5). The claim follows directly from .
Fix an arbitrary leaf 1 as the root of T and a starting and end point of and denote by n the last leaf visited for the first time along . Furthermore, for every edge e, T(e) denotes the connected component of that does not contain 1.
Claim 3
If covers every edge of T exactly twice then the leaves contained within every subtree form an interval in .
It suffices to note that enters and leaves the subtree T(e) only through e. If the edge is covered exactly twice, all leaves of T(e), and only the leaves of T(e) are visited along between the first and the second traversal of e.
It follows that, for each edge where and , that is, , there is a linear ordering of the children , , through of v so that the subtrees , are traversed by in this order. Consequently, there is a planar layout of T so that the leaves 1 through n are arranged in the order of traversal. In other words, if traverses T so that every edge is covered exactly twice, then T has a planar embedding so that travels along its outline and visits consecutive leaves in the order in which they appear on the outline of the tree.
Hence there is a T-ordering following the outline of T if and only if the corresponding closed walk covers every edge of T exactly twice. Now suppose that is not a T-ordering. By closure of the walk, each edge must be covered an even number of times by , so that without the return path from to covers at least one edge thrice, thus .
Simulating distance matrices for gene duplications
We show here that genetic distance matrices for models of gene duplications can be simulated directly. This has advantages over the more usual approach of simulating sequence evolution. In particular we can, in this manner, separate the stochastic noise that may lead to deviations from additive tree metrics.
Lemma 1
Let be an additive tree metric on X and let for be arbitrary. Then defined as for is again an additive tree metric.
Proof
A metric d is an additive tree metric if and only if every 4-tuple satisfies the “4-point condition” (Buneman 1974; Cunningham 1978; Dobson 1974; Simões-Pereira 1969), which stipulates that any four leaves can be renamed such that
3 |
Using the definition of immediately yields
Hence we can propagate time by an increment simply by adding where is the rate of evolution of taxon x. A duplication of gene x introduces a new gene z that, at the time of the duplication event, is identical to x. There, it is represented in the distance matrix by simply duplicating the row and column x, i.e., by setting for all and . The procedure is summarized in Algorithm 1.
A rate (and possibly a new rate ) needs to be chosen. Assuming a constant rate of duplication, we set and choose one of the leaves at random for duplication. Instead of appending the new leaf to the end of the matrix, we insert it explicitly before or after x so that the order of the rows and columns explicitly encodes the genomic order. Duplicating a larger block of rows and columns can immediately be used to simulate the block duplications of any number of adjacent genes.
Lemma 2
Every additive tree metric can be constructed by Algorithm 1.
Proof
If is an additive tree metric, then there is a unique additive tree T with edge lengths representing . Suppose for the moment that T is binary. Then it has at least one “cherry”, i.e., a pair of leaves separated by only a single interior vertex, say . It is easy to check that every cherry in T must satisfy
4 |
If is a cherry, then the distances in T from p and q to their last common ancestor are and , both of which are non-negative as a consequence of the triangle inequality. The reduced distance matrix on defined by for , represents T with the cherry replaced by its last common ancestor, hence it is again an additive distance matrix.
Repeating this construction we arrive at a single vertex after steps. Each step identifies a leaf p that is duplicated and the extensions and of p and its copy q. Note that we have set for all . This reflects that the stepwise elongation of the trees’ branches modeled in Algorithm 1 can be subdivided arbitrarily between duplication events that affect a particular branch. Here we simply choose to add the entire length immediately after each duplication event. Thus the construction in this proof backtraces a particular sequence of duplication events in Algorithm 1.
The case of non-binary trees is easily incorporated by observing that it can be represented as binary tree in which an internal branch length of 0 is also allowed.
Type R distance matrices
Construction and recognition
The model so far corresponds to a mechanism in which unequal crossing over occurs only between the genes of interest. We can, however, also model events in which the genes themselves are recombined. Instead of assuming that the newly introduced gene z is a true copy of x, we now assume that z is a recombinant of two adjacent genes x and y. The product is inserted between x and y.
Since z is composed of two parts, of relatives sizes a and , , that are identical to x and y, respectively, we have
5 |
After the duplication event, each gene evolves independently with its own rate, so that the genetic distance between p and q again grows by , i.e.,
6 |
Definition 1
A distance matrix is of type R if it is constructed by repeated application of Eqs. (5) and (6).
Clearly, every additive tree metric, and thus also every phylogenetic tree resulting from tandem duplications is of type R by virtue of setting (or ) in every duplication step. In particular, therefore, for every distance matrix is of type R. For , however, it is not obvious whether a type R matrix can be recognized efficiently. The evolutionary history therefore must not include e.g. simultaneous duplications for two or more genes (as in the example of Fig. 3), or genome rearrangements. As we argued above, events will not only violate the type R condition but will in general also interfere with the circular decomposability of the split system.
In order to characterize type R distances, we start by observing
7 |
since .
For , consider the following expression for .
8 |
The key observation is that this expression is independent of u. Thus, for , there are distinct leaves u, v distinct from so that , which can be rearranged as and hence, after a short calculation,
9 |
Note that this equation must be satisfied for all , hence it restricts the space of type R distance matrices to a submanifold for all .
Once a has been computed, f(a) can also be computed explicitly. Now consider the following system of equations
10 |
The first line uses the definition of f(a) above, the second and third line are rearrangements of and , resp. multiplying the second and third line by a and and adding up the three equations yields . We can now compute from
11 |
Finally, and are obtained from
12 |
In summary, therefore, we can obtain, for , complete information on the relative arrangement of the parents x and y and their recombinant offspring z. If or in Eq. (9) then z is a copy of x or y, resp. In this case we cannot determine from Eq. (11) since . By construction, however, we can just remove z from the matrix to obtain the ancestral state.
It remains to determine the values of for . This turns out to be not so trivial, since is, in contrast to , , and , not uniquely determined by the last unequal crossing over in Gehring’s model event.
To see this more clearly, let us first consider the case . It is well known that every metric on four points can be represented as a “box graph” as shown in Fig. 4. The box dimensions can be computed from and . The key ingredients thus are the three different pairs of distances emphasized by parentheses. For more details see Nieselt-Struwe (1997). Now let us start from an arbitrary distance matrix on and construct z as a recombinant. In the following, we will use abbreviations for the three pairs of distance sums, thus
13 |
Using the definitions of , , and we can compute
14 |
using again the triangle inequality. The terms and correspond to twice the sides of the box in the quadruple graph, shown in Fig. 4; note that they are independent of , , , and . We obtain a tree whenever the box degenerates to a line, i.e., if or .
In the general case, the length of the edge incident with leaf u becomes , where the minimum runs over all different from 0, since we have a box as in Fig. 4 for every quadruple of leaves. It follows that the contribution that measures that divergence of sequences between duplication events cannot be determined. Intuitively, this comes from the fact that distances are modified by contributions deriving from the independent evolution of two leaves. This terms is added to after every duplication event. This contribution cannot be divided unambiguously between the individual steps in complete analogy to the situation for additive tree metrics in the previous section.
Hence we can set for every and assume the entire length of stems from previous events. This yields the recursive Algorithm 2 for recognizing type R distance matrices. It requires O(|V|) decomposition steps, each of which needs in the worst case computations to identify the triple (x, y, z) corresponding to the last duplication event. Note that it suffices to consider . If or , then z was obtained as a faithful copy of x or y, resp., and hence it can just be dropped. If a candidate triple is found, the previous distance matrix is computed in quadratic time. Thus Algorithm 2 runs in time.
For the remaining distance matrix is represented by a unique box as in Fig. 4, which implies a unique circular order of the remaining four nodes, say u, x, y, z. The fourth node therefore must be the result of unequal crossing over of two nodes that are placed at diagonally opposite corners of the box. Therefore (u, y : x), (x, z : y), (y, u : z), and (z, x : u) are equivalent.
Linear type R matrices
Definition 2
A type R distance matrix is called linear (with order ) if, starting from , in each vertex addition step the two parents x and y are adjacent and their offspring z is placed between x and y.
Algorithm 2 identifies triples (x, y : z) so that z was obtained as a recombinant of x and y, i.e., that z is located between x and y together with a possible temporal order of these events. It is difficult in general to determine whether a linear order exists that is compatible with an arbitrary collection of betweenness triples: the so-called Betweenness Sorting Problem is NP complete (Opatrny 1979; Chor and Sudan 1998). Here, however, we have much more information. We call a type R matrix generic if for every z both parents are uniquely defined. We say that (u, v : w) is a successor of (x, y : z) if or . A triple without a successor is a leaf triple.
With a leaf triple (x, y : z) we can associate the path . If a triple (x, y : z) has only one successor, say , we set . If it has two successors, these are of the form and , and we set . This is, the paths corresponding to the two “intervals” and are joined at the common vertex z. By construction of type R matrices, each triple has at most one predecessor, hence the path is uniquely and completely defined for every triple. A triple (x, y : z) has no predecessor only if x and y are two of the three ancestral nodes. There are at most two such triples by construction of linear type R matrices, which necessarily have one node in common. The paths are joined at this common node. The type R matrix is linear if the final concatenation result is a single path, in which each node appears exactly once. By construction, z is located between x and y for all triples (x, y : z), i.e., the final path encodes the desired linear order of the nodes.
Representing the paths as lists, joining at their end points can be performed in constant time. Any triple (x, y : z) can be a left or right successor to another triple on (x, y), accept a left successor on (x, z), or accept a right successor on (z, y). For each triple, joining to already processed triples and/or generating references for later triples can be achieved in O(1) utilizing the tuple connectors of the triples themselves as keys in associative arrays (one per connection type), e.g. using a quadratic array or (sparse) hash-maps. The successor/predecessor relation between the O(n) triples can therefore be established in linear time if the triples that account for duplications are already known. Thus, linearity of a type R matrix can be checked in linear time (see Algorithm 3 in the Appendix).
This algorithm can also be extended to the non-generic case. Instances with or duplications result in (x : z) relations with unknown second flanking gene, which can cause several problems. While the algorithm above can always find one linear configuration, this is no longer unique in the non-generic case. Any pair obtained as “clones” from the same parent have no defined order among themselves, unless a later triple with can resolve it (see Fig. 5). Hence, the predecessor–successor relationship is no longer binary, but rather any gene might relate to an unlimited number of perfect copies. This requires careful indexing on individual genes, as listing gene tuples would create exponential growth of open references.
Let us now turn to the connection of type R matrices and circular orders.
Definition 3
A distance matrix satisfies the Kalmanson condition if there is a circular order of the points so that the inequality
15 |
for every four points so that .
If satisfies Eq. (15) then the corresponding Travelling Salesman Problem (TSP) is solved by the unit permutations, i.e., (Kalmanson 1975). Equivalently, if is a circular ordering of the taxa set V and the permutation of V associated with an arbitrary linearization of , then is Kalmanson iff
16 |
for . In this case in Eq. (2) is a shortest Hamiltonian cycle for .
With each circular ordering we can associate a set of splits, i.e., non-trivial bipartitions of the set X of taxa. if and only if (i) , (ii) , (iii) there is and so that (a) for all and holds and and (b) and . We write
17 |
with i, j taken for the splits of , where is again an arbitrary linearization of . A metric is called circular decomposable (Bandelt and Dress 1992) if there is a circular ordering (with a corresponding permutation ), and , so that
18 |
where the split pseudometric is defined as if the split separates x and y, and otherwise. Such expressions are known as “Crofton formulas” (Chepoi and Fichet 1998). The isolation indices of the splits can be computed as
19 |
It is shown by Christopher et al. (1996) and Chepoi and Fichet (1998) that a metric satisfies the Kalmanson condition if and only if it is circular decomposable. These can be represented as so-called split graphs and computed efficiently using the NeighborNet algorithm (Bryant et al. 2004, 2007).
As shown in (Levy and Pachter 2011, Theorem 37) the solution of the TSP on a generic circular decomposable metric is unique. Thus, one can use the TSP solutions of directly for finding circular orderings to be used in NeighborNet (Korostensky and Gonnet 2000; Bryant et al. 2004, 2007). Note that this is not true for special case of additive tree metrics.
Theorem 2
Every linear type R distance matrix satisfies the Kalmanson condition.
Proof
We only need to show that the distance matrix on is Kalmanson provided the distance matrix on X is Kalmanson. Suppose z is the recombinant of j and . In the general case we have , since by circularity of the ordering it does not matter whether we duplicate i, j, k, or l. In addition to the general case we have to consider the special cases with and/or . The proof repeatedly makes use of the simple observation that .
We assume that the Kalmanson inequalities hold for all quadruples in X with an appropriate circular order. For the general case we have, by substituting the definition of the distances involving the recombinant vertex z,
In the fourth line we use that the Kalmanson inequality holds for and by assumption, the last line used the definition of . Analogous computations for the three special cases (omitting the analog of the second and third line above) yield:
We conclude that all quadruples involving z satisfy the Kalmanson inequality provided the distances form a Kalmanson metric on V: we have used the Kalmanson conditions for as well as the triangle inequality in our proof. As the distances that do not involve the new offspring z remain unchanged by the construction principle of type R matrices, we conclude that the distances on also satisfy the Kalmanson inequalities.
Robinsonian distances and Hamiltonian paths
The basic idea of converting a TSP into a shortest Hamiltonian path problem is folklore. One simply adds a dummy node 0 between 1 and n with large enough. Then a shortest Hamiltonian path will use 0 as an endpoint to avoid using 2c in the solution. The resulting expanded distance matrix on is circular decomposable if and only if the Kalmanson conditions also hold for quadruples involving the dummy node, i.e., if and only if
20 |
holds for all . Since this simplifies to the condition
21 |
A dissimilarity d is called Robinsonian if there is a permutation so that
22 |
The so-called serialization problem (Robinson 1951; Liiv 2010) of linearly ordering objects is solved by the order for Robinsonian dissimilarities. This result appears to be folklore, we have not found a simple direct proof.
Lemma 3
If d is Robinsonian, then is a shortest Hamiltonian path.
Proof
W.l.o.g. we assume . Consider an arbitrary permutation . Then there is a bijection between the adjacencies with respect to and the adjacencies with respect to so that . To see this we argue by induction. For the statement is trivial. In general is either (1) the extension of a permutation on by one of the adjacencies [1, n] or , or (2) is obtained by inserting n into the adjacency with and In case (1) is the extension of by or . In case (2) we obtain from by replacing with and adding . The Robinson condition (21) implies for and hence , i.e., is a shortest Hamiltonian path.
The Robinson property also plays an important role in cluster analysis, where it characterizes certain generalizations of hierarchies (Diday 1986; Kleinman et al. 2013; Préa and Fortin 2014). So-called quadripolar Robinson dissimilarities that also satisfy the Kalmanson condition are studied in some detail by Critchley (1994).
Lemma 4
Suppose satisfies Eq. (21) on V. Then the distance matrix on obtained by inserting the recombinant node z between adjacent parents and also satisfies the Robinson condition Eq. (21).
Proof
Suppose is the new node derived from parents . Then for and we have and . Thus . The special case , yields: and thus . An analogous computation works for and . Finally, for and we have, by construction and .
It is important to note that the choice of can destroy the inequality: From we cannot conclude that . Hence, very uneven evolution rates or a mechanism that makes the “middle” genes in a gene cluster evolve much faster can destroy the betweenness conditions. The Robinson condition should be satisfied at least in very good approximation if the evolution rates of the offspring are not too different. Gene conversion, which effectively reduces distances, should make it even easier to satisfy Eq. (21).
Simulations and application to real-life data
Inference of gene order from distance data
The theory outlined above predicts that “well-behaved” gene clusters, i.e., those that (i) evolved by duplication of single genes only and (ii) did not experience rearrangements, should be Robinsonian. In other words, the shortest Hamiltonian path with respect to the genetic distances between its constituents should be co-linear with the genomic order. It is therefore of interest to study the length distribution of Hamiltonian paths. Associating a pseudo-energy with a path/permutation we may construct a probabilistic model where with an “inverse temperature” parameter . Höner zu Siederdissen et al. (2014) and Höner zu Siederdissen et al. (2015) showed that this model is tractable by a variation of the well-known exponential-time dynamic programming approach to the Travelling Salesman Problem (Bellman 1962). In brief, the ensemble (p, A, q) of paths starting in p, ending in q and running through all elements of A is of the form . Using a variant of algebraic dynamic programming on sets, this simple decomposition can be used to compute the posterior probabilities of adjacencies in the ensemble of Boltzmann-weighted paths as well as the posterior probabilities of vertices p and q to be endpoints of a Hamiltonian path. Further details on the method are discussed by Höner zu Siederdissen et al. (2014) and Höner zu Siederdissen et al. (2015). It is implemented in the Gene Cluster Evolution Determined Order software package Gene-CluEDO.1
Since the genetic distance matrix is expected to have the Kalmanson properties the NeighborNet (Bryant et al. 2004, 2007) algorithm can be used as an alternative method to infer the expected gene order. The consistency theorem for NeighborNet (Bryant et al. 2004, 2007) in particular guarantees that the correct order will be obtained for ideal input data, i.e., input data that satisfies the Kalmanson condition. In practice, NeighborNet has turned out to be rather resilient to noise. Hence, it can be expected to produce good approximations to the gene order also for imperfect, noisy input data. Concurrence of Gene-CluEDO and NeighborNet can thus be used as support for the correctness of the reconstructed order, see Fig. 6.
Simple simulation of gene cluster evolution
In order to test whether sequence evolution indeed approximates type R distances we generated artificial amino acid sequence data starting from a random initial sequence of length N. For the data reported here we use and a uniform distribution of the 21 amino acids (including selenocystein). The initial sequence is copied identically; then both copies are independently mutated with a position-wise rate to generate the two initial parents. Subsequently, recombinant offspring are produced in an iterative fashion.
In each step, first a recombinant sequence z is produced from two adjacent parents x and y so that z is placed between x and y. To model unequal crossing over in Gehring’s model we randomly choose a breakpoint position k and produce z as a concatenation of y[1, k] and . In the first step, the initial sequence is simply copied. We also consider the case where the breakpoint is outside the “gene”, i.e., instead of producing a recombinant sequence z we use a copy of x or y with probability . If , we obtain the limit of tree-like evolution. The second part of each iteration step consists of independent mutations applied to all sequences. To this end, we replace with probability the amino acid in each sequence position by a randomly chosen alternative. The per site mutation rate must be chosen large enough to ensure a measurable divergence in each step. On the other hand, the sequence divergence should not saturate after n duplication-mutation steps, i.e., the expected total number of mutations per site should not substantially exceed 1. Thus .
Since we do not simulate insertions and deletions, the sequences are already properly aligned. In order to obtain an approximately additive distance matrix from the simulated sequences we use the Jukes–Cantor transformation (Jukes and Cantor 1969) to account for multiple mutations hitting the same site. We used emboss 6.6. (Rice et al. 2000) for this purpose. Fig. 6 shows data for a simulation with only local gene duplications in (a) and with unequal crossing over in Gehring’s model in each step in (b)–(e).
The gene order in the cluster and the reconstructed order in either the Gene-CluEDO or the circular order inferred using NeighborNet do not match for tree-like evolution. The reason is that in this case many orders, namely all outlines of any planar embedding of the tree, are equivalently perfect data. The simulated sequence data by construction contain stochastic noise that breaks this symmetry in a random manner. More precisely, distances empirically inferred from sequences will satisfy the equality in Eq. (3) only approximately. As a consequence, the tree edge belonging to the split xy|uv will be expanded to a narrow box as in Fig. 4. It is completely up to the noise, whether the second split is xu|yv or xv|yu, and thus, whether the circular order is x, u, v, y or x, v, u, y.
In contrast, both Gene-CluEDO and circular order reproduce the gene order in the cluster in the vast majority of simulations with unequal crossing over in Gehring’s model. The choice of the mutation rates makes little difference as long as the genetic distances between the sequences are not saturated.
An exception is Fig. 6c, where NeighborNet “misplaces” sequence 1. A detailed analysis of the data shows that both 3 and 9 are unequal crossing over products involving 1, however by chance the breakpoint was located so that only a tiny fraction of 1 was included in 3 and 9. The example thus contains an “almost tree-like” step, which does not retain sufficient ordering information.
Analysis of gene clusters
Pairwise distances
In the following we illustrate the application of the theoretical results to the analysis of several gene clusters. To this end, we retrieved the amino acid sequence data of the annotated proteins from the NCBI database, constructed and—where necessary—manually curated sequence alignments, and used these to compute the matrices of pairwise genetic distances that are taken as input by both Gene-CluEDO and NeighborNet. Details on the data sources are compiled in the Online Supplement.
Multiple sequence alignments were computed with T-Coffee (Notredame et al. 2000). Since highly variable regions in the proteins mostly introduce noise into the alignment and the subsequent reconstruction of the phylogenetic network, we removed highly variable alignment columns using noisy (Dress et al. 2008). From the processed alignment we then computed the evolutionary distances interpreting gap characters as additional characters. The resulting raw distances are transformed into evolutionary distances using the Jukes–Cantor correction (Jukes and Cantor 1969). For the lancelet Hox cluster we obtained an extremely gap-rich alignment. We therefore constructed an alternative alignment using the block-based dialign approach (Al Ait et al. 2013), which identifies a chain of significant local alignments. We retained only the alignment blocks with a non-zero significance score.
Hox gene cluster
We already showed in previous work (Höner zu Siederdissen et al. 2015) that the Hamilton path method implemented in Gene-CluEDO can be applied to investigating the ancient evolution of Hox gene clusters. Cephalochordates harbour the largest known single Hox gene clusters, comprising 15 members (Pascual-Anaya et al. 2012). The Hox gene clusters are known to have expanded independently in the major deuterostome lineages (Pascual-Anaya et al. 2013) making them a particularly interesting model system for testing Gehring’s model. The results of this analysis are shown in Fig. 7. Overall, the amphioxus cluster behaves as expected. In line with the analysis of Hox clusters from the coelacanth (Höner zu Siederdissen et al. 2015), both Gene-CluEDO and NeighborNet reproduce the genomic arrangement. There are a few notable deviations, however: Both methods report a reversed ordering of HOX1 and HOX2. A blastp search, however, confirmed that the sequences of these two genes unambiguously belong to the HOX1 and HOX2 paralog groups that are present in all deuterostomes. We suspect that adaptive evolution of one of these genes may be responsible for the observed discrepancy. NeighborNet shows HOX11 and HOX12 in reverse order. However, the splits involved in establishing this ordering have very small weights, suggesting that this reversal is not significant.
We conclude, therefore, that the evolution of the HOX gene cluster most likely followed Gehring’s model. Another aspect supporting this conclusion is the placement of splits in the network created by NeighborNet. The genes are placed in a nearly perfect circle around the center of the network. Comparing its topology to the topologies of the clusters created by simulating Gehring’s model, we can see high similarity in the network structures (see Fig. 6). The source data can be seen in Supplemental Table 1.
PSG gene cluster
The pregnancy-specific glycoproteins (PSG) play an important role in the immune system during pregnancy (Chang et al. 2013). They form a well-defined subfamily of the Carcinoembryonales Antigen gene family, which in turn belongs to the immunoglobulin gene superfamily. The PSG family forms a cluster that has independently expanded in some mammalian classes, most prominently in rodents and primates. Here we analyzed the human PSG gene cluster, which contains ten PSG genes. Five CEACAM pseudogenes are interspersed in the cluster. The results of this analysis are shown in Fig. 8.
The data shows two remarkable properties. Consistent with evolutionarily recent duplications the PSG genes are very similar to each other. The second remarkable property is that the orders inferred with Gene-CluEDO and NeighborNet do not fit to the real genomic order. In fact only three (Gene-CluEDO) or four (NeighborNet) genes appear in the order of their genomic positions. The data are not consistent with the prediction from Gehring’s model.
Two aspects provide possible explanations. Zid and Drouin (2013) proposed that the PSG gene cluster in primates evolved under purifying selection for gene conversion. Chang et al. (2013) proposed that a high number of unequal crossing over events had occurred in primate evolution. A very large number of duplicates, however, may reduce the selection pressure on single gene copies such that gene loss is no longer lethal. This may lead to missing genes and to large differences in evolution rates of individual copies. The latter may account for a violation of the Robinson property, and thus deviations between the observed genomic gene order and the order inferred by Gene-CluEDO from the genetic distances. An observation that supports these explanations is that PSG11 and PSG2 stand out amongst the other genes as relatively diverse (see NeighborNet plot). Possibly genes that could close this gap were lost due to unequal crossing over. The source data can be seen in Supplemental Table 2.
-Rhox gene cluster
The Rhox genes (MacLean and Wilkinson 2010) are expressed during both embryogenesis and in adult reproductive tissues. In the mouse they are located in a single cluster on the X chromosome comprising 33 genes in three subclusters (, and, ). The Rhox cluster is notable for its unusually rapid evolution. Here we included 23 well annotated genes of the -Rhox cluster, after removing the pseudogene rHox3d, the highly diverged rHox1 sequence, as well as rHox3b, for which no translation is reported in the NCBI database.
Figure 9 shows that the data set is divided into three groups. All rHox2 genes are in one group (left), all rHox3 genes form the second group (bottom) and all rHox4 genes build the third group (top right). These groups are clearly separated from each other. The -Rhox gene cluster clearly has not evolved conforming to Gehring’s model. As described e.g. by MacLean et al. (2006), the basic unit of tandem duplications is a block comprising an rHox2, rHox3, and rHox4 gene. Subsequent gene losses further restructured the cluster. In addition the cluster was subject to an inversion. Our analysis does not contradict this scenario. The source data can be seen in Supplemental Table 3.
ADH gene cluster
The alcohol dehydrogenases (ADH) family exists in a wide range of taxa, from bacteria to plants and humans (Oota et al. 2007). Their main function in animals is to break down alcohols that are otherwise toxic. Most members of this gene family appear in a well-studied gene cluster. The Human ADH gene cluster comprises seven genes, one each belonging to classes 2–5 as well as three paralogs of class 1 ADHs. Here, we find three elements in the cluster, which also cluster together regarding the results of Gene-CluEDO and NeighborNet, shown in Fig. 10.
As the genes are relatively similar to each other, genetic distances are small. The reconstructed cycle order inferred with both Gene-CluEDO and NeighborNet is the same as the genomic gene order. Gene-CluEDO identified ADH1A and ADH6 as the extreme ends in terms of genetic distance. These two genes are located adjacent to each other in the middle of the cluster. This may be an artefact of the small distances, since ADH5 and ADH7, for instance, have more or less the same distance to the split point inferred by Gene-CluEDO.
Our analysis thus suggests that the cluster evolved in line with Gehring’s model. The order is perfectly reconstructed. It has been argued by Oota et al. (2007) based on the observation that different exons of the genes resulted in different maximum parsimony trees that the ADH1 genes have not been subject to gene conversion (Oota et al. 2007). This observation is also consistent with the assumption of unequal crossing over within the gene as the mechanism underlying the duplications: in this scenario, duplicate genes are composed of two parts of two distinct genes, with different evolutionary history. Gene duplication following Gehring’s model therefore provides an explanation for the differences in exon-specific tree reconstructions as observed for ADH gene clusters. The source data can be seen in Supplemental Table 4.
Conclusions
In this contribution we have investigated in some detail a model of gene cluster evolution that goes beyond identical tandem copies. Based on Walter Gehring’s ideas, we saw that unequal crossing over events produce genes that are hybrids of their adjacent genes. The distances between the members of a gene cluster therefore are not expected to be tree-like. Instead they form a distinctive subclass of circular decomposable (Kalmanson) distances, which we have termed here type R. As a consequence, the genomic gene order matches the circular order associated with the Kalmanson-type genetic distance matrix. The NeighborNet algorithm (Bryant et al. 2004), a commonly used tool for the inference of phylogenetic networks, readily infers this order. This provides a simple method to check whether a gene cluster evolves according to Gehring’s model or not. To better characterize type R distances, we showed that they are recognizable in polynomial time and that the sequence of unequal crossing over events can be inferred from a given type R distance matrix.
Additive tree metrics, which arise if the crossing over breakpoints are located between genes, are a special case of type R distances. In this case, the circular order is ambiguous since an arbitrary decision can be made at each interior vertex of the phylogenetic tree. More precisely, all planar embeddings of the phylogenetic tree yield a valid circular order.
The genetic distances of gene clusters evolving according to Gehring’s model of unequal crossing over within genes also satisfy the Robinson condition, at least as long as selective pressures and thus evolutionary rates on paralogous members are not too different. This implies that shortest Hamiltonian paths with respect to the genetic distance should be co-linear with the genomic order of genes. Numerical simulations show that this type of co-linearity can be used to distinguish clusters that evolve through unequal crossing over within genes from clusters where unequal crossing over occurs (mostly) between genes. The tree-like evolution in the latter case yields equivalent solutions of the shortest Hamiltonian path problem, again corresponding to arbitrary planar embeddings of the tree. Small amounts of noise in the data then typically yield optimal solutions that differ substantially from co-linearity with the genomic arrangement.
We tested these ideas using well-studied gene clusters as examples. The Hox cluster of the lancelet, for instance, essentially follows Gehring’s paradigm. This is also true to a certain extent for the ADH gene cluster. Other clusters, such as the cluster of rodent Rhox genes or the PSG immunoglobulins, however, show little or no indication of unequal crossing over within genes, and drastic deviations from co-linearity between gene orders inferred from genetic distances and their actual genomic arrangements.
The work presented here focused on the mathematical foundations and the demonstration that genetic distance matrices are informative about the mode of gene cluster evolution. Several open problems remain, in particular related to practical applications. The recognition algorithm Algorithm 2 requires an exact type R structure. Since the conditions for a metric to be type R involves equalities, an empirically determined distance matrix generically will not be type R due to noise. This raises the question how a best-fitting type R matrix can be identified, and how the deviation from a type R matrix should be quantified most appropriately. Together with the approximation of a type R matrix it would be useful to compute the most likely sequence of unequal crossing over events.
If a gene cluster evolves according to Gehring’s model with appreciable levels of gene conversion, we expect that values of a inferred from Eq. (9) are typically bounded away from 0 or 1. This implies that the type R matrix should be decidedly non-tree like. It remains a question for future work how the distribution of a values relates to well-established measures of deviations from tree-likeness, such as the parameters proposed by statistical geometry (Nieselt-Struwe 1997). Similarly, it remains an open problem how to properly quantify the deviation of a given metric from type R structure. As with circular decomposable metrics there does not seem to be an easy-to-compute measure. The split-prime part of the metric, i.e., the unique component that is not decomposable in split metrics (Bandelt and Dress 1992), might serve at least as a first approximation for this purpose. These issues, however, extend beyond the scope of the present contribution and will require a most systematic analysis of a larger number of gene clusters.
In this contribution we have considered only the special case that unequal crossing over is restricted to adjacent genes. This assumption does not cover all cases of biological interest, as the case of the Rhox cluster shows: there, the unit of duplication is a sequence of three genes. It will be interesting to see, whether unequal crossing over events that lead to the duplication of larger subclusters lead to similar mathematical structures, and whether such events could be inferred from a careful analysis of the genetic distance matrix.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
Open access funding provided by Max Planck Society. SJP gratefully acknowledges a series of conversations with Walter Gehring that sparked the idea to this project, which unfortunately was realized only after he passed away. The simulations and the analysis of real-life gene clusters reported here were the topic of a bioinformatics computer lab course at U. Leipzig in the winter term 2016/17. The following students contributed their observations: Adarelys Andrades, Yves Annanias, Marius Brunnert, Alexander Engler, Maik Fröbe, Christian Heide, Felix Helfer, Ulrike Klotz, Stefan Krämer, Sebastian Luhnburg, Florian Mäschle, Markus Michaelis, Michael Rode, Jeremias Schebera, Alexander Scholz, Stephan Thönes, Kathleen Wende, Marcel Winter, Jan Witte, Sophie Wolf, Anastasia Wolschewski. Part of this work was funded by the German Federal Ministry of Education and Research within the Project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B) and the Deutsche Forschungsgemeinschaft (DFG STA 850/19-2 within SPP 1738).
Predecessor relation of crossover events
Algorithm 3 utilizes the associative properties of gene identifiers that allow constant time mapping between pairs of genes and recombinant triples. If triples are derived from a linear type R matrix, they form a natural binary tree that is to be established by the algorithm, where each triple (x, y : z) can be a left or right successor to another triple on (x, y), accept a left successor on (x, z), or accept a right successor on (z, y).
Each triple is added in turn, checking for connections to already added triples using associative arrays (map) for each connection type. If a connected triple was already added, an open entry is found in the corresponding map, else a new entry will be added to the according inverse map. For instance, an added left successor needs to look at an open predecessor. If a single tree was created, the linear order of genes can be found by traversing the tree. If no linear order exists, multiple trees will be created, as necessary connectors are either never added to an open map or have been removed since entries in open maps are only used for a single connection.
Data sources for the analysis gene clusters
Data sources relating to the gene clusters analyzed in this contribution as listed in the Electronic Supplemental Material. In addition, machine readable data, including alignments of all sequences used to compute genetic distances, are compiled for download at http://www.bioinf.uni-leipzig.de/publications/supplements/17-012.
Footnotes
Sources: http://hackage.haskell.org/package/Gene-CluEDO binaries: https://github.com/choener/Gene-CluEDO/releases.
Electronic supplementary material
The online version of this article (10.1007/s00285-017-1197-3) contains supplementary material, which is available to authorized users.
Contributor Information
Sonja J. Prohaska, Email: sonja@bioinf.uni-leipzig.de
Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de
References
- Al Ait L, Yamak Z, Morgenstern B. DIALIGN at GOBICS—multiple sequence alignment using various sources of external information. Nucleic Acids Res. 2013;41:W3–W7. doi: 10.1093/nar/gkt283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bandelt HJ, Dress AWM. A canonical decomposition theory for metrics on a finite set. Adv Math. 1992;92:47. doi: 10.1016/0001-8708(92)90061-O. [DOI] [Google Scholar]
- Bellman R. Dynamic programming treatment of the travelling salesman problem. J ACM. 1962;9:61–63. doi: 10.1145/321105.321111. [DOI] [Google Scholar]
- Bryant D, Moulton V, Spillner A. NeighborNet: an agglomerative method for the construction of planar phylogenetic networks. Mol Biol Evol. 2004;21:255–265. doi: 10.1093/molbev/msh018. [DOI] [PubMed] [Google Scholar]
- Bryant D, Moulton V, Spillner A. Consistency of the NeighborNet algorithm. Alg Mol Biol. 2007;2:8. doi: 10.1186/1748-7188-2-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buneman P. A note on the metric property of trees. J Comb Theory Ser B. 1974;17:48–50. doi: 10.1016/0095-8956(74)90047-1. [DOI] [Google Scholar]
- Carson AR, Scherer SW. Identifying concerted evolution and gene conversion in mammalian gene pairs lasting over 100 million years. BMC Evol Biol. 2009;9:156. doi: 10.1186/1471-2148-9-156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang CL, Semyonov J, Cheng PJ, Huang SY, Park JI, Tsai HJ, Lin CY, Grützner F, Soong YK, Cai JJ, et al. Widespread divergence of the CEACAM/PSG genes in vertebrates and humans suggests sensitivity to selection. PLoS ONE. 2013;8:e61701. doi: 10.1371/journal.pone.0061701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chepoi V, Fichet B. A note on circular decomposable metrics. Geom Dedic. 1998;69:237–240. doi: 10.1023/A:1004907919611. [DOI] [Google Scholar]
- Chor B, Sudan M. A geometric approach to betweenness. SIAM J Discrete Math. 1998;11:511–523. doi: 10.1137/S0895480195296221. [DOI] [Google Scholar]
- Christopher G, Farach M, Trick M (1996) The structure of circular decomposable metrics. In: Diaz J, Serna M (eds) Algorithms ESA’96, Lecture notes in computer science. Springer, New York, pp 406–418
- Critchley F. On quadripolar Robinson dissimilarity matrices. In: Diday E, Lechevallier Y, Schader M, Bertrand P, Burtschy B, editors. New approaches in classification and data analysis. Heidelberg: Springer; 1994. pp. 93–101. [Google Scholar]
- Cunningham P. Free trees and bidirectional trees as representations of psychological distance. J Math Psychol. 1978;17:165–188. doi: 10.1016/0022-2496(78)90029-9. [DOI] [Google Scholar]
- Diday E. Orders and overlapping clusters in pyramids. In: De Leeuw J, Heiser WJ, Meulman JJ, Critchley F, editors. Multidimensional data analysis. Leiden: DSWO Press; 1986. pp. 201–234. [Google Scholar]
- Dobson AJ. Unrooted trees for numerical taxonomy. J Appl Probab. 1974;11:32–42. doi: 10.2307/3212580. [DOI] [Google Scholar]
- Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. Noisy: identification of problematic columns in multiple sequence alignments. Alg Mol Biol. 2008;3:7. doi: 10.1186/1748-7188-3-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dress AWM, Huber KT, Moulton V. An exceptional split geometry. Ann Comb. 2000;4:1–11. doi: 10.1007/PL00001271. [DOI] [Google Scholar]
- Farach M. Recognizing circular decomposable metrics. J Comput Biol. 1997;4:157–162. doi: 10.1089/cmb.1997.4.157. [DOI] [PubMed] [Google Scholar]
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151:1531–1545. doi: 10.1093/genetics/151.4.1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia-Fernàndez J. The genesis and evolution of homeobox gene clusters. Nat Rev Genet. 2005;6:881–892. doi: 10.1038/nrg1723. [DOI] [PubMed] [Google Scholar]
- Gehring WJ. Master controle genes in development and evolution: the homeobox story. New Haven: Yale University Press; 1998. [Google Scholar]
- Grünewald S, Moulton V, Spillner A. Consistency of the QNet algorithm for generating planar split networks from weighted quartets. Discrete Appl Math. 2009;157:2325–2334. doi: 10.1016/j.dam.2008.06.038. [DOI] [Google Scholar]
- Grünewald S, Forslund K, Dress AWM, Moulton V. QNet: an agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol. 2007;24:532–538. doi: 10.1093/molbev/msl180. [DOI] [PubMed] [Google Scholar]
- Halin R. Studies on minimally -connected graphs. In: Welsh DJA, editor. Combinatorial mathematics and its applications. London: Academic; 1971. pp. 129–136. [Google Scholar]
- Hardison R, Slightom JL, Gumucio DL, Goodman M, Stojanovic N, Miller W. Locus control regions of mammalian -globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. Gene. 1997;205:73–94. doi: 10.1016/S0378-1119(97)00474-5. [DOI] [PubMed] [Google Scholar]
- Höner zu Siederdissen C, Prohaska SJ, Stadler PF (2014) Dynamic programming for set data types. In: Campos S (ed) Advances in bioinformatics and computational biology: BSB 2014, vol 8826 of Lect. Notes Comp. Sci., pp 57–64
- Höner zu Siederdissen C, Prohaska SJ, Stadler PF. Algebraic dynamic programming over general data structures. BMC Bioinform. 2015;16(19):S2. doi: 10.1186/1471-2105-16-S19-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jukes TH, Cantor CR. Evolution of protein molecules. In: Munro HN, editor. Mammalian protein metabolism. New York: Academic; 1969. pp. 21–132. [Google Scholar]
- Kalmanson K. Edgeconvex circuits and the traveling salesman problem. Can J Math. 1975;27:1000–1010. doi: 10.4153/CJM-1975-104-6. [DOI] [Google Scholar]
- Kleinman A, Harel M, Pachter L. Affine and projective tree metric theorems. Ann Comb. 2013;17:205–228. doi: 10.1007/s00026-012-0173-2. [DOI] [Google Scholar]
- Korostensky C, Gonnet G. Using traveling salesman problem algorithms for evolutionary tree construction. Bioinformatics. 2000;16:619–627. doi: 10.1093/bioinformatics/16.7.619. [DOI] [PubMed] [Google Scholar]
- Levy D, Pachter L. The neighbor-net algorithm. Adv Appl Math. 2011;47:240–258. doi: 10.1016/j.aam.2010.09.002. [DOI] [Google Scholar]
- Liiv I. Seriation and matrix reordering methods: an historical overview. Stat Anal Data Min. 2010;3:70–91. [Google Scholar]
- MacLean JA, II, Wilkinson MF. The Rhox genes. Reproduction. 2010;140:195–213. doi: 10.1530/REP-10-0100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacLean JA, Lorenzetti D, Hu Z, Salerno WJ, Miller J, Wilkinson MF. Rhox homeobox gene cluster: recent duplication of three family members. Genesis. 2006;44:122–129. doi: 10.1002/gene.20193. [DOI] [PubMed] [Google Scholar]
- Makarychev Y. A short proof of Kuratowski’s graph planarity criterion. J Graph Theory. 1997;25:129–131. doi: 10.1002/(SICI)1097-0118(199706)25:2<129::AID-JGT4>3.0.CO;2-O. [DOI] [Google Scholar]
- Maniatis T, Fritsch EF, Lauer J, Lawn RM. The molecular genetics of human hemoglobins. Ann Rev Genet. 1980;14:145–178. doi: 10.1146/annurev.ge.14.120180.001045. [DOI] [PubMed] [Google Scholar]
- Meggido N. Partial and complete cyclic orders. Bull Am Math Soc. 1976;82:274–276. doi: 10.1090/S0002-9904-1976-14020-7. [DOI] [Google Scholar]
- Montavon T, Duboule D. Chromatin organization and global regulation of Hox gene clusters. Phil Trans R Soc B. 2013;368:20120367. doi: 10.1098/rstb.2012.0367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moret BME, Tang J, Wang LS, Warnow T. Steps toward accurate reconstructions of phylogenies from gene-order data. J Comp Syst Sci. 2002;65:508–525. doi: 10.1016/S0022-0000(02)00007-7. [DOI] [Google Scholar]
- Nei M. Genetic distance between populations. Am Nat. 1972;106:283–292. doi: 10.1086/282771. [DOI] [Google Scholar]
- Nieselt-Struwe K. Graphs in sequence spaces: a review of statistical geometry. Biophys Chem. 1997;66:111–131. doi: 10.1016/S0301-4622(97)00064-1. [DOI] [PubMed] [Google Scholar]
- Noonan JP, Grimwood J, Schmutz J, Dickson M, Myers RM. Gene conversion and the evolution of protocadherin gene cluster diversity. Genome Res. 2004;14:354–366. doi: 10.1101/gr.2133704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. [DOI] [PubMed] [Google Scholar]
- Novák V. Cuts in cyclically ordered sets. Czech Math J. 1984;34:322–333. [Google Scholar]
- Ohno S. Evolution by gene duplication. Berlin: Springer; 1970. [Google Scholar]
- Oota H, Dunn CW, Speed WC, Pakstis AJ, Palmatier MA, Kidd JR, Kidd KK. Conservative evolution in duplicated genes of the primate class I ADH cluster. Gene. 2007;392:64–76. doi: 10.1016/j.gene.2006.11.008. [DOI] [PubMed] [Google Scholar]
- Opatrny J. Total ordering problem. SIAM J Comput. 1979;8:111–114. doi: 10.1137/0208008. [DOI] [Google Scholar]
- Pascual-Anaya J, Adachi N, Álvarez S, Kuratani S, Daniello S, Garcia-Fernàndez J. Broken colinearity of the amphioxus Hox cluster. EvoDevo. 2012;3:28. doi: 10.1186/2041-9139-3-28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pascual-Anaya J, Daniello S, Kuratani S, Garcia-Fernàndez J. Evolution of Hox gene clusters in deuterostomes. BMC Dev Biol. 2013;13:26. doi: 10.1186/1471-213X-13-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Préa P, Fortin D. An optimal algorithm to recognize Robinsonian dissimilarities. J Classif. 2014;31:1–35. doi: 10.1007/s00357-014-9152-0. [DOI] [Google Scholar]
- Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/S0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- Robinson WS. A method for chronologically ordering archaeological deposits. Am Antiq. 1951;16:293–301. doi: 10.2307/276978. [DOI] [Google Scholar]
- Semple C, Steel MA. Phylogenetics. Oxford: Oxford University Press on Demand; 2003. [Google Scholar]
- Simões-Pereira JMS. A note on the tree realizability of a distance matrix. J Combin Theory. 1969;6:303–310. doi: 10.1016/S0021-9800(69)80092-X. [DOI] [Google Scholar]
- Zid M, Drouin G. Gene conversions are under purifying selection in the carcinoembryonic antigen immunoglobulin gene families of primates. Genomics. 2013;102:301–309. doi: 10.1016/j.ygeno.2013.07.003. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.