Abstract
Evolutionary processes have been described not only in biology but also for a wide range of human cultural activities including languages and law. In contrast to the evolution of DNA or protein sequences, the detailed mechanisms giving rise to the observed evolution-like processes are not or only partially known. The absence of a mechanistic model of evolution implies that it remains unknown how the distances between different taxa have to be quantified. Considering distortions of metric distances, we first show that poor choices of the distance measure can lead to incorrect phylogenetic trees. Based on the well-known fact that phylogenetic inference requires additive metrics, we then show that the correct phylogeny can be computed from a distance matrix if there is a monotonic, subadditive function such that is additive. The required metric-preserving transformation can be computed as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process remains elusive.
Keywords: Cultural evolution, Phylogenetic tree, Additive metric, Metric-preserving functions
Introduction
At the most abstract level, evolution can be seen as a consequence of the generation of variation and selection. Since selection acts to remove entities from the system, it will eventually “die out” unless counteracted by some form of reproduction. Sustained evolution thus necessarily operates on populations of entities. The history of an evolutionary process can be recorded in the form of a directed graph: Dress et al. (2010b) considered the set comprising “all organisms that ever lived on earth” arranged into a graph with arcs (directed edges) connecting to nodes and whenever was a “parent” of , defined in a rather loose sense as having contributed directly to the genetic make-up of . These arcs encode not only father and mother in sexually reproducing populations, but also horizontal gene transfer, hybridization, the incorporation of retroviruses into the genome, etc. Since arcs encode ancestry, is acyclic.
The very same construction applies to many other systems that are perceived as evolutionary. For example, in the evolution of languages one may consider the mutual influences of speakers or, even more fine grained, individual utterances as the basic entities (Croft 2000; Pagel 2009). The same is true for the transmission of cultural techniques, designs, and conventions (Mesoudi et al. 2006). Well-studied cases include the transmission of texts (Greg 1950), in particular manuscripts, and text reuse, i.e., the borrowing of parts of a corpus, with or without modifications, in the process of creating a new text, see, e.g., Seo and Croft (2008). Similarly, the revisions of the law as dissenting interpretations can be seen in this manner (Roe 1996). The common ground of these and presumably many other systems is that a limited set of entities at some point or interval in time “informs” limited sets of entities in their (usually immediate) future.
The key result of Dress et al. (2010b) is that several types of clusters on the subset of organisms that are currently alive can be defined from the structure of the graph Many of these form hierarchies and therefore define a tree. These clusters naturally take on the role of taxa, and the corresponding trees consequently are a meaningful representations of the phylogenetic relationships among these taxa. The same interpretation is meaningful, as we argued above, also for many—but presumably not all—aspects of human cultural endeavors. Notions of cultural evolution (see, e.g., Flannery (1972), Mesoudi et al. (2006)) are therefore more than a convenient metaphor. Instead, for a given system of interest, one has to ask whether or not the corresponding graph shares key features with the one obtained from conceptualizing biological evolution. There is no a priori reason to assume, for instance, that always gives rise to the tree-like abstraction that is at the heart of biological evolution. This is an inherently empirical question that needs to be answered for each “evolutionary” system under consideration. Human languages, for instance, are a prime example of an aspect of human activity that closely conforms to biological evolution.
The key point here is that a phylogenetic structure is an emergent phenomenon of the underlying evolutionary process; it requires that there exists a level of aggregation in G that produces clusters adhering to an (essentially) hierarchical structure. Although Dress et al. (2010b) provide a formal justification for phylogenetic reconstruction with their analysis of the graph , their work does not attempt to provide a practical procedure to identify the relevant clusters, i.e., the taxa. After all, these are defined in terms of the graph , which of course is not directly observable. In fact, usually not even the set of extant entities will be known completely, as we will have to be content with a subset of available data.
In general, neither the “true nature” of the elementary entities nor a complete description for each of them is available to us. Instead, we have to be content with measured representations. For instance, in molecular phylogenetics, it is customary to represent a taxon by a set of sequences (usually representing single copy protein coding genes) obtained from one or more individuals. Morphological approaches in phylogenetics use a list of characters such as features of bones or organs to represent a typical individual. The impact of the choice of representation on the results of phylogenetic reconstructions has long been recognized in morphological phylogenetics and has been the subject of a long-standing debate, see, e.g., Wiens (2001).
The fundamental assumption that is made in any type of similarity-based phylogenetic analysis is that similarity of representations reflects evolutionary relatedness, i.e., proximity in , and therefore also makes it possible to identify the hierarchical cluster systems that are defined in terms of . This is well established, of course, in the case of molecular phylogenetics, where a detailed model of sequence evolution is available (Jukes and Cantor 1969; Tavaré 1986; Arenas 2015). Similarly, permutation distances directly count genomic rearrangement events (Hannenhalli and Pevzner 1995). The connection is much less clear for morphological phylogenetics, where choice and even the concept of “character” is under debate, see, e.g., Wagner (2001), Wagner and Stadler (2003) for a formal discussion. In many cases, it seems difficult to construct a theory that links distance or similarity measures directly to an underlying evolutionary process. This is the case for instance in phylogenetic applications of distances between RNA secondary structures (Siebert and Backofen 2005) or the use of distance measures based on data compression (Cilibrasi and Vitanyi 2005; RajaRajeswari and Viswanadha Raju 2017).
Phylogenetic methods have also been employed in the humanities. Relationships among languages, for instance, can be captured by using cognates (i.e., words with a common origin) as characters, see, e.g., Gray et al. (2011), Holman and Wichmann (2017). Recently, sophisticated statistical approaches, that model, e.g., the importance of sound change, have been used to reconstruct language trees, see, e.g., Bhattacharya et al. (2018) for a recent overview. In stemmatics, differences between editions or manuscripts serve as characters from which the relationships, e.g., between the many different versions (O’Hara and Robinson 1993; Barbrook et al. 1998; Marmerola et al. 2016) can be reconstructed. Occasionally, material artefacts are considered. Tëmkin and Eldredge (2007) studied used phylogenetic methods to study the history of certain musical instruments. A broader perspective of phylogenetic approaches in cultural evolution is discussed, e.g., by Mesoudi et al. (2006), Steele et al. (2010) or Howe and Windram (2011).
It is a well-known fact in sequence analysis that not all (reasonable) distance measures lead to faithful reconstructions of phylogenies. It is a well-established practice, in fact, to correct for back-mutations, i.e., to transform raw counts of diverged sequence positions, i.e., the Hamming or Levenshtein distances, into distance measures that can be interpreted as numbers of evolutionary events or divergence times. Depending on the level of insights into the data, the simple Jukes–Cantor model (Jukes and Cantor 1969) or one of the many much more elaborate models (Tavaré 1986; Arenas 2015) is used for this purpose. In the field of alignment-free sequence analysis, on the other hand, the focus is on the efficient computation of dissimilarity measures, without overt concern of the measure’s connection to a dynamical model of evolution (Vinga and Almeida 2003). One has observed, however, the distance measures that do well in a phylogenetic context also correlate very well with model-based distances (Edgar 2004; Haubold et al. 2009; Leimeister and Morgenstern 2014). We suspect that this reflects the fact that a particular subclass of metrics, the so-called additive metrics, conveys complete phylogenetic information, see “Distance-based phylogenetics” section. We therefore make a strong assumption throughout this contribution:
Assumption A
Given a complete and correct model of the evolutionary dynamics on a suitable constructed space , there is an additive metric distance measure on that measures the cumulative change along each lineage.
An immediate consequence is that phylogenetic relationships can be reconstructed unambiguously if is known. There is, of course, no reason to think that Assumption A holds in real life. In particular, it is certainly violated by all processes that lead to reticulate patterns in evolution, such as incomplete lineage sorting, horizontal gene transfer, and hybridization (Gontier 2015). The purpose of this contribution, therefore, is to ask how much (or how little) we need to know about the “true” metric t to be able to infer the correct phylogenetic tree . More precisely, we investigate here the consequence of distorted distance measurements: Suppose that instead of we can infer from the data only a “deformed” dissimilarity measure , where is an unknown function about which only some qualitative features can be known. We then ask: How much information about , and thus the underlying phylogenetic tree, does still convey?
Distance-based phylogenetics
A map is a metric if it satisfies, for all :
-
(M0)
-
(M1)
If then .
-
(M2)
.
-
(M3)
.
Distance measures can be used for clustering and thus serve as a means of extracting hierarchical, i.e., tree-like, structures on a set of data.
The basis of distance-based phylogenetic methods is additive metrics, i.e., metrics that are representations of edge-weighted trees. Consider a tree with leaf-set and a length function defined on the edges of . Recall that every pair of leaves and is connected by a unique path in . The length of this path, i.e., the sum of its edge lengths, defines the distance . Additive metrics are those that derive from a tree in this manner. A famous theorem (Buneman 1974; Cunningham 1978; Dobson 1974; Simões-Pereira 1969) shows that additive metrics are characterized by the four-point condition: A metric is additive if and only if for any four points holds
-
(MA)
The appearance of additive metrics in evolutionary processes can be justified rigorously for specific models. For example, Markovian processes on strings of fixed length lead to distances that can be estimated directly from the data: Denoting by the fraction of characters in which has state and has state , which for each pair () can be arranged in a matrix . Steel (1994) showed that (the expected values of) form an additive metric. Well-known results from phylogenetic combinatorics show that given an additive metric, the tree and its edge lengths can be reconstructed readily, see, e.g., the work of Apresjan (1966), Imrich and Stockiĭ (1972), Buneman (1974), Dress (1984), Bandelt and Dress (1992), Dress et al. (2010a). The well-known neighbor-joining algorithm (Saitou and Nei 1987), a special case of a large class of agglomerative clustering algorithms, furthermore, solves this problem efficiently and was shown to always compute the correct tree when presented with an additive metric, see the survey by Gascuel and Steel (2006) and the references therein. Additivity of the underlying metric is also assumed in a recent generalization of phylogenetic trees that allows data points to appear not only as leaves but also as interior vertices of the reconstructed tree (Telles et al. 2013).
A stronger condition than additivity is ultrametricity, which is characterized by the strong triangle equation
-
(MU)
Condition (MU) means that all triangles are “isosceles with a short base”, i.e., the length of two sides of the triangles is equal and the third one is at least not longer than these two. Ultrametrics appear in phylogenetics under the assumption of the strong clock hypothesis, i.e., constant evolutionary rates (Dress et al. 2007). Dating of the internal nodes (Britton et al. 2007) transforms an (additive) phylogeny into an ultrametric tree. Ultrametrics are a special case of additive metrics.
Real-life data sets, unfortunately, almost never satisfy the four-point condition. As a remedy, Sattah and Tversky (1977) and Fitch (1981) suggested to consider a “split relation” on pairs of objects, often referred to as quadruples, defined by
| 1 |
The relation has been studied extensively and, under certain additional conditions, can provide sufficient information for reconstructing phylogenetic trees (Bandelt and Dress 1986) or at least phylogenetic networks (Bandelt and Dress 1992; Grünewald et al. 2009). The approximation of a given metric by additive metrics or ultrametrics given some measure of the goodness of fit has also received quite a bit of attention (Farach et al. 1996; Agarwala et al. 1998; Apostolico et al. 2013).
Here, we ask under which conditions distance data that may deviate from additivity in a systematic manner still yield a phylogenetically (more or less) correct relation . This is different from the inference problems mentioned above: Our task is not to minimize a uniform error functional but to deal with systematic distortions of the distance measurements. In order to formalize the problem setting, we assume that the evolutionary process under consideration (operating on a space ) generates an additive metric . The catch is that we have no knowledge of and we cannot directly access . We can, however, obtain partial knowledge from representations. That is, there is a function . The construction of the representation in depends on our theory of what is important about the evolving system. In molecular phylogenetics, may be chosen to be a space of sequences. In classical, morphology-based phylogenetics, the elements of are character-based descriptions of animals; attempts to use molecular structures for phylogenetic purposes might use RNA secondary structures or labeled graph representations of protein 3D structures; a historic linguist might choose word lists or grammatical features.
Once we have decided on representations, we can turn to measuring (dis)similarities between them. The concrete choice of a distance measure of course again depends on the theoretical conception of the underlying evolutionary process. We can easily reinterpret as a distance measure on by setting
| 2 |
It is easy to see that is a metric whenever is a metric and is injective, i.e., whenever our representation is good enough to distinguish objects in . There is no a priori reason to make this assumption, however. Consider, for example, RNA secondary structures as a function of the primary sequences. This map is highly redundant (Schuster et al. 1994); for example, most tRNAs share the standard clover-leaf structure despite very different sequences and divergence times that pre-date the common ancestor of all extant life forms (Eigen et al. 1989); distances between secondary structures therefore do not reflect all evolutionary processes. Formally, is not a metric but only a pseudometric in this case: It does not satisfy axiom (M1) any longer. We will ignore this complication here and assume for simplicity that is a metric.
The metric is of interest for phylogenetic purposes if it quantifies evolutionary divergence in a meaningful way. That is, we are concerned with the information about the underlying additive metric that can be extracted from . Without additional assumptions on the relationships between and , however, nothing much can be said. At the very least, our representation should be good enough to recognize whether one of two objects or has diverged further from a given reference point than the other. Hence, we assume that for all :
-
(m0)
implies .
In the absence of at least this very weak form of monotonicity, we cannot really hope to recover information about from measuring . To our knowledge, property (m0) has not received much attention in the past. The following, stronger condition, however, has been considered extensively:
implies
for all . This property is known as (strong) monotonicity (Kruskal 1964) and lies at the heart of non-metric multi-dimensional scaling, a set of techniques that aim at approximating dissimilarity data by a Euclidean metric (Borg and Groenen 2005). A commonly used criterion is to minimize the violations of condition (m1). It is interesting to note in this context that, given any input metric , there is a always a Euclidean metric that is connected with by strong monotonicity, provided the embedding space is of sufficiently high dimension (Agarwal et al. 2007). In our context, it will be interesting to investigate whether there is an analogous result for additive metrics.
If we insist, in addition, that ties are preserved, i.e., that is equivalent to , then there exists an increasing function such that . In the following, we will consider this (more restrictive) setting in some detail.
Metric-preserving functions
Definition 1
A function is metric-preserving if for every metric the function is also a metric on .
Consider the following properties:
if and only if (amenable)
(subadditive)
is non-decreasing.
A theorem by Kelley (1955, p. 131) states that (Z1), (Z2), and (Z3) together are sufficient conditions for to be metric-preserving. One can show, furthermore, that (Z1) and (Z2) are necessary (Corazza 1999). Property (Z3) is sufficient but not necessary, as shown by several examples of metric-preserving functions that fail to be non-decreasing (Doboš 1998; Corazza 1999). A necessary and sufficient condition (Wilson 1935; Borsik and Doboš 1981; Das 1989) is that is amenable, (Z1), and satisfies
-
(Z*)
.
It can also be shown that any concave amenable function is metric preserving (Doboš 1998). If satisfies (m0), then (Z3) holds. We therefore restrict ourselves to amenable, subadditive, non-decreasing functions. Furthermore, we assume for convenience that is continuous.
We say that is a.m.-preserving (ultrametric-preserving) if is an additive metric whenever t is an additive metric (ultrametric). It was shown recently that a function preserves ultrametricity if and only if it is amenable (Z1) and non-decreasing (Z3) (Pongsriiam and Termwuttipong 2014). In Appendix, we prove:
Lemma 1
If is a.m.-preserving, then it is also ultrametric-preserving.
This implies in particular that an a.m.-preserving function is non-decreasing. It will not come as a surprise that nonlinear distortions do not preserve additivity.
Theorem 1
If is a.m.-preserving, then holds for all with .
A proof can be found in Appendix. The importance of this theorem lies in the fact that any nonlinear distortion of the metric t necessarily destroys additivity and thus, depending on the algorithm employed, may result in the reconstruction of an incorrect phylogeny.
Given the importance of the relation , it is natural to ask whether—or under what conditions—at least this relation is preserved. The example in Fig. 1 shows, however, that the relation is not necessarily preserved under transformations satisfying (Z1), (Z2), and (Z3). The example of Fig. 1 is reminiscent of the effect of long branch attraction (LBA) in parsimony-based methods (Felsenstein 1978; Bergsten 2005), which can also be understood the consequence of underestimating the impact of homoplasy, i.e., “back-mutations.”
Fig. 1.
Metric-preserving transformations do not preserve the relation . The distance matrix corresponds to the tree in the middle and, according to Eq. (1), satisfied . The function satisfies (Z1), (Z2), (Z3) and is smooth. The transformed distance matrix is presented by the networks shown on the r.h.s. (computed with SplitsTree (Huson and Bryant 2006). Here, is the distance pair with the shortest distance sum, i.e., it corresponds to the quadruple . This split corresponds to the longer one of the two side lengths of the box
Multiple features
A reasonable approach to devise a distance measure for a set of objects is to use a representation in terms of a collection of features, i.e., to consider a product space with distance measures independently defined for each of the features. Each feature can be seen as an independent representation, , and thus, we may reinterpret the as different distance measures on X, i.e., with . In this setting, it seems most natural to assume that is just a pseudometric.
It is well known that any nonnegative linear combination of pseudometric with is again a pseudometric. To avoid trivial cases, assume . Then, is a metric whenever implies that there is a feature such that . The most general ways to combine metrics are given by the generalized metric-preserving transforms, i.e., functions with the property that is a metric whenever each , , is a metric (Das 1989). These functions have a characterization that naturally generalizes (Z1) and (Z*) to multiple arguments.
Theorem 2
If transforms additive metrics consistent with the same underlying tree into a metric that is again compatible with , then where
-
(i)
with ,
-
(ii)
is a nonnegative linear combination where is the standard discrete metric applied to the component, i.e., the argument of .
-
(iii)
for each , at least one of and is nonzero.
Proof
Suppose all component metrics are discrete except for , . Then, is linear with nonnegative slope for as an immediate consequence of Theorem 1, i.e., condition (i) is necessary. Theorem 1 furthermore implies that the contribution for each feature i is necessarily of the form with . To ensure that we have a metric, each constituent must be a metric, i.e., at least one of and must be nonzero. □
In essence, Theorem 1 characterizes the distance measures that are “good” for phylogenetic purposes: These exactly are the ones that are linear combinations of distance measures that themselves are additive. In particular, therefore, alignment-free phylogenetic methods are guaranteed to work only when their distance measure approximates an additive measure, or, equivalently, when they approximate a distance for which a transformation to an additive distance is known (and used for the phylogenetic reconstruction).
Inferring transformations
The theoretical considerations above lead to the conclusion that the key problem for phylogenetic inference from data without a completely understood underlying model is to find monotonic transformations that make the original data as additive as possible before applying distance-based phylogenetic methods. It is important to realize that this is not the same problem as extracting the additive part of a given metric using, e.g., split decomposition. To see this, consider the metric distance matrix
| 3 |
The transformation recovers the additive metric of Fig. 1 (up to small rounding errors) and thus recovers the tree in Fig. 1. Its split decomposition, on the other hand, yields the network on the r.h.s. of the figure with isolation indices and . Any reasonable methods for fitting an additive tree thus will pick up the a quadruple with the from these distances.
Consider now a function that, given a metric distance matrix as input, produced a “best-fitting” additive metric distance matrix of the same dimension as output. More formally, denote by the set of all metrics on n points, and let .
Definition 2
A function is a.m.-consistent if the following conditions are satisfied:
-
(i)
If then is an additive metric.
-
(ii)
If is an additive metric, then .
The neighbor-joining algorithm (Saitou and Nei 1987) is a well-known example of an a.m.-consistent function (Gascuel and Steel 2006). Another example is the non-prime part of the split decomposition (Bandelt and Dress 1992). Given a distance matrix and an a.m.-consistent function , a natural measure for the deviation from additivity is with some matrix norm . In particular, if and only if is an additive metric.
Let us now return to Assumption A and characterize distances that derive from additive metrics in a simple manner:
Lemma 2
Let be a metric distance matrix, let be an a.m.-consistent function, suppose is invertible, increasing, and subadditive, and let be a matrix norm. Then, there is an additive distance matrix with if and only if .
Proof
Invertibility of implies that is equivalent to . Now if and only if is additive. Using invertibility of again, this is in turn equivalent to . Since the matrix norm vanishes only for the 0-matrix, the Lemma follows. □
Lemma 2 immediately suggests to search for by minimizing the error functional
| 4 |
By Lemma 2, derives from an additive metric if and only if a with exists. Otherwise, we obtain an approximately additive source metric that then serves as the best available input for phylogenetic reconstruction. In this case, the values of as well as the estimate that is found by minimizing will in general depend on both the a.m.-consistent function and the matrix norm .
As a proof of principle, we first produced an artificial distance matrix by transforming distance of a randomly generated tree with 100 leaves using the Jukes–Kantor rule (Jukes and Cantor 1969) corresponding to a four-letter alphabet and scaling the mutation rate such that back-mutations play a role but distances are not completely saturated. We then make the assumption that the measured data might depend on the unknown additive scale via a stretched exponential transformation of the form
| 5 |
with unknown parameters , , and . Figure 2(top) shows that the correct values of and can be inferred by using Eq. (4) to minimize the discrepancy . In “Appendix 2,” we show more formally that the parameter is arbitrary and hence cannot be inferred. Intuitively, this follows from the fact that only scales the time axis and hence constitutes a purely additive transformation of the distance, which canceled in Eq. (4) by the application of .
Fig. 2.
Empirical estimation of a transformation . Top: The relevant parameters and of the stretched exponential transform Eq. (5) can be estimated with the help of Eq. (4). Plotting as a function of the parameters and in Eq. (5) shows that the minimal discrepancy is indeed found at the theoretical values and used to generate the transformed distance matrix corresponding to a tree with 100 leaves. The color scale on the r.h.s. of the panel refers to ln. Below: The two small panels show the effect of increasing levels of measurement noise (left: , right: , see “Appendix 2” for details)
Real-life distance data of course are not perfectly additive. We therefore simulated sequence data by introducing substitutions independently at each sequence position according to a first order Markov process along all edges of a given phylogenetic tree. In order to tune the level of noise, we considered different linear combinations of the theoretical and the simulated data, see “Appendix 2” for details. We found that the estimation of via Eq. (4) works well for small levels of sampling noise. For large noise levels, however, there are systematic biases. These appear to depend strongly on the choice of the matrix norm . Clearly, a better understanding of the numerical problems associated with this inference problem will be necessary before the conceptually simple workflow proposed here can be applied to real-life data.
Discussion and conclusions
It has been realized already in the early days of computational phylogenetics that suitable transformation of distance data, e.g., using the Jukes–Cantor transformation, can increase the additivity and thus conceivably improve the quality of phylogenetic reconstructions (Vach 1992). A main insight in this contribution is that it is, at least in principle, possible to infer the correct distance transformation from the measured data only. As a consequence, the correct inference of phylogenetic relationships is possible not only for additive distances but also for the large class of distances that arise from additive metrics with a monotonic metric-preserving function.
At the same time, our results suggest that there are limits to phylogenetic inference. Whenever the available data cannot be transformed into an additive metric (at least approximately, i.e., up to measurement noise), there seems little hope to justify the interpretation of the results of hierarchical clustering (which of course can be performed on any kind of distance or similarity data) as a phylogeny. It is important to note, however, that our discussion has focused on metric-preserving functions, i.e., “uniform” transformations of the distance data. It is entirely possible to employ more general schemes that further extend the realm of phylogenetically meaningful data. For instance, the results of “Multiple features” section show that for data comprising multiple types of descriptors, distances extracted from the different subclasses c can be transformed with different functions . Such an approach might even be useful to distinguish phylogenetically informative from problematic classes of features.
On a more conceptual level, our results show that detailed mechanistic models of the underlying evolutionary process are not logically necessary for phylogenetic inference. It is, in fact, sufficient that the measured distance data can be transformed to an additive metric by means of a monotonic metric-preserving function. This is not to say that a mechanistic understanding of the process is not useful or desirable. After all, a mechanistic model will, at the very least, typically imply the functional form of the transformation function . The inference of from real-world data remains an important open problem. The issue to be explored is not only the limiting effect of measurement noise and inherent deviations from additivity due to horizontal gene transfer, incomplete lineage sorting, etc., but also numerical issues such as the fact that, in large trees, a substantial fraction of all pairwise distances takes values very close to the diameter of the tree. This seems to cause a particular susceptibility to measurement noise. Systematic simulation studies well beyond the scope of this contribution will be required to address this issues.
A potential alternative to Eq. (4) is the minimization of some measure of tree-likeness for the transformed matrix . Attractive candidates are the corresponding parameters of statistical geometry (Eigen et al. 1988; Nieselt-Struwe 1997) and the related “-plots” advocated by Holland et al. (2002). It is not obvious, however, how these measures react to the changes in scale invariably introduced by . This issue does not arise in the context of Eq. (4) because the effects cancel due to the appearance of both and .
It is interesting to note that our results also provide an a posteriori explanation for the observation that alignment-free methods work best in phylogenetic applications when the distances correlate well with alignment-based distances (Haubold et al. 2009; Morgenstern et al. 2017; Thankachan et al. 2017). It will be interesting to see whether other types of distances, such as compression distances (Kocsor et al. 2006; Penner et al. 2011), admit a transformation that makes them approximately additive.
Finally, several mathematical questions arise naturally from the results presented here. First, we may ask whether it is possible to replace condition (m1) by weaker requirements, such as (m0)? Even more generally, to what extent can arbitrary rate variations be accommodated? We know of course that they are harmless in an underlying additive metric—but what is the most general distortion that can be accommodated? Complementarily, it will be of interest to characterize the functions that preserve circular (Kalmanson 1975) and weakly decomposable metrics (Bandelt and Dress 1992), respectively.
Acknowledgements
Open access funding provided by Max Planck Society. This work was supported by travel funding from the ASU-SFI Center for Biosocial Complex Systems.
Appendix 1: Proofs
Proof of Lemma 1
Since every ultrametric is additive, an a.m.-preserving function must transform every ultrametric into an additive metric. Being a function, in particular transforms isosceles triangles into isosceles triangles. In particular, it preserves equilateral triangles.
Consider the set of ultrametrics on 4 points satisfying . The four isosceles triangles are . Therefore, and If the -transformed additive metric satisfies then these four triangles still have short base. Recall that is an ultrametric if and only if every triangle is isosceles with short basis or equilateral. Therefore, is again an ultrametric. Otherwise, suppose holds w.r.t. to the transformed metric. Then, additivity thus implies i.e., , a contradiction. The same result is obtained assuming . In the degenerate case, no quadruple exists and thus . Since and can vary independently of each other, must be constant, and thus, is the trivial discrete metric, which is also an ultrametric. Hence, is ultrametric-preserving on any subset of four points and thus in particular also preserves ultrametricity of all triangles. □
Proof of Theorem 1
The discrete metric is additive; hence, any function that is constant on is a.m.-preserving. As a consequence of Lemma 1 and Pongsriiam and Termwuttipong 2014, we know that any a.m.-preserving function is amenable and non-decreasing. In the following, we therefore assume that is amenable, not constant on , and non-decreasing.
Consider the set of additive metrics on four points satisfying . Then, for some sufficiently small, the metric defined by except and is again an additive. Thus, , a constant. It is easy to see that and can be chosen arbitrarily (first choose an isolation index for such that and then pick and sufficiently small). Thus, for every and sufficiently small , we have . Let us fix a and consider the partial function . Suppose is not constant. Then, then there is a point . Since we know that is constant in a neighborhood of , we have . By construction for all . But is also constant in an open neighborhood of , which has a non-empty intersection with . Thus, , a contradiction. Renaming the arguments, there is a function such that
| 6 |
for all and .
Replacing by for yields , while substituting with yields . Substituting by and adding the resulting equation lead to and thus eventually . Taken together, we have . Replacing by / shows and thus for all . That is, for all . Since is non-decreasing, we see that is also non-decreasing. Therefore, holds for all and all with . Using the well-known fact that is dense in conclude that holds for all . In particular, we have . Substituting this into Eq. (6) and setting yield . Setting and rearranging the terms, finally, yield
| 7 |
for all . The theorem now follows by observing that both the slope h(1) and the intercept must be nonnegative since is amenable and non-decreasing. □
Appendix 2: On the example of Fig. 2
In Fig. 2, we considered distance data generated from an additive tree using a transformation of the form Eq. (5), which has the inverse , with parameters , , fixed a some values , , , which we pretend not to know. Transforming them with with the correct value but arbitrary choices of and yields transformed distances
| 8 |
The coefficients and b appear only in the multiplicative factor and this does not affect additivity of the metric because the function must satisfy for input matrices close to that are almost additive. It follows that the scaling factor of the time axis cannot be inferred by minimizing the discrepancy in Eq. (4). This does not matter for phylogenetic reconstruction, however, because the scaled distance matrix corresponds to the same phylogenetic tree as . In contrast, choosing an exponent causes a nonlinear distortion and thus causes a nonzero discrepancy in data. It is also easy to see that any choice of also causes a nonzero discrepancy, and hence, can be inferred.
In order to construct a data set with tunable levels of sampling error, we used the tree as “scaffold” to simulate the evolution of four-letter sequences of length for 100 time units with an per site substitution rate of . Denote by the empirically determined scaled Hamming distances for a particular instance of the simulated sequences. By construction, the expected distance matrix for this model is with and . Hence, the sampling variance can be tuned by using linear combinations of and . We used convex combinations of the form . Note that the limit corresponds to sequences of infinite length, which allow an arbitrarily accurate estimation of the expected distances.
References
- Agarwal S, Wills J, Cayton L, Lanckriet G, Kriegman D, Belongie S (2007) Generalized non-metric multidimensional scaling. In: Meila M, Shen X (eds) Proceedings of the eleventh international conference on artificial intelligence and statistics, vol 2 of proceedings of machine learning research, pp 11–18. San Juan, PR
- Agarwala R, Bafna V, Farach M, Paterson M, Thorup M. On the approximability of numerical taxonomy (fitting distances by tree metrics) SIAM J Comput. 1998;28:1073–1085. doi: 10.1137/S0097539795296334. [DOI] [Google Scholar]
- Apostolico A, Comin M, Dress AWM, Parida L. Ultrametric networks: a new tool for phylogenetic analysis. Algorithms Mol Biol. 2013;8:7. doi: 10.1186/1748-7188-8-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Apresjan JD. An algorithm for constructing clusters from a distance matrix. Mashinnyi perevod prikladnaja lingvistika. 1966;9:3–18. [Google Scholar]
- Arenas M. Trends in substitution models of molecular evolution. Front Genet. 2015;6:319. doi: 10.3389/fgene.2015.00319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bandelt HJ, Dress AWM. Reconstructing the shape of a tree from observed dissimilarity data. Adv Math. 1986;7:309–343. doi: 10.1016/0196-8858(86)90038-2. [DOI] [Google Scholar]
- Bandelt HJ, Dress AWM. A canonical decomposition theory for metrics on a finite set. Adv Math. 1992;92:47–105. doi: 10.1016/0001-8708(92)90061-O. [DOI] [Google Scholar]
- Barbrook AC, Howe CJ, Blake NB, Robinson P. The phylogeny of The Canterbury Tales. Nature. 1998;394:839. doi: 10.1038/29667. [DOI] [Google Scholar]
- Bergsten J. A review of long-branch attraction. Cladistics. 2005;21:163–193. doi: 10.1111/j.1096-0031.2005.00059.x. [DOI] [PubMed] [Google Scholar]
- Bhattacharya T, Retzlaff N, Blasi D, Croft W, Cysouw M, Hruschka D, Maddieson I, Müller L, Smith E, Stadler PF, Starostin G, Youn H. Studying language evolution in the age of big data. J Lang Evol. 2018 doi: 10.1093/jole/lzy004. [DOI] [Google Scholar]
- Borg I, Groenen P. Modern multidimensional scaling: theory and applications. 2. Heidelberg: Springer; 2005. [Google Scholar]
- Borsik Y, Doboš J. Functions whose composition with every metric is a metric. Mathematica Slovaca. 1981;31:3–12. [Google Scholar]
- Britton T, Anderson CL, Jacquet D, Lundqvist S, Bremer K. Estimating divergence times in large phylogenetic trees. Syst Biol. 2007;56:741–752. doi: 10.1080/10635150701613783. [DOI] [PubMed] [Google Scholar]
- Buneman P. Note on the metric properties of trees. J Comb Theory B. 1974;17:48–50. doi: 10.1016/0095-8956(74)90047-1. [DOI] [Google Scholar]
- Cilibrasi R, Vitanyi P. Clustering by compression. IEEE Trans Inf Theory. 2005;51:1523–1545. doi: 10.1109/TIT.2005.844059. [DOI] [Google Scholar]
- Corazza P. Introduction to metric-preserving functions. Am Math Mon. 1999;106:309–323. doi: 10.1080/00029890.1999.12005048. [DOI] [Google Scholar]
- Croft W. Explaining language change: an evolutionary approach. Harlow: Pearson Education; 2000. [Google Scholar]
- Cunningham P. Free trees and bidirectional trees as representations of psychological distance. J Math Psychol. 1978;17:165–188. doi: 10.1016/0022-2496(78)90029-9. [DOI] [Google Scholar]
- Das PP. Metricity preserving transforms. Pattern Recogn Lett. 1989;10:73–76. doi: 10.1016/0167-8655(89)90069-X. [DOI] [Google Scholar]
- Doboš J. Metric preserving functions. Košice: Univerzita P. J. Šafárika v Košiciach; 1998. [Google Scholar]
- Dobson AJ. Unrooted trees for numerical taxonomy. J Appl Probab. 1974;11:32–42. doi: 10.2307/3212580. [DOI] [Google Scholar]
- Dress AWM. Trees, tight extensions of metric spaces, and the cohomological dimension of certain groups: a note on combinatorial properties of metric spaces. Adv Math. 1984;53:321–402. doi: 10.1016/0001-8708(84)90029-X. [DOI] [Google Scholar]
- Dress AWM, Huber K, Moulton V. Some uses of the farris transform in mathematics and phylogenetics—a review. Ann Comb. 2007;11:1–37. doi: 10.1007/s00026-007-0302-5. [DOI] [Google Scholar]
- Dress A, Huber KT, Koolen J, Moulton V, Spillner A. An algorithm for computing cutpoints in finite metric spaces. J Classif. 2010;27:158–172. doi: 10.1007/s00357-010-9055-7. [DOI] [Google Scholar]
- Dress A, Moulton V, Steel M, Wu T. Species, clusters and the ‘tree of life’: a graph-theoretic perspective. J Theor Biol. 2010;265:535–542. doi: 10.1016/j.jtbi.2010.05.031. [DOI] [PubMed] [Google Scholar]
- Edgar RC. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 2004;5:113. doi: 10.1186/1471-2105-5-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eigen M, Winkler-Oswatitsch R, Dress AWM. Statistical geometry in sequence space: a method of quantitative comparative sequence analysis. Proc Natl Acad Sci USA. 1988;85:5913–5917. doi: 10.1073/pnas.85.16.5913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eigen M, Lindemann BF, Tietze M, Winkler-Oswatitsch R, Dress AWM, von Haeseler A. How old is the genetic code? Statistical geometry of tRNA provides an answer. Science. 1989;244:673–679. doi: 10.1126/science.2497522. [DOI] [PubMed] [Google Scholar]
- Farach M, Kannan S, Warnow T. A robust model for finding optimal evolutionary trees. Algorithmica. 1996;13:155–179. doi: 10.1007/BF01188585. [DOI] [Google Scholar]
- Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst Biol. 1978;27:401–410. doi: 10.1093/sysbio/27.4.401. [DOI] [Google Scholar]
- Fitch WM. A non-sequential method for constructing trees and hierarchical classifications. J Mol Evol. 1981;18:30–37. doi: 10.1007/BF01733209. [DOI] [PubMed] [Google Scholar]
- Flannery KV. The cultural evolution of civilizations. Ann Rev Ecol Syst. 1972;3:399–426. doi: 10.1146/annurev.es.03.110172.002151. [DOI] [Google Scholar]
- Gascuel O, Steel M. Neighbor-joining revealed. Mol Biol Evol. 2006;23:1997–2000. doi: 10.1093/molbev/msl072. [DOI] [PubMed] [Google Scholar]
- Gontier N. Reticulate evolution: symbiogenesis, lateral gene transfer, hybridization and infectious heredity. Cham: Springer; 2015. [Google Scholar]
- Gray RD, Atkinson QD, Greenhill SJ. Language evolution and human history: what a difference a date makes. Philos Trans R Soc Lond B Biol Sci. 2011;366:1090–1100. doi: 10.1098/rstb.2010.0378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greg WW. The rationale of copy-text. Stud Bibliogr. 1950;3:19–36. [Google Scholar]
- Grünewald S, Moulton V, Spillner A. Consistency of the QNet algorithm for generating planar split networks from weighted quartets. Discrete Appl Math. 2009;157:2325–2334. doi: 10.1016/j.dam.2008.06.038. [DOI] [Google Scholar]
- Hannenhalli S, Pevzner PA (1995) Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proceedings of IEEE 36th annual foundations of computer science, pp 581–592. IEEE
- Haubold B, Pfaffelhuber P, Domazet-Lošo M, Wiehe T. Estimating mutation distances from unaligned genomes. J Comput Biol. 2009;16:1487–1500. doi: 10.1089/cmb.2009.0106. [DOI] [PubMed] [Google Scholar]
- Holland BR, Huber KT, Dress AWM, Moulton V. plots: a tool for analyzing phylogenetic distance data. Mol Biol Evol. 2002;19:2051–2059. doi: 10.1093/oxfordjournals.molbev.a004030. [DOI] [PubMed] [Google Scholar]
- Holman EW, Wichmann S. New evidence from linguistic phylogenetics identifies limits to punctuational change. Syst Biol. 2017;66:604–610. doi: 10.1093/sysbio/syx031. [DOI] [PubMed] [Google Scholar]
- Howe CJ, Windram HF. Phylomemetics-evolutionary analysis beyond the gene. PLoS Biol. 2011;9:e1001069. doi: 10.1371/journal.pbio.1001069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23:254–267. doi: 10.1093/molbev/msj030. [DOI] [PubMed] [Google Scholar]
- Imrich W, Stockiĭ On optimal embeddings of metrics in graphs. Sibirsk Mat Z. 1972;13:558–565. [Google Scholar]
- Jukes TH, Cantor CR. Evolution of protein molecules. In: Munro HN, editor. Mammalian protein metabolism III. New York: Academic Press; 1969. pp. 21–132. [Google Scholar]
- Kalmanson K. Edgeconvex circuits and the traveling salesman problem. Can J Math. 1975;27:1000–1010. doi: 10.4153/CJM-1975-104-6. [DOI] [Google Scholar]
- Kelley JL. General topology. New York: Van Nostrand; 1955. [Google Scholar]
- Kocsor A, Kertész-Farkas A, Kaján L, Pongor S. Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics. 2006;22:407–412. doi: 10.1093/bioinformatics/bti806. [DOI] [PubMed] [Google Scholar]
- Kruskal JB. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964;29:1–27. doi: 10.1007/BF02289565. [DOI] [Google Scholar]
- Leimeister CA, Morgenstern B. kmacs: the -mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014;30:2000–2008. doi: 10.1093/bioinformatics/btu331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marmerola GD, Oikawa MA, Dias Z, Goldenstein S, Rocha A. On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS ONE. 2016;11:e0167822. doi: 10.1371/journal.pone.0167822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mesoudi A, Whiten A, Laland KN. Towards a unified science of cultural evolution. Behav Brain Sci. 2006;29:329–347. doi: 10.1017/S0140525X06009083. [DOI] [PubMed] [Google Scholar]
- Morgenstern B, Schöbel S, Leimeister CA. Phylogeny reconstruction based on the length distribution of -mismatch common substrings. Algorithms Mol Biol. 2017;12:27. doi: 10.1186/s13015-017-0118-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nieselt-Struwe K. Graphs in sequence spaces: a review of statistical geometry. Biophys Chem. 1997;66:111–131. doi: 10.1016/S0301-4622(97)00064-1. [DOI] [PubMed] [Google Scholar]
- O’Hara RJ, Robinson PM. Computer-assisted methods of stemmatic analysis. Occas Pap Canterb Tales Proj. 1993;1:53–74. [Google Scholar]
- Pagel M. Human language as a culturally transmitted replicator. Nat Rev Genet. 2009;10:405–415. doi: 10.1038/nrg2560. [DOI] [PubMed] [Google Scholar]
- Penner O, Grassberger P, Paczuski M. Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies. PLoS ONE. 2011;6:e14373. doi: 10.1371/journal.pone.0014373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pongsriiam P, Termwuttipong I. Remarks on ultrametrics and metric-preserving functions. Abstr Appl Anal. 2014;2014:163258. doi: 10.1155/2014/163258. [DOI] [Google Scholar]
- RajaRajeswari P, Viswanadha Raju S. Phylogenetic trees construction with compressed DNA sequences using GENBIT COMPRESS tool. Ann Data Sci. 2017;4:105–121. doi: 10.1007/s40745-016-0098-4. [DOI] [Google Scholar]
- Roe MJ. Chaos and evolution in law and economics. Harv Law Rev. 1996;109:641–668. doi: 10.2307/1342067. [DOI] [Google Scholar]
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
- Sattah S, Tversky A. Additive similarity trees. Psychometrika. 1977;42:319–345. doi: 10.1007/BF02293654. [DOI] [Google Scholar]
- Schuster P, Fontana W, Stadler PF, Hofacker IL. From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc Lond B. 1994;255:279–284. doi: 10.1098/rspb.1994.0040. [DOI] [PubMed] [Google Scholar]
- Seo J, Croft WB (2008) Local text reuse detection. In: Chua TS, Leong MK, Myaeng SH, Oard DW, Sebastiani F (eds) Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 571–578. ACM, New York
- Siebert S, Backofen R (2005) A new distance measure of RNA ensembles and its application to phylogenetic tree construction. In: Computational intelligence in bioinformatics and computational biology, CIBCB ’05. IEEE
- Simões-Pereira JMS. A note on the tree realizability of a distance matrix. J Comb Theory. 1969;6:303–310. doi: 10.1016/S0021-9800(69)80092-X. [DOI] [Google Scholar]
- Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Appl Math Lett. 1994;7:19–24. doi: 10.1016/0893-9659(94)90024-8. [DOI] [Google Scholar]
- Steele J, Jordan P, Cochrane E. Evolutionary approaches to cultural and linguistic diversity. Philos Trans R Soc Lond B Biol Sci. 2010;365:3781–3785. doi: 10.1098/rstb.2010.0202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci. 1986;17:57–86. [Google Scholar]
- Telles GP, Almeida NF, Minghim R, Walter MEMT. Live phylogeny. J Comput Biol. 2013;20:30–37. doi: 10.1089/cmb.2012.0219. [DOI] [PubMed] [Google Scholar]
- Tëmkin I, Eldredge N. Phylogenetics and material cultural evolution. Curr Anthropol. 2007;48:146–153. doi: 10.1086/510463. [DOI] [Google Scholar]
- Thankachan SV, Chockalingam SP, Liu Y, Krishnan A, Aluru S. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinform. 2017;18:238. doi: 10.1186/s12859-017-1658-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vach W. The Jukes–Cantor transformation and additivity of estimated genetic distances. In: Schader M, editor. Analyzing and modeling data and knowledge. Berlin: Springer; 1992. pp. 141–150. [Google Scholar]
- Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003;19:513–523. doi: 10.1093/bioinformatics/btg005. [DOI] [PubMed] [Google Scholar]
- Wagner GP (ed) (2001) The character concept in evolutionary biology. Academic Press, San Diego
- Wagner GP, Stadler PF. Quasi-independence, homology and the unity of type: a topological theory of characters. J Theor Biol. 2003;220:505–527. doi: 10.1006/jtbi.2003.3150. [DOI] [PubMed] [Google Scholar]
- Wiens JJ. Character analysis in morphological phylogenetics: problems and solutions. Syst Biol. 2001;50:689–699. doi: 10.1080/106351501753328811. [DOI] [PubMed] [Google Scholar]
- Wilson WA. On certain types of continuous transformations of metric spaces. Am J Math. 1935;57:62–68. doi: 10.2307/2372019. [DOI] [Google Scholar]


