Combinatorics of least-squares trees

Radu Mihaescu; Lior Pachter

doi:10.1073/pnas.0802089105

. 2008 Sep 8;105(36):13206–13211. doi: 10.1073/pnas.0802089105

Combinatorics of least-squares trees

Radu Mihaescu ^†,^‡, Lior Pachter ^§,^‡

PMCID: PMC2533170 PMID: 18779558

Abstract

A recurring theme in the least-squares approach to phylogenetics has been the discovery of elegant combinatorial formulas for the least-squares estimates of edge lengths. These formulas have proved useful for the development of efficient algorithms, and have also been important for understanding connections among popular phylogeny algorithms. For example, the selection criterion of the neighbor-joining algorithm is now understood in terms of the combinatorial formulas of Pauplin for estimating tree length. We highlight a phylogenetically desirable property that weighted least-squares methods should satisfy, and provide a complete characterization of methods that satisfy the property. The necessary and sufficient condition is a multiplicative four-point condition that the variance matrix needs to satisfy. The proof is based on the observation that the Lagrange multipliers in the proof of the Gauss–Markov theorem are tree-additive. Our results generalize and complete previous work on ordinary least squares, balanced minimum evolution, and the taxon-weighted variance model. They also provide a time-optimal algorithm for computation.

Keywords: phylogenetics, tree additivity, independence of irrelevant paths, minimum evolution, semimultiplicative maps

The least-squares approach to phylogenetics was first suggested by Cavalli-Sforza and Edwards (1) and Fitch and Margoliash (2).

Definition 1. (Pair-edge incidence matrix) A phylogenetic X-tree is a semi labeled tree with leaves labeled by elements of a set X (see ref. 3 for basic definitions). Given such a tree T with edge set E and |X| = n, the pair-edge incidence matrix of T is the $(\begin{array}{l} n \\ 2 \end{array}) \times | E |$ matrix

Definition 2. (Tree-additive map) Let T be a phylogenetic X-tree. A dissimilarity map D is T-additive if for some vector l ∈ R^|E|,

Problem 1. [Ordinary least squares (OLS) (1)] Find the phylogenetic X-tree T and T-additive map D̂that minimizes

for a fixed tree, the solution of Problem 1 is a linear algebra problem (Theorem 3). However Rzhetsky and Nei (4) showed that the OLS edge lengths could instead be computed by using elegant and efficient combinatorial formulas. Their result was based on an observation of Vach (5), namely that OLS edge lengths obey the desirable independence of irrelevant pairs (IIP) property [our choice of terminology is inspired by social choice theory (6)]. Let T be a phylogenetic X-tree and e an edge in T. A linear edge length estimator for e is a linear function from dissimilarity maps to the real numbers, i.e. Î_e = Σ_ijp_ijD_ij. We say that such an estimator satisfies the IIP property if p_ij = 0 when the path from i to j in T (denoted i,j) does not contain either of e's endpoints.

In other words, the IIP property is equivalent to the statement that the sufficient statistic for the least-squares estimator of the length of e is a projection of the dissimilarity map onto the coordinates given by pairs of leaves whose joining path contains at least one endpoint ofe. It has been shown that this crucial property, motivated by the Markovian tree models used in phylogenetics, is satisfied not only by OLS estimators, but also by specific instances of weighted least squares (WLS) estimators (e.g., ref. 17).

Problem 2. (WLS) Let T be a phylogenetic X-tree and D be a dissimilarity map. Find the T-additive map D̂ that minimizes

The variance–covariance matrix for WLS is the $(\begin{array}{l} n \\ 2 \end{array}) \times (\begin{array}{l} n \\ 2 \end{array})$ diagonal matrix V whose diagonal entries are the V_ij. Note that V can also be regarded as a dissimilarity map and we will do so in this paper. WLS for trees was first suggested in 2 and 8, with the former proposing specifically V_ij =D_ij².

Theorem 3. (Least-squares solution) The solution to Problem 2 is given by D̂ = S_TÎ, where

We note that the OLS problem reduces to the case V = I. The statistical significance of the variance matrix together with a statistical interpretation of Theorem 3 is provided in the next section.

It follows from Eq. 4 that the lengths of the edges in a weighted least-squares tree are linear combinations of the entries of the dissimilarity map. A natural question is therefore which variance matrices V result in edge length estimators that satisfy the IIP property? Our main result is an answer to this question in the form of a characterization (Theorem 6): A WLS model is IIP if and only if the variance matrix is semimultiplicative. We show that such matrices are good approximations to the variances resulting from popular distance estimation procedures. Moreover, we provide combinatorial formulas that describe the WLS edge lengths under semimultiplicative variances (Lemma 3) and show that they lead to optimal algorithms for computing the lengths (Theorem 8).

The key idea that leads to our results is a connection between Lagrange multipliers arising in the proof of the Gauss–Markov theorem and a weak form of the tree metric theorem of phylogenetics that provides a combinatorial characterization of tree-additive maps (Remark 1). This explains many isolated results in the literature on least squares in phylogenetics; in fact, as we show in The Multiplicative Model and Other Corollaries, almost all the known theorems and algorithms about least-squares estimates of edge lengths follow from our results.

Best Linear Unbiased Estimator (BLUE) Trees

The foundation of least-squares theory in statistics is the Gauss–Markov theorem. This theorem states that the BLUE for a linear combination of the edge lengths, when the errors have zero expectation, is a least-squares estimator. We explain this theorem in the context of Problem 2.

Lemma 1. For any phylogenetic X-tree T, the matrix S_T is full rank.

Proof: We show that for any e ∈ E, the vector f_e^t=(0,…,1,…,0) of size |E| with a 1 in the eth position and 0 elsewhere, lies in the row span of S. Choose any i,j,k,l ∈ X such that the paths i,j and k,l in T do not intersect, and the intersection of the paths i,k and j,l is exactly the edge e. Note that

where (S_T)_ik is the row of S_T corresponding to the taxon pair i,j.

Theorem 4. (Gauss–Markov Theorem) Suppose that D is a random dissimilarity map of the form D = S_Tl + ε, where T is a tree, and ε is a vector of random variables satisfying E(ε) = 0and Var(ε) = V, where V is an invertible variance–covariance matrix for ε.

Let M(S_T) be the linear space generated by the rows of S_T and f^t ∈ M(S_T). Then f^tÎ = p^tD (where Î given by Eq. 4) has minimum variance among the linear unbiased estimators of f^t l.

Proof: Observe that the problem of finding p is equivalent to solving a constrained optimization problem:

The first condition specifies that the goal is to minimize the variance; the second constraint encodes the requirement that the estimator is unbiased. Using Lagrange multipliers, it is easy to see that the minimum variance unbiased estimator of f^tl is the unique vector p satisfying

In other words

graphic file with name zpq03608-4046-m10.jpg

where U = S^t_TV⁻¹S_T.

The Gauss–Markov Theorem can also be proved directly by using linear algebra, but the Lagrange multiplier proof has two advantages: First, it provides a description of p different from Eq. 4 that is simpler and more informative. Second, the technique is general and can be used in many similar settings to find minimum-variance unbiased estimators. Hayes and Haslett 9 provide pedagogical arguments in favor of Lagrange multipliers for interpreting least-squares coefficients and discuss the origins of this approach in applied statistics 10.

In phylogenetics, Theorem 4 (and its proof) are useful because for each edge e, the vector f_e in the standard basis for M(S^t_T) is associated with a vector p such that p^t D is the best linear unbiased estimator for the length of e. Similarly, the tree length is estimated from f_T^t=(1,1,…,1), which is also in M(S_T). The condition of Eq. 7 is particularly interesting because it says that there exists some T-additive map Λ = S^t_Tμ =V_p, whose (possibly negative) edge lengths are given by the Lagrange multipliers μ.

The following theorem provides a combinatorial characterization of tree-additive maps and, hence, of the Lagrange treeΛ.

Definition 3. (Weak four-point condition) A dissimilarity map D satisfies the weak four-point condition if for any i,j,k,l ∈ X, two of the following three linear forms are equal:

Theorem 5. A dissimilarity map D is tree-additive if and only if it satisfies the weak four-point condition.

Theorem 5 was first proved in ref. 11. For a recent exposition, see Corollary 7.6.8 of ref. 3, where it is derived by using the theory of group-valued dissimilarity maps. We note that the pair of equal quantities in the four-point condition defines the topology of a quartet. Furthermore, the topology of the tree is defined uniquely by the topologies of all its quartets. We again refer the reader to ref. 3 for details.

The Lagrange equations (Eqs. 7 and 8) together with Theorem 5 form the mathematical basis for our results:

Remark 1. The condition 7 specifies that Vp must be a T-additive map. It follows that Vp satisfies the weak four-point condition. In other words, Eq. 7 amounts to a combinatorial characterization of Vp, and hence p. The condition of Eq. 8 imposes a normalization requirement on p. Together these conditions are useful for finding p and also for understanding its combinatorial properties.

The structure of the Lagrange tree in the case of OLS is the middle quartet of the tree shown in Fig. 1. It immediately reveals interesting properties of the estimator. For example the fact that it is a tree on four taxa implies the IIP property. The content of ref. 12, Appendix 2 is that for tree length estimation under the balanced minimum evolution model, the Lagrange tree is the star tree. In fact, we will see that most of the known combinatorial results about least-squares estimates of edge and tree lengths can be explained by Remark 1 and interpreted in terms of the structure of the Lagrange tree.

Main Theorem

Our main result is a characterization of IIP WLS estimators. In the sections that follow, we will see that the IIP property for WLS is not only biologically desirable but also statistically motivated and algorithmically convenient. We begin by introducing some notation and concepts that are necessary for stating our main theorem.

Definition 4. (Clade) A clade of a phylogenetic X-tree T is a subset A ⊂ X such that there exists an edge in T whose removal induces the partition {A,X\A}. We also use clade to mean the induced topology T|_A.

Given a dissimilarity map D and a variance matrix V, we set

graphic file with name zpq03608-4046-m12.jpg

where A,B are disjoint clades. If e₁,…,e_k ∈ E(T) form a path with ends determining disjoint clades A and B, then D_{e₁…e_k} and Z_{e₁…e_k} represent D_AB and Z_AB, respectively. For a single edge e defining clades A,B,D_e, and Z_e represent D_AB and Z_AB.

Note that if e* is an edge in a tree T then Eqs. 7 and 8 imply that setting D = Λ above, where Λ is the Lagrange tree for any WLS edge length estimate Î(e*), gives

where A,B are the clades defined by edge e.

Definition 5. (Semimultiplicative map) A dissimilarity map D is multiplicative with respect to disjoint clades A,B if for any a₁,a₂ ∈ A and b₁,b₂ ∈ B

We say that D is semimultiplicative with respect to T if it is multiplicative with respect to any pair of disjoint clades A,B, not defined by the same edge of T.

The following lemma is left as an exercise.

Lemma 2. D is semimultiplicative if and only if, for a diagonal variance matrix V with entries given by V_ij = D_ij for all pairs i,j, every clade A of T has the property that for any A′⊂A, and any clade B disjoint from A and induced by a different edge,

for all x ∈ B, where ξ_A′A^B does not depend on x.

In fact, A satisfies Eq. 13 for all relevant B if and only if Eq. 13 holds for the two clades disjoint from A and defined by the two edges adjacent to the edge defining A.

The semimultiplicative condition is slightly weaker than log(V) being tree-additive. Removing the requirement that the clades A,B are defined by different edges of T results in the multiplicative analog of the four-point condition. By Theorem 5, this is equivalent to V_ij=Π_eεi,jw(e)⁻¹ for some w : E(T) → R₊. Such dissimilarity maps are called tree-multiplicative, and are studied in ref. 13.

Theorem 6. (Characterization of IIP WLS estimators) A WLS edge length estimator for an edge in a tree T has the IIP property if and only if the variance matrix is semimultiplicative with respect to T.

The proof of the theorem reduces to the WLS solution for the length of an edge in a tree with at most eight leaves (edge e* in Fig. 1):

Proposition 7. Let T be the phylogenetic X-tree shown in Fig. 1. The Lagrange tree Λ = S_Tμ for the WLS problem of estimating the length of the edge e* satisfies the property that μ₁ = −μ₂, μ₃ = −μ₄, μ₅ = −μ₆ and μ₇ = −μ₈. Furthermore, these Lagrange multipliers and the remaining ones μ₉,…,μ₁₃ can be computed by solving μ=(S^t_TV⁻¹S_T)−1f_e*.

Proof: Using the notation of Fig. 1, with the convention that the edge labeled by μ_i is e_i, it follows from Eq. 8 that Λ_{e_i}=0 for i = 1,2,9. But Λ_{e_i}=Λ_{e_ie_j}+Λ_{e_ie_k} for {i,j,k} = {1,2,9}, which implies that Λ_{e_ie_j}=0∀i,j∈{1,2,9}. Therefore Λ_e₁e₂=Λ_AB=V_AB⁻¹(μ₁+μ₂)=0. The arguments for e₃,e₄, e₅, e₆ and e₇, e₈ are identical, and the result follows. The complete solution μ for a given V is obtained from μ=(S^t_TV⁻¹S)⁻¹f_e*, which reduces to the inversion of a 13 × 13 matrix.

Note that the proof only uses the fact that e₁, e₂ are adjacent leaf edges not adjacent to e*. The conclusion μ_e₁=−μ_e₂ will hold identically in any tree for a pair of edges of this type.

Proof of Theorem 6: We begin by showing that if V is semimultiplicative, then the WLS edge length estimators have the IIP property. This calculation involves showing that for any phylogenetic X-tree T and edge e* ∈ T, the Lagrange tree for e* is the tree in Fig. 1, where A,B,C,D,E,F,G,H are clades with the property that their intraclade Lagrange multipliers are zero.

Let e₁,…,e_k, with k ≤ 8, be the edges of T such that either d(e*,e_i) = 2 or d(e*,e_i) < 2 and e_i is a leaf edge. For i ∈ {1,…,k}, let C_i be the clade defined by e_i such that e* ∉ C_i. Let T^/e* be the phylogenetic X^/e*-tree, where X^/e* = {C₁,…,C_k}, with topology induced by T in the natural way (see Fig. 1). Let V^/e* be the diagonal variance matrix on pairs of nodes in X^/e* given by V_{C_iC_j}^/e*=Z_{C_iC_j}⁻¹.

We let μ^/e* be the Lagrange multipliers and Λ^/e* be the Lagrange tree for the problem of estimating the WLS edge length of e* from variance V^/e* and topology T^/e*. Let Î^/e* (e*) = (Λ^/e*)^t(V^/e*)⁻¹D^/e* be the resulting estimator given the distance matrix D^/e* on X^/e*:D_{C_iC_j}^/e*=D_{C_iC_j}Z_{C_iC_j}⁻¹.

Lemma 3. If V is semimultiplicative, the T-additive map given by Λ = S_Tμ with

satisfies the Lagrange equations for T and V. Thus μ are the Lagrange multipliers for Î(e*) and Î(e*) = Î^/e*(e*).

Proof of Lemma 3: It is an easy exercise to check that for all e ∈ E(T^/e*), we have Z_e^/e*=Z_e and Λ_e^/e*=Λ_e, where Z^/e* is an analog of Z for variance V^/e* and topology T^/e*. This implies that Λ_e = f_e*(e) for all e ∈ E(T^/e*), i.e. the Lagrange equation (Eq. 8) is satisfied for e ∈ E(T^/e*).

Now consider edge e ∈ E(C₁). We need to verify that Λ_e = 0. Because Λ_ij = 0 for all i,j∈C₁,Λ_e=Λ_e…e₂+Λ_e…e₉. Now for all i ∈ C₁ and j ∈ C₂, Λ_ij=μ₁^/e*+μ₂^/e*=0, so Λ_e…e₂=0. Finally, let A′⊂A be a clade defined by e and let A″ be the clade defined by e₉ that does not intersect A. The fact that V is semimultiplicative implies that for any taxon x ∈ A″

where ξ_A′A does not depend on the taxon x. This implies Λ_e…e₉=Λ_e₁…e₉ξ_A′A^C₁=Λ_e₁…e₉^/e*ξ_A′A^C₁=0 by the proof of Proposition 7. This shows that Λ_e = 0 for e ∈ E(T)\E(T^/e*) and shows that μ are the Lagrange multipliers corresponding to Î(e*).

Also, Λ_uv=Λ_{C_iC_j}^/e* for all u∈C_i,v∈C_j, which easily implies

which is in turn equivalent to Î(e*) = Î^/e*(e*).

Because μ_e = 0 for all e ∉ T^/e*, it is enough to show that Λ^/e* satisfies the IIP property. This follows from Proposition 7. Therefore, V has the IIP property with respect to T, i.e. Λ_ij = 0 for all i,j ∈ X such that i,j does not contain an endpoint of e*. This concludes the proof for the “if” part of Theorem 6.

For the “only if” direction, we will prove by induction that Eq. 13 is satisfied by all clades A of T, and thus the variance V is semimultiplicative with respect to T. The base case is provided by clades formed by a single leaf, for which Eq. 13 holds vacuously.

For the induction step, suppose clades A and B both satisfy Eq. 13, and that they are defined by adjacent edges e_A and e_B (see Fig. 2). Let e_C be the other edge adjacent to e_A and e_B and let C = X\(A ∪ B) be the clade it defines. We would like to prove that the clade (A ∪ B) also satisfies Eq. 13. If |C| = 1, this holds vacuously. We may therefore assume that there exist two more edges e₁,e₂ incident with e_C. Let C_i⊂C be the clade defined by e_i, for i = 1,2. It suffices to prove that (A ∪ B) satisfies Eq. 13 with respect to C₁ and C₂. Notice that A and B already satisfy Eq. 13 with respect to C₁ and C₂. Therefore it is enough to show that $\frac{Z_{{x} A}}{Z_{(x) (A \cup B)}} = ξ_{A (A \cup B)}^{C_{1}}$ is the same for all x ∈ C₁, and similarly for C₂.

Now consider the problem of estimating Î(e_A). Let μ be the corresponding Lagrange multipliers and Λ = S_Tμ be the Lagrange tree they define. By the IIP property, Λ defines an identically zero tree additive map on the clade C. Therefore the edge lengths corresponding to this map are all zero. This implies μ_e = 0 for all e ∈ E(C),e ≠ e₁,e₂, and also μ_e₁+μ_e₂=0.

Let A₁,…,A_k, with k ≤ 4 and B₁,…,B_t, with t ≤ 2, be the subclades of A, respectively B, corresponding to nodes of T^/e_A. Then for any x ∈ C₁ and y ∈ A_i, and z ∈ B_j, Λ_xy=Λ_{C₁A_i}^/e_A does not depend on x,y, and Λ_xy=Λ_{C₁B_j}^/e_A does not depend on x,z.

Now pick leaf x ∈ C₁ and let e be the leaf edge adjacent to it. Then Λ_e = 0. Because all Lagrange multipliers are 0 inside the clade C₁,Λ_e=Λ_e…e₁=Λ_e…e₂+Λ_{e…e_c}. Because μ_e₁+μ_e₂=0 and all Lagrange multipliers in C₁ and C₂ are 0, Λ_e…e₂=. Thus Λ_{e…e_C}=Λ_{x}A+Λ_{x}B=0. Equivalently,

This imposes a nontrivial linear equation on Z_{x}A and Z_{x}B whose coefficients do not depend on x. Thus,

does not depend on x.

An Optimal Algorithm for WLS Edge Lengths

Theorem 8. (Computing WLS edge lengths) Let D be a dissimilarity map and V an IIP variance matrix. The set of all WLS edge lengths estimates for a tree T can be computed in O(n²), where n is the number of leaves in T.

Proof: Consider a given edge e*. Preserving the notation of the previous section, let C₁,…,C_k be the clades of T corresponding to the vertices of T^/e*. By Lemma 3 we have:

graphic file with name zpq03608-4046-m22.jpg

where S^/e* is the pair-edge incidence matrix of T^/e*, and f^/e* is the standard basis vector corresponding to e* in T^/e*.

Once D^/e* and V^/e* are known, by Proposition 7, all the above steps take constant time to compute: The dominant computation is the inversion of a matrix of size at most 13 × 13. But for any edge e*, the elements of V^/e* and D^/e* are Z_AB⁻¹ and D_AB/Z_AB for clades A,B of T, separated by at least two edges. Thus it only remains to show that we can compute D_AB and Z_AB, for all pairs of disjoint clades A,B, in O(n²) time.

We define the height of a tree to be the distance between its root and its farthest leaf, where the root is taken to be the closest endpoint of the edge defining the clade. Thus the height of a clade formed by just one leaf is 0. Now consider the configuration in Fig. 3. The clades A,B,C are all pairwise disjoint and A and B are adjacent. It is easy to see that A ∪ B form a clade for which

Fig. 3. — Configuration of the dynamic programming recursion for computing WLS edge lengths. A, B, and A ∪ B are clades, and C is a clade disjoint from A ∪ B. The oval in the middle represents the rest of the tree.

Thus, we need only constant time to compute D_A∪B,C and Z_A∪B,C if D_AC, Z_AC, D_CB, and Z_CB are known. There are O(n) clades and, thus, O(n²) pairs of disjoint clades. We can compute D_AB and Z_AB for all pairs AB through a simple dynamic program: We start with pairs of clades of height 0 (leaves), for which the values of D and Z are trivially given by D and V⁻¹. After round 2t of the algorithm, we will know D_AB and Z_AB for all disjoint pairs A,B of height at most t. After round 2t + 1 we know D_AB and Z_AB for all disjoint pairs A,B of height t + 1 and t respectively.

The algorithm is optimal because its running time is proportional to the size of the input.

The Multiplicative Model and Other Corollaries

In this section, we begin by giving formulas for the WLS edge lengths using a tree-multiplicative variance model, i.e. V_ij=Π_e∈i,jw_e⁻¹ for some w : E(T) → R₊. Throughout the section, e* ∈ E(T) denotes the edge for which the WLS length is being computed. If e* is an internal edge, then A,B,C,D are disjoint clades induced by adjacent edges. In the case that e* is adjacent to a leaf, that leaf is labeled i, and the adjacent clades are A,B.

Note that in Proposition 9, Z_AB and D_AB are as defined in the previous section, however for all the other examples, these quantities are redefined locally.

Proposition 9. The WLS edge length of an internal edge e* under dissimilarity map D and tree-multiplicative variance V is

graphic file with name zpq03608-4046-m25.jpg

if e* is adjacent to a leaf then the WLS length is

At first glance, these formulas may seem surprising, but the derivation is straightforward after solving for the Lagrange multipliers. By Lemma 3, is enough to solve for the Lagrange multipliers in the tree T^/e*, for (multiplicative) variance V^/e*. The proof of the above theorem is left as an instructive exercise for the reader. It rests on the following analog of Lemma 3 for tree-multiplicative variances.

Proposition 10. The Lagrange tree for the WLS estimation of a single edge length under multiplicative variance is a quartet tree.

We now present a number of previous results about least squares that can be interpreted (and in some cases completed) by using Theorems 6 and 8, and Proposition 9. All the models we discuss here are special cases of the multiplicative variance model and our statements can be easily proven by substituting the appropriate form of V into Eqs. 21 and 22 and modifying the expressions for Z and D accordingly.

OLS. This is the first, and most studied, model for least-squares edge and tree length estimation. It corresponds to a variance matrix equal to the identity matrix.

Corollary 11. [Rzhetsky 4] The OLS estimate Î(e*) = p^tD = f_e^t(S^t_TS_T)⁻¹S^t_TD for the length of edge e is given by

graphic file with name zpq03608-4046-m27.jpg

where n_A, n_B, n_C, and n_D are the number of leaves in the clades A,B,C, and D, and D_AC=∑_a∈A,c∈CD_ac/n_An_C.

If e* is a leaf edge, î(e*) is given by:

Our algorithm for computing edge lengths (Theorem 8) reduces, in the case of OLS, to that of ref. 14. It has the same optimal running time as the algorithms in refs. 5, 15 and 16.

Balanced minimum evolution (BME)

The BME model was introduced by Pauplin in ref. 17. The motivation was that in the computation of î(e*) in the OLS model, the distances D_ac and D_bd can receive different total weight from D_ad and D_bc, where a ∈ A, b ∈ B, c ∈ C, and d ∈ D. Pauplin therefore suggested an alternative model where all clades are weighted equally. One then defines recursively:

D_{a}{b} = D_ab for leaves a and b
D_AB = (D_A′B + D_A″B/2 for disjoint clades A and B such that A′ and A″ are the two clades of A pointing away from B.

Corollary 12. (Pauplin's edge formula) The WLS edge lengths with variance model V_ij∝2^|i,j| are given by î(e*) = $\frac{1}{4} (D_{A C} + D_{B D} + D_{A D} + D_{B C}) - \frac{1}{2} (D_{A B} - D_{C D})$ for internal edges and î(e*) = $\frac{1}{2} (D_{A i} + D_{B i}) - \frac{1}{2} (D_{A B})$ for edges adjacent to leaves.

This corresponds to the multiplicative variance model with w_e = 0.5 for all edges e and follows easily from Proposition 9. As far as we are aware, this proof that the formulas given by Pauplin for edge lengths are in fact the WLS edge weights under the variance model described above has not been previously stated.

This result accompanies the connection between Pauplin's tree length formula and WLS tree length under the BME model established by Desper and Gasquel in ref. 12. They proved the following:

Corollary 13. [Desper and Gascuel 12] The tree length estimator given by î = ∑_abD_ab2^1−|a,b| is the minimum variance tree length estimator for the BME model. It is also identical to the one given by the coefficients p^t=f^t(S^t_TV⁻¹S_T)⁻¹S^t_TV⁻¹.

Proof: The second part of the corollary follows trivially from Theorem 4. For the first part, notice that p_ab=2^1−|a,b|, therefore p_ab V_ab is the uniform vector and thus defines a T-additive map corresponding to the star topology (equal-length leaf edges and zero-length internal edges). Finally, ∑_i,jS_ij,ep=1 follows from an easy counting argument.

The taxon-weighted variance model. Another well known WLS model was introduced in ref. 18. Under this model, V_ij⁻¹=t_it_j for some t₁,…,t_n ∈ R₊. In the tree-multiplicative model, this corresponds to setting w_e = 1 for internal edges and w_e = t_i when e is the edge adjacent to leaf i. Ref. 18 gives a beautiful proof for the statistical consistency of this model (which implies statistical consistency of OLS), and also provides an O(n²) algorithm for computing the WLS edge lengths. However, the algorithm is based on a recursive agglomeration scheme, and an explicit formula for the edge lengths based on the values of D is not given. Such a formula follows from Theorem 6:

Corollary 14. For e an internal edge of T, the WLS edge length Î(e*) is given by

graphic file with name zpq03608-4046-m29.jpg

where T_X=∑_x∈Xt_x and $D_{X Y} = \sum_{x \in X, y \in Y} \frac{t_{x} t_{y}}{T_{X} T_{Y}} D_{x y}$ .

If e* is adjacent to a leaf,

Final Remarks

An important question is whether the variance matrices required for the IIP property to hold are realistic for problems where branch lengths are estimated by using standard evolutionary models. We argue that although variances of distances estimated by using maximum likelihood are not multiplicative, they are approximately so for large distances.

We illustrate this for the Jukes–Cantor model 19. Because the estimated distance between two sequences can be infinite with small, but nonzero probability, the expected distance and its variance are infinite. However, the large sample “δ approximation” in the following proposition is commonly used in practice.

Proposition 15. (20,21) Let Y be the fraction of different nucleotides between two length n sequences, generated from the Jukes–Cantor process with branch length δ. Then the expected value of the empirical distance $D = - \frac{3}{4} log (1 - \frac{4}{3} Y)$ is δ and its variance is

Because the branch lengths for an evolutionary model are tree-additive, this shows that a tree-multiplicative model for variances is very reasonable: For large δ, $V a r (D) ≃ \frac{9}{16 n} e^{\frac{8}{3} δ}$ , which is tree-multiplicative. This result can be extended to more general models 22.

Theorem 6 sheds light on the Fitch–Margoliash model. The assumption that the variance V_ij=Var(D_ij)∝D_ij² results in a WLS estimator that is not IIP because V is not semimultiplicative. This means that for generic dissimilarity maps, the Fitch–Margoliash least-squares estimates of edge lengths will depend on irrelevant distance estimates. On the other hand, Corollary 12 is useful for interpreting the neighbor-joining algorithm. In fact, it is possible to show that the edge length estimates of ref. 23 are precisely the Pauplin formulas (and hence least-squares formulas) for the trees at each agglomerative step.

Our optimal algorithm for weighted least-squares edge length estimates for multiplicative matrices is similar in spirit to some of the algorithms in ref. 15. In fact, we believe that all the fast algorithms for WLS edge lengths can be understood within a single framework. The unifying concept is the observation that they all essentially estimate the Lagrange tree, either via a top-down or bottom-up approach. We defer a detailed discussion.

Finally, a key issue is that of consistency of the minimum evolution principle for weighted least-squares tree length, under specific forms of variance matrices assigned to all trees 18, 24. An obvious question is what classes of semimultiplicative variance matrices result in consistent tree estimates and, moreover, which semimultiplicative variance matrices give identical tree-length estimates. In the latter direction, we note that although it follows from Theorem 4 that for any V and f there is a unique BLUE p for f^t l, the converse of this statement is not true. In fact, for some tree topologies, it is even possible that the OLS tree length is equal to the BME WLS tree length (for example for five taxa trees). This means that by minimizing the tree length, some information about the variance is being discarded. This may be viewed as a weakness rather than as a strength of minimum evolution frameworks. A full discussion of this topic is beyond the scope of this article.

ACKNOWLEDGMENTS.

R.M. was supported by a National Science Foundation (NSF) Graduate Fellowship and partially by the Fannie and John Hertz foundation. L.P. was supported in part by NSF Grant CCF-0347992.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

References

1.Cavalli-Sforza L, Edwards A. Phylogenetic analysis models and estimation procedures. Evolution. 1967;32:550–570. doi: 10.1111/j.1558-5646.1967.tb03411.x. [DOI] [PubMed] [Google Scholar]
2.Fitch WM, Margoliash E. Construction of phylogenetic trees. Science. 1967;155:279–284. doi: 10.1126/science.155.3760.279. [DOI] [PubMed] [Google Scholar]
3.Semple C, Steel M. Phylogenetics. Oxford: Oxford Univ Press; 2003. [Google Scholar]
4.Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol. 1993;10:1073–1095. doi: 10.1093/oxfordjournals.molbev.a040056. [DOI] [PubMed] [Google Scholar]
5.Vach W. In: Least squares approximation of additive trees in Conceptual and Numerical Analysis of Data. Opitz O, editor. Heidelberg: Springer; 1989. pp. 230–238. [Google Scholar]
6.Ray P. Independence of irrelevant alternatives. Econometrica. 1973;41:987–991. [Google Scholar]
7.Semple C, Steel M. Cyclic permutations and evolutionary trees. Appl Math. 2003;32:669–680. [Google Scholar]
8.Hartigan JA. Representation of similarity matrices by trees. J Am Stat Assoc. 1967;62:1140–1158. [Google Scholar]
9.Hayes K, Haslett J. Simplifying general least squares. Am Stat. 1999;53:376–381. [Google Scholar]
10.Matheron G. Les Variables Regionalisés et Leur Estimation. Paris: Mason; 1962. [Google Scholar]
11.Patrinos AN, Hakimi SL. The distance matrix of a graph and its tree realization. Q J Appl Math. 1972;30:255–269. [Google Scholar]
12.Desper R, Gascuel O. Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evol. 2004;21:587–98. doi: 10.1093/molbev/msh049. [DOI] [PubMed] [Google Scholar]
13.Gill J, Linusson S, Moulton V, Steel M. A regular decomposition of the edge-product space of phylogenetic trees. Adv Appl Math. 2008;41:158–176. [Google Scholar]
14.Desper R, Vingron M. Tree fitting: Topological recognition from ordinary least-squares edge length estimates. J Classification. 2002;19:87–112. [Google Scholar]
15.Bryant D, Waddell P. Rapid evaluation of least squares and minimum evolution criteria on phylogenetic trees. Mol Biol Evol. 1998;15:1346–1359. [Google Scholar]
16.Gascuel O. Concerning the nj algorithm and its unweghted version, unj in Mathematical Hierarchies and Biology. Vol V. American Mathematical Society; 1997. pp. 149–170. [Google Scholar]
17.Pauplin Y. Direct calculation of a tree length using a distance matrix. J Mol Evol. 2000;51:41–47. doi: 10.1007/s002390010065. [DOI] [PubMed] [Google Scholar]
18.Denis O, Gascuel F. On the consistency of the minimum evolution principle of phylogenetic inference. Dis Appl Math. 2003;127:63–77. [Google Scholar]
19.Jukes TH, Cantor C. In: Evolution of protein molecules in Mammalian Protein Metabolism. Munro HN, editor. New York: Academic Press; 1969. pp. 21–32. [Google Scholar]
20.Bulmer D. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol Biol Evol. 1991;8:868–883. [Google Scholar]
21.Nei M, Jin L. Variances of the average numbers of nucleotide substitutions within and between populations. Mol Biol Evol. 1989;6:290–300. doi: 10.1093/oxfordjournals.molbev.a040547. [DOI] [PubMed] [Google Scholar]
22.Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer; 2003. [Google Scholar]
23.Saitou, Nei M. The neighbor joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
24.Willson SJ. Consistent formulas for estimating the total lengths of trees. Dis Appl Math. 2005;148:214–239. [Google Scholar]

[B1] 1.Cavalli-Sforza L, Edwards A. Phylogenetic analysis models and estimation procedures. Evolution. 1967;32:550–570. doi: 10.1111/j.1558-5646.1967.tb03411.x. [DOI] [PubMed] [Google Scholar]

[B2] 2.Fitch WM, Margoliash E. Construction of phylogenetic trees. Science. 1967;155:279–284. doi: 10.1126/science.155.3760.279. [DOI] [PubMed] [Google Scholar]

[B3] 3.Semple C, Steel M. Phylogenetics. Oxford: Oxford Univ Press; 2003. [Google Scholar]

[B4] 4.Rzhetsky A, Nei M. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol. 1993;10:1073–1095. doi: 10.1093/oxfordjournals.molbev.a040056. [DOI] [PubMed] [Google Scholar]

[B5] 5.Vach W. In: Least squares approximation of additive trees in Conceptual and Numerical Analysis of Data. Opitz O, editor. Heidelberg: Springer; 1989. pp. 230–238. [Google Scholar]

[B6] 6.Ray P. Independence of irrelevant alternatives. Econometrica. 1973;41:987–991. [Google Scholar]

[B7] 7.Semple C, Steel M. Cyclic permutations and evolutionary trees. Appl Math. 2003;32:669–680. [Google Scholar]

[B8] 8.Hartigan JA. Representation of similarity matrices by trees. J Am Stat Assoc. 1967;62:1140–1158. [Google Scholar]

[B9] 9.Hayes K, Haslett J. Simplifying general least squares. Am Stat. 1999;53:376–381. [Google Scholar]

[B10] 10.Matheron G. Les Variables Regionalisés et Leur Estimation. Paris: Mason; 1962. [Google Scholar]

[B11] 11.Patrinos AN, Hakimi SL. The distance matrix of a graph and its tree realization. Q J Appl Math. 1972;30:255–269. [Google Scholar]

[B12] 12.Desper R, Gascuel O. Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evol. 2004;21:587–98. doi: 10.1093/molbev/msh049. [DOI] [PubMed] [Google Scholar]

[B13] 13.Gill J, Linusson S, Moulton V, Steel M. A regular decomposition of the edge-product space of phylogenetic trees. Adv Appl Math. 2008;41:158–176. [Google Scholar]

[B14] 14.Desper R, Vingron M. Tree fitting: Topological recognition from ordinary least-squares edge length estimates. J Classification. 2002;19:87–112. [Google Scholar]

[B15] 15.Bryant D, Waddell P. Rapid evaluation of least squares and minimum evolution criteria on phylogenetic trees. Mol Biol Evol. 1998;15:1346–1359. [Google Scholar]

[B16] 16.Gascuel O. Concerning the nj algorithm and its unweghted version, unj in Mathematical Hierarchies and Biology. Vol V. American Mathematical Society; 1997. pp. 149–170. [Google Scholar]

[B17] 17.Pauplin Y. Direct calculation of a tree length using a distance matrix. J Mol Evol. 2000;51:41–47. doi: 10.1007/s002390010065. [DOI] [PubMed] [Google Scholar]

[B18] 18.Denis O, Gascuel F. On the consistency of the minimum evolution principle of phylogenetic inference. Dis Appl Math. 2003;127:63–77. [Google Scholar]

[B19] 19.Jukes TH, Cantor C. In: Evolution of protein molecules in Mammalian Protein Metabolism. Munro HN, editor. New York: Academic Press; 1969. pp. 21–32. [Google Scholar]

[B20] 20.Bulmer D. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol Biol Evol. 1991;8:868–883. [Google Scholar]

[B21] 21.Nei M, Jin L. Variances of the average numbers of nucleotide substitutions within and between populations. Mol Biol Evol. 1989;6:290–300. doi: 10.1093/oxfordjournals.molbev.a040547. [DOI] [PubMed] [Google Scholar]

[B22] 22.Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer; 2003. [Google Scholar]

[B23] 23.Saitou, Nei M. The neighbor joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]

[B24] 24.Willson SJ. Consistent formulas for estimating the total lengths of trees. Dis Appl Math. 2005;148:214–239. [Google Scholar]

PERMALINK

Combinatorics of least-squares trees

Radu Mihaescu

Lior Pachter

Abstract

Best Linear Unbiased Estimator (BLUE) Trees

Fig. 1.

Main Theorem

Fig. 2.

An Optimal Algorithm for WLS Edge Lengths

Fig. 3.

The Multiplicative Model and Other Corollaries

Balanced minimum evolution (BME)

Final Remarks

ACKNOWLEDGMENTS.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Combinatorics of least-squares trees

Radu Mihaescu

Lior Pachter

Abstract

Best Linear Unbiased Estimator (BLUE) Trees

Fig. 1.

Main Theorem

Fig. 2.

An Optimal Algorithm for WLS Edge Lengths

Fig. 3.

The Multiplicative Model and Other Corollaries

Balanced minimum evolution (BME)

Final Remarks

ACKNOWLEDGMENTS.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases