A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction

Walter R Gilks; Tom MW Nye; Pietro Lio

doi:10.2202/1544-6115.1574

. 2011 Mar 30;10(1):Article 16. doi: 10.2202/1544-6115.1574

A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction

Walter R Gilks ¹, Tom MW Nye ², Pietro Lio ³

PMCID: PMC3535883

Abstract

Phylogenetic trees describe evolutionary relationships between related organisms (taxa). One approach to estimating phylogenetic trees supposes that a matrix of estimated evolutionary distances between taxa is available. Agglomerative methods have been proposed in which closely related taxon-pairs are successively combined to form ancestral taxa. Several of these computationally efficient agglomerative algorithms involve steps to reduce the variance in estimated distances. We propose an agglomerative phylogenetic method which focuses on statistical modeling of variance components in distance estimates. We consider how these variance components evolve during the agglomerative process. Our method simultaneously produces two topologically identical rooted trees, one tree having branch lengths proportional to elapsed time, and the other having branch lengths proportional to underlying evolutionary divergence. The method models two major sources of variation which have been separately discussed in the literature: noise, reflecting inaccuracies in measuring divergences, and distortion, reflecting randomness in the amounts of divergence in different parts of the tree. The methodology is based on successive hierarchical generalized least-squares regressions. It involves only means, variances and covariances of distance estimates, thereby avoiding full distributional assumptions. Exploitation of the algebraic structure of the estimation leads to an algorithm with computational complexity comparable to the leading published agglomerative methods. A parametric bootstrap procedure allows full uncertainty in the phylogenetic reconstruction to be assessed. Software implementing the methodology may be freely downloaded from StatTree.

Keywords: agglomerative method, distance matrix, generalized least-squares regression, phylogenetic tree, variance-components model

1 Introduction

Phylogenetic trees describe evolutionary relationships between related organisms, or taxa. Currently, the most sophisticated statistical methods for reconstructing phylogenetic trees depend on probabilistic models of DNA or protein sequence evolution, estimated using maximum likelihood or Markov chain Monte Carlo; see for example Felsenstein (2004). These approaches require many evolutionary assumptions but concentrate on small-scale events in sequences such as point mutations or deletions. Such methods do not accommodate large-scale genomic rearrangements or supplementary non-sequence data such as phenotype or habitat. Moreover, their need to explore the vastness of tree space may render them computationally infeasible for problems involving very large numbers of taxa.

For large or complex phylogenetic data, methods based on distance matrices, such as NJ (Saitou and Nei, 1987), BioNJ (Gascuel, 1997), Weighbor (Bruno, Socci, and Halpern, 2000), MVR (Gascuel, 2000) and Fastme (Desper and Gascuel, 2004), offer greater flexibility and computational efficiency. Such methods involve a distance metric which scores the dissimilarity between each pair of taxa. The resulting inter-taxon distance matrix is then processed by an agglomerative algorithm which progressively combines pairs of taxa, forming ancestral taxa and estimating branch lengths in the tree, concomitantly reducing the size of the working distance matrix. In general, for more than 3 taxa, the imputed phylogeny cannot reproduce observed distances exactly. Thus, a distance matrix contains uncertainty or noise, and might be consistent with many alternative phylogenies. Ideally, this uncertainty should be taken into account during the agglomeration in estimating branch lengths, in making topological decisions and in monitoring the amounts of information available for those decisions.

Each of the published agglomerative methods addresses the issue of uncertainty in its own style. NJ employs ordinary least-squares (OLS) and Weighbor uses weighted least squares (WLS) estimates of branch lengths, which might be appropriate when distance estimates are uncorrelated (see, for example, Felsenstein (1987)). However, phylogenetic distances are often derived from DNA or protein sequence data, based on simple models of sequence evolution. Such models imply correlated distance estimates with variances which increase exponentially with branch length (Nei, Stephens, and Saitou, 1985, Bulmer, 1991). For such distances, Gascuel (1997) developed BioNJ, introducing the idea of estimating branch lengths by minimizing a sum of branch-length variances at each agglomeration. Rather different optimization criteria involving variances of branch-lengths are employed by Weighbor, MVR and Fastme.

To estimate phylogenetic trees, correlations between distances should be taken into account (Chakraborty, 1977). For this, Hasegawa, Kishino, and Yano (1985) and Bulmer (1991) have proposed the use of generalized least-squares (GLS). However, these non-agglomerative methods require estimation of the full tree in a single computationally intensive step, and hence are unsuitable for exploring large phylogenies. On the other hand, for computational efficiency, agglomerative methods such as NJ, Weighbor and Fastme employ OLS or WLS, and thus do not take formal account of distance correlations. The full implementation of MVR correctly takes account of induced distance correlations, but greater computational efficiency is obtained if they are ignored (Gascuel, 2000).

Some phylogenetic methods are ultrametric. An ultrametric tree assumes a molecular clock beating out a constant pace of evolution in all branches. If branch lengths are drawn proportional to the amount of evolution occurring along them, then an ultrametric tree may be drawn with a physical time axis such that each ancestral taxon is aligned at its extant time, and each leaf taxon at the present time, as in Figure 1(a). Non-ultrametric trees, on the other hand, allow evolution to deviate from time-proportionality, so that leaf taxa become unaligned, as in Figure 1(b). Such varying amounts of evolution may be due to positive or negative selection pressures, to interspecific differences in generation times, to random variation in the rate of evolution (Thorne, Kishino, and Painter, 1998, Kishino, Thorne, and Bruno, 2001), or simply to the randomness of evolutionary events under neutral selection. Currently, the practitioner must decide whether to conduct an ultrametric or a non-ultrametric analysis. However, underlying any non-ultrametric divergence tree must lie a topologically identical but ultrametric time tree in which branch lengths are drawn proportional to elapsed physical time. This assumes that each speciation event occurs at a specific (but generally unknown) physical time. Thus an ultrametric time-tree and a non-ultrametric divergence tree can be simultaneously true. They estimate different aspects of evolution, sharing the same topology but differing in branch lengths.

Here we introduce a new distance-based agglomerative phylogenetic method, based on a variance-components model involving only means, variances and covariances of distances. Thus our method takes account of distance correlations but avoids making full probabilistic evolutionary assumptions. Our method simultaneously produces both a non-ultrametric divergence tree and its underlying ultra-metric time tree. Consequently, without needing to specify an outgroup, our divergence tree is rooted. The true divergence tree may, to a greater or lesser extent, conform to a molecular clock assumption; our method allows the degree of this non-ultrametricity to be estimated from the data.

Our approach uses a sequence of staged GLS regressions to build the estimated tree, each stage formally incorporating variances and covariances in distances observed or estimated at the previous stage. The GLS computations involve inversion of square matrices with dimension of order Inline graphic (m²), where m is

graphic file with name sagmb-2011-10-1-1574_0002.jpg — Figure 1: Illustrating (a) an ultrametric tree with time axis recording years before present (bp) and leaves aligned at the present time; and (b) a non-ultrametric tree of the same topology.

the number of taxa. If tackled head-on, each GLS regression would have computational complexity of order Inline graphic (m⁶). However, thorough exploitation of the algebraic structure of these matrices leads to a GLS algorithm with computational complexity of order (m), and the full phylogenetic reconstruction has computational complexity on a par with the most efficient distance-based algorithms. A parametric bootstrap procedure allows full phylogenetic uncertainty to be evaluated. Our methodology is implemented in StatTree, which may be freely downloaded from http://www.mas.ncl.ac.uk/~ntmwn/stattree. Examples of the application of StatTree can be found at http://www.cl.cam.ac.uk/~pl219/GilksNyeLioSAGMB.html.

2 Method

2.1 Time, divergence and distance

We distinguish between physical time t, divergence d, and distance d̂.

Time: Let t_j denote the time before present (bp) that taxon j was extant. Let k denote the most recent common ancestor of taxa j and ℓ. Then the total physical time over which taxa j and ℓ have independently evolved is t_jℓ = (t_k – t_j) + (t_k – t_ℓ) = 2t_k — t_j — t_ℓ.

Divergence: This is the concept of branch length, which we define here to be the gross amount of change in sequence or other variables, summed along a path in the evolutionary tree. Each change makes a positive contribution to divergence, the amount being determined in some way appropriate to the problem in hand. Thus a change immediately followed by its reversal both contribute positively. Since such reversed changes are generally undetectable, divergences are intrinsically unobservable. Let d_jk = d_kj denote the divergence between any two taxa j and k. We expect d_jk to be equal on average to t_jk (subsuming an arbitrary scaling of time), with some random variation around this mean.

Distance: Although divergences are unobservable, we may be able to estimate them from available sequence or other data. We use the term distance and the notation d̂_jk to denote a measurement or estimate of d_jk. We expect d̂_jk to be equal on average to d_jk, with some random variation around this mean reflecting measurement imprecision.

The distinction between distance and divergence is seen more sharply when considering tree-additivity. Divergences are tree-additive, that is, for any three taxa j,k,ℓ where k lies on the path in the phylogenetic tree from j to ℓ, we have

However, tree-additivity is generally only an approximate property of distances, that is d_jℓ ≈ d_jk + d̂_kℓ.

For present-day taxa, we assume sequence or other data are available from which distances may be calculated directly. However, such data are in general unavailable for ancient taxa, so distances between them must be computed indirectly. Indeed, agglomerative methods recursively compute distances between ancient taxa on the basis of previously computed distances between more recent taxa (see below). That is, divergences are estimated recursively on the basis of previous divergence estimates. Thus, as we consider taxa j and k increasingly far back in time, distances d̂_jk will tend to become increasingly imprecise estimates of their underlying divergences d_jk. On the other hand, the number of pairs of present-day taxa descending from j and k and therefore contributing to the calculation of d̂_jk will concomitantly increase, which should tend to reduce imprecision (although there has been some debate on this issue; see Zwickl and Hillis (2002)). Therefore, as the agglomeration proceeds, it is unclear how much imprecision accumulates in the working distance matrix. It would be desirable to know whether, or under what conditions, imprecision accumulates or dissipates in agglomerative methods.

As noted above, current agglomerative methods employ a variety of strategies to take account of imprecision in distances. Here we consider the question of imprecision through a more coherent statistical-modeling framework. We show how imprecision can be estimated and monitored as the tree reconstruction proceeds; how it can be factored into the tree reconstruction; and how all this can be done without introducing unacceptable computational overheads.

2.2 Agglomeration

In common with most other distance-matrix methods, our approach is agglomerative. We begin with a set of currently extant taxa (or other genetic objects) of interest, and a matrix of distances between them. Recursively, we choose two taxa to be replaced by a new taxon representing their last common ancestor (LCA). This taxon pair is chosen to minimize the estimated total tree length, according to the neighbor-joining principle of Saitou and Nei (1987) (see Section 2.11). Distances from the new taxon to each of the remaining taxa are then computed. This agglomerative process continues until a single taxon remains.

We start the agglomerative process at stage 0 with m⁽⁰⁾ taxa all extant at time 0, and a distance matrix containing n⁽⁰⁾ = m⁽⁰⁾(m⁽⁰⁾ — 1)/2 elements in its upper triangle. There will be m⁽⁰⁾ agglomerative stages. At a generic stage i ≥ 0 of the agglomerative process, we consider a set of m⁽ⁱ⁾ taxa, labeled 1⁽ⁱ⁾,2⁽ⁱ⁾,...,m⁽ⁱ⁾. These taxa may have been extant at different times; we use Inline graphic to denote the extant time of taxon j⁽ⁱ⁾. At this stage we will have a working distance matrix between these m⁽ⁱ⁾ taxa, stored for convenience in a vector d̂⁽ⁱ⁾ of length n⁽ⁱ⁾ = m⁽ⁱ⁾(m⁽ⁱ⁾ — 1)/2, whose elements are indexed by taxon-pairs in upper-triangular row order, which we refer to as standard pair order, as follows:

Our principal tasks at stage i are to select, on the basis of d̂⁽ⁱ⁾, the two taxa to be agglomerated; to determine the extant time of their LCA; and to calculate a vector d̂⁽ⁱ⁺¹⁾ containing distances between the resulting m⁽ⁱ⁺¹⁾ = m⁽ⁱ⁾ — 1 taxa. Other quantities to be estimated are described below. The LCA represents a new internal taxon in the tree under construction, as illustrated in Figure 2.

2.3 Divergence relationships

To simplify the presentation, without loss of generality, we assume until further notice that the two taxa to be agglomerated at stage i are 1⁽ⁱ⁾ and 2⁽ⁱ⁾, as in Figure 2. Then, at stage i+1, we label the new LCA as taxon 1⁽ⁱ⁺¹⁾, and relabel each remaining taxon as follows:

stage	taxon label
i	3⁽ⁱ⁾	4⁽ⁱ⁾	...	k ⁽ⁱ⁾	...	m ⁽ⁱ⁾
i+1	2⁽ⁱ⁺¹⁾	3⁽ⁱ⁺¹⁾	...	(k–1)⁽ⁱ⁺¹⁾	...	m ⁽ⁱ⁺¹⁾

Open in a new tab

graphic file with name sagmb-2011-10-1-1574_0006b.jpg — Figure 2: Illustrating a generic stage i of the agglomerative process, showing (a) the ultrametric time tree with its time axis, and (b) the corresponding non-ultrametric divergence tree whose edge lengths are proportional to the amount of evolution taking place along them. The m⁽ⁱ⁾ = 4 current taxa are denoted by black circles, and the LCA of taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾, which are to be agglomerated next to form taxon 1⁽ⁱ⁺¹⁾, is denoted by a white circle. Taxon 1⁽ⁱ⁺¹⁾ will replace taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾ at the next stage i+1. Broken lines indicate earlier stages of the agglomeration.

From our assumption (1) of tree-additivity of divergences, we may then write

where Inline graphic denotes the divergence between taxa j⁽ⁱ⁾ and k⁽ⁱ⁾, and where and denote the divergences along the edges from taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾ to their LCA, taxon 1⁽ⁱ⁺¹⁾. The first condition in (3) states that the divergence between the two taxa being agglomerated is equal to the sum of the divergences from each of them to taxon 1⁽ⁱ⁺¹⁾, their LCA. The second condition states that the divergence between either of these two taxa and any other taxon is the sum of the divergences from each of them to taxon 1⁽ⁱ⁺¹⁾ (noting the change in taxon labels at stage i+1, described above). The final condition states that the divergence between any other pair of taxa remains unaltered (again noting the change in taxon labels).

Equation (3) may be expressed more concisely in vector notation:

where d⁽ⁱ⁾ is the n⁽ⁱ⁾-vector of divergences Inline graphic stored in standard pair order (2);

a⁽ⁱ⁾ is the 2-vector Inline graphic , where ^T denotes transposition; and A⁽ⁱ⁾, S⁽ⁱ⁾ and B⁽ⁱ⁾ are the following (n⁽ⁱ⁾ × m⁽ⁱ⁾), (m⁽ⁱ⁾ × 2) and (n⁽ⁱ⁾ × n⁽ⁱ⁺¹⁾) binary matrices. Matrix S⁽ⁱ⁾ contains a 1 in its (1,1) and (2,2) elements, and 0 elsewhere. The rows of A⁽ⁱ⁾ and B⁽ⁱ⁾ correspond to taxon pairs of stage i in standard pair order. Each column k = 1,...,m⁽ⁱ⁾ of A⁽ⁱ⁾ indicates which taxon-pairs of stage i contain taxon k⁽ⁱ⁾. The columns of B⁽ⁱ⁾ correspond to taxon pairs of stage i+1 in standard pair order, the column for pair (j⁽ⁱ⁺¹⁾,k⁽ⁱ⁺¹⁾) indicating which taxon pairs of stage i are connected on the tree via the path connecting taxa j⁽ⁱ⁺¹⁾ and k⁽ⁱ⁺¹⁾. For example, from Figure 2 we have

where the row and column labels of A are shown, the rows labels of B are as for A, and the column labels of B correspond to taxon pairs (1⁽ⁱ⁺¹⁾, 2⁽ⁱ⁺¹⁾), (1⁽ⁱ⁺¹⁾, 3⁽ⁱ⁺¹⁾) and (2⁽ⁱ⁺¹⁾, 3⁽ⁱ⁺¹⁾).

2.4 Statistical assumptions

We now state our statistical assumptions concerning divergences d⁽ⁱ⁾ and distances d⁽ⁱ⁾. Our approach is moment-based: we argue entirely in terms of conditional expectations, variances and covariances, and thus avoid the need to make explicit distributional assumptions, unlike Weighbor, for example, which assumes Gaussian distances.

We adopt the convention that all expectations, variances and covariances implicitly condition on the true but unknown tree topology and taxon times, including the assertion (for now) that the LCA of taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾ is the most recent of any ith-stage taxon pair. However, where expections and variances condition on divergences, we will make this explicit.

2.4.1 Non-ultrametricity

Our first set of assumptions concerns the relationship between the divergences and times which we consider at each stage i ≥ 0 of the algorithm. Let u⁽ⁱ⁾ be the 2-vector containing the time-intervals from taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾ to their LCA, taxon 1⁽ⁱ⁺¹⁾:

where 1_k denotes a k-vector of 1s and t⁽ⁱ⁾ = Inline graphic . Conditioning on d⁽ⁱ⁺¹⁾ and (implicitly) on , we assume:

where ν ≥ 0 is a scalar constant and Inline graphic denotes the diagonalization of vector u⁽ⁱ⁾. (It may seem strange to condition here on d⁽ⁱ⁺¹⁾ which we do not know, but formally d⁽ⁱ⁺¹⁾ _will take the role of a parameter vector in a regression model, as described below.) Thus, we assert that the variance of the divergence along a path is proportional to that path’s time-length, and that divergences along non-overlapping paths are uncorrelated.

For example, suppose we have a DNA sequence of length n bases, where each base has one of the four states {A,C,G,T}. If we assume the Jukes–Cantor model of sequence evolution (Jukes and Cantor, 1969), in which events (state-changes) occur independently and with equal probability according to a Poisson process at rate λ, then the total number of events (including reversals of previous state-changes) in the sequence in a time-interval u will be Poisson with mean λnu. Defining divergence a to be the total number of events per base in time-interval u, scaling time so that λ = 1, and setting ν = n^–1, leads to divergence a having mean u and variance νu, in agreement with (6,7). This simple example helps to provide some intuition about ν. However, our methodology is not tied to such simple evolutionary processes. With suitable scaling of time and choice of ν, any homogenous independent-increments process will satisfy (6,7), including more general sequence evolution models (Kimura, 1980, Hasegawa et al., 1985, Lanave, Preparata, Saccone, and Serio, 1984), with or without rate-variation across sites (Yang, 1993). The assumptions might be equally applicable to much more complex processes including large-scale genomic rearrangements; see, for example, Wang and Warnow (2005). Our moment-based approach has the advantage that details of the stochastic process need not be specified beyond (6,7).

Equation (7) allows divergences to depart from a strict molecular clock assumption. Thus the divergence tree is not constrained to be ultrametric, unlike the time tree. We call this kind of variation distortion. The amount of molecular clock distortion is controlled by parameter v. From (6,7), setting ν = 0 would assert that a⁽ⁱ⁾ = u⁽ⁱ⁾ for all i, so branch lengths in the divergence tree would be identical to those in the time tree and the divergence tree would then be ultrametric, obeying a strict molecular clock. Contrastingly, setting ν > 0 confers some elasticity in the branches of the divergence tree, as illustrated in Figure 1(b), larger v conferring greater elasticity. In the simple Poisson process example above, we have ν = n^–1. However, our methodology does not require ν to be set externally, as we provide methodology for estimating it (see Section 2.12).

Taking expectations and variances over a⁽ⁱ⁾ in equation (4) using (6,7), it follows immediately that divergences have the following conditional moments, conditioning on d⁽ⁱ⁺¹⁾ the part of the divergence tree that remains after agglomerating taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾:

A recursive expression for the marginal variance of d⁽ⁱ⁾ then follows from (8,9):

In particular, (10) implies that

where Inline graphic denotes the kth element of u⁽ⁱ⁾. This illustrates several consequences of our assumptions: that the variation in divergence along a tree path is proportional to the time-length of that path (11); that the covariation of divergences along two overlapping paths is proportional to their shared path length (12); and that divergences along non-overlapping paths are uncorrelated (13). Such relationships were first proposed by Chakraborty (1977), and underpin BioNJ.

Although our assumptions (6,7) accommodate a great variety of models, they do not formally allow for evolving rates of evolution (Thorne et al., 1998, Kishino et al., 2001), as this would induce correlations between non-overlapping paths in the tree.

2.4.2 Non-tree-additivity

Our second set of assumptions concerns the relationship between distances and divergences. We begin by considering the observed distances d̂⁽⁰⁾ between m⁽⁰⁾ currently extant taxa. There are n⁽⁰⁾ elements in d̂⁽⁰⁾, but the corresponding (rooted) divergence tree contains only (2m⁽⁰⁾ — 2) branches, so for m⁽⁰⁾ > 4 it will generally be impossible to find a divergence tree whose branches exactly replicate the distances in d̂⁽⁰⁾. We refer to this lack of fit to tree-additivity as noise, and model it as follows:

where I_k denotes the (k × k) identity matrix. The amount of noise is controlled by variance parameter σ². Setting σ² = 0 implies strict tree-additivity in distances. Thus we assert that observed distances are unbiased, uncorrelated, equivariant estimates of the tree-additive divergences underlying them. We emphasize that this is not an abandonment of previous work expounding the genesis of correlations between distances (Chakraborty, 1977, Hasegawa et al., 1985, Bulmer, 1991), since these relate to the marginal distribution of distances whereas (14,15) relate to their conditional distribution, conditioning on underlying divergences d⁽⁰⁾. Equations (14,15) refer only to the relationship between distances and the divergences they measure, not to the relationship between divergences and time intervals, which we have already discussed in Section 2.4.1. In particular, (15) does not imply an underlying star-tree topology. As Chakraborty (1977) and others have shown, correlations between distances arise precisely through their overlapping paths in the divergence tree. Below we show how this plays out within the framework of equations (6,7,14,15). The independent, equivariant, distance errors assumed by NJ are represented here by (15), but the branch-length dependent variances and covariances assumed by BioNJ correspond here to equation (9). There is no conflict in simultaneously assuming both sources of variation, as we do.

At each subsequent agglomerative stage i > 0, our algorithm produces a vector of inter-taxon distances d̂⁽ⁱ⁾, estimating underlying divergences d⁽ⁱ⁾. As we show by induction in Section 2.8, assumptions (6,7,14,15), together with the estimation procedures described below, result in the following conditional moments at each stage i ≥ 0, conditioning on unknown divergences d⁽ⁱ⁾:

where V⁽ⁱ⁾ is a symmetric, positive-definite (n⁽ⁱ⁾ × n⁽ⁱ⁾) matrix of a particular algebraic structure. Thus, at each stage, our model implies that distances are unbiased measures of divergence, with conditional variances and covariances which do not depend on the lengths of the divergences they measure. At stage i = 0, equation (15) asserts that V⁽⁰⁾ is the identity matrix, so that distances at stage 0 have a common variance and are uncorrelated. However, these simple variance-covariance properties necessarily disappear at later stages, because divergences estimated at stage i > 0 jointly depend on imprecisely estimated divergences from previous stages, thereby altering variances of divergence estimates and inducing correlations between them, ultimately affecting subsequent divergence estimates. NJ, BioNJ and Fastme take no account of such cumulative effects of estimation; Weighbor takes account of estimation effects on variances, but ignores effects on covariances; whilst the full, computationally intensive, version of MVR (Gascuel, 2000) correctly accounts for such effects in its minimum-variance criterion. Our methodology likewise takes full account of these estimation effects to preserve statistical efficiency, but without sacrificing computational efficiency.

2.4.3 Combining distortion and noise

It follows from equations (6,7,16,17), marginalizing over d⁽ⁱ⁾ while retaining the conditioning on d⁽ⁱ⁺¹⁾, that distances have the following conditional variance property:

where

where θ = ν/σ². Notice the differences between equations (17) and (18). They differ on the left because in (17) we condition on d⁽ⁱ⁾, while in (18) we condition only on d⁽ⁱ⁺¹⁾, a smaller condition as it involves less of the tree. They differ on the right because the marginalization has introduced an extra variance component. However, we emphasize that these are not alternative models, they are different aspects of the same model.

Equation (19) succinctly demonstrates that our model contains both noise (controlled by σ²) and distortion (controlled by v). The first term in (19) represents the noise in d̂⁽ⁱ⁾ as a measurement of d⁽ⁱ⁾; and the second term represents the effects of Inline graphic and being independent evolutionary distortions of the same underlying time-interval u⁽ⁱ⁾.

2.5 Regression model

From (4,5,6,16), we may write

where the error terms ε⁽ⁱ⁾ and ε′⁽ⁱ⁾ accommodate departures of d̂⁽ⁱ⁾ and a⁽ⁱ⁾ from their means. Equations (20,21) correspond to a random-effects regression model where d̂⁽ⁱ⁾ is regressed on A⁽ⁱ⁾ and B⁽ⁱ⁾ with regression parameters a⁽ⁱ⁾ and d⁽ⁱ⁺¹⁾ and the random effects a⁽ⁱ⁾ are regressed on 1₂ with parameter Inline graphic and a fixed offset . The error terms ε⁽ⁱ⁾ and ε′⁽ⁱ⁾ have zero mean and are uncorrelated with variance-covariance matrices

from (7,17). For now, we assume that variance parameters σ² and v are known; we will relax this assumption in Section 2.12.

2.6 Generalized Least Squares

The regression parameters in (20,21) may be estimated by GLS. (For a general introduction to GLS, see Mardia, Kent, and Bibby (1979); for its application in non-agglomerative phylogenetics, see Hasegawa et al. (1985), Bulmer (1991), Susko (2003)). The GLS objective function at stage i, based on (20–23), is the Mahalanobis distance:

where, from (20,21),

We refer to (24) as the conditional objective function at stage i. Parameter estimates, Inline graphic , â⁽ⁱ⁾, d⁽ⁱ⁺¹⁾, are obtained by minimizing with respect to , a⁽ⁱ⁾, d⁽ⁱ⁺¹⁾, for given σ², ν and t⁽ⁱ⁾. Note that t⁽ⁱ⁾ is known from the previous stage; that is, we set t⁽ⁱ⁾ = t̂⁽ⁱ⁾ if i > 0 or t⁽ⁱ⁾ = 0, otherwise.

It is instructive to consider an alternative approach to estimation, derived by substituting equation (21) into (20) to give:

where

Equation (27) represents a marginal regression model, in which d̂⁽ⁱ⁾ is regressed on A⁽ⁱ⁾S⁽ⁱ⁾1₂ and B⁽ⁱ⁾, with regression parameters Inline graphic and d⁽ⁱ⁺¹⁾ and a fixed offset . From (22,23,25,26), the error term ε*⁽ⁱ⁾ has zero mean and variance

where W⁽ⁱ⁾ is given by (19). From (27–29), we can construct a marginal GLS objective function:

and estimate parameters Inline graphic and d⁽ⁱ⁺¹⁾ by minimizing w.r.t. and d⁽ⁱ⁺¹⁾. It can be shown, using the Sherman–Morrison–Woodbury matrix-inversion identity (see, for example, Golub and Van Loan (1996)), that and hence the conditional and marginal GLS approaches lead to identical estimates of Inline graphic and d⁽ⁱ⁺¹⁾ for given σ², ν and t⁽ⁱ⁾.

2.7 Parameter estimates

Minimization of Inline graphic w.r.t. a⁽ⁱ⁾ and d⁽ⁱ⁺¹⁾ leads to the following GLS estimates:

where

and

suppressing index ⁽ⁱ⁾ in the notation Q_tt,...,P_a.

Now d̂⁽ⁱ⁺¹⁾ in (32) estimates divergence vector d⁽ⁱ⁺¹⁾ imprecisely, so in the terminology of Section 2.1 it is a vector of distances. Indeed, d̂⁽ⁱ⁺¹⁾ is the distance vector which will be input to the next stage i+1 of our algorithm. Equation (32) admits the possibility that elements of this distance vector may be negative. Although numerical methods for non-negative GLS are available at greater computational cost, they frustrate the algebraic approach which is particularly valuable at subsequent stages of our analysis (see Sections 2.10 and 2.12). We deal with negative branch lengths in a minimally perturbing post-processing stage, described in Section 2.11.

The following variances of the parameter estimates may be derived from (24) and (30), according to standard GLS theory:

It can be shown that the matrices to be inverted in equations (31–39) are all positive-definite, provided V⁽ⁱ⁾ is positive definite and no elements of Inline graphic are negative. Noting (40), it follows by induction that V⁽ⁱ⁾ is positive-definite for all i.

2.8 Progression

For the next stage i+1 of the agglomeration, we need to know the matrix V⁽ⁱ⁺¹⁾ assumed in (17). This is provided by equation (38), giving:

Thus V⁽ⁱ⁺¹⁾ depends on V⁽ⁱ⁾ through equations (19,36,40). These equations suggest that, from one stage to the next, the variance-covariance structure of distances becomes increasingly intricate, as a result of regression parameter estimates at each stage being recursively built upon correlated and imprecise regression parameter estimates from the previous stage. This happens even though in (15) we assumed conditionally independent distances initially. Despite this loss of independence, we show below that V⁽ⁱ⁺¹⁾ remains within a relatively simple class of structured covariance matrices. This is a central result of this paper.

We assumed above that equations (16,17) hold for given i, and consequently we have from the unbiasedness of GLS that IE[d̂⁽ⁱ⁺¹⁾ | d⁽ⁱ⁺¹⁾] = d⁽ⁱ⁺¹⁾, and from (38,40) that Var[d̂⁽ⁱ⁺¹⁾ | d⁽ⁱ⁺¹⁾] = σ²V⁽ⁱ⁺¹⁾. Therefore, (16,17) hold with i replaced by i+1. From (14,15), we see that (16,17) hold for i = 0. Hence, by induction, (16,17) hold for all i ≥ 0.

2.9 Dependence of variance on mean

Estimates Inline graphic and d̂⁽ⁱ⁺¹⁾ in (31,32) assume W⁽ⁱ⁾ is known. However, equations (6,7) specify that the variance of a⁽ⁱ⁾ depends on its mean, and hence W⁽ⁱ⁾ depends on unknown regression parameter through (5). To deal with this, we propose an iterative solution, assuming variance parameters σ² and ν are known. Estimation of σ² and ν is considered in Section 2.12. At each stage i, beginning with an initial guess Inline graphic for , we iterate the calculations for u⁽ⁱ⁾, W⁽ⁱ⁾,y⁽ⁱ⁾, and d̂⁽ⁱ⁺¹⁾ in (5,19,31–36) until convergence.

2.10 Computation

We have seen that we may proceed from one stage i to the next i+1 by computing the distance vector d̂⁽ⁱ⁺¹⁾ from (32) and its variance structure V⁽ⁱ⁺¹⁾ from (40). However, both (32) and (40) appear computationally formidable, requiring inversions of matrices of dimension n⁽ⁱ⁺¹⁾ × n⁽ⁱ⁺¹⁾. Such calculations, if tackled directly, would have time-complexity of order Inline graphic . Fortunately, we may exploit the algebraic structure of these matrices to perform each inversion in only (m⁽ⁱ⁺¹⁾).

As we describe below, these more efficient calculations involve, at stage i, two nonnegative vectors w⁽ⁱ⁾ and s⁽ⁱ⁾, each of length m⁽ⁱ⁾, defined recursively as follows:

where Inline graphic is a scalar depending only on and m⁽⁰⁾ (see equation (B5) in Appendix B). At the initial stage, we set and , the zero vector. Recursion (41) implies that taxon weight is the number of leaves of the tree (i.e. extant taxa) descended from taxon j⁽ⁱ⁾. In particular, it follows that

graphic file with name sagmb-2011-10-1-1574_0081.jpg

It is a remarkable fact, and a central result of this paper, that the complicated recursion for V⁽ⁱ⁾ through (19,36,40) reduces at each stage i to the following simple form (see Appendix B):

where Inline graphic diagonalizes s⁽ⁱ⁾ and diagonalizes an n⁽ⁱ⁾-vector p⁽ⁱ⁾ whose elements are products of pairs of taxon weights in standard order (2), thus:

Interpreting (43), the variance in d̂⁽ⁱ⁾ can be thought of as arising from two uncorrelated sources of variation. The first source is a collection of n⁽ⁱ⁾ uncorrelated random variables, one for each taxon pair, with variances inversely proportional to taxon weights, reflecting the pooling of distance information with each agglomeration. Thus we refer to p⁽ⁱ⁾ as a vector of pooling coefficients. The second source is a much smaller set of m⁽ⁱ⁾ uncorrelated random variables, one for each taxon. The variance of the jth random variable in this second set is proportional to Inline graphic and arises through agglomerations leading to taxon j⁽ⁱ⁾. This is not purely a consequence of distortion; positive arise even when ν = 0. This source impacts on V⁽ⁱ⁾ only where taxon-pairs share a taxon. Thus we refer to s⁽ⁱ⁾ as a vector of sharing coefficients. A particular consequence of (43) is that inter-taxon distances are uncorrelated where they have no taxon in common. Thus, for example, Inline graphic and are uncorrelated whereas and are not.

Equation (43) leads to substantial computational simplifications, in particular in computing d̂⁽ⁱ⁺¹⁾ in (32). We can show that distances at stage i which do not involve taxa 1⁽ⁱ⁾ or 2⁽ⁱ⁾ (i.e. most of d̂⁽ⁱ⁾) are passed unaltered to the next stage. That is, Inline graphic for 3 ≤ j < k ≤ m⁽ⁱ⁾. This intuitive result is critical in terms of reducing the time-complexity of our algorithm.

2.11 Building the tree

We have seen above how we may proceed from each stage i to the next, i + 1, by agglomerating taxa 1⁽ⁱ⁾ and 2⁽ⁱ⁾. Of course, by appropriate taxon relabeling, we can agglomerate any two taxa j⁽ⁱ⁾ and k⁽ⁱ⁾ using this methodology. We now turn our attention to consider which pair of taxa to agglomerate.

We begin at stage i = 0 with observed distances d̂⁽⁰⁾, weights Inline graphic and sharing-coefficients . At each generic stage i ≥ 0, we agglomerate that pair of taxa which would minimize the estimated tree-length, according to the neighbor-joining principle of Saitou and Nei (1987). That is, we choose (j⁽ⁱ⁾, k⁽ⁱ⁾) to minimise:

which is the simplified NJ criterion of Studier and Keppler (1988). We then add the LCA of taxa j⁽ⁱ⁾ and k⁽ⁱ⁾ to our tree reconstruction, with edges of length û⁽ⁱ⁾ in the time tree and â⁽ⁱ⁾ in the divergence tree computed from the agglomeration of j⁽ⁱ⁾ and k⁽ⁱ⁾, using the formulae in the foregoing sections (with appropriate taxon relabeling). We then proceed to the next stage, armed only with d̂⁽ⁱ⁺¹⁾, w⁽ⁱ⁺¹⁾ and s⁽ⁱ⁺¹⁾.

Thus, from leaves to root, the complete time tree is built entirely from edges of lengths û⁽ⁱ⁾, and the complete divergence tree is built entirely from edges of lengths â⁽ⁱ⁾. After completing both trees, any negative branch length u < 0 in the time tree is removed by setting it to zero and simultaneously adding — u to the sister branch and +u to the parent branch (if any). These local adjustments, which are applied recursively from tips to root, have the property of preserving the ultrametric property of the time tree. The same form of adjustment is applied to the divergence tree.

2.12 Estimating variance parameters

In general, variance parameters σ² and θ = ν/σ² will be unknown. Here we propose an iterative scheme for their estimation, starting with initial guesses Inline graphic . Within each iteration h = 1,2,..., using current estimates , θ̂_h–1, we proceed through all agglomerative stages i = 0,1,..., m⁽⁰⁾ — 2 to recompute the entire time and divergence trees and their associated quantities. These results are then substituted into estimating equations, derived in Appendix A, which may be solved iteratively for σ² and θ, giving updates Inline graphic , θ̂_h. This process of recomputing the tree and updating σ² and θ is continued until convergence.

2.13 Putting it together

Putting together the methods of the foregoing sections, the complete algorithm is as follows.

1.
Initialize , θ̂₀, and set h = 0.
2.
Set and .
3.
For each stage i = 0,1,...,m⁽⁰⁾—2,
- (a)
  for each possible join (j⁽ⁱ⁾,k⁽ⁱ⁾), calculate β̂^(i,j,k) and Ŵ^(i,j,k) via the iterative method described in Section 2.9;
- (b)
  choose the join (j⁽ⁱ⁾,k⁽ⁱ⁾) with smallest (45), as described in Section 2.11, and calculate , d̂⁽ⁱ⁺¹⁾, w⁽ⁱ⁺¹⁾ and s⁽ⁱ⁺¹⁾ accordingly;
4.
Calculate , θ̂_h, using the estimating equations (A7,A8) derived in Appendix A.
5.
If or θ̂_h ≉ θ̂_h–1, increment h and return to Step 2.
6.
Any negative branch lengths in the time or divergence trees are dealt with by the adjustment procedure of Section 2.11.

2.14 Consistency

Both NJ and BioNJ are consistent in the sense that the estimated topology will be correct if distances converge to divergences (which would happen, for example, if the length of the multiple sequence alignment from which distances are calculated tends to infinity). The same property holds for our methodology, the proof of which we now sketch.

Suppose distances are arbitrarily close to divergences: d̂⁽⁰⁾ ≈ d⁽⁰⁾. If we set θ̂₀ arbitrarily large at Step 1 of the algorithm in Section 2.13, then the second term in the objective function (24) will be negligible. It can then be shown by induction that at each stage i, we minimise (24) by setting the unknown parameters a⁽ⁱ⁾, d⁽ⁱ⁺¹⁾ equal to their true values under the true topology. Hence the true topology will be recovered at Step 3 of the algorithm. Then at Step 4, we will again obtain arbitrarily large θ̂₀, so no changes will occur in the parameter or topology estimates in the next iteration of the algorithm, and Step 5 will confirm convergence.

2.15 Sampling full uncertainty

Although equations (38,39) allow us to evaluate uncertainty in divergence estimates, they do not allow us to assess uncertainty in the estimated tree topology. To better understand the full uncertainty in the phylogenetic reconstruction, we propose the following simple parametric bootstrap procedure.

First, a vector of divergences d⁽⁰⁾ between the m⁽⁰⁾ extant taxa is read off from the complete estimated divergence tree, as constructed in Section 2.11. Second, a bootstrapped vector of distances d̂^(0,boot) is generated by sampling independently each element Inline graphic from a distribution with mean and variance σ̂², for consistency with (16,17). A natural choice for this would be a Gamma distribution, which is preferable to a Gaussian, as it produces only positive distances. However, alternative distributions could be explored to evaluate their impact. Third, using d̂^(0,boot) instead of d̂⁽⁰⁾, a bootstrapped version of the time-tree, divergence-tree, σ^² and ν̂, is estimated using the methods described above. Many bootstrapped versions may be generated by repeating steps two and three. The frequency of different tree topologies, histograms of σ^² and ν̂, and any other statistical summaries of interest, may then be computed from the collection of bootstrapped versions.

3 Simulations

To demonstrate key features of our approach, we present results obtained from StatTree on simulated data. Distance matrices were randomly generated according to two different schemes. In the first, distance matrices were obtained by distorting trees with fixed topology, and then adding noise. This will be referred to as ‘direct simulation’ below. In second scheme, referred to as ‘sequence-based simulation’, sequence alignments were randomly generated from some fixed trees and distance matrices were then estimated from the alignments. For both schemes, trees were constructed from the distance matrices and compared with the original underlying tree.

3.1 Direct simulation

1.
A base ultrametric time tree was selected together with a choice of parameters σ² and ν. Two different topologies for the time tree, each with m⁽⁰⁾ = 64 leaf taxa, were used. These topologies are illustrated in Figure 3, and represent extremes of the kind of branching behaviour that can arise in rooted bifurcating trees. We will refer to these as the balanced and unbalanced topologies.
2.
The selected time tree was then distorted, as follows, to obtain a divergence tree with inter-leaf divergences . The length of each edge on the divergence tree was assigned a value independently sampled from a Gamma distribution with mean equal to the edge length u on the time tree and variance νu, ensuring consistency with equations (6) and (7).
3.
A distance matrix was then obtained from the divergence tree and perturbed according to the process described in Section 2.15: each inter-leaf distance was assigned a value independently sampled from a Gamma distribution with mean and variance σ². This process ensures consistency with equations (16) and (17).

Various values of σ² and ν were used to simulate distance matrices, and these values are specified as follows. The standard deviation σ is best expressed as a proportion of the mean inter-taxon distance in the time tree. For example, if the mean inter-taxon distance in the time tree was δ we might take σ = 0.2 × δ and refer to this as a value of 20% for σ. The two basic time trees shown in Figure 3 have

graphic file with name sagmb-2011-10-1-1574_0111.jpg — Figure 3: Two topologies for the underlying time trees used in the simulations. The tree on the left is referred to as having a *balanced* topology, while the tree on the right is referred to as *unbalanced*. These represent two extremes of the pattern of branching that can be exhibited in rooted bifurcating trees. All internal branches in the trees are taken to have equal length, and the trees were scaled so that the maximum distance between two taxa was 1.0. For both topologies, trees with 16 and 64 taxa were used, as described in the text.

quite different mean inter-taxon distances, and so by specifying σ as a proportion of the mean we ensure that distance matrices are obtained from the different topologies in a comparable way. Similarly, the parameter ν needs to be specified in a way that is consistent between different trees. If time trees with a fixed topology are scaled to have different overall lengths and then distorted to obtain divergence trees, it is necessary to vary ν with the scale of the tree in order to produce a collection of divergence trees with a similar visual appearance and level of distortion. This is achieved by fixing a proportion p, and taking

where δ is the mean inter-taxon distance in the time tree. This ensures that when a branch of length u on a time tree is distorted, the ratio of the standard deviation to the branch length is given by

This is a dimensionless quantity that is proportional to p. Fixing ν in this way for different tree topologies, tree scales, and numbers of taxa ensures that the results are comparable. The proportion p is usually expressed as a percentage in what follows, and for brevity we will write ‘ν = 20%’ to mean ν = 0.2² × δ, for example.

To assess the performance of our algorithm, distance matrices were simulated using the scheme in Steps 1–3 above, and phylogenies were estimated according to the steps specified in Section 2.13, as implemented in our software StatTree. Specifically, values of 5%, 10% and 20% for σ and ν were used in the simulations. For each of the 9 corresponding combinations of parameters, 100 distance matrices were generated using the balanced topology on 64 taxa, and another 100 distance matrices were generated using the unbalanced topology on 64 taxa. Tree construction with estimation of σ² and ν was then carried out for each of the simulated distance matrices. Means and standard deviations of the estimated parameters across each set of 100 simulations were calculated to assess the accuracy of estimation. In addition, each estimated tree was compared to the true underlying tree via two scores that measure the topological accuracy of the estimated tree. These scores are described in the next paragraph. For each set of parameter values the average scores were calculated for the set of 100 simulations. Trees were also constructed with the BioNJ algorithm and compared to the true tree in the same way. The results are given in Tables 1 and 2.

Two topological scores, the Robinson-Foulds and quartet distance, were used to measure accuracy of estimated trees. The Robinson–Foulds distance (Robinson and Foulds, 1981) is the proportion of bi-partitions that two trees share (each branch in each tree, upon being cut, induces one such bi-partition of the leaves). The Robinson–Foulds distance is very sensitive, in that for certain topologies a single incorrectly positioned taxon can result in a score of zero. For this reason a second score, the quartet distance (Brodal, Faberberg, and C.N.S., 2004), was also used. The quartet distance between two trees is defined as the proportion of subsets of 4 taxa for which the trees induce the same topology.

An immediate observation from Tables 1 and 2 is that σ is estimated with a relatively high degree of accuracy while the estimates for ν show considerable bias and variability. The distortion parameter ν is generally over-estimated, and the bias is greater for the unbalanced topology. This bias seems not to be directly related to the topological accuracy of tree estimates: descending any of the columns in the tables generally shows an approximately constant level of bias while the topological accuracy decreases. As σ and ν increase, the topological accuracy of the estimated trees decreases. This effect is present for both topologies, though the unbalanced topology is generally harder to estimate.

The topological scores T̄_RF,T̄_Q for trees built with our method are compared, in Tables 1 and 2, with the scores for trees built using BioNJ, based on the same distance matrices. For the unbalanced topology, the Robinson–Foulds score T̄_RF can be very low and the quartet score T̄_Q is more useful for making comparisons. Our method produces more accurate trees uniformly, and the contrast between the two methods increases with σ and ν. Even though BioNJ makes no

	Method	Statistic	ν = 5%	ν = 10%	ν = 20%
σ = 5%	StatTree	σ̄ (s.d.)	5.0 (0.1)	5.0 (0.1)	5.0 (0.1)
		ν̄ (s.d.)	11.1 (2.7)	14.2 (2.2)	22.7 (2.4)
		T̄_RF,T̄_Q	99.6, 100	98.9, 100	94.8, 99.6
	BioNJ	T̄_RF,T̄_Q	99.6, 100	99.1, 100	94.7, 99.5
σ = 10%	StatTree	σ̄ (s.d.)	10.0 (0.2)	10.0 (0.2)	10.0 (0.2)
		ν̄ (s.d.)	12.3 (3.4)	15.5 (3.2)	23.6 (3.0)
		T̄_RF,T̄_Q	92.1, 99.8	90.7, 99.7	86.0, 99.1
	BioNJ	T̄_RF,T̄_Q	92.1, 99.8	90.8, 99.6	84.1, 98.2
σ = 20%	StatTree	σ̄ (s.d.)	19.9 (0.4)	19.9 (0.4)	20.1 (0.4)
		ν̄ (s.d.)	12.7 (2.8)	15.1 (3.3)	24.2 (4.3)
		T̄_RF,T̄_Q	76.0, 98.4	72.0, 97.9	65.2, 95.5
	BioNJ	T̄_RF,T̄_Q	73.4, 97.4	68.7, 95.5	61.4, 88.2

Open in a new tab

Table 1: Results from StatTree of direct simulations for an underlying tree with the balanced topology on 64 taxa, for several true values of parameters σ² and ν. The means σ̄ and ν̄ of the parameter estimates (in units of %) over 100 simulations are given, together with their standard deviations in parentheses. Also given for each case is the average Robinson–Foulds percentage score, T̄_RF, and the quartet percentage score T̄_Q. For comparison, the average scores are also given for trees estimated from the same distance matrices by BioNJ. In each case, the method with the better score is indicated in bold.

assumptions about the rate of evolution in each branch of the tree, the topological accuracy of its estimates decreases as the level of distortion in the underlying tree increases. Weighbor was not used for these simulations since it requires a parameter that corresponds to the length of the sequence alignment from which the distance matrix was estimated. No such parameter was available for the direct simulations, but Weighbor was incorporated into the sequence-based simulations described below.

	Method	Statistic	ν = 5%	ν = 10%	ν = 20%
σ = 5%	StatTree	σ̄ (s.d.)	5.1 (0.1)	5.1 (0.1)	5.0 (0.3)
		ν̄ (s.d.)	26.1 (14.3)	29.1 (11.2)	32.1 (9.8)
		T̄_RF,T̄_Q	38.1, 89.7	33.2, 87.1	23.2, 79.0
	BioNJ	T̄_RF,T̄_Q	36.3, 89.2	31.8, 86.8	22.0, 78.1
σ = 10%	StatTree	σ̄ (s.d.)	10.3 (0.3)	10.2 (0.2)	10.2 (0.4)
		ν̄ (s.d.)	22.9 (10.9)	24.2 (9.5)	28.1 (9.1)
		T̄_RF,T̄_Q	12.7, 77.6	11.9, 75.1	11.2, 68.0
	BioNJ	T̄_RF,T̄_Q	12.1, 77.3	11.5, 74.9	10.5, 66.7
σ = 20%	StatTree	σ̄ (s.d.)	20.4 (0.5)	20.4 (0.5)	20.4 (0.8)
		ν̄ (s.d.)	22.5 (5.5)	26.2 (5.7)	30.5 (7.4)
		T̄_RF,T̄_Q	2.0, 57.5	2.0, 55.1	2.9, 48.9
	BioNJ	T̄_RF,T̄_Q	1.7, 57.2	2.2, 53.4	3.3, 45.2

Open in a new tab

Table 2: Results from StatTree of direct simulations for an underlying tree with the unbalanced topology on 64 taxa, for several true values of parameters σ² and ν. The means σ̄ and ν̄ of the parameter estimates (in units of %) over 100 simulations are given, together with their standard deviations in parentheses. Also given for each case is the average Robinson–Foulds percentage score, T̄_RF, and the quartet percentage score T̄_Q. For comparison, the average scores are also given for trees estimated from the same distance matrices by BioNJ. In each case, the method with the better score is indicated in bold.

3.2 Sequence-based simulations

Our algorithm was also evaluated using distance matrices estimated from sequence alignments, using a scheme very similar to that used to evaluate BioNJ by Gascuel (1997). Distance matrices were randomly generated in the following way.

1.
A base tree of divergences was selected. Two different trees were used: one with a balanced topology and the other unbalanced, as illustrated in Figure 4 below. These were obtained by randomly distorting ultrametric balanced and unbalanced trees on 16 taxa with ν = 10%.
2.
DNA sequence alignments were generated from the base tree using the program SeqGen (Rambaut and Grassly, 1997) according to the scheme specified by Gascuel (1997). Kimura’s two-parameter model of substitution (Kimura, 1980) was used with a transition/transversion ratio of 2.0. Standard parameter values were used in SeqGen to specify rate variation between sites. Sequence alignments with 300 and 600 sites were simulated. In addition, three different scales for the underlying base tree were used: low, medium, high substitution rates with (respectively) a maximum of 0.1, 0.5 and 1.0 substitutions per site.
3.
Distance matrices were estimated from each alignment using the program DnaDist (part of Phylip), using the same parameter values as SeqGen, although the transition/transversion ratio was estimated each time.

One hundred distance matrices were generated in this way for each of the two base trees, three scale factors, and two sequence lengths, totaling 12 different simulation conditions. Trees were constructed for each distance matrix using our algorithm (as specified in Section 2.13), BioNJ, and Weighbor. The Robinson–Foulds score and quartet score were calculated to evaluate the topological accuracy of the estimated trees. The accuracy of estimated branch lengths was measured using the following statistic:

where d̂_jk is the estimated divergence between taxa j,k and d_jk is the true divergence. Table 3 shows the results.

Table 3 shows that all three algorithms are broadly comparable and generally achieve a high degree of topological accuracy. The best topological scores and branch-length scores occur for the longer sequences and the balanced trees, where BioNJ performs slightly better than StatTree. The more difficult scenarios, where the sequences are short and the rate of substitution is low or the topology unbalanced, tend to favor StatTree. Taking these results together with those of Section 3.1, it appears that StatTree offers some advantages in harder inference problems where the signal-to-noise ratio in the distance matrix is low.

4 Application to HIV evolution

Rapidly evolving viral species, such as influenza and HIV, have become important laboratory tools for studying evolution in real time, providing a test-bed for evolutionary theories and inference methodologies. The high genetic diversity of HIV

graphic file with name sagmb-2011-10-1-1574_0114.jpg — Figure 4: Balanced and unbalanced trees used for sequence-based simulations.

results from about 10⁵ mutations per day per infected individual. Longitudinal clinical studies can provide information on within-patient HIV evolution under varying treatment regimes, with potential to reveal genetic bottlenecks and resurgence of rare wild-type variants post-therapy. Here we reanalyze data from a recent study by Salazar-Gonzalez, Salazar, Keele, Learn, Giorgi, Li, et al. (2009), in which a total of 121 full-length HIV-1 genome sequences were obtained from samples taken longitudinally in 12 infected individuals. The authors reported a rooted, multi-furcating, maximum-likelihood phylogeny of these sequences, based on a bespoke mathematical model of HIV sequence evolution (Keele, Giorgi, Salazar-Gonzalez, Decker, Pham, Salazar, et al., 2008).

It is of interest to determine not only the tree describing the evolution of these sampled HIV genomes, but also the physical times of their diversification. For these tasks our methodology, producing both a divergence tree and its underlying time tree, is directly relevant. Our reanalysis of these data was based on a distance matrix for all 121 genomes, constructed using program DnaDist of Phylip assuming Kimura’s two-parameter model of base substitution (Kimura, 1980). The results from StatTree are shown in Figure 5. Similar results (not shown) were obtained with alternative substitution models. Within-subject branch lengths were all, with one exception, very short, and so for concision are not shown. The one exception is the subject identified in the figure as ZM.247, for whom two viral clades, ZM.247a and ZM.247b, are shown.

As in the original study (Salazar-Gonzalez et al., 2009), the divergence tree in Figure 5b demonstrates clear between-subject differences in clonal lineages, the generally short internal branches suggesting only small differences between their common ancestors. The major exception concerns the three Zambian subjects,

		Length 300		Length 600
Scale	Method	Balanced	Unbalanced	Balanced	Unbalanced
		T̄_RF,T̄_Q,B̄	T̄_RF,T̄_Q,B̄	T̄_RF,T̄_Q,B̄	T̄_RF,T̄_Q,B̄
Low	StatTree	94.8, 94.1, 76.3	62.8, 77.3, 61.8	98.8, 98.5, 77.2	78.1, 88.0, 61.5
	BioNJ	95.3, 93.7, 77.8	62.8, 76.9, 64.3	99.2, 98.6, 77.7	78.8, 88.5, 64.2
	Weighbor	95.2, 93.7, 77.8	59.3, 74.0, 64.3	98.8, 97.9, 77,7	76.9, 86.8, 64.2
Medium	StatTree	99.2, 99.2, 21.0	73.2, 84.2, 14.9	100, 100, 21.5	85.3, 92.0, 14.5
	BioNJ	99.5, 99.3, 22.3	73.3, 84.7, 18.1	100, 100, 22.6	85.7, 92.1, 18.5
	Weighbor	99.2, 98.9, 22.4	70.7, 82.4, 18.1	100, 100, 22.6	85.2, 91.8, 18.5
High	StatTree	98.6, 97.8, 2.8	72.3, 83.7, 22.9	100, 99.9, 1.8	82.8, 90.8, 15.5
	BioNJ	98.8, 97.9, 2.1	71.8, 83.1, 2.2	100, 99.9, 1.2	83.6, 91.4, 1.1
	Weighbor	98.6, 97.5, 2.1	70.2, 81.7, 2.3	100, 99.9, 1.2	82.5, 90.7, 1.1

Open in a new tab

Table 3: Results of sequence-based simulations generated under balanced and unbalanced tree topologies. For each tree, 100 distance matrices were generated for each combination of two different sequence lengths (300 and 600 bp) and three different scales (corresponding to maxima of 0.1, 0.5 and 1.0 substitutions per site). The table reports the average of the Robinson–Foulds percentage score T̄_RF, quartet percentage score T̄_Q, and branch length score B̄, for each of StatTree, BioNJ and Weighbor. In each case, the method with the better score is indicated in bold.

whose viral strains are quite distinct from those of the USA subjects. The clear separation between the two viral subclades of subject ZM.247 is interpreted by Salazar-Gonzalez et al. (2009) as evidence of simultaneous infection from an individual carrying both strains. The major difference between our analysis and that of Salazar-Gonzalez et al. (2009) is that in the latter the tree is rooted between the Zambian patients and the rest.

The corresponding time tree in Figure 5a estimates the times of diversification of these viral lineages. This time tree suggests relatively long periods of shared HIV ancestry between some of the USA patients: an insight not available from the divergence tree. Moreover, the relatively short estimated time since divergence of the two subclades of subject ZM.247 might call into question the authors’ explanation for their existence. These observations underline the potential utility of estimating time-trees.

graphic file with name sagmb-2011-10-1-1574_0115.jpg — Figure 5: (a) Time and (b) divergence trees of 121 HIV-1 genomes sampled from 12 subjects studied by Salazar-Gonzalez et al. (2009), estimated by StatTree. Within-subject viral clades are shown as tips. Tip-label prefixes denote subjects’ geographical location (AL = Alabama; NC = North Carolina; NY = New York; ZM = Zambia).

5 Discussion

We have developed a variance-components model for phylogenetic reconstruction from distance matrices, incorporating several novel features. Two components of variance are distinguished and incorporated: noise (as modeled by NJ) reflecting the lack-of-fit of distances to an underlying divergence tree, and distortion (as modeled by BioNJ) reflecting the departure of this divergence tree from its underlying ultrametric time tree. Unlike previous methods, our approach exploits the coherence of both trees, simultaneously estimating them through a computationally efficient procedure which recursively descends through the phylogeny. Variances and covariances of distances and divergences are propagated from one agglomerative stage to the next, and taken into account in building the tree.

Focussing only on means, variances and covariances of distances, our methodology avoids the need for full parametric assumptions about evolutionary and other processes involved. The advantage is that inferences do not depend on the validity of substitution models, thus admitting the possibility of straightforwardly incorporating non-standard sources of phylogenetic information other than sequence data.

As shown in Section 3, even small phylogenies contain enough data to estimate the noise component σ accurately, but information on the distortion compo-

nent ν is weak and estimates tend to be positively biassed. Our simulation results show that StatTree performs comparably with Weighbor and BioNJ, StatTree slightly outperforming BioNJ when distance matrices have a low signal-to-noise ratio such as when sequences and branch lengths are short and the topology unbalanced.

There is scope for further development of the methodology, in particular for a more robust estimation of ν. Generalizations of the methodology might allow for non-binary trees and for different amounts of noise in different parts of the tree. A more substantial advance would be required to accommodate evolution in the rate of evolution (Thorne et al., 1998, Kishino et al., 2001), which induces correlations in divergence between non-overlapping branches.

Software and applications

We have implemented the methodology in StatTree, which may be freely downloaded from http://www.mas.ncl.ac.uk/~ntmwn/stattree. Illustrative applications of the methodology to bacterial, viral and mitochondrial datasets can be found at http://www.cl.cam.ac.uk/~pl219/GilksNyeLioSAGMB.html.

Appendix A Estimating equations for σ² and ν

Our estimating equations for σ² and ν are derived from a global objective function M_G comprising the marginal objective function Inline graphic in (30) summed over all stages i:

where the dependence on θ is through W⁽ⁱ⁾ given in (19). The estimated marginal residuals ε̂*⁽ⁱ⁾ in (A1) are given by

from (27), for the optimal taxon-pair agglomeration (j⁽ⁱ⁾,k⁽ⁱ⁾), as described in Section 2.11.

We construct our estimating equations by setting partial derivatives of (A1) equal to their expectations (see, for example, Crowder (2001)). These partial derivatives are:

using (19), where tr denotes trace and the residuals {ε̂*⁽ⁱ⁾} are considered fixed. Now, it can be shown that IE Inline graphic = σ²P⁽ⁱ⁾W⁽ⁱ⁾ where P⁽ⁱ⁾ is the matrix projecting into the residual space of the marginal regression model (27):

where X⁽ⁱ⁾ = [A⁽ⁱ⁾S⁽ⁱ⁾1₂,B⁽ⁱ⁾]. Therefore, taking expectations in (A3,A4),

Equating observed (A3,A4) and expected (A5,A6) partial derivatives, we obtain the following estimating equations:

which may be solved iteratively for σ² and θ.

Appendix B Efficient computation

Using equations (31–40,A7,A8) directly would be computationally prohibitive. However, the algebraic structure of the vectors and matrices involved admit substantial computational savings, as we now describe.

The key result, in Lemma B.1 below, is an efficient formula for computing Inline graphic in (31,32), thus avoiding the double inversion of large matrices involved in . Lemma B.3 then shows that the V⁽ⁱ⁾ matrices stay within the class defined by (43). Lemmas B.1–B.3 open the way for efficient computation of equations (31–40,A7,A8).

We first introduce the following scalars, notationally suppressing their dependence on i:

Then we have the following results:

Lemma B.1 If W ⁽ⁱ⁾ satisfies (19) and V ⁽ⁱ⁾ satisfies (43), then

and

where , τ_D and ζ_D are the diagonalizations of (44), (τ₁,τ₂,τ₃,..., Inline graphic ) and (), respectively.

Proof Equation (B1) follows immediately from (19,43). Using the Sherman–Morrison–Woodbury matrix-inversion identity (see, for example, Golub and Van Loan (1996)), we obtain

where e_D is the diagonalization of (τ₁r₁, τ₂r₂,..., Inline graphic ). Elementary manipulations eventually lead to:

where ẽ_D is the diagonalization of (h,τ₃r₃,..., Inline graphic ) and w⁽ⁱ⁺¹⁾ is given by (41). Equation (B2) may then be obtained by again applying the Sherman–Morrison-Woodbury identity, but this time to (B3) and in the reverse direction. □

Lemma B.2 With the assumptions of Lemma B.1, expression (40) simplifies to

where and are the diagonalizations of p ⁽ⁱ⁺¹⁾ and s ⁽ⁱ⁺¹⁾ given by (44) and (42), where in (42) we define:

Proof The result may be shown through elementary but extremely lengthy and laborious manipulations, using the results of Lemma B.1. □

Lemma B.3 Expression (43) for V⁽ⁱ⁾ holds for all i = 0,1,...,m⁽⁰⁾.

Proof By induction from Lemma B.2, noting from (15) that (43) holds for i = 0. □

Contributor Information

Walter R Gilks, University of Leeds and Rothamsted Research .

Tom M.W. Nye, University of Newcastle upon Tyne

Pietro Lio, University of Cambridge .

References

Brodal, G., R. Faberberg, and P. C.N.S. (2004). Computing the quartet distance between evolutionary trees in time O(nlog²n). Algorithmica 38(2), 377–395. [Google Scholar]
Bruno, W., N. Socci, and A. Halpern (2000). Weighted Neighbour Joining: a likelihood-based approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution 17, 189–197. [DOI] [PubMed] [Google Scholar]
Bulmer, M. (1991). Use of the method of generalized least squ ares in reconstructing phylogenies from sequence data. Molecular Biology and Evolution 8, 868–883. [Google Scholar]
Chakraborty, R. (1977). Estimation of time of divergence from phylogenetic studies. Canadian Journal of Genetics and Cytology 19, 217–223. [DOI] [PubMed] [Google Scholar]
Crowder, M. (2001). On repeated measures analysis with misspecified covariance structure. Journal of the Royal Statistical Society, Series B 63, 55–62. [Google Scholar]
Desper, R. and O. Gascuel (2004). Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted leastsquares tree fitting. Molecular Biology and Evolution 21(3), 587–598. [DOI] [PubMed] [Google Scholar]
Felsenstein, J. (1987). Estimation of hominoid phylogeny from a DNA hybridization data set. Journal of Molecular Evolution 26, 123–131. [DOI] [PubMed] [Google Scholar]
Felsenstein, J. (2004). Inferring Phylogenies. Massachusetts: Sinauer Associates, Inc. [Google Scholar]
Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695. [DOI] [PubMed] [Google Scholar]
Gascuel, O. (2000). Data model and classification by trees: the minimum variance reduction (MVR) method. Journal of Classification 17, 67–99. [Google Scholar]
Golub, G. and C. Van Loan (1996). Matrix Computations (3rd ed.). Baltimore: The Johns Hopkins University Press. [Google Scholar]
Hasegawa, M., H. Kishino, and T. Yano (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174. [DOI] [PubMed] [Google Scholar]
Jukes, T. and C. Cantor (1969). Evolution of protein molecules. In M.N.Munro (Ed.), Mammalian Protein Metabolism, Volume III, pp. 21–132. New York: Academic Press. [Google Scholar]
Keele, B., E. Giorgi, J. Salazar-Gonzalez, J. Decker, K. Pham, M. Salazar, et al. (2008). Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. USA. 105, 75527557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16, 111–120. [DOI] [PubMed] [Google Scholar]
Kishino, H., J. Thorne, and W. Bruno (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18, 352–361. [DOI] [PubMed] [Google Scholar]
Lanave, C., G. Preparata, C. Saccone, and G. Serio (1984). A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20, 86–93. [DOI] [PubMed] [Google Scholar]
Mardia, K., J. Kent, and J. Bibby (1979). Multivariate Analysis. New York: Academic Press. [Google Scholar]
Nei, M., J. Stephens, and N. Saitou (1985). Methods for computing the standard errors of branching points in an evolutionary tree and their applications to molecular data from human and apes. Molecular Biology and Evolution 2, 66–85. [DOI] [PubMed] [Google Scholar]
Rambaut, A. and N. Grassly (1997). Seq-gen: an application for the Monte Carlo simulation od DNA sequence evolution along phylogenetic trees. Algorithmica 13(3), 235–238. [DOI] [PubMed] [Google Scholar]
Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees. Mathematical Bioscience 53, 131–147. [Google Scholar]
Saitou, N. and M. Nei (1987). The neighbour-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425. [DOI] [PubMed] [Google Scholar]
Salazar-Gonzalez, J., M. Salazar, B. Keele, G. Learn, E. Giorgi, H. Li, et al. (2009). Genetic identity, biological phenotype, and evolutionary pathways of transmitted/founder viruses in acute and early HIV-1 infection. J. Experimental Medecine 206(6), 1273–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
Studier, J. and K. Keppler (1988). A note on the neighbor-joining method of Saitou and Nei. Molecular Biology and Evolution 5, 729–731. [DOI] [PubMed] [Google Scholar]
Susko, E. (2003). Confidence regions and hypothesis tests using generalized least squares. Molecular Biology and Evolution 20, 862–868. [DOI] [PubMed] [Google Scholar]
Thorne, J., H. Kishino, and I. Painter (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15, 1647–1657. [DOI] [PubMed] [Google Scholar]
Wang, L.-S. and T. Warnow (2005). Distance-based genome rearrangement phylogeny. In O. Gascuel (Ed.), Mathematics of Evolution & Phylogeny, Chapter 13, pp. 353–383. Oxford University Press. [Google Scholar]
Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10, 1396–1401. [DOI] [PubMed] [Google Scholar]
Zwickl, D. and D. Hillis (2002). Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 51(4), 588–598. [DOI] [PubMed] [Google Scholar]

[R1] Brodal, G., R. Faberberg, and P. C.N.S. (2004). Computing the quartet distance between evolutionary trees in time O(nlog²n). Algorithmica 38(2), 377–395. [Google Scholar]

[R2] Bruno, W., N. Socci, and A. Halpern (2000). Weighted Neighbour Joining: a likelihood-based approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution 17, 189–197. [DOI] [PubMed] [Google Scholar]

[R3] Bulmer, M. (1991). Use of the method of generalized least squ ares in reconstructing phylogenies from sequence data. Molecular Biology and Evolution 8, 868–883. [Google Scholar]

[R4] Chakraborty, R. (1977). Estimation of time of divergence from phylogenetic studies. Canadian Journal of Genetics and Cytology 19, 217–223. [DOI] [PubMed] [Google Scholar]

[R5] Crowder, M. (2001). On repeated measures analysis with misspecified covariance structure. Journal of the Royal Statistical Society, Series B 63, 55–62. [Google Scholar]

[R6] Desper, R. and O. Gascuel (2004). Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted leastsquares tree fitting. Molecular Biology and Evolution 21(3), 587–598. [DOI] [PubMed] [Google Scholar]

[R7] Felsenstein, J. (1987). Estimation of hominoid phylogeny from a DNA hybridization data set. Journal of Molecular Evolution 26, 123–131. [DOI] [PubMed] [Google Scholar]

[R8] Felsenstein, J. (2004). Inferring Phylogenies. Massachusetts: Sinauer Associates, Inc. [Google Scholar]

[R9] Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695. [DOI] [PubMed] [Google Scholar]

[R10] Gascuel, O. (2000). Data model and classification by trees: the minimum variance reduction (MVR) method. Journal of Classification 17, 67–99. [Google Scholar]

[R11] Golub, G. and C. Van Loan (1996). Matrix Computations (3rd ed.). Baltimore: The Johns Hopkins University Press. [Google Scholar]

[R12] Hasegawa, M., H. Kishino, and T. Yano (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174. [DOI] [PubMed] [Google Scholar]

[R13] Jukes, T. and C. Cantor (1969). Evolution of protein molecules. In M.N.Munro (Ed.), Mammalian Protein Metabolism, Volume III, pp. 21–132. New York: Academic Press. [Google Scholar]

[R14] Keele, B., E. Giorgi, J. Salazar-Gonzalez, J. Decker, K. Pham, M. Salazar, et al. (2008). Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. USA. 105, 75527557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16, 111–120. [DOI] [PubMed] [Google Scholar]

[R16] Kishino, H., J. Thorne, and W. Bruno (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18, 352–361. [DOI] [PubMed] [Google Scholar]

[R17] Lanave, C., G. Preparata, C. Saccone, and G. Serio (1984). A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20, 86–93. [DOI] [PubMed] [Google Scholar]

[R18] Mardia, K., J. Kent, and J. Bibby (1979). Multivariate Analysis. New York: Academic Press. [Google Scholar]

[R19] Nei, M., J. Stephens, and N. Saitou (1985). Methods for computing the standard errors of branching points in an evolutionary tree and their applications to molecular data from human and apes. Molecular Biology and Evolution 2, 66–85. [DOI] [PubMed] [Google Scholar]

[R20] Rambaut, A. and N. Grassly (1997). Seq-gen: an application for the Monte Carlo simulation od DNA sequence evolution along phylogenetic trees. Algorithmica 13(3), 235–238. [DOI] [PubMed] [Google Scholar]

[R21] Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees. Mathematical Bioscience 53, 131–147. [Google Scholar]

[R22] Saitou, N. and M. Nei (1987). The neighbour-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425. [DOI] [PubMed] [Google Scholar]

[R23] Salazar-Gonzalez, J., M. Salazar, B. Keele, G. Learn, E. Giorgi, H. Li, et al. (2009). Genetic identity, biological phenotype, and evolutionary pathways of transmitted/founder viruses in acute and early HIV-1 infection. J. Experimental Medecine 206(6), 1273–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Studier, J. and K. Keppler (1988). A note on the neighbor-joining method of Saitou and Nei. Molecular Biology and Evolution 5, 729–731. [DOI] [PubMed] [Google Scholar]

[R25] Susko, E. (2003). Confidence regions and hypothesis tests using generalized least squares. Molecular Biology and Evolution 20, 862–868. [DOI] [PubMed] [Google Scholar]

[R26] Thorne, J., H. Kishino, and I. Painter (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15, 1647–1657. [DOI] [PubMed] [Google Scholar]

[R27] Wang, L.-S. and T. Warnow (2005). Distance-based genome rearrangement phylogeny. In O. Gascuel (Ed.), Mathematics of Evolution & Phylogeny, Chapter 13, pp. 353–383. Oxford University Press. [Google Scholar]

[R28] Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10, 1396–1401. [DOI] [PubMed] [Google Scholar]

[R29] Zwickl, D. and D. Hillis (2002). Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 51(4), 588–598. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction

Walter R Gilks

Tom MW Nye

Pietro Lio

Abstract

1 Introduction

2 Method

2.1 Time, divergence and distance

2.2 Agglomeration

2.3 Divergence relationships

2.4 Statistical assumptions

2.4.1 Non-ultrametricity

2.4.2 Non-tree-additivity

2.4.3 Combining distortion and noise

2.5 Regression model

2.6 Generalized Least Squares

2.7 Parameter estimates

2.8 Progression

2.9 Dependence of variance on mean

2.10 Computation

2.11 Building the tree

2.12 Estimating variance parameters

2.13 Putting it together

2.14 Consistency

2.15 Sampling full uncertainty

3 Simulations

3.1 Direct simulation

3.2 Sequence-based simulations

4 Application to HIV evolution

5 Discussion

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction

Walter R Gilks

Tom MW Nye

Pietro Lio

Abstract

1 Introduction

2 Method

2.1 Time, divergence and distance

2.2 Agglomeration

2.3 Divergence relationships

2.4 Statistical assumptions

2.4.1 Non-ultrametricity

2.4.2 Non-tree-additivity

2.4.3 Combining distortion and noise

2.5 Regression model

2.6 Generalized Least Squares

2.7 Parameter estimates

2.8 Progression

2.9 Dependence of variance on mean

2.10 Computation

2.11 Building the tree

2.12 Estimating variance parameters

2.13 Putting it together

2.14 Consistency

2.15 Sampling full uncertainty

3 Simulations

3.1 Direct simulation

3.2 Sequence-based simulations

4 Application to HIV evolution

5 Discussion

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases