Abstract
Phylogenetic trees describe evolutionary relationships between related organisms (taxa). One approach to estimating phylogenetic trees supposes that a matrix of estimated evolutionary distances between taxa is available. Agglomerative methods have been proposed in which closely related taxon-pairs are successively combined to form ancestral taxa. Several of these computationally efficient agglomerative algorithms involve steps to reduce the variance in estimated distances. We propose an agglomerative phylogenetic method which focuses on statistical modeling of variance components in distance estimates. We consider how these variance components evolve during the agglomerative process. Our method simultaneously produces two topologically identical rooted trees, one tree having branch lengths proportional to elapsed time, and the other having branch lengths proportional to underlying evolutionary divergence. The method models two major sources of variation which have been separately discussed in the literature: noise, reflecting inaccuracies in measuring divergences, and distortion, reflecting randomness in the amounts of divergence in different parts of the tree. The methodology is based on successive hierarchical generalized least-squares regressions. It involves only means, variances and covariances of distance estimates, thereby avoiding full distributional assumptions. Exploitation of the algebraic structure of the estimation leads to an algorithm with computational complexity comparable to the leading published agglomerative methods. A parametric bootstrap procedure allows full uncertainty in the phylogenetic reconstruction to be assessed. Software implementing the methodology may be freely downloaded from StatTree.
Keywords: agglomerative method, distance matrix, generalized least-squares regression, phylogenetic tree, variance-components model
1 Introduction
Phylogenetic trees describe evolutionary relationships between related organisms, or taxa. Currently, the most sophisticated statistical methods for reconstructing phylogenetic trees depend on probabilistic models of DNA or protein sequence evolution, estimated using maximum likelihood or Markov chain Monte Carlo; see for example Felsenstein (2004). These approaches require many evolutionary assumptions but concentrate on small-scale events in sequences such as point mutations or deletions. Such methods do not accommodate large-scale genomic rearrangements or supplementary non-sequence data such as phenotype or habitat. Moreover, their need to explore the vastness of tree space may render them computationally infeasible for problems involving very large numbers of taxa.
For large or complex phylogenetic data, methods based on distance matrices, such as NJ (Saitou and Nei, 1987), BioNJ (Gascuel, 1997), Weighbor (Bruno, Socci, and Halpern, 2000), MVR (Gascuel, 2000) and Fastme (Desper and Gascuel, 2004), offer greater flexibility and computational efficiency. Such methods involve a distance metric which scores the dissimilarity between each pair of taxa. The resulting inter-taxon distance matrix is then processed by an agglomerative algorithm which progressively combines pairs of taxa, forming ancestral taxa and estimating branch lengths in the tree, concomitantly reducing the size of the working distance matrix. In general, for more than 3 taxa, the imputed phylogeny cannot reproduce observed distances exactly. Thus, a distance matrix contains uncertainty or noise, and might be consistent with many alternative phylogenies. Ideally, this uncertainty should be taken into account during the agglomeration in estimating branch lengths, in making topological decisions and in monitoring the amounts of information available for those decisions.
Each of the published agglomerative methods addresses the issue of uncertainty in its own style. NJ employs ordinary least-squares (OLS) and Weighbor uses weighted least squares (WLS) estimates of branch lengths, which might be appropriate when distance estimates are uncorrelated (see, for example, Felsenstein (1987)). However, phylogenetic distances are often derived from DNA or protein sequence data, based on simple models of sequence evolution. Such models imply correlated distance estimates with variances which increase exponentially with branch length (Nei, Stephens, and Saitou, 1985, Bulmer, 1991). For such distances, Gascuel (1997) developed BioNJ, introducing the idea of estimating branch lengths by minimizing a sum of branch-length variances at each agglomeration. Rather different optimization criteria involving variances of branch-lengths are employed by Weighbor, MVR and Fastme.
To estimate phylogenetic trees, correlations between distances should be taken into account (Chakraborty, 1977). For this, Hasegawa, Kishino, and Yano (1985) and Bulmer (1991) have proposed the use of generalized least-squares (GLS). However, these non-agglomerative methods require estimation of the full tree in a single computationally intensive step, and hence are unsuitable for exploring large phylogenies. On the other hand, for computational efficiency, agglomerative methods such as NJ, Weighbor and Fastme employ OLS or WLS, and thus do not take formal account of distance correlations. The full implementation of MVR correctly takes account of induced distance correlations, but greater computational efficiency is obtained if they are ignored (Gascuel, 2000).
Some phylogenetic methods are ultrametric. An ultrametric tree assumes a molecular clock beating out a constant pace of evolution in all branches. If branch lengths are drawn proportional to the amount of evolution occurring along them, then an ultrametric tree may be drawn with a physical time axis such that each ancestral taxon is aligned at its extant time, and each leaf taxon at the present time, as in Figure 1(a). Non-ultrametric trees, on the other hand, allow evolution to deviate from time-proportionality, so that leaf taxa become unaligned, as in Figure 1(b). Such varying amounts of evolution may be due to positive or negative selection pressures, to interspecific differences in generation times, to random variation in the rate of evolution (Thorne, Kishino, and Painter, 1998, Kishino, Thorne, and Bruno, 2001), or simply to the randomness of evolutionary events under neutral selection. Currently, the practitioner must decide whether to conduct an ultrametric or a non-ultrametric analysis. However, underlying any non-ultrametric divergence tree must lie a topologically identical but ultrametric time tree in which branch lengths are drawn proportional to elapsed physical time. This assumes that each speciation event occurs at a specific (but generally unknown) physical time. Thus an ultrametric time-tree and a non-ultrametric divergence tree can be simultaneously true. They estimate different aspects of evolution, sharing the same topology but differing in branch lengths.
Here we introduce a new distance-based agglomerative phylogenetic method, based on a variance-components model involving only means, variances and covariances of distances. Thus our method takes account of distance correlations but avoids making full probabilistic evolutionary assumptions. Our method simultaneously produces both a non-ultrametric divergence tree and its underlying ultra-metric time tree. Consequently, without needing to specify an outgroup, our divergence tree is rooted. The true divergence tree may, to a greater or lesser extent, conform to a molecular clock assumption; our method allows the degree of this non-ultrametricity to be estimated from the data.
Our approach uses a sequence of staged GLS regressions to build the estimated tree,
each stage formally incorporating variances and covariances in distances observed or
estimated at the previous stage. The GLS computations involve inversion of square
matrices with dimension of order
(m2), where m is
Figure 1: Illustrating (a) an ultrametric tree with time axis recording years before present (bp) and leaves aligned at the present time; and (b) a non-ultrametric tree of the same topology.
the number of taxa. If tackled head-on, each GLS regression would have computational
complexity of order
(m6). However, thorough exploitation of the
algebraic structure of these matrices leads to a GLS algorithm with computational
complexity of order
(m), and the
full phylogenetic reconstruction has computational complexity on a par with the most
efficient distance-based algorithms. A parametric bootstrap procedure allows full
phylogenetic uncertainty to be evaluated. Our methodology is implemented in
StatTree, which may be freely downloaded from
http://www.mas.ncl.ac.uk/~ntmwn/stattree. Examples of the application of StatTree
can be found at http://www.cl.cam.ac.uk/~pl219/GilksNyeLioSAGMB.html.
2 Method
2.1 Time, divergence and distance
We distinguish between physical time t, divergence d, and distance d̂.
Time: Let tj denote the time before present (bp) that taxon j was extant. Let k denote the most recent common ancestor of taxa j and ℓ. Then the total physical time over which taxa j and ℓ have independently evolved is tjℓ = (tk – tj) + (tk – tℓ) = 2tk — tj — tℓ.
Divergence: This is the concept of branch length, which we define here to be the gross amount of change in sequence or other variables, summed along a path in the evolutionary tree. Each change makes a positive contribution to divergence, the amount being determined in some way appropriate to the problem in hand. Thus a change immediately followed by its reversal both contribute positively. Since such reversed changes are generally undetectable, divergences are intrinsically unobservable. Let djk = dkj denote the divergence between any two taxa j and k. We expect djk to be equal on average to tjk (subsuming an arbitrary scaling of time), with some random variation around this mean.
Distance: Although divergences are unobservable, we may be able to estimate them from available sequence or other data. We use the term distance and the notation d̂jk to denote a measurement or estimate of djk. We expect d̂jk to be equal on average to djk, with some random variation around this mean reflecting measurement imprecision.
The distinction between distance and divergence is seen more sharply when considering tree-additivity. Divergences are tree-additive, that is, for any three taxa j,k,ℓ where k lies on the path in the phylogenetic tree from j to ℓ, we have
However, tree-additivity is generally only an approximate property of distances, that is djℓ ≈ djk + d̂kℓ.
For present-day taxa, we assume sequence or other data are available from which distances may be calculated directly. However, such data are in general unavailable for ancient taxa, so distances between them must be computed indirectly. Indeed, agglomerative methods recursively compute distances between ancient taxa on the basis of previously computed distances between more recent taxa (see below). That is, divergences are estimated recursively on the basis of previous divergence estimates. Thus, as we consider taxa j and k increasingly far back in time, distances d̂jk will tend to become increasingly imprecise estimates of their underlying divergences djk. On the other hand, the number of pairs of present-day taxa descending from j and k and therefore contributing to the calculation of d̂jk will concomitantly increase, which should tend to reduce imprecision (although there has been some debate on this issue; see Zwickl and Hillis (2002)). Therefore, as the agglomeration proceeds, it is unclear how much imprecision accumulates in the working distance matrix. It would be desirable to know whether, or under what conditions, imprecision accumulates or dissipates in agglomerative methods.
As noted above, current agglomerative methods employ a variety of strategies to take account of imprecision in distances. Here we consider the question of imprecision through a more coherent statistical-modeling framework. We show how imprecision can be estimated and monitored as the tree reconstruction proceeds; how it can be factored into the tree reconstruction; and how all this can be done without introducing unacceptable computational overheads.
2.2 Agglomeration
In common with most other distance-matrix methods, our approach is agglomerative. We begin with a set of currently extant taxa (or other genetic objects) of interest, and a matrix of distances between them. Recursively, we choose two taxa to be replaced by a new taxon representing their last common ancestor (LCA). This taxon pair is chosen to minimize the estimated total tree length, according to the neighbor-joining principle of Saitou and Nei (1987) (see Section 2.11). Distances from the new taxon to each of the remaining taxa are then computed. This agglomerative process continues until a single taxon remains.
We start the agglomerative process at stage 0 with
m(0) taxa all extant at time 0, and a distance
matrix containing n(0) =
m(0)(m(0) — 1)/2
elements in its upper triangle. There will be m(0)
agglomerative stages. At a generic stage i ≥ 0 of the
agglomerative process, we consider a set of
m(i) taxa, labeled
1(i),2(i),...,m(i).
These taxa may have been extant at different times; we use
to denote the extant
time of taxon j(i). At this stage
we will have a working distance matrix between these
m(i) taxa, stored for
convenience in a vector d̂(i) of
length n(i) =
m(i)(m(i)
— 1)/2, whose elements are indexed by taxon-pairs in upper-triangular row order,
which we refer to as standard pair order, as follows:
Our principal tasks at stage i are to select, on the basis of d̂(i), the two taxa to be agglomerated; to determine the extant time of their LCA; and to calculate a vector d̂(i+1) containing distances between the resulting m(i+1) = m(i) — 1 taxa. Other quantities to be estimated are described below. The LCA represents a new internal taxon in the tree under construction, as illustrated in Figure 2.
2.3 Divergence relationships
To simplify the presentation, without loss of generality, we assume until further notice that the two taxa to be agglomerated at stage i are 1(i) and 2(i), as in Figure 2. Then, at stage i+1, we label the new LCA as taxon 1(i+1), and relabel each remaining taxon as follows:
| stage | taxon label | |||||
|---|---|---|---|---|---|---|
| i | 3(i) | 4(i) | ... | k (i) | ... | m (i) |
| i+1 | 2(i+1) | 3(i+1) | ... | (k–1)(i+1) | ... | m (i+1) |
Figure 2: Illustrating a generic stage i of the agglomerative process, showing (a) the ultrametric time tree with its time axis, and (b) the corresponding non-ultrametric divergence tree whose edge lengths are proportional to the amount of evolution taking place along them. The m(i) = 4 current taxa are denoted by black circles, and the LCA of taxa 1(i) and 2(i), which are to be agglomerated next to form taxon 1(i+1), is denoted by a white circle. Taxon 1(i+1) will replace taxa 1(i) and 2(i) at the next stage i+1. Broken lines indicate earlier stages of the agglomeration.
From our assumption (1) of tree-additivity of divergences, we may then write
where
denotes
the divergence between taxa j(i)
and k(i), and where
and
denote the divergences
along the edges from taxa 1(i) and
2(i) to their LCA, taxon
1(i+1). The first condition in (3) states
that the divergence between the two taxa being agglomerated is equal to the sum
of the divergences from each of them to taxon
1(i+1), their LCA. The second condition states that
the divergence between either of these two taxa and any other taxon is the sum
of the divergences from each of them to taxon 1(i+1)
(noting the change in taxon labels at stage i+1, described
above). The final condition states that the divergence between any other pair of
taxa remains unaltered (again noting the change in taxon labels).
Equation (3) may be expressed more concisely in vector notation:
where d(i) is the
n(i)-vector of divergences
stored
in standard pair order (2);
a(i) is the 2-vector
, where T
denotes transposition; and A(i),
S(i) and
B(i) are the following
(n(i) ×
m(i)),
(m(i) × 2) and
(n(i) ×
n(i+1)) binary matrices.
Matrix S(i) contains a 1 in its
(1,1) and (2,2) elements, and 0 elsewhere. The rows of
A(i) and
B(i) correspond to taxon
pairs of stage i in standard pair order. Each column
k = 1,...,m(i)
of A(i) indicates which taxon-pairs
of stage i contain taxon
k(i). The columns of
B(i) correspond to taxon pairs
of stage i+1 in standard pair order, the column for pair
(j(i+1),k(i+1))
indicating which taxon pairs of stage i are connected on the
tree via the path connecting taxa
j(i+1) and
k(i+1). For example, from
Figure 2 we have
where the row and column labels of A are shown, the rows labels of B are as for A, and the column labels of B correspond to taxon pairs (1(i+1), 2(i+1)), (1(i+1), 3(i+1)) and (2(i+1), 3(i+1)).
2.4 Statistical assumptions
We now state our statistical assumptions concerning divergences d(i) and distances d(i). Our approach is moment-based: we argue entirely in terms of conditional expectations, variances and covariances, and thus avoid the need to make explicit distributional assumptions, unlike Weighbor, for example, which assumes Gaussian distances.
We adopt the convention that all expectations, variances and covariances implicitly condition on the true but unknown tree topology and taxon times, including the assertion (for now) that the LCA of taxa 1(i) and 2(i) is the most recent of any ith-stage taxon pair. However, where expections and variances condition on divergences, we will make this explicit.
2.4.1 Non-ultrametricity
Our first set of assumptions concerns the relationship between the divergences and times which we consider at each stage i ≥ 0 of the algorithm. Let u(i) be the 2-vector containing the time-intervals from taxa 1(i) and 2(i) to their LCA, taxon 1(i+1):
where 1k denotes a
k-vector of 1s and
t(i) =
. Conditioning on
d(i+1) and (implicitly) on
, we
assume:
where ν ≥ 0 is a scalar constant and
denotes the
diagonalization of vector u(i).
(It may seem strange to condition here on
d(i+1) which we do not
know, but formally d(i+1)
will take the role of a parameter vector in a regression model,
as described below.) Thus, we assert that the variance of the divergence
along a path is proportional to that path’s time-length, and that
divergences along non-overlapping paths are uncorrelated.
For example, suppose we have a DNA sequence of length n bases, where each base has one of the four states {A,C,G,T}. If we assume the Jukes–Cantor model of sequence evolution (Jukes and Cantor, 1969), in which events (state-changes) occur independently and with equal probability according to a Poisson process at rate λ, then the total number of events (including reversals of previous state-changes) in the sequence in a time-interval u will be Poisson with mean λnu. Defining divergence a to be the total number of events per base in time-interval u, scaling time so that λ = 1, and setting ν = n–1, leads to divergence a having mean u and variance νu, in agreement with (6,7). This simple example helps to provide some intuition about ν. However, our methodology is not tied to such simple evolutionary processes. With suitable scaling of time and choice of ν, any homogenous independent-increments process will satisfy (6,7), including more general sequence evolution models (Kimura, 1980, Hasegawa et al., 1985, Lanave, Preparata, Saccone, and Serio, 1984), with or without rate-variation across sites (Yang, 1993). The assumptions might be equally applicable to much more complex processes including large-scale genomic rearrangements; see, for example, Wang and Warnow (2005). Our moment-based approach has the advantage that details of the stochastic process need not be specified beyond (6,7).
Equation (7) allows divergences to depart from a strict molecular clock assumption. Thus the divergence tree is not constrained to be ultrametric, unlike the time tree. We call this kind of variation distortion. The amount of molecular clock distortion is controlled by parameter v. From (6,7), setting ν = 0 would assert that a(i) = u(i) for all i, so branch lengths in the divergence tree would be identical to those in the time tree and the divergence tree would then be ultrametric, obeying a strict molecular clock. Contrastingly, setting ν > 0 confers some elasticity in the branches of the divergence tree, as illustrated in Figure 1(b), larger v conferring greater elasticity. In the simple Poisson process example above, we have ν = n–1. However, our methodology does not require ν to be set externally, as we provide methodology for estimating it (see Section 2.12).
Taking expectations and variances over a(i) in equation (4) using (6,7), it follows immediately that divergences have the following conditional moments, conditioning on d(i+1) the part of the divergence tree that remains after agglomerating taxa 1(i) and 2(i):
A recursive expression for the marginal variance of d(i) then follows from (8,9):
In particular, (10) implies that
where
denotes the kth element of
u(i). This illustrates
several consequences of our assumptions: that the variation in divergence
along a tree path is proportional to the time-length of that path (11); that
the covariation of divergences along two overlapping paths is proportional
to their shared path length (12); and that divergences along non-overlapping
paths are uncorrelated (13). Such relationships were first proposed by
Chakraborty (1977), and underpin BioNJ.
Although our assumptions (6,7) accommodate a great variety of models, they do not formally allow for evolving rates of evolution (Thorne et al., 1998, Kishino et al., 2001), as this would induce correlations between non-overlapping paths in the tree.
2.4.2 Non-tree-additivity
Our second set of assumptions concerns the relationship between distances and divergences. We begin by considering the observed distances d̂(0) between m(0) currently extant taxa. There are n(0) elements in d̂(0), but the corresponding (rooted) divergence tree contains only (2m(0) — 2) branches, so for m(0) > 4 it will generally be impossible to find a divergence tree whose branches exactly replicate the distances in d̂(0). We refer to this lack of fit to tree-additivity as noise, and model it as follows:
where Ik denotes the (k × k) identity matrix. The amount of noise is controlled by variance parameter σ2. Setting σ2 = 0 implies strict tree-additivity in distances. Thus we assert that observed distances are unbiased, uncorrelated, equivariant estimates of the tree-additive divergences underlying them. We emphasize that this is not an abandonment of previous work expounding the genesis of correlations between distances (Chakraborty, 1977, Hasegawa et al., 1985, Bulmer, 1991), since these relate to the marginal distribution of distances whereas (14,15) relate to their conditional distribution, conditioning on underlying divergences d(0). Equations (14,15) refer only to the relationship between distances and the divergences they measure, not to the relationship between divergences and time intervals, which we have already discussed in Section 2.4.1. In particular, (15) does not imply an underlying star-tree topology. As Chakraborty (1977) and others have shown, correlations between distances arise precisely through their overlapping paths in the divergence tree. Below we show how this plays out within the framework of equations (6,7,14,15). The independent, equivariant, distance errors assumed by NJ are represented here by (15), but the branch-length dependent variances and covariances assumed by BioNJ correspond here to equation (9). There is no conflict in simultaneously assuming both sources of variation, as we do.
At each subsequent agglomerative stage i > 0, our algorithm produces a vector of inter-taxon distances d̂(i), estimating underlying divergences d(i). As we show by induction in Section 2.8, assumptions (6,7,14,15), together with the estimation procedures described below, result in the following conditional moments at each stage i ≥ 0, conditioning on unknown divergences d(i):
where V(i) is a symmetric, positive-definite (n(i) × n(i)) matrix of a particular algebraic structure. Thus, at each stage, our model implies that distances are unbiased measures of divergence, with conditional variances and covariances which do not depend on the lengths of the divergences they measure. At stage i = 0, equation (15) asserts that V(0) is the identity matrix, so that distances at stage 0 have a common variance and are uncorrelated. However, these simple variance-covariance properties necessarily disappear at later stages, because divergences estimated at stage i > 0 jointly depend on imprecisely estimated divergences from previous stages, thereby altering variances of divergence estimates and inducing correlations between them, ultimately affecting subsequent divergence estimates. NJ, BioNJ and Fastme take no account of such cumulative effects of estimation; Weighbor takes account of estimation effects on variances, but ignores effects on covariances; whilst the full, computationally intensive, version of MVR (Gascuel, 2000) correctly accounts for such effects in its minimum-variance criterion. Our methodology likewise takes full account of these estimation effects to preserve statistical efficiency, but without sacrificing computational efficiency.
2.4.3 Combining distortion and noise
It follows from equations (6,7,16,17), marginalizing over d(i) while retaining the conditioning on d(i+1), that distances have the following conditional variance property:
where
where θ = ν/σ2. Notice the differences between equations (17) and (18). They differ on the left because in (17) we condition on d(i), while in (18) we condition only on d(i+1), a smaller condition as it involves less of the tree. They differ on the right because the marginalization has introduced an extra variance component. However, we emphasize that these are not alternative models, they are different aspects of the same model.
Equation (19) succinctly demonstrates that our model contains both noise
(controlled by σ2) and distortion (controlled by
v). The first term in (19) represents the noise in
d̂(i) as a measurement
of d(i); and the second term
represents the effects of
and
being independent
evolutionary distortions of the same underlying time-interval
u(i).
2.5 Regression model
From (4,5,6,16), we may write
where the error terms ε(i) and
ε′(i) accommodate departures of
d̂(i) and
a(i) from their means.
Equations (20,21) correspond to a random-effects regression model where
d̂(i) is regressed on
A(i) and
B(i) with regression
parameters a(i) and
d(i+1) and the random effects
a(i) are regressed on
12 with parameter
and a fixed offset
. The
error terms ε(i) and
ε′(i) have zero mean and are
uncorrelated with variance-covariance matrices
from (7,17). For now, we assume that variance parameters σ2 and v are known; we will relax this assumption in Section 2.12.
2.6 Generalized Least Squares
The regression parameters in (20,21) may be estimated by GLS. (For a general introduction to GLS, see Mardia, Kent, and Bibby (1979); for its application in non-agglomerative phylogenetics, see Hasegawa et al. (1985), Bulmer (1991), Susko (2003)). The GLS objective function at stage i, based on (20–23), is the Mahalanobis distance:
where, from (20,21),
We refer to (24) as the conditional objective function at stage
i. Parameter estimates,
,
â(i),
d(i+1), are obtained by
minimizing
with respect to
,
a(i),
d(i+1), for given
σ2, ν and
t(i). Note that
t(i) is known from the previous
stage; that is, we set t(i) =
t̂(i) if i
> 0 or t(i) = 0, otherwise.
It is instructive to consider an alternative approach to estimation, derived by substituting equation (21) into (20) to give:
where
Equation (27) represents a marginal regression model, in which
d̂(i) is regressed on
A(i)S(i)12
and B(i), with regression
parameters
and d(i+1) and a fixed offset
. From
(22,23,25,26), the error term ε*(i) has zero mean
and variance
where W(i) is given by (19). From (27–29), we can construct a marginal GLS objective function:
and estimate parameters
and
d(i+1) by minimizing
w.r.t.
and
d(i+1). It can be shown, using
the Sherman–Morrison–Woodbury matrix-inversion identity (see, for example, Golub
and Van Loan (1996)), that
and hence the
conditional and marginal GLS approaches lead to identical estimates of
and
d(i+1) for given
σ2, ν and
t(i).
2.7 Parameter estimates
Minimization of
w.r.t.
a(i) and
d(i+1) leads to the
following GLS estimates:
where
and
suppressing index (i) in the notation Qtt,...,Pa.
Now d̂(i+1) in (32) estimates divergence vector d(i+1) imprecisely, so in the terminology of Section 2.1 it is a vector of distances. Indeed, d̂(i+1) is the distance vector which will be input to the next stage i+1 of our algorithm. Equation (32) admits the possibility that elements of this distance vector may be negative. Although numerical methods for non-negative GLS are available at greater computational cost, they frustrate the algebraic approach which is particularly valuable at subsequent stages of our analysis (see Sections 2.10 and 2.12). We deal with negative branch lengths in a minimally perturbing post-processing stage, described in Section 2.11.
The following variances of the parameter estimates may be derived from (24) and (30), according to standard GLS theory:
It can be shown that the matrices to be inverted in equations (31–39) are all
positive-definite, provided V(i) is
positive definite and no elements of
are negative. Noting
(40), it follows by induction that
V(i) is positive-definite for
all i.
2.8 Progression
For the next stage i+1 of the agglomeration, we need to know the matrix V(i+1) assumed in (17). This is provided by equation (38), giving:
Thus V(i+1) depends on V(i) through equations (19,36,40). These equations suggest that, from one stage to the next, the variance-covariance structure of distances becomes increasingly intricate, as a result of regression parameter estimates at each stage being recursively built upon correlated and imprecise regression parameter estimates from the previous stage. This happens even though in (15) we assumed conditionally independent distances initially. Despite this loss of independence, we show below that V(i+1) remains within a relatively simple class of structured covariance matrices. This is a central result of this paper.
We assumed above that equations (16,17) hold for given i, and consequently we have from the unbiasedness of GLS that IE[d̂(i+1) | d(i+1)] = d(i+1), and from (38,40) that Var[d̂(i+1) | d(i+1)] = σ2V(i+1). Therefore, (16,17) hold with i replaced by i+1. From (14,15), we see that (16,17) hold for i = 0. Hence, by induction, (16,17) hold for all i ≥ 0.
2.9 Dependence of variance on mean
Estimates
and
d̂(i+1) in (31,32) assume
W(i) is known. However,
equations (6,7) specify that the variance of
a(i) depends on its mean,
and hence W(i) depends on unknown
regression parameter
through (5). To deal
with this, we propose an iterative solution, assuming variance parameters
σ2 and ν are known. Estimation
of σ2 and ν is considered in
Section 2.12. At each stage i, beginning with an initial guess
for
, we
iterate the calculations for u(i),
W(i),y(i),
and d̂(i+1) in
(5,19,31–36) until convergence.
2.10 Computation
We have seen that we may proceed from one stage i to the next
i+1 by computing the distance vector
d̂(i+1) from (32) and its
variance structure V(i+1) from
(40). However, both (32) and (40) appear computationally formidable, requiring
inversions of matrices of dimension
n(i+1) ×
n(i+1). Such calculations,
if tackled directly, would have time-complexity of order
. Fortunately, we may
exploit the algebraic structure of these matrices to perform each inversion in
only
(m(i+1)).
As we describe below, these more efficient calculations involve, at stage i, two nonnegative vectors w(i) and s(i), each of length m(i), defined recursively as follows:
where
is a
scalar depending only on
and
m(0) (see equation (B5) in Appendix B). At the
initial stage, we set
and
, the zero vector.
Recursion (41) implies that taxon weight
is the number of leaves
of the tree (i.e. extant taxa) descended from taxon
j(i). In particular, it follows
that

It is a remarkable fact, and a central result of this paper, that the complicated recursion for V(i) through (19,36,40) reduces at each stage i to the following simple form (see Appendix B):
where
diagonalizes s(i) and
diagonalizes an n(i)-vector
p(i) whose elements are
products of pairs of taxon weights in standard order (2), thus:
Interpreting (43), the variance in
d̂(i) can be thought of as
arising from two uncorrelated sources of variation. The first source is a
collection of n(i) uncorrelated
random variables, one for each taxon pair, with variances inversely proportional
to taxon weights, reflecting the pooling of distance information with each
agglomeration. Thus we refer to
p(i) as a vector of
pooling coefficients. The second source is a much smaller
set of m(i) uncorrelated random
variables, one for each taxon. The variance of the jth random variable in this
second set is proportional to
and arises through
agglomerations leading to taxon
j(i). This is not purely a
consequence of distortion; positive
arise even when
ν = 0. This source impacts on
V(i) only where taxon-pairs
share a taxon. Thus we refer to
s(i) as a vector of
sharing coefficients. A particular consequence of (43) is
that inter-taxon distances are uncorrelated where they have no taxon in common.
Thus, for example,
and
are uncorrelated whereas
and
are
not.
Equation (43) leads to substantial computational simplifications, in particular
in computing d̂(i+1) in (32). We
can show that distances at stage i which do not involve taxa
1(i) or 2(i)
(i.e. most of
d̂(i)) are passed unaltered to
the next stage. That is,
for 3 ≤
j < k ≤
m(i). This intuitive result
is critical in terms of reducing the time-complexity of our algorithm.
2.11 Building the tree
We have seen above how we may proceed from each stage i to the next, i + 1, by agglomerating taxa 1(i) and 2(i). Of course, by appropriate taxon relabeling, we can agglomerate any two taxa j(i) and k(i) using this methodology. We now turn our attention to consider which pair of taxa to agglomerate.
We begin at stage i = 0 with observed distances
d̂(0), weights
and sharing-coefficients
. At each
generic stage i ≥ 0, we agglomerate that pair of taxa which
would minimize the estimated tree-length, according to the neighbor-joining
principle of Saitou and Nei (1987). That is, we choose
(j(i),
k(i)) to minimise:
which is the simplified NJ criterion of Studier and Keppler (1988). We then add the LCA of taxa j(i) and k(i) to our tree reconstruction, with edges of length û(i) in the time tree and â(i) in the divergence tree computed from the agglomeration of j(i) and k(i), using the formulae in the foregoing sections (with appropriate taxon relabeling). We then proceed to the next stage, armed only with d̂(i+1), w(i+1) and s(i+1).
Thus, from leaves to root, the complete time tree is built entirely from edges of lengths û(i), and the complete divergence tree is built entirely from edges of lengths â(i). After completing both trees, any negative branch length u < 0 in the time tree is removed by setting it to zero and simultaneously adding — u to the sister branch and +u to the parent branch (if any). These local adjustments, which are applied recursively from tips to root, have the property of preserving the ultrametric property of the time tree. The same form of adjustment is applied to the divergence tree.
2.12 Estimating variance parameters
In general, variance parameters σ2 and
θ = ν/σ2 will
be unknown. Here we propose an iterative scheme for their estimation, starting
with initial guesses
. Within each iteration
h = 1,2,..., using current estimates
,
θ̂h–1, we proceed through
all agglomerative stages i = 0,1,...,
m(0) — 2 to recompute the entire time and
divergence trees and their associated quantities. These results are then
substituted into estimating equations, derived in Appendix A, which may be
solved iteratively for σ2 and θ,
giving updates
, θ̂h. This process of
recomputing the tree and updating σ2 and
θ is continued until convergence.
2.13 Putting it together
Putting together the methods of the foregoing sections, the complete algorithm is as follows.
-
1.
Initialize
,
θ̂0, and set h = 0. -
2.
Set
and
. -
3.
For each stage i = 0,1,...,m(0)—2,
-
(a)
for each possible join (j(i),k(i)), calculate β̂(i,j,k) and Ŵ(i,j,k) via the iterative method described in Section 2.9;
-
(b)
choose the join (j(i),k(i)) with smallest
(45), as
described in Section 2.11, and calculate
,
d̂(i+1),
w(i+1) and
s(i+1)
accordingly;
-
(a)
-
4.
Calculate
,
θ̂h, using the estimating equations
(A7,A8) derived in Appendix A. -
5.
If
or θ̂h ≉
θ̂h–1, increment
h and return to Step 2. -
6.
Any negative branch lengths in the time or divergence trees are dealt with by the adjustment procedure of Section 2.11.
2.14 Consistency
Both NJ and BioNJ are consistent in the sense that the estimated topology will be correct if distances converge to divergences (which would happen, for example, if the length of the multiple sequence alignment from which distances are calculated tends to infinity). The same property holds for our methodology, the proof of which we now sketch.
Suppose distances are arbitrarily close to divergences: d̂(0) ≈ d(0). If we set θ̂0 arbitrarily large at Step 1 of the algorithm in Section 2.13, then the second term in the objective function (24) will be negligible. It can then be shown by induction that at each stage i, we minimise (24) by setting the unknown parameters a(i), d(i+1) equal to their true values under the true topology. Hence the true topology will be recovered at Step 3 of the algorithm. Then at Step 4, we will again obtain arbitrarily large θ̂0, so no changes will occur in the parameter or topology estimates in the next iteration of the algorithm, and Step 5 will confirm convergence.
2.15 Sampling full uncertainty
Although equations (38,39) allow us to evaluate uncertainty in divergence estimates, they do not allow us to assess uncertainty in the estimated tree topology. To better understand the full uncertainty in the phylogenetic reconstruction, we propose the following simple parametric bootstrap procedure.
First, a vector of divergences d(0) between the
m(0) extant taxa is read off from the complete
estimated divergence tree, as constructed in Section 2.11. Second, a
bootstrapped vector of distances d̂(0,boot) is
generated by sampling independently each element
from a distribution with
mean
and
variance σ̂2, for consistency with (16,17). A
natural choice for this would be a Gamma distribution, which is preferable to a
Gaussian, as it produces only positive distances. However, alternative
distributions could be explored to evaluate their impact. Third, using
d̂(0,boot) instead of
d̂(0), a bootstrapped version of the time-tree,
divergence-tree, σ^2 and ν̂, is
estimated using the methods described above. Many bootstrapped versions may be
generated by repeating steps two and three. The frequency of different tree
topologies, histograms of σ^2 and
ν̂, and any other statistical summaries of interest, may
then be computed from the collection of bootstrapped versions.
3 Simulations
To demonstrate key features of our approach, we present results obtained from StatTree on simulated data. Distance matrices were randomly generated according to two different schemes. In the first, distance matrices were obtained by distorting trees with fixed topology, and then adding noise. This will be referred to as ‘direct simulation’ below. In second scheme, referred to as ‘sequence-based simulation’, sequence alignments were randomly generated from some fixed trees and distance matrices were then estimated from the alignments. For both schemes, trees were constructed from the distance matrices and compared with the original underlying tree.
3.1 Direct simulation
-
1.
A base ultrametric time tree was selected together with a choice of parameters σ2 and ν. Two different topologies for the time tree, each with m(0) = 64 leaf taxa, were used. These topologies are illustrated in Figure 3, and represent extremes of the kind of branching behaviour that can arise in rooted bifurcating trees. We will refer to these as the balanced and unbalanced topologies.
-
2.
The selected time tree was then distorted, as follows, to obtain a divergence tree with inter-leaf divergences
. The length of
each edge on the divergence tree was assigned a value independently
sampled from a Gamma distribution with mean equal to the edge length
u on the time tree and variance
νu, ensuring consistency with equations (6) and
(7). -
3.
A distance matrix was then obtained from the divergence tree and perturbed according to the process described in Section 2.15: each inter-leaf distance
was assigned a
value independently sampled from a Gamma distribution with mean
and variance σ2. This process ensures
consistency with equations (16) and (17).
Various values of σ2 and ν were used to simulate distance matrices, and these values are specified as follows. The standard deviation σ is best expressed as a proportion of the mean inter-taxon distance in the time tree. For example, if the mean inter-taxon distance in the time tree was δ we might take σ = 0.2 × δ and refer to this as a value of 20% for σ. The two basic time trees shown in Figure 3 have
Figure 3: Two topologies for the underlying time trees used in the simulations. The tree on the left is referred to as having a balanced topology, while the tree on the right is referred to as unbalanced. These represent two extremes of the pattern of branching that can be exhibited in rooted bifurcating trees. All internal branches in the trees are taken to have equal length, and the trees were scaled so that the maximum distance between two taxa was 1.0. For both topologies, trees with 16 and 64 taxa were used, as described in the text.
quite different mean inter-taxon distances, and so by specifying σ as a proportion of the mean we ensure that distance matrices are obtained from the different topologies in a comparable way. Similarly, the parameter ν needs to be specified in a way that is consistent between different trees. If time trees with a fixed topology are scaled to have different overall lengths and then distorted to obtain divergence trees, it is necessary to vary ν with the scale of the tree in order to produce a collection of divergence trees with a similar visual appearance and level of distortion. This is achieved by fixing a proportion p, and taking
where δ is the mean inter-taxon distance in the time tree. This ensures that when a branch of length u on a time tree is distorted, the ratio of the standard deviation to the branch length is given by
This is a dimensionless quantity that is proportional to p. Fixing ν in this way for different tree topologies, tree scales, and numbers of taxa ensures that the results are comparable. The proportion p is usually expressed as a percentage in what follows, and for brevity we will write ‘ν = 20%’ to mean ν = 0.22 × δ, for example.
To assess the performance of our algorithm, distance matrices were simulated using the scheme in Steps 1–3 above, and phylogenies were estimated according to the steps specified in Section 2.13, as implemented in our software StatTree. Specifically, values of 5%, 10% and 20% for σ and ν were used in the simulations. For each of the 9 corresponding combinations of parameters, 100 distance matrices were generated using the balanced topology on 64 taxa, and another 100 distance matrices were generated using the unbalanced topology on 64 taxa. Tree construction with estimation of σ2 and ν was then carried out for each of the simulated distance matrices. Means and standard deviations of the estimated parameters across each set of 100 simulations were calculated to assess the accuracy of estimation. In addition, each estimated tree was compared to the true underlying tree via two scores that measure the topological accuracy of the estimated tree. These scores are described in the next paragraph. For each set of parameter values the average scores were calculated for the set of 100 simulations. Trees were also constructed with the BioNJ algorithm and compared to the true tree in the same way. The results are given in Tables 1 and 2.
Two topological scores, the Robinson-Foulds and quartet distance, were used to measure accuracy of estimated trees. The Robinson–Foulds distance (Robinson and Foulds, 1981) is the proportion of bi-partitions that two trees share (each branch in each tree, upon being cut, induces one such bi-partition of the leaves). The Robinson–Foulds distance is very sensitive, in that for certain topologies a single incorrectly positioned taxon can result in a score of zero. For this reason a second score, the quartet distance (Brodal, Faberberg, and C.N.S., 2004), was also used. The quartet distance between two trees is defined as the proportion of subsets of 4 taxa for which the trees induce the same topology.
An immediate observation from Tables 1 and 2 is that σ is estimated with a relatively high degree of accuracy while the estimates for ν show considerable bias and variability. The distortion parameter ν is generally over-estimated, and the bias is greater for the unbalanced topology. This bias seems not to be directly related to the topological accuracy of tree estimates: descending any of the columns in the tables generally shows an approximately constant level of bias while the topological accuracy decreases. As σ and ν increase, the topological accuracy of the estimated trees decreases. This effect is present for both topologies, though the unbalanced topology is generally harder to estimate.
The topological scores T̄RF,T̄Q for trees built with our method are compared, in Tables 1 and 2, with the scores for trees built using BioNJ, based on the same distance matrices. For the unbalanced topology, the Robinson–Foulds score T̄RF can be very low and the quartet score T̄Q is more useful for making comparisons. Our method produces more accurate trees uniformly, and the contrast between the two methods increases with σ and ν. Even though BioNJ makes no
| Method | Statistic | ν = 5% | ν = 10% | ν = 20% | |
|---|---|---|---|---|---|
| σ = 5% | StatTree | σ̄ (s.d.) | 5.0 (0.1) | 5.0 (0.1) | 5.0 (0.1) |
| ν̄ (s.d.) | 11.1 (2.7) | 14.2 (2.2) | 22.7 (2.4) | ||
| T̄RF,T̄Q | 99.6, 100 | 98.9, 100 | 94.8, 99.6 | ||
| BioNJ | T̄RF,T̄Q | 99.6, 100 | 99.1, 100 | 94.7, 99.5 | |
| σ = 10% | StatTree | σ̄ (s.d.) | 10.0 (0.2) | 10.0 (0.2) | 10.0 (0.2) |
| ν̄ (s.d.) | 12.3 (3.4) | 15.5 (3.2) | 23.6 (3.0) | ||
| T̄RF,T̄Q | 92.1, 99.8 | 90.7, 99.7 | 86.0, 99.1 | ||
| BioNJ | T̄RF,T̄Q | 92.1, 99.8 | 90.8, 99.6 | 84.1, 98.2 | |
| σ = 20% | StatTree | σ̄ (s.d.) | 19.9 (0.4) | 19.9 (0.4) | 20.1 (0.4) |
| ν̄ (s.d.) | 12.7 (2.8) | 15.1 (3.3) | 24.2 (4.3) | ||
| T̄RF,T̄Q | 76.0, 98.4 | 72.0, 97.9 | 65.2, 95.5 | ||
| BioNJ | T̄RF,T̄Q | 73.4, 97.4 | 68.7, 95.5 | 61.4, 88.2 |
Table 1: Results from StatTree of direct simulations for an underlying tree with the balanced topology on 64 taxa, for several true values of parameters σ2 and ν. The means σ̄ and ν̄ of the parameter estimates (in units of %) over 100 simulations are given, together with their standard deviations in parentheses. Also given for each case is the average Robinson–Foulds percentage score, T̄RF, and the quartet percentage score T̄Q. For comparison, the average scores are also given for trees estimated from the same distance matrices by BioNJ. In each case, the method with the better score is indicated in bold.
assumptions about the rate of evolution in each branch of the tree, the topological accuracy of its estimates decreases as the level of distortion in the underlying tree increases. Weighbor was not used for these simulations since it requires a parameter that corresponds to the length of the sequence alignment from which the distance matrix was estimated. No such parameter was available for the direct simulations, but Weighbor was incorporated into the sequence-based simulations described below.
| Method | Statistic | ν = 5% | ν = 10% | ν = 20% | |
|---|---|---|---|---|---|
| σ = 5% | StatTree | σ̄ (s.d.) | 5.1 (0.1) | 5.1 (0.1) | 5.0 (0.3) |
| ν̄ (s.d.) | 26.1 (14.3) | 29.1 (11.2) | 32.1 (9.8) | ||
| T̄RF,T̄Q | 38.1, 89.7 | 33.2, 87.1 | 23.2, 79.0 | ||
| BioNJ | T̄RF,T̄Q | 36.3, 89.2 | 31.8, 86.8 | 22.0, 78.1 | |
| σ = 10% | StatTree | σ̄ (s.d.) | 10.3 (0.3) | 10.2 (0.2) | 10.2 (0.4) |
| ν̄ (s.d.) | 22.9 (10.9) | 24.2 (9.5) | 28.1 (9.1) | ||
| T̄RF,T̄Q | 12.7, 77.6 | 11.9, 75.1 | 11.2, 68.0 | ||
| BioNJ | T̄RF,T̄Q | 12.1, 77.3 | 11.5, 74.9 | 10.5, 66.7 | |
| σ = 20% | StatTree | σ̄ (s.d.) | 20.4 (0.5) | 20.4 (0.5) | 20.4 (0.8) |
| ν̄ (s.d.) | 22.5 (5.5) | 26.2 (5.7) | 30.5 (7.4) | ||
| T̄RF,T̄Q | 2.0, 57.5 | 2.0, 55.1 | 2.9, 48.9 | ||
| BioNJ | T̄RF,T̄Q | 1.7, 57.2 | 2.2, 53.4 | 3.3, 45.2 |
Table 2: Results from StatTree of direct simulations for an underlying tree with the unbalanced topology on 64 taxa, for several true values of parameters σ2 and ν. The means σ̄ and ν̄ of the parameter estimates (in units of %) over 100 simulations are given, together with their standard deviations in parentheses. Also given for each case is the average Robinson–Foulds percentage score, T̄RF, and the quartet percentage score T̄Q. For comparison, the average scores are also given for trees estimated from the same distance matrices by BioNJ. In each case, the method with the better score is indicated in bold.
3.2 Sequence-based simulations
Our algorithm was also evaluated using distance matrices estimated from sequence alignments, using a scheme very similar to that used to evaluate BioNJ by Gascuel (1997). Distance matrices were randomly generated in the following way.
-
1.
A base tree of divergences was selected. Two different trees were used: one with a balanced topology and the other unbalanced, as illustrated in Figure 4 below. These were obtained by randomly distorting ultrametric balanced and unbalanced trees on 16 taxa with ν = 10%.
-
2.
DNA sequence alignments were generated from the base tree using the program SeqGen (Rambaut and Grassly, 1997) according to the scheme specified by Gascuel (1997). Kimura’s two-parameter model of substitution (Kimura, 1980) was used with a transition/transversion ratio of 2.0. Standard parameter values were used in SeqGen to specify rate variation between sites. Sequence alignments with 300 and 600 sites were simulated. In addition, three different scales for the underlying base tree were used: low, medium, high substitution rates with (respectively) a maximum of 0.1, 0.5 and 1.0 substitutions per site.
-
3.
Distance matrices were estimated from each alignment using the program DnaDist (part of Phylip), using the same parameter values as SeqGen, although the transition/transversion ratio was estimated each time.
One hundred distance matrices were generated in this way for each of the two base trees, three scale factors, and two sequence lengths, totaling 12 different simulation conditions. Trees were constructed for each distance matrix using our algorithm (as specified in Section 2.13), BioNJ, and Weighbor. The Robinson–Foulds score and quartet score were calculated to evaluate the topological accuracy of the estimated trees. The accuracy of estimated branch lengths was measured using the following statistic:
where d̂jk is the estimated divergence between taxa j,k and djk is the true divergence. Table 3 shows the results.
Table 3 shows that all three algorithms are broadly comparable and generally achieve a high degree of topological accuracy. The best topological scores and branch-length scores occur for the longer sequences and the balanced trees, where BioNJ performs slightly better than StatTree. The more difficult scenarios, where the sequences are short and the rate of substitution is low or the topology unbalanced, tend to favor StatTree. Taking these results together with those of Section 3.1, it appears that StatTree offers some advantages in harder inference problems where the signal-to-noise ratio in the distance matrix is low.
4 Application to HIV evolution
Rapidly evolving viral species, such as influenza and HIV, have become important laboratory tools for studying evolution in real time, providing a test-bed for evolutionary theories and inference methodologies. The high genetic diversity of HIV
Figure 4: Balanced and unbalanced trees used for sequence-based simulations.
results from about 105 mutations per day per infected individual. Longitudinal clinical studies can provide information on within-patient HIV evolution under varying treatment regimes, with potential to reveal genetic bottlenecks and resurgence of rare wild-type variants post-therapy. Here we reanalyze data from a recent study by Salazar-Gonzalez, Salazar, Keele, Learn, Giorgi, Li, et al. (2009), in which a total of 121 full-length HIV-1 genome sequences were obtained from samples taken longitudinally in 12 infected individuals. The authors reported a rooted, multi-furcating, maximum-likelihood phylogeny of these sequences, based on a bespoke mathematical model of HIV sequence evolution (Keele, Giorgi, Salazar-Gonzalez, Decker, Pham, Salazar, et al., 2008).
It is of interest to determine not only the tree describing the evolution of these sampled HIV genomes, but also the physical times of their diversification. For these tasks our methodology, producing both a divergence tree and its underlying time tree, is directly relevant. Our reanalysis of these data was based on a distance matrix for all 121 genomes, constructed using program DnaDist of Phylip assuming Kimura’s two-parameter model of base substitution (Kimura, 1980). The results from StatTree are shown in Figure 5. Similar results (not shown) were obtained with alternative substitution models. Within-subject branch lengths were all, with one exception, very short, and so for concision are not shown. The one exception is the subject identified in the figure as ZM.247, for whom two viral clades, ZM.247a and ZM.247b, are shown.
As in the original study (Salazar-Gonzalez et al., 2009), the divergence tree in Figure 5b demonstrates clear between-subject differences in clonal lineages, the generally short internal branches suggesting only small differences between their common ancestors. The major exception concerns the three Zambian subjects,
| Length 300 | Length 600 | ||||
|---|---|---|---|---|---|
| Scale | Method | Balanced | Unbalanced | Balanced | Unbalanced |
| T̄RF,T̄Q,B̄ | T̄RF,T̄Q,B̄ | T̄RF,T̄Q,B̄ | T̄RF,T̄Q,B̄ | ||
| Low | StatTree | 94.8, 94.1, 76.3 | 62.8, 77.3, 61.8 | 98.8, 98.5, 77.2 | 78.1, 88.0, 61.5 |
| BioNJ | 95.3, 93.7, 77.8 | 62.8, 76.9, 64.3 | 99.2, 98.6, 77.7 | 78.8, 88.5, 64.2 | |
| Weighbor | 95.2, 93.7, 77.8 | 59.3, 74.0, 64.3 | 98.8, 97.9, 77,7 | 76.9, 86.8, 64.2 | |
| Medium | StatTree | 99.2, 99.2, 21.0 | 73.2, 84.2, 14.9 | 100, 100, 21.5 | 85.3, 92.0, 14.5 |
| BioNJ | 99.5, 99.3, 22.3 | 73.3, 84.7, 18.1 | 100, 100, 22.6 | 85.7, 92.1, 18.5 | |
| Weighbor | 99.2, 98.9, 22.4 | 70.7, 82.4, 18.1 | 100, 100, 22.6 | 85.2, 91.8, 18.5 | |
| High | StatTree | 98.6, 97.8, 2.8 | 72.3, 83.7, 22.9 | 100, 99.9, 1.8 | 82.8, 90.8, 15.5 |
| BioNJ | 98.8, 97.9, 2.1 | 71.8, 83.1, 2.2 | 100, 99.9, 1.2 | 83.6, 91.4, 1.1 | |
| Weighbor | 98.6, 97.5, 2.1 | 70.2, 81.7, 2.3 | 100, 99.9, 1.2 | 82.5, 90.7, 1.1 | |
Table 3: Results of sequence-based simulations generated under balanced and unbalanced tree topologies. For each tree, 100 distance matrices were generated for each combination of two different sequence lengths (300 and 600 bp) and three different scales (corresponding to maxima of 0.1, 0.5 and 1.0 substitutions per site). The table reports the average of the Robinson–Foulds percentage score T̄RF, quartet percentage score T̄Q, and branch length score B̄, for each of StatTree, BioNJ and Weighbor. In each case, the method with the better score is indicated in bold.
whose viral strains are quite distinct from those of the USA subjects. The clear separation between the two viral subclades of subject ZM.247 is interpreted by Salazar-Gonzalez et al. (2009) as evidence of simultaneous infection from an individual carrying both strains. The major difference between our analysis and that of Salazar-Gonzalez et al. (2009) is that in the latter the tree is rooted between the Zambian patients and the rest.
The corresponding time tree in Figure 5a estimates the times of diversification of these viral lineages. This time tree suggests relatively long periods of shared HIV ancestry between some of the USA patients: an insight not available from the divergence tree. Moreover, the relatively short estimated time since divergence of the two subclades of subject ZM.247 might call into question the authors’ explanation for their existence. These observations underline the potential utility of estimating time-trees.
Figure 5: (a) Time and (b) divergence trees of 121 HIV-1 genomes sampled from 12 subjects studied by Salazar-Gonzalez et al. (2009), estimated by StatTree. Within-subject viral clades are shown as tips. Tip-label prefixes denote subjects’ geographical location (AL = Alabama; NC = North Carolina; NY = New York; ZM = Zambia).
5 Discussion
We have developed a variance-components model for phylogenetic reconstruction from distance matrices, incorporating several novel features. Two components of variance are distinguished and incorporated: noise (as modeled by NJ) reflecting the lack-of-fit of distances to an underlying divergence tree, and distortion (as modeled by BioNJ) reflecting the departure of this divergence tree from its underlying ultrametric time tree. Unlike previous methods, our approach exploits the coherence of both trees, simultaneously estimating them through a computationally efficient procedure which recursively descends through the phylogeny. Variances and covariances of distances and divergences are propagated from one agglomerative stage to the next, and taken into account in building the tree.
Focussing only on means, variances and covariances of distances, our methodology avoids the need for full parametric assumptions about evolutionary and other processes involved. The advantage is that inferences do not depend on the validity of substitution models, thus admitting the possibility of straightforwardly incorporating non-standard sources of phylogenetic information other than sequence data.
As shown in Section 3, even small phylogenies contain enough data to estimate the noise component σ accurately, but information on the distortion compo-
nent ν is weak and estimates tend to be positively biassed. Our simulation results show that StatTree performs comparably with Weighbor and BioNJ, StatTree slightly outperforming BioNJ when distance matrices have a low signal-to-noise ratio such as when sequences and branch lengths are short and the topology unbalanced.
There is scope for further development of the methodology, in particular for a more robust estimation of ν. Generalizations of the methodology might allow for non-binary trees and for different amounts of noise in different parts of the tree. A more substantial advance would be required to accommodate evolution in the rate of evolution (Thorne et al., 1998, Kishino et al., 2001), which induces correlations in divergence between non-overlapping branches.
Software and applications
We have implemented the methodology in StatTree, which may be freely downloaded from http://www.mas.ncl.ac.uk/~ntmwn/stattree. Illustrative applications of the methodology to bacterial, viral and mitochondrial datasets can be found at http://www.cl.cam.ac.uk/~pl219/GilksNyeLioSAGMB.html.
Appendix A Estimating equations for σ2 and ν
Our estimating equations for σ2 and ν
are derived from a global objective function MG
comprising the marginal objective function
in (30) summed over all
stages i:
where the dependence on θ is through W(i) given in (19). The estimated marginal residuals ε̂*(i) in (A1) are given by
from (27), for the optimal taxon-pair agglomeration (j(i),k(i)), as described in Section 2.11.
We construct our estimating equations by setting partial derivatives of (A1) equal to their expectations (see, for example, Crowder (2001)). These partial derivatives are:
using (19), where tr denotes trace and the residuals
{ε̂*(i)} are considered fixed.
Now, it can be shown that IE
=
σ2P(i)W(i)
where P(i) is the matrix projecting
into the residual space of the marginal regression model (27):
where X(i) = [A(i)S(i)12,B(i)]. Therefore, taking expectations in (A3,A4),
Equating observed (A3,A4) and expected (A5,A6) partial derivatives, we obtain the following estimating equations:
which may be solved iteratively for σ2 and θ.
Appendix B Efficient computation
Using equations (31–40,A7,A8) directly would be computationally prohibitive. However, the algebraic structure of the vectors and matrices involved admit substantial computational savings, as we now describe.
The key result, in Lemma B.1 below, is an efficient formula for computing
in (31,32),
thus avoiding the double inversion of large matrices involved in
. Lemma B.3 then shows that
the V(i) matrices stay within the class
defined by (43). Lemmas B.1–B.3 open the way for efficient computation of equations
(31–40,A7,A8).
We first introduce the following scalars, notationally suppressing their dependence on i:
Then we have the following results:
Lemma B.1 If W (i) satisfies (19) and V (i) satisfies (43), then
and
where
, τD
and
ζD
are the diagonalizations of (44),
(τ1,τ2,τ3,...,
)
and (
),
respectively.
Proof Equation (B1) follows immediately from (19,43). Using the Sherman–Morrison–Woodbury matrix-inversion identity (see, for example, Golub and Van Loan (1996)), we obtain
where eD is the diagonalization of
(τ1r1,
τ2r2,...,
). Elementary
manipulations eventually lead to:
where ẽD is the diagonalization of
(h,τ3r3,...,
) and
w(i+1) is given by (41). Equation
(B2) may then be obtained by again applying the Sherman–Morrison-Woodbury identity,
but this time to (B3) and in the reverse direction. □
Lemma B.2 With the assumptions of Lemma B.1, expression (40) simplifies to
where
and
are the
diagonalizations of p
(i+1)
and s
(i+1)
given by (44) and (42), where in (42) we define:
Proof The result may be shown through elementary but extremely lengthy and laborious manipulations, using the results of Lemma B.1. □
Lemma B.3 Expression (43) for V(i) holds for all i = 0,1,...,m(0).
Proof By induction from Lemma B.2, noting from (15) that (43) holds for i = 0. □
Contributor Information
Walter R Gilks, University of Leeds and Rothamsted Research .
Tom M.W. Nye, University of Newcastle upon Tyne
Pietro Lio, University of Cambridge .
References
- Brodal, G., R. Faberberg, and P. C.N.S. (2004). Computing the quartet distance between evolutionary trees in time O(nlog2n). Algorithmica 38(2), 377–395. [Google Scholar]
- Bruno, W., N. Socci, and A. Halpern (2000). Weighted Neighbour Joining: a likelihood-based approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution 17, 189–197. [DOI] [PubMed] [Google Scholar]
- Bulmer, M. (1991). Use of the method of generalized least squ ares in reconstructing phylogenies from sequence data. Molecular Biology and Evolution 8, 868–883. [Google Scholar]
- Chakraborty, R. (1977). Estimation of time of divergence from phylogenetic studies. Canadian Journal of Genetics and Cytology 19, 217–223. [DOI] [PubMed] [Google Scholar]
- Crowder, M. (2001). On repeated measures analysis with misspecified covariance structure. Journal of the Royal Statistical Society, Series B 63, 55–62. [Google Scholar]
- Desper, R. and O. Gascuel (2004). Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted leastsquares tree fitting. Molecular Biology and Evolution 21(3), 587–598. [DOI] [PubMed] [Google Scholar]
- Felsenstein, J. (1987). Estimation of hominoid phylogeny from a DNA hybridization data set. Journal of Molecular Evolution 26, 123–131. [DOI] [PubMed] [Google Scholar]
- Felsenstein, J. (2004). Inferring Phylogenies. Massachusetts: Sinauer Associates, Inc. [Google Scholar]
- Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695. [DOI] [PubMed] [Google Scholar]
- Gascuel, O. (2000). Data model and classification by trees: the minimum variance reduction (MVR) method. Journal of Classification 17, 67–99. [Google Scholar]
- Golub, G. and C. Van Loan (1996). Matrix Computations (3rd ed.). Baltimore: The Johns Hopkins University Press. [Google Scholar]
- Hasegawa, M., H. Kishino, and T. Yano (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174. [DOI] [PubMed] [Google Scholar]
- Jukes, T. and C. Cantor (1969). Evolution of protein molecules. In M.N.Munro (Ed.), Mammalian Protein Metabolism, Volume III, pp. 21–132. New York: Academic Press. [Google Scholar]
- Keele, B., E. Giorgi, J. Salazar-Gonzalez, J. Decker, K. Pham, M. Salazar, et al. (2008). Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. USA. 105, 75527557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16, 111–120. [DOI] [PubMed] [Google Scholar]
- Kishino, H., J. Thorne, and W. Bruno (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18, 352–361. [DOI] [PubMed] [Google Scholar]
- Lanave, C., G. Preparata, C. Saccone, and G. Serio (1984). A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20, 86–93. [DOI] [PubMed] [Google Scholar]
- Mardia, K., J. Kent, and J. Bibby (1979). Multivariate Analysis. New York: Academic Press. [Google Scholar]
- Nei, M., J. Stephens, and N. Saitou (1985). Methods for computing the standard errors of branching points in an evolutionary tree and their applications to molecular data from human and apes. Molecular Biology and Evolution 2, 66–85. [DOI] [PubMed] [Google Scholar]
- Rambaut, A. and N. Grassly (1997). Seq-gen: an application for the Monte Carlo simulation od DNA sequence evolution along phylogenetic trees. Algorithmica 13(3), 235–238. [DOI] [PubMed] [Google Scholar]
- Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees. Mathematical Bioscience 53, 131–147. [Google Scholar]
- Saitou, N. and M. Nei (1987). The neighbour-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425. [DOI] [PubMed] [Google Scholar]
- Salazar-Gonzalez, J., M. Salazar, B. Keele, G. Learn, E. Giorgi, H. Li, et al. (2009). Genetic identity, biological phenotype, and evolutionary pathways of transmitted/founder viruses in acute and early HIV-1 infection. J. Experimental Medecine 206(6), 1273–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Studier, J. and K. Keppler (1988). A note on the neighbor-joining method of Saitou and Nei. Molecular Biology and Evolution 5, 729–731. [DOI] [PubMed] [Google Scholar]
- Susko, E. (2003). Confidence regions and hypothesis tests using generalized least squares. Molecular Biology and Evolution 20, 862–868. [DOI] [PubMed] [Google Scholar]
- Thorne, J., H. Kishino, and I. Painter (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15, 1647–1657. [DOI] [PubMed] [Google Scholar]
- Wang, L.-S. and T. Warnow (2005). Distance-based genome rearrangement phylogeny. In O. Gascuel (Ed.), Mathematics of Evolution & Phylogeny, Chapter 13, pp. 353–383. Oxford University Press. [Google Scholar]
- Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10, 1396–1401. [DOI] [PubMed] [Google Scholar]
- Zwickl, D. and D. Hillis (2002). Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 51(4), 588–598. [DOI] [PubMed] [Google Scholar]



































