Skip to main content

Some NLM-NCBI services and products are experiencing heavy traffic, which may affect performance and availability. We apologize for the inconvenience and appreciate your patience. For assistance, please contact our Help Desk at info@ncbi.nlm.nih.gov.

Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1998 May 26;95(11):5899–5905. doi: 10.1073/pnas.95.11.5899

Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution

Xun Gu *, Wen-Hsiung Li †,
PMCID: PMC34493  PMID: 9600890

Abstract

Estimation of evolutionary distances has always been a major issue in the study of molecular evolution because evolutionary distances are required for estimating the rate of evolution in a gene, the divergence dates between genes or organisms, and the relationships among genes or organisms. Other closely related issues are the estimation of the pattern of nucleotide substitution, the estimation of the degree of rate variation among sites in a DNA sequence, and statistical testing of the molecular clock hypothesis. Mathematical treatments of these problems are considerably simplified by the assumption of a stationary process in which the nucleotide compositions of the sequences under study have remained approximately constant over time, and there now exist fairly extensive studies of stationary models of nucleotide substitution, although some problems remain to be solved. Nonstationary models are much more complex, but significant progress has been recently made by the development of the paralinear and LogDet distances. This paper reviews recent studies on the above issues and reports results on correcting the estimation bias of evolutionary distances, the estimation of the pattern of nucleotide substitution, and the estimation of rate variation among the sites in a sequence.

Keywords: substitution models, rate heterogeneity, stationarity, nonstationarity, estimation bias


Evolutionary distances (usually designated by d) such as the number of nucleotide substitutions between two DNA sequences (K) are basic quantities in the study of molecular evolution because they are required for computing the rate of evolution in a DNA or protein sequence, for inferring the evolutionary relationships among genes or organisms, and for estimating the divergence dates between taxa or genes (19). For these purposes, however, it is essential to obtain reliable estimates of evolutionary distances. Indeed, if the evolutionary distances are not accurately estimated, all distance matrix methods of tree reconstruction may be misleading (56, 8). Because accurate estimation of evolutionary distances requires a realistic model of nucleotide substitution, much effort has been made to develop general models of nucleotide substitution (4, 8).

If the process of nucleotide substitution is stationary, i.e., if the nucleotide compositions of the sequences under study have been approximately constant over time, then fairly general models of nucleotide substitution can be developed. For the stationary, time-reversible model (the SR model), Lanave et al. (10), Gu and Li (11), and others (1214) have developed methods for estimating K. This model includes many other models as special cases (see next page). Moreover, Gu and Li (11) have recently extended the SR model to include rate variation among sites, i.e., the SRV model, in which SRV stands for stationary, time-reversible, and rate-variable.

When nucleotide frequencies change with time so that stationarity does not hold, phylogenetic reconstruction using distances estimated under a stationary model can be misleading because it tends to group together sequences of similar nucleotide compositions irrespective of their true evolutionary relationships (1518). Nonstationarity greatly complicates the mathematics. Fortunately, significant progress has been made with the development of the paralinear (19) and LogDet distances (17, 20). However, both methods assume a uniform rate among sites, and so methods for dealing with rate heterogeneity remain to be developed.

An issue related to the estimation of evolutionary distances is the estimation of the pattern of nucleotide substitution. This pattern can be reliably estimated under stationarity (2123) but is difficult to estimate under nonstationarity. Another problem closely related to distance estimation is how to estimate the degree of rate variation among sites (2429). Many methods have been proposed for this purpose under a specific distribution (e.g., a gamma distribution). However, how to estimate rate heterogeneity without assuming a specific distribution has been unclear (30). These issues will be considered in this paper.

A further issue is that estimation bias usually occurs when the sequence length is short so that stochastic effects are strong. Although the bias tends to become trivial as the sequence length increases, it is desirable to correct the bias because in practice many sequences studied are actually very short (3132).

The purpose of this article is to review recent studies on the above issues and to present our results.

Stationary Models

The SR Model.

Assume that nucleotide substitution follows a stationary Markov process (1014). Denote A, G, T, and C as 1, 2, 3, and 4, respectively. Let R be the rate matrix whose ij-th element rij is the rate of change from nucleotide i to nucleotide j (ij, i, j = 1, 2, 3, 4); the diagonal elements are given by rii = −Σji rij. Then the matrix of transition probabilities P for t time units is given by P = eRt, where the ij-th element Pij is the probability of transition from nucleotide i to nucleotide j after t evolutionary time units.

The substitution process is reversible in time if and only if πirij = πjrji, where πi is the equilibrium frequency of nucleotide i. The preceding relation implies that the off-diagonal elements of R can be expressed as

graphic file with name M1.gif

Therefore, the SR model is a nine-parameter model and includes many models as special cases, e.g., the models of Jukes and Cantor (33), Kimura (34), Tajima and Nei (35), Hasegawa et al. (21), and Tamura and Nei (22). The SR model has been studied by many authors (1014, 23, 36).

Consider two sequences (designated by 1 and 2) that have evolved from O, their common ancestor, t time units ago (Fig. 1). Under stationarity, time-reversibility means that the substitution process from the common ancestor O to sequences 1 and 2 is equivalent to the substitution process from 1 through O to 2 (or from 2 through O to 1), whose transition probability matrix for 2t time units is given by

graphic file with name M2.gif 1

Let λk (k = 1, 2, 3, 4) be the k-th eigenvalue of the rate matrix R; one of them is zero, say λ4 = 0. Let zk be the k-th eigenvalue of P. Eq. 1 implies zk = e2tλk. Gu and Li (11) showed that the evolutionary distance defined by the average number of substitutions per site (i.e., K = −2t Σi=14 πirii) is given by

graphic file with name M3.gif 2

where constants ck are determined by the eigenmatrix of P. Eq. 2 is generally valid since all eigenvalues zk are real under the SR model (11, 37). For example, under the Jukes-Cantor model (33), z1 = z2 = z3 = 1 − 4p/3 and c1 = c2 = c3 = 1/4 so that Eq. 2 is reduced to d = −(3/4)ln(1 − 4p/3), where p is the proportion of nucleotide differences between the two sequences.

Figure 1.

Figure 1

Two DNA sequences diverged t time units ago.

The SR distance can be estimated from the data matrix J, whose ij-th element (Jij) is the frequency of sites at which the nucleotides in the two sequences are i and j, respectively. By time-reversibility, we have Jij = πiPij. Therefore, the ij-th element of P (for 2t time units) can be estimated by P̂ij = Jiji (i, j = 1, … , 4), where πi and Jij are easily obtained from the sequence data. Let matrix consist of P̂ij. Its eigenvalues ẑk (k = 1, … , 3) can be computed by a standard algorithm, and the constants are given by ck = −Σi=14 Σji πiuikvkj (k = 1, 2, 3), where uik and vkj are the elements of the corresponding eigenmatrix U and its inverse matrix V, respectively. For details, see Saccone et al. (38), Gu and Li (11), and Li and Gu (39). The sampling variance of d and the variance-covariance matrix for more than two DNA sequences can be found in Gu and Li (11).

Eq. 2 can be used to define many additive distances by choosing appropriate constants ck (Table 1), e.g., the number of nucleotide substitutions per site (K), the number of transitional substitutions per site (A), the number of transversional substitutions per site (B), and the number of substitutions from nucleotides i to j (Dij). These distance measures are useful for phylogenetic analysis and molecular clock testing.

Table 1.

The constants ck in the general SR or SRV distance

Distance ck (k = 1, 2, 3)
K −Σi=14 Σji πiuikvkj
A −Σi=14 ΣjiTs πiuikvkj
B −Σi=14 ΣjiTv πiuikvkj
Dij −πiuikvkj

K is the number of substitutions per site; A is the number of transitional substitutions per site; B is the number of tranversional substitutions per site, and Dij is the number of substitutions from nucleotides i to j per site. The subscripts jiTs and jiTv mean that the differences between nucleotides i and j are transitional and transversional, respectively. 

The SRV Model.

Rate variation among sites can be incorporated into the SR model by assuming rij = aiju, where aij is a constant and u varies according to a gamma distribution

graphic file with name M4.gif 3

with mean ū = α/β; α is the shape parameter and determines the degree of rate variation. Under this model, the (mean) transition probability matrix P for 2t time units is given by

graphic file with name M5.gif 4

where I is the identity matrix and the mean rate matrix = ūA where matrix A consists of aij (11). From Eq. 4, one can show that the k-th eigenvalue of P is given by

graphic file with name M6.gif 5

where λk is the k-th eigenvalue of . It follows that the evolutionary distance under the SRV model is given by

graphic file with name M7.gif 6

The constants ck are determined in the same manner as above (Table 1). Note that Eq. 4 reduces to Eq. 1 and Eq. 6 to Eq. 2 as α → ∞, i.e., the substitution rate is uniform among sites.

Furthermore, Eq. 6 can be generalized to any distribution f(u) for the rate variation among sites. Let G(s) = ∫0 esuf(u)du be the moment-generating function of f(u). Gu and Li (11) showed that zk = G(2λkt), k = 1, 2, 3, 4. Thus, the general additive distance is given by

graphic file with name M8.gif 7

where G−1 is the inverse function of the moment-generating function G. For example, consider the invariant + gamma model (26, 4041): (i) for a given site, the probability of being invariable (i.e., u = 0) is θ, whereas the probability of being variable is 1 − θ; and (ii) among the sites that are variable, the substitution rate follows a gamma distribution. By applying Eq. 7, one can show that the evolutionary distance under the invariant + gamma distribution is given by

graphic file with name M9.gif 8

For other distributions, see Waddell et al. (30).

Bias-Corrected SR and SRV Distances.

Our computer simulation has shown that when the sequence length is short the SR and SRV methods tend to overestimate the evolutionary distance. The bias can be corrected as follows.

Let d̂ be an estimate of the SR or SRV distance. We use the first three terms of the Taylor expansion to obtain an approximate expression of E[d̂]. For the SR model,

graphic file with name M10.gif 9

Therefore, the bias-corrected SR distance is given by

graphic file with name M11.gif 10

where δ is defined as

graphic file with name M12.gif 11

and Var(ẑk) can be obtained by the method of Gu and Li (11).

The bias-corrected distance under the SRV model also can be written as Eq. 10, except that δ is replaced by

graphic file with name M13.gif 12

Computer Simulation.

Extensive computer simulations on the performance of the SR and SRV methods have been conducted in this study and in Rodriguez et al. (14), Zharkikh (31), and Gu and Li (11). The results can be summarized as follows.

(i) When the sequence length (L) is long and the rate of substitution is uniform among sites, the SR method performs well, whereas simpler methods [e.g., Kimura’s two-parameter method (34)] give biased estimates if some assumptions of the method are violated (11, 14, 31). Because the actual substitution pattern of DNA evolution may be complex, the SR method is preferred when the sequences are long, say, longer than 1,000 bp.

(ii) The SR method may give large biases when the sequence length is short (say, L ≤ 200), but the biases can be substantially reduced by the bias-corrected SR distance (Table 2). As L becomes longer than 2,000 bp, the estimation bias virtually decreases to zero. The same comment applies to the SRV method (Table 3).

Table 2.

The mean of distances (d) over simulation replicates estimated by the bias-corrected SR method and the SR method

Model Sequence length (L)
200 500 2000
(1) d = 0.5
JC 0.506  (0.516) 0.503  (0.506) 0.501  (0.502)
K2P 0.507  (0.517) 0.502  (0.506) 0.501  (0.502)
TN 0.508  (0.516) 0.504  (0.507) 0.501  (0.502)
TmN 0.505  (0.516) 0.505  (0.509) 0.501  (0.502)
SR 0.509  (0.517) 0.503  (0.506) 0.501  (0.502)
NR1 0.509  (0.517) 0.503  (0.507) 0.501  (0.502)
NR2 0.510  (0.517) 0.505  (0.509) 0.501  (0.502)
(2) d = 1.0
JC 1.036  (1.082) 1.013  (1.029) 1.005  (1.008)
K2P 1.072  (1.093) 1.008  (1.038) 1.003  (1.009)
TN 1.046  (1.089) 1.015  (1.037) 1.006  (1.010)
TmN 1.061  (1.085) 1.016  (1.050) 1.005  (1.012)
SR 1.049  (1.085) 1.006  (1.038) 1.005  (1.009)
NR1 1.057  (1.090) 1.009  (1.044) 1.005  (1.011)
NR2 1.071  (1.094) 1.015  (1.055) 1.006  (1.012)

The value presented in each case is the mean of d estimated by the bias-corrected SR method and the value in parentheses by the (uncorrected) SR method. Simulation models: JC, the Jukes–Cantor model (33). K2P, Kimura’s two parameter model (34): the transition/transversion ratio is 4. For TN (Tajima and Nei, Ref. 35), TmN (Tamura and Nei, Ref. 22), SR, and the two time-irreversible models (NR1 and NR2), see Gu and Li (11) for a detailed description. 

Table 3.

The mean of distance (d) estimated by the SRV method and the bias-corrected SRV method

L true d SR + V model NR2 + V model
(1) α = 0.5
200 0.3 0.317  (0.325) 0.320  (0.334)
0.5 0.520  (0.552) 0.555  (0.574)
1.0 1.068  (1.179) 1.193  (1.303)
500 0.3 0.303  (0.307) 0.305  (0.310)
0.5 0.508  (0.517) 0.510  (0.520)
1.0 1.027  (1.061) 1.037  (1.077)
(2) α = 1.0
200 0.3 0.312  (0.318) 0.313  (0.319)
0.5 0.513  (0.528) 0.531  (0.544)
1.0 1.038  (1.126) 1.063  (1.149)
500 0.3 0.306  (0.309) 0.306  (0.309)
0.5 0.508  (0.514) 0.502  (0.507)
1.0 1.013  (1.037) 1.022  (1.053)
(3) α = 2.0
200 0.3 0.307  (0.311) 0.308  (0.312)
0.5 0.514  (0.526) 0.514  (0.524)
1.0 1.046  (1.132) 1.060  (1.146)
500 0.3 0.300  (0.302) 0.300  (0.301)
0.5 0.503  (0.508) 0.502  (0.507)
1.0 1.012  (1.034) 1.015  (1.043)

The value presented in each case is the mean of d estimated by the bias-corrected SRV method, and the value in parentheses by the (uncorrected) SRV method. See the note of Table 2 for details. 

(iii) The SR method performs well even when DNA sequence evolution is not time-reversible (see models NR1 and NR2 in Table 2). Therefore, the assumption of time-reversibility, which simplifies the estimation problem considerably, may not have serious effects on distance estimation.

(iv) When the substitution rate varies among sites, the evolutionary distance can be seriously underestimated by the SR method; note that this bias is systematic and cannot be eliminated by increasing sequence length. As shown in Table 3, the SRV method performs well and the estimation bias vanishes when L is long.

(v) The methods developed by Gu and Li (11) for estimating sampling variance under the SR and SRV models appear to be reliable except when L < 200 and d > 1.0.

(vi) The mean squared error defined by MSE = bias2 + Var(d) is useful for comparing the relative performance of two methods because for a simple method, the sampling variance tends to be smaller but the bias tends to be larger (11). For example, using this criterion, Gu and Li (11) found that SR is superior to JC when L > 500 bp and that SRV is always superior to SR when the substitution rate varies among sites.

Estimating the Pattern of Nucleotide Substitution.

The pattern of nucleotide substitution can be measured by the off-diagonal elements of the rate matrix R. For simplicity, these elements are usually rescaled, and here, we define the pattern of nucleotide substitution as R* = 2tR. Consider two DNA sequences (Fig. 1) under the SR model. Denote the diagonal matrix of the eigenvalues of P = e2tR by diag(z1, z2, z3, z4). By matrix theory, we have P = U diag(z1, z2, z3, z4)U−1, where U is the eigenmatrix of P. Then, the substitution pattern R* = 2tR = ln P can be expressed as

graphic file with name M14.gif 13

Therefore, using the same procedure, we can estimate the evolutionary distance and the pattern of nucleotide substitution simultaneously. In the same manner, under the SRV model, one can show that the pattern of nucleotide substitution can be estimated by

graphic file with name M15.gif 14

where λ*k = α(zk−1/α − 1) (see also ref. 42).

It is known that estimation of the pattern of nucleotide substitution can be significantly improved by using n > 2 sequences, but the estimation procedure becomes complex because it needs to consider the phylogenetic tree of the sequences, which may be unknown. The following simple method does not require knowledge of the tree topology. For a given pair of sequences i and j, which diverged tij time units ago, the transition probability matrix under the SR model is P(ij) = e2tijR. By multiplying P(ij) over all pairs of sequences, we have

graphic file with name M16.gif 15

where τ = Σi<j tij. Similarly, under the SRV model, one can show that

graphic file with name M17.gif 16

Therefore, when the transition probability matrix for each pair of sequences has been estimated, which is denoted by ij, we first compute (2τ) = ∏i<jij. Then, under the SR or SRV model, the substitution pattern R* = 2τ for n sequences can be estimated by an approach similar to that for the case of two sequences. The sampling variances for the estimated substitution pattern can be obtained by the analytical method developed by Gu and Li (11) or by a simple resampling technique (e.g., bootstrapping).

When many sequences are considered for estimating the substitution pattern, the time scale τ in Eq. 16 can be very large, resulting in some elements in R* larger than one. Because we are more concerned with the relative rates among the types of nucleotide substitutions, it is better to provide a normalized substitution pattern. A simple normalization procedure is to compute (2τ) = [∏i<jijwij]1/M, where M = n(n − 1)/2 and the weight wij = 1/dij.

A General Measure of Rate Variation Among Sites

Gu et al. (26) suggested a normalized measure (ρ) for evaluating the relative strength of the rate variation among sites:

graphic file with name M18.gif 17

where CV = Inline graphic; Var(u) and ū are the variance and mean of the evolutionary rate (u) for any distribution f(u). As ρ varies from 0 to 1, the rate heterogeneity increases from a uniform rate over sites (ρ = 0 or CV = 0) to the maximum heterogeneity (ρ = 1 or CV = ∞). Therefore, ρ can directly reflect rate heterogeneity, and unlike the shape parameter α of the gamma distribution, it does not depend on a specific distribution.

In the following we describe a simple method for estimating ρ without assuming a specific model for the rate variation among sites. We assume (i) at each site nucleotide substitution follows a Poisson process, and (ii) the evolutionary rate u varies among sites according to the distribution f(u). Let X be the number of substitutions at a nucleotide site with rate u. Then, the first two conditional moments of X are given by E[X|u] = uT and E[X2|u] = uT + (uT)2, respectively, where T is the total evolutionary time. It follows that the first two (unconditional) moments of X over all sites are E[X] = E[E(X|u)] = TE[u], and E[X2] = E[E[X2|u]] = TE[u] + T2E[u2], respectively, where E[u] and E[u2] are the first two moments of f(u), respectively. Let m = E[X] and V = E[X2] − m2, and let ū = E[u] and Var(u) = E[u2] − (ū)2. One can show that m = ūT and V = ūT + Var(u)T2, and so CV = Inline graphic/m. Therefore, the parameter ρ is given by

graphic file with name M21.gif 18

To estimate ρ from sequence data, we need to know the number of substitutions at each site. Conventionally, this number is inferred by the parsimony method (43) when the phylogenetic tree is known. However, the parsimony method tends to underestimate the true number of substitutions (29, 44). Gu and Zhang (29) solved this problem by using a combination of ancestral sequence inference and maximum likelihood estimation. Let X̂i be the number of substitutions at the ith site estimated by Gu and Zhang’s method (29). Then, m̂ = Σi=1L Xi/L and V̂ = Σi=1L Xi2/Lm2 (L is the sequence length) so that ρ̂ can be easily obtained from Eq. 18 without knowing the distribution f(u).

The biological meaning of ρ can be easily understood by using the following simple model. Let v be the mutation rate at a site. For invariant sites, the substitution rate is 0, and for the other sites, the rate is hv, where 0 < h ≤ 1. The average substitution rate of the gene is therefore u = (1 − θ)hv, where θ is the frequency of invariable sites. It is easy to show that CV = Inline graphic and ρ = θ. Thus, the substitution rate can be expressed as

graphic file with name M23.gif 19

This formula predicts a negative correlation between substitution rate and the rate variation among sites, which has been observed by J. Zhang and X. Gu (unpublished results).

Nonstationary Models

LogDet and Paralinear Distances.

The paralinear (19) and LogDet (17, 20) distances have been proposed to deal with nonstationarity. They are based on the most general model of nucleotide substitution. Historically, these methods can be traced back to Barry and Hartingan (13) and Cavender and Felseinstein (45).

Consider the evolution of two sequences (Fig. 1). Denote the diagonal matrix of nucleotide frequencies at node k (k = 0, 1, 2) by F(k) = diag(f1(k), f2(k), f3(k), f4(k)), where the subscript j refers to nucleotide j. Let J be the data matrix as defined previously. Then, the paralinear distance (between sequences 1 and 2) is defined as

graphic file with name M24.gif 20

where det( ) means the determinant of a matrix, and for a diagonal matrice, we have det[F(k)] = ∏i=14 fi(k), k = 1, 2 (19). A related measure is the LogDet distance (17, 20), which is defined as

graphic file with name M25.gif 21

In Eq. 21, the constant −ln 4 is added because it does not change any property of the original LogDet distance but makes the biological interpretation easier (32). The paralinear and LogDet distances have the following properties:

(i) Both distances are based on the most general model of nucleotide substitution, i.e., the 12-parameter model (17, 1920, 31). Moreover, they are valid even if the rate matrix R varies among lineages. Therefore, in the case where the assumption of a uniform substitution rate among sites holds, the paralinear and LogDet distances are very useful for phylogenetic reconstruction when nucleotide frequencies are nonstationary (1920, 32).

(ii) For the neighbor-joining method and related methods, the two distance measures give the same tree topology (32). However, there are some differences between the two distances. First, the paralinear distance between two sequences is the sum of “paralinear” lengths of the branches involved. Thus, the branch lengths under a given tree can be well estimated from the paralinear distance matrix by the least-squares method. In contrast, this property does not hold for the LogDet distance. Second, the LogDet distance is particularly useful for testing the molecular clock hypothesis under nonstationarity, whereas the paralinear distance is not suitable for this purpose (see Eqs. 27 and 28).

(iii) The biological interpretation of the two distances can be described as follows. Let μ(k) = −Σi=14 rii(k)/4 be the arithmetic mean rate in lineage k (k = 1, 2), and μ = (μ(1) + μ(2))/2. Gu and Li (32) showed that the expected paralinear distance (Eq. 20) is given by

graphic file with name M26.gif 22

and the expected LogDet distance (Eq. 21) is given by

graphic file with name M27.gif 23

Note that, when the nucleotide frequency is stationary, Eq. 22 reduces to d = 2μt, which is the expected number of substitutions between the two sequences and is equivalent to the SR distance with ck = 1/4 (Eq. 2). Eq. 23 reduces to d = 2μt if fi(0) = 1/4, i = 1, … , 4.

(iv) The approximate sampling variance of the paralinear distance is given by

graphic file with name M28.gif 24

and that of the LogDet distance is given by

graphic file with name M29.gif 25

where L is the sequence length and Mij is the ij-th element of M = J−1 (13, 20, 32). For more than two sequences, the method for computing the variance-covariance matrix of the two distances has been developed by Gu and Li (32).

Bias-Corrected Paralinear and LogDet Distances.

Because the data matrix J and the nucleotide frequencies can be directly estimated from the sequence data, the estimation of paralinear and LogDet distances is simple (1920). However, our simulation study has revealed that the true (paralinear or LogDet) distance can be overestimated when the sequences are short (32), a situation similar to the SR/SRV distance. Gu and Li (32) obtained the following bias-corrected paralinear or LogDet distance.

graphic file with name M30.gif 26

where d̂ and Var(d̂) are the estimates of the “standard” paralinear or LogDet distance and the sampling variance, respectively (see Eqs. 20, 21, 24, 25).

The performance of the bias-corrected distances has been examined by extensive computer simulation (32). We considered two DNA sequences (Fig. 1) that evolve under a very general model: in one lineage the nucleotide substitution follows a time-reversible model (TR) and in another lineage it follows a time-irreversible model (NR). The rate matrices of TR and NR are designed to be very different, and the equilibrium GC% is 70% in TR but only 17% in NR (see ref. 32 for the detail). Moreover, The initial GC% at node O (Fig. 1) is set to be 15%, 50%, and 70%, in three cases. Our simulation results indicate that, when the sequence length is short, the bias-corrected paralinear or LogDet distance performs considerably better than the uncorrected method (Table 4).

Table 4.

Statistical performances of the bias-corrected paralinear distance

Initial GC% L c
t = 0.5
 50% 200 0.486 0.488  (0.4%) 0.497  (2.3%)
500 0.486 0.489  (0.6%) 0.492  (1.2%)
2,000 0.486 0.487  (0.2%) 0.488  (0.4%)
 70% 200 0.555 0.556  (0.2%) 0.572  (3.1%)
500 0.555 0.557  (0.4%) 0.563  (1.4%)
2,000 0.555 0.555  (0.0%) 0.557  (0.4%)
 15% 200 0.607 0.599  (1.3%) 0.637  (4.9%)
500 0.607 0.602  (0.8%) 0.613  (1.0%)
2,000 0.607 0.609  (0.3%) 0.611  (0.7%)
t = 0.8
 50% 200 0.770 0.766  (0.5%) 0.791  (2.7%)
500 0.770 0.768  (0.3%) 0.777  (0.9%)
2,000 0.770 0.770  (0.0%) 0.772  (0.3%)
 70% 200 0.858 0.842  (1.9%) 0.890  (3.7%)
500 0.858 0.854  (0.5%) 0.868  (1.2%)
2,000 0.858 0.859  (0.1%) 0.862  (0.5%)
 15% 200 0.926 0.880  (5.0%) 0.986  (6.5%)
500 0.926 0.918  (0.9%) 0.946  (1.2%)
2,000 0.926 0.925  (0.1%) 0.930  (0.5%)

L is the sequence length; d is the true value of the paralinear; d̂c and d̂ are the means of d estimated by the bias-corrected and uncorrected paralinear distances. The percentage values in parentheses are the biases of d̂c (i.e., |d̂c − d̄|/d̄ × 100%), and d̂ (i.e., |d̂ − d̄|/d̄ × 100%), respectively. 

Testing the Molecular Clock Hypothesis Under Nonstationarity.

The relative rate test (2) can be described as follows. Consider three species as shown in Fig. 2, where species 3 is an outgroup. To test whether the evolutionary rate in lineage O1 is the same as that in lineage O2 (i.e., the molecular clock hypothesis), one tests whether or not the difference D = d13d23 is significantly different from zero. Wu and Li (2), Gu and Li (46), Muse and Weir (47), Tajima (48), and others have developed tests for the case of stationarity. When the nucleotide frequencies are nonstationary, D ≠ 0 can arise from differences in nucleotide frequencies between the two sequences. Gu and Li (32) showed that this problem can be avoided by using the LogDet distance; that is,

graphic file with name M31.gif 27

where t is the divergent time between species 1 and 2 (Fig. 2). To test whether D is significantly different from zero, one can estimate the sampling variance of D, V(D) = V(d13) + V(d23) − 2 Cov(d13, d23) by the method of Gu and Li (32). When the sequence is long, the statistic Z = D/Inline graphic follows approximately the standard normal distribution (2). Actually, this new relative rate test can be easily generalized to the two-cluster test of Li and Bousquet (49) and Takezaki et al. (50), who considered the case of stationarity (Gu and Li, unpublished data).

Figure 2.

Figure 2

The phylogeny used for molecular clock testing.

On the other hand, if dij is measured by the paralinear distance, one can show that D′ = d13d23 is given by

graphic file with name M33.gif 28

Obviously, D′ is affected by differences in nucleotide frequencies and thus not suitable for testing the molecular clock hypothesis.

Discussion

In the above, we discussed the estimation of evolutionary distances and related issues under three models of nucleotide substitution: the SR model (1014, 36), the SRV model (11), and the nonstationary model (13, 17, 1920, 32, 45). The conclusions can be summarized as follows. (i) Under stationarity, the evolutionary distances and the pattern of nucleotide substitution can be estimated under the SR or SRV model. (ii) When the nucleotide frequencies are nonstationary, the paralinear or LogDet distances should be used. However, although both distances lead to the same tree topology, the branch lengths of a tree can be appropriately estimated only from the paralinear distances, whereas the molecular clock hypothesis should be tested by the LogDet distance. (iii) The proposed bias-corrected methods for the SR/SRV and paralinear/LogDet distances are useful when the sequences are shorter than 500 bp. (iv) A general measure for the rate variation among sites is proposed, which does not depend on any specific distribution of rates.

In principle, the SR/SRV and paralinear/LogDet distances can be easily extended to more complex models in which the dimension of the rate matrix R is >4 (5155). Two interesting cases are the amino acid-based model (a general 20 × 20 model) and the codon-based model (a general 61 × 61 model). However, our preliminary simulation showed that, even for the amino-acid based model, these distances are subject to large sampling variances unless the sequence is very long, say, larger than 2,000 amino acids; the sampling variance would be much larger for the codon-based model. Indeed, because there are too many unknown parameters, the distances cannot be estimated accurately. Thus, one should be cautious when applying these methods to analyze amino acid sequence data.

We suggested to use ρ (related to the coefficient of variation CV) as a general measure of rate heterogeneity. However, Waddell et al. (30) questioned its usefulness because they found, for a given sequence data set, the estimated CV value differs under different assumptions of rate distribution. This dilemma has now been removed because we have developed a method for estimating ρ (or CV) that does not require any specific model of rate distribution. Apparently, the discrepancy found by Waddell et al. (30) is caused by sampling errors or the unsuitability of the model.

When the nucleotide frequencies are not stationary, the parlinear and LogDet methods provide concise and elegant distance measures for phylogenetic inference and molecular clock testing. However, how to incorporate the effect of heterogeneity into these two distances is a problem that remains to be solved.

Acknowledgments

This study was supported by National Institutes of Health Grants GM 30998 (to W.H.L.) and GM 20293 (to Masatoshi Nei, Pennsylvania State University).

ABBREVIATIONS

SR

stationary time reversible

SRV

SR rate-variable

NR

time-irreversible

TR

time-reversible

References

  • 1.Li W H, Wu C I, Luo C C. In: Molecular Evolutionary Genetics. MacIntyre R J, editor. New York: Plenum; 1985. pp. 1–94. [Google Scholar]
  • 2.Wu C I, Li W H. Proc Natl Acad Sci USA. 1985;82:1741–1745. doi: 10.1073/pnas.82.6.1741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Saitou N, Nei M. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  • 4.Nei M. Molecular Evolutionary Genetics. New York: Columbia Univ. Press; 1987. [Google Scholar]
  • 5.Nei M. Annu Rev Genet. 1996;30:371–403. doi: 10.1146/annurev.genet.30.1.371. [DOI] [PubMed] [Google Scholar]
  • 6.Felsenstein J. Annu Rev Genet. 1988;22:521–565. doi: 10.1146/annurev.ge.22.120188.002513. [DOI] [PubMed] [Google Scholar]
  • 7.Doolittle R E, Feng D F, Tsang S, Cho G, Little E. Science. 1996;271:470–477. doi: 10.1126/science.271.5248.470. [DOI] [PubMed] [Google Scholar]
  • 8.Li W H. Molecular Evolution. Sunderland, MA: Sinauer; 1997. [Google Scholar]
  • 9.Gu X. Mol Biol Evol. 1997;14:861–866. doi: 10.1093/oxfordjournals.molbev.a025827. [DOI] [PubMed] [Google Scholar]
  • 10.Lanave C, Preparata G, Saccone C, Serio G. J Mol Evol. 1984;20:86–93. doi: 10.1007/BF02101990. [DOI] [PubMed] [Google Scholar]
  • 11.Gu X, Li W H. Proc Natl Acad Sci USA. 1996;93:4671–4676. doi: 10.1073/pnas.93.10.4671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tavare S. Lect Math Life Sci. 1986;17:57–86. [Google Scholar]
  • 13.Barry D, Hartigan J A. Biometrics. 1987;43:261–276. [PubMed] [Google Scholar]
  • 14.Rodriguez F, Oliver J F, Marin A, Medina J R. J Theor Biol. 1990;142:485–501. doi: 10.1016/s0022-5193(05)80104-3. [DOI] [PubMed] [Google Scholar]
  • 15.Hasegawa M, Hashimoto T. Nature. 1993;361:23. doi: 10.1038/361023b0. [DOI] [PubMed] [Google Scholar]
  • 16.Sogin M L, Hinkle G, Leipe D D. Nature. 1993;362:795. doi: 10.1038/362795a0. [DOI] [PubMed] [Google Scholar]
  • 17.Steel M A. Appl Math Lett. 1994;7:19–24. [Google Scholar]
  • 18.Galtier N, Gouy M. Proc Natl Acad Sci USA. 1996;92:11317–11321. doi: 10.1073/pnas.92.24.11317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Lake J A. Proc Natl Acad Sci USA. 1994;91:1455–1459. doi: 10.1073/pnas.91.4.1455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lockhart P J, Steel M A, Hendy M D, Penny D. Mol Biol Evol. 1994;11:605–612. doi: 10.1093/oxfordjournals.molbev.a040136. [DOI] [PubMed] [Google Scholar]
  • 21.Hasegawa M, Kishino H, Yano T. J Mol Evol. 1985;22:160–174. doi: 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
  • 22.Tamura K, Nei M. Mol Biol Evol. 1993;10:512–526. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
  • 23.Yang Z. J Mol Evol. 1994;39:105–111. doi: 10.1007/BF00178256. [DOI] [PubMed] [Google Scholar]
  • 24.Uzzel T, Corbin K W. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]
  • 25.Yang Z. Mol Biol Evol. 1993;10:1396–1401. doi: 10.1093/oxfordjournals.molbev.a040082. [DOI] [PubMed] [Google Scholar]
  • 26.Gu X, Fu X Y, Li W H. Mol Biol Evol. 1995;12:546–557. doi: 10.1093/oxfordjournals.molbev.a040235. [DOI] [PubMed] [Google Scholar]
  • 27.Sullivan J K, Holsinger K E, Simon C. Mol Biol Evol. 1995;12:988–1001. doi: 10.1093/oxfordjournals.molbev.a040292. [DOI] [PubMed] [Google Scholar]
  • 28.Kelly C, Rice J. Math Biosci. 1996;133:85–109. doi: 10.1016/0025-5564(95)00083-6. [DOI] [PubMed] [Google Scholar]
  • 29.Gu X, Zhang J. Mol Biol Evol. 1997;14:1106–1113. doi: 10.1093/oxfordjournals.molbev.a025720. [DOI] [PubMed] [Google Scholar]
  • 30.Waddell P J, Penny D, Moore T. Mol Phylogenet Evol. 1997;8:33–50. doi: 10.1006/mpev.1997.0405. [DOI] [PubMed] [Google Scholar]
  • 31.Zharkikh A. J Mol Evol. 1994;39:315–329. doi: 10.1007/BF00160155. [DOI] [PubMed] [Google Scholar]
  • 32.Gu X, Li W H. Mol Biol Evol. 1996;13:1375–1383. doi: 10.1093/oxfordjournals.molbev.a025584. [DOI] [PubMed] [Google Scholar]
  • 33.Jukes T H, Cantor C R. In: Mammalian Protein Metabolism. Munro H N, editor. New York: Academic; 1969. pp. 21–123. [Google Scholar]
  • 34.Kimura M. J Mol Evol. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
  • 35.Tajima F, Nei M. Mol Biol Evol. 1984;1:269–285. doi: 10.1093/oxfordjournals.molbev.a040317. [DOI] [PubMed] [Google Scholar]
  • 36.Steel M, Szekely L, Hendy M. J Comp Biol. 1994;1:153–163. doi: 10.1089/cmb.1994.1.153. [DOI] [PubMed] [Google Scholar]
  • 37.Keilson J. Markov Chain Models: Rarity and Exponentially. New York: Springer; 1979. [Google Scholar]
  • 38.Saccone C, Lanave C, Pesole G, Preparata G. Methods Enzymol. 1990;183:570–583. doi: 10.1016/0076-6879(90)83037-a. [DOI] [PubMed] [Google Scholar]
  • 39.Li W H, Gu X. Methods Enzymol. 1996;266:449–459. doi: 10.1016/s0076-6879(96)66028-5. [DOI] [PubMed] [Google Scholar]
  • 40.Miyamoto M M, Fitch W M. Syst Biol. 1996;45:568–575. doi: 10.1093/sysbio/45.4.568. [DOI] [PubMed] [Google Scholar]
  • 41.Tourasse N, Gouy M. Mol Biol Evol. 1997;14:287–298. doi: 10.1093/oxfordjournals.molbev.a025764. [DOI] [PubMed] [Google Scholar]
  • 42.Yang Z, Kumar S. Mol Biol Evol. 1996;13:650–659. doi: 10.1093/oxfordjournals.molbev.a025625. [DOI] [PubMed] [Google Scholar]
  • 43.Fitch W M. Syst Zool. 1971;20:406–416. [Google Scholar]
  • 44.Wakeley J. J Mol Evol. 1993;37:613–623. doi: 10.1007/BF00182747. [DOI] [PubMed] [Google Scholar]
  • 45.Cavender J A, Felsenstein J. J Classification. 1987;4:57–71. [Google Scholar]
  • 46.Gu X, Li W H. Mol Phylogenet Evol. 1992;234:185–192. doi: 10.1016/1055-7903(92)90017-b. [DOI] [PubMed] [Google Scholar]
  • 47.Muse S V, Weir B S. Genetics. 1992;132:269–276. doi: 10.1093/genetics/132.1.269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Tajima F. Genetics. 1993;135:599–607. doi: 10.1093/genetics/135.2.599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Li P, Bousquet J. Mol Biol Evol. 1992;9:1185–1189. doi: 10.1093/oxfordjournals.molbev.a040779. [DOI] [PubMed] [Google Scholar]
  • 50.Takezaki N, Rzhetsky A, Nei M. Mol Biol Evol. 1995;12:823–833. doi: 10.1093/oxfordjournals.molbev.a040259. [DOI] [PubMed] [Google Scholar]
  • 51.Dayhoff M O. Atlas of Protein Sequence and Structure. Vol. 5. Silver Spring, MD: Natl. Biomed. Res. Found.; 1978. [Google Scholar]
  • 52.Schoniger M, von Haeseler A. Mol Phylogenet Evol. 1994;3:240–247. doi: 10.1006/mpev.1994.1026. [DOI] [PubMed] [Google Scholar]
  • 53.Golding N, Yang Z. Mol Biol Evol. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
  • 54.Muse S V, Gaut B S. Mol Biol Evol. 1994;11:715–724. doi: 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]
  • 55.Rzhetsky A. Genetics. 1995;141:771–783. doi: 10.1093/genetics/141.2.771. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES