Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Sep 7.
Published in final edited form as: J Theor Biol. 2008 May 10;254(1):164–167. doi: 10.1016/j.jtbi.2008.04.034

Improved variance estimators for one- and two-parameter models of nucleotide substitution

Hsiuying Wang 1, Yun-Huei Tzeng 2, Wen-Hsiung Li 2,3,§
PMCID: PMC2580800  NIHMSID: NIHMS68203  PMID: 18571203

Abstract

The current variance estimators for Jukes and Cantor’s one-parameter model and Kimura’s two-parameter model tend to seriously underestimate the true variances when the proportion of nucleotide differences between the two sequences under study is not small. In this paper, we developed improved variance estimators, using a higher order Taylor expansion and empirical methods. The new estimators outperform the conventional estimators and provide accurate estimates of the true variances.

Keywords: substitution model, variance estimator, Taylor expansion, empirical formulas

1. Introduction

A basic process in the evolution of DNA sequences is the substitution of one nucleotide for another during evolution. The substitution of one allele for another in a population generally takes thousands of years or longer to complete, so the process cannot be directly observed. To detect evolutionary changes in a DNA sequence, we need to compare two sequences that have descended from a common ancestral sequence.

If two sequences of length L differ from each other at X sites, the proportion of differences, X/L, is referred to as the observed or uncorrected divergence. When the degree of divergence between the two sequences compared is small, the chance for more than one substitution to have occurred at a site is negligible, and the number of observed differences between the two sequences is close to the actual number of substitutions. However, if the degree of divergence is substantial, the observed number of differences is likely to be smaller than the actual number of substitutions due to multiple hits at the same site. Many methods have been proposed to correct for multiple hits (Holmquist, 1971; Jukes and Cantor, 1969; Kaplan and Risko, 1982; Kimura, 1980; Kimura, 1981; Lanave et al., 1984). The simplest and most frequently used models are Jukes and Cantor’s (1969) one-parameter model and Kimura’s (1980) two-parameter model.

Jukes and Cantor’s one-parameter model assumes that substitutions occur with equal probability, say α, among the four nucleotide types. Since the time of divergence between two sequences is usually unknown, we cannot estimate α directly. Instead, we compute K, the number of substitutions per site since the time of divergence between the two sequences. In the one-parameter model case, K = 2(3αt), where 3αt is the expected number of substitutions per site in a single lineage. Jukes and Cantor (1969) derived the following formula:

K=34ln(143p^) (1)

where p̂ = X/L is the observed proportion of different nucleotides between the two sequences. The following approximated estimator for the sampling variance was derived by Kimura and Ohta (1972) and has been commonly used in the literature.

V(K)=p^p^2L(143p^)2 (2)

In the case of the two-parameter model (Kimura, 1980), the differences between two sequences are classified into transitions and transversions. Let = X1/L and = X2/L be the observed proportions of transitional and transversional differences between the two sequences, respectively, where X1 and X2 are the numbers of transitional and transversional differences between the two sequences. Then the number of nucleotide substitutions per site between the two sequences, K2, is estimated by

K2=12ln(112P^Q^)+14ln(112Q^). (3)

The sampling variance is approximately given by

V(K2)=1L[P^(112P^Q^)2+Q^(124P^2Q^+124Q^)2(P^12P^Q^+Q^24P^2Q^+Q^24Q^)2]. (4)

Since the above two variance estimators underestimate the true variances in most circumstances, we derive improved estimators for estimating the sampling variances, using a higher order Taylor expansion and empirical methods. Our simulation results show that the new estimators outperform the conventional variance estimators and provide accurate estimates of the sampling variances.

2. Methods

Because (1) involves the log function, it is not easy to directly calculate the variance. So we employ the Taylor expansion to expand the log function at X=Lp. By Taylor expansion at X=Lp to second order, we have

34ln(143XL)34ln(143p)+(XLp)1143p+(XLp)223(143p)2. (5)

From the formula

Var(Y)=E(Y2)(EY)2,

where Y is a random variable, the variance of K can be expressed as

Var(K)=E[(34ln(143XL))2][E(34ln(43XL))]2. (6)

From (5), the first term in (6) is

E[(34ln(143XL))2]=916ln2(143p)+p(1p)L1(143p)2+49(143p)4E(XLp)432p(1p)Lln(143p)23(143p)2+o(1L2). (7)

From (5), the second term in (6) is

[E(34ln(143XL))]2=916ln2(143p)32p(1p)Lln(143p)23(143p)2+49(143p)4p2(1p)2L2+o(1L2). (8)

From (7), (8) and the fact

E(XLp)4=p(1p)(16p(1p)+3np(1p))L3˜3p2(1p)2L2

we have

Var(34ln(143XL))p(1p)L(143p)2+8p2(1p)29L2(143p)4. (9)

Our simulation study showed that when p is small, the variance estimator (9) provides a better estimator for the true variance than the estimator (2).

Thus, when p is small, we can directly use the estimator (9) as an improved estimator for the variance. However, when p is not small, the estimator (9) is not good enough to approximate the true variance because some higher order terms become not negligible. Therefore, we use (9) to propose the following form of a new estimator

a(p^)p^(1p^)L(143p^)2+b(p^)8p^2(1p^)29L2(143p^)4, (10)

for the one-parameter model, where a() and b() can be derived empirically by simulation, so that the new estimator can approximate the true variance more accurately than formula (9).

For the two-parameter model, we expand the funtion

f(X1,X2)=12ln(12X1LX2L)14ln(12X2L)

in (3) at X1 = LP and X2 = LQ by using the Taylor expansion to the second order. Then, we have

f(X1,X2)12ln(12PQ)14ln(12Q)+(X1PL)1L(12PQ)+(X2QL)12L(112PQ+112Q)+12{(X1PL)21L22(12PQ)2+2(X1PL)(X2QL)1L2(12PQ)2+(X2QL)21L2(12(12PQ)2+1(12Q)2)}. (11)

From the formula

Var(f(X1X2))=E(f2(X1,X2))(Ef(X1,X2))2

and tedious calculations, we obtain

V(K2)1L[P(112PQ)2+Q(124P2Q+124Q)2(P12PQ+Q24P2Q+Q24Q)2]+S (12)

where

S=[16P4(336Q+132Q2200Q3+108Q4)+(1+Q)2Q(12+89Q272Q2+424Q3336Q4+108Q5)+32P3(3+39Q168Q2+332Q3308Q4+108Q5)+8P2(8115Q+574Q21402Q3+1820Q41208Q5+324Q6)+8P(2+33Q191Q2+562Q3942Q4+916Q5484Q6+108Q7)]/(8L2(12Q)4(1+2P+Q)4).

By an argument similar to that for the one parameter model, we propose, on the basis of (12), the following form of a new estimator

c(P^,Q^)[p^(112P^Q^)2+Q^(124P^2Q^+124Q^)2(P^12P^Q^+Q^24P^2Q^+Q^24Q^)2]+d(P^,Q^)S^ (13)

for the two-parameter model, where Ŝ is the estimator of S by replacing P and Q in Ŝ by and , respectively.

3. Results and Discussion

From the forms of (10) and (13), we employ an empirical method to find suitable a(), b(), c(P̂, Q̂) and d(P̂, Q̂) such that the new estimators can be close to the true variances. There are many options of a(), b(), c(P̂, Q̂) and d(P̂, Q̂) which can lead to better estimators for the variances of the one- and two-parameter models.

To obtain general formulas for a() and b() in the one-parameter model, we use simulation to profile the relation of the true variance and the estimator (9) first, and then adopt the model selection method to derive a() and b(). We fix a() = b() = 1 to obtain the new estimators at first. Because the difference between the true variances and new estimators increases exponentially as b() increases, we assume that the coefficient terms in (10) are functions of and use the nonlinear regression method to obtain the approximation formulas of a() and b(). Although there are many possible choices of a() and b(), we choose those that can perform well under all different sequence length L in our simulation. The derivation of coefficient terms c(P̂, Q̂) and d(P̂, Q̂) in (13) of the two-parameter model is similar to the one-parameter model.

From the above simulations, we propose

V*(K)=0.6e9p^p^(1p^)L(143p^)2+89p^2(1p^)2L2(143p^)4 (14)

and

V*(K2)=0.56e10(p^+Q^)L[P^(112P^Q^)2+Q^(124P^2Q^+124Q^)2(P^12P^Q^+Q^24P^2Q^+Q^24Q^)2]+S^. (15)

to be the new estimators of the variances for the one- and two-parameter models, respectively.

To test the performances of formulas (14) and (15), we generate DNA sequences by using the evolver program in PAML package (Yang, 1997). Several combinations of parameter values are used to generate different data sets: sequence length (L = 500, 1000 and 5000) and the expected number of nucleotide substitutions per site (0.1 ~ 0.7). For each data set, we generate 1000 pairs of sequences and calculate their corresponding K values from formula (1). Hence, we can calculate the sample variance of these 1000 values of K and use it as the true variance of each data set. A similar simulation procedure is used for Kimura’s two-parameter model, and the ratio of transition/transversion is set to be 1, 2 and 5.

Table 1 and Table 2 show the comparisions of the new estimators (14) and (15) and the conventional estimators (2) and (4). For the one-parameter model, when the number of substitutions per site is low, the conventional estimators are not far from the true estimators. For example, when the expected number of nucleotide substitutions per site is 0.1, the conventional estimator underestimates the true variance within a tolerable region. However, as the divergence increases, the performance becomes poor. When the divergence is greater than 0.2, the conventional estimators seriously underestimate the true variance, for all the different sequence lengths studied.

Table 1.

Comparision of the conventional estimator V(K) and the new estimator V*(K) for the one-parameter model

Sequence Length (L) Expected number of substitutions per site true variance Estimator
V(K) V*(K)
500 0.1 0.000362595 0.000219929 0.000311769
0.2 0.001404189 0.000494168 0.001479758
0.3 0.004145225 0.000830774 0.004766365
0.4 0.010529535 0.001247488 0.012617401
0.5 0.025591656 0.001776986 0.029761526
0.6 0.061907183 0.002434374 0.063117721
0.7 0.141074137 0.003261173 0.123747284
1000 0.1 0.000196551 0.000110974 0.0001567
0.2 0.000716157 0.000247913 0.000735775
0.3 0.00203738 0.000416255 0.002360282
0.4 0.005212835 0.000625914 0.00626774
0.5 0.013088052 0.000886346 0.014583643
0.6 0.03068909 0.001209791 0.030645552
0.7 0.073055629 0.001617717 0.059929941
5000 0.1 3.84488E-05 2.21349E-05 3.08939E-05
0.2 0.000145207 4.93066E-05 0.000143897
0.3 0.000403367 8.27823E-05 0.000460616
0.4 0.000997212 0.000124208 0.001215598
0.5 0.002500406 0.000175767 0.002822049
0.6 0.005804204 0.000240054 0.005940681
0.7 0.013515512 0.000320543 0.011582702

Table 2.

Comparision of the conventional estimator V2 and the new estimator V2* for the two-parameter model when the ratio of transition/transversion, k, is set to be 1, 2 or 5. d denotes the expected number of substitutions per site.

L d k = 1 k = 2 k = 5

True Estimator True Estimator True Estimator
variance V2 V2* variance V2 V2* variance V2 V2*
× 10−3 × 10−3 × 10−3 × 10−3 × 10−3 × 10−3 × 10−3 × 10−3 × 10−3
500 0.1 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3
0.2 1.4 0.5 1.7 1.6 0.5 1.7 2.0 0.5 1.7
0.3 4.2 0.8 5.7 4.7 0.9 5.8 7.4 0.9 6.1
0.4 10 1.3 16 13 1.3 17 23 1.5 18
0.5 26 1.8 41 35 1.9 42 72 2.3 45
0.6 64 2.5 90 91 2.6 95 224 3.3 104
0.7 149 3.3 186 237 3.6 194 678 4.8 220
1000 0.1 0.2 0.1 0.2 0.2 0.1 0.2 0.2 0.1 0.2
0.2 0.7 0.2 0.8 0.8 0.3 0.8 1.0 0.3 0.9
0.3 2.1 0.4 2.8 2.4 0.4 2.9 3.6 0.5 3.0
0.4 5.3 0.6 8.0 6.6 0.6 8.2 11 0.7 8.7
0.5 13 0.9 20 18 0.9 20 36 1.1 22
0.6 31 1.2 44 42 1.3 45 106 1.6 49
0.7 75 1.6 89 104 1.8 92 300 2.3 102
5000 0.1 0.03 0.02 0.03 0.04 0.02 0.03 0.04 0.02 0.03
0.2 0.1 0.05 0.2 0.2 0.05 0.2 0.2 0.05 0.2
0.3 0.4 0.08 0.6 0.5 0.08 0.6 0.7 0.09 0.6
0.4 1.0 0.1 1.5 1.2 0.1 1.6 2.1 0.1 1.6
0.5 2.5 0.2 3.8 3.1 0.2 3.9 6.1 0.2 4.1
0.6 5.8 0.2 8.4 7.9 0.3 8.6 17 0.3 9.2
0.7 14 0.3 17 19 0.3 18 51 0.5 19

As seen from Table 1, the improved estimator can accurately estimate the true variance for the case where the expected number of nucleotide substitutions per site is 0.1 or 0.2. When the expected number of nucleotide substitutions per site is greater than 0.2, the improved estimator provides a much better estimator for the variance compared with the conventional one.

For the two-parameter model, Table 2 provides the simulation results for different transition/transversion ratios. It can be seen that the improved estimator outperforms the conventional estimator.

Although more sophisticated methods for estimating the number of nucleotide substitutions per site between two sequences (K) are available, the one- and two-parameter methods are still very widely used. In addition, the two-parameter method is used in Li et al. (1985), Li (1993), and Ina (1995) for estimating the number of substitutions per synonymous site and the number of substitutions per nonsynonymous site, and the method of Li (1993) is commonly used in current literature. Therefore, accurate estimation of the variance of K for the one- and two-parameter methods is desirable. An alternative method used to improve the variance estimator in the literature is the bootstrap approach. However, this approach does not have a closed form for the variance, so it requires heavier computations than do the improved variance estimators we derived in this paper. Our estimators have closed forms, so they can be easily applied or included in a computational package such as MEGA4.

In conclusion, the proposed new variance estimators provide substantial improvements for the variance estimation. A computer program for the present variance estimations is available from the author upon request and will be for on-line calculation at a website in the near future.

Acknowledgments

This study was supported by Academia Sinica, Taiwan and NIH grant GM30998.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Holmquist R. Theoretical foundations for a quantitative approach to paleogenetics. Part I: DNA. J Mol Evol. 1971;1:115–133. doi: 10.1007/BF01659159. [DOI] [PubMed] [Google Scholar]
  2. Ina Y. New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J Mol Evol. 1995;40:190–226. doi: 10.1007/BF00167113. [DOI] [PubMed] [Google Scholar]
  3. Jukes TH, Cantor CR. Evolution of protein molecules. In: Munro HN, editor. Mammalian Protein Metabolism. New York: Academic Press; 1969. pp. 21–132. [Google Scholar]
  4. Kaplan N, Risko K. A method for estimating rates of nucleotide substitution using DNA sequence data. Theor Popul Biol. 1982;21:318–328. doi: 10.1016/0040-5809(82)90021-1. [DOI] [PubMed] [Google Scholar]
  5. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
  6. Kimura M. Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl Acad Sci U S A. 1981;78:454–458. doi: 10.1073/pnas.78.1.454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Kimura M, Ohta T. On the stochastic model for estimation of mutational distance between homologous proteins. J Mol Evol. 1972;2:87–90. doi: 10.1007/BF01653945. [DOI] [PubMed] [Google Scholar]
  8. Lanave C, Preparata G, Saccone C, Serio G. A new method for calculating evolutionary substitution rates. J Mol Evol. 1984;20:86–93. doi: 10.1007/BF02101990. [DOI] [PubMed] [Google Scholar]
  9. Li WH. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993;36:96–99. doi: 10.1007/BF02407308. [DOI] [PubMed] [Google Scholar]
  10. Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985;2:150–174. doi: 10.1093/oxfordjournals.molbev.a040343. [DOI] [PubMed] [Google Scholar]
  11. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]

RESOURCES