Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2010 Jul 30;5(7):e11230. doi: 10.1371/journal.pone.0011230

Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models

Sergei Kosakovsky Pond 1,*, Wayne Delport 2, Spencer V Muse 3, Konrad Scheffler 4
Editor: Thomas Mailund5
PMCID: PMC2912764  PMID: 20689581

Abstract

Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard Inline graphic estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of Inline graphic sequence alignments, our estimators show a significant improvement in goodness of fit compared to the Inline graphic approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the Inline graphic-style estimators.

Introduction

Virtually all codon models in wide use today (see [1], [2] for recent reviews) are members of the class of finite-state, continuous time reversible Markov chains, each defined by an instantaneous rate matrix Inline graphic. Transition matrices for finite amounts of time are found via the matrix exponential of Inline graphic, so the probability that a position initially occupied by codon Inline graphic is occupied by codon Inline graphic after Inline graphic units of time is Inline graphic (throughout the manuscript we will use upper-case letters to index codons and lower-case letters to index nucleotides). If Inline graphic is a model in this class, the individual entries of its rate matrix can be written in the canonical form Inline graphic. The Inline graphic can be thought of as “rate parameters” that govern the relative rates of substitutions between different codons, while parameters Inline graphic induce the equilibrium frequencies of the codons. The choice of Inline graphic is the primary distinction between the two popular families of codon models: MG (introduced in [3]) and GY (introduced in [4]). How to best estimate the Inline graphic— or more precisely, how to estimate model parameters that actually determine the Inline graphic— from sequence alignments is the focus of this note. In order to frame this discussion we need to define what we mean by empirical frequencies, model parameters and equilibrium frequencies (Figure 1). Given an observed alignment, the position-specific empirical nucleotide frequencies, Inline graphic where Inline graphic is a nucleotide (Inline graphic) and Inline graphic the codon position (Inline graphic), can be estimated directly by counts from the data, and the empirical codon frequencies, Inline graphic, can be estimated by counts as well (the latter gives rise to the F61 codon frequency estimator [4]). Either of these estimates can be used to set model parameters, however typical alignments have insufficient information for the direct estimation of empirical codon frequencies with a sufficient degree of confidence. Rather, the empirical nucleotide frequencies are used to set the nucleotide frequency parameters, Inline graphic, and by multiplication of their constituents, the codon frequency parameters, Inline graphic. For example, in the original MG94 model of codon evolution [3], the equilibrium frequency of codon Inline graphic is given by Inline graphic, where Inline graphic. A common extension of this model, referred to as MG94 F3×4, allows the three codon positions to have their own nucleotide frequency parameters and leads to equilibrium codon expressed as:

graphic file with name pone.0011230.e029.jpg (1)

In this expression the superscripts indicate the position, and the equation for Inline graphic is modified in the obvious way. If we set all of the model nucleotide frequency parameters to be equal, i.e. Inline graphic, the result is equal equilibrium frequencies for all codons, i.e. Inline graphic for all Inline graphic. This vector of codon equilibrium frequencies allows us to easily tabulate, via marginalization, the equilibrium frequencies of each nucleotide at each position:

graphic file with name pone.0011230.e034.jpg (2)

Figure 1. Relationships between empirical frequencies, frequency parameters and equilibrium frequencies in codon models.

Figure 1

Note that there are only 13 occurrences of T in the first position, 14 of A in the second position, etc because the model explicitly disallows (TAG,TAA,TGA) as is standard for all other codon models. The finding from this exercise is that when one sets all the Inline graphic, each of the codon equilibrium frequencies, Inline graphic takes the anticipated value of Inline graphic. However, remarkably, the equilibrium nucleotide frequencies generated by this model are not the anticipated Inline graphic. For instance, the equilibrium frequency of Inline graphic at the first position is Inline graphic. Traditionally, the empirical nucleotide frequencies are used to set nucleotide frequency parameters, and it is therefore assumed that the induced equilibrium nucleotide frequencies are equal to those observed in the alignment. However, given that the nucleotide composition of stop codons is not accounted for, this practice is flawed, because Inline graphic. The conflation of frequency parameters (Inline graphic) and equilibrium nucleotide (Inline graphic) frequencies results in incorrect estimates of equilibrium nucleotide (and codon) frequencies as demonstrated in (2) above. This phenomenon is not restricted to the MG family of models. It is simple to demonstrate the exact same behavior for the GY family of models, again because of the incorrect designation of nucleotide frequency parameters in the rate matrix as equal to empirical nucleotide frequencies. We show that the traditional identification of frequency parameters and observed nucleotide frequencies leads to a cascade of problems. Model frequency parameters are estimated with bias, which leads to biased estimation of the equilibrium codon frequencies, which leads to compensatory biased estimation of the substitution rate parameters. We propose a correction, and a maximum likelihood frequency parameterization and show that both these approaches are not similarly biased, and therefore advocate their use in codon models.

Materials and Methods

To ensure clarity of presentation, we first carefully introduce the necessary notation (summarized in Figure 1). For a given substitution model, let Inline graphic be the frequency of sense codon Inline graphic (Inline graphic) in its equilibrium distribution, and Inline graphic, Inline graphic be the equilibrium frequency of nucleotide Inline graphic in codon position Inline graphic. When necessary, we will indicate specific models via a superscript (ie, MG or GY). The position specific nucleotide equilibrium frequencies, Inline graphic, are uniquely determined by the codon equilibrium frequencies, Inline graphic, through marginalization, e.g. Inline graphic is simply the sum of frequencies of the Inline graphic sense codons that have a T in their first position, e.g. as in equation (2).

These equilibrium frequencies, of both nucleotides and codons, have traditionally been assumed equal to empirical frequencies observed in a sequence alignment, Inline graphic or Inline graphic, and used to set model parameters. If the specified model is correct, Inline graphic converges to Inline graphic and Inline graphic to Inline graphic as the sequence length Inline graphic increases. (However, note that this result requires that the evolutionary process itself be at equilibrium; many important biological mechanisms— notably directional positive selection— are likely to disrupt equilibrium; see [5][7]).

Because the simple example in equation (2) demonstrated that the empirical and equilibrium nucleotide frequencies are not synonymous, we strive to obtain an expression that relates the equilibrium nucleotide frequencies to the model nucleotide frequencies, Inline graphic, and through extension –to the observed empirical frequencies. Even though the MG and GY models treat equilibrium codon frequencies differently, it is a fortunate coincidence that in either case the Inline graphic have identical forms when written in terms of Inline graphic. Given twelve MG nucleotide frequency parameters, only Inline graphic of which are independent because Inline graphic for each position Inline graphic, the equilibrium frequency of codon Inline graphic induced by their values is as in equation (1).

By using Inline graphic to directly estimate Inline graphic in equation (1), one obtains the popular Inline graphic estimator of codon equilibrium frequencies – by far the most common estimator used in literature for both MG and GY classes of models. The statistical and computational appeal of Inline graphic lies in its use of only Inline graphic nucleotide parameters to describe Inline graphic codon frequencies. However, the key shortcut— direct estimation of nucleotide frequency parameters with empirical nucleotide frequencies from the data— is flawed. The empirical nucleotide frequencies are unbiased estimates of the true equilibrium frequencies; unfortunately, the model parameters they are being used to estimate are something different. Thus, a fundamental problem with current practices is that use of the Inline graphic estimators with either MG or GY models leads to biased estimates of the Inline graphic, and in turn the Inline graphic. As we will show below, the problems do not end there, and lead to biased estimation of other model parameters.

We first present two approaches for correcting these estimation errors. The obvious, but more computationally demanding method is to estimate the Inline graphic by maximum likelihood along with other model parameters. We dub this approach Inline graphic. Theory suggests that estimates from this methodology will have all the desirable properties of maximum likelihood estimation. Maximum likelihood estimation of these values has been available in some software packages, e.g. in HyPhy [8], for a number of years, but to our knowledge it has rarely been used.

The second strategy, described here for the first time, relies on finding an expression for the induced equilibrium frequency of nucleotide Inline graphic at codon position Inline graphic (Inline graphic) as a function of Inline graphic. Since the Inline graphic define codon equilibrium frequencies (equation 1), we can readily obtain such equations by marginalization:

graphic file with name pone.0011230.e085.jpg (3)

Here, Inline graphic is simply scaling for the absence of stop codons: Inline graphic, and Inline graphic defines the set of stop codons. The corrected Inline graphic, or Inline graphic estimator equates Inline graphic with observed nucleotide frequencies Inline graphic, and then solves the nonlinear system (3) for Inline graphic to obtain estimates of the latter. Because Inline graphic, the above system of Inline graphic nonlinear equations relate Inline graphic independent observed statistics (Inline graphic, e.g. for Inline graphic) with Inline graphic independent model parameters Inline graphic. We were unable to obtain a closed form solution to the system, but it can be easily solved numerically at a negligible computational cost.

We conducted simulations to further investigate the effects of biases in the equilibrium frequencies on parameters typically estimated using phylogenetic models. We generated two-sequence codon alignments with uniform codon frequency composition (Inline graphic). We used Inline graphic as substitution bias parameters in the MG94xREV model [9], and set the nonsynonymous/synonymous substitution rate ratio Inline graphic to Inline graphic. The two sequences were Inline graphic divergent on average, and the length of the alignment, Inline graphic, was one of Inline graphic, Inline graphic or Inline graphic codons. Inline graphic replicates were generated for each value of Inline graphic. We compared the fits of Inline graphic, Inline graphic and Inline graphic on simulated data sets, and furthermore compared simulated to inferred parameter estimates with each of the three frequency parameterizations. In addition to the simulated data, we fitted all three frequency parameterizations to a sample of Inline graphic alignments from the carefully curated Pandit database [10]. All alignments were chosen to contain between 10 and 20 sequences and at least 200 reliably aligned codon sites. Given that each estimator has the same number of independent parameters (Inline graphic), an improvement in log-likelihood under one of the models is considered as evidence in favor of the better fitting model, e.g. under the BIC [11] criterion. All new estimators for the MG94 class of models are implemented in HyPhy.

Results and Discussion

We simulated data with a uniform codon frequency composition and fitted all three frequency parameterizations for alignments of various sequence lengths. The suboptimal nature of the Inline graphic estimator is immediately apparent from Figure 2a, where the improvement in Inline graphic scores of the model equipped with the corrected estimator Inline graphic is shown. For all replicates, the Inline graphic estimator yielded better Inline graphic, with median improvements of Inline graphic, Inline graphic, and Inline graphic (for Inline graphic, and Inline graphic codons respectively), or approximately Inline graphic likelihood points per codon site. Note that as the sample size increased, the estimators from (3) effectively matched the performance of the maximum likelihood estimator (Figure 2b). Even more importantly, the use of the Inline graphic frequency estimator led to biased inference of other model parameters. Maximum likelihood estimates of some substitution rates were biased under the Inline graphic, and the bias was progressively more pronounced with increasing sample size (Figure 2c). Indeed, for Inline graphic, a simple likelihood ratio test rejected the (true) null of Inline graphic at Inline graphic for all Inline graphic replicates. Biased MLEs of the substitution rate parameter Inline graphic is a result of the under/overestimates of Inline graphic and Inline graphic using Inline graphic. Similar results were seen for the other Inline graphic. To our relief, the maximum likelihood estimate (MLE) for the Inline graphic ratio was not noticeably affected even for the largest sample size (mean Inline graphic, median Inline graphic, IQR Inline graphic under Inline graphic; mean Inline graphic, median Inline graphic, IQR Inline graphic under Inline graphic, Figure 2d).

Figure 2. Comparison of frequency parameterizations fitted to simulated alignments.

Figure 2

The top row (A,B) shows the comparison of Inline graphic scores on simulated data obtained with different corrected frequency estimates; C) Bias in the estimate of the substitution rate Inline graphic in near-asymptotic regime (Inline graphic) is apparent under Inline graphic, but does not exist for the other two estimators; D) variance of the Inline graphic estimate for Inline graphic is reduced with increasing sample size.

For the Pandit alignments Inline graphic values were, of course, higher for the models estimated using Inline graphic than for those using Inline graphic. However, the magnitudes of the differences were impressive (median Inline graphic, IQR Inline graphic, max Inline graphic). The Inline graphic estimator improved the Inline graphic score of the Inline graphic estimator for over Inline graphic(Inline graphic) of the alignments by a median of Inline graphic points; in the remaining cases the median decrease in Inline graphic score was Inline graphic points. As with the simulated data, the MLEs of Inline graphic were largely unaffected by the choice of frequency estimators (but there were some datasets where the difference was large), while some substitution rate estimates appeared biased (Figure 3). For example, the estimates of Inline graphic were strongly linearly correlated between Inline graphic and Inline graphic methods (Inline graphic), but the regression line was estimated as Inline graphic, which recapitulates the downward bias observed on simulated data (if the estimates were unbiased, we would expect an intercept of zero and slope of one).

Figure 3. The effect of the frequency estimator on the inference of Inline graphic and Inline graphic (relative to the Inline graphic rate) substitution rate from Inline graphic alignments sampled from the Pandit database [10].

Figure 3

The estimate of Inline graphic under Inline graphic is biased downwards relative to Inline graphic.

We have demonstrated through simulations that the almost universally used Inline graphic estimator of equilibrium frequencies in codon substitution models is biased, and we have pointed out how a misinterpretation of standard codon model parameters is responsible for these biases. Although this bias appears to have little effect on estimation of “composite” parameters such as the nonsynonymous/synonymous rate ratio (Inline graphic) and branch lengths (results not shown), the bias has considerable damaging effects on the estimation of substitution rate parameters in the instantaneous rate matrix. This problem will become acutely relevant as researchers pursue finer-scale studies of the evolutionary process, such as developing substitution models with protein residue-dependent codon substitution rates [12], [13]. Since the computational burden of the Inline graphic estimator is virtually identical to that of our proposed Inline graphic estimator, which in turn is only marginally faster than Inline graphic, we recommend the use of either of the alternatives offered in this manuscript over the Inline graphic estimator. Our current recommendation is to obtain Inline graphic estimates and use them to initialize the optimization procedure for Inline graphic to speed up convergence.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This research was supported by the Joint Division of Mathematical Sciences/National Institute of General Medical Sciences Mathematical Biology Initiative through Grant NSF-0714991, the National Institutes of Health, AI47745 and by a University of California, San Diego Center for AIDS Research/NIAID Developmental Award to S.L.K.P. (AI36214). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol. 2009;26:255–271. doi: 10.1093/molbev/msn232. [DOI] [PubMed] [Google Scholar]
  • 2.Delport W, Scheffler K, Seoighe C. Models of coding sequence evolution. Brief Bioinform. 2009;10:97–109. doi: 10.1093/bib/bbn049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11:715–724. doi: 10.1093/oxfordjournals.molbev.a040152. [DOI] [PubMed] [Google Scholar]
  • 4.Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
  • 5.Seoighe C, Ketwaroo F, Pillay V, Scheffler K, Wood N, et al. A model of directional selection applied to the evolution of drug resistance in HIV-1. Mol Biol Evol. 2007;24:1025–1031. doi: 10.1093/molbev/msm021. [DOI] [PubMed] [Google Scholar]
  • 6.Kosakovsky Pond SL, Poon AFY, Leigh Brown AJ, Frost SDW. A maximum likelihood method for detecting directional evolution in protein sequences and its application to influenza a virus. Mol Biol Evol. 2008;25:1809–1824. doi: 10.1093/molbev/msn123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lacerda M, Scheffler K, Seoighe C. Epitope discovery with phylogenetic hidden Markov models. Mol Biol Evol. 2010 doi: 10.1093/molbev/msq008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kosakovsky Pond SL, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21:676–9. doi: 10.1093/bioinformatics/bti079. [DOI] [PubMed] [Google Scholar]
  • 9.Kosakovsky Pond SL, Muse SV. Site-to-site variation of synonymous substitution rates. Mol Biol Evol. 2005;22:2375–2385. doi: 10.1093/molbev/msi232. [DOI] [PubMed] [Google Scholar]
  • 10.Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman N. Pandit: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 2006;34:D327–31. doi: 10.1093/nar/gkj087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
  • 12.Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007;24:1464–1479. doi: 10.1093/molbev/msm064. [DOI] [PubMed] [Google Scholar]
  • 13.Conant GC, Stadler PF. Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009;26:1155–1161. doi: 10.1093/molbev/msp031. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES