Abstract
Motivation: In a nucleotide or amino acid sequence, not all sites evolve at the same rate, due to differing selective constraints at each site. Currently in computational molecular evolution, models incorporating rate heterogeneity always share two assumptions. First, the rate of evolution at each site is assumed to be independent of every other site. Second, the values of these rates are assumed to be drawn from a known prior distribution. Although often assumed to be small, the actual effect of these assumptions has not been previously quantified in the literature.
Results: Herein we describe an algorithm to simultaneously infer the set of n−1 relative rates that parameterize the likelihood of an n-site alignment. Unlike previous work (a) these relative rates are completely identifiable and distinct from the branch-length parameters, and (b) a far more general class of rate priors can be used, and their effects quantified. Although described in a Bayesian framework, we discuss a future maximum likelihood extension.
Conclusions: Using both synthetic data and alignments from the Myc, Max and p53 protein families, we find that inferring relative rather than absolute rates has several advantages. First, both empirical likelihoods and Bayes factors show strong preference for the relative-rate model, with a mean Δ ln P=−0.458 per alignment site. Second, the computed likelihoods and Bayes factors were essentially independent of the relative-rate prior, indicating that good estimates of the posterior rate distribution are not required a priori. Third, a novel finding is that rates can be accurately inferred even when up to ≈4 substitutions per site have occurred. Thus biologically relevant putative hypervariable sites can be identified as easily as conserved sites. Lastly, our model treats rates and tree branch-lengths as completely identifiable, allowing for the first time coherent simultaneous inference of branch-lengths and site-specific evolutionary rates.
Availability: Source code for the utility described is available under a BSD-style license at http://www.fernandes.org/txp/article/9/site-specific-relative-evolutionary-rates.
Contact: andrew@fernandes.org
Supplementary information: Supplementary data is available at Bioinformatics online.
1 BACKGROUND
In a nucleotide or amino acid sequence, the rate of evolution at a given site is expected to vary according to the specific selective constraints at that site. Thus we expect a priori that not all sites evolve at the same rate (Corbin and Uzzell, 1970). Sites that are under strong selective constraints should be relatively highly conserved, while sites under lesser selective pressure should be more variable. In essence, the observed evolutionary rate corresponds to the level of purifying selection at that site (Kimura, 1983). We know that, in a phylogenetic analysis, not accounting for this rate heterogeneity can yield misleading results (Felsenstein, 2001; Yang, 1994, 1996; Yang and Kumar, 1996). Therefore, correctly modeling rate heterogeneity is important both for correct phylogenetic reconstruction and the discrimination of conserved from non-conserved sites.
Traditionally, given data D consisting of a fixed n-site alignment and tree topology, the likelihood of observing D given rates r=[r1, r2,…,rn] and branch-lengths t=[t1, t2,…,tm] is
| (1) |
where Pi denotes the likelihood of site i. In this model the likelihood of each site is independent of every other site. Furthermore, the set of rates r and branch-lengths t are not completely identifiable for any dataset because Pi(rit)=Pi(ris−1 · st) for any s > 0. In other words, halving the rates and doubling the branch-lengths yields the same likelihood.
One of the first attempts to use one rate per site to estimate the overall likelihood was made by Swofford et al. (1996), using maximum likelihood, in the DNArates program. However, Felsenstein Felsenstein (2001, 2004) subsequently cautioned that the ‘one rate-parameter per site’ model may lead to an ill-conditioned maximum likelihood model since the number of model parameters increases linearly with the number of alignment sites.
To regularize the likelihood calculation (1), Uzzell and Corbin (1971), followed by Nei et al. (1976), assumed that rather than being fixed, each rate was drawn from a known prior distribution. Each rate was further assumed independent of every other rate. The likelihood of each site could then be integrated independently over all possible rates. More formally, they calculated
| (2) |
where f(ri) denotes the density of the rate prior and ℝ+ denotes the non-negative reals. The unit mean gamma distribution was historically used for f because it often yields analytically tractable models. For calculations that are less amenable to analytic results, the discrete gamma approximation, first popularized by Yang (1994), has become the de facto standard rate prior in molecular evolution.
Unfortunately, the assumptions inherent in (2) result in two undesirable, yet unavoidable, consequences. First, enforcing a unit mean constraint on the prior f does not constrain the posterior rates in any useful manner, as can be seen in sample calculations from recent versions of phyml (Guindon and Gascuel, 2003) or mrbayes (Ronquist and Huelsenbeck, 2003). Thus rates and branch lengths remain mathematically unidentifiable in this model. The rate4site program by Pupko et al. (2002) takes the re-normalization approach suggested by Meyer and von Haeseler (2003) whereby rates and branch lengths are estimated by alternately inferring rates given the lengths, then the lengths given the rates. At each step, the rates are re-normalized to have a unit mean. The consequences of inferring rates (and phylogeny) without rigorously dealing with rate/time non-identifiability has not been quantified or formally investigated. A more detailed discussion of this issue can be found in the Supplementary Material.
The second undesirable consequence inherent in (2) is that the actual distribution f must be either specified or estimated. Most frequently, a unit-mean gamma distribution is assumed, and the shape parameter α of that distribution is estimated. Several attempts at addressing the shortcomings of the unit-mean gamma rate prior have been undertaken, most notably with Gu et al. (1995) who augmented the gamma distribution with an estimated proportion of invariant sites (where ri = 0). Mayrose et al. (2005a) advocated using a mixture of gamma distributions, while Pond and Frost (2005) used more general parameterized distributions.
Again, the consequences of inferring rates (and phylogeny) under the influence of these priors is not known. For instance, we may compare a model using one gamma prior with another using a two-gamma mixture. If the two-gamma mixture does not yield a significantly better model, we may erroneously conclude that the single-gamma model is a ‘good’ approximation of the ‘true’ set of rates. In fact, this observation only supports the conclusion that under the class of n-gamma mixture priors, n = 1 is sufficient. For instance, since all gamma distributions have exponentially decreasing tails, this class of priors does not include models with heavy tails. To properly assess the effect of the prior, the class of rate priors should ideally be as large (in some sense) as possible.
In order to quantify the impact of (2) on rate inference, we inferred site-specific rates for both synthetic data and alignments from the Myc, Max and p53 protein families. Specifically, we infer rates r=[r1, r2,…,rn] under the constraint that
| (3) |
Computationally, (3) is much more stringent than the constraint that the distribution f in (2) have unit mean. The constraint lets us model the n rates via n−1 relative-rate parameters. Relative rates are advantageous compared to absolute rates because relative rates are completely identifiable from branch lengths in likelihood calculations. This advantage comes at a price, however, in that it becomes non-trivial to integrate likelihoods over the space of all rates, subject to constraint (3).
As written, constraint (3) implies that rates are modeled as fixed-effects, and not the random-effects model more commonly assumed by traditional maximum likelihood models. If a random-effects is preferred, (3) could be re-written to imply that ∑iri∼M(μ) for some distribution M with mean rate μ. In doing so, however, we lose identifiability between rates and branch-lengths, and drastically reduce stability and convergence rate of our algorithm (data not shown).
2 RESULTS AND DISCUSSION
A Markov Chain Monte Carlo (MCMC) approach was used to integrate (1) under constraint (3) over all possible relative rates. Both simulated and real data were used to compare our relative-rate model with the best unit-mean gamma, independent-rate model. A Bayesian framework was adopted for three reasons. First, previous works suggested that empirical Bayesian methods were significantly better than likelihood methods when inferring site-specific rates (Mayrose et al., 2004). Second, unlike the independent-rate assumption, constraint (3) precludes the use of simplifying numeric approaches such as Gaussian quadrature (Fernandes and Atchley, 2006) to integrate over all possible rates. Lastly, since the relative-rate model is not nested in the absolute-rate model, comparing their model fits via likelihood is not trivial. Instead, Bayes factors (Kass and Raftery, 1995), which are implicitly correct for differences in parameter dimension, are used for comparison. Throughout, we have assumed without loss of generality that branch lengths are fixed while inferring rates. Since rates and branch lengths are completely independent in our model, it is implicit that lengths could be simultaneously inferred in parallel with rates.
2.1 Rate priors
Our method is based on Bayesian techniques and thus requires specification of a relative-rate prior distribution; we assume implicitly that parameters must have well-defined posterior sampling distributions. As we will discuss, this prior is markedly different than those used for absolute-rate models. Furthermore, a rate prior is also required for maximum likelihood inference. To see why, recall that as long as a site is not completely conserved, Pi(rit) approaches a positive, non-zero constant as ri t→∞. Thus if f(ri) in (2) was constant, the integral of their product would be infinite. In fact, the likelihood Pi(rit) is, in general, not a density with respect to ri. Therefore, even in a maximum likelihood setting a rate prior is required to regularize the likelihood function. Non-integrable likelihood functions can sometimes be regularized with straightforward methods, such as in the case of Gaussian mixture models (Wasserman, 2000). Unfortunately, under the independent-rates assumption, such regularizations are not possible. Furthermore, it is difficult to quantify the precise effect that a family of priors will have on the final inference. For more discussion of this topic, see the Supplementary Material.
Often the required regularization constraint is ‘hidden’ within a method. For example, a well known early study by Kelly and Rice (1996) describes a purportedly ‘priorless’ rate inference procedure. In reality, their posterior rate distribution is estimated by using the moment generating function, which in itself is estimated through the eigen values of the infinitesimal rate (mutation) matrix. However, since the rate matrix itself is constrained to have have a unit-mean rate, the moments of their posterior rate distribution are automatically implicitly constrained, analogously to constraint (3).
Although seemingly a subtle change, the constraint (3) changes the situation significantly. Rather than integrating over the infinite domain r∈ℝn+ (the n-dimensional orthant of non-negative reals), we now integrate over the finite domain r∈𝕊n, the (n − 1)-dimensional unit simplex. Examples of familiar, low-dimensional simplexes are shown in Supplementary Figure A1. The non-informative, and in this case maximum entropy, prior f(ri)=1 becomes perfectly admissible. Such a simple prior may not be the optimal choice, however; there is tremendous literature describing the selecting priors based on systematic and formal rules (Berger, 2006; Kass and Wasserman, 1996). Denoting θi=ri/n, as the scaled relative rate, then P(θ)=(θ1θ2·θn)−δ, δ∈[0,1) are the most common priors over the domain 𝕊n. δ=1/2 yields Jeffreys’ prior (Jeffreys, 1946), while δ→1 yields Jaynes’ invariant Haar-measure prior (Jaynes, 1968; Syversveen, 1998). Unfortunately, both of these priors are based on examination of the multinomial likelihood function and are not appropriate for inferring rates. For instance, they imply that ri→n is just as probable as ri→0, even though it is biologically assumed that very high mutation rates (hundreds of times the mean rate) are quite unlikely. In fact, we found that all formulaic recipes for the construction of objective priors (Berger, 2006; Bernardo and Ramon, 1998; Bernardo and Smith, 1994; Kass and Wasserman, 1996) failed when applied to phylogenetic likelihoods since these likelihoods (a) are not densities with respect to r, assuming independence or (b) have variance increasing linearly with n, assuming relative rates.
Therefore, we chose to investigate inferential differences resulting from the use of two different priors based on intuitively reasonable assumptions. First, the uniform prior P(ri) ∝ 1 was selected as an appropriate comparison for a ‘prior-less’ maximum likelihood-type situation. Second, the unit-exponential P(ri) ∝ (−ri) was selected to represent the idea that very high substitution rates are anticipated to be unlikely. Note that the relative-rate unit-exponential prior is not conceptually or computationally identical to assuming f(ri)=exp(−ri) in (2) due to the action of constraint (3).
2.2 Simulation study
To assess the behavior of our method, we inferred the rates of a synthetic dataset designed to mimic an experimentally ideal situation. Our synthetic dataset was comprised of 100 sequences of 2000 sites with no gaps. All descendants were taken to be t=1 time-units away from the ancestor, and the ancestral sequences were drawn from the wag (Whelan and Goldman, 2001) equilibrium density. Rates were equally log-spaced from 10−3 to just under 10, with a mean of exactly 1. The prior was unit-exponential per site. A box-plot of the posterior rate distributions is shown in Figure 1. The solid sigmoidal curve denotes the site rate mean, smoothed across adjacent sites. The dotted line denotes the original rate of the simulation. Although not shown, virtually identical results were attained under the uniform prior, with no discernible qualitative differences between plots.
Fig. 1.
Synthetic data were comprised of 100 sequences of 2000 sites with no gaps. All descendants were taken to be t=1 time-units away from the ancestor, and the ancestral sequences were drawn from the wag equilibrium density. Rates were equally log-spaced from 10−3 to just under 10, with a mean of exactly 1. The prior was unit-exponential per rate. The solid sigmoidal curve denotes the site rate mean, smoothed across adjacent sites. The dotted line denotes the original rate of the simulation. Although not shown, virtually identical results were attained under the uniform prior.
When rates are low, few substitutions are observed, leading to two effects on inference. First, given only 100 sequences, there is no observed difference between, say, a rate of 10−3 and 10−2.3. At each of these rates, it is unlikely that even one substitution has occurred. Therefore, given a constraint that the mean rate equals one, highly conserved sites will have their rates biased upwards. Figure 1 shows that significant departures from mean estimated rate occur when
. Second, the variance of the estimated rate becomes large as the rate decreases, again as shown in the figure box-plots. This increased variance can be understood by using the analogy of estimating the rate parameter of a Poisson process when the observed event is rare. In the Poisson case, the expected Fisher information is inversely proportional to the number of events observed, which by assumption is small. Hence, the variance of the estimated rate of a conserved position will be large.
For rates between ≈10−1.6 and ≈100.60≈4.0 the mean inferred rate is almost completely coincident with the actual rate. We found the magnitude of the upper bound rate (4.0) surprising, since it implied that evolutionary rates could be accurately inferred even when, on average, four substitution events occurred between every observed sequence in the test data. Prior experience with other biological datasets led us to expect that such a high substitution rate would be indistinguishable from complete randomization (ri → ∞) Figure 1 shows that for the correct dataset there is considerable discernible difference between high substitution rates and randomization. We hypothesize that most substitution events given by amino acid evolution models substitute amino acids primarily within the same ‘similarity’ class; aliphatic, aromatic, charged and so on. Since estimating rates considers substitution both within and between amino acid similarity classes, with enough data our method appears able to accurately estimate the rate even when multiple substitutions occur. In other words, over short times isoleucine will frequently substitute with leucine, but over long times a substitution to glutamine is highly informative as to the true underlying rate. As compared with the lower range, the middle range of substitution rates appear to have significantly less variance associated with them.
At greater than ri≈4.0, Figure 1 shows that the sequences do become randomized with respect to each other, overwhelming even inter-class substitution events. Rather than estimate an excessively large rate, however, constraint (3) appears to bias the inferred rate downward. Thus, the model appears to be self-limiting with respect to high evolutionary rates without the a priori assumption of an exponential rate prior. Note that although the variance of high-rate parameters appears to be relatively small in the figure, the logarithmic scaling of the ordinate implies a larger variance than is visually evident.
2.3 Model comparison
For given fixed alignment and phylogenetic tree data D, both Maximum Likelihood (ML) and Bayesian estimations of the posterior rate distribution were performed. Alignments were initially computed with T-Coffee (Notredame et al., 2000) and then refined by inspection. Phylogenetic trees were inferred by phyml (Guindon and Gascuel, 2003) using an optimized gamma model of rate heterogeneity. The wag substitution matrix (Whelan and Goldman, 2001) was used throughout.
2.4 Protein families
Three proteins from two distinct families were studied to compare our relative-rate model to the more traditional independent-rate model. Specifically, Myc and Max, and two variants of p53 alignments were selected due to our familiarity with these families.
The Myc-Max-Mad network of basic-Helix-Loop-Helix (bHLH) transcription factor proteins is essential for control of cell growth, proliferation, differentiation and apoptosis. Myc is a well-established oncogene whose deregulated expression is responsible for a wide range of human cancers (Grandori et al., 2000; Luscher, 2001). A comprehensive analysis of phylogeny and conservation in the bHLH-leucine-zipper (bHLHz) domain of a diverse set of Myc and Max homologs was performed by Atchley and Fernandes (2005) and is utilized herein.
In contrast, p53 belongs to the β-sandwich-domain family of DNA-binding transcription factors (Berardi et al., 1999; Rudolph and Gergen, 2001) and is structurally independent of the bHLHz family. A detailed phylogenetic study of the p53 family has been presented by Fernandes and Atchley, (2008). To mimic the situation where relatively few, closely related proteins are available for study, a subset of the p53 sequences, denoted p53R, was also analyzed.
2.5 Bayes factors
There is no straightforward procedure to contrast maximum likelihood and Bayesian models, but we and others have found that Bayes factors (Kass and Raftery, 1995) can be used to construct intuitively meaningful and statistically valid comparisons. Taking an approach similar to MrBayes, we start with the independent site, gamma rate-prior model MI and use Bayes factors to compare it to our relative-rate model MR. Given model MI, data D, a set of n independent−rate parameters r=r1,r2,…,rn, a shape parameter α, a likelihood model P(D|r,α,MI) and prior distributions P(r∣α,MI) and P(α), Bayes’ Theorem allows us to calculate
![]() |
where δ denotes Dirac's delta function and αmax is the ML estimate of α. Since rates are independent under MI, the first integration is standard and straightforward. The second integration over α acknowledges that we cannot know the ‘correct’ value of α exactly. Following standard Bayesian theory then, we draw it from some prior distribution P(α). Thus P(D∣MI) will be maximal only if α is known precisely a priori and can only decrease as uncertainty about α increases. Thus, we use P(D∣αmax,MI) as a ‘best case’ conservative estimate of P(D∣MI). The ratio of P(D∣MI) to P(D∣MR), known as the Bayes factor, indicates the relative weight of evidence supporting competing models HI or HR given the data.
Estimating P(D|MR) from the from the MCMC samples of the posterior likelihood is numerically challenging (Kass and Raftery, 1995). To estimate it, we utilized the stabilized harmonic mean estimator (Satagopan et al., 2000) as provided by the model_p program of BaliPhy (Redelings and Suchard, 2005; Suchard and Redelings, 2006). Comparative results are shown in Table 1. There, ‘Max ln L’ denotes the maximum likelihood solution, ‘lnP(M)’ denotes the Bayesian probability of the relative-rate model, ‘Max ln P’ denotes the maximum probability found for the relative-rate model during MCMC simulations and ‘Min Δln (P)’ denotes the minimum possible log-probability difference between the relative- and absolute-rate hypotheses; in other words, the smallest possible Bayes factor. More negative values indicate stronger support for the relative-rate model. Confidence intervals on ln P(M) are shown, along with the calculated Δln P/site. The latter is shown so that comparisons can be drawn between alignments of greatly different lengths.
Table 1.
Comparison of model fit likelihoods and posterior probabilities for the independent- and relative-rate models
![]() |
The rows are described in the main text. The mean ΔlnP/site=−0.458 (not including the synthetic data) and strongly implies preference for the relative-rate model.
For all examples studied, the minimum Bayes factor strongly supported the relative-rate model over site independence, with log-differences ranging from ≈47 to ≈3371. According to Jeffreys’ scale (Jeffreys, 1961; Kass and Raftery, 1995) where differences of 2–10 are considered decisive, this represents overwhelming evidence in support of the relative model. Using long sampling times, the width of the ln P(M) confidence intervals were shortened to be insignificant compared to the magnitude of the Bayes factor. Since the magnitude of phylogenetic likelihoods tend to scale linearly with the number of alignment sites, the ΔlnP/site for each alignment was also calculated for each dataset. The values, ranging from -0.316 to -0.596 indicate that even short alignments of about 10 sites would overwhelmingly favor the relative-rate model.
The next most intriguing result displayed in Table 1 is the complete insensitivity of the model probability changes in the rate prior. Bayes factors are known to sometimes display extraordinary sensitivity to choice of prior (Kass and Greenhouse, 1989; Kass and Raftery, 1995). For the relative-rate model, however, no significant differences were detectable between the uniform and unit-exponential relative-rate prior: all differences were less than half the width of the model probability confidence interval. Again, we emphasize that the unit-exponential prior of the relative-rate model is not comparable to a unit-exponential independent-rate model. Although posterior probabilities are not significantly different between priors, a detailed comparison of the posterior densities would be required to recommend either as a suitable default.
2.6 Gamma shapes
Although the posterior rate distribution for the relative-rate model cannot be approximated by the independent gamma model, Figure 2 shows the distribution of ‘best fit’ gamma shape parameters across MCMC samples. Black circles denote the maximum likelihood shape parameter solution, while short horizontal bars indicate the mean shape parameter, along with quartiles and ranges. The ML shape parameter was always found to be outside the interquartile range of possible shapes. In the case of p53 and the synthetic datasets, the differences were substantial and indicate that the relative-rate posterior is significantly different than that implied by the independent-rate model.
Fig. 2.
Boxplots show the approximate distribution of estimated shape parameters from the MCMC integration; narrow internal lines show the mean shape estimate. Filled circles show the estimated shape parameter for the same system under maximum likelihood. In all cases the posterior shape distribution appears significantly different than that found by maximum likelihood.
Simply comparing ML shapes to ‘best approximating’ relative-rate shapes, however, fails to capture just how significantly different the posterior rate distributions are between
and 𝕊n. For instance, the best linear unbiased (BLU) estimator of central tendency in
is the arithmetic mean. For 𝕊n the geometric mean (Pawlowsky-Glahn and Egozcue, 2002) is far more preferable. Furthermore, since the rates in 𝕊n are by definition non-independent, the relative-rate posterior cannot be summarized by a scalar statistic.
Figure 3 illustrates just how different posterior rate estimates can be by comparing their best unit-gamma approximations. Shown are histograms of the posterior rate distribution for the relative-rate model taking either (a) the arithmetic or (b) geometric mean of all MCMC samples, along with best approximating shapes. The differences in distributions are striking, especially for the illustrated Max and p53R datasets. Also shown are (c) the best approximating shapes for the maximum likelihood (independent-rate) model and (d) the mean relative-rate shape. In other words, the histograms show the mean rate of all MCMC samples, while the remaining curves show the best gamma distribution between MCMC samples (the mean of the sample shapes versus the shape of the mean rates). These figures support the idea that under the relative-rate model, the resultant posterior is inherently multivariate and cannot be correctly summarized by statistics of the marginals.
Fig. 3.
The inferred distribution of rates for Max, showing the across-sample arithmetic and geometric means, as well as the best fit unit-gamma distribution shape parameter approximations between MCMC samples and of the final posterior mean.
3 CONCLUSIONS
Current models in molecular evolution almost universally assume site independence to model rate heterogeneity. The unstated assumption is that as the number of sites increases, the independence model will become asymptotically more correct. Our results indicate that no simple relationship exists between the independent- and relative-rate models. The independent-rate model is conceptually simple although it requires considerable parameterization or foreknowledge of the rate prior and complicates branch length inference, requiring numerous regularization assumptions. The relative-rate model automatically encompasses a much greater class of rate prior without parameterization, results in better model fits, allows simultaneous branch-length inference, but is somewhat more computationally complex.
To see that no simple relationship exists between the models, consider the angle between the surface of 𝕊n and the one of its bordering n-dimensional hyperplanes. Simple geometric arguments show that the radial angle between these subspaces is arcsin (
). As n→∞ this angle approaches π/2, implying that as the dimension increases, the relative-rate model becomes orthogonal to any independent-rate model (minus one site). Although not a formal argument, this observation suggests that the relative-rate model cannot be easily approximated via an independent-rate model.
Comparisons with rate4site (Pupko et al., 2002) show similar marginal rate posteriors (data not shown). Differences are primarily observed when the marginal mean rate is either small or large. Thus, if it is known a priori that a given rate prior is appropriate for a given dataset, there may be no compelling reason to use the more accurate relative-rate model. It has been shown, however, that mixtures of gamma distributions often provide substantial model improvements in many situations (Mayrose et al., 2005a, 2005b). If little is known about the actual underlying rate distribution, then the relative-rate model is preferable since it does not require parameter estimation.
Again, we emphasize that although the marginal distributions computed by the independent- and relative-rate models often appeared qualitatively similar, Figures 2 and 3 emphasize that the intrinsic correlation present in the relative-rates model make between-model comparisons of marginal distributions virtually meaningless.
3.1 Posterior summarization
As shown in the figures, characterizing the rate posterior is not trivial. The Dirichlet distribution is often used as a summary distribution on the unit simplex, and can be readily fit to the posterior (Minka, 2003). However, (Aitchison, 1986) argues that the restrictive Dirichlet covariance structure make it surprisingly unsuitable for describing distributions on 𝕊n. Variants of the log-normal distribution is the preferred alternative. This alternative may hypothetically be used to study rate-heterogeneity covariance.
3.2 Hypervariability
Since substitution rates ≤4 substitutions per unit time appear to be resolvable if there is enough data, we postulate that hypervariability can be meaningfully defined for sites where the posterior rate is significantly and substantially greater than one. More investigation into the biological relevance of these sites is needed. In particular, preliminary observations indicate that some hypervariable sites are identifiable as homologous sites ‘sandwiched’ between conserved residues. However, other sites consist primarily of gaps, which are treated somewhat like an indeterminate amino acid, in standard likelihood calculations. Therefore, hypervariability may be biologically relevant in some situations, but not others.
3.3 ML formulation
Although presented in a Bayesian context, a maximum likelihood approach could be accommodated in an unconstrained optimization framework via composition analysis (Aitchison, 1986). Specifically, the isometric log-ratio (ILR) transformation (Egozcue et al., 2003) can be used to construct a diffeomorphism between 𝕊n and ℝn under the standard Euclidian metric, with Jacobian J ∝ (θ1θ2 ·θn)-1. From an information-theoretic view, the ILR transformation uses this Jacobian as the invariant Haar measure on 𝕊n and is equivalent to the use of Jaynes’ prior (Jaynes, 1968). Thus rather than adaptively integrating over 𝕊n as MCMC strives to do, it should be possible to find the most likely point θ∈𝕊n via unconstrained optimization, and hence find r=nθ.
4 METHODS
It has been suggested that MCMC sampling over 𝕊n can be done by utilizing Dirichlet-distributed proposals (Larget and Simon, 1999). Our experience disagrees and shows that when n is large, sampling efficiency using Dirichlet proposals becomes intolerably low. The efficiency becomes particularly bad as θ∈𝕊n approaches the boundary. Unfortunately, such approaches are common as they occur for all conserved sites.
To understand why the Dirichlet sampling is inefficient, suppose we are given the current Markov chain state as θ. A new state θ’ is selected via
| (4) |
where s is a scalar scale factor. Under this parameterization, E[θ’]=θ and Var[θ’] scale approximately as 1/s. As θ approaches the simplex boundary, s must become very large to avoid inflating the sampling variance of θ’. A large value of s, however, implies that θ’−θ must be small. Small MCMC sample differences imply long autocorrelation times, and hence intolerably inefficient sampling.
Instead, we developed a two-step MCMC sampling procedure with much higher sampling efficiency. If each marginal
, then θ/∑iθi ∼ Dirichlet(1,1,…,1) is a standard result (Devroye, 1986). Therefore, given a current state θ, a new state θ’ can be generated by the following procedure:
For each component θi of θ, a new component θ′i is sampled via MCMC such that the stationary distribution of θ’i is unit-exponential.
A secondary MCMC step is performed using θ′i/∑iθ′i and the phylogenetic likelihood function.
Repeat, using n individual θi parameters to hold the ‘state’ of the n−1 relative rates.
Thus the proposal function itself is first sampled via MCMC, and the resulting point is used to sample the relevant posterior. The procedure works because the acceptance or rejection of a given step is always, by definition, independent of the previous state. Furthermore, the sum of the state variables is statistically independent of each individual (Devroye, 1986). The efficiency of the algorithm is quite high as the exponential scaling of the marginals ensures that the new sample scales optimally along each dimension of the simplex.
Supplementary Material
ACKNOWLEDGEMENTS
A. D. F. would like to thank Lindi M. Wahl and Gregory B. Gloor for funding and mentorship. Data processing and analysis was done with R (R Development Core Team, 2008).
Funding: National Institutes of Health (GM45344); North Carolina State University; the Alexander von Humboldt Stiftung; the Canadian Institutes of Health Research and the Natural Sciences and Engineering Research Council of Canada.
Conflict of Interest: none declared.
REFERENCES
- Aitchison J. Monographs on Statistics and Applied Probability. New York, London: Chapman and Hall; 1986. The Statistical Analysis of Compositional Data. [Google Scholar]
- Atchley WR, Fernandes AD. Sequence signatures and the probabilistic identification of proteins in the myc-max-mad network. Proc. Natl Acad. Sci. USA. 2005;102:6401–6406. doi: 10.1073/pnas.0408964102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berardi MJ, et al. The Ig fold of the core binding factor αRunt domain is a member of a family of structurally and functionally related Ig-fold DNA-binding domains. Structure. 1999;7:1247–1256. doi: 10.1016/s0969-2126(00)80058-1. [DOI] [PubMed] [Google Scholar]
- Berger J. The case for objective bayesian analysis. Bayesian Anal. 2006;1:385–402. [Google Scholar]
- Bernardo JM, Ramon JM. An introduction to bayesian reference analysis: inference on the ratio of multinomial parameters. J. R. Stat. Soc. D. 1998;47:101–135. [Google Scholar]
- Bernardo JM, Smith A. Bayesian Theory. New York: John Wiley and Sons; 1994. [Google Scholar]
- Corbin K, Uzzell T. Natural selection and mutation rates in mammals. Am. Nat. 1970;104:37–53. [Google Scholar]
- Devroye L. Non-uniform random variate generation. 1986 Available at http://cg.scs.carleton.ca/~luc/rnbookindex.html(last~accessed, August 11, 2008).
- Egozcue JJ, et al. Isometric logratio transformations for compositional data analysis. Math. Geol. 2003;35:279–300. [Google Scholar]
- Felsenstein J. Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. Mol. Evol. 2001;53:447. doi: 10.1007/s002390010234. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer Associates; 2004. [Google Scholar]
- Fernandes AD, Atchley WR. Gaussian quadrature formulae for arbitrary positive measures. Evol. Bioinform. 2006;2:261–269. [PMC free article] [PubMed] [Google Scholar]
- Fernandes AD, Atchley WR. Biochemical and functional evidence of p53 homology is inconsistent with molecular phylogenetics for distant sequences. J. Mol. Evol. 2008;67:51–67. doi: 10.1007/s00239-008-9124-2. [DOI] [PubMed] [Google Scholar]
- Grandori C, et al. The myc/max/mad network and the transcriptional control of cell behavior. Annu. Rev. Cell Dev. Biol. 2000;16:653–699. doi: 10.1146/annurev.cellbio.16.1.653. [DOI] [PubMed] [Google Scholar]
- Gu X, et al. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 1995;12:546–557. doi: 10.1093/oxfordjournals.molbev.a040235. [DOI] [PubMed] [Google Scholar]
- Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 2003;52:696–704. doi: 10.1080/10635150390235520. [DOI] [PubMed] [Google Scholar]
- Jaynes ET. Prior probabilities. IEEE T. Syst. Sci. Cyb. 1968;4:227–241. [Google Scholar]
- Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
- Jeffreys H. Theory of Probability. 3rd edn. Oxford: Clarendon Press; 1961. [Google Scholar]
- Kass R, Greenhouse J. Comments on “investigating therapies of potentially great benefit: ECMO”. Stat. Sci. 1989;4:310–317. [Google Scholar]
- Kass R, Raftery A. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795. [Google Scholar]
- Kass RE, Wasserman L. The selection of prior distributions by formal rules. J. Am. Stat. Assoc. 1996;91:1343–1370. [Google Scholar]
- Kelly C, Rice J. Modeling nucleotide evolution: a heterogeneous rate analysis. Math. Biosci. 1996;133:85–109. doi: 10.1016/0025-5564(95)00083-6. [DOI] [PubMed] [Google Scholar]
- Kimura M. The neutral theory of molecular evolution. In: Nei M, Koehn R, editors. Evolution of Genes and Proteins. Sunderland, MA: Sinauer Associates; 1983. pp. 208–233. [Google Scholar]
- Larget B, Simon D. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 1999;16:750–759. [Google Scholar]
- Luscher B. Function and regulation of the transcription factors of the mye/max/mad network. Gene. 2001;277:1–14. doi: 10.1016/s0378-1119(01)00697-7. [DOI] [PubMed] [Google Scholar]
- Mayrose I, et al. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol. 2004;21:1781–1791. doi: 10.1093/molbev/msh194. [DOI] [PubMed] [Google Scholar]
- Mayrose I, et al. A gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics. 2005a;21:151–158. doi: 10.1093/bioinformatics/bti1125. [DOI] [PubMed] [Google Scholar]
- Mayrose I, et al. Site-specific evolutionary rate inference: taking phylogenetic uncertainty into account. J. Mol. Evol. 2005b;60:345–353. doi: 10.1007/s00239-004-0183-8. [DOI] [PubMed] [Google Scholar]
- Meyer S, von Haeseler A. Identifying site-specific substitution rates. Mol. Biol. Evol. 2003;20:182–189. doi: 10.1093/molbev/msg019. [DOI] [PubMed] [Google Scholar]
- Minka TP. Estimating a dirichlet distribution. Technical report. 2003 Microsoft Research. Available at http://research.microsoft.com/~minka/papers/dirichlet/(last accessed, August 11, 2008)
- Nei M, et al. Infinite allele model with varying mutation rate. Proc. Natl Acad. Sci. USA. 1976;73:4164–4168. doi: 10.1073/pnas.73.11.4164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Notredame C, et al. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. [DOI] [PubMed] [Google Scholar]
- Pawlowsky-Glahn V, Egozcue J. BLU estimators and compositional data. Math. Geol. 2002;34:259–274. [Google Scholar]
- Pond SLK, Frost SDW. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. 2005;22:223–234. doi: 10.1093/molbev/msi009. [DOI] [PubMed] [Google Scholar]
- Pupko T, et al. Rate4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;18(Suppl. 1):S71–S77. doi: 10.1093/bioinformatics/18.suppl_1.s71. [DOI] [PubMed] [Google Scholar]
- R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. [Google Scholar]
- Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 2005;54:401–418. doi: 10.1080/10635150590947041. [DOI] [PubMed] [Google Scholar]
- Ronquist F, Huelsenbeck JP. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
- Rudolph MJ, Gergen JP. DNA-binding by Ig-fold proteins. Nat. Struct. Mol. Biol. 2001;8:384–386. doi: 10.1038/87531. [DOI] [PubMed] [Google Scholar]
- Satagopan J, et al. Technical Report 382. University of Washington; 2000. Easy estimation of normalizing constants and Bayes factors from posterior simulation: stabilizing the harmonic mean estimator. [Google Scholar]
- Suchard MA, Redelings BD. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–2048. doi: 10.1093/bioinformatics/btl175. [DOI] [PubMed] [Google Scholar]
- Swofford D, et al. Phylogenetic inference. In: Hillis D, editor. Molecular Systematics. 2nd edn. Sinauer: Sunderland, Massachusetts; 1996. pp. 407–514. [Google Scholar]
- Syversveen AR. Technical Report 3/98. Institutt for Matematiske Fag.; 1998. Noninformative Bayesian priors. interpretation and problems with construction and applications. [Google Scholar]
- Uzzell T, Corbin K. Fitting discrete probability distributions to evolutionary events. Science. 1971;172:1089–1096. doi: 10.1126/science.172.3988.1089. [DOI] [PubMed] [Google Scholar]
- Wasserman L. Asymptotic inference for mixture models using data-dependent priors. J. R. Stat. Soc. B. 2000;62:159–180. [Google Scholar]
- Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 2001;18:691–699. doi: 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]
- Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
- Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 1996;11:367–372. doi: 10.1016/0169-5347(96)10041-0. [DOI] [PubMed] [Google Scholar]
- Yang ZH, Kumar S. Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. Mol. Biol. Evol. 1996;13:650–659. doi: 10.1093/oxfordjournals.molbev.a025625. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





