Abstract
The invariable site plus
model (I
is widely used to model rate heterogeneity among alignment sites in maximum likelihood and Bayesian phylogenetic analyses. The proof that the I
continuous
model is identifiable (model parameters can be inferred correctly given enough data) has increased the creditability of its application to phylogeny reconstruction. However, most phylogenetic software implement the I
discrete
model, whose identifiability is likely but unproven. How well the parameters of the I
discrete
model are estimated is still disputed. Especially the correlation between the fraction of invariable sites and the fractions of sites with a slow evolutionary rate is discussed as being problematic. We show that optimization heuristics as implemented in frequently used phylogenetic software (PhyML, RAxML, IQ-TREE, and MrBayes) cannot always reliably estimate the shape parameter, the proportion of invariable sites, and the tree length. Here, we propose an improved optimization heuristic that accurately estimates the three parameters. While research efforts mainly focus on tree search methods, our results signify the equal importance of verifying and developing effective estimation methods for complex models of sequence evolution.
Keywords: Gamma model, invariable sites, maximum likelihood, phylogenetic inference, rate heterogeneity among sites
In model based phylogenetic analysis, the invariable site plus
model (Yang 1994; Gu et al. 1995), hereafter referred to as I
, is widely used to model rate heterogeneity among sites, because it often fits the data better than the
model or the invariable-sites model alone (Sullivan and Swofford 1997). Thus, the I
model is frequently selected by MODELTEST (Posada and Crandall 1998). The I
model has two parameters: the proportion of invariable sites
(
and the shape parameter
(
) of the
distribution. A small
(
) indicates strong rate heterogeneity, whereas a large
(
) corresponds to weak rate heterogeneity. Under certain conditions
and
compete with each other for the same phylogenetic signal. For example,
already accounts for sites with low rates; that interferes with
and causes a correlation between the parameters making reliable estimation of those parameters difficult (Sullivan et al. 1999; Mayrose et al. 2005). Despite this interference, it has been shown that the I
continuous
model is identifiable for “all but members of the F81 family of rate matrices on any phylogeny with more than two distinct interspecies distances” (Rogers 2001; Allman and Rhodes 2008; Chai and Housworth 2011). Since the I
continuous
model is identifiable, reliable parameter estimation for this model should be possible for sufficiently long multiple sequence alignments.
However, most phylogenetic software only implement the I
discrete
(Yang 1994) model as an approximation to the continuous
model because of its computational efficiency. The discussed competition between
and
is based on the analysis of the discrete
-distribution. The results have led to the suggestion to discourage the use of the I
discrete
model (Yang 2006; Jia et al. 2014; Stamatakis 2014).
On the other hand, the identifiability of the I
discrete
model is likely, but unproven (Chai and Housworth 2011), and it is unclear how accurately popular phylogenetic software estimate parameters of the I
discrete
model.
Thus, we used simulations to assess the accuracy of the I
discrete
estimators implemented in three maximum likelihood (ML) phylogenetic software: RAxML (Stamatakis 2014), PhyML (Guindon et al. 2010), IQ-TREE (Nguyen et al. 2015), and one Bayesian inference program MrBayes (Ronquist et al. 2012). More precisely, we simulated 100,000-bp long alignments along three balanced trees of 6, 24, and 96 taxa. The lengths of the alignments ensure the recovery of the correct tree topology. The three trees have uniform branch lengths of 0.1 substitutions per site except for one internal branch on the 6-taxon tree whose length equals 0.2 to allow for three distinct distances between the sequences as required for identifiability in the continuous case (Chai and Housworth 2011). We assumed the K2P model (Kimura 1980) with a transition/transversion ratio of 2.0 and the rate heterogeneity model I
discrete
with four rate categories. For each tree and each pair
, we simulated 100 alignments using Seq-Gen (Rambaut and Grassly 1997). We used RAxML version 8.2.2, PhyML version 20141029, IQ-TREE version 1.3.7, and MrBayes version 3.2.6 compiled with the BEAGLE library (Ayres et al. 2012) to infer the invariable proportion, the shape parameter, and the tree length from the simulated alignments. For RAxML, PhyML, and IQ-TREE, we used the default options.
For MrBayes we used the default priors, that is, uniform distribution within interval [0,1] for
, exponential distribution with mean 1.0 for
, nonclocklike uniform Dirichlet distribution for branch lengths and
distribution with mean of 10 for tree lengths (Unconstrained:GammaDir(1.0,0.1,1.0,1.0)). The sequential version of MrBayes was run with four chains (one hot and three cold chains) and one million MCMC generations. One thousand four hundred and eighty-nine (16.5%) nonconvergent MrBayes runs, where the effective sample sizes (ESS) on
,
, or tree lengths are smaller than 100, were repeated with five million generations. However, 52 of the extended reruns were stopped after 4 weeks without completing all five million generations. We note that 207 of the 1489 reruns still did not converge. MrBayes estimates are then summarized as the mean of the posterior distribution with a default burn-in of 25%.
Current Phylogenetic Programs Do Not Produce Accurate Estimates for the I
Discrete
Model
Figure 1 displays the averages
of the estimated shape parameter
, the averages
of the estimated invariable fraction
and the average
of the estimated tree length
produced by PhyML, RAxML, IQ-TREE, and MrBayes for the 100 alignments simulated from each parameter combinations. A program is called accurate if the estimated averages
,
,
deviate no more than 10% from the true values.
Figure 1.
The averages
of the estimated shape parameter
, the averages
of the estimated invariable fraction
and the average
of the estimated tree length
produced by PhyML, RAxML, IQ-TREE, and MrBayes for the 100 alignments simulated from each parameter combinations. The averages are highlighted according to their differences from the true values: inaccurate (more than 25% deviation, red in online version), moderately inaccurate (10% to 25% deviation yellow in online version), and accurate (less than 10% deviation green in online version). For
the estimated
is accurate if
, moderately inaccurate if
, and inaccurate if
.
None of the tested programs estimated all parameter combinations accurately. The problem is especially pronounced for the 6-taxon alignments. For extreme rate heterogeneity (
) MrBayes and PhyML recovered the true
,
, and
for 9/10 and 5/10 parameter combinations respectively, whereas the average estimates from IQ-TREE and RAxML were inaccurate. For strong rate heterogeneity (
, the degrees of inaccuracy observed among all programs differ unsystematically. On the one hand, IQ-TREE and MrBayes accurately estimated the parameters in four and six settings. On the other hand, RAxML and PhyML could not estimate accurately the three parameters for any of the ten parameter-combinations. For medium rate variation (
), only IQ-TREE produced the accurate estimates for all settings. All other programs exhibited varying degrees of inaccuracy.
For the 24- and 96-taxon alignments we observed an increase in the number of accurate estimates for all programs. These results corroborate a previous study (Sullivan et al. 1999) showing that increased taxon sampling leads to more reliable estimates. However, under extreme rate heterogeneity (
, only MrBayes estimated all parameter sets accurately. We note that our measure of accuracy correlates well with the Bayesian coverage probabilities, the frequency with which true parameter values are included in the 95% credible interval of the estimates (Supplementary Fig. S1 available on Dryad at https://doi.org/10.5061/dryad.4j5c7). Two hundred and seven (2.3%) nonconvergent MrBayes runs (effective sample size of
or
are smaller than 100) partly overlap with cases where MrBayes was not accurate for 6-taxon simulations (
and
;
and
;
and
. Hence, nonconvergence is a predictor of difficult settings but does not fully explain the inaccuracy of MrBayes (Fig. 1).
We also observed that inaccurate estimates of
and
could sometimes lead to tree lengths that substantially deviate from the simulated lengths. For instance, for the 96-taxon alignments simulated with
and
(expected tree length
18.9) IQ-TREE estimated an average tree length of 177.0 that is nine times longer than the simulated tree length. The other programs also sometimes produced tree lengths that were considerably longer than the simulated ones.
In terms of computing times PhyML, RAxML, and IQ-TREE needed for all analyses 62,441, 12,563, and 7,675 CPU hours, respectively. MrBayes needed 740,681 CPU hours to complete one million MCMC generations, thus it is 96.5 times slower than the fastest ML program. We note that this is only a lower-bound for the effective time one needs to wait for MrBayes results because 16.5% of MrBayes runs did not converge after one million MCMC generations. These runs were repeated with five times more generations, that led to significantly more computations.
Multiple Local Optima on the Likelihood Surface Cause Inaccuracy
Because the tested programs performed quite differently with respect to the accuracy of parameter estimation, the number of taxa cannot be the only explanation. We suspected that the optimization heuristics as implemented in these programs drive the accuracy. Examining the likelihood surfaces for many simulated alignments revealed a common feature that the parameter space has two distinct peaks of high log-likelihoods (Fig. 2): one global close to the true parameters and one suboptimal peak with slightly lower log-likelihood (
in this example) separated by a flat valley from the true parameters. In this particular instance, MrBayes and PhyML found the true parameters whereas RAxML and IQ-TREE were trapped in the local maximum (not necessarily the case for other instances). In fact, whether the global or local optimum is detected depends on the starting values of the numerical optimization routines.
Figure 2.
The likelihood surface for one simulated alignment as a function of
and
.
To summarize, we compared for each simulated alignment the log-likelihoods of the estimates with the log-likelihoods obtained for the true parameters. Table 1 show how often the true parameter combination produced a higher likelihood than the inferred parameters from MrBayes, PhyML, RAxML, and IQ-TREE. These fractions are particularly high for the 6-taxon tree and for the ML inference programs. Most ML phylogenetic programs use general-purpose numerical methods to find
and
(e.g., Brent 1973). These methods are obviously not well adapted to the complex likelihood surface (Fig. 2) and explain the poor overall performance of the ML programs (Fig. 1).
Table 1.
Percentage of alignments where the true simulation parameters result in higher log-likelihoods than the inferred parameters from the programs for three simulation scenarios (6-, 24-, and 96-taxon trees)
| Program | 6-taxon tree (%) | 24-taxon tree (%) | 96-taxon tree (%) |
|---|---|---|---|
| MrBayes | 34.7 | 6.6 | 5.6 |
| PhyML | 60.2 | 21.9 | 45.0 |
| RAxML | 89.2 | 37.7 | 49.5 |
| IQ-TREE | 36.0 | 34.1 | 44.0 |
Effective Optimization Heuristic Produces Accurate Estimates
As remedy, we propose an alternative optimization heuristic which employs the Expectation-Maximization (EM) algorithm (Dempster et al. 1977) to estimate
. We assume a discrete
distribution with
rate categories. Under the I
discrete
model, the site rates follow a discrete mixture model consisting of
categories with rates
, where
represents invariable sites and
are the
rates determined from the shape parameter
of the discrete
distribution (Yang 1994). Given a tree topology, the optimization heuristic does the following:
1. Choose initial values for
and
.2. Optimize branch lengths by the Newton–Raphson method.
3. Optimize substitution model parameters by the Broyden–Fletcher–Goldfarb–Shanno algorithm.
-
4. For each alignment site
compute its posterior probability of being invariable (
, where
is the number of alignment sites):
where
is the likelihood of site
having rate
and
,
(
). 5. Update
.6. Optimize
by the Brent method.7. If the log-likelihood improvement is greater than a predefined
value, go back to Step 2. Otherwise, stop the parameter optimization.
Steps 4 and 5 correspond to the E- and M-step of the EM algorithm, respectively. To avoid being stuck in local optima, we repeat this optimization procedure from ten starting values of
evenly spaced between 0 and the fraction of constant sites observed in the alignment. The initial value of
is always set to 1.0.
We implemented the new optimization heuristic in IQ-TREE now called IQ-TREE-EM (IQ-TREE version 1.4.3) and repeated the previous simulations. Figure 3 shows that IQ-TREE-EM successfully recovered the true parameters for all but one parameter combination (6-taxon,
and
) where the average estimates (
and
slightly deviated from the true values.
Figure 3.
The averages
of the estimated shape parameter
, the averages
of the estimated invariable fraction
and the average
of the estimated tree length
produced by IQ-TREE-EM for the 100 alignments simulated from each parameter combinations. The highlighting is explained in Fig. 1.
Also, the percentage of instances where the estimated log-likelihoods were lower than the log-likelihood for the true parameters dropped considerably (0.06% 6-taxon tree, 0.0% 24-taxon tree, and 0.03% 96-taxon tree; compare also with Table 1).
This increase in accuracy comes at the cost of an increased total computing time by a factor of 1.3 compared to IQ-TREE.
Thus, we conclude that the inaccurate parameter estimation of the I
discrete
shown for the tested phylogenetic programs is caused by ineffective optimization methods.
Impact on Real Data
To investigate the impact of accuracy on real data for ML estimates, we analyzed 70 DNA and 45 protein TreeBase alignments (Nguyen et al. 2015). We applied the GTR
I
4 and LG
I
4 models for DNA and protein data, respectively. Among 115 alignments, we detected 15 (5 DNA and 10 protein) alignments where the estimated
and
by PhyML, RAxML, or IQTREE deviated more than 10% from those by IQ-TREE-EM (Fig. 4; Supplementary Table S1 available on Dryad). The estimates by PhyML and IQ-TREE deviated from those by IQ-TREE-EM only for one and two alignments, respectively. However, RAxML estimated
and
dramatically different from IQ-TREE-EM, PhyML, and IQ-TREE for all 15 alignments. Interestingly, RAxML systematically overestimated
and
for all 5 DNA and underestimated them for all 10 protein alignments (
sometimes very close to zero).
Figure 4.
Estimation of
(left) and
(right) for TreeBase alignments using IQ-TREE-EM (
-axis) and IQ-TREE (circle), PhyML (cross) and RAxML (triangle). Dashed lines show the boundaries of 10% deviation from the IQ-TREE-EM estimates. Points above the upper dashed lines indicate overestimation compared with IQ-TREE-EM, whereas points under the lower dashed lines indicate underestimation.
Discussion
Our simulations revealed a major issue for parameter estimation of the I
discrete
model as implemented in phylogenetic software. Despite using very long alignments, none of the tested programs recovered the true
,
, and tree length for all parameter combinations. Often, the estimates deviated heavily from the true values and different programs estimated different values for the same evolutionary parameters, although all programs inferred the true tree. Our further analysis of 115 TreeBase alignments showed that PhyML, IQ-TREE, and IQ-TREE-EM estimates generally agree with each other except for two alignments. However, we identified 15 (13%) alignments where RAxML systematically overestimated
and
for DNA and underestimated for protein, compared with other programs. The reasons for that behavior are unclear and deserve further analyses. While this result may not be extrapolated to other data sets, phylogenetic software should benefit from the more robust optimization described for IQ-TREE-EM.
We showed that the estimation heuristics implemented in popular phylogenetic programs causes such inaccurate estimates and the I
model per se is not problematic. The relatively good performance of MrBayes is likely attributed to the Bayesian sampling of the parameter space but comes at the cost of excessive computing time.
With IQ-TREE-EM, we provided an alternative optimization heuristic for ML methods that allows accurate estimation of the parameters for the I
discrete
model. IQ-TREE-EM combines two optimization techniques: the multiple starting point strategy and the EM algorithm. We note that the EM algorithm alone will not achieve this accuracy (Supplementary Fig. S2 available on Dryad). Therefore, while the former allows to escape local optima, the latter helps to speed-up the optimization using analytical formula for
. This new approach effectively infers the true evolutionary parameters for long alignments. Thus, it is tempting to speculate that the GTR
I
discrete
model is also identifiable as shown for the GTR
I
continuous
model (Chai and Housworth 2011).
Our observations show that as models of sequence evolution become more and more complex (e.g., Dirichlet rate and other mixture models), tailored numerical optimization methods are necessary to achieve accurate estimates of evolutionary parameters. It is not enough to recover the true tree, if one wants to understand how evolutionary forces shaped contemporary genomes. The effect of wrong parameter estimates for the substitution model on the total tree length is sometimes dramatic (see Fig. 1). This may in turn bias downstream analysis such as divergence time dating, inference of site-specific evolutionary rates, and ancestral sequence reconstruction, which are sensitive to the parameter estimates. Thus, one should critically scrutinize the heuristics implemented in popular programs. A more thorough evaluation of phylogenetic inference programs allowing for very complicated models of sequence evolution is necessary, but beyond the scope of this article.
Finally, we would like to point out that we only addressed the accurate computation of
and
for the widely used I
discrete
model. We do not discuss the biological interpretation of
. The estimate of
depends very much on the multiple sequence alignment at hand.
may change if we enlarge the alignment. Thus, drawing an absolute conclusion from
is in any case questionable.
Acknowledgements
The authors would like to thank Heiko A. Schmidt for fruitful discussions, two anonymous reviewers, Fredrik Ronquist, and Edward Susko for constructive and helpful comments on an earlier version of the manuscript. The computational results presented have been achieved using the Vienna Scientific Cluster 3 (VSC-3).
Supplementary Material
Data available from the Dryad Digital Repository: http://doi.org/10.5061/dryad.4j5c7.
Funding
This work was supported by the Austrian Science Fund – FWF (Grant Nos I-2805-B29 and I-1824-B22).
References
- Allman E.S,, Rhodes J.A. 2008. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. 211:18–33. [DOI] [PubMed] [Google Scholar]
- Ayres D.L.,, Darling A.,, Zwickl D.J., Beerli P.,, Holder M.T.,, Lewis P.O.,, Huelsenbeck J.P.,, Ronquist F.,, Swofford D.L.,, Cummings M.P.,, Rambaut A.,, Suchard M.A. 2012. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol 61:170–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brent R.P. 1973. Algorithms for minimization without derivatives. Englewood Cliffs, New Jersey: Prentice-Hall; p. 1–195. [Google Scholar]
-
Chai J.,, Housworth E.A.
2011.
On Rogers' proof of identifiability for the GTR
Gamma
I model. Syst. Biol. 60:713–718. [DOI] [PubMed] [Google Scholar] - Dempster A.P.,, Laird N.M.,, Rubin D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 75:1–38. [Google Scholar]
- Gu X.,, Fu Y.X.,, Li W.H. 1995. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546–557. [DOI] [PubMed] [Google Scholar]
- Guindon S.,, Dufayard J.F.,, Lefort V.,, Anisimova M.,, Hordijk W.,, Gascuel O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59:307–321. [DOI] [PubMed] [Google Scholar]
- Jia F.Z.,, Lo N.,, Ho S.Y.W. 2014. The impact of modelling rate heterogeneity among sites on phylogenetic estimates of intraspecific evolutionary rates and timescales. PLos One 9:e95722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide-sequences. J. Mol. Evol. 16:111–120. [DOI] [PubMed] [Google Scholar]
- Mayrose I.,, Friedman N.,, Pupko T. 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21:ii151–ii158. [DOI] [PubMed] [Google Scholar]
- Nguyen L.T.,, Schmidt H.A.,, von Haeseler A.,, Minh B.Q.. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Posada D.,, Crandall K.A. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817–818. [DOI] [PubMed] [Google Scholar]
- Rambaut A.,, Grassly N.C. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosci. 13:235–238. [DOI] [PubMed] [Google Scholar]
- Rogers J.S. 2001. Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. Syst. Biol. 50:713–722. [DOI] [PubMed] [Google Scholar]
- Ronquist F.,, Teslenko M.,, van der Mark P.,, Ayres D.L.,, Darling A.,, Hohna S.,, Larget B.,, Liu L.,, Suchard M.A.,, Huelsenbeck J.P. 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61:539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sullivan J.,, Swofford D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamm. Evol. 4:77–86. [Google Scholar]
- Sullivan J.,, Swofford D.L.,, Naylor G.J.P. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:1347–1356. [Google Scholar]
- Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314. [DOI] [PubMed] [Google Scholar]
- Yang Z. 2006. Computational molecular evolution. New York: Oxford University Press; p. 113–114. [Google Scholar]




