Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Lam-Tung Nguyen; Arndt von Haeseler; Bui Quang Minh

doi:10.1093/sysbio/syx092

. 2017 Nov 27;67(3):552–558. doi: 10.1093/sysbio/syx092

Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Lam-Tung Nguyen ¹, Arndt von Haeseler ^1,², Bui Quang Minh ^1,^✉

Editor: Edward Susko

PMCID: PMC6204645 PMID: 29186593

Abstract

The invariable site plus Inline graphic model (I is widely used to model rate heterogeneity among alignment sites in maximum likelihood and Bayesian phylogenetic analyses. The proof that the I continuous model is identifiable (model parameters can be inferred correctly given enough data) has increased the creditability of its application to phylogeny reconstruction. However, most phylogenetic software implement the I Inline graphic discrete model, whose identifiability is likely but unproven. How well the parameters of the I discrete model are estimated is still disputed. Especially the correlation between the fraction of invariable sites and the fractions of sites with a slow evolutionary rate is discussed as being problematic. We show that optimization heuristics as implemented in frequently used phylogenetic software (PhyML, RAxML, IQ-TREE, and MrBayes) cannot always reliably estimate the shape parameter, the proportion of invariable sites, and the tree length. Here, we propose an improved optimization heuristic that accurately estimates the three parameters. While research efforts mainly focus on tree search methods, our results signify the equal importance of verifying and developing effective estimation methods for complex models of sequence evolution.

Keywords: Gamma model, invariable sites, maximum likelihood, phylogenetic inference, rate heterogeneity among sites

In model based phylogenetic analysis, the invariable site plus Inline graphic model (Yang 1994; Gu et al. 1995), hereafter referred to as I, is widely used to model rate heterogeneity among sites, because it often fits the data better than the model or the invariable-sites model alone (Sullivan and Swofford 1997). Thus, the I model is frequently selected by MODELTEST (Posada and Crandall 1998). The I Inline graphic model has two parameters: the proportion of invariable sites ( and the shape parameter () of the distribution. A small () indicates strong rate heterogeneity, whereas a large () corresponds to weak rate heterogeneity. Under certain conditions and compete with each other for the same phylogenetic signal. For example, Inline graphic already accounts for sites with low rates; that interferes with and causes a correlation between the parameters making reliable estimation of those parameters difficult (Sullivan et al. 1999; Mayrose et al. 2005). Despite this interference, it has been shown that the I continuous model is identifiable for “all but members of the F81 family of rate matrices on any phylogeny with more than two distinct interspecies distances” (Rogers 2001; Allman and Rhodes 2008; Chai and Housworth 2011). Since the I Inline graphic continuous model is identifiable, reliable parameter estimation for this model should be possible for sufficiently long multiple sequence alignments.

However, most phylogenetic software only implement the I Inline graphic discrete (Yang 1994) model as an approximation to the continuous model because of its computational efficiency. The discussed competition between and is based on the analysis of the discrete -distribution. The results have led to the suggestion to discourage the use of the I discrete Inline graphic model (Yang 2006; Jia et al. 2014; Stamatakis 2014).

On the other hand, the identifiability of the I Inline graphic discrete model is likely, but unproven (Chai and Housworth 2011), and it is unclear how accurately popular phylogenetic software estimate parameters of the I discrete model.

Thus, we used simulations to assess the accuracy of the I Inline graphic discrete estimators implemented in three maximum likelihood (ML) phylogenetic software: RAxML (Stamatakis 2014), PhyML (Guindon et al. 2010), IQ-TREE (Nguyen et al. 2015), and one Bayesian inference program MrBayes (Ronquist et al. 2012). More precisely, we simulated 100,000-bp long alignments along three balanced trees of 6, 24, and 96 taxa. The lengths of the alignments ensure the recovery of the correct tree topology. The three trees have uniform branch lengths of 0.1 substitutions per site except for one internal branch on the 6-taxon tree whose length equals 0.2 to allow for three distinct distances between the sequences as required for identifiability in the continuous case (Chai and Housworth 2011). We assumed the K2P model (Kimura 1980) with a transition/transversion ratio of 2.0 and the rate heterogeneity model I Inline graphic discrete with four rate categories. For each tree and each pair , we simulated 100 alignments using Seq-Gen (Rambaut and Grassly 1997). We used RAxML version 8.2.2, PhyML version 20141029, IQ-TREE version 1.3.7, and MrBayes version 3.2.6 compiled with the BEAGLE library (Ayres et al. 2012) to infer the invariable proportion, the shape parameter, and the tree length from the simulated alignments. For RAxML, PhyML, and IQ-TREE, we used the default options.

For MrBayes we used the default priors, that is, uniform distribution within interval [0,1] for Inline graphic , exponential distribution with mean 1.0 for , nonclocklike uniform Dirichlet distribution for branch lengths and distribution with mean of 10 for tree lengths (Unconstrained:GammaDir(1.0,0.1,1.0,1.0)). The sequential version of MrBayes was run with four chains (one hot and three cold chains) and one million MCMC generations. One thousand four hundred and eighty-nine (16.5%) nonconvergent MrBayes runs, where the effective sample sizes (ESS) on Inline graphic , , or tree lengths are smaller than 100, were repeated with five million generations. However, 52 of the extended reruns were stopped after 4 weeks without completing all five million generations. We note that 207 of the 1489 reruns still did not converge. MrBayes estimates are then summarized as the mean of the posterior distribution with a default burn-in of 25%.

Current Phylogenetic Programs Do Not Produce Accurate Estimates for the I Discrete Model

Figure 1 displays the averages Inline graphic of the estimated shape parameter , the averages of the estimated invariable fraction and the average of the estimated tree length produced by PhyML, RAxML, IQ-TREE, and MrBayes for the 100 alignments simulated from each parameter combinations. A program is called accurate if the estimated averages Inline graphic , , deviate no more than 10% from the true values.

None of the tested programs estimated all parameter combinations accurately. The problem is especially pronounced for the 6-taxon alignments. For extreme rate heterogeneity ( Inline graphic ) MrBayes and PhyML recovered the true , , and for 9/10 and 5/10 parameter combinations respectively, whereas the average estimates from IQ-TREE and RAxML were inaccurate. For strong rate heterogeneity (, the degrees of inaccuracy observed among all programs differ unsystematically. On the one hand, IQ-TREE and MrBayes accurately estimated the parameters in four and six settings. On the other hand, RAxML and PhyML could not estimate accurately the three parameters for any of the ten parameter-combinations. For medium rate variation ( Inline graphic ), only IQ-TREE produced the accurate estimates for all settings. All other programs exhibited varying degrees of inaccuracy.

For the 24- and 96-taxon alignments we observed an increase in the number of accurate estimates for all programs. These results corroborate a previous study (Sullivan et al. 1999) showing that increased taxon sampling leads to more reliable estimates. However, under extreme rate heterogeneity ( Inline graphic , only MrBayes estimated all parameter sets accurately. We note that our measure of accuracy correlates well with the Bayesian coverage probabilities, the frequency with which true parameter values are included in the 95% credible interval of the estimates (Supplementary Fig. S1 available on Dryad at https://doi.org/10.5061/dryad.4j5c7). Two hundred and seven (2.3%) nonconvergent MrBayes runs (effective sample size of Inline graphic or are smaller than 100) partly overlap with cases where MrBayes was not accurate for 6-taxon simulations ( and ; and ; and . Hence, nonconvergence is a predictor of difficult settings but does not fully explain the inaccuracy of MrBayes (Fig. 1).

We also observed that inaccurate estimates of Inline graphic and could sometimes lead to tree lengths that substantially deviate from the simulated lengths. For instance, for the 96-taxon alignments simulated with and (expected tree length 18.9) IQ-TREE estimated an average tree length of 177.0 that is nine times longer than the simulated tree length. The other programs also sometimes produced tree lengths that were considerably longer than the simulated ones.

In terms of computing times PhyML, RAxML, and IQ-TREE needed for all analyses 62,441, 12,563, and 7,675 CPU hours, respectively. MrBayes needed 740,681 CPU hours to complete one million MCMC generations, thus it is 96.5 times slower than the fastest ML program. We note that this is only a lower-bound for the effective time one needs to wait for MrBayes results because 16.5% of MrBayes runs did not converge after one million MCMC generations. These runs were repeated with five times more generations, that led to significantly more computations.

Multiple Local Optima on the Likelihood Surface Cause Inaccuracy

Because the tested programs performed quite differently with respect to the accuracy of parameter estimation, the number of taxa cannot be the only explanation. We suspected that the optimization heuristics as implemented in these programs drive the accuracy. Examining the likelihood surfaces for many simulated alignments revealed a common feature that the parameter space has two distinct peaks of high log-likelihoods (Fig. 2): one global close to the true parameters and one suboptimal peak with slightly lower log-likelihood ( Inline graphic in this example) separated by a flat valley from the true parameters. In this particular instance, MrBayes and PhyML found the true parameters whereas RAxML and IQ-TREE were trapped in the local maximum (not necessarily the case for other instances). In fact, whether the global or local optimum is detected depends on the starting values of the numerical optimization routines.

Figure 2. — The likelihood surface for one simulated alignment as a function of and .

To summarize, we compared for each simulated alignment the log-likelihoods of the estimates with the log-likelihoods obtained for the true parameters. Table 1 show how often the true parameter combination produced a higher likelihood than the inferred parameters from MrBayes, PhyML, RAxML, and IQ-TREE. These fractions are particularly high for the 6-taxon tree and for the ML inference programs. Most ML phylogenetic programs use general-purpose numerical methods to find Inline graphic and (e.g., Brent 1973). These methods are obviously not well adapted to the complex likelihood surface (Fig. 2) and explain the poor overall performance of the ML programs (Fig. 1).

Table 1.

Percentage of alignments where the true simulation parameters result in higher log-likelihoods than the inferred parameters from the programs for three simulation scenarios (6-, 24-, and 96-taxon trees)

Program	6-taxon tree (%)	24-taxon tree (%)	96-taxon tree (%)
MrBayes	34.7	6.6	5.6
PhyML	60.2	21.9	45.0
RAxML	89.2	37.7	49.5
IQ-TREE	36.0	34.1	44.0

Open in a new tab

Effective Optimization Heuristic Produces Accurate Estimates

As remedy, we propose an alternative optimization heuristic which employs the Expectation-Maximization (EM) algorithm (Dempster et al. 1977) to estimate Inline graphic . We assume a discrete distribution with rate categories. Under the I discrete model, the site rates follow a discrete mixture model consisting of categories with rates , where represents invariable sites and are the rates determined from the shape parameter of the discrete distribution (Yang 1994). Given a tree topology, the optimization heuristic does the following:

1. Choose initial values for and .
2. Optimize branch lengths by the Newton–Raphson method.
3. Optimize substitution model parameters by the Broyden–Fletcher–Goldfarb–Shanno algorithm.
4. For each alignment site compute its posterior probability of being invariable (, where is the number of alignment sites):

where is the likelihood of site having rate and , ().
5. Update .
6. Optimize by the Brent method.
7. If the log-likelihood improvement is greater than a predefined value, go back to Step 2. Otherwise, stop the parameter optimization.

Steps 4 and 5 correspond to the E- and M-step of the EM algorithm, respectively. To avoid being stuck in local optima, we repeat this optimization procedure from ten starting values of Inline graphic evenly spaced between 0 and the fraction of constant sites observed in the alignment. The initial value of is always set to 1.0.

We implemented the new optimization heuristic in IQ-TREE now called IQ-TREE-EM (IQ-TREE version 1.4.3) and repeated the previous simulations. Figure 3 shows that IQ-TREE-EM successfully recovered the true parameters for all but one parameter combination (6-taxon, Inline graphic and ) where the average estimates ( and slightly deviated from the true values.

Figure 3. — The averages of the estimated shape parameter , the averages of the estimated invariable fraction and the average of the estimated tree length produced by IQ-TREE-EM for the 100 alignments simulated from each parameter combinations. The highlighting is explained in Fig. 1.

Also, the percentage of instances where the estimated log-likelihoods were lower than the log-likelihood for the true parameters dropped considerably (0.06% 6-taxon tree, 0.0% 24-taxon tree, and 0.03% 96-taxon tree; compare also with Table 1).

This increase in accuracy comes at the cost of an increased total computing time by a factor of 1.3 compared to IQ-TREE.

Thus, we conclude that the inaccurate parameter estimation of the I Inline graphic discrete shown for the tested phylogenetic programs is caused by ineffective optimization methods.

Impact on Real Data

To investigate the impact of accuracy on real data for ML estimates, we analyzed 70 DNA and 45 protein TreeBase alignments (Nguyen et al. 2015). We applied the GTR Inline graphic I4 and LGI4 models for DNA and protein data, respectively. Among 115 alignments, we detected 15 (5 DNA and 10 protein) alignments where the estimated and by PhyML, RAxML, or IQTREE deviated more than 10% from those by IQ-TREE-EM (Fig. 4; Supplementary Table S1 available on Dryad). The estimates by PhyML and IQ-TREE deviated from those by IQ-TREE-EM only for one and two alignments, respectively. However, RAxML estimated Inline graphic and dramatically different from IQ-TREE-EM, PhyML, and IQ-TREE for all 15 alignments. Interestingly, RAxML systematically overestimated and for all 5 DNA and underestimated them for all 10 protein alignments ( sometimes very close to zero).

Figure 4. — Estimation of (left) and (right) for TreeBase alignments using IQ-TREE-EM (-axis) and IQ-TREE (circle), PhyML (cross) and RAxML (triangle). Dashed lines show the boundaries of 10% deviation from the IQ-TREE-EM estimates. Points above the upper dashed lines indicate overestimation compared with IQ-TREE-EM, whereas points under the lower dashed lines indicate underestimation.

Discussion

Our simulations revealed a major issue for parameter estimation of the I Inline graphic discrete model as implemented in phylogenetic software. Despite using very long alignments, none of the tested programs recovered the true , , and tree length for all parameter combinations. Often, the estimates deviated heavily from the true values and different programs estimated different values for the same evolutionary parameters, although all programs inferred the true tree. Our further analysis of 115 TreeBase alignments showed that PhyML, IQ-TREE, and IQ-TREE-EM estimates generally agree with each other except for two alignments. However, we identified 15 (13%) alignments where RAxML systematically overestimated Inline graphic and for DNA and underestimated for protein, compared with other programs. The reasons for that behavior are unclear and deserve further analyses. While this result may not be extrapolated to other data sets, phylogenetic software should benefit from the more robust optimization described for IQ-TREE-EM.

We showed that the estimation heuristics implemented in popular phylogenetic programs causes such inaccurate estimates and the I Inline graphic model per se is not problematic. The relatively good performance of MrBayes is likely attributed to the Bayesian sampling of the parameter space but comes at the cost of excessive computing time.

With IQ-TREE-EM, we provided an alternative optimization heuristic for ML methods that allows accurate estimation of the parameters for the I Inline graphic discrete model. IQ-TREE-EM combines two optimization techniques: the multiple starting point strategy and the EM algorithm. We note that the EM algorithm alone will not achieve this accuracy (Supplementary Fig. S2 available on Dryad). Therefore, while the former allows to escape local optima, the latter helps to speed-up the optimization using analytical formula for Inline graphic . This new approach effectively infers the true evolutionary parameters for long alignments. Thus, it is tempting to speculate that the GTRI discrete model is also identifiable as shown for the GTRI continuous model (Chai and Housworth 2011).

Our observations show that as models of sequence evolution become more and more complex (e.g., Dirichlet rate and other mixture models), tailored numerical optimization methods are necessary to achieve accurate estimates of evolutionary parameters. It is not enough to recover the true tree, if one wants to understand how evolutionary forces shaped contemporary genomes. The effect of wrong parameter estimates for the substitution model on the total tree length is sometimes dramatic (see Fig. 1). This may in turn bias downstream analysis such as divergence time dating, inference of site-specific evolutionary rates, and ancestral sequence reconstruction, which are sensitive to the parameter estimates. Thus, one should critically scrutinize the heuristics implemented in popular programs. A more thorough evaluation of phylogenetic inference programs allowing for very complicated models of sequence evolution is necessary, but beyond the scope of this article.

Finally, we would like to point out that we only addressed the accurate computation of Inline graphic and for the widely used I discrete model. We do not discuss the biological interpretation of . The estimate of depends very much on the multiple sequence alignment at hand. may change if we enlarge the alignment. Thus, drawing an absolute conclusion from is in any case questionable.

Acknowledgements

The authors would like to thank Heiko A. Schmidt for fruitful discussions, two anonymous reviewers, Fredrik Ronquist, and Edward Susko for constructive and helpful comments on an earlier version of the manuscript. The computational results presented have been achieved using the Vienna Scientific Cluster 3 (VSC-3).

Supplementary Material

Data available from the Dryad Digital Repository: http://doi.org/10.5061/dryad.4j5c7.

Funding

This work was supported by the Austrian Science Fund – FWF (Grant Nos I-2805-B29 and I-1824-B22).

References

Allman E.S,, Rhodes J.A. 2008. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. 211:18–33. [DOI] [PubMed] [Google Scholar]
Ayres D.L.,, Darling A.,, Zwickl D.J., Beerli P.,, Holder M.T.,, Lewis P.O.,, Huelsenbeck J.P.,, Ronquist F.,, Swofford D.L.,, Cummings M.P.,, Rambaut A.,, Suchard M.A. 2012. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol 61:170–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brent R.P. 1973. Algorithms for minimization without derivatives. Englewood Cliffs, New Jersey: Prentice-Hall; p. 1–195. [Google Scholar]
Chai J.,, Housworth E.A. 2011. On Rogers' proof of identifiability for the GTR Gamma I model. Syst. Biol. 60:713–718. [DOI] [PubMed] [Google Scholar]
Dempster A.P.,, Laird N.M.,, Rubin D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 75:1–38. [Google Scholar]
Gu X.,, Fu Y.X.,, Li W.H. 1995. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546–557. [DOI] [PubMed] [Google Scholar]
Guindon S.,, Dufayard J.F.,, Lefort V.,, Anisimova M.,, Hordijk W.,, Gascuel O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59:307–321. [DOI] [PubMed] [Google Scholar]
Jia F.Z.,, Lo N.,, Ho S.Y.W. 2014. The impact of modelling rate heterogeneity among sites on phylogenetic estimates of intraspecific evolutionary rates and timescales. PLos One 9:e95722. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide-sequences. J. Mol. Evol. 16:111–120. [DOI] [PubMed] [Google Scholar]
Mayrose I.,, Friedman N.,, Pupko T. 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21:ii151–ii158. [DOI] [PubMed] [Google Scholar]
Nguyen L.T.,, Schmidt H.A.,, von Haeseler A.,, Minh B.Q.. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Posada D.,, Crandall K.A. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817–818. [DOI] [PubMed] [Google Scholar]
Rambaut A.,, Grassly N.C. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosci. 13:235–238. [DOI] [PubMed] [Google Scholar]
Rogers J.S. 2001. Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. Syst. Biol. 50:713–722. [DOI] [PubMed] [Google Scholar]
Ronquist F.,, Teslenko M.,, van der Mark P.,, Ayres D.L.,, Darling A.,, Hohna S.,, Larget B.,, Liu L.,, Suchard M.A.,, Huelsenbeck J.P. 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61:539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sullivan J.,, Swofford D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamm. Evol. 4:77–86. [Google Scholar]
Sullivan J.,, Swofford D.L.,, Naylor G.J.P. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:1347–1356. [Google Scholar]
Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314. [DOI] [PubMed] [Google Scholar]
Yang Z. 2006. Computational molecular evolution. New York: Oxford University Press; p. 113–114. [Google Scholar]

[B1] Allman E.S,, Rhodes J.A. 2008. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. 211:18–33. [DOI] [PubMed] [Google Scholar]

[B2] Ayres D.L.,, Darling A.,, Zwickl D.J., Beerli P.,, Holder M.T.,, Lewis P.O.,, Huelsenbeck J.P.,, Ronquist F.,, Swofford D.L.,, Cummings M.P.,, Rambaut A.,, Suchard M.A. 2012. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol 61:170–173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Brent R.P. 1973. Algorithms for minimization without derivatives. Englewood Cliffs, New Jersey: Prentice-Hall; p. 1–195. [Google Scholar]

[B4] Chai J.,, Housworth E.A. 2011. On Rogers' proof of identifiability for the GTR Gamma I model. Syst. Biol. 60:713–718. [DOI] [PubMed] [Google Scholar]

[B5] Dempster A.P.,, Laird N.M.,, Rubin D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 75:1–38. [Google Scholar]

[B6] Gu X.,, Fu Y.X.,, Li W.H. 1995. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546–557. [DOI] [PubMed] [Google Scholar]

[B7] Guindon S.,, Dufayard J.F.,, Lefort V.,, Anisimova M.,, Hordijk W.,, Gascuel O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59:307–321. [DOI] [PubMed] [Google Scholar]

[B8] Jia F.Z.,, Lo N.,, Ho S.Y.W. 2014. The impact of modelling rate heterogeneity among sites on phylogenetic estimates of intraspecific evolutionary rates and timescales. PLos One 9:e95722. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide-sequences. J. Mol. Evol. 16:111–120. [DOI] [PubMed] [Google Scholar]

[B10] Mayrose I.,, Friedman N.,, Pupko T. 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21:ii151–ii158. [DOI] [PubMed] [Google Scholar]

[B11] Nguyen L.T.,, Schmidt H.A.,, von Haeseler A.,, Minh B.Q.. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Posada D.,, Crandall K.A. 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817–818. [DOI] [PubMed] [Google Scholar]

[B13] Rambaut A.,, Grassly N.C. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosci. 13:235–238. [DOI] [PubMed] [Google Scholar]

[B14] Rogers J.S. 2001. Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. Syst. Biol. 50:713–722. [DOI] [PubMed] [Google Scholar]

[B15] Ronquist F.,, Teslenko M.,, van der Mark P.,, Ayres D.L.,, Darling A.,, Hohna S.,, Larget B.,, Liu L.,, Suchard M.A.,, Huelsenbeck J.P. 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61:539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Sullivan J.,, Swofford D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamm. Evol. 4:77–86. [Google Scholar]

[B18] Sullivan J.,, Swofford D.L.,, Naylor G.J.P. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:1347–1356. [Google Scholar]

[B19] Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306–314. [DOI] [PubMed] [Google Scholar]

[B20] Yang Z. 2006. Computational molecular evolution. New York: Oxford University Press; p. 113–114. [Google Scholar]

PERMALINK

Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Lam-Tung Nguyen

Arndt von Haeseler

Bui Quang Minh

Roles

Abstract

Current Phylogenetic Programs Do Not Produce Accurate Estimates for the I Discrete Model

Figure 1.

Multiple Local Optima on the Likelihood Surface Cause Inaccuracy

Figure 2.

Table 1.

Effective Optimization Heuristic Produces Accurate Estimates

Figure 3.

Impact on Real Data

Figure 4.

Discussion

Acknowledgements

Supplementary Material

Funding

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Lam-Tung Nguyen

Arndt von Haeseler

Bui Quang Minh

Roles

Abstract

Current Phylogenetic Programs Do Not Produce Accurate Estimates for the I Discrete Model

Figure 1.

Multiple Local Optima on the Likelihood Surface Cause Inaccuracy

Figure 2.

Table 1.

Effective Optimization Heuristic Produces Accurate Estimates

Figure 3.

Impact on Real Data

Figure 4.

Discussion

Acknowledgements

Supplementary Material

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases