Abstract
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Keywords: Bayes factor, Bayesian inference, MCMC, model averaging, model choice
Bayesian Markov chain Monte Carlo (MCMC) methods quickly gained in popularity after they were introduced in statistical phylogenetics in the late 1990's (Mau and Newton 1997, Yang and Rannala 1997, Larget and Simon 1999, Mau et al. 1999). This was due to the inherent advantages of the approach but also to the availability of easy-to-use software packages, such as MrBayes (Huelsenbeck and Ronquist 2001). Originally, MrBayes only supported simple phylogenetic models, but the model space expanded considerably in version 3.0 (Ronquist and Huelsenbeck 2003). In addition to a wide range of models on binary, “standard” (morphology), nucleotide and amino acid data, version 3.0 also supported mixed models. The latter allow different data partitions to be combined in the same model, with parameters linked or unlinked across partitions according to user specifications. MrBayes 3.0 was apparently the first statistical phylogenetics package to support such models (Rannala and Yang 2008).
Bayesian phylogenetic inference using MCMC has developed in leaps and bounds since the release of MrBayes 3.0. In particular, the relative ease with which complex models can be tackled using the MCMC machinery has led to an explosion in the development of probabilistic evolutionary models (for a review, see Ronquist and Deans 2010). We have also seen the appearance of better MCMC algorithms and more sophisticated convergence diagnostics for phylogenetic models, and methods for Bayesian model choice have improved considerably.
With this note, we announce the official release of version 3.2 of MrBayes. Version 3.2 was originally intended as a relatively modest expansion of version 3.1, which added convergence diagnostics to the original features in version 3.0. Over the years, however, a number of significant new features were added to version 3.2, and large parts of the program were rewritten. When we now officially release version 3.2, it is every bit as significant in the evolution of the program as the release of version 3.0 almost a decade ago.
DESCRIPTION OF NEW FEATURES
Convergence
The phylogenetics community has come to accept as good practice that Bayesian MCMC results be accompanied by a critical assessment of convergence. Arguably, the best way of accomplishing this is to compare samples obtained from independent MCMC analyses. It is typically the tree samples that are most divergent in phylogenetic analyses, and we therefore introduced the average standard deviation of split frequencies (ASDSF) in MrBayes to allow quantitative assessment of the similarity among such samples.
ASDSF is calculated by comparing split or clade frequencies across multiple independent MCMC runs that ideally should be started from different randomly chosen starting trees (Lakner et al. 2008). ASDSF should approach 0.0 as runs converge to the same distribution. The frequencies of rare splits or clades are difficult to estimate accurately and these groupings are usually of marginal interest. Therefore, it may be advantageous to exclude them from the diagnostic. MrBayes allows the user to set a cutoff frequency (default value 0.10); all splits or clades occurring minimally at that frequency in at least one of the runs will be incorporated in the ASDSF.
To allow users to monitor MCMC progress, MrBayes can run several analyses in parallel and report the average (ASDSF) or maximum standard deviation of split frequencies at regular intervals. More detailed diagnostics can be obtained using the “sump” and “sumt” commands after the run has completed. They include ASDSF across runs for each of the sampled clades in addition to the potential scale reduction factor (PSRF; Gelman and Rubin 1992) for branch lengths, node times, and substitution model parameters. PSRF compares the variance within and between runs and should approach 1.0 as runs converge. MrBayes 3.2 also reports the effective sample size, widely used for single-run convergence diagnostics.
MrBayes 3.2 also introduces several new features intended to improve MCMC convergence rates. A number of new tree proposal mechanisms have been added, including subtree-swapping moves and extending subtree-pruning-and-regrafting moves, and the default mix of proposals has been optimized (Lakner et al. 2008). MrBayes 3.2 further includes a completely new type of tree proposal that is guided using parsimony scores. The details of the parsimony-biased proposals will be presented elsewhere; however, tentative empirical results show that they can improve the speed of convergence by an order of magnitude on some problems (see also Höhna and Drummond 2012). For nontree proposals, MrBayes 3.2 implements auto-tuning that automatically adjusts tuning parameters such that a target acceptance frequency is reached (Roberts and Rosenthal 2009). Since previous versions, MrBayes supports Metropolis coupling (heated chains) to accelerate convergence. To simplify monitoring of convergence, MrBayes 3.2 prints ASDSF values, acceptance rates of moves, and acceptance rates of swaps between Metropolis-coupled chains to a separate file with a “.mcmc” suffix during runs.
Faster and More Convenient Computation
Much of the computational effort in a phylogenetic MCMC analysis is spent calculating likelihoods. To improve speed, MrBayes 3.2 now employs streaming single-instruction-multiple-data extensions (SSE) for all likelihood calculations. SSE instructions are supported by most current CPUs and provide low-level parallelization of arithmetic operations. Importantly, MrBayes 3.2 also supports the use of the BEAGLE library for likelihood calculations (Ayres et al. 2012). With BEAGLE, the likelihood calculations can be farmed out to one or more graphics processing units (GPUs) on compatible hardware, resulting in significant speedups for codon and amino acid models in particular. BEAGLE can also be used for likelihood computation on the CPU.
MrBayes 3.2 does not support multithreading, but it does implement the message passing interface (MPI) for efficient parallel processing across large computer clusters (Altekar et al. 2004). On many hardware platforms, including Mac OS and Linux, it is possible to use the MPI-enabled Unix version of MrBayes to take advantage of multiple cores. However, MPI parallelization is across chains, which means that the maximum number of cores or processors that can be used by MrBayes is the same as the total number of heated and nonheated chains across all simultaneous runs. For instance, two runs of four chains each would be maximally accelerated on a system with eight processors or cores. The MPI version can be combined with BEAGLE to further expand the opportunity for computational parallelization.
Finally, to facilitate long runs, MrBayes 3.2 implements checkpointing across all models. At a frequency determined by the user, all parameter samples are printed to a “.ckp” file. If desired, the analysis can later be restarted from the checkpoint file, and the final results will appear as if the run had never been stopped.
New Models
Many phylogenetic hypotheses concern the structure of the phylogenetic tree. To facilitate such analyses, MrBayes 3.2 implements three types of constraints on the tree: hard, negative, and partial. A hard constraint forces a split or clade to be present in all trees sampled in the MCMC analysis, whereas a negative constraint forces a split or clade to be absent. Unlike hard and negative constraints, a partial constraint (or backbone constraint) can leave the position of some taxa indeterminate. The indeterminate taxa are allowed to appear on either side of the specified split if the tree is unrooted, or either within or outside the specified clade if the tree is rooted. Several hard, negative, and partial constraints can be combined into complicated priors on the shape of the tree. However, constraints are either on or off; they cannot be associated with probabilities in the current version.
Unlike previous versions, MrBayes 3.2 supports relaxed clock models and dating. Three different relaxed clock models are available: the Compound Poisson Process (CPP; Huelsenbeck et al. 2000), the Thorne–Kishino 2002 (TK02; Thorne and Kishino 2002), and the Independent Gamma Rate (IGR; Lepage et al. 2007) models.
The CPP model is a discrete autocorrelated model, in which rate multipliers appear on the tree according to a Poisson process. The MrBayes implementation uses a lognormal distribution for the rate multipliers instead of the modified gamma distribution proposed originally (Huelsenbeck et al. 2000). It also includes novel algorithms to allow sampling across tree space since the original paper only dealt with fixed trees.
The TK02 model is a continuous autocorrelated model. In the particular version we implemented (Thorne and Kishino 2002), the rate of a descendant node is drawn from a lognormal distribution, the mean of which is the same as the ancestral rate and the variance of which is proportional to the length of the branch (measured in expected substitutions per site at the base rate of the clock).
The IGR model is a continuous uncorrelated model. First published as the “white noise” model (Lepage et al. 2007), it is similar to the uncorrelated gamma model (Drummond et al. 2006) but is mathematically more elegant in that it truly lacks time structure. In the IGR model, effective branch lengths are drawn from a gamma distribution, in which the mean is the same as, and the variance proportional to, the branch length.
Dating can be achieved in MrBayes 3.2 by calibrating interior or tip nodes in the tree; calibrated interior nodes need to be associated with hard constraints to be valid. Calibration points can be either fixed or associated with uncertainty. The birth–death prior model on clock trees has been expanded to incorporate recent progress in the understanding of the linear constant birth–death process with complete sampling (Gernhard 2008), with random incomplete sampling (Stadler 2009), or with clustered or diversified sampling (Höhna et al. 2011). The tree moves on clock and relaxed clock trees have also been improved considerably over those that were available in previous versions.
Bayesian phylogenetic inference of species trees from multiple gene trees was first accomplished in the Bayesian estimation of species trees (BEST) software using a complex computational machinery, in which MrBayes was one of the components (Edwards et al. 2007, Liu and Pearl 2007). Despite later improvements to BEST, the analyses remained slow and computationally demanding. The multispecies coalescent model has now been fully integrated in MrBayes 3.2, and several of the original algorithms have been rewritten to speed up the calculations.
Model Averaging and Model Choice
It is standard practice today to select a substitution model for Bayesian phylogenetic inference using a priori model selection procedures (Goldman 1993, Posada 1998, Posada 2008, Suchard et al. 2001). An alternative is to use Bayesian model jumping during the MCMC simulation to integrate out the uncertainty concerning the correct substitution model (Huelsenbeck et al. 2004). The latter procedure is now implemented in MrBayes 3.2. Rather than selecting a substitution model before the analysis, the user can now sample across all 203 possible time-reversible rate matrices according to their posterior probability. The model-jumping approach is available in all models where a four-by-four nucleotide model is a component, including doublet and codon models in addition to the ordinary nucleotide models.
Bayesian model choice using Bayes factors is rapidly gaining in popularity. Since earlier versions, MrBayes has reported the harmonic mean of the likelihoods from the MCMC sample, which can be used as a rough estimate of the model likelihood from which the Bayes factor is calculated (Newton and Raftery 1994). However, there are now considerably more accurate, albeit computationally more demanding, methods (Lartillot and Philippe 2006). Of these, MrBayes 3.2 implements the recently proposed stepping stone method (Xie et al. 2011) that uses MCMC to sample from a series of so-called power posterior distributions connecting the posterior distribution with the prior distribution. The samples across these distributions are then used to estimate the model likelihood. The stepping stone algorithm in MrBayes 3.2 uses the full MCMC machinery, including convergence diagnostics and Metropolis coupling, and can be applied to any model available in the program. For instance, it can be used to test various topological hypotheses or substitution models against each other.
More Output Options
MrBayes 3.2 provides more extensive output options than previous versions. The user can now request sampling of site rates, site selection coefficients, site positive selection probabilities, and ancestral states of particular nodes. A wide range of tree statistics, including the mean and variance of split or clade frequencies, node times, and branch rates, are now added as annotations to the consensus tree by the “sumt” command and can be displayed using FigTree and compatible tree viewers.
BENCHMARK AND BIOLOGICAL EXAMPLES
Benchmark data on the GPU-accelerated code are provided by Ayres et al. (2012). A number of example data sets are distributed with the program, and tutorials illustrating most of the new features are included in the program manual. Many of the dating features in MrBayes 3.2 are discussed in some detail and used in an empirical context in Ronquist et al. (2012).
AVAILABILITY
MrBayes 3.2 is freely available under the GNU General Public License version 3.0. The program web site (http://www.mrbayes.net) provides download links to both source code for compilation on Unix systems and to convenient installers for Windows and Mac OS systems. The installers include both MrBayes and the required BEAGLE libraries, but the BEAGLE libraries can also be installed separately using the BEAGLE installer, available at http://beagle-lib.googlecode.com. The program comes with a manual and example files. Further help is available on the program web site, which also provides instructions for reporting bugs and signing up for the MrBayes e-mail list. Instructions for accessing the MrBayes source code repository can be found at http://sourceforge.net/projects/mrbayes/develop.
FUNDING
The development of version 3.2 of MrBayes would not have been possible without generous support from the Swedish Research Council [2008-5629 to F.R.]; the National Institutes of Health [GM-069801 to J.P.H. and GM-086887, HG-006139 to M.A.S.]; and the National Science Foundation [DEB-0445453 to J.P.H., DEB-0949121 and DEB-0936214 to B.L., and DBI-0755048 to D.L.A.]. Incorporation of the BEST algorithms and support for the BEAGLE library was greatly facilitated by a workshop in October 2010 sponsored by the Mathematical Biosciences Institute at Ohio State University [NSF-DMS-0931642], hosted by Dennis Pearl and Marty Golubitsky.
Acknowledgments
F.R., with the assistance of M.T. and P.v.d.M., did most of the programming for version 3.2, whereas J.P.H., assisted by F.R., was responsible for the software architecture and initial code base. D.L.A., A.D., and M.A.S. helped with the BEAGLE integration and the related performance testing. L.L. assisted in the incorporation of the BEST algorithms, whereas B.L. and S.H. contributed to the implementation of particular models. We would like to thank Chris Anderson for additional assistance with the BEST algorithms. We would also like to express our deep gratitude to the many MrBayes users, who have generously contributed to the project by submitting bug reports, bug fixes, feature requests, and other comments on the software. David Posada, Leonardo Martins, and Jeremy Brown provided constructive criticism that helped improve the manuscript.
References
- Altekar G, Dwarkadas S, Huelsenbeck J. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics. 2004;20:407–425. doi: 10.1093/bioinformatics/btg427. [DOI] [PubMed] [Google Scholar]
- Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, Huelsenbeck JP, Ronquist F, Swofford DL, Cummings MP, Rambaut A, Suchard MA. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol. 2012;61:170–173. doi: 10.1093/sysbio/syr100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. doi: 10.1371/journal.pbio.0040088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc. Natl. Acad. Sci. U.S.A. 2007;104:5936–5941. doi: 10.1073/pnas.0607004104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992;7:457–472. [Google Scholar]
- Gernhard T. The conditioned reconstructed process. J. Theor. Biol. 2008;253:769–778. doi: 10.1016/j.jtbi.2008.04.005. [DOI] [PubMed] [Google Scholar]
- Goldman N. Statistical tests of models of DNA substitution. J. Mol. Evol. 1993;36:182–198. doi: 10.1007/BF00166252. [DOI] [PubMed] [Google Scholar]
- Höhna S, Drummond AJ. Guided tree topology proposal for Bayesian phylogenetic inference. Syst. Biol. 2012;61:1–11. doi: 10.1093/sysbio/syr074. [DOI] [PubMed] [Google Scholar]
- Höhna S, Stadler T, Ronquist F, Britton T. Inferring speciation and extinction rates under different species sampling schemes. Mol. Biol. Evol. 2011;28:2577–2589. doi: 10.1093/molbev/msr095. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck J, Larget B, Swofford D. A compound Poisson process for relaxing the molecular clock. Genetics. 2000;154:1879–1892. doi: 10.1093/genetics/154.4.1879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huelsenbeck JP, Larget B, Alfaro ME. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol. 2004;21:1123–1133. doi: 10.1093/molbev/msh123. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17:754–755. doi: 10.1093/bioinformatics/17.8.754. [DOI] [PubMed] [Google Scholar]
- Lakner C, van der Mark P, Huelsenbeck J, Larget B, Ronquist F. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol. 2008;57:86–103. doi: 10.1080/10635150801886156. [DOI] [PubMed] [Google Scholar]
- Larget B, Simon D. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 1999;16:750–759. [Google Scholar]
- Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst. Biol. 2006;55:195–207. doi: 10.1080/10635150500433722. [DOI] [PubMed] [Google Scholar]
- Lepage T, Bryant D, Philippe H, Lartillot N. A general comparison of relaxed molecular clock models. Mol. Biol. Evol. 2007;24:2669–2680. doi: 10.1093/molbev/msm193. [DOI] [PubMed] [Google Scholar]
- Liu L, Pearl DK. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol. 2007;56:504–514. doi: 10.1080/10635150701429982. [DOI] [PubMed] [Google Scholar]
- Mau B, Newton MA. Phylogenetic inference for binary data on dendograms using Markov chain Monte Carlo. J. Comput. Graph. Stat. 1997;6:122–131. [Google Scholar]
- Mau B, Newton MA, Larget B. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics. 1999;55:1–12. doi: 10.1111/j.0006-341x.1999.00001.x. [DOI] [PubMed] [Google Scholar]
- Newton M, Raftery A. Approximate Bayesian inference with the weighted likelihood bootstrap. J. R. Stat. Soc. B Stat. Methodol. 1994;56:3–48. [Google Scholar]
- Posada D. Modeltest: testing the model of DNA substitution. Bioinformatics. 1998;14:817–818. doi: 10.1093/bioinformatics/14.9.817. [DOI] [PubMed] [Google Scholar]
- Posada D. jModelTest: phylogenetic model averaging. Mol. Biol. Evol. 2008;25:1253–1256. doi: 10.1093/molbev/msn083. [DOI] [PubMed] [Google Scholar]
- Rannala B, Yang Z. Phylogenetic inference using whole genomes. Annu. Rev. Genomics Hum. Genet. 2008;9:217–231. doi: 10.1146/annurev.genom.9.081307.164407. [DOI] [PubMed] [Google Scholar]
- Roberts G, Rosenthal J. Examples of adaptive MCMC. J. Comput. Graph. Stat. 2009;18:349–367. [Google Scholar]
- Ronquist F, Deans AR. Bayesian phylogenetics and its influence on insect systematics. Annu. Rev. Entomol. 2010;55:189–206. doi: 10.1146/annurev.ento.54.110807.090529. [DOI] [PubMed] [Google Scholar]
- Ronquist F, Huelsenbeck JP. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
- Ronquist F, Klopfstein S, Vilhelmsen L, Schulmeister S, Murray DL, Rasnitsyn AP. Forthcoming. A total-evidence approach to dating with fossils, applied to the early radiation of the Hymenoptera. Syst. Biol. 2012 doi: 10.1093/sysbio/sys058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler T. On incomplete sampling under birth-death models and connections to the sampling-based coalescent. J. Theor. Biol. 2009;261:58–66. doi: 10.1016/j.jtbi.2009.07.018. [DOI] [PubMed] [Google Scholar]
- Suchard MA, Weiss RE, Sinsheimer JS. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 2001;18:1001–1013. doi: 10.1093/oxfordjournals.molbev.a003872. [DOI] [PubMed] [Google Scholar]
- Thorne JL, Kishino H. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 2002;51:689–702. doi: 10.1080/10635150290102456. [DOI] [PubMed] [Google Scholar]
- Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 2011;60:150–160. doi: 10.1093/sysbio/syq085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. Evol. 1997;14:717–724. doi: 10.1093/oxfordjournals.molbev.a025811. [DOI] [PubMed] [Google Scholar]