Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2022 Feb 17;71(4):901–916. doi: 10.1093/sysbio/syac010

StarBeast3: Adaptive Parallelized Bayesian Inference under the Multispecies Coalescent

Jordan Douglas 1,, Cinthy L Jiménez-Silva 1, Remco Bouckaert 1
Editor: Rayna Bell
PMCID: PMC9248896  PMID: 35176772

Abstract

As genomic sequence data become increasingly available, inferring the phylogeny of the species as that of concatenated genomic data can be enticing. However, this approach makes for a biased estimator of branch lengths and substitution rates and an inconsistent estimator of tree topology. Bayesian multispecies coalescent (MSC) methods address these issues. This is achieved by constraining a set of gene trees within a species tree and jointly inferring both under a Bayesian framework. However, this approach comes at the cost of increased computational demand. Here, we introduce StarBeast3—a software package for efficient Bayesian inference under the MSC model via Markov chain Monte Carlo. We gain efficiency by introducing cutting-edge proposal kernels and adaptive operators, and StarBeast3 is particularly efficient when a relaxed clock model is applied. Furthermore, gene-tree inference is parallelized, allowing the software to scale with the size of the problem. We validated our software and benchmarked its performance using three real and two synthetic data sets. Our results indicate that StarBeast3 is up to one-and-a-half orders of magnitude faster than StarBeast2, and therefore more than two orders faster than *BEAST, depending on the data set and on the parameter, and can achieve convergence on large data sets with hundreds of genes. StarBeast3 is open-source and is easy to set up with a friendly graphical user interface. [Adaptive; Bayesian inference; BEAST 2; effective population sizes; high performance; multispecies coalescent; parallelization; phylogenetics.]


Existing methods for testing macroevolutionary and macroecological questions have not kept pace with the explosion of next-generation sequence data now available (Blom et al. 2016b; Bragg et al. 2017; Stenson et al. 2017). Despite burgeoning databases of within- and between-species genomic diversity (Blom et al. 2016b; Bragg et al. 2017; Stenson et al. 2017), it is still common practice to ignore the gene-tree discordance that underlies any species phylogeny inferred from multilocus sequences and instead infer species ancestry based on concatenated sequence data taken to represent all underlying gene histories (Degnan and Rosenberg 2009; Heled and Drummond 2010; Jones 2017; Ogilvie et al. 2017; Rannala and Yang 2017). While this approach can perform well for inferring topologies when branches are long and incomplete lineage sorting (ILS) is absent, these conditions are rarely met.

Species trees inferred from concatenated sequences are often topologically incorrect (Degnan and Rosenberg 2009; Heled and Drummond 2010; Ogilvie et al. 2017), provide biased estimates for branch lengths and substitution rates (Kubatko et al. 2011; Ogilvie et al. 2016; Mendes and Hahn 2016), and underestimate uncertainty in tree topology, resulting in an unjustified degree of confidence in the wrong tree (Heled and Drummond 2010; Ogilvie et al. 2017). Such biases are exacerbated by subsampling of incongruent genes (Edwards et al. 2016; Mendes and Hahn 2016) and hold even for deep splits in the tree (Oliver 2013). These are crucial concerns in themselves and, more generally, can lead to biased estimates and erroneous inferences about fundamental evolutionary and ecological processes that require accurate phylogenetic trees, such as rates of speciation and extinction (Cadena et al. 2011; Rowe et al. 2011; Pepper et al. 2013), rates of substitution in DNA sequences (Bouckaert et al. 2013) and morphological characters (Pepper et al. 2013), species ancestry and ancestral age estimation (Mitchell et al. 2014), geographical history and origins (Lemey et al. 2009; Bouckaert 2016), and species delimitation (Yang and Rannala 2010; Grummer et al. 2013; Leaché et al. 2014; Yang and Rannala 2014).

The multispecies coalescent (MSC; Maddison 1997; Edwards 2009; Liu et al. 2009) is an approach designed to minimize these potential biases by modeling macroevolution as a distribution of gene trees constrained by a species tree (Degnan and Rosenberg 2009; Heled and Drummond 2010; Jones 2017; Ogilvie et al. 2017; Rannala and Yang 2017). In doing so, the MSC provides a more biologically realistic framework for phylogenetic inference that captures the process of ILS underlying most multilocus phylogenies. Furthermore, by explicitly modeling both species and gene trees, the MSC can address questions that cannot be addressed under a concatenation approach—such as automatic species delimitation (Fujita et al. 2012), with important implications for biodiversity assessment and conservation (Bickford et al. 2007).

A number of software packages have implemented the MSC in various ways (see review by (Liu et al., 2015)). Our work at the Centre for Computational Evolution at the University of Auckland has led the development of *BEAST (STARBeast; Heled and Drummond 2010) and StarBeast2 (Ogilvie et al. 2017)—full Bayesian MSC frameworks for species-tree estimation from multilocus sequence data—and UglyTrees for visualizing these models (Douglas 2020). By explicitly modeling the MSC and avoiding the biases associated with concatenation methods (Heled and Drummond 2010; Ogilvie et al. 2016; Ogilvie et al. 2017), an analysis using either of these software packages can significantly improve the conclusions drawn from data.

However, despite some advances in computational efficiency of the full Bayesian MSC (Jones 2017; Ogilvie et al. 2017; Rannala and Yang 2017), these complex models remain computationally intractable for large next-generation sequence data sets of 100’s of sequenced loci across hundreds of individuals (i.e., Inline graphicInline graphic samplesInline graphicloci). As a result, existing applications of the approach have tended to consider smaller data sets (Kang et al. 2014; Blom et al. 2016a) or to ignore much of the available data (Blom et al. 2016b; Bragg et al. 2017; Stenson et al. 2017), which reduces accuracy and increases uncertainty in species-tree estimates (Song et al. 2012; Ogilvie et al. 2017). One approach to this problem has been the development of much simpler summary coalescent methods that utilize distributions of estimated gene-tree topologies as input to rapidly process large data sets (Liu et al. 2015). These include the rooted triplet method MP-EST (Liu et al. 2010) and the quartet method ASTRAL (Mirarab et al. 2014). However, summary coalescent methods are sensitive to gene-tree errors (Mirarab and Warnow 2015; Xi et al. 2015) and produce trees in coalescent units, and thus time and population size estimates used by downstream analyses are confounded.

Here, we aim to perform Bayesian inference on large data sets using the Markov chain Monte Carlo (MCMC) algorithm as our workhorse. As illustrated in Figure 1, the number of parameters involved is quite large, as is the accompanying state space. We develop a set of new MCMC proposals to explore state space in a much more efficient way than previous implementations and demonstrate we can handle data sets several times faster than *BEAST and StarBeast2. The resulting software package StarBeast3 is available as an open-source BEAST 2 package (Bouckaert et al. 2019).

Figure 1.


Figure 1.

Depiction of the multispecies coalescent model, with Inline graphic gene trees constrained within a single species tree Inline graphic with Inline graphic species. In this depiction, node heights (age) run along the y-axis and species-tree node widths are proportional to effective population sizes (arbitrary units). The relative molecular substitution rate of each species-tree branch is proportional to line thickness. Tree was built from a Gopher data set (Belfiore et al. 2008) and visualized using UglyTrees (Douglas 2020).

Methods

The MSC

Our objective is to develop efficient methods in a Bayesian framework for analyzing models where there is a phylogeny, Inline graphic, such as a species or language tree, that forms a constraint on a set of Inline graphic trees Inline graphic, such as gene trees. Each taxon within Inline graphic is assigned to a single taxon within Inline graphic, from some fixed individual-to-species mapping function (Fig. 1). Species tree Inline graphic consists of a topology Inline graphic and divergence times Inline graphic, as does the set of gene trees Inline graphic.

All trees are assumed to be binary rooted time trees, where branch lengths describe the passing of time from the root of the tree down to the tips. Taxon node heights are assumed to be fixed and are typically extant (with height 0). Each gene tree Inline graphic consists of Inline graphic nodes and Inline graphic branches for taxon count Inline graphic, while Inline graphic consists of Inline graphic nodes and Inline graphic branches, including a root branch, for species count Inline graphic. Gene-tree taxa are associated with data Inline graphic, for example, nucleotide sequences or cognate data. Let Inline graphic be a set of model parameters, for instance, those related to the speciation or nucleotide substitution processes. Consider the posterior density function Inline graphic:

graphic file with name Equation1.gif (1)

The MSC model is therefore hierarchical. Inline graphic can follow a range of tree prior distributions Inline graphic, such as the Yule (Yule 1925) or birth–death models (Nee et al. 1994). Whereas, each gene tree Inline graphic is assumed to follow the MSC process (Degnan and Rosenberg 2009; Heled and Drummond 2010; Jones 2017; Ogilvie et al. 2017; Rannala and Yang 2017), under which species-tree branches are associated with independently and identically distributed (effective) population sizes Inline graphic which govern the coalescent process of Inline graphic, where Inline graphic. Gene trees are thus assumed to be contained within Inline graphic (Fig. 1).

Site evolution is assumed to follow a continuous-time Markov process (Felsenstein 1981) under some substitution model and clock model:

graphic file with name Equation2.gif (2)

Inline graphic can adopt a range of molecular substitution models, such as the HKY nucleotide evolution model (Hasegawa et al. 1985) or the WAG amino acid evolution model (Whelan and Goldman 2001). Tree Inline graphic has relative molecular substitution rate Inline graphic. Branches in Inline graphic are associated with substitution rates Inline graphic, which govern the rate of site evolution of Inline graphic along the respective branch, where Inline graphic (Fig. 1). Branch rates Inline graphic are assumed to be independently and identically distributed under a log-normal distribution with standard deviation Inline graphic (i.e., the MSC relaxed clock model; Drummond et al. 2006; Ogilvie et al. 2017). Lastly, the clock rate Inline graphic can be estimated when accompanied by time-calibration data, such as ancient fossil records (Sauquet et al. 2011; Heled and Drummond 2012; Ballesteros and Sharma 2019), or left fixed when no such data are available. Overall, the total substitution rate of any given branch in Inline graphic is the product of Inline graphic, Inline graphic, and a subset of the elements in Inline graphic (weighted by their coverage of the gene-tree branch; Ogilvie et al. 2017).

In this article, we develop tools that allow the MSC to be applied to large data sets using complex models of evolution. Although we focus on MSC models, we anticipate that in the future other models of the form expressed in Eq. (1) will be developed, for example, models that allow some lateral gene transfer and therefore allow some gene-tree branches to cross species boundaries in the species tree. We design a number of MCMC operators which generate proposals that explore the state space more efficiently—using a Gibbs sampler for population sizes, a combination of Bactrian (Yang and Rodríguez 2013; Thawornwattana et al. 2018) and adaptable variance multivariate normal (Baele et al. 2017) proposal kernels, a parallel operator for sampling gene trees and substitution model parameters, and an MCMC operator which selects other operators based on their exploration efficiency (Douglas et al. 2021b). Moreover, in the special case of the multispecies relaxed clock model (Ogilvie et al. 2017), we introduce methods for operating on the species tree, the gene trees, and the clock model simultaneously (Zhang and Drummond 2020; Douglas et al. 2021b).

Effective Population Size Gibbs Operator

The StarBeast2 (Ogilvie et al. 2017) and DISSECT (Jones et al. 2015) packages have the capability of integrating effective population sizes Inline graphic when using an inverse gamma distributed prior on Inline graphic, based on a technique introduced by (Liu et al., 2008) and detailed out by (Jones, 2017). This approach greatly reduces the state space. However, consequently the posterior Eq. 1 can no longer be broken down in a product over components over individual gene trees:

graphic file with name Equation3.gif (3)

Thus, the technique is not suitable for gene-tree operator parallelization, and therefore, we estimate Inline graphic instead.

Suppose that Inline graphic, for species-tree branch Inline graphic, follows an inverse gamma prior distribution Inv-Inline graphic, where the shape Inline graphic is fixed at 2 and therefore the scale Inline graphic is the expected value (because Inline graphic). Following the results by (Jones, 2017), the posterior of Inline graphic follows an inverse gamma Inv-Inline graphic, such that Inline graphic and Inline graphic where Inline graphic is the total number of coalescent events of all gene trees in branch Inline graphic and Inline graphic. Here, Inline graphic is the ploidy of gene Inline graphic, Inline graphic the size of the Inline graphicth coalescent interval for gene Inline graphic in branch Inline graphic, and Inline graphic the number of lineages of gene tree Inline graphic at the tip-side of branch Inline graphic (so that Inline graphic is the number of lineages at the start of the Inline graphicth coalescent interval for Inline graphic).

Instead of integrating Inline graphic, our Inline graphic operator samples from the posterior. All Inline graphic elements in Inline graphic are proposed simultaneously. As demonstrated later, this turns out to be more efficient than standard Inline graphic random walk operators, with the added advantage of sampling effective population sizes—which may be a parameter of interest—as well as the ability to parallelize gene-tree proposals. This technique is readily applicable for periodically sampling and logging Inline graphic to implementations that do integrate this term out.

Bactrian Operators for Trees

The step size of a proposal kernel should be such that the proposed state Inline graphic is sufficiently far from the current state Inline graphic to explore vast areas of parameter space, but not so far that the proposal is rejected too often (Gelman et al. 1997). The Bactrian distribution (Yang and Rodríguez 2013; Thawornwattana et al. 2018) has minimal probability mass around the center, and a higher concentration flanking the center, akin to the humps of a Bactrian camel (Fig. 2; left). This distribution is a preferred alternative to standard uniform- or normal-distributed random walk kernels, as it places minimal probability on step sizes that are too large or too small, and has successfully improved phylogenetic inference in previous studies (Yang and Rodríguez 2013; Zhang and Drummond 2020; Douglas et al. 2021b).

Figure 2.


Figure 2.

Depiction of random walks Inline graphic under varying proposal kernels. Left: The random walk occurs from the origin between the two modes, where the vertical axis shows the probability density function of the kernel Inline graphic (Yang and Rodríguez 2013). Right: A 2D random walk on inversely correlated parameters Inline graphic with different domains (Baele et al. 2017). Contours describe the joint probability density function Inline graphic under a transformed multivariate normal distribution learned during MCMC.

In this article, we apply Bactrian proposals to trees. The standard set of tree node height proposals in BEAST 2 consists of a Inline graphic operator which embarks all nodes in the tree on a random walk (in log-space), a Inline graphic operator which does so for only the root of a tree, an Inline graphic operator which changes species/gene node heights and various continuous parameters simultaneously (Drummond et al. 2002), a Inline graphic operator which slides a node up or down branches (Hohna et al. 2008), and constant distance operators when a relaxed clock model is applied (Zhang and Drummond 2020). Each operator would normally draw a random variable from a uniform distribution, but here we instead use a Bactrian distribution and apply appropriate transformations. We also introduce the Inline graphic operator, which transforms parameters with lower- and upper-bounds (such as tree node heights) by applying a Bactrian random walk in their real-space transformations.

Adaptive Variance Multivariate Normal Operator

An adaptive variance multivariate normal (AVMN) operator (Baele et al. 2017) provides proposals for a set of real-space parameters by learning the posterior throughout the run of the MCMC algorithm and approximating it as a multivariate normal distribution to capture correlations between parameters (Fig. 2; right). The space spanned by such a set of continuous parameters may need to be transformed (in order to satisfy the assumption that all parameters lie in real-space), by applying a log-transformation to parameters with positive domains (such as substitution rates), or a log-constrained sum transformation to multivariate parameters with unit sums (such as nucleotide frequencies), for instance. AVMN has been demonstrated to be more efficient in estimating phylogenetic parameters than standard random walk or scale operators (Baele et al. 2017; Bouckaert 2020; Douglas et al. 2021b).

Consider a single gene tree Inline graphic and its substitution model Inline graphic, consisting of substitution rates and nucleotide frequencies for instance. Performing a single proposal for any single parameter would require a full recalculation of the tree likelihood Inline graphic (see peeling algorithm by (Felsenstein, 1981)). Therefore, proposing all site model parameters Inline graphic simultaneously can reduce the number of likelihood calculations required and thus lower the computational runtime.

Parallel Gene-Tree Operator

During MCMC, operators are typically sampled proportionally to fixed weights (or proposal probabilities), to ensure the chain is ergodic. Here, we present an alternative method, where a single gene tree Inline graphic and its substitution model Inline graphic is selected, and Inline graphic operators are sequentially sampled and applied to Inline graphic and Inline graphic, before returning to the full parameter space. This is equivalent to running a small MCMC chain of Inline graphic steps—applying only gene tree and substitution model operators on Inline graphic and Inline graphic—and then accepting the resulting Inline graphic and Inline graphic afterwards with probability 1, as if it were a single Gibbs sampling operation (Geman and Geman 1984).

Observe that because only Inline graphic and its associated parameters change, part of Eq. (1) can be rewritten as:

graphic file with name Equation4.gif (4)

Thus, the posterior distribution can be decomposed into the product of contributions of individual gene trees and their substitution models. Assuming that substitution model parameters Inline graphic are distinct for each gene tree Inline graphic, an Inline graphic-step MCMC chain could be run for each of Inline graphic and Inline graphic for Inline graphic in parallel, and the resulting Inline graphic and Inline graphic each accepted with probability 1, as if two Gibbs operators were sequentially applied. Because the posterior density for Inline graphic is proportional to Inline graphic and that of Inline graphic proportional to Inline graphic then provided that any shared parameters (such as Inline graphic, Inline graphic, and Inline graphic) are not being operated on, these two Inline graphic-step MCMC chains can run in parallel.

Where there are Inline graphic threads available, the Inline graphic gene trees are split into Inline graphic groups (assuming Inline graphic). The Inline graphic sets of Inline graphic-step MCMC chains are run in parallel and the resulting gene trees Inline graphic are accepted into the main MCMC chain. Here, we introduce a parallel operator Inline graphic. This operator partitions gene trees into Inline graphic threads and operates on their topologies, node heights, and substitution models. Tree node height proposals employ the Bactrian kernel where applicable (Fig. 2), and substitution model proposals invoke the AVMN kernel (Fig. 2). The chain length Inline graphic of each thread is learned during MCMC (Fig. 3).

Figure 3.


Figure 3.

Optimization of gene-tree parallel operator chain lengths. Top: The time limit of each parallel MCMC chain is randomized on each call so that the overhead (intercept) and time-per-proposal (slope) can be learned as a linear regression model. Bottom: The linear regression model is applied, and parallel MCMC chain lengths are set such that the slowest thread attains the user-specified target overhead (i.e., the bottom thread has attained 20% overhead in the example above).

Since each small MCMC chain for a thread can be considered a single Gibbs proposal, for Inline graphic threads in principle Inline graphic steps should be added to the main chain. If the operator is selected just before logging a state, in principle some threads may need to be disregarded before logging in order to maintain exactly equal intervals in the trace log. Due to the low frequency at which the operator is selected, and the logging intervals being orders of magnitude larger than the number of threads, this does not appear to be a problem in practice.

Species Tree Relaxed Clock Model Operators

The constant distance operator family exploits the negative correlations between divergence times and branch substitution rates by proposing both terms simultaneously (Zhang and Drummond 2020). This technique has yielded a parameter convergence rate of one to two orders of magnitude faster, particularly for large data sets that come with peaked posterior distributions (Douglas et al. 2021b). Under the MSC relaxed clock model used by StarBeast2, the branch rate of gene-tree branch Inline graphic is the length-weighted branch rate Inline graphic of all species-tree branches that contain Inline graphic (Ogilvie et al. 2017). Moreover, effective population sizes Inline graphic are positively correlated with divergence times, so this correlation could also be readily exploited.

Extending the work by (Zhang and Drummond, 2020), we introduce the Inline graphic operator. This operator proposes a node height Inline graphic for species-tree internal node Inline graphic, the three branch rates (elements of Inline graphic) and population sizes (elements of Inline graphic) incident to Inline graphic, and heights for all gene-tree non-leaf nodes that are contained within these three incident branches (Fig. 4). Inline graphic is embarked on a Bactrian random walk (Yang and Rodríguez 2013) to give Inline graphic, then Inline graphic and the node heights in Inline graphic are proposed such that all genetic distances are conserved following the change in Inline graphic, and Inline graphic is proposed such that the positive correlation between itself and the branch lengths incident to Inline graphic is respected (see Algorithm S1).

Figure 4.


Figure 4.

An example of an Inline graphic proposal, acting on species nodes Inline graphic and its two children Inline graphic and Inline graphic. First, the height of Inline graphic (Inline graphic) is increased to Inline graphic. Then, the relative substitution rates of branches Inline graphic (Inline graphic) and Inline graphic (Inline graphic) are decreased to Inline graphic and Inline graphic, and Inline graphic is increased to Inline graphic. These compensations in branch length ensure that the genetic distance of each branch (Inline graphic, Inline graphic, and Inline graphic) is maintained. The thicknesses of the species node lines are proportional to these substitution rates. Finally, the effective population size of Inline graphic and Inline graphic are increased to Inline graphic and Inline graphic, while that of Inline graphic is decreased to Inline graphic. These compensations in node height ensure that the ratio between branch length and branch population size are maintained. Species node widths are proportional to their effective population size. During this operation, gene-tree nodes always remain constrained by the species tree. Figure was generated by UglyTrees (Douglas 2020).

Previously, we introduced the narrow exchange rate (Inline graphic) operator (Douglas et al. 2021b). This operator combined the simple Inline graphic operator (i.e., a proposal which swaps a subtree with its uncle subtree; (Drummond et al., 2002)) with the Inline graphic operator (Zhang and Drummond 2020), by applying a small topological change to the tree and then recomputing branch substitution rates such that evolutionary distances are preserved. We demonstrated that this operator assisted the traversal of tree topology space on longer alignments compared with shorter ones.

Here, we combine this work with the Inline graphic operator implemented by (Ogilvie et al., 2017)—based on work by (Jones, 2017) and (Rannala and Yang, 2017)—and introduce the coordinated narrow exchange rate (Inline graphic) operator. This operator exchanges a species-tree node with its uncle node adjusts gene-tree topologies Inline graphic to preserve compatibility with Inline graphic, and proposes three nearby branch rates in Inline graphic to preserve genetic distances (Algorithm S2).

Adaptive Operator Weighing

Previously, we developed the Inline graphicInline graphic operator (Douglas et al. 2021b). This operator learns the weights (or proposal probabilities) behind a set of suboperators during MCMC, by rewarding operators which bring about large changes to parameter Inline graphic in short computational runtime, with respect to some distance function: Euclidean distance when Inline graphic is real, and RNNI distance (Collienne and Gavryushkin 2021) when Inline graphic is tree topology. This approach can account for the scenario when an operator’s performance is conditional on the data set. When a data set contains very little signal with respect to a certain parameter Inline graphic and its prior distribution, then resampling that parameter from its prior distribution using the Inline graphic operator may be more efficient than embarking Inline graphic on a random walk, for instance (Douglas et al. 2021b). In contrast, data sets with more signals are likely to prefer smarter operators which account for correlations in the posterior distribution, such as the constant distance or Inline graphic operators (Zhang and Drummond 2020; Douglas et al. 2021b).

Here, we have applied the Inline graphicInline graphic to seven areas of parameter space: the species and gene-tree node heights (Inline graphic and Inline graphic), the relaxed clock model rates Inline graphic and standard deviation Inline graphic, the mean effective population size Inline graphic, the species-tree birth rate Inline graphic (assuming a Yule speciation model; Yule 1925), and the species-tree topology Inline graphic. These operator schemes are explicated in Tables 1 and 2.

Table 1.

StarBeast3 operator scheme, assuming a Yule tree prior on the species tree with birth rate Inline graphic

Operator Weight Reference
Species tree
Inline graphic 30 Ogilvie et al. (2017)
Inline graphic 30 Ogilvie et al. (2017), Jones (2017)
Inline graphic 15 Ogilvie et al. (2017), Jones (2017)
Inline graphic 15 Hohna et al. (2008)
Inline graphic 15 Drummond et al. (2002)
Inline graphic 15 Drummond et al. (2002)
Inline graphic 15  
Inline graphic   Drummond et al. (2002)
Inline graphic   Ogilvie et al. (2017)
Inline graphic   Douglas et al. (2021b)
Inline graphic   Species Tree Relaxed Clock Model Operators
Inline graphic 3  
Inline graphic 3  
Inline graphic 3 Bactrian Operators for Trees
Inline graphic 100  
Inline graphic    
Inline graphic    
Inline graphic   Bactrian Operators for Trees
Inline graphic   Species Tree Relaxed Clock Model Operators
Inline graphic   Ogilvie et al. (2017), Jones (2017)
Inline graphic   Ogilvie et al. (2017), Jones (2017)
Inline graphic   Drummond et al. (2002)
Gene trees/site models
Inline graphic 3.42 Table 2
Tree hyperparameters
Inline graphic 50 Effective Population Size Gibbs Operator
Inline graphic 5  
Inline graphic    
Inline graphic   Bouckaert et al. (2019)
Inline graphic   Douglas et al. (2021b)
Inline graphic 5  
Inline graphic    
Inline graphic   Bouckaert et al. (2019)
Inline graphic   Douglas et al. (2021b)
Relaxed clock model
Inline graphic 30  
Inline graphic
Inline graphic   Species Tree Relaxed Clock Model Operators
Inline graphic   Douglas et al. (2021b)
Inline graphic 5  
Inline graphic    
Inline graphic   Douglas et al. (2021b)

Notes: The Inline graphic operator weight was set such that it is sampled 1% of the time. Further operator details can be found in Drummond and Bouckaert (2015).

Inline graphic Bactrian kernel applied to random walk (Yang and Rodríguez 2013).

Table 2.

StarBeast3 parallel operator scheme for gene trees and their associated site models (assumed to be an HKY model with transition–transversion ratio Inline graphic and nucleotide frequencies Inline graphic)

Operator Weight Reference
Inline graphic   Parallel Gene-Tree Operator
Gene trees
Inline graphic 15 Drummond et al. (2002)
Inline graphic 15 Drummond et al. (2002)
Inline graphic 15 Drummond et al. (2002)
Inline graphic 10 Hohna et al. (2008)
Inline graphic 30  
Inline graphic 10  
Inline graphic 10 Bactrian Operators for Trees
Inline graphic 100  
Inline graphic    
Inline graphic    
Inline graphic   Hohna et al. (2008)
Inline graphic   Bouckaert (2021)
Site models
Inline graphic Inline graphic 5 Baele et al. (2017)
Inline graphic Inline graphic 0.5  
Inline graphic 0.5  
Inline graphic 0.5  

Notes: Each operator is applicable to a single gene tree Inline graphic or its site model Inline graphic. Inline graphic) generated proposals for the site model and complete set of tree node heights simultaneously. Operator weights are normalized into proposal probabilities within a single MCMC chain called by Inline graphic. Further operator details can be found in (Drummond and Bouckaert, 2015).

Inline graphic Bactrian kernel applied to random walk (Yang and Rodríguez 2013).

Results

In this section, we first validate the correctness of StarBeast3 through a well-calibrated simulation study. Then, we demonstrate that StarBeast3 is efficient at doing Bayesian inference on large data sets compared with StarBeast2. We did not compare to *BEAST directly, since it does not provide relaxed clock models on species trees, but note that (Ogilvie et al., 2017) benchmarked StarBeast2 against *BEAST for strict clocks and found StarBeast2 to be an order faster than *BEAST, so any gain over StarBeast2 will be more so over *BEAST.

Validation

In order to validate the correctness of StarBeast3, we performed two well-calibrated simulation studies. These were achieved by simulating nucleotide alignments (of two varying sizes) using parameters directly sampled from the prior distribution, and then recovering the posterior estimates of these parameters by doing Bayesian inference on the simulated alignments using StarBeast3. For each study, the 95%-coverage of each parameter was approximately 95% (meaning that the true parameter estimate was within the 95% highest posterior density interval approximately 95% of the time). Therefore, these experiments provide confidence in StarBeast3’s correctness and are presented in Figure 5 and Section S4 of Supplementary material available on Dryad at http://dx.doi.org/10.5061/dryad.f1vhhmgzk.

Figure 5.


Figure 5.

Well-calibrated simulation study analyzing Inline graphic species, Inline graphic taxa, and Inline graphic genes. One-hundred simulations were performed to recover the coverage between “true” simulated values and their estimates under the posterior distribution. 95% highest posterior density (HPD) intervals of parameters are represented by vertical lines. Each line represents a single simulation, and is colored blue when the true value was contained within the 95% interval, or red otherwise. The top of each plot shows the coverage of each parameter (i.e., the number of MCMC simulations for which the “true” parameter value was contained within the 95% HPD).

Performance Benchmarking

We evaluated the performance of StarBeast3 for its ability to achieve multispecies coalescent parameter convergence in a Bayesian framework, compared with that of StarBeast2. Although it is a nontrivial problem to determine if an MCMC chain has converged, the effective sample size (ESS) can serve as a useful metric. Thus, we computed the number of effective samples generated per hour (ESS/h) across multiple replicates of MCMC, using three real and two simulated data sets (Table 3). The ESS of any parameter should be over 200 in order to estimate its posterior distribution (Tracer; Rambaut et al. 2018). To allow both software packages to perform at their best, effective population sizes were integrated by StarBeast2, but were estimated by StarBeast3. This section provides a general comparison of StarBeast3 and StarBeast2; however, the performances of individual operators can be found in Sections S5 and S6 of Supplementary material available on Dryad.

Table 3.

Benchmark data sets

Data set No. of species Inline graphic No. of taxa Inline graphic No. of gene trees Inline graphic Time (h)
Frog (Barrow et al. 2014) 21 88 26 25–41
Skink (Bryson Jr et al. 2017) 10 59 50 30–54
Spider (Hamilton et al. 2016) 36 83 50 660–1100
Simulated (12) 4 12 100 24–100
Simulated (48) 16 48 100 440–950

Notes: Fifty gene trees were subsampled from the Skink and Spider data sets. The simulated data sets were directly sampled from the model specification used during Bayesian inference (described in Section S3 of Supplementary material available on Dryad). In the final column, we estimate the time required for the MCMC chain to converge using StarBeast3 with 16 threads (min–max across 5 MCMC replicates). These terms were approximated as the time to achieve an effective sample size of 200 for the posterior density Inline graphic, with a 20% burn-in.

The ESS/h was evaluated in five distinct areas of parameter space. First, we considered generic summaries of convergence: the ESS/h of the posterior density Inline graphic, the likelihood Inline graphic, and the prior density Inline graphic. Second, species tree Inline graphic convergence was evaluated in terms of its height Inline graphic, its length Inline graphic, and hyperparameters Inline graphic—the Yule model birth rate (Yule 1925)—and Inline graphic—the mean effective population size. In the case of StarBeast3, where effective population sizes are estimated, we also measured the mean ESS/h associated with species-tree leaf nodes of Inline graphic. Third, gene-tree convergences were evaluated by their heights Inline graphic, their lengths Inline graphic, and the RNNI distances (Collienne and Gavryushkin 2021) to their UPGMA Inline graphic (Sokal 1958) and neighbor-joining Inline graphic trees (Saitou and Nei 1987). As there are multiple gene trees, we only considered the mean ESS/h of each term. Fourth, substitution model convergence (HKY substitution model; Hasegawa et al. 1985) was measured from the transition–transversion ratio Inline graphic, nucleotide frequencies Inline graphic, and gene-tree substitution rates Inline graphic, where the ESS/h of each term was averaged across all Inline graphic substitution models. Lastly, relaxed clock model convergence was evaluated by considering the mixing of branch rate empirical mean Inline graphic and variance Inline graphic, as well as the relaxed clock standard deviation parameter Inline graphic.

These results showed that, depending on the data set, the “slowest” parameter generally converged considerably faster for StarBeast3 than it did for StarBeast2 (see the min term in Figs. 6 and 7). On the smallest data set considered (Frog), StarBeast2 and 3 performed comparably well overall (and no significant difference in min). However, StarBeast3 performed better on all of the other data sets, with the “slowest” parameter converging between 4 and 37Inline graphic as fast, and the posterior density Inline graphic converging between 2 and 36Inline graphic as fast, often at a statistically significant level. For StarBeast3, the absolute time needed to converge varied a lot across the data sets, and even across multiple replicates of the same data set (see final column of Table 3). The fastest data sets —Frog and Simulated(12)—required 1–2 days to converge, while the Spider data set required over a month.

Figure 6.


Figure 6.

Performance benchmarking the two simulated data sets. Each point is the geometric-mean ESS/h across five replicates, for either StarBeast2, or StarBeast3 with 16 threads. The geometric-mean relative performance of StarBeast3, compared with StarBeast2, is indicated above each term, and a * is present if the difference across five replicates is significant according to a Student’s t-test. Note that the y-axis is in log-space.

Figure 7.


Figure 7.

Performance benchmarking the two biological data sets. See Figure 6 caption for figure notation.

Notably, relaxed clock model parameters converged up to Inline graphic as fast under StarBeast3. This was credited to the use of a real-space branch rate parameterization (where branch rates are real numbers as opposed to discrete bins, as implemented in StarBeast2) as well as constant distance operators, which adjust branch rates and divergence times simultaneously (Zhang and Drummond 2020; Douglas et al. 2021b). The disparity between StarBeast3 and StarBeast2 was less extreme for the smaller Inline graphic gene tree Frog data set (Barrow et al. 2014), consistent with previous experiments (Douglas et al. 2021b).

Substitution model parameters Inline graphic generally converged faster for StarBeast2 than they did for StarBeast3. Note, however, that this is by design. The total operator weight assigned to Inline graphic parameters was 50% smaller in StarBeast3, in order to ensure balanced convergence across all areas of parameter space. In all data sets considered, substitution models converged significantly faster than any other area of parameter space, despite receiving relatively little operator weight, and therefore computational resources that were being spent on the substitution model were better off spent in “slower” areas of parameter space, such as gene-tree node heights.

The Inline graphic operators (Table 1) confirmed the value in the Inline graphic and Inline graphic operators for operating on their respective areas of parameter space. The Inline graphic operator almost always outperformed other operators at proposing species node heights Inline graphic (Table 4). The exception to this was the Skink data set, for which the Inline graphic operator was superior at proposing branch lengths, and the Frog data set, for which Inline graphic, Inline graphic, and Inline graphic were all on a par. In general, very little operator weight was rewarded to the Inline graphic, Inline graphic, Inline graphicInline graphic, and Inline graphic operators for their abilities to propose species node heights. Similarly, among Inline graphic variants evaluated by Inline graphic, the Inline graphic operator was marginally favored by all data sets (Table 5). This was due to the operator making larger or more frequent topological changes to the species tree, in faster computational runtime, especially compared with Inline graphic and Inline graphic. Overall, this experiment reinforced the value of learning operator weights on a problem-by-problem basis. A full breakdown of the remaining four adaptive operators can be found in Section S6 of Supplementary material available on Dryad.

Table 4.

Learned weights of the suboperators of Inline graphic), averaged across five replicates

Data set Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Frog 0.06 0.0078 0.34a 9.8eInline graphic05 0.04 0.22 0.33
Simulated (12) 1.1eInline graphic05 0.00013 0.99a 2.6eInline graphic05 0.00043 8eInline graphic04 0.0061
Simulated (48) 4.3eInline graphic05 0.00055 0.98a 5.1eInline graphic06 0.0011 0.00029 0.016
Skink 0.008 0.0087 0.34 4.8eInline graphic05 0.013 0.04 0.59a
Spider 0.0019 0.0034 0.84a 1.5eInline graphic05 0.0063 0.0025 0.15

Notes: The operator which attained the highest proposal probability is indicated by Inline graphic.

Table 5.

Average species tree RNNI distance between before and after each proposal/operator runtime for the suboperators of Inline graphic(Inline graphic), averaged across five replicates

Data set Inline graphic Inline graphic Inline graphic Inline graphic
Frog 0.0091/0.29 ms 0.0091/0.29 msInline graphic 0.0091/0.4 ms 0.0091/0.39 ms
Simulated (12) 0.003/0.091 ms 0.0032/0.094 msInline graphic 0.0028/0.19 ms 0.0028/0.19 ms
Simulated (48) 0.00043/0.77 ms 0.00043/0.62 msInline graphic 0.00047/1 ms 0.00047/0.83 ms
Skink 0.021/0.3 ms 0.021/0.3 msInline graphic 0.021/0.5 ms 0.021/0.48 ms
Spider 0.019/1.6 ms 0.019/1.2 msInline graphic 0.019/1.8 ms 0.019/1.3 ms

Notes: Note that the timer starts at the beginning of the proposal and ends when the proposal has accepted or rejected. NE = narrow exchange; NER = narrow exchange rates; CNE = coordinated narrow exchange; CNER = coordinated narrow exchange rates. The operator which was rewarded the highest proposal probability for each data set is indicated by Inline graphic.

Lastly, we evaluated the effect of threading on StarBeast3, by comparing its performance under 1, 2, 4, 8, and 16 threads allotted to the Inline graphic gene-tree operator (Fig. 8). There was a positive-but-modest correlation between the number of threads and the overall rate of convergence among the terms considered, with an overall log-linear slope coefficient of 0.19. This can be interpreted as follows: across the range of threads and data sets considered, doubling the number of threads was associated with an increase in mixing by 14%. Multithreading provided the strongest boost for the Skink and Spider data sets and made little difference to the simulated data set (48 taxa). This is an unexpected result, because the Skink and Spider data sets have fewer genes (Inline graphic compared with Inline graphic), and may be due to the former data sets having more taxa and thus larger trees.

Figure 8.


Figure 8.

Effect of threading on StarBeast3 performance. Each point represents the ESS/h of the posterior density Inline graphic (averaged across five replicates), for the indicated thread count and data set. These terms are normalized to enable comparison across data sets, by dividing it by that of one thread. A linear model was fitted to the ESS/h and number of threads, each in Inline graphic space, and is reported at the top of the plot. The positive coefficient of the slope indicates that performance increased with the number of threads, across the range of threads considered. Parallel MCMC chain lengths were optimized using the adaptive scheme presented in Figure 3.

Benchmarking on Large Data Sets

We benchmarked the performance of StarBeast3 on simulated data sets with increasingly large numbers of gene trees Inline graphic, ranging from 250 to 1000 genes. Each gene was 200 nucleotides in length. In order to achieve convergence in a timely manner, we performed inference under a strict clock model (i.e., all branch rates fixed Inline graphic) and with a small sample size (Inline graphic species Inline graphic taxa). These experiments showed that StarBeast3 required more time to produce one sample for larger Inline graphic, and therefore more time to produce one effective sample, as expected (Fig. 9). The Inline graphic gene data set would require Inline graphic h for the average ESS to exceed 200 in all areas of parameter space, while the Inline graphic gene data set would require Inline graphic h. Furthermore, we confirmed that gene-tree parallelization gave a noticeable-but-modest improvement to runtime (Fig. 9). Although the trees were small, this experiment showed that StarBeast3 is indeed capable of running on large data sets with several hundred genes.

Figure 9.


Figure 9.

Performance of StarBeast3, across varying gene-tree sizes Inline graphic and varying thread counts. Fifteen replicates of MCMC were run under each setting. Top: mean time taken to produce one effective sample (averaged across the ESSes of the following terms: Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic), Inline graphic se. Means and standard errors were computed in log space. Bottom: time required to produce one state in the MCMC chain.

Discussion

The Next Generation of Bayesian MCMC Operators

In recent years, Bayesian MCMC proposals have advanced significantly beyond that of the unidimensional random walk. The use of adaptive algorithms and advanced proposal kernels have become increasingly prevalent (Haario et al. 2001; Vihola 2012; Yang and Rodríguez 2013; Benson and Friel 2018). In phylogenetic inference in particular, tree proposals have been guided by conditional clade probabilities and parsimony scores (Höhna and Drummond 2012; Zhang et al. 2020), and mirror kernels learn target distributions which act as “mirror images” (Thawornwattana et al. 2018), for instance.

Here, we introduced a range of recently developed MCMC operators to the MSC, including Bactrian proposal kernels (Yang and Rodríguez 2013), which have been successfully applied to bird phylogeny (Maliet et al. 2019), and tree “flex” operators (BICEPS; Bouckaert 2021), which have been applied to coronavirus disease-2019 genomic data (Douglas et al. 2021a). We also invoked a series of more meticulous operators which account for known correlations, such as the AVMN kernel (Baele et al. 2017), constant distance operators (Zhang and Drummond 2020), and the NER operator (Douglas et al. 2021b), as well as adaptive operators that improve over the course of MCMC, such as the adaptable operator sampler (Douglas et al. 2021b), parallel gene-tree operators, and the AVMN kernel (Baele et al. 2017). Indeed, these operators have yielded a software package which outperforms StarBeast2 by up to one-and-a-half orders of magnitude, depending on the data set and the parameter.

While StarBeast3 provides a clear advancement to the problem, Bayesian MCMC is still lagging behind the volumes of next-generation genomic data. Therefore, the continued development of efficient, meticulous, and adaptive MCMC operators is essential.

Efficient Parallelized Bayesian Inference under the MSC

As genomic data becomes increasingly available, concatenating genomic sequences and inferring the phylogeny of the species as that of the genes can become enticing. However, this approach makes for an inconsistent estimator of topology when divergence times are small (Pamilo and Nei 1988), and a biased estimator of species divergence times and substitution rates when ILS is present (Arbogast et al. 2002; Mendes and Hahn 2016; Ogilvie et al. 2016). MSC methods address these issues, but at the drawback of their demanding computational runtimes.

Therefore, as multithreading technologies become increasingly affordable, the appeal in parallelizing multispecies inference becomes clear. StarBeast3 exploits the assumption of conditional independence between gene trees, by doing Bayesian inference on gene trees in parallel, and therefore it scales with the size of the problem. StarBeast3 can handle large data sets (with hundreds of genes) and achieve convergence several times faster than its predecessors.

A Balanced Traversal Through Parameter Space

All areas of parameter space should be explored approximately evenly during MCMC. If one area of parameter space is being explored more rapidly than another, then computational resources allotted to the former should be diverted to the latter. This is best exemplified by the phylogenetic substitution model which, despite requiring relatively little attention to converge, still requires full recalculation of the tree likelihood upon every proposal (Felsenstein 1981). Conversely, tree topologies often converge rather poorly and can require significant attention to be rescued from local optima. By fine tuning our MCMC operator proposal probabilities, we have achieved a balanced traversal through all areas of the MSC parameter space. Although some parameters converge slower for StarBeast3 than they do for StarBeast2 (such as those in the substitution model), the slowest parameters converge significantly faster for the former; up to Inline graphic as fast (see the min term in Figs. 6 and 7).

For StarBeast3, we employed adaptable operators which are able to learn the proposal probabilities of other operators based on their ability to explore a single area of parameter space (Douglas et al. 2021b). However, there would be a great benefit in an adaptable operator scheme which learns and applies a balanced exploration across different areas of parameter space on a problem-by-problem basis.

Conclusion

Here we introduce StarBeast3—a software package for performing efficient Bayesian inference on genomic data under the MSC model. We verified StarBeast3’s correctness and we benchmarked its performance against StarBeast2, which is an order of magnitude faster than its still popular predecessor *BEAST. We showed that StarBeast3 is significantly faster than StarBeast2. Notably, relaxed clock parameters converged between 3 and 30Inline graphic faster, but most importantly even the “slowest” parameters converged up to Inline graphic faster. Our adaptive operator scheme allows proposal probabilities to be learned on a problem-by-problem basis, making StarBeast3 suitable for a range of data sets. By estimating effective population sizes (instead of analytically integrating the term out), we were able to parallelize gene-tree proposals and demonstrated that doubling the number of allotted threads was associated with an increase in performance by around 14%. StarBeast3 is highly effective at performing fast Bayesian inference on large data sets with over 100 genes.

Software Availability

StarBeast3 is available as an open-source BEAST 2 package with an easy-to-use graphical user interface. Instructions for downloading and running StarBeast3 can be found at https://github.com/rbouckaert/starbeast3.

Supplementary Material

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.f1vhhmgzk.

Funding

This study was supported by a Marsden grant 18-UOA-096 from the Royal Society of New Zealand. Software packages were benchmarked using the New Zealand eScience Infrastructure (NeSI) cluster, funded by the New Zealand Ministry of Business, Innovation, and Employment.

References

  1. Arbogast B.S., Edwards S.V., Wakeley J., Beerli P., Slowinski J.B.. 2002. Estimating divergence times from molecular data on phylogenetic and population genetic timescales. Annu. Rev. Ecol. Syst. 33:707–740. [Google Scholar]
  2. Baele G., Lemey P., Rambaut A., Suchard M.A.. 2017. Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in beast. Bioinformatics 33:1798–1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ballesteros J.A., Sharma P.P.. 2019. A critical appraisal of the placement of xiphosura (chelicerata) with account of known sources of phylogenetic error. Syst. Biol. 68:896–917. [DOI] [PubMed] [Google Scholar]
  4. Barrow L.N., Ralicki H.F., Emme S.A., Lemmon E.M.. 2014. Species tree estimation of North American chorus frogs (hylidae: Pseudacris) with parallel tagged amplicon sequencing. Mol. Phylogenet. Evol. 75:78–90. [DOI] [PubMed] [Google Scholar]
  5. Belfiore N.M., Liu L., Moritz C.. 2008. Multilocus phylogenetics of a rapid radiation in the genus Thomomys (Rodentia: Geomyidae). Syst. Biol. 57:294–310. [DOI] [PubMed] [Google Scholar]
  6. Benson A., Friel N.. 2018. Adaptive MCMC for multiple changepoint analysis with applications to large datasets. Electron. J. Stat. 12:3365–3396. [Google Scholar]
  7. Bickford D., Lohman D.J., Sodhi N.S., Ng P.K., Meier R., Winker K., Ingram K.K., Das I.. 2007. Cryptic species as a window on diversity and conservation. Trends Ecol. & Evol. 22:148–155. [DOI] [PubMed] [Google Scholar]
  8. Blom M., Bragg J.G., Potter S., Moritz C.. 2016a. Accounting for uncertainty in gene tree estimation: summary-coalescent species tree inference in a challenging radiation of Australian lizards. Syst. Biol. 66:352–366. [DOI] [PubMed] [Google Scholar]
  9. Blom M., Horner P., Moritz C.. 2016b. Convergence across a continent: adaptive diversification in a recent radiation of Australian lizards. Proc. R. Soc. B 283:20160181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bouckaert R. 2016. Phylogeography by diffusion on a sphere: whole world phylogeography. PeerJ 4:e2406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bouckaert R. 2021. An efficient coalescent epoch model for Bayesian phylogenetic inference. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bouckaert R., Vaughan T.G., Barido-Sottani J., Duchêne S., Fourment M., Gavryushkina A., Heled J., Jones G., Kühnert D., De Maio N., Matschiner M., Mendes F.K., Müller N.F., Ogilvie H.A., du Plessis L., Popinga A., Rambaut A., Rasmussen D., Siveroni I., Suchard M.A., Wu C.H., Xie D., Zhang C., Stadler T., Drummond A.J.. 2019. Beast 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15:e1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bouckaert R., Alvarado-Mora M.V., Pinho J.R.. 2013. Evolutionary rates and HBV: issues of rate estimation with Bayesian molecular methods. Antivir. Ther. 18:497–503. [DOI] [PubMed] [Google Scholar]
  14. Bouckaert R.R. 2020. Obama: Obama for Bayesian amino-acid model averaging. PeerJ 8:e9460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Bragg J.G., Potter S., Bi K., Catullo R., Donnellan S.C., Eldridge M.D.B., Joseph L., Keogh J.S., Oliver P., Rowe K.C., Moritz C.. 2017. Resources for phylogenomic analyses of Australian terrestrial vertebrates. Mol. Ecol. Resour. 17:869–876. [DOI] [PubMed] [Google Scholar]
  16. Bryson Jr R.W., Linkem C.W., Pavón-Vázquez C.J., Nieto-Montes de Oca A., Klicka J., McCormack J.E.. 2017. A phylogenomic perspective on the biogeography of skinks in the plestiodon brevirostris group inferred from target enrichment of ultraconserved elements. J. Biogeogr. 44:2033–2044. [Google Scholar]
  17. Cadena C.D., Kozak K.H., Gómez J.P., Parra J.L., McCain C.M., Bowie R.C., Carnaval A.C., Moritz C., Rahbek C., Roberts T.E., Sanders N.J., Schneider C.J., VanDerWal J., Zamudio K.R., Graham C.. 2011. Latitude, elevational climatic zonation and speciation in New World vertebrates. Proc. R. Soc. Lond. [Biol] 279:194–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Collienne L., Gavryushkin A.. 2021. Computing nearest neighbour interchange distances between ranked phylogenetic trees. J. Math. Biol. 82:1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Degnan J.H., Rosenberg N.A.. 2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 24:332–340. [DOI] [PubMed] [Google Scholar]
  20. Douglas J. 2020. Uglytrees: a browser-based multispecies coalescent tree visualiser. Bioinformatics. 37:268–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Douglas J., Geoghegan J.L., Hadfield J., Bouckaert R., Storey M., Ren X., de Ligt J., French N., Welch D.. 2021a. Real-time genomics for tracking severe acute respiratory syndrome coronavirus 2 border incursions after virus elimination, New Zealand. Emerg. Infect. Dis. 27:2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Douglas J., Zhang R., Bouckaert R.. 2021b. Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model. PLoS Comput. Biol. 17:e1008322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Drummond A.J., Bouckaert R.R.. 2015. Bayesian evolutionary analysis with BEAST. Cambridge: Cambridge University Press. [Google Scholar]
  24. Drummond A.J., Ho S.Y.W., Phillips M.J., Rambaut A.. 2006. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4:e88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Drummond A.J., Nicholls G.K., Rodrigo A.G., Solomon W.. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:1307–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Edwards S.V. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63:1–19. [DOI] [PubMed] [Google Scholar]
  27. Edwards S.V., Xi Z., Janke A., Faircloth B.C., McCormack J.E., Glenn T.C., Zhong B., Wu S., Lemmon E.M., Lemmon A.R., Leaché A.D., Liu L., Davis C.C.. 2016. Implementing and testing the multispecies coalescent model: a valuable paradigm for phylogenomics. Mol. Phylogenet. Evol. 94:447–462. [DOI] [PubMed] [Google Scholar]
  28. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376. [DOI] [PubMed] [Google Scholar]
  29. Fujita M.K., Moritz C.. 2012. Coalescent-based species delimitation in an integrative taxonomy. Trends Ecol. Evol. 27:480–488. [DOI] [PubMed] [Google Scholar]
  30. Gelman A., Gilks W.R., Roberts G.O.. 1997. Weak convergence and optimal scaling of random walk metropolis algorithms. Ann. Appl. Probab. 7:110–120. [Google Scholar]
  31. Geman S., Geman D.. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6:721–741. [DOI] [PubMed] [Google Scholar]
  32. Grummer J.A., Bryson R.W. Jr,Reeder T.W.. 2013. Species delimitation using Bayes factors: simulations and application to the Sceloporus scalaris species group. Syst. Biol. 63:119–133. [DOI] [PubMed] [Google Scholar]
  33. Haario H., Saksman E., Tamminen J.. 2001. An adaptive metropolis algorithm. Bernoulli 7:223–242. [Google Scholar]
  34. Hamilton C.A., Lemmon A.R., Lemmon E.M., Bond J.E.. 2016. Expanding anchored hybrid enrichment to resolve both deep and shallow relationships within the spider tree of life. BMC Evol. Biol. 16:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Hasegawa, M., Kishino H., Yano T.. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. [DOI] [PubMed] [Google Scholar]
  36. Heled J., Drummond A.J.. 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27:570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Heled J., Drummond A.J.. 2012. Calibrated tree priors for relaxed phylogenetics and divergence time estimation. Syst. Biol. 61:138–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Hohna S., Defoin-Platel M., Drummond A.J.. 2008. Clock-constrained tree proposal operators in Bayesian phylogenetic inference. 2008 8th IEEE International Conference on BioInformatics and BioEngineering IEEE. p. 1–7. [Google Scholar]
  39. Höhna S., Drummond A.J.. 2012. Guided tree topology proposals for Bayesian phylogenetic inference. Syst. Biol. 61:1–11. [DOI] [PubMed] [Google Scholar]
  40. Jones G. 2017. Algorithmic improvements to species delimitation and phylogeny estimation under the multispecies coalescent. J. Math. Biol. 74:447–467. [DOI] [PubMed] [Google Scholar]
  41. Jones G., Aydin Z., Oxelman B.. 2015. Dissect: an assignment-free Bayesian discovery method for species delimitation under the multispecies coalescent. Bioinformatics 31:991–998. [DOI] [PubMed] [Google Scholar]
  42. Kang Y.J., Kim S.K., Kim M.Y., Lestari P., Kim K.H., Ha B.K., Jun T.H., Hwang W.J., Lee T., Lee J., Shim S., Yoon M.Y., Jang Y.E., Han K.S., Taeprayoon P., Yoon N., Somta P., Tanya P., Kim K.S., Gwag J.G., Moon J.K., Lee Y.H., Park B.S., Bombarely A., Doyle J.J., Jackson S.A., Schafleitner R., Srinives P., Varshney R.K., Lee S.H.. 2014. Genome sequence of mungbean and insights into evolution within Vigna species. Nat. Commun. 5:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Kubatko L.S., Gibbs H.L., Bloomquist E.W.. 2011. Inferring species-level phylogenies and taxonomic distinctiveness using multilocus data in sistrurus rattlesnakes. Syst. Biol. 60:393–409. [DOI] [PubMed] [Google Scholar]
  44. Leaché A.D., Fujita M.K., Minin V.N., Bouckaert R.R.. 2014. Species delimitation using genome-wide SNP data. Syst. Biol. 63:534–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Lemey P., Rambaut A., Drummond A.J., Suchard M.A.. 2009. Bayesian phylogeography finds its roots. PLoS Comput. Biol. 5:1798–1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Liu L., Pearl D.K., Brumfield R.T., Edwards S.V.. 2008. Estimating species trees using multiple-allele DNA sequence data. Evolution 62:2080–2091. [DOI] [PubMed] [Google Scholar]
  47. Liu L., Wu S., Yu L.. 2015. Coalescent methods for estimating species trees from phylogenomic data. J. Syst. Evol. 53:380–390. [Google Scholar]
  48. Liu L., Yu L., Edwards S. V.. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10:302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Liu L., Yu L., Kubatko L., Pearl D.K., Edwards S.V.. 2009. Coalescent methods for estimating phylogenetic trees. Mol. Phylogenet. Evol. 53:320–328. [DOI] [PubMed] [Google Scholar]
  50. Maddison W.P. 1997. Gene trees in species trees. Syst. Biol. 46:523–536. [Google Scholar]
  51. Maliet O., Hartig F., Morlon H.. 2019. A model with many small shifts for estimating species-specific diversification rates. Nat. Ecol. Evol. 3:1086–1092. [DOI] [PubMed] [Google Scholar]
  52. Mendes F.K., Hahn M.W.. 2016. Gene tree discordance causes apparent substitution rate variation. Syst. Biol. 65:711–721. [DOI] [PubMed] [Google Scholar]
  53. Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T.. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541–i548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Mirarab S., Warnow T.. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Mitchell K.J., Llamas B., Soubrier J., Rawlence N.J., Worthy T.H., Wood J., Lee M.S., Cooper A.. 2014. Ancient DNA reveals elephant birds and kiwi are sister taxa and clarifies ratite bird evolution. Science 344:898–900. [DOI] [PubMed] [Google Scholar]
  56. Nee S., May R.M., Harvey P.H.. 1994. The reconstructed evolutionary process. Philos. Trans. R. Soc. Lond. B 344:305–311. [DOI] [PubMed] [Google Scholar]
  57. Ogilvie H., Bouckaert R., Drummond A.. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol. 34:2101–2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Ogilvie H.A., Bouckaert R.R., Drummond A.J.. 2016. Computational performance and statistical accuracy of *BEAST and comparisons with other methods. Syst. Biol. 65:381–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Oliver J.C. 2013. Microevolutionary processes generate phylogenomic discordance at ancient divergences. Evolution 67: 1823–1830. [DOI] [PubMed] [Google Scholar]
  60. Pamilo P., Nei M.. 1988. Relationships between gene trees and species trees. Mol. Biol. Evol. 5:568–583. [DOI] [PubMed] [Google Scholar]
  61. Pepper M., Doughty P., Fujita M.K., Moritz C., Keogh J.S.. 2013. Speciation on the rocks: integrated systematics of the Heteronotia spelea species complex (Gekkota; Reptilia). PLoS One 8:e78110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Rambaut A., Drummond A.J., Xie D., Baele G., Suchard M.A.. 2018. Posterior summarization in Bayesian phylogenetics using tracer 1.7. Syst. Biol. 67:901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Rannala B., Yang Z.. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst. Biol. 66:823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Rowe K.C., Aplin K.P., Baverstock P.R., Moritz C.. 2011. Recent and rapid speciation with limited morphological disparity in the genus Rattus. Syst Biol. 60:188–203. [DOI] [PubMed] [Google Scholar]
  65. Saitou N., Nei M.. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425. [DOI] [PubMed] [Google Scholar]
  66. Sauquet H., Ho S.Y.W., Gandolfo M.A., Jordan G.J., Wilf P., Cantrill D.J., Bayly M.J., Bromham L., Brown G.K., Carpenter R.J., Lee D.M., Murphy D.J., Sniderman J.M.K., Udovicic F.. 2011. Testing the impact of calibration on molecular divergence times using a fossil-rich group: the case of Nothofagus (Fagales). Syst. Biol. 61:289–313. [DOI] [PubMed] [Google Scholar]
  67. Sokal R.R. 1958. A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull. 38:1409–1438. [Google Scholar]
  68. Song S., Liu L., Edwards S.V., Wu S.. 2012. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. USA 109:14942–14947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Stenson P.D., Mort M., Ball E.V., Evans K., Hayden M., Heywood S., Hussain M., Phillips A.D., Cooper D.N.. 2017. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genetics 136: 665–677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Thawornwattana Y., Dalquen D., Yang Z.. 2018. Designing simple and efficient Markov chain Monte Carlo proposal kernels. Bayesian Anal. 13:1037–1063. [Google Scholar]
  71. Vihola M. 2012. Robust adaptive metropolis algorithm with coerced acceptance rate. Stat. Comput. 22:997–1008. [Google Scholar]
  72. Whelan S., Goldman N.. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18:691–699. [DOI] [PubMed] [Google Scholar]
  73. Xi Z., Liu L., Davis C.C.. 2015. Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased. Mol. Phylogenet. Evol. 92:63–71. [DOI] [PubMed] [Google Scholar]
  74. Yang Z., Rannala B.. 2010. Bayesian species delimitation using multilocus sequence data. Proc. Natl. Acad. Sci. USA 107:9264–9269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Yang Z., Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol. Biol. Evol. 31:3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Yang Z., Rodríguez C.E.. 2013. Searching for efficient Markov chain Monte Carlo proposal kernels. Proc. Natl. Acad. Sci. USA 110:19307–19312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Yule G.U. 1925. II. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F. R. S. Philos. Trans. R. Soc. Lond. B 213:21–87. [Google Scholar]
  78. Zhang C., Huelsenbeck J.P., Ronquist F.. 2020. Using parsimony-guided tree proposals to accelerate convergence in Bayesian phylogenetic inference. Syst. Biol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Zhang R., Drummond A.. 2020. Improving the performance of Bayesian phylogenetic inference under relaxed clock models. BMC Evol. Biol. 20:1–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES