Skip to main content
PeerJ logoLink to PeerJ
. 2019 Dec 18;7:e8272. doi: 10.7717/peerj.8272

Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics

Mathieu Fourment 1,, Aaron E Darling 1
Editor: Asa Ben-Hur
PMCID: PMC6966998  PMID: 31976168

Abstract

Recent advances in statistical machine learning techniques have led to the creation of probabilistic programming frameworks. These frameworks enable probabilistic models to be rapidly prototyped and fit to data using scalable approximation methods such as variational inference. In this work, we explore the use of the Stan language for probabilistic programming in application to phylogenetic models. We show that many commonly used phylogenetic models including the general time reversible substitution model, rate heterogeneity among sites, and a range of coalescent models can be implemented using a probabilistic programming language. The posterior probability distributions obtained via the black box variational inference engine in Stan were compared to those obtained with reference implementations of Markov chain Monte Carlo (MCMC) for phylogenetic inference. We find that black box variational inference in Stan is less accurate than MCMC methods for phylogenetic models, but requires far less compute time. Finally, we evaluate a custom implementation of mean-field variational inference on the Jukes–Cantor substitution model and show that a specialized implementation of variational inference can be two orders of magnitude faster and more accurate than a general purpose probabilistic implementation.

Keywords: Variational Bayes, Stan, Phylogenetics, molecular clock, Bayesian inference

Introduction

Markov chain Monte Carlo (MCMC) algorithms have become the workhorse of Bayesian phylogenetic inference since they were introduced in the late 1990’s (Mau & Newton, 1997; Larget & Simon, 1999). Recent advances in computing hardware and corresponding software implementations have allowed this class of inference method to handle increasingly large datasets (Flouri et al., 2015; Ayres et al., 2019). However the quantity of sequence data being generated every year has been growing exponentially, which, when combined with practitioner’s desires to conduct inference on increasingly rich statistical models, makes MCMC algorithms difficult to apply in practice because they are too slow to compute. Sequential Monte Carlo is an alternative sampling method (Doucet, De Freitas & Gordon, 2001) that has recently received some attention in the phylogenetic community (Bouchard-Côté, Sankararaman & Jordan, 2012; Wang, Bouchard-Côté & Doucet, 2015; Fourment et al., 2018a). Although it is fast and, unlike MCMC algorithms, easily parallelizable, designing efficient proposals for tree topologies and continuous parameters has proven to be difficult (Fourment et al., 2018a). Unlike some statistical models, phylogenetic models have a structure that makes approximating their posterior distribution especially difficult. Specifically, the combination of dependencies among discrete-valued (e.g. the tree topology) and continuous variables (e.g. branch lengths, rates, etc.) means that many recently developed methods to accelerate Bayesian inference can not be directly applied to phylogenetic models.

Attempts to accelerate MCMC

The tree topology is arguably the most difficult part of a phylogenetic model to estimate. The number of possible topologies for binary trees grows super-exponentially in the number of sequences (tips). As a result, sophisticated MCMC proposals are required to efficiently explore the enormous space of this parameter. Typically, MCMC samplers propose a new topology using simple rearrangements of the current topology such as nearest neighbor interchange and subtree-prune-regraft operations. However these relative small, local moves through the state space can lead to inadequate sampling, especially when the posterior is multimodal (Whidden & Matsen, 2015). Several advances in tree topology sampling have been proposed, such as the use of a conditional clade probability distribution to guide proposals (Höhna & Drummond, 2011). Another recently proposed approach involved an extension of Hamiltonian Monte Carlo, which uses gradient information to guide proposals, to work in the discrete state space of tree topologies (Dinh et al., 2017).

When it comes to continuous parameters, the simplest proposals update a single parameter at a time (e.g. via a multiplier or sliding window proposals). This comes at the expense of ignoring correlation among parameters. Building upon research on adaptive MCMC (Haario, Saksman & Tamminen, 2001; Roberts & Rosenthal, 2009), recent efforts have implemented multivariate proposals to update blocks of continuous phylogenetic model parameters (Baele et al., 2017). While Baele et al. (2017) showed that this proposal is statistically more efficient than standard proposals in terms of effective sampling size per unit of time, this proposal is currently not available for updating branch lengths in the BEAST packages (Suchard et al., 2018; Bouckaert et al., 2019). Finally, in order to increase the efficiency of an MCMC sampler, Aberer, Stamatakis & Ronquist (2015) developed an independence sampler to update branch lengths in an informed way. Unfortunately, none of these approaches provide the magnitude of improvement required to carry out Bayesian phylogenetic inference on the datasets of 1,000s of sequences which have now become commonplace.

Variational inference: an alternative to MCMC

One alternative to MCMC that has been proposed for Bayesian inference of model parameters is variational Bayes (VB) (Jordan et al., 1999; Wainwright & Jordan, 2008). Like MCMC, VB is a technique used to approximate intractable integrals. The main idea behind variational inference is to transform the posterior approximation of an intractable model into an optimization problem using a family of approximate densities that are tractable. The aim is to find the member of that family with the minimum Kullback–Leibler (KL) divergence to the posterior distribution of interest. Although variational inference does not provide the guarantee that MCMC would of generating samples from the true posterior distribution, variational inference tends to be much faster than MCMC since it relies on fast optimization methods. It is common for VB inference to require 100-fold less compute time than MCMC to approximate the same posterior distribution, although the quality of the approximation may not be as good. Variational inference has become popular in machine learning, as evidenced by the large number of software libraries implementing it (e.g. Stan (Carpenter et al., 2017), PyMC (Salvatier, Wiecki & Fonnesbeck, 2016), TensorFlow (Abadi et al., 2015), Edward (Tran et al., 2016)).

To date, VB has received relatively little attention in the field of phylogenetics, possibly due to the presence of discrete model components such as the tree topology which can not be inferred using standard VB algorithms. However, in recent years interest in VB for phylogenetics has begun to grow. Using the Jukes–Cantor (JC69) nucleotide substitution model and fixed topologies, Fourment et al. (2018b) employed VB to approximate the marginal likelihood of fixed topologies. Zhang & Matsen (2019) have recently shown that VB is a superior method to infer phylogenetic tree posteriors in the classical case of unrooted trees with branch lengths and the JC69 substitution model. Dang & Kishino (2019) used variational inference to approximate the CAT-Poisson model for amino acid data (Lartillot & Philippe, 2004). In that work the authors used a partial Gibbs sampling strategy to update the topology and therefore only the continuous parameters were approximated with VB.

In the present work, we describe implementations of several phylogenetic models for nucleotide datasets using the Stan language (Carpenter et al., 2017). These models are substantially richer than those described in previous studies of variational inference for phylogenetics as they include more general nucleotide substitution models, such as the general time reversible (GTR) substitution model, and rate heterogeneity across sites. We also implemented molecular clock and coalescent models to infer the substitution rate and divergence times of heterochronous and homochronous sequence data. As with previous studies, we consider the topology to be fixed (Fourment et al., 2018b). Despite this limitation, implementing the type of model we describe here under a fixed tree is a useful step forward toward fast Bayesian inference of complex phylogenetic models, as it helps us understand the quality and speed of posterior approximation that can be achieved using a generic modeling language like Stan. We compared the performance of the variational approximations to that obtained with the widely used MCMC-based phylogenetic software package BEAST, run on a fixed tree topology (Suchard et al., 2018; Bouckaert et al., 2019). Because our models are implemented in the Stan language, we are able to carry out inference using any of the inference engines available in Stan, and so we used the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) to further validate our phylogenetic model implementations. Finally, we compared the mean-field approximations obtained with phylostan to those obtained with a highly specialized implementation of mean-field variational inference in the C language software physher (Fourment & Holmes, 2014; Fourment et al., 2018b) available from https://www.github.com/4ment/physher.

Methods

We have implemented phylogenetic models in a Python software package called phylostan. phylostan accesses the Stan package through the pystan API (Stan Development Team, 2019). When executed, the phylostan command line program generates and runs a Stan script with the selected phylogenetic model. Table 1 shows a summary of different types of models available in phylostan. The code and the associated datasets are available from https://www.github.com/4ment/phylostan.

Table 1. Models implemented in phylostan.

Type Model
Substitution models JC69, HKY, GTR
Rate heterogeniety across sites Proportion of invariant sites, Weibull and discrete distributions
Clock model Clock-free, strict, relaxed (hierarchical): autocorrelated, uncorrelated
Coalescent Constant, skyride, skygrid

Variational inference

The main idea behind variational inference is to transform posterior approximation into an optimization problem using a family of approximate densities. The aim is to find the member of that family with the minimum KL divergence to the posterior distribution of interest:

ϕ=argminϕΦKL(q(θ;ϕ)p(θ|D,τ)),

where q(θ; ϕ) is the variational distribution parameterized by a vector ϕ ∈ Φ, while θ are the model parameters (e.g. branch lengths), D are the aligned sequence data, τ is a fixed tree topology, and KL is defined as

KL(qp)=θq(θ;ϕ)logq(θ;ϕ)p(θ|D,τ).

To minimize the KL divergence, we first rewrite the KL equation:

KL(q(θ;ϕ)p(θD,τ)=E[logq(θ;ϕ)]E[logp(θD,τ)]=E[logq(θ;ϕ)]E[logp(θ,Dτ)]+logp(Dτ),

where the expectations are taken with respect to the variational distribution q. The third term log p(D | τ) on the right hand side of the last equality is a constant with respect to the variational distribution so it can be ignored for the purpose of the minimization. After switching the sign of the other two terms, the minimization problem can be framed as a maximization problem of the evidence lower bound (ELBO) function

ELBO(ϕ)=E[logp(θ,D|τ)]E[logq(θ;ϕ)].

The ELBO is easier to calculate than the KL divergence as it does not involve computing the intractable posterior nnormalization term p(D | τ). We entertained two classes of variational distributions: mean-field and full-rank Gaussian distributions. The mean-field approximation assumes a complete factorization of the distribution over each of the N parameters of the model (e.g. branch lengths, GTR parameters) and each factor is governed by its own variational parameters ϕi:

q(θ1,,θN;ϕ)=i=1Nq(θi;ϕi),

where qi; ϕi) is a Gaussian density and ϕi = (μi, σi).

The full-rank approximation is

q(θ1,,θN;ϕ)=N(θ;ϕ),

where the term ϕ = (μ, Σ) concatenates the mean vector μ and covariance matrix Σ of a multivariate Gaussian distribution. The mean vector and the covariance matrix contribute respectively N and N(N + 1)/2 parameters to the full-rank variational model.

When provided with a model definition via phylostan, the Stan software is able to estimate the variational parameters by applying stochastic gradient ascent in a black box approach (Ranganath, Gerrish & Blei, 2014; Kucukelbir et al., 2015).

In phylogenetics, parameters are usually constrained, for example, branch lengths and substitution rates are non-negative while nucleotide frequencies sum to one. Since the support of the Gaussian distribution is ℝ, Stan automatically transforms constrained parameters into unconstrained variables. For example, branch lengths are log-transformed such that they live in the real coordinate space. Stan maintains a set of transformations for common constraints (e.g. lower and/or upper bound constraints) and their corresponding Jacobians. Further information on the library of transformations can be found in the online Stan manual https://mc-stan.org/docs/2_18/reference-manual/variable-transforms-chapter.html.

Models with clocks

The simplest phylogenetic models do not distinguish between evolutionary rate and time, and for those models the only constraint on the branch lengths of a tree is that they must be non-negative. Since our models assume that branches are independent, the non-negativity is easily accommodated in a Stan model via Stan’s application of a log transform on the branch length parameters.

On the other hand, trees in molecular clock models are constrained so that parent nodes must be older than any of their descendent nodes. Here we consider that time is going backwards: the earliest taxon was sampled at time 0 and the age of every internal node is greater than 0. In the case of homochronous sequence data (i.e. every sequence was collected at the same time), the age of an internal node can be reparameterized as a proportion of the age of its parent, except for the root age. For each internal node i (excluding the root), the height hi of the node is reparameterized as hi = pi hpa(i) where the subscript pa(i) denotes the parent of node i and 0 < pi < 1. The height of the root is constrained to be non-negative.

In the case of heterochronous sequence data, the age of the sequence data at the tips of the tree must to be taken into account. For each internal node i (excluding the root), the height hi of the node can be reparameterized as hi = hd(i) + pi (hpa(i)hd(i)) where the subscript d(i) denotes the earliest taxon in the set of descendant nodes of node i. The height of the root is now constrained to be greater than the earliest taxa.

The reparameterized model consists of a vector of proportions p=(p1,,pN2), which Stan automatically transforms using a log-odds transform.

This reparametrization of the node heights is also used in the PAML software package (Yang, 2007). In our case, the transformation of the node ages requires an adjustment to the joint density. Specifically, in the multivariate case, the density is scaled by the absolute value of the Jacobian of the inverse of the transforms. The Jacobian of homochronous data is a triangular matrix for which the determinant is i=1N2hpa(i). For heterochronous data, the Jacobian is also triangular for which the determinant is i=1N2hpa(i)hd(i).

Under these reparameterizations, different ranked topologies will be evaluated during the gradient ascent optimization of the variational parameters.

Rate heterogeneity among sites

The gamma distribution is commonly used to model rate heterogeneity across sites in phylogenetic models (Yang, 1994, 1996). However, it is currently not possible to use the quantile approximation of the gamma distribution proposed by Yang (1994) since the Stan language does not provide inverse cumulative density functions (CDF). Instead we opted to use the Weibull distribution to describe rate heterogeneity across sites. The Weibull distribution has the closed-form inverse CDF F−1(p | λ, k) = λ ( − ln(1 − p))1/k. Fixing the scale parameter λ to 1, the shape of the Weibull distribution is determined by the shape parameter k. When k ≤ 1 the Weibull distribution is skewed, suggesting strong rate heterogeneity. When k > 1 the distribution is bell shaped, suggesting no or low rate heterogeneity. Since the mean of the Weibull distribution must be equal to 1 in order to preserve branch length interpretation (i.e. expected number of substitutions per site) we rescale the rates r1rK such that i=1Kpiri=1. Such rescaling is also necessary when a proportion of invariant sites is included in the model.

We also implemented rate heterogeneity across sites using a general discrete distribution with 2K − 2 parameters, by specifying K rates 0r1<r2<rK and probabilities pi = Pr{r = ri} with i=1Kpiri=1 and i=1Kpi=1 (Kosakovsky Pond & Frost, 2005). The values of ri are constrained to be ordered from smallest to largest merely for the sake of model identifiability in Stan. This distribution is parameter rich compared to the discretized Weibull distribution scheme and as a result it may be more difficult to fit to data.

Datasets and validation

To evaluate the accuracy of variational inference we applied phylostan to three empirical datasets and compared the posterior approximations to those obtained with BEAST2 (Bouckaert et al., 2019), BEAST (Suchard et al., 2018), or MrBayes (Ronquist et al., 2012). The first dataset is made of 69 human influenza virus haemagglutinin nucleotide sequences isolated between 1981 and 1998. The dataset was analyzed under the Hasegawa, Kishino and Yano (HKY) substitution model allowing for rate variation among sites (Weibull distribution with four rate classes). We used a strict clock and a constant size coalescent prior on the tree. The topology used in the phylostan and BEAST2 analyses was drawn randomly from a sample of trees generated by a preliminary analysis with BEAST2 without topological constraints.

The second dataset comprises 63 RNA sequences of type 4 from the E1 region of the hepatitis C virus (HCV) genome that were isolated in 1993. As in previous studies (Pybus et al., 2001; Faulkner et al., 2018), the substitution rate was fixed to 7.9 × 10−4 substitutions per site per year. We used the GTR substitution model with Weibull distributed rate heterogeneity (four categories). We used the Bayesian skyride tree prior and a Gaussian Markov random field prior for the effective population size trajectory (Minin, Bloomquist & Suchard, 2008). The topology used in the phylostan and BEAST (Suchard et al., 2018) analyses was also drawn randomly from a sample of trees generated by a preliminary analysis with BEAST without topological constraints. We assigned a gamma prior with rate and scale equal to 0.005 on the precision parameter.

The last dataset consists of twenty-seven 18S rRNA sequences of length 1949. This dataset, commonly referred to as DS1, has been studied several times and has become a de facto standard dataset for evaluating MCMC methods (Hedges, Moberg & Maxson, 1990; Lakner et al., 2008; Fourment et al., 2018b; Whidden and Matsen IV, 2015; Whidden et al., 2018). DS1 was analyzed under the Jukes Cantor substitution model (JC69) without a clock. In (Fourment et al., 2018b), the authors used very long runs of MrBayes (Ronquist et al., 2012) to estimate the posterior distribution of the tree topologies (see corresponding paper for more details). Akin to that study, we focused on the 42 tree topologies contained in the 95% credible set and renormalized their posterior probabilities so that they sum to one. For each topology we then estimated its ELBO under the mean-field approximation using either Stan or physher (Fourment & Holmes, 2014), and converted these approximations of the marginal likelihood to a posterior probability. Specifically, we use the ELBO estimate p^ELBO(D|τ):=maxϕΦELBO(ϕ) to approximate the marginal likelihood of a topology. Finally, we compare the ELBO-based posterior estimates to the MCMC-based estimates from MrBayes, which we consider as the ground truth. We placed independent exponential priors on branch lengths with a prior expectation of 0.1 substitutions.

In every analysis the gradient of the ELBO is evaluated with one sample and the ELBO is computed using 100 samples. While we used the default parameters provided by pystan, we used a smaller relative tolerance of 0.001 for every analysis as suggested by Yao et al. (2018).

Results

We analyzed three datasets to evaluate the accuracy of the approximations to the posterior distribution provided by the variational framework on phylogenetic models. First, we analyzed a set of heterochronous influenza A virus sequences under the strict clock model on a fixed topology with BEAST2 and phylostan. In Fig. 1 we show the posterior distributions of every parameter approximated by BEAST2 and phylostan using either mean-field variational inference or the NUTS algorithm. Although the mean-field-based variational approach tends to approximate the posterior distributions correctly, the between-run variability is high relative to other methods. Apart from two very divergent replicates, the full-rank variational approximation appears to be more precise than the mean-field approximation (Fig. 2). The two replicates that do not correctly approximate the posterior distributions have a substantially lower ELBO (−4475.75 and −4481.07) than the other replicates for which the ELBO are between −4430 and −4432, so in practice these poorly converged runs could be identified and removed by comparing the results of multiple replicates. The distributions inferred by BEAST2 using Metropolis Hastings closely match the posterior approximations given by the NUTS algorithm implemented in Stan, providing evidence that we have implemented the models in phylostan correctly.

Figure 1. Posterior approximation of phylogenetic model parameters using mean-field variational inference (phylostan), NUTS (phylostan), and Metropolis-Hastings (BEAST2) algorithms on the influenza A virus dataset.

Figure 1

Mean-field variational inference (blue lines) was replicated 10 times while the NUTS (red line) and the Metropolis-Hastings (histogram) methods were run once. Parameters are: clock substitution rate (A), nucleotide frequencies (B–E), ratio of transition and transversion rates of HKY model (F), population size (G), shape parameter of the Weibull distribution (H), and tree height (I).

Figure 2. Posterior approximation of phylogenetic model parameters using full-rank variational inference (phylostan), NUTS (phylostan), and Metropolis-Hastings (BEAST2) algorithms on the influenza A virus dataset.

Figure 2

Full-rank variational inference (blue lines) was replicated 10 times while the NUTS (red line) and the Metropolis-Hastings (histogram) methods were run once. Parameters are: clock substitution rate (A), nucleotide frequencies (B–E), ratio of transition and transversion rates of HKY model (F), population size (G), shape parameter of the Weibull distribution (H), and tree height (I).

Next we evaluated the ability of the variational framework to approximate the demographic history of a set a HCV using the skyride model (Minin, Bloomquist & Suchard, 2008) and a fixed substitution rate. In Fig. 3, we present the demographic history approximations from the replicate with the highest ELBO for the mean-field and full-rank models. The trajectories of the population size estimated from the variational distributions match the estimates from BEAST and NUTS (Fig. 3). The 95% confidence intervals computed from the variational distributions are narrower than the MCMC-based intervals. This can, at least in part, be explained by the zero-forcing nature of the reverse KL divergence. Although the mean-field variational approximation consistently recovers the population size trajectory across replicates, the full-rank approximation fails to converge to the correct posterior in 9 out of the 10 replicates.

Figure 3. Posterior medians (solid black lines) of effective population sizes and associated 95% credible intervals (blue shaded areas) for the HCV dataset using BEAST (A) and phylostan (B–D).

Figure 3

Variational-based approximations shown in this figure are from the replicates with the highest ELBO. Variational-based approximations for every replicate can found in Figs. S1 and S2.

In the final analysis we compare the accuracy and speed of mean-field variational inference as implemented in phylostan against an implementation in the physher software when using the Jukes–Cantor model. phylostan and physher use the same gradient ascent algorithm and variational distribution, for which the gradient is analytically derived (Kucukelbir et al., 2015). We find that physher is two orders of magnitude faster than phylostan with a mean computing time of 1.2 s per topology for the former and 332 seconds per topology for the latter (Fig. 4). Although it is much faster, the physher implementation of mean-field variational inference currently is specific to the Jukes–Cantor model and lacks the generality of phylostan to carry out inference under a range of models. Another important aspect that may contribute to the speed differences is how each program initializes its variational parameters. Whereas Stan initializes the variational parameters to random points in the parameter space, physher initializes these parameters deterministically with near optimal values using the Laplus method (Fourment et al., 2018b). For each branch length parameter, the Laplus approximation takes its maximum a posteriori estimate and second derivative with respect to the log-likelihood and finds the parameters (μ and σ) of the log-normal distribution by matching modes and second derivatives of the approximating and posterior distributions of the branch length. The procedure is described in more detail in (Fourment et al., 2018b). Since we use a log-transform on the branch length parameters, the two parameters μ and σ are used to initialize the location and scale parameters of the normal variational distribution.

Figure 4. The approximate posterior probabilities of the topologies in DS1 (y-axis) calculated with physher (A–J) and phylostan (K–T) versus the ground truth posterior probabilities from MrBayes (x-axis), plotted on the log scale for clarity.

Figure 4

The mean time per topology and the Kullback–Leibler divergence are reported for each of the 10 replicates. Panels are ordered in decreasing order by Kullback–Leibler divergence.

Discussion

We have developed a tool based on the Stan package for Bayesian phylogenetic inference, which to our knowledge is the first application of VB to time trees with coalescent models. Although we have focused on inferring phylogenetic models with a fixed topology due to the complexity and discrete nature of the topology space, recent research on subsplit Bayesian networks (SBN) has made a significant step toward modeling topological uncertainty in the variational framework (Zhang & Matsen, 2018, 2019). A promising area for future research will be to investigate combinations of the parameter rich models we have presented in this study with the SBN approach, for which only the JC69 model has been implemented thus far.

In our tests, we found the variational inference engine of Stan to be considerably faster than the NUTS algorithm. We then showed using the DS1 dataset that a manually coded implementation of variational inference can be two orders of magnitude faster and somewhat more accurate than the automatic differentiation variational inference produced via Stan. This finding makes it clear that the modeling flexibility and ease of use afforded by a general purpose language like Stan comes with significant tradeoffs in computational efficiency and accuracy. On the other hand, bespoke implementations of variational inference can be complex to derive, time consuming to implement, and limited in flexibility. For example, the Laplus method used to initialize the parameters of the branch length variational distributions assumes that the branch length parameters are independent. Further work would be needed to extend the Laplus method for other model parameters that are strongly correlated such as the GTR parameters. Opportunities may exist to combine the flexibility of a general purpose modeling language with an inference engine that is specialized for phylogenetic models. Implementation of variational inference within widely used phylogenetic packages such as BEAST, MrBayes, and RevBayes (Höhna et al., 2016) would be a significant step forward. Although calculating the gradient of each parameter analytically is in theory possible, this tedious task could be circumvented by adopting the automatic differentiation framework used in modern probabilistic programming languages (e.g. Stan, Edward).

In this study we have presented and evaluated a subset of the functionality available in phylostan. In addition to the models evaluated here we have also implemented the coalescent skygrid model (Gill et al., 2013) and more flexible molecular clock models such the relaxed clock using a hierarchical prior and an autocorrelated clock rate. A list of model features available in phylostan is given in Table 1.

Another promising avenue for future research would be to capture dependencies between latent variables that, by definition, are ignored by the mean-field approximation. To this end, Tran, Blei & Airoldi (2015) have proposed to augment the mean-field variational distribution with a copula, while recent studies (Rezende & Mohamed, 2015; Kingma et al., 2016) have proposed to improve the posterior approximation though a normalizing flow. Although the full-rank approximation can in principle capture some of these dependencies, in practice it seems that phylogenetic models can not be reliably fit with the full-rank approximation, at least not with the black box approach implemented in Stan. We speculate that the failure to fit the full-rank approximation on, e.g. the HCV dataset shown above, is a result of the high dimension of the covariance matrix. One possible remedy would be the use of a sparse covariance matrix to reduce the number of parameters. It is reasonable to assume that some variables have low or no correlation (e.g. guanine frequency and root height, or pairs of branch lengths for distant tree branches). It may also be possible to apply a low-rank Gaussian distribution (Miller, Foti & Adams, 2017) to alleviate the computational burden associated with the full-rank distribution.

Conclusions

Although the methods we have introduced are limited to analysis of a single fixed tree topology, and have much room for improvement, there is still scope for them to be usefully applied to biological data analysis. Phylogenetic methods have proven especially useful in the context of infectious disease research. In particular, systems such as Nextstrain (Hadfield et al., 2018) provide a fast way to incorporate data into a visualization environment that enables exploration of phylogeny and other key aspects such as sample collection time and geographical location. Nextstrain computes a phylogeny for input sequences using maximum likelihood-based methods. In order to reconstruct temporal information, it then takes the single maximum likelihood tree and applies the TreeTime software (Sagulenko, Puller & Neher, 2018) to infer divergence times and evolutionary rates. TreeTime provides a fast and highly scalable means to carry out approximate maximum likelihood inference on those model parameters. Although the current implementation may not run as fast as TreeTime, phylostan could offer a drop-in replacement to something like TreeTime in this application context. phylostan would have the advantage of carrying out joint Bayesian inference of all continuous model parameters, and being able to report the uncertainty associated with estimates of each model parameter.

Supplemental Information

Supplemental Information 1. Posterior medians (solid black lines) of effective population sizes and associated 95% credible intervals (blue shaded areas) for the HCV dataset using the mean-field variational distribution in phylostan.

10 Independent replicates are shown.

DOI: 10.7717/peerj.8272/supp-1
Supplemental Information 2. Posterior medians (solid black lines) of effective population sizes and associated 95\% credible intervals (blue shaded areas) for the HCV dataset using the full-rank variational distribution in phylostan.

10 Independent replicates are shown.

DOI: 10.7717/peerj.8272/supp-2

Acknowledgments

We would like to thank Erick Matsen, Alexei Drummond, Christiaan Swanepoel and Cheng Zhang for useful discussions.

Funding Statement

This work was supported by the ithree institute, UTS and AusGEM. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Mathieu Fourment conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Aaron E. Darling conceived and designed the experiments, contributed reagents/materials/analysis tools, approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Data is available from GitHub: https://github.com/4ment/phylostan.

References

  • Abadi et al. (2015).Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous systems. 2015. http://tensorflow.org http://tensorflow.org
  • Aberer, Stamatakis & Ronquist (2015).Aberer AJ, Stamatakis A, Ronquist F. An efficient independence sampler for updating branches in Bayesian Markov chain Monte Carlo sampling of phylogenetic trees. Systematic Biology. 2015;65(1):161–176. doi: 10.1093/sysbio/syv051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ayres et al. (2019).Ayres DL, Cummings MP, Baele G, Darling AE, Lewis PO, Swofford DL, Huelsenbeck JP, Lemey P, Rambaut A, Suchard MA. Beagle 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Systematic Biology. 2019;68(6):1052–1061. doi: 10.1093/sysbio/syz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Baele et al. (2017).Baele G, Lemey P, Rambaut A, Suchard MA. Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics. 2017;33(12):1798–1805. doi: 10.1093/bioinformatics/btx088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bouchard-Côté, Sankararaman & Jordan (2012).Bouchard-Côté A, Sankararaman S, Jordan MI. Phylogenetic inference via sequential Monte Carlo. Systematic Biology. 2012;61(4):579–593. doi: 10.1093/sysbio/syr131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bouckaert et al. (2019).Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, De Maio N, Matschiner M, Mendes FK, Müller NF, Ogilvie HA, Du Plessis L, Popinga A, Rambaut A, Rasmussen D, Siveroni I, Suchard MA, Wu C-H, Xie D, Zhang C, Stadler T, Drummond AJ. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLOS Computational Biology. 2019;15(4):e1006650. doi: 10.1371/journal.pcbi.1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Carpenter et al. (2017).Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A. Stan: a probabilistic programming language. Journal of Statistical Software. 2017;76(1):1–32. doi: 10.18637/jss.v076.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Dang & Kishino (2019).Dang T, Kishino H. Stochastic variational inference for Bayesian phylogenetics: a case of CAT model. Molecular Biology and Evolution. 2019;36(4):825–833. doi: 10.1093/molbev/msz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Dinh et al. (2017).Dinh V, Bilge A, Zhang C, Matsen FA., IV . Probabilistic path Hamiltonian Monte Carlo. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Sydney: PMLR; 2017. pp. 1009–1018. (Proceedings of Machine Learning Research). [Google Scholar]
  • Doucet, De Freitas & Gordon (2001).Doucet A, De Freitas N, Gordon N. Sequential Monte Carlo methods in practice. New York: Springer-Verlag; 2001. [Google Scholar]
  • Faulkner et al. (2018).Faulkner JR, Magee AF, Shapiro B, Minin VN. Locally-adaptive Bayesian nonparametric inference for phylodynamics. 2018. http://arxiv.org/abs/1808.04401 http://arxiv.org/abs/1808.04401
  • Flouri et al. (2015).Flouri T, Izquierdo-Carrasco F, Darriba D, Aberer AJ, Nguyen L-T, Minh BQ, Von Haeseler A, Stamatakis A. The phylogenetic likelihood library. Systematic Biology. 2015;64(2):356–362. doi: 10.1093/sysbio/syu084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Fourment et al. (2018a).Fourment M, Claywell BC, Dinh V, McCoy C, Matsen FA, IV, Darling AE. Effective online Bayesian phylogenetics via sequential Monte Carlo with guided proposals. Systematic Biology. 2018a;67(3):490–502. doi: 10.1093/sysbio/syx090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Fourment & Holmes (2014).Fourment M, Holmes EC. Novel non-parametric models to estimate evolutionary rates and divergence times from heterochronous sequence data. BMC Evolutionary Biology. 2014;14(1):163. doi: 10.1186/s12862-014-0163-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Fourment et al. (2018b).Fourment M, Magee AF, Whidden C, Bilge A, Matsen FA, IV, Minin VN. 19 Dubious ways to compute the marginal likelihood of a phylogenetic tree topology. 2018b. http://arxiv.org/abs/1811.11804. http://arxiv.org/abs/1811.11804 [DOI] [PMC free article] [PubMed]
  • Gill et al. (2013).Gill MS, Lemey P, Faria NR, Rambaut A, Shapiro B, Suchard MA. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Molecular Biology and Evolution. 2013;30(3):713–724. doi: 10.1093/molbev/mss265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Haario, Saksman & Tamminen (2001).Haario H, Saksman E, Tamminen J. An adaptive Metropolis algorithm. Bernoulli. 2001;7(2):223–242. doi: 10.2307/3318737. [DOI] [Google Scholar]
  • Hadfield et al. (2018).Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34(23):4121–4123. doi: 10.1093/bioinformatics/bty407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Hedges, Moberg & Maxson (1990).Hedges SB, Moberg KD, Maxson LR. Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. Molecular Biology and Evolution. 1990;7(6):607–633. doi: 10.1093/oxfordjournals.molbev.a040628. [DOI] [PubMed] [Google Scholar]
  • Hoffman & Gelman (2014).Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014;15(1):1593–1623. [Google Scholar]
  • Höhna & Drummond (2011).Höhna S, Drummond AJ. Guided tree topology proposals for Bayesian phylogenetic inference. Systematic Biology. 2011;61(1):1–11. doi: 10.1093/sysbio/syr074. [DOI] [PubMed] [Google Scholar]
  • Höhna et al. (2016).Höhna S, Landis MJ, Heath TA, Boussau B, Lartillot N, Moore BR, Huelsenbeck JP, Ronquist F. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic Biology. 2016;65(4):726–736. doi: 10.1093/sysbio/syw021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jordan et al. (1999).Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine Learning. 1999;37(2):183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
  • Kingma et al. (2016).Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M. Advances in Neural Information Processing Systems. Red Hook: Curran Associates; 2016. Improved variational inference with inverse autoregressive flow; pp. 4743–4751. [Google Scholar]
  • Kosakovsky Pond & Frost (2005).Kosakovsky Pond SL, Frost SD. A simple hierarchical approach to modeling distributions of substitution rates. Molecular Biology and Evolution. 2005;22(2):223–234. doi: 10.1093/molbev/msi009. [DOI] [PubMed] [Google Scholar]
  • Kucukelbir et al. (2015).Kucukelbir A, Ranganath R, Gelman A, Blei D. Advances in Neural Information Processing Systems. Cambridge: MIT Press; 2015. Automatic variational inference in Stan; pp. 568–576. [Google Scholar]
  • Lakner et al. (2008).Lakner C, Van Der Mark P, Huelsenbeck JP, Larget B, Ronquist F. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Systematic Biology. 2008;57(1):86–103. doi: 10.1080/10635150801886156. [DOI] [PubMed] [Google Scholar]
  • Larget & Simon (1999).Larget B, Simon DL. Markov chasin Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution. 1999;16(6):750–759. doi: 10.1093/oxfordjournals.molbev.a026160. [DOI] [Google Scholar]
  • Lartillot & Philippe (2004).Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution. 2004;21(6):1095–1109. doi: 10.1093/molbev/msh112. [DOI] [PubMed] [Google Scholar]
  • Mau & Newton (1997).Mau B, Newton MA. Phylogenetic inference for binary data on dendograms using Markov chain Monte Carlo. Journal of Computational and Graphical Statistics. 1997;6(1):122–131. [Google Scholar]
  • Miller, Foti & Adams (2017).Miller AC, Foti NJ, Adams RP. Variational boosting: iteratively refining posterior approximations. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Sydney: PMLR; 2017. pp. 2420–2429. (Proceedings of Machine Learning Research). [Google Scholar]
  • Minin, Bloomquist & Suchard (2008).Minin VN, Bloomquist EW, Suchard MA. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Molecular Biology and Evolution. 2008;25(7):1459–1471. doi: 10.1093/molbev/msn090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Pybus et al. (2001).Pybus OG, Charleston MA, Gupta S, Rambaut A, Holmes EC, Harvey PH. The epidemic behavior of the hepatitis C virus. Science. 2001;292(5525):2323–2325. doi: 10.1126/science.1058321. [DOI] [PubMed] [Google Scholar]
  • Ranganath, Gerrish & Blei (2014).Ranganath R, Gerrish S, Blei D. Black box variational inference. In: Kaski S, Corander J, editors. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. Vol. 33. Reykjavik: PMLR; 2014. pp. 814–822. (Proceedings of Machine Learning Research). [Google Scholar]
  • Rezende & Mohamed (2015).Rezende DJ, Mohamed S. Variational inference with normalizing flows. Proceedings of the 32nd International Conference on International Conference on Machine Learning; 6-11 July; Lille, France. 2015. pp. 1530–1538. [Google Scholar]
  • Roberts & Rosenthal (2009).Roberts GO, Rosenthal JS. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics. 2009;18(2):349–367. doi: 10.1198/jcgs.2009.06134. [DOI] [Google Scholar]
  • Ronquist et al. (2012).Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology. 2012;61(3):539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sagulenko, Puller & Neher (2018).Sagulenko P, Puller V, Neher RA. TreeTime: maximum-likelihood phylodynamic analysis. Virus Evolution. 2018;4(1):vex042. doi: 10.1093/ve/vex042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Salvatier, Wiecki & Fonnesbeck (2016).Salvatier J, Wiecki TV, Fonnesbeck C. Probabilistic programming in python using PyMC3. PeerJ Computer Science. 2016;2(2):e55. doi: 10.7717/peerj-cs.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Stan Development Team (2019).Stan Development Team Pystan: the python interface to Stan, version 2.19.0.0. 2019. https://github.com/stan-dev/pystan https://github.com/stan-dev/pystan
  • Suchard et al. (2018).Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution. 2018;4(1):vey016. doi: 10.1093/ve/vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Tran, Blei & Airoldi (2015).Tran D, Blei D, Airoldi EM. Copula variational inference. Proceedings of the 28th International Conference on Neural Information Processing Systems; 7–12 December; Montreal, Canada. Cambridge: MIT Press; 2015. pp. 3564–3572. [Google Scholar]
  • Tran et al. (2016).Tran D, Kucukelbir A, Dieng AB, Rudolph M, Liang D, Blei DM. Edward: a library for probabilistic modeling, inference, and criticism. 2016. http://arxiv.org/abs/1610.09787 http://arxiv.org/abs/1610.09787
  • Wainwright & Jordan (2008).Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning. 2008;1(1–2):1–305. [Google Scholar]
  • Wang, Bouchard-Côté & Doucet (2015).Wang L, Bouchard-Côté A, Doucet A. Bayesian phylogenetic inference using a combinatorial sequential Monte Carlo method. Journal of the American Statistical Association. 2015;110(512):1362–1374. doi: 10.1080/01621459.2015.1054487. [DOI] [Google Scholar]
  • Whidden et al. (2018).Whidden C, Claywell B, Fisher T, Magee A, Fourment M, Matsen FA., IV Systematic exploration of the high likelihood density set of phylogenetic trees. 2018. http://arxiv.org/abs/1811.11007. http://arxiv.org/abs/1811.11007 [DOI] [PMC free article] [PubMed]
  • Whidden & Matsen (2015).Whidden C, Matsen FA., IV Quantifying MCMC exploration of phylogenetic tree space. Systematic Biology. 2015;64(3):472–491. doi: 10.1093/sysbio/syv006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Yang (1994).Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution. 1994;39(3):306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
  • Yang (1996).Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology & Evolution. 1996;11(9):367–372. doi: 10.1016/0169-5347(96)10041-0. [DOI] [PubMed] [Google Scholar]
  • Yang (2007).Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution. 2007;24(8):1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  • Yao et al. (2018).Yao Y, Vehtari A, Simpson D, Gelman A. Yes, but did it work? Evaluating variational inference. In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. Vol. 80. Stockholm: PMLR; 2018. pp. 5581–5590. [Google Scholar]
  • Zhang & Matsen (2018).Zhang C, Matsen FA., IV . Generalizing tree probability estimation via bayesian networks. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. Vol. 31. Red Hook: Curran Associates; 2018. pp. 1444–1453. [Google Scholar]
  • Zhang & Matsen (2019).Zhang C, Matsen FA., IV Variational Bayesian phylogenetic inference. International Conference on Learning Representations, 6–9 May; New Orleans, LA, USA. 2019. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information 1. Posterior medians (solid black lines) of effective population sizes and associated 95% credible intervals (blue shaded areas) for the HCV dataset using the mean-field variational distribution in phylostan.

10 Independent replicates are shown.

DOI: 10.7717/peerj.8272/supp-1
Supplemental Information 2. Posterior medians (solid black lines) of effective population sizes and associated 95\% credible intervals (blue shaded areas) for the HCV dataset using the full-rank variational distribution in phylostan.

10 Independent replicates are shown.

DOI: 10.7717/peerj.8272/supp-2

Data Availability Statement

The following information was supplied regarding data availability:

Data is available from GitHub: https://github.com/4ment/phylostan.


Articles from PeerJ are provided here courtesy of PeerJ, Inc

RESOURCES