Abstract
Scientific studies in many areas of biology routinely employ evolutionary analyses based on inference of phylogenetic trees from molecular sequence data. Evolutionary processes that act at the molecular level are highly variable, and properly accounting for heterogeneity is crucial for more accurate phylogenetic inference. Nucleotide substitution rates and patterns are known to vary among sites in multiple sequence alignments, and such variation can be modeled by partitioning alignments into categories corresponding to different substitution models. Determining a priori appropriate partitions can be difficult, however, and better model fit can be achieved through flexible Bayesian infinite mixture models that simultaneously infer the number of partitions, the partition that each site belongs to, and the evolutionary parameters corresponding to each partition. Here, we consider several different types of infinite mixture models, including classic Dirichlet process mixtures, as well as novel approaches for modeling across-site evolutionary variation: hierarchical models for data with a natural group structure, and infinite hidden Markov models that account for spatial patterns in alignments. In analyses of several viral data sets, we find that different types of models perform best in different scenarios, but infinite hidden Markov models emerge as particularly promising for larger data sets and complex evolutionary patterns characterized by multiple genes and overlapping reading frames. To enable these models to scale to large data sets, we adapt efficient Markov chain Monte Carlo algorithms and exploit opportunities for parallel computing. We implement this infinite mixture modeling framework in BEAST X, a widely-used software package for phylogenetic inference.
Keywords: phylogenetics, Bayesian statistics, Bayesian nonparametrics, substitution models, across-site variation, model selection, virus evolution
Introduction
Probabilistic phylogenetic inference requires statistical models for molecular sequence evolution. Evolutionary processes are typically described using Markov models for the substitution of discrete molecular characters, such as DNA bases (Felsenstein 2004). Each observed molecular sequence can be thought of as corresponding to the tip of an unobserved phylogenetic tree and produced by a Markov model that starts at the root of the tree and proceeds down its branches. Calculation of the observed data likelihood under such a model forms the basis for both maximum likelihood and Bayesian phylogenetic inference (Felsenstein 1981).
Evolutionary processes are known to be highly variable (Yang 2006), and evolutionary modeling has been gradually refined from early, restrictive approaches to better account for such variability and in turn enable more accurate phylogenetic inferences. The simplest nucleotide substitution model assumes equal DNA base equilibrium frequencies as well as equal relative exchange rates between bases, with an overall substitution rate being the only model parameter (Jukes and Cantor 1969). Numerous extensions of this model have been developed to enable variation in relative exchange rates (Kimura 1980), equilibrium frequencies (Felsenstein 1981), and both (Lanave et al. 1984; Hasegawa et al. 1985; Tamura 1992; Tamura and Nei 1993). While early evolutionary modeling approaches assumed that such substitution models remained constant across the branches of the phylogenetic tree and the sites of the multiple sequence alignment, these restrictions have since also been relaxed in various ways (Yang 2006).
Different alignment sites in molecular sequences have different functional and structural roles and are subject to different selective pressures and thus do not necessarily evolve in the same ways (Yang 2006). Modeling advances for across-site variation initially focused on allowing the overall substitution rates of substitution models to vary along alignment sites. However, alignment sites may also evolve in qualitatively different ways. That is, the substitution pattern, which is characterized by the relative exchange rates of molecular characters in a substitution model, can also vary. A widely used method for modeling variation of substitution rates and/or relative character exchange rates is to model the parameters as random variables that are drawn from a common distribution for all sites (Nei et al. 1976; Golding 1983; Yang 1993, 1994; Waddell and Steel 1997; Huelsenbeck and Nielsen 1999). A common alternative approach is to partition alignment sites into a fixed number of different categories and estimate substitution rates and/or relative character exchange rates independently for each category. While it is possible to treat each site as belonging to its own category (Bruno 1996; Swofford et al. 1996; Nielsen 1997), this can lead to overfitting. More commonly used partitioning schemes are biologically informed. For example, sites can be partitioned by gene, by stems and loops in ribosomal data, or by codon position for protein coding sequences (Pagel and Meade 2004). However, it may be not be clear how to best partition a given data set a priori, and moreover, any fixed partition would fail to account for partition uncertainty. There may be some sites that appear to clearly belong to one of the fixed partitions while other sites may be more appropriately modeled as belonging to different partitions with different probabilities. In the face of such challenges, finite mixture models have emerged as popular and effective approaches (Yang 1994; Huelsenbeck and Nielsen 1999; Pagel and Meade 2004; Venditti et al. 2008).
A major limitation of finite mixture models is the necessity to fix an a priori number of categories. Researchers have overcome this obstacle through the development of Bayesian infinite mixture models that treat the number of categories as an unknown and unbounded parameter (Lartillot and Philippe 2004; Huelsenbeck and Suchard 2007; Wu et al. 2013). Notably, by inferring the number of categories for an evolutionary mixture model along with all other model parameters, these approaches enable the data to determine the number of alignment partitions that best captures the heterogeneity of the evolutionary process that generated the data. Infinite mixture models for evolutionary heterogeneity have thus far relied on Dirichlet processes (Ferguson 1973) to specify prior probabilities on a space of discrete distributions that can generate site-to-category assignments and evolutionary parameters corresponding to each category. This prior specification affords the flexibility to adequately model evolutionary variation while also providing a built-in mechanism that guards against overfitting (Ghahramani 2013), and Dirichlet process mixture models have been shown to routinely achieve better model fit than standard models for across-site evolutionary variation (Huelsenbeck and Suchard 2007; Wu et al. 2013). However, Dirichlet processes have limitations that may hamper their effectiveness at modeling across-site variation in certain scenarios. Specifically, Dirichlet process mixture models do not model any spatial dependence in evolutionary parameters, which may be expected, for example, for evolutionary rates at adjacent sites. To model spatially correlated patterns of evolutionary rates along multiple sequence alignments, Yang (1995) and Felsenstein and Churchill (1996) use hidden Markov models to specify finite mixture models.
Here, we build on the success of Dirichlet process mixtures in modeling across-site variation by exploiting methods from Bayesian nonparametrics that can overcome some of the Dirichlet process’ limitations. To model spatial patterns along alignments, we employ an infinite hidden Markov model (Beal et al. 2002) that expands the hidden Markov model to a countably infinite state space. We also use a hierarchical Dirichlet process framework (Teh et al. 2006) that posits different Dirichlet processes for different groups of sites while linking the separate Dirichlet processes to pool information and share statistical strength. We compare the performance of the resulting infinite mixture models as well as standard approaches for modeling across-site evolutionary variation through analyses of respiratory syncytial virus subgroup A data, hepatitis C virus subtype 4 data, rabies virus data, and a complete genome hepatitis B virus data set. We find that infinite mixture models emerge as a clearly preferable alternative to standard approaches. While the best performing infinite mixture model varies, with each of the three types of infinite mixtures yielding the best model fit in at least one scenario, infinite hidden Markov models stand out as being especially powerful. In particular, mixture models based on infinite hidden Markov models outperform other models by large margins in the majority of scenarios, especially those featuring larger data sets with more complex evolutionary patterns characterized by multiple genes and overlapping reading frames.
Materials and Methods
Consider a multiple sequence alignment of molecular sequence data . Let n denote the number of sequences and s the number of sites, and let denote the observed molecular sequence characters at alignment site i. We assume that the data are generated by continuous-time Markov chains that act independently at the different alignment sites. Each Markov chain begins at the root node of an unobserved phylogenetic tree τ and acts independently along its lineages to ultimately produce the observed molecular sequence data at its external nodes. The Markov chain for site i is characterized by evolutionary parameters . Here, is a matrix specifying the relative exchange rates of molecular characters. The matrix is normalized so that its expected substitution rate is 1, the parameter is the overall substitution rate, and the product is the Markov chain’s infinitesimal rate matrix.
The parameters that characterize the rate matrix depend on the substitution model. Most popular substitution models parameterize to ensure time-reversibility of the Markov chain. Although there is no biological reason to believe that the substitution process should be time-reversible, the computational convenience of time-reversible models has led to their widespread use (Yang 2006). Time-reversible nucleotide substitution models range in complexity from the simple Jukes–Cantor model (Jukes and Cantor 1969), which features the overall substitution rate as its only parameter, to the most flexible time-reversible model, which is known as the general time-reversible (GTR) model (Lanave et al. 1984; Tavare 1986) and features a matrix that is specified with four nucleotide base equilibrium frequencies and six additional relative rate parameters. Bridging the gap between the Jukes–Cantor model and GTR model are several other popular time-reversible models that can be obtained by making various assumptions about the substitution process. For example, the HKY model (Hasegawa et al. 1985) posits a bias between the rates of transitions (substitutions between two purines, bases A and G, or between two pyrimidines, bases T and C) and transversions (all other types of substitutions). In addition to the transition/transversion ratio κ, the parameters also include nucleotide base equilibrium frequencies .
Under these modeling assumptions, the full data likelihood can be expressed as
where . Allowing the evolutionary parameters to all assume distinct values would lead to overfitting. Instead, we want multiple to be able to take on the same value. We think of each distinct value assumed by at least one as determining an evolutionary category, where the number of categories as well as the assignment of alignment sites to categories depend on the model for across-site evolutionary variation. Suppose there are K distinct evolutionary categories with corresponding parameters , and let denote the evolutionary category corresponding to alignment site i for . Thus, . To account for uncertainty in the number of evolutionary categories and in the partitioning of alignment sites into evolutionary categories, we model K and as random variables. This leads to the following Bayesian evolutionary model:
where and . To specify a prior distribution for and K, it suffices to specify a prior distribution for the . We desire a prior distribution G for the that is discrete and has a countably infinite support. The discreteness will allow the number of distinct values K to be strictly less than s. While K cannot be greater than the number of alignment sites s, the countably infinite support of G will allow K to assume any value between 1 and s without having to make any adjustments for data sets with different numbers of sites. Because the number of evolutionary categories K in this framework is not fixed, it is theoretically unbounded and is said to yield an infinite mixture model. Rather than fix the distribution G, we model its uncertainty by specifying a prior distribution on G itself. The subfield of Bayesian inference that focuses on such models without a fixed number of model parameters has come to be known as Bayesian nonparametrics. We consider three different kinds of Bayesian nonparametric prior distributions for G: Dirichlet processes, hierarchical Dirichlet processes, and infinite hidden Markov models.
Dirichlet Processes
Dirichlet processes (Ferguson 1973) are widely used to specify infinite mixture models (Antoniak 1974). A Dirichlet process defines a distribution for a random probability measure. Consider a positive scalar α known as the concentration parameter and a probability distribution known as the base distribution. We can specify a Dirichlet process in terms of the concentration parameter and base distribution, denoted , as follows. Let be a sequence of independent random variables distributed according to , and let be a sequence of independent random variables that follow a distribution. Define the random variables by
for , and let denote the Dirac measure, where if the set A contains and otherwise. Then the random probability measure
is distributed according to (Sethuraman 1994). Note that the measure is random because the and are random variables rather than fixed values. From this construction, it is clear that a draw from a Dirichlet process will be a discrete distribution with a countably infinite support consisting of the atoms . Each atom is associated with a weight , and with probability 1. The concentration parameter α determines the level of discretization: as α becomes smaller, draws from a Dirichlet process will have mass increasingly concentrated among a smaller number of atoms, while as α tends to , the draws will be closer to continuous distributions. The value of α can be fixed beforehand, or it can be treated as a random variable and inferred from the data along with all other model parameters.
An alternative perspective on Dirichlet processes can be obtained by considering a conditional distribution characterizing a sequence of independent and indentically distributed draws , where . Blackwell and MacQueen (1973) show that G can be integrated out to obtain
Here, denotes the number of , where , such that . This conditional distribution can be understood through the metaphor of a Chinese restaurant, which emerged from the underlying distribution on partitions that became known as the Chinese restaurant process (Aldous 1985). In the metaphor, customers sequentially enter a Chinese restaurant with an infinite number of tables, each serving a unique dish. Each customer i corresponds to a and each dish (and table) k corresponds to a . The nth customer sits at an occupied table k with probability proportional to the number of customers already sitting at the table, and sits at an unoccupied table with probability proportional to α. The “dishes” can be thought of as being independently sampled from the “menu” distribution .
Hierarchical Dirichlet Processes
Many standard approaches for modeling variability in evolutionary processes partition multiple sequence alignment sites into different groups on the basis that the sites in each group are believed to exhibit similar evolutionary dynamics. From the perspective of infinite mixture models, it is then natural to wonder if, rather than assuming that evolutionary parameter values for all sites are distributed according to draws from a single Dirichlet process, different groups of sites are better represented by different Dirichlet processes. While it is possible to model different groups with independent mixture models, it is appealing to consider a hierarchical Bayesian framework that allows for variation of the model between groups while still sharing information across groups.
Teh et al. (2006) propose a hierarchical Dirichlet process with a random probability measure for each group j, where each is distributed according to a Dirichlet process defined by concentration parameter α and base distribution . The base distribution is itself distributed according to a Dirichlet process, characterized by concentration parameter γ and base distribution H. Because the all inherit their sets of atoms from the same discrete base distribution , they all share the same atoms. The differ from one another in the weights that they associate with the atoms. To summarize, if we divide the alignment sites into J groups, where group j has sites and are the evolutionary parameters associated with them, we have
Teh et al. (2006) extend the Chinese restaurant metaphor for Dirichlet processes to a Chinese restaurant franchise for hierarchical Dirichlet processes. The restaurants in the franchise share a common menu of dishes, each table in each restaurant serves one dish, and multiple tables at multiple restaurants can feature the same dish. Each restaurant corresponds to a group j, and customer i in restaurant j corresponds to . Each unique dish k corresponds to a and is drawn from the franchise-wide menu distribution H. We represent the dish served at table t of restaurant j by a new variable, . Thus, each is associated with one , and each is associated with one . The number of customers in restaurant j at table t who are being served dish k is denoted , and the number of tables in restaurant j serving dish k is denoted . While is in fact completely determined by j and t, it has become customary to include the subscript k to facilitate the expression of marginal counts, where a dot in a subscript represents summation over the corresponding index. For instance, is the number of customers in restaurant j eating dish k, and is the number of occupied tables in restaurant j.
The dynamics of the Chinese restaurant franchise can be illustrated by integrating out the random measures to obtain
and integrating out to arrive at
Thus, a new customer enters restaurant j and sits at occupied table t serving dish with probability proportional to , and the customer sits at an unoccupied table with probability proportional to α, in which case a dish for the table is needed. The dish for the newly occupied table is equal to a dish that is already being served at at least one table in the franchise with probability proportional to , and it is equal to a new dish, drawn from menu H, with probability proportional to γ.
Infinite Hidden Markov Models
Hidden Markov models (Baum and Petrie 1966) offer an alternative partitioning strategy that can account for spatial patterns along alignments. Starting at one end of the alignment, site-specific evolutionary parameters can be thought of as being generated sequentially according to a Markov chain with a finite state space (Felsenstein and Churchill 1996). Thus, the evolutionary category for a specific site depends on the evolutionary category assumed by the preceding site. In particular, for any site i, we have
and
To overcome the restriction of having to specify the dimension of the state space beforehand, Beal et al. (2002) introduce an infinite hidden Markov model with a countably infinite state space. Teh et al. (2006) show that an infinite hidden Markov model can in fact be achieved through an extension of the hierarchical Dirichlet process framework. The key difference is that rather than a fixed number of “groups” with the division of sites into groups determined beforehand, there is an unbounded number of groups, and the group membership of a given site is determined by the evolutionary category of the preceding site.
As in the hierarchical Dirichlet process, we have a collection of random probability measures that are distributed according to the same Dirichlet process, whose underlying base distribution is itself distributed according to a Dirichlet process. In particular, to each evolutionary category k, we associate a random probability measure where
and
Then the Markovian nature of the process is captured by
where, as before, denotes the evolutionary category of site i. As in the hierarchical Dirichlet process, the common discrete base distribution ensures that the random measures share the same atoms, which ensures that any state (i.e. evolutionary category) can be reached from any other state.
Posterior Inference
We implement our infinite mixture model framework in the BEAST X (v10.5.0) software package for Bayesian evolutionary inference (Suchard et al. 2018). Our framework is currently implemented in the development branch, available at https://github.com/beast-dev/beast-mcmc/, and will be included in the next beta release of BEAST X (BEAST X v10.5.0-beta6, 2025). Example XML input files and R (R Core Team 2021) code for summarizing output are available at https://github.com/mandevgill/infinitemixturemodels, and details about output log files are presented in the Supplementary material. Importantly, implementation of our framework in BEAST X enables our framework for across-site variation to be employed in the wide range of phylogenetic and phylodynamic inference models that can be specified and run by BEAST X. We generate samples from the posterior distribution through Markov chain Monte Carlo (MCMC) simulation (Metropolis et al. 1953; Hastings 1970). Standard MCMC methods for infinite mixture models can be hampered by slow mixing and high computational burden. To enable our model to scale efficiently to large genomic data sets that have become commonplace with advances in sequencing technology, we adapt a cost-effective “data squashing” MCMC sampling strategy put forth by Guha (2010) to update the site-to-category assignments. This approach can be applied to infinite mixture models based on a wide class of Bayesian nonparametric prior distributions, including all priors that we consider in this study. We outline the main ideas of the sampling scheme here and refer to Guha (2010) for further details.
The strategy of the algorithm is to simultaneously propose Metropolis-Hastings updates for the evolutionary category assignment variables for a group of alignment sites j that have similar full conditional distributions for the at the current iteration. By working with sites that have category assignment variables with approximately identically distributed full conditionals, candidate category assignments can be jointly generated in an efficient manner by simply working with one representative member of the group of sites. For the current iteration t, suppose there are K evolutionary categories with distinct evolutionary parameter values . Rather than compute the full conditional distribution for each , we adopt simpler mass functions that approximate the full conditionals and are less computationally expensive. In the case of a Dirichlet process mixture, for example, we define for , and otherwise, where denotes the number of sites with parameter values equal to . Let i be a randomly chosen site. To compare the mass functions for sites i and j, , we can compute a difference measure such as the squared Hellinger distance (Yang and Le Cam 2000), in which case . Next, we specify a set of sites D for which we will jointly propose category assignment updates. We form the set by including site i along with sites with mass functions most similar to until D has the desired size. In our analyses, we let D assume different sizes in different iterations, ranging between 1 site and approximately 10 of the total number of alignment sites. To accommodate potential new evolutionary categories, we augment the state space with auxiliary parameters, as in the Gibbs sampling procedure introduced by Neal (2000) for nonconjugate Dirichlet process mixture models. Next, we construct a proposal distribution for by approximating the conditional distribution of given the data and model parameters that correspond to sites that are not in D. Finally, we generate candidate category assignments for all sites in D as independent and identically distributed realizations from the aforementioned proposal distribution for . This set of candidate category assignments is then jointly accepted or rejected according to the corresponding Metropolis-Hastings ratio.
Each iteration of this algorithm requires computationally expensive evaluation of the observed sequence data likelihoods for alignment sites and occupied and unoccupied evolutionary categories . Here, c can vary from one iteration to another. Fortunately, these computations are independent and can be performed in parallel. To this end, we construct an interface between the BEAST X implementation of our model and BEAGLE (Ayres et al. 2019), a high-performance library for parallel phylogenetic likelihood evaluation.
We propose updates for evolutionary model parameters as well as Dirichlet process concentration parameters and base distribution hyperparameters using standard Metropolis-Hastings transition kernels. We implement Dirichlet process mixtures using the Chinese restaurant process representation that integrates out the random measure G and thus do not need to generate parameters that characterize the weights of G. For hierarchical Dirichlet processes (and infinite hidden Markov models), however, Teh et al. (2006) note that starting with the Chinese restaurant franchise representation but explicitly instantiating the shared base distribution eases the implementation by enabling the posterior conditioned on to factor across groups. We follow the strategy of Teh et al. (2006) by implementing a Gibbs sampler to generate weights for . This in turn necessitates the generation of “table count” variables , for which we also implement a Gibbs sampler. Finally, to propose updates to the phylogenetic tree and hyperparameters for the phylogenetic tree prior distribution, we employ transition kernels already available in BEAST X.
Results
We evaluate different methods for modeling across-site variation in analyses of four data sets. Because BEAST X is often used for phylodynamic analyses of measurably evolving pathogens, incorporating dated-tip molecular clock models, we here focus on viral data sets, but the methods are broadly applicable in evolutionary biology. The methods include infinite mixture models as well as several commonly used “standard” approaches that employ fixed partitions and/or finite mixture models. For each method of modeling across-site variation, we conduct analyses using two different commonly used DNA substitution models: the HKY model (Hasegawa et al. 1985), and the GTR model (Lanave et al. 1984; Tavare 1986). We use the GTR model because it is the most flexible among time-reversible nucleotide substitution models, and we use the HKY model because it is a widely-used time-reversible model with a moderate number of parameters, representing a middle ground in complexity between the simpler Jukes–Cantor model and the more parameter-rich GTR model. However, other substitution models could have been chosen.
We consider several standard approaches for modeling across-site evolutionary variation. First, we employ a restrictive model with one substitution rate and one set of relative character exchange parameters for all alignment sites. We refer to this model for across-site variability as the “No Variation” model, and we adopt the convention to refer to the overall evolutionary model in terms of the substitution model and the across-site variability model (for example, “HKY + No Variation” or “GTR + No Variation”). We relax the No Variation approach by allowing for substitution rate variation according to the popular finite mixture model proposed by Yang (1994) while maintaining one set of relative character exchange rate parameters for all alignment sites. The Yang (1994) model posits a fixed number of equally probable substitution rate categories, with each rate drawn from a discretized gamma distribution. We use five different rate categories and denote the across-site variability model by “Gamma.” Three of the four data sets we consider consist partially or entirely of protein coding regions without overlapping reading frames, and it is therefore natural to consider partitioning strategies for these data sets that allow the evolutionary process to vary according to codon position. We use “Codon” to denote an across-site variability model that partitions alignment sites as follows: sites from protein coding regions are categorized according to which of the three codon positions they correspond to, and sites from noncoding regions (if any) are assigned a separate category. Thus, there are three or four total categories (depending on whether the data come entirely or partially from protein coding regions), and the Codon model posits a separate substitution rate and set of relative character exchange rates for each category. As a more flexible alternative, we again partition sites as in the aforementioned Codon model and allow each partition to have its own set of relative character exchange rates, but we also allow substitution rate variation within each partition according to an independent (Yang 1994) model with five rate categories. We call the resulting model for across-site variability the “Codon + Gamma” model.
For all evolutionary parameters, we use vague prior distributions that correspond to default options in the BEAUti software program for setting up data analyses to be performed by BEAST X (Suchard et al. 2018). In particular, equilibrium frequencies, instantaneous rate matrix parameters, and relative substitution rates for partitions under the Codon partitioning scheme are assigned uniform Dirichlet priors. HKY model transition/transversion ratios are a priori log-normal with mean 1.0 and standard deviation 1.25. Finally, the shape/rate parameters for the Yang (1994) discretized gamma model for across-site rate variation have an exponential prior distribution with mean 0.5.
In addition to the aforementioned standard modeling of across-site variation, we employ different infinite mixture models to account for uncertainty in the number of alignment partitions and the assignment of sites to different partitions. The type of infinite mixture model is determined by the Bayesian nonparametric prior we use for the evolutionary parameters, site-to-category assignments, and number of categories. We use a Dirichlet process prior (yielding the “DP” model for across-site variability) and an infinite hidden Markov model prior (giving rise to the “IHMM” model for across site variability). For data sets that feature protein coding regions without overlapping reading frames, we also use a hierarchical Dirichlet process prior with alignment sites divided into three or four groups according to the same partitioning scheme employed under the Codon model. We refer to the resulting model for across-site variability as the “HDP-Codon” model.
We do not have strong prior beliefs or information about the number of evolutionary categories, or about the clustering patterns of alignment sites. We therefore assume a priori that all infinite mixture model concentration parameters follow diffuse gamma distributions with shape and rate parameters both equal to 0.001. We compose base distributions from which evolutionary parameter values are drawn by specifying independent distributions for equilibrium frequencies, substitution rates and (as applicable) GTR instantaneous rate matrix relative rate parameters and the HKY transition/transversion ratio. We employ Dirichlet distributions for equilibrium frequencies and GTR relative rate parameters, and normal distributions for log-transformed substitution rates and transition/transversion ratios.
We adopt vague prior distributions for base distribution hyperparameters. In particular, we assume normal distribution means are a priori normally distributed with mean 0 and standard deviation 10, and normal distribution precisions are a priori gamma distributed with shape and rate equal to 0.001. We parameterize each Dirichlet prior distribution concentration parameter as a product , where the scalar η is an overall dispersion parameter and the vector characterizes the relative values of the concentration parameter components. We assign a uniform Dirichlet prior distribution, and on η we place a gamma prior distribution with shape and rate equal to 0.001. HKY transition/transversion rate base distributions must be handled with extra care, and they are the only base distributions whose hyperparameters we do not jointly infer from the data. HKY model Markov chain transition probabilities can converge to finite values as the transition/transversion rate tends to infinity, so the presence of alignment sites for which there is negligible support for transversions can lead to divergent estimates of transition/transversion rates and base distribution hyperparameters. To allow for large transition/transversion rates to accommodate such alignment sites while ensuring numerical stability, we specify vague base distributions with fixed hyperparameter values. To specify the hyperparameters, we take an empirical Bayes approach and adopt the estimated mean and ten times the estimated standard deviation of transition/transversion rate estimates from HKY + No Variation models.
In all analyses, whether modeling across-site variation via standard approaches or infinite mixture models, we employ a strict molecular clock that assumes the evolutionary rate does not vary among phylogenetic tree branches (Kimura 1968). In the infinite mixture models, we wish to model absolute site-specific substitution rates, so we fix the strict molecular clock rate to 1.0. Under the standard approaches, across-site substitution rate variation is modeled in terms of relative rates, so the strict molecular clock rate is estimated from the data, and we assign it a vague exponential prior with mean 1.0. For the phylogenetic tree, we employ skygrid coalescent-based prior distributions that flexibly model the trajectories of the effective sizes of the populations from which the samples are taken as piece-wise constant functions (Gill et al. 2013). The smoothness of skygrid effective population size trajectories are governed by a precision parameter, to which we assign a diffuse gamma prior with shape and rate equal to 0.001.
We use Tracer v1.7.2 (Rambaut et al. 2018) to assess MCMC convergence and mixing and ensure sufficient posterior samples for all analyses. For each infinite mixture model-based analysis, we run the analysis long enough to generate at least 100 million post-burn-in posterior samples (logging parameter values every 10,000 iterations) and achieve an effective sample size greater than 100 for the unnormalized joint posterior density. We summarize the computational performance in supplementary table S6 of the Supplementary Material, Supplementary Material online.
To summarize the clustering pattern inferred under a given infinite mixture model, we use the k-means clustering algorithm available in the R (R Core Team 2021) package MASS (Venables and Ripley 2002) to divide alignment sites into different categories, where is the posterior median estimate of the number of evolutionary categories. Each cluster analysis is applied to the site-specific posterior median, th percentile and th percentile estimates of all evolutionary model parameters.
We compare the performance of the various standard models and infinite mixture models by assessing their model fit through estimates of the log marginal likelihood (Newton and Raftery 1994). Notably, the marginal likelihood implicitly penalizes overfitting and does not systematically prefer more complex models (Jefferys and Berger 1992). A greater marginal likelihood corresponds to a better model fit, and two models and can be formally compared by subtracting the log marginal likelihood under model from the log marginal likelihood under model to obtain the log of the Bayes factor in favor for model over (Jeffreys 1935, 1961). The evidence in favor of over can be interpreted as follows depending on the value of the log Bayes factor: “very strong” if it is greater than 5, “strong” if it is between 3 and 5, “positive” if it is between 1 and 3, and “not worth more than a bare mention” if it is between 0 and 1 (Kass and Raftery 1995). To approximate the marginal likelihood, we use an adjusted version (Redelings and Suchard 2005) of the stabilized harmonic mean estimator introduced by Newton and Raftery (1994) (see the Supplementary material for details).
In addition to comparing the model fit, we assess the differences in evolutionary inferences that result from the various models (see Supplementary material for full details). We compare summary phylogenetic trees (supplementary figs. S1–S4, Supplementary Material online) as well as the overall phylogenetic posterior distributions by examining split frequencies (Lakner et al. 2008) (supplementary figs. S5–S12, Supplementary Material online) and two-dimensional representations of phylogenetic treespace (Hillis et al. 2005) (supplementary figs. S13–S20, Supplementary Material online). We also examine posterior distributions of summary statistics (supplementary tables S2–S5, Supplementary Material online) and substitution rates and transition/transversion rates by site position, such as codon position and/or gene (supplementary figs. S21–S28, Supplementary Material online). To gain insight into spatial patterns under different infinite mixture models, we compare lagged autocorrelation of mean substitution rates and mean transition/transversion rates along alignments (supplementary figs. S29–S53, Supplementary Material online). Finally, for the different data sets, supplementary figs. S54–S61, Supplementary Material online illustrate the across-site variation in substitution rates and transition/transversion rates for the best fitting models with underlying HKY substitution models.
In the presence of notable differences in posterior inferences under different models, it is important to have assurance that we are able to identify which inferences are more accurate. To this end, we perform a simulation study to evaluate the extent to which models favored by Bayes factors are better able to infer the true underlying evolutionary process.
Respiratory Syncytial Virus Subgroup A
We first consider a human respiratory syncytial virus A (RSVA) data set (Zlateva et al. 2005) that features 35 sequences of 629 base pairs from the G gene, sampled between 1956 and 2002. The RSVA G gene encodes for the attachment glycoprotein, and for across-site variation models that employ the Codon partitioning scheme, we divide alignment sites into three groups according to the three codon positions. Figure 1 shows the improvement in model performance over baseline HKY + No Variation and GTR + No Variation models achieved by using different models for across-site variation. Marginal likelihood estimates for all analyses are reported in supplementary table S1 of the Supplementary Material, Supplementary Material online.
Fig. 1.
Performance of models for across-site variation on respiratory syncytial virus subgroup A (RSVA), hepatitis C virus subtype 4 (HCV), rabies virus (RABV) data, and hepatitis B virus (HBV) data. Bars depict improvement in model fit in log marginal likelihood units over baseline model with same nucleotide substitution model that assumes no across-site variation. The best performing model for each data set and nucleotide substitution model is indicated by a star over the corresponding bar.
Among the standard approaches, the models without across-site variation clearly perform the worst while the most flexible Codon + Gamma variation scheme yields the best performance. It is interesting to note that, conditional on the substitution model, the Gamma scheme that accounts for uncertainty in substitution rate partitions while using the same relative character exchange rates for all sites performs much better than the Codon scheme, which allows for variation in substitution rates as well as relative character exchange rates but fixes partitions according to codon position. For each of the four standard approaches for modeling across-site variation, using a GTR substitution model leads to a better fit compared to using an HKY substitution model.
All of the infinite mixture models outperform the best of the standard models (GTR + Codon + Gamma) by wide margins. The best model is HKY + IHMM, with a marginal likelihood nearly 100 log units ahead of the second place HKY + HDP-Codon, which has a marginal likelihood 23 log units greater than the HKY + DP model. For each infinite mixture model, using the HKY substitution model leads to a better model fit than using the GTR substitution model (in contrast to what we observed under the standard approaches). In fact, the worst fitting model that uses the HKY substitution model has a marginal likelihood 22 log units higher than the best fitting model that uses the GTR substitution model (the GTR + HDP-Codon). The GTR + HDP-Codon has a marginal likelihood 4 log units greater than the GTR + IHMM, which has a marginal likelihood 3 log units greater than the GTR + DP. The posterior estimates of the number of evolutionary categories (Table 1) are greater and less precise for models that use the HKY substitution model rather than the GTR substitution model. Thus, the infinite mixture models compensate in some sense for a more restrictive substitution model by inferring a larger number of distinct substitution model parameters. The clustering patterns of the alignment sites are summarized in Fig. 2. The pattern variation between models that use the HKY substitution model is greater than the pattern variation between models that use the GTR substitution model. This is consistent with large differences in marginal likelihood between the models that use the former substitution model vs. the relatively small differences in marginal likelihood between models that employ the latter substitution model.
Table 1.
Posterior medians and 95 Bayesian credibility intervals (BCIs) of number of evolutionary categories inferred from infinite mixture model analyses of respiratory syncytial virus subgroup A (RSVA), hepatitis C virus subtype 4 (HCV), rabies virus (RABV), and hepatitis B virus (HBV) data.
| GTR | HKY | ||||
|---|---|---|---|---|---|
| Median | 95 BCI | Median | 95 BCI | ||
| RSVA | DP | 2 | (2, 2) | 5 | (3, 11) |
| HDP-Codon | 2 | (2, 2) | 7 | (4, 11) | |
| IHMM | 2 | (2, 2) | 5 | (5, 7) | |
| HCV | DP | 3 | (3, 3) | 12 | (10, 16) |
| HDP-Codon | 3 | (3, 3) | 9 | (7, 12) | |
| IHMM | 3 | (3, 3) | 10 | (9, 12) | |
| RABV | DP | 2 | (2, 2) | 2 | (2, 5) |
| HDP-Codon | 2 | (2, 2) | 4 | (4, 7) | |
| IHMM | 3 | (3, 3) | 5 | (5, 5) | |
| HBV | DP | 3 | (3, 3) | 27 | (26, 30) |
| IHMM | 3 | (3, 3) | 19 | (19, 20) | |
Rows correspond to different data sets and different infinite mixture models for across-site variation of substitution rates and relative exchange rates of nucleotide bases. Columns correspond to different substitution models.
Fig. 2.
Summary of alignment site clustering patterns inferred from infinite mixture model analyses of respiratory syncytial virus subgroup A data set. Alignment sites are divided into categories via the k-means clustering algorithm applied to site-specific posterior quantile estimates of evolutionary model parameters. The prespecified number of categories for each cluster analysis corresponds to the posterior median estimate of the number of evolutionary categories inferred under the infinite mixture model. In each plot, categories are ordered according to median substitution rate, from lowest to highest. The category of each alignment site on a horizontal axis is indicated by a vertical black bar.
Under all infinite mixture models, we infer a greater posterior mean substitution rate for the third codon position than for codon positions one and two (see supplementary figs. S21–S22, Supplementary Material online). However, there is substantial variation of substitution rate and transition/transversion rate within each codon position, and for each parameter, the 95 Bayesian credibility intervals for the different codon positions largely overlap. Autocorrelation plots of the mean substitution rate and (when applicable) transition/transversion rate along the alignment (supplementary figs. S28–S31, Supplementary Material online) provide insight into the different dynamics under the infinite mixture models. For the HDP-based models, the autocorrelations are positive when the lag is a multiple of three and are otherwise negative, and the magnitudes of the autocorrelations are very similar for all lags three units apart. Under the DP- and IHMM-based models, the autocorrelations are mostly positive and of relatively small magnitude.
Hepatitis C Virus Subtype 4
Next, we analyze a hepatitis C virus subtype 4 (HCV) data set comprising 63 sequences of 412 base pairs from the E1 region (Ray et al. 2000). The sequences all have the same sampling date, which means that the substitution rate cannot be separated from time in the evolutionary model and is thus not identifiable. In order to estimate substitution rate multipliers, we adopt the same resolution as Wu et al. (2013) in their analysis of the data set: we assign the tree root height a restrictive normal prior distribution with mean 1.0 and standard deviation 0.1. The sequences encompass a protein coding region, and we again divide the alignment into three groups according to the three codon positions for across-site variation models that use the Codon partitioning scheme. Improvements in model fit by models for across-site variation over baseline models are shown in Fig. 1 and complete marginal likelihood estimates are available in supplementary table S1 (Supplementary Material), Supplementary Material online.
The results under the standard approaches mirror those for the RSVA data set. The best model fit is attained under the least restrictive method for modeling across-site variation: the Codon + Gamma scheme. Also, for each of the four across-site variation models, an underlying GTR substitution model yields a better model fit than an HKY model.
Again, as with the RSVA analyses, every infinite mixture model outperforms the best standard model (the GTR + Codon + Gamma), and infinite mixture models that employ HKY models uniformly outperform those that use GTR models. In contrast to the RSVA analyses, Dirichlet process mixtures perform best no matter the underlying substitution model. The best fitting model is the HKY + DP, with a marginal likelihood 92 log units greater than the HKY + HDP-Codon, which exceeds the marginal likelihood of the HKY + IHMM by 83 log units. The marginal likelihoods for the GTR model-based mixtures are again relatively close to each other: the GTR + DP’s marginal likelihood is 2 log units ahead of the GTR + IHMM’s marginal likelihood, which is 4 log units greater than the GTR + HDP-Codon model’s marginal likelihood. As in the case of the RSVA analyses, the posterior estimates of the number of evolutionary categories are higher and more variable for infinite mixture models with underlying HKY substitution models (Table 1). Figure 3 summarizes the clustering patterns under different mixture models. The clustering patterns vary substantially for models that use the HKY model, reflective of the large differences in their marginal likelihoods. On the other hand, the clustering patterns for models based on the GTR model are very similar, which is in line with their relatively close marginal likelihoods.
Fig. 3.
Summary of alignment site clustering patterns inferred from infinite mixture model analyses of hepatitis C virus subtype 4 data set. Alignment sites are divided into categories via the k-means clustering algorithm applied to site-specific posterior quantile estimates of evolutionary model parameters. The prespecified number of categories for each cluster analysis corresponds to the posterior median estimate of the number of evolutionary categories inferred under the infinite mixture model. In each plot, categories are ordered according to median substitution rate, from lowest to highest. The category of each alignment site on a horizontal axis is indicated by a vertical black bar.
The posterior means of substitution rates and transition/transversion rates by codon position are greatest for the third codon position under all models, but the overall posterior distributions again exhibit considerable variability (supplementary figs. S23–S24, Supplementary Material online). The autocorrelation dynamics of the mean substitution rate and mean transition/transition rate along alignment sites are similar in all cases (supplementary figs. S32–S34, Supplementary Material online): the autocorrelations are positive when the lags are multiple of three and otherwise negative, and the autocorrelation magnitudes are similar for lags three units apart.
Rabies Virus
We next analyze a data set comprising 47 rabies virus (RABV) sequences sampled between 1982 and 2004 that was used to study a large-scale outbreak among North American raccoons (Biek et al. 2007). Each sequence includes 1,359 bp of the glycoprotein gene, 1,365 bp of the nucleoprotein gene, and 87 bp for the noncoding sequence that immediately follows N. For across-site variation models that employ the Codon partitioning scheme, we thus divide the sites from the G and N genes into three groups according to codon position, and noncoding sites are assigned a separate fourth group. Supplementary table S1 (Supplementary Material), Supplementary Material online presents the marginal likelihood estimates under different standard and infinite mixture models, and Fig. 1 depicts the differences in model fit.
As with the RSVA and HCV analyses, for each standard approach for modeling across-site variation, using the GTR model leads to a better model fit than using the HKY model. Furthermore, the best performance is achieved under Codon + Gamma across-site variation. In contrast to the RSVA and HCV analyses, however, the Codon across-site variation models outperform the Gamma models.
Many more departures from the trends in the RSVA and HCV analyses emerge in the infinite mixture model analyses. Models based on the HKY substitution model do not uniformly outperform models based on the GTR model. Indeed, while the HKY + IHMM model achieves the best marginal likelihood by a margin of over 1,000 log units, the GTR + IHMM model ranks second and has a marginal likelihood 847 log units greater than the third best model. However, conditional on the type of Bayesian nonparametric prior, using an HKY model always yields a better model fit than using a GTR model. While the gaps between the IHMM-based models and the rest are much bigger than any of the other gaps, the difference in marginal likelihood between any two of the infinite mixture models is greater than 50 log units. While five of the six infinite mixture models greatly outperform all of the standard models for across-site variation, the GTR + DP model has a marginal likelihood that is 30 log units less than that of the standard GTR + Codon + Gamma model, and 13 log units less than the standard GTR + Codon model’s marginal likelihood. Posterior estimates of the number of evolutionary categories (Table 1) are greater and more variable for infinite mixture models that use the HKY substitution model, but the difference is not as great as what we observe in the RSVA and HCV analyses. A summary of the clustering patterns under the different mixture models is depicted in Fig. 4. There is generally substantial variation, but the patterns under the GTR + DP and HKY + DP models are very similar.
Fig. 4.
Summary of alignment site clustering patterns inferred from infinite mixture model analyses of rabies virus data set. Alignment sites are divided into categories via the k-means clustering algorithm applied to site-specific posterior quantile estimates of evolutionary model parameters. The pre-specified number of categories for each cluster analysis corresponds to the posterior median estimate of the number of evolutionary categories inferred under the infinite mixture model. In each plot, categories are ordered according to median substitution rate, from lowest to highest. The category of each alignment site on a horizontal axis is indicated by a vertical black bar. Horizontal bars at the tops of the plots indicate whether sites correspond to the G or N genes or are noncoding sites.
Supplementary figs. S25–S26, Supplementary Material online summarize the posterior estimates of substitution rates and (when applicable) transition/transversion rates for codon positions in the G and N genes and for sites corresponding to the noncoding region. Posterior distributions for the same codon position in different genes are extremely similar under all models. The posterior mean substitution rate is highest for the third codon position in all cases, and HDP-based models exhibit the greatest contrast between the posterior means for the third position and the posterior means for other positions. Supplementary figs. S35–S43, Supplementary Material online show the lagged autocorrelation of the mean substitution rate and mean transition/transversion rate along genes G and N and the noncoding region for the different infinite mixture models. All GTR model-based mixtures exhibit similar patterns for genes G and N: within each model, the autocorrelation magnitudes are very similar for lags three units apart, and the autocorrelations are positive when the lags are multiples of three (reflecting the codon position pattern) and negative otherwise. The magnitudes of the lagged autocorrelations are much greater under the GTR + HDP model than under the GTR + DP and GTR + IHMM models. This is reflective of the HDP modeling which allows for a greater degree of independence between different codon positions. We see similar autocorrelation patterns for genes G and N under models that use HKY substitution models, with the exception of the transition/transversion rate under the HKY + IHMM model, which exhibits lagged autocorrelations with small magnitudes that are barely positive or negative. For the noncoding region, the autocorrelation pattern does not exhibit the repetition for lags that are three units apart that we see under genes G and N. The patterns under the DP- and HDP-based models are however very similar throughout and show positive autocorrelations for lags between 1 and 5 units.
Hepatitis B Virus
Finally, we analyze a data set of 114 complete genome hepatitis B virus (HBV) sequences with sampling locations spanning five continents. The data set is a subset of representatives from the complete data set analyzed by Kocher et al. (2021) based on clustering according to 95% similarity using CD-HIT (Fu et al. 2012). The data set includes contemporary sequences as well as ancient sequences dated up to approximately 10,500 years ago. The HBV genome encodes four overlapping genes (C, P, S, and X), and due to overlapping reading frames we cannot employ standard or infinite mixture models that partition by codon position. For the more limited range of models we explore, supplementary table S1, Supplementary Material online and Fig. 1 show trends that are similar to those for other data sets. In particular, an underlying GTR model yields a better model fit for standard across site variation models, and infinite mixture models with underlying HKY models uniformly outperform infinite mixture models with underlying GTR models. Posterior estimates of the number of evolutionary categories are again greater and more variable for infinite mixture models based on HKY models (Table 1). The GTR-based models yield posterior medians of three evolutionary categories, which are similar to the GTR-based model estimates in analyses of other data sets. On the other hand, the HKY-based analyses result in much higher median estimates of the number of evolutionary categories (19 under the HKY + IHMM model, 27 under the DP + IHMM model) than for the other data sets.
The best fitting model is the HKY + IHMM model. The marginal likelihood estimates for the GTR + IHMM and GTR + DP models are relatively close (the GTR + IHMM model performs better by 13 log units compared to GTR + DP), but the marginal likelihoods of models are otherwise separated by hundreds (and often thousands) of log units. Summaries of the clustering patterns (Fig. 5) under the GTR-based models show very similar clustering patterns. The HKY-based models show rather different clustering dynamics, with the HKY + DP model featuring many more categories occupied by relatively small numbers of sites.
Fig. 5.
Summary of alignment site clustering patterns inferred from infinite mixture model analyses of hepatitis B virus data set. Alignment sites are divided into categories via the k-means clustering algorithm applied to site-specific posterior quantile estimates of evolutionary model parameters. The prespecified number of categories for each cluster analysis corresponds to the posterior median estimate of the number of evolutionary categories inferred under the infinite mixture model. In each plot, categories are ordered according to median substitution rate, from lowest to highest. The category of each alignment site on a horizontal axis is indicated by a vertical black bar. Horizontal bars at the tops of the plots indicate the genes that sites correspond to.
Supplementary figs. S44–S53, Supplementary Material online present lagged autocorrelations for mean substitution rates and (when applicable) mean transition/transversion rates along the contiguous portions of HBV genes C, P, S, and X under DP- and IHMM-based models. The autocorrelation trends are very similar under the GTR + DP and GTR + IHMM models: there is substantial positive autocorrelation for lags that are multiples of three, and the autocorrelation is otherwise generally negative or positive with a very small magnitude. The only exception to the latter is for gene X, in which case the mean substitution rate has substantial positive autocorrelation for lags of one and two. Under the HKY + DP and HKY + IHMM models, the mean transition/transversion rate autocorrelations generally have very small magnitudes. For genes C and P, we observe similar dynamics for the mean substitution rate under both models: the autocorrelations with lags that are multiples of three are positive and generally have a greater magnitude than the autocorrelations with different lags, which are either negative or very close to zero. Notably, the positive autocorrelations with lags that are multiples of three are of substantially smaller magnitude under the HKY + IHMM model. For genes S and X, the mean substitution rate under the HKY + DP model again exhibits substantial positive autocorrelations for lags that are multiples of three and, for other lags, autocorrelations that are generally negative or close to zero. The substantial positive autocorrelations with lags of one and two for gene X stand out as exceptions. For genes S and X, the mean substitution rate under the HKY + IHMM model generally exhibits very little autocorrelation, with the substantially positive autocorrelations of lags 1 and 3 along the segment of gene S comprising sites 2,850–3,162 standing out as exceptions.
The overall trend in autocorrelation that we see under all infinite mixture models is that the most substantial autocorrelations are the positive autocorrelations for the mean substitution rate when the lag is a multiple of three. This is in line with the general patterns suggested by the underlying codon structure and offers support for the ability of infinite mixture models to effectively capture important dynamics of multiple overlapping reading frames. This observation may appear to be contradicted by the fact that the best-fitting model (HKY + IHMM) yields estimates that diverge the most from this aforementioned overall trend. However, autocorrelation focuses only on specific kinds of patterns and does not capture more complex patterns of dependence. This is clarified by the more comprehensive summary of site-specific distributions of evolutionary parameters inferred under the HKY + IHMM model presented in supplementary figs. S59–S61, Supplementary Material online.
We stress that examining the autocorrelation of evolutionary parameters along alignment sites is not an exercise in evaluating model performance or fit. Rather, it is one way to gain insight into the dynamics of different models and the different kinds of patterns they reveal. Whether autocorrelation patterns highlight a model’s strengths or weaknesses depends on the context. As seen in our empirical examples, a very simple model with no across-site variation will not be able to detect general differences in evolutionary processes among codon positions and will exhibit poor model fit in comparison to a model that can uncover basic across-site variation patterns that align with codon positions. Here, autocorrelation patterns of evolutionary parameters along alignment sites would illustrate the advantages of the latter model. On the other hand, there can be more complex patterns of across-site evolutionary variation, and a model that can capture such complex patterns would achieve better model fit than a model that reveals more basic periodic trends but struggles to uncover more complex patterns. In this case, the limitations of the latter model would make it easier to summarize the patterns it can reveal by investigating autocorrelation across sites.
Posterior Distributions of Phylogenetic Trees
The substantial variation in model fit among different models for across-site variation in our empirical examples is accompanied by notably different posterior phylogenetic inferences, especially in the cases of the HCV, RABV, and HBV data sets. In the Supplementary material, we present a number of different summaries of phylogenetic inferences under different models. These include maximum clade credibility (MCC) trees (supplementary figs. S1–S4, Supplementary Material online), plots of posterior split frequencies (supplementary figs. S5–S12, Supplementary Material online), comparison of phylogenetic treespace sampled under different models via heatmaps of two dimensional representations of treespace obtained through multidimensional scaling (supplementary figs. S13–S20, Supplementary Material online), and posterior estimates of root height, tree length and mean evolutionary rate (supplementary tables S2–S5, Supplementary Material online).
Summary trees are widely used to express the essential findings of phylogenetic analyses, and they are often heavily relied upon as bases for downstream evolutionary and epidemiological inferences. The MCC trees inferred under the best-fitting infinite mixture models vs. the best fitting standard models for across-site variation (with the same underlying substitution model) for each data set depict many different evolutionary relationships. In the case of the HBV analyses, there are nodes in each MCC tree with posterior probability of 0.90 or greater that are not present in the other tree. Notable differences in phylogenetic inferences of the HBV data set under different models are also revealed in the other summaries. For instance, some of the 95 Bayesian credibility intervals for the tree length and mean evolutionary rate encompass very different ranges and, for some pairs of models, do not overlap. Finally, the substantially different inferences under different models that are suggested by summary trees and statistics are corroborated by more comprehensive summaries of the posterior distributions. Notably, phylogenetic treespace heatmaps and posterior split frequencies reveal particularly striking contrasts between the inferences resulting from the infinite mixture models vs. the best fitting standard models.
In contrast the what we observe for the HBV data set, different clusterings in MCC trees inferred from the RSVA, HCV and RABV data sets are primarily accounted for by tree nodes with relatively low posterior support, raising the possibility that the uncertainty pertaining to a subset of nodes may make it difficult to arrive at a consensus “best” tree, but that the overall posterior distributions are still very similar. However, a more thorough exploration of the posterior via split frequencies and heatmaps of phylogenetic treespace reveals substantial differences in phylogenetic posterior distributions inferred under different models. In general, we observe a correspondence between how much the marginal likelihoods for a pair of models differ and how much the various summaries of their phylogenetic inferences differ. As such, the summaries provide some insight into what accounts for certain differences in model fit. For instance, in the phylogenetic treespace heatmaps of HKY model-based analyses of the RABV data (supplementary fig. S17, Supplementary Material online), the heatmap of the HKY + IHMM analysis features a much greater peak in the upper right corner than the heatmaps of the HKY + HDP, HKY + DP and HKY + Codon + Gamma analyses. The HKY + IHMM heatmap differs from the other three heatmaps to a much larger degree than any of the aforementioned three heatmaps differ from each other, and this is in line with the degree of difference in marginal likelihood among the four corresponding models.
Simulation Study
We perform a simulation study to assess the extent to which models with higher marginal likelihood estimates correspond to models that better infer evolutionary parameters and phylogenetic trees that characterize generative processes. Importantly, to make the results of this study more relevant to real data (which will not have been generated by any of our nonparametric models for across-site variation), we do not simulate data using any infinite mixture models. We simulate a total of 20 different molecular sequence data sets: 10 with underlying HKY models, and 10 with underlying GTR models. Each data set comprises 500 bp. To generate a data set, we draw distinct evolutionary parameters for each alignment site from appropriate parametric distributions (a gamma distribution for transition/transversion rates, a Dirichlet distribution for equilibrium frequencies, etc.). We set the parameter values of the aforementioned parametric distributions to approximate the posterior distributions of the best-fitting analyses of RSVA data that employ the corresponding substitution model (HKY + IHMM or GTR + HDP). Once we have drawn evolutionary parameters for each site, we use πBUSS (Bielejec et al. 2014) to simulate sequence data for each site according to the corresponding evolutionary parameters on the tips of a fixed tree. For all HKY-model based simulations, we use the MCC tree inferred under the HKY + IHMM analysis of the RSVA data, and for all GTR model-based simulations, we use the MCC tree inferred under the GTR + HDP analysis of the RSVA data.
We analyze each simulated data set with IHMM- and DP-based models that employ the same substitution model that the data were simulated under. Because we do not have any knowledge about how sites might be reasonably divided into different groups, we do not use any HDP-based models. For each analysis, we evaluate the ability of infinite mixture models to estimate the true site-specific evolutionary parameter values by computing the coverage of the true values and mean squared error (across all alignment sites) for each evolutionary parameter. To examine the extent to which the favored model outperforms the unfavored model, Fig. 6 depicts, for each data set, (i) the degree to which the favored model is supported over the unfavored model (as quantified by the log Bayes factor) and (ii) the coverage and mean squared error of the favored model relative to the unfavored model for each evolutionary parameter. Supplementary tables S7 and S8 in the Supplementary Material, Supplementary Material online include more details. In most cases for which the log Bayes factor in support of the favored model is 2.17 or less, no model clearly outperforms the other (there is one notable exception in which case the overall performance of the favored model is convincingly better). When the log Bayes factor in support of the favored model is 4.07 or 5.05, the favored model does not uniformly outperform the unfavored model, but its overall performance is superior: it achieves better mean squared error and coverage for more parameters, and it outperforms the unfavored model by larger margins than the unfavored model outperforms it. For cases where the log Bayes factor in support of the favored model is 6.83 or greater, the favored model performs as good or better (in most cases, substantially better) than the unfavored model for all parameters and performance metrics. Notably, the correspondence between the log Bayes factor and the extent to which the favored model outperforms the unfavored model is largely in line with the widely adopted scale for interpreting the Bayes factor proposed by Kass and Raftery (1995). For instance, Kass and Raftery (1995) consider support for the favored model to be “strong” if the log Bayes factor is between 3 and 5, and “very strong” if it is greater than 5.
Fig. 6.
Relative performance of favored models over unfavored models in estimating true evolutionary parameter values from 20 simulated data sets. Each data set is analyzed with two different across-site variation models (one based on a Dirichlet process, the other based on an infinite hidden Markov model). For each data set, the plots show the log Bayes factor in support of the favored model over the unfavored model (horizontal axis) and the performance of the favored model relative to the unfavored model in estimating model parameters (vertical axes). Each point represents an evolutionary parameter (there are six parameters for HKY model-based analyses and eleven parameters for GTR model-based analyses) in a simulated data set. The top plot shows the relative coverage of the true values of the parameter across all alignment sites. The relative coverage being greater than one (above the dotted line) indicates that the favored model performs better for the corresponding parameter. The bottom plot shows the relative mean squared error across all alignment sites. The relative mean squared error being less than one (below the dotted line) indicates that the favored model performs better for the corresponding parameter.
We also compare model performance in terms of phylogenetic inference. For each analysis, we employ three different phylogenetic tree distance metrics to compare the distance of trees from the posterior sample to the true phylogenetic tree by computing the mean squared error. The tree distance metrics use information theoretic approaches to compare phylogenetic trees in terms of the number of splits that are unique to either tree while also accounting for the similarity of nonidentical splits (Smith 2020a, 2020b, 2022). The three different metrics we use are based on different ways of quantifying the similarity of nonidentical splits: (i) shared phylogenetic information of the splits, (ii) mutual clustering information of the splits, and (iii) the phylogenetic information content of the most informative split that is consistent with the two splits being compared (see the Supplementary material for details). Supplementary tables S9 and S10, Supplementary Material online report the mean squared error of the better fitting model relative to the worse fitting model for each simulated data set. To summarize the results, we consider a model to be better performing in tree estimation if it yields a mean squared error that is less than or equal to the mean squared error yielded by the competing model under all three tree distance metrics, and strictly less under at least two of the metrics. By this measure, we can identify a better performing model for all but one data set. While the mean squared errors under the competing models for each data set are relatively close, the dynamics are similar to what we observed in the case of evolutionary parameter estimation. When supported by a log Bayes factor greater than or equal to 4.07, the favored model outperforms the unfavored model in tree estimation in every case with one exception. In the exceptional case, the favored model is supported by a log Bayes factor of 19.0 and yields a mean squared error that is equal to that of the unfavored model under two of the distance metrics and one percent greater under the third distance metric. The overall performance of the favored model is still consistent with its strong Bayes factor support considering that it substantially outperforms the unfavored model in evolutionary parameter estimation (with relative mean squared error between 0.77–0.87) while performing very similarly in tree estimation. When the log Bayes factor in support of the favored model is 2.17 or less, the favored model outperforms the unfavored model in tree estimation in half of the cases, and the unfavored model performs better in the other half. In any case, the differences in mean squared error for tree estimation are relatively small and, when taking evolutionary parameter estimation performance into account, no model exhibits a clearly superior overall performance when the Bayes factor is 2.17 or less.
Discussion
We introduce a framework for modeling variation of evolutionary processes across multiple sequence alignment sites via a Bayesian infinite mixture model that simultaneously infers assignment of sites to evolutionary categories, the number of evolutionary categories, and the values of evolutionary model parameters that correspond to each category. Importantly, this Bayesian framework naturally accounts for uncertainty in all model parameters, including the aforementioned parameters that specifically account for across-site variation. To build infinite mixture models that can support an arbitrary number of evolutionary categories, our framework relies on Bayesian nonparametric prior distributions. Among such priors, Dirichlet processes have already been employed with great success in evolutionary modeling. Our framework offers novel methods for modeling across-site variation through two different Bayesian nonparametric priors: infinite hidden Markov models and hierarchical Dirichlet processes.
The Bayesian nonparametric priors proposed here model clustering in different ways and lead to substantial differences in model fit and posterior phylogenetic inferences (see Supplementary material). In our viral examples, we find that the suitability of each prior depends on the data set as well as on the underlying DNA substitution model. Notably, for each of the three types of priors, there is at least one scenario in which it leads to better model fit in terms of marginal likelihood than the other two priors. Infinite hidden Markov models yield the best model fit among analyses of the RSVA data under the HKY substitution model and of the RABV data and HBV data under both GTR and HKY substitution models. Dirichlet process mixtures achieve the best model fit in analyses of the HCV data under GTR and HKY substitution models. The hierarchical Dirichlet process, with a priori groups defined by codon position, leads to the best fit in analyses of the RSVA data under the GTR substitution model. These results underscore the importance of expanding beyond Dirichlet processes, which have thus far formed the basis of infinite mixture models for across-site evolutionary variation, and taking all three Bayesian nonparametric priors into consideration in order to best model across-site variation for different data sets with different evolutionary models. While no specific type of infinite mixture model emerges as a consistent favorite, infinite hidden Markov models stand out as particularly promising, outperforming other infinite mixture models by large margins in a majority of scenarios including, notably, larger data sets with more complex evolutionary patterns.
Importantly, our study provides support for regular adoption of infinite mixture models over standard models for across-site variation. Indeed, the best infinite mixture models outperform the best standard models in terms of model fit in all scenarios and, with the exception of the Dirichlet process mixture paired with a GTR substitution model in the case of RABV data, infinite mixture models always outperform standard models. While the results of our simulation study caution against overemphasizing inferential differences under competing models on the basis of small differences in model fit, the favored models in the simulation study consistently and strongly outperform competing models in terms of accurately inferring true evolutionary model parameters when supported by log Bayes factors greater than 6. To put this into perspective, in our real data examples, the best infinite mixture model is favored over the best standard model in every scenario by a log Bayes factor exceeding 90. Furthermore, in most scenarios, the log Bayes factor for competing infinite mixture models is greater than 10 (and is often in the hundreds or even thousands).
Our exploration in terms of data set sizes and complexity remains limited, but future application of our methodology on a variety of data sets will contribute to our understanding of the relative strengths and weaknesses of different types of infinite mixtures, and it can also provide opportunities to apply the methodology in novel ways. For example, three of the four data sets that we analyze consist mostly or entirely of protein coding regions, suggesting hierarchical Dirichlet processes with a priori groups determined by codon positions. Hierarchical Dirichlet processes can be specified differently for sequence data that offer alternative “natural” groupings of sites. In the case of genome data, for instance, sites can be grouped according to gene. Or, following up on our analyses of the HBV data set that features overlapping genes, sites can be grouped based on overlapping and nonoverlapping regions. Finally, we have focused on mixture models with underlying nucleotide substitution models, but this framework could be applied to substitution models for amino acid or codon data, and this will also have implications for the specification and performance of different kinds of mixture models. For codon models, for instance, hierarchical Dirichlet processes with grouping according to codon position would no longer be relevant. Furthermore, infinite mixture models would not need to capture important patterns that are explicitly accounted for by codon models, and this may better enable them to uncover interesting dynamics across codon positions.
While moving from standard models toward infinite mixture models is a big step toward “letting the data decide,” our framework still relies on more a priori modeling assumptions than ideal. For instance, we must pre-specify a particular substitution model. For every combination of data set and Bayesian nonparametric prior we examine, using an underlying HKY substitution model results in a substantially better model fit than using a less restrictive GTR model. This makes for an interesting contrast with standard modeling: for every combination of data set and standard model for across-site variation, using an underlying GTR substitution model leads to a better model fit. The sensitivity of the results underlines the importance of the substitution model. The trends we observe in our examples regarding the HKY and GTR models may not hold for other data sets, and it is also possible that other underlying substitution models may be preferable. Wu et al. (2013) present an appealing alternative to conditioning on any given substitution model: they account for substitution model uncertainty in Dirichlet process mixtures through a stochastic procedure that selects among a range of substitution models. We plan on implementing such an approach that enables substitution model selection for mixtures based on infinite hidden Markov models and hierarchical Dirichlet processes. In addition, when adhering to a single underlying substitution model, we could consider making its parametrization more data driven by using random-effect substitution models, for which efficient implementations exist in our Bayesian inference framework (Magee et al. 2024).
There is much potential for further development of infinite mixture models for across-site variation. There is a wide range of Bayesian nonparametric prior distributions that model clustering in different ways than the three types of priors that we employ. These include Pitman-Yor processes (Pitman and Yor 1997), dependent Dirichlet processes (MacEachern 2000), spatial Dirichlet processes (Gelfand et al. 2005), nested Dirichlet processes (Rodriguez et al. 2008), sticky hierarchical Dirichlet process hidden Markov models (Fox et al. 2011), and nested hierarchical Dirichlet processes (Paisley et al. 2014). In certain scenarios, models based on such priors may very well outperform any of the models that we consider. In another direction, our framework infers partitions under the assumption that overall substitution rates and relative molecular character exchange rates are constant within each partition. Wu et al. (2013) demonstrate improved inference through decoupling the clustering of substitution rates and relative exchange rates by modeling them with independent Dirichlet processes. We anticipate that similar gains can be achieved through analogous decoupled clustering in models based on other Bayesian nonparametric priors.
While our focus has been on modeling evolutionary variation across multiple sequence alignment sites, evolutionary processes often also exhibit variation among phylogenetic tree branches. Numerous molecular clock models that posit branch-specific substitution rate variation have been developed and enjoy widespread use (Thorne et al. 1998; Huelsenbeck et al. 2000; Drummond et al. 2006; Lepage et al. 2006; Drummond and Suchard 2010; Bletsa et al. 2019; Didelot et al. 2021), including a Dirichlet process mixture model (Heath et al. 2012). Other advances have united branch- and site-specific variation of substitution rates (Tuffley and Steel 1998; Galtier 2001; Huelsenbeck 2002; Zhou et al. 2010) and all substitution model parameters (Guindon et al. 2004; Gascuel and Guindon 2007; Whelan 2008; Baele et al. 2021). Simultaneous modeling of branch- and site-specific variation of evolutionary processes via infinite mixtures represents a promising next step. For instance, hierarchical Dirichlet processes offer a natural framework to allow across-site variation dynamics to differ among phylogenetic tree branches while still capturing essential aspects of an overall shared structure.
An important future direction that will allow for more insight into how to best model evolutionary variation is development of better methods for marginal likelihood estimation that are computationally feasible for infinite mixture models in phylogenetic inference. Several studies (Xie et al. 2011; Baele et al. 2012a, 2012b; Fourment et al. 2020) have shown that estimation procedures such as path sampling (Gelman and Meng 1998; Lartillot and Philippe 2006; Baele et al. 2012a), stepping-stone sampling (Xie et al. 2011), and generalized stepping-stone sampling (Fan et al. 2011; Baele et al. 2016) outperform the stabilized harmonic mean estimator that we have used. Developing and evaluating similarly sophisticated methods that can scale and perform well in the context of infinite mixture models in phylogenetics is, however, a difficult task (Filippi et al. 2016; Fourment et al. 2020; Hairault et al. 2022). While there has been limited development and/or evaluation of marginal likelihood estimation procedures specifically for Dirichlet process mixtures (Basu and Chib 2003; Lartillot and Philippe 2004; Hairault et al. 2022), to the best of our knowledge there is no such existing work for hierarchical Dirichlet processes or infinite hidden Markov models. An alternative perspective on Bayesian model selection is to focus on predictive fit rather than marginal fit, and this has spurred the development of cross-validation and information criteria. The AICM (Raftery et al. 2007), a simulation-based analog of of the classic Akaike information criterion (AIC) (Akaike 1973), has been employed in phylogenetic inference (Baele et al. 2012a). However, the asymptotic theory that serves as the basis for the AIC does not hold for mixture models (Watanabe 2010; Gelman et al. 2014), and recent work suggests that the AIC is not suitable for assessing mixture models in phylogenetics (Susko and Roger 2020; Crotty and Holland 2022; Liu et al. 2023). There has been progress in implementing procedures for assessing predictive fit in a phylogenetic inference framework that do not suffer from the same limitations as the AIC, such as leave-one-out cross-validation (Lartillot 2023). While the suitability of these developments remains questionable for our purposes (for instance, they are only valid for i.i.d. models), they represent another promising avenue for future work.
The development of Bayesian nonparametrics has largely been motivated by scenarios with growing amounts of data that necessitate increasing model complexity to adequately capture structure and patterns as data accrue (Ghahramani 2013). Such scenarios have taken center stage in genomic epidemiology, where advances in sequencing capabilities have enabled genomic surveillance in pathogen outbreaks and epidemics in close to real-time (Quick et al. 2016; Douglas et al. 2021). Phylogenetic inference has proven to be an integral tool in genomic epidemiology, providing evolutionary and epidemiological insights that cannot be obtained through other methods (Attwood et al. 2022), and researchers have developed frameworks for “online” phylogenetic inference that can efficiently deliver updated real-time inferences as new data become available (Fourment et al. 2018; Gill et al. 2020). Such online inference frameworks will undoubtedly benefit from the integration of infinite mixture models that not only outperform standard approaches on fixed data sets, but can dynamically adjust to maintain high performance as data sets grow.
In order for infinite mixture models for across-site variation to have maximal impact in “real time” inference settings and in general, it is essential that they are accompanied by efficient algorithms for posterior simulation. To this end, our implementation exploits a cost-effective “data squashing” MCMC sampling scheme for mixture model parameters (Guha 2010) and takes advantage of the opportunity to evaluate the data likelihood for different combinations of alignment site and evolutionary category in parallel via an interface with BEAGLE (Ayres et al. 2019). However, efficient posterior sampling remains challenging. For example, we have observed inconsistent performance in MCMC convergence and mixing across replicates of the same analysis. The data squashing strategy is powerful in its generality, subsuming Gibbs sampling as a special case while providing the opportunity to update parameter values for a large number of sites at once. We plan to investigate the data squashing tuning parameters in more detail and devise strategies to optimize them for different data sets and mixture models. We will also explore the possibility of adapting other promising strategies, such as ensemble approaches that enable multiple Markov chains that are exploring different regions of the posterior distribution to interact (Lindsey et al. 2022).
Supplementary Material
Acknowledgments
We would like to thank the Associate Editor and three anonymous reviewers for constructive comments that helped improve the manuscript. The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 725422-ReservoirDOCS) and from the European Union’s Horizon 2020 project MOOD (grant agreement no. 874850). The Artic Network receives funding from the Wellcome Trust through project 206298/Z/17/Z. PL acknowledges support by the Research Foundation - Flanders (“Fonds voor Wetenschappelijk Onderzoek - Vlaanderen,” G0D5117N, G0B9317N and G051322N). MAS acknowledges support from US National Institutes of Health grants U19 AI135995, R01 AI153044 and R01 AI162611. MSG acknowledges support from the Centers for Disease Control and Prevention, Department of Health and Human Services, under contract NU50CK000626.
Contributor Information
Mandev S Gill, Department of Statistics, University of Georgia, Athens, GA, USA; Institute of Bioinformatics, University of Georgia, Athens, GA, USA.
Guy Baele, Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Marc A Suchard, Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA; Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA; Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, CA, USA.
Philippe Lemey, Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Data Availability
BEAST X XML input files that feature all data analyzed in this study are available at https://github.com/mandevgill/infinitemixturemodels.
References
- Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F, editors, Proceedings of the Second International Symposium on Information Theory. Budapest: Akademiai Kiado; 1973. p. 267–281.
- Aldous D. Exchangeability and related topics. In: Ecole d’Ete de probabilities de saint-flour XIII-1983. Berlin: Springer-Verlag; 1985. p. 1–198. [Google Scholar]
- Antoniak C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat. 1974:2(6):1152–1174. 10.1214/aos/1176342871. [DOI] [Google Scholar]
- Attwood SW, Hill SC, Aanensen DM, Connor TR, Pybus OG. Phylogenetic and phylodynamic approaches to understanding annd combating the early SARS-CoV-2 pandemic. Nat Rev Genet. 2022:23(9):547–562. 10.1038/s41576-022-00483-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ayres DL, Cummings MP, Baele G, Darling AE, Lewis PO, Swofford DL, Huelsenbeck JP, Lemey P, Rambaut A, Suchard MA. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst Biol. 2019:68(6):1052–1061. 10.1093/sysbio/syz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baele G, Gill MS, Bastide P, Lemey P, Suchard MA. Markov-modulated continuous-time Markov chains to identify site- and branch-specific evolutionary variation in BEAST. Syst Biol. 2021:70(1):181–189. 10.1093/sysbio/syaa037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baele G, Lemey P, Bedford T, Rambaut A, Suchard MA, Alekseyenko AV. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol Biol Evol. 2012a:29(9):2157–67. 10.1093/molbev/mss084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baele G, Lemey P, Suchard MA. Genealogical working distributions for Bayesian model testing with phylogenetic uncertainty. Syst Biol. 2016:65(2):250–264. 10.1093/sysbio/syv083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baele G, Li WLS, Drummond AJ, Suchard MA, Lemey P. Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics. Mol Biol Evol. 2012b:20(2):239–243. 10.1093/molbev/mss243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basu S, Chib S. Marginal likelihood and Bayes factors for Dirichlet process mixture models. J Am Stat Assoc. 2003:98(461):224–235. 10.1198/01621450338861947. [DOI] [Google Scholar]
- Baum LE, Petrie T. Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat. 1966:37(6):1554–1563. 10.1214/aoms/1177699147. [DOI] [Google Scholar]
- Beal MJ, Ghahramani Z, Rasmussen C. The infinite hidden Markov model. In: Dietterich TG, Becker S, Ghahramani Z, editors. Advances in neural information processing systems. Vol. 14. Cambridge: MIT Press; 2002. p. 577–584. [Google Scholar]
- Biek R, Henderson J, Waller L, Rupprecht C, Real L. A high-resolution genetic signature of demographic and spatial expansion in epizootic rabies virus. Proc Natl Acad Sci U S A. 2007:104(19):7993–7998. 10.1073/pnas.0700741104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bielejec F, Lemey P, Carvalho LM, Baele G, Rambaut A, Suchard MA. PIBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios. BMC Bioinformatics. 2014:15(1):133. 10.1186/1471-2105-15-133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blackwell D, MacQueen J. Ferguson distributions via polya urn schemes. Ann Stat. 1973:1:353–355. 10.1214/aos/1176342372. [DOI] [Google Scholar]
- Bletsa M, Suchard MA, Ji X, Gryseels S, Vrancken B, Baele G, Worobey M, Lemey P. Divergence dating using mixed effects clock modelling: an application to HIV-1. Virus Evol. 2019:5(2):vez036. 10.1093/ve/vez036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruno WJ. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol. 1996:13(10):1368–1374. 10.1093/oxfordjournals.molbev.a025583. [DOI] [PubMed] [Google Scholar]
- Crotty SM, Holland BR. Comparing partitioned models to mixture models: do information criteria apply? Syst Biol. 2022:71(6):1541–1548. 10.1093/sysbio/syac003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Didelot X, Siveroni I, Volz EM. Additive uncorrelated relaxed clock models for the dating of genomic epidemiology phylogenies. Mol Biol Evol. 2021:38(1):307–317. 10.1093/molbev/msaa193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Douglas J, Geoghegan JL, Hadfield J, Bouckaert R, Storey M, Ren X, de Ligt J, French N, Welch D. Real-time genomics for tracking severe acute respiratory syndrome coronavirus 2 border incursions after virus elimination, New Zealand. Emerg Infect Dis. 2021:27(9):2361–2368. 10.3201/eid2709.211097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond A, Ho S, Phillips M, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006:4(5):e88. 10.1371/journal.pbio.0040088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond AJ, Suchard MA. Bayesian random local clocks, or one rate to rule them all. BMC Biol. 2010:8(1):114. 10.1186/1741-7007-8-114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan Y, Wu R, Chen MH, Kuo L, Lewis PO. Choosing among partition models in Bayesian phylogenetics. Mol Biol Evol. 2011:28(1):523–532. 10.1093/molbev/msq224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981:13:93–104. 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- Felsenstein J. Inferring phylogenies. Sunderland (MA): Sinauer Associates, Inc; 2004. [Google Scholar]
- Felsenstein J, Churchill G. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996:13(1):93–104. 10.1093/oxfordjournals.molbev.a025575. [DOI] [PubMed] [Google Scholar]
- Ferguson T. A Bayesian analysis of some nonparametric problems. Ann Stat. 1973:1(2):209–230. 10.1214/aos/1176342360. [DOI] [Google Scholar]
- Filippi S, Holmes CC, Nieto-Barajas LE. Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet process mixtures. Electron J Stat. 2016:10(2):3338–3354. 10.1214/16-EJS1171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fourment M, Claywell BC, Dinh V, McCoy C, Matsen IV FA, Darling AE. Effective online Bayesian phylogenetics via sequential Monte Carlo with guided proposals. Syst Biol. 2018:67(3):490–502. 10.1093/sysbio/syx090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fourment M, Magee AF, Whidden C, Bilge A, Matsen IV FA, Minin VN. 19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology. Syst Biol. 2020:69(2):209–220. 10.1093/sysbio/syz046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fox EB, Sudderth EB, Jordan MI, Willsky AS. A sticky HDP-HMM with application to speaker diarization. Ann Appl Stat. 2011:5(2A):1020–1056. 10.1214/10-AOAS395. [DOI] [Google Scholar]
- Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012:28(23):3150–3152. 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol. 2001:18(5):866–873. 10.1093/oxfordjournals.molbev.a003868. [DOI] [PubMed] [Google Scholar]
- Gascuel O, Guindon S. Modelling the variability of evolutionary processes. In: Reconstructing evolution: new mathematical and computational advances. Oxford: Oxford University Press; 2007. [Google Scholar]
- Gelfand AE, Kottas A, MacEachern SN. Bayesian nonparametric spatial modeling with Dirichlet process mixing. J Am Stat Assoc. 2005:100(471):1021–1035. 10.1198/016214504000002078. [DOI] [Google Scholar]
- Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput. 2014:24(6):997–1016. 10.1007/s11222-013-9416-2. [DOI] [Google Scholar]
- Gelman A, Meng XL. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat Sci. 1998:13(2):163–185. 10.1214/ss/1028905934. [DOI] [Google Scholar]
- Ghahramani Z. Bayesian non-parametrics and the probabilistic approach to modelling. Philos Trans R Soc A. 2013:371(1984):20110553. 10.1098/rsta.2011.0553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gill MS, Lemey P, Faria NR, Rambaut A, Shapiro B, Suchard MA. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Mol Biol Evol. 2013:30(3):713–724. 10.1093/molbev/mss265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gill MS, Lemey P, Suchard MA, Rambaut A, Baele G. Online Bayesian phylodynamic inference in BEAST with application to epidemic reconstruction. Mol Biol Evol. 2020:37(6):1832–1842. 10.1093/molbev/msaa047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golding GB. Estimates of DNA and protein sequence divergence: an examination of some assumptions. Mol Biol Evol. 1983:1:125–142. 10.1093/oxfordjournals.molbev.a040303. [DOI] [PubMed] [Google Scholar]
- Guha S. Posterior simulation in countable mixture models for large datasets. J Am Stat Assoc. 2010:105(490):775–786. 10.1198/jasa.2010.tm09340. [DOI] [Google Scholar]
- Guindon S, Rodrigo AG, Dyer KA, Huelsenbeck JP. Modeling the site-specific variation of selection patterns along lineages. Proc Natl Acad Sci U S A. 2004:101(35):12957–12962. 10.1073/pnas.0402177101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hairault A, Robert CP, Rousseau J. Evidence estimation in finite and infinite mixture models and applications, arXiv, arXiv:2205.05416, 10.48550/arXiv.2205.05416, preprint: not peer reviewed. 2022. [DOI]
- Hasegawa M, Kishino H, Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985:22(2):160–174. 10.1007/BF02101694. [DOI] [PubMed] [Google Scholar]
- Hastings W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970:57(1):97–109. 10.1093/biomet/57.1.97. [DOI] [Google Scholar]
- Heath TA, Holder MT, Huelsenbeck JP. A Dirichlet process prior for estimating lineage-specific substitution rates. Mol Biol Evol. 2012:29(3):939–955. 10.1093/molbev/msr255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hillis DM, Heath TA, St John K. Analysis and visualization of tree space. Syst Biol. 2005:54(3):471–482. 10.1080/10635150590946961. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck JP. Testing a covariotide model of DNA substitution. Mol Biol Evol. 2002:19(5):698–707. 10.1093/oxfordjournals.molbev.a004128. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck JP, Larget B, Swofford DL. A compound poisson orocess for relaxing the molecular clock. Genetics. 2000:154(4):1879–1892. 10.1093/genetics/154.4.1879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huelsenbeck JP, Nielsen R. Variation in the pattern of nucleotide substitution across sites. J Mol Evol. 1999:48(1):86–93. 10.1007/PL00006448. [DOI] [PubMed] [Google Scholar]
- Huelsenbeck JP, Suchard MA. A nonparametric method for accommodating and testing across-site rate variation. Syst Biol. 2007:56(6):975–987. 10.1080/10635150701670569. [DOI] [PubMed] [Google Scholar]
- Jeffreys H. Some tests of significance, treated by the theory of probability. Math Proc Camb Philos Soc. 1935:31(2):203–222. 10.1017/S030500410001330X. [DOI] [Google Scholar]
- Jeffreys H. Theory of probability. 3rd ed. Oxford: Oxford University Press; 1961. [Google Scholar]
- Jefferys W, Berger J. Ockham’s razor and Bayesian analysis. Am Stat. 1992:80:64–72. [Google Scholar]
- Jukes T, Cantor C. Evolution of protein molecules. In: Munro H, editors. Mammalian protein metabolism. New York: Academic Press; 1969. p. 21–132. [Google Scholar]
- Kass R, Raftery A. Bayes factors. J Am Stat Assoc. 1995:90(430):773–795. 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
- Kimura M. Evolutionary rate at the molecular level. Nature. 1968:217(5129):624–626. 10.1038/217624a0. [DOI] [PubMed] [Google Scholar]
- Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980:16(2):111–120. 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
- Kocher A, Papac L, Barquera R, Key FM, Spyrou MA, Hübler R, Rohrlach AB, Aron F, Stahl R, Wissgott A, et al. Ten millennia of hepatitis b virus evolution. Science. 2021:374(6564):182–188. 10.1126/science.abi5658. [DOI] [PubMed] [Google Scholar]
- Lakner C, van der Mark P, Huelsenbeck JP, Larget B, Ronquist F. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst Biol. 2008:57(1):86–103. 10.1080/10635150801886156. [DOI] [PubMed] [Google Scholar]
- Lanave C, Preparata G, Saccone C, Serio G. A new method for calculating evolutionary substitution rates. J Mol Evol. 1984:20(1):86–93. 10.1007/BF02101990. [DOI] [PubMed] [Google Scholar]
- Lartillot N. Identifying the best approximating model in Bayesian phylogenetics: Bayes factors, cross-validation or wAIC? Syst Biol. 2023:72(3):616–638. 10.1093/sysbio/syad004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 2004:21(6):2004. 10.1093/molbev/msh112. [DOI] [PubMed] [Google Scholar]
- Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. 2006:55(2):195–207. 10.1080/10635150500433722. [DOI] [PubMed] [Google Scholar]
- Lepage T, Lawi S, Tupper P, Bryant D. Continuous and tractable models for the variation of evolutionary rates. Math Biosci. 2006:199(2):216–233. 10.1016/j.mbs.2005.11.002. [DOI] [PubMed] [Google Scholar]
- Lindsey M, Weare J, Zhang A. Ensemble Markov chain Monte Carlo with teleporting walkers. SIAM/ASA J Uncertain Quantif. 2022:10(3):860–885. 10.1137/21M1425062. [DOI] [Google Scholar]
- Liu Q, Charleston MA, Richards SA, Holland BR. Performance of Akaike information criterion and Bayesian information criterion in selecting partition models and mixture models. Syst Biol. 2023:72(1):92–105. 10.1093/sysbio/syac081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacEachern SN. Dependent nonparametric processes. In: ASA Proceedings of the Section on Bayesian Statistical Science; Alexandria (VA): American Statistical Association; 1999. p. 50–55.
- MacEachern SN. Dependent Dirichlet processes. Technical report, Ohio State University, Department of Statistics; 2000.
- Magee AF, Holbrook AJ, Pekar JE, Caviedes-Solis IW, Matsen Iv FA, Baele G, Wertheim JO, Ji X, Lemey P, Suchard MA. Random-effects substitution models for phylogenetics via scalable gradient approximations. Syst Biol. 2024:73(3):562–578. 10.1093/sysbio/syae019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equation of state calculation by fast computing machines. J Chem Phys. 1953:21(6):1087–1092. 10.1063/1.1699114. [DOI] [Google Scholar]
- Neal RM. Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat. 2000:9(2):249–265. 10.1080/10618600.2000.10474879. [DOI] [Google Scholar]
- Nei M, Chakraborty R, Fuerst PA. Infinite allele model with varying mutation rate. Proc Natl Acad Sci U S A. 1976:73(11):4164–4168. 10.1073/pnas.73.11.4164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newton M, Raftery A. Approximate Bayesian inference with the weighted likelihood bootstrap. J R Stat Soc Ser B. 1994:56(1):3–48. 10.1111/j.2517-6161.1994.tb01956.x. [DOI] [Google Scholar]
- Nielsen R. Site-by-site estimation of the rate of evolution and the correlation of rates in mitochondrial DNA. Syst Biol. 1997:46(2):346–353. 10.1093/sysbio/46.2.346. [DOI] [PubMed] [Google Scholar]
- Pagel M, Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst Biol. 2004:53(4):571–581. 10.1080/10635150490468675. [DOI] [PubMed] [Google Scholar]
- Paisley J, Wang C, Blei DM, Jordan MI. Nested hierarchical Dirichlet processes. IEEE Trans Pattern Anal Mach Intell. 2014:37(2):256–270. 10.1109/TPAMI.2014.2318728. [DOI] [PubMed] [Google Scholar]
- Pitman J, Yor M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann Probab. 1997:25(2):855–900. 10.1214/aop/1024404422. [DOI] [Google Scholar]
- Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L, Bore JA, Koundouno R, Dudas G, Mikhail A, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016:530(7589):228–232. 10.1038/nature16996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raftery A, Newton M, Satagopan J, Krivitsky P. Estimating the integrated likelihood via posterior simulation using the harmonic mean identity. In: Bayesian Statistics. Oxford: Oxford University Press; 2007. p. 1–45. [Google Scholar]
- Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior summarization in Bayesian phylogenetics user tracer 1.7. Syst Biol. 2018:67(5):901–904. 10.1093/sysbio/syy032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ray SC, Arthur RR, Carella A, Bukh J, Thomas DL. Genetic epidemiology of hepatitis C virus throughout Egypt. J Infect Dis. 2000:182(3):698–707. 10.1086/jid.2000.182.issue-3. [DOI] [PubMed] [Google Scholar]
- R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2021. [Google Scholar]
- Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 2005:54(3):401–418. 10.1080/10635150590947041. [DOI] [PubMed] [Google Scholar]
- Rodriguez A, Dunson DB, Gelfand AE. The nested Dirichlet process. J Am Stat Assoc. 2008:103(483):1131–11144. 10.1198/016214508000000553. [DOI] [Google Scholar]
- Sethuraman J. A constructive definition of Dirichlet priors. Stat Sin. 1994:4:639–650. [Google Scholar]
- Smith MR. TreeDist: distances between Phylogenetic Trees. R package version 2.9.2; 2020b.
- Smith MR. Information theoretic generalized Robinson-Foulds metrics for comparing phylogenetic trees. Bioinformatics. 2020a:36(20):5007–5013. 10.1093/bioinformatics/btaa614. [DOI] [PubMed] [Google Scholar]
- Smith MR. Robus analysis of phylogenetic tree space. Syst Biol. 2022:71(5):1255–1270. 10.1093/sysbio/syab100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018:4(1):vey016. 10.1093/ve/vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Susko E, Roger AJ. On the use of information criteria for model selection in phylogenetics. Mol Biol Evol. 2020:37(2):549–562. 10.1093/molbev/msz228. [DOI] [PubMed] [Google Scholar]
- Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. Phylogenetic Inference. In: Molecular systematics. 2nd ed. Sunderland (MA): Sinauer Associates, Inc; 1996. p. 407–514. [Google Scholar]
- Tamura K. Estimation of the number of nucleotide substitutions when there are Stronng transition-transversion and G + C-content biases. Mol Biol Evol. 1992:9:678–687. 10.1093/oxfordjournals.molbev.a040752. [DOI] [PubMed] [Google Scholar]
- Tamura K, Nei M. Estimation of the number of nucelotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993:10:512–526. 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
- Tavare S. Some probabilistic and statistical problems on the analysis of DNA sequences. Lect Math Life Sci. 1986:17:57–86. [Google Scholar]
- Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet processes. J Am Stat Assoc. 2006:101(476):1566–1581. 10.1198/016214506000000302. [DOI] [Google Scholar]
- Thorne J, Kishino H, Painter I. Estimating the rate of evolution of the rate of molecular evolution. Mol Biol Evol. 1998:15(12):1647–1657. 10.1093/oxfordjournals.molbev.a025892. [DOI] [PubMed] [Google Scholar]
- Tuffley C, Steel M. Modeling the covarion hypothesis of nucleotide substitution. Math Biosci. 1998:147(1):63–91. 10.1016/S0025-5564(97)00081-3. [DOI] [PubMed] [Google Scholar]
- Venables WN, Ripley BD. Modern applied statistics with S. 4th ed. New York: Springer; 2002. [Google Scholar]
- Venditti C, Meade A, Pagel M. Phylogenetic mixture models can reduce node-density artifacts. Syst Biol. 2008:57(2):286–293. 10.1080/10635150802044045. [DOI] [PubMed] [Google Scholar]
- Waddell PJ, Steel MA. General time-reversible distances with unequal rates across sites: mixing and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol. 1997:8(3):398–414. 10.1006/mpev.1997.0452. [DOI] [PubMed] [Google Scholar]
- Watanabe S. Asymptotic equivalence of Bayess cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res. 2010:11:3571–3594. [Google Scholar]
- Whelan S. Spatial and temporal heterogeneity in nuceleotide sequence evolution. Mol Biol Evol. 2008:25(8):1683–1694. 10.1093/molbev/msn119. [DOI] [PubMed] [Google Scholar]
- Wu CH, Suchard MA, Drummond AJ. Bayesian selection of nucleotide substitution model and their site assignments. Mol Biol Evol. 2013:30(3):669–688. 10.1093/molbev/mss258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H, Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol. 2011:60(2):150–160. 10.1093/sysbio/syq085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang GL, Le Cam L. Asymptotics in statistics: some basic concepts. Berlin: Springer; 2000. [Google Scholar]
- Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol. 1993:10:1396–1401. 10.1093/oxfordjournals.molbev.a040082. [DOI] [PubMed] [Google Scholar]
- Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994:39(3):306–314. 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
- Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995:139(2):993–1005. 10.1093/genetics/139.2.993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. Computational molecular evolution. Oxford: Oxford University Press; 2006. [Google Scholar]
- Zhou Y, Brinkmannn H, Rodrigue N, Lartillot N, Philippe H. A Dirichlet process covarion mixture model and its assessments using posterior predictive discrepancy tests. Mol Biol Evol. 2010:27(2):371–384. 10.1093/molbev/msp248. [DOI] [PubMed] [Google Scholar]
- Zlateva K, Lemey P, Moës E, Vandamme AM, Van Ranst M. Genetic variability and molecular evolution of the human respiratory syncytial virus subgroup B atttachment G protein. J Virol. 2005:79(14):9157–9167. 10.1128/JVI.79.14.9157-9167.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
BEAST X XML input files that feature all data analyzed in this study are available at https://github.com/mandevgill/infinitemixturemodels.






