Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2020 May 16;70(1):181–189. doi: 10.1093/sysbio/syaa037

Markov-Modulated Continuous-Time Markov Chains to Identify Site- and Branch-Specific Evolutionary Variation in BEAST

Guy Baele 1,, Mandev S Gill 1, Paul Bastide 1, Philippe Lemey 1, Marc A Suchard 2,3,4
Editor: David Posada
PMCID: PMC7744037  PMID: 32415977

Abstract

Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the process over time in a site-specific manner remains frequently overlooked. This is problematic, as evolutionary processes that act at the molecular level are highly variable, subjecting different sites to different selective constraints over time, impacting their substitution behavior. We propose incorporating time variability through Markov-modulated models (MMMs), which extend covarion-like models and allow the substitution process (including relative character exchange rates as well as the overall substitution rate) at individual sites to vary across lineages. We implement a general MMM framework in BEAST, a popular Bayesian phylogenetic inference software package, allowing researchers to compose a wide range of MMMs through flexible XML specification. Using examples from bacterial, viral, and plastid genome evolution, we show that MMMs impact phylogenetic tree estimation and can substantially improve model fit compared to standard substitution models. Through simulations, we show that marginal likelihood estimation accurately identifies the generative model and does not systematically prefer the more parameter-rich MMMs. To mitigate the increased computational demands associated with MMMs, our implementation exploits recent developments in BEAGLE, a high-performance computational library for phylogenetic inference. [Bayesian inference; BEAGLE; BEAST; covarion, heterotachy; Markov-modulated models; phylogenetics.]


Molecular sequence evolution is typically modeled by Markov models of character substitution acting along the branches of a phylogenetic tree. These models are phenomenological descriptions of the evolution of DNA as a string of a number of discrete character states, with models of nucleotide substitution among four states being the most widely used in statistical phylogenetics. The Markovian property within such a model reflects the common assumption that evolution has no memory. Further, it is standard to assume that the Markov model is time-homogeneous, so that it can be characterized by a generator or instantaneous rate matrix Inline graphic that remains constant during evolution (Gascuel and Guindon 2007). Early probabilistic phylogenetic reconstruction methods assumed a single substitution model that acted independently across all sites and lineages.

The characters at different alignment sites, however, typically evolve under varying structural or functional constraints, inspiring models that accommodate among-site rate variation by scaling up or down the expected number of substitutions at different sites. Sites evolve, nonetheless, in more qualitatively different ways than simply variation in their overall substitution rates (Pagel and Meade 2004). Furthermore, selective pressures vary over time and often defy a priori site partitioning into sets with approximately equal selection across an alignment. Examples of such a complex interplay between sites come from studies on how the 3D structure of proteins evolves over time. These studies show that, although a few essential sites may be invariable over long periods of evolutionary time, most sites do change their functional environment—and as a result, the functional constraints they are subjected to—during evolution (Penny et al. 2001). In order to capture and accurately model these types of evolutionary phenomena, there is need for a class of flexible substitution models that do not require prior knowledge regarding data partitioning.

The increase in computational power over the past two decades has enabled fast evaluation of complex models in a feasible amount of time, by focusing on exploiting many-core computing solutions (Suchard and Rambaut 2009). This has paved the way for evaluating high-dimensional substitution models and modeling complex scenarios, such as clade-specific and even branch-specific evolutionary processes. Markov-modulated models (MMMs) constitute a class of mixture models that allow the substitution process to change across each branch and this for each site independently within an alignment (we refer interested readers to Supplementary materials available on Dryad at https://doi.org/10.5061/dryad.230s5h0 for an in-depth introduction). In this article, we introduce a Bayesian inference framework for MMMs, with an implementation in BEAST (Suchard et al. 2018)—a software package for Bayesian evolutionary analysis—that accommodates phylogenetic uncertainty. In doing so, we strive for optimal generality by allowing switching between evolutionary models within the MMM that have different substitution rates, relative character exchange rates and stationary distributions.

Methods

Markov-Modulated Model Structure

Consider an MMM composed of Inline graphic evolutionary models (irrespective of those models being nucleotide, amino acid, or codon models). Each evolutionary model is defined by a relative substitution rate multiplier Inline graphic and a substitution model characterized by an instantaneous rate matrix Inline graphic, of dimension Inline graphic, and stationary distribution Inline graphic. We also adopt the usual constraint -Inline graphic. The switching process between the Inline graphic models is defined by a Inline graphic-state continuous-time Markov process with rate matrix

graphic file with name M10.gif (1.1)

where the element Inline graphic corresponds to the rate of switching from substitution model Inline graphic to substitution model Inline graphic, and the diagonal elements are fixed such that the rows sum to Inline graphic. We denote the stationary distribution of this switching process by Inline graphic. These model switches follow a homogeneous, stationary—but not necessarily time-reversible—Markovian process. In Equation 1.1, we do not make use of an additional parameter Inline graphic that expresses the global rate of change between the evolutionary models because this is a deterministic parameter obtained by normalizing the model-switching process (Guindon et al. 2004; Gascuel and Guindon 2007).

The MMM is characterized by a Inline graphic rate matrix Inline graphic (Fischer and Meier-Hellstern 1993):

graphic file with name syaa037m1.jpg (1.2)

where Inline graphic is an Inline graphic identity matrix and Inline graphic denotes the Kronecker product. The MMM can therefore be considered a single Markov process with a state space equal to the Cartesian product of the state space of the switching process (between the evolutionary models) and the state space of the evolutionary models, with cardinality Inline graphic and stationary distribution Inline graphic (Guindon et al. 2004). As noted by Gascuel and Guindon (2007), the MMM in Equation 2 allows for every compound state Inline graphic to either: 1) stay in model Inline graphic and transition to Inline graphic with rate defined by Inline graphic, or 2) change evolutionary models and transition to Inline graphic with rate Inline graphic. All rows in Inline graphic sum to Inline graphic, and because Inline graphic and Inline graphic, it follows that Inline graphic. We refer to Supplementary material available on Dryad for additional information on these MMMs, for example on their identifiability when combining them with among-site rate variation (ASRV; Yang 1994, 1996).

Likelihood

In this section, we adopt a similar notation to Gascuel and Guindon (2007) to describe the data likelihood under an MMM. Likelihood calculations for MMMs employ a standard pruning approach (Felsenstein 1981), with integration over the compound states (i.e., the evolutionary model and character state) at the internal nodes of the tree, and integration over the unobserved categories at the tips. Let Inline graphic where Inline graphic are the extant characters observed at aligned site Inline graphic for Inline graphic, and let Inline graphic denote the phylogenetic tree with its branch lengths. Let Inline graphic denote the MMM that models the evolutionary process for all sites, where Inline graphic and Inline graphic represents parameters for the Inline graphicth evolutionary model, and Inline graphic parameters of the switching process. The observed data likelihood is:

graphic file with name M45.gif (1.3)

where the product is taken over every site Inline graphic in the alignment, with each site assumed to evolve independently. The sum over the compound states Inline graphic replaces the sum over the nucleotide characters Inline graphic that is performed for standard nucleotide substitution models (Gascuel and Guindon 2007). Here, Inline graphic is the likelihood of the data at site Inline graphic under category Inline graphic and given that state Inline graphic is observed at site Inline graphic of the root node Inline graphic. We can generalize this notation as Inline graphic for node Inline graphic to express the partial likelihood of observing the characters at site Inline graphic in the extant sequences descending from Inline graphic. This notation can be shortened to Inline graphic because Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic are the same for all sites and nodes. Let Inline graphic and Inline graphic be the left and right descendants of Inline graphic and Inline graphic the length of the branch connecting Inline graphic to its parent. Each partial likelihood is then defined as follows (taking into account that the evolutionary categories are unobserved; Gascuel and Guindon 2007):

graphic file with name M70.gif (1.4)

The substitution probabilities Inline graphic are computed using matrix exponentiation of Inline graphic with computational complexity Inline graphic (Pan and Chen 1999), although lower complexity may be achieved depending on the Kronecker structure of Inline graphic (but see the Supplementary material available on Dryad). Computing these probabilities for all Inline graphic branches in the phylogeny therefore sports a complexity of Inline graphic. Evaluating the Inline graphic site likelihoods through the tree-pruning (or peeling) algorithm (Felsenstein 1981) amounts to a complexity of Inline graphic. Taken together, with a relatively small cost Inline graphic for taking logarithm of site likelihoods and summing over sites results in a computational complexity of Inline graphic for the log-likelihood of the observed data.

Implementation

We have implemented MMMs and their corresponding likelihood function in BEAST (Suchard et al. 2018), a widely used software package for Bayesian phylogenetic and phylodynamic inference using Markov chain Monte Carlo integration. These models are available for use in BEAST through XML specification, allowing to construct a wide range of different modeling assumptions such as the ones detailed in this article (and the Supplementary material available on Dryad). The use of MMMs substantially increases computation time in likelihood-based inference, and we offload the computationally demanding aspects to powerful multi- and many-core hardware through the BEAGLE library (Ayres et al. 2019).

Biological Examples

We here consider substitution models that are time-reversible and therefore substitution model Inline graphic will have instantaneous rates Inline graphic that can be expressed in terms of base frequencies Inline graphic and symmetric rate parameters Inline graphic as follows:

graphic file with name M85.gif (1.5)

Thus a substitution model can be specified in terms of its base frequencies and symmetric rate parameters Inline graphic.

We adopt the following notation: MMM(Inline graphic)Inline graphic, where Inline graphic denotes the type of substitution model and Inline graphic, Inline graphic, Inline graphic, and Inline graphic denote the numbers of distinct sets of symmetric rate parameters, sets of base frequencies, the relative rate multipliers, and the structure of Inline graphic as either symmetric/triangular (Inline graphic) or asymmetric (Inline graphic), respectively. For example, an MMM(HKY)Inline graphic refers to an MMM featuring two different HKY substitution models, each with its own set of symmetric rate parameters and set of base frequencies, two different relative rate multipliers and a symmetric rate switching matrix Inline graphic. An MMM(HKY)Inline graphic refers to an MMM featuring two different HKY substitution models that share the same set of symmetric rate parameters but have different sets of base frequencies, along with two different relative rate multipliers and an asymmetric rate switching matrix Inline graphic. When the relative rate multipliers are all fixed to 1 to superimpose an ASRV model (see Supplementary material available on Dryad), the Inline graphic subscript is omitted (e.g., MMM(HKY)Inline graphic).

We here consider two empirical data sets that show the importance of employing MMMs to accurately model the substitution process, as supported by Bayesian model selection. In Supplementary material available on Dryad, we analyze two additional empirical data sets—a plant plastid gene and an influenza A virus data set—that provide evidence in favor of MMMs over traditional substitution models but also showcase the wide range of modeling assumptions possible within our MMM formulation.

Bacterial 16S Ribosomal RNA

Differences in base composition throughout the genome can bias phylogenetic inference when not properly taken into account. Often, the proportion of A+T in a genome differs from that of G+C, and different organisms exhibit different patterns of base composition. At the level of the entire genome, GC content varies greatly within and among major groups of organisms, which can skew phylogenetic reconstruction if not properly unaccounted for (Mooers and Holmes 2000). Two different evolutionary processes have been singled out as possible explanations for varying patterns of base composition: biases in the underlying process of mutation, as similar levels of GC content are often found in regions with different functional constraints, and natural selection, with increased global GC content in bacteria possibly being selected for by UV exposure (Singer and Ames 1970).

Environmental variation shaping nucleotide composition may cause unrelated taxa to share similar base composition and therefore be grouped together within a clade. To accurately reconstruct evolutionary histories through phylogenetic inference, these potentially differing base compositions need to be accommodated in an explicit manner by the nucleotide substitution model. To address this, Blanquart and Lartillot (2006) developed a nonstationary and nonhomogeneous model accounting for compositional biases, allowing the composition to change at random points in the tree, with the total number of change points across the tree being inferred from the data. Through a Bayesian analysis of eubacterial 16S rRNA and BAS1 gene yeast data sets, the authors show that in most cases, the stationarity assumption was rejected in favor of their nonstationary model.

We evaluate our MMM framework on 16S ribosomal RNA of five bacterial sequences: Deinococcus radiodurans, Thermus thermophilus, Thermotoga maritima, Aquifex pyrophilus, and Bacillus subtilis (GenBank accession numbers: Y11332.1, AJ251939.1, NR_029163.1, M83548.2, and CP009796.1). We use standard nucleotide substitution models as well as MMMs to infer their evolutionary history while fixing the Aquifex pyrophilus sequence as an outgroup. Given that the data contain three thermophilic (high GC content) and two mesophilic (lower GC content) bacteria genera (Mooers and Holmes 2000), we consider only MMM(Inline graphic)Inline graphic models and do not further explore higher-dimensional models. The true tree topology of this eubacterial data set is believed to group D. radiodurans and T. thermophilus together to the exclusion of B. subtilis, T. maritima, and A. pyrophilus, given that D. radiodurans and T. thermophilus share the same peptidoglycan and menaquinone type (Murray 1992). However, phylogenetic reconstruction under stationary models has a tendency to erroneously group D. radiodurans and B. subtilis together, because these mesophiles have similar, relatively low GC content.

Figure 1 shows the results of the phylogenetic reconstructions, with the HKY and GTR models—both featuring an ASRV model and a relaxed molecular clock with an underlying lognormal distribution—yielding similar (log) marginal likelihoods (we refer to Supplementary material available on Dryad for details on the marginal likelihood estimation procedure). Note that, because we will include an ASRV model in all of these MMMs, we set all Inline graphic in Equation 1.2 to 1 to ensure identifiability. Both the HKY and GTR models express strong support in favor of a clustering of D. radiodurans and B. subtilis (see Fig. 1), with the GTR model yielding a small increase in model fit to the data over the HKY model (log BF Inline graphic 1). As such, both models yield an incorrect clustering, which appears to be primarily based on both sequences being mesophilic (low GC content), whereas the three other sequences are considered thermophilic (high GC content). While an MMM of the type introduced by Tuffley and Steel (1998) offers no improvement over these models when Inline graphic is parameterized as an HKY model (see Supplementary material available on Dryad for the model’s details), a significant improvement in model fit can be obtained when Inline graphic is parameterized as a GTR model (log BF = 19). However, any MMM with two sets of base frequencies and with either a single set of symmetric rate parameters (an MMM(Inline graphic)Inline graphic) or with two different sets of symmetric rate parameters (an MMM(Inline graphic)Inline graphic) offers a further improvement in model fit compared to the standard nucleotide substitution models tested (8 Inline graphic log BF Inline graphic 45; we refer to Supplementary material available on Dryad for the log marginal likelihood estimates). This can be attributed to the fact that MMMs are able to accommodate differing base compositions throughout the tree topology, and consequently yield an accurate phylogenetic reconstruction of the bacterial relationships, with the D. radiodurans and T. thermophilus clustering together (see Fig. 1) (Embley et al. 1993; Mooers and Holmes 2000).

Figure 1.

Figure 1

a) Maximum clade credibility (MCC) phylogeny relating five bacterial 16S sequences; unlabeled nodes have Inline graphic0.9999 posterior probability. Standard nucleotide substitution models that assume among-site rate variation (ASRV) erroneously cluster the two mesophiles together with high posterior probability (0.649 for HKY and 0.863 for GTR in the topology on the left). However, an MMM(HKY)Inline graphic yields the correct clustering of the Deinococcus radiodurans and the Thermus thermophilus sequences with high posterior probability (topology on the right); each branch is annotated with the proportion of sites in each of the continuous-time Markov chain (CTMC) models, based on the maximum a posteriori (MAP) phylogeny. b) Number of CTMC model switches per alignment site based on the most probable hidden state realizations of the MMM on the MAP phylogeny; of the full alignment of 1304 sites, 761 sites are estimated not to switch between CTMC models. c) Mean posterior parameter estimates of the MMM show asymmetric switching between models (with circle sizes proportional to rate switching intensity) with pronounced differences in transition/transversion ratios and base frequencies.

The base frequency estimates for the CTMC models within the MMM reflect the presence of mesophilic sequences (low GC content; orange in Fig. 1) and thermophilic sequences (high GC content; blue in Fig. 1) in our data. Despite the fact that only eight branches connect the observed sequences, alignment sites switch up to four times between CTMC models across the phylogeny, indicating evolutionary dynamics that cannot possibly be accommodated using standard nucleotide substitution models. Over 40% of the alignment sites undergo at least one switch between CTMC models in a highly asymmetric manner (see Fig. 1). The two CTMC models are also characterized by pronounced differences in transition/transversion ratios. In conclusion, we show that appropriately modeling compositional heterogeneity for these eubacterial sequences enables inference of the correct phylogeny as well as base frequency compositions that reflect the presence of both mesophilic and thermophilic sequences in the data set.

Plant Plastid Genes

We consider nucleotide sequence data from the protein-coding genes of 23 completely sequenced plant plastid genomes, previously analyzed by Ané et al. (2005) to measure the independence of the substitution process between two groups of taxa as a means of detecting covarion evolution. Assuming a fixed underlying reference tree that represents the likely relationships of plant taxa for which complete chloroplast sequences were available at the time, the covarion test of Ané et al. (2005) detected significant covarion evolution (Inline graphic) in 14 of 57 genes analyzed across all positions. We here analyze the psaB gene with standard nucleotide substitution models and MMMs and compare the inferred phylogenies and model fit; we refer to Supplementary material available on Dryad for our analysis of the ndhD gene.

A comparison of standard nucleotide substitution models reveals that the combination of a GTR model and an ASRV model, along with a relaxed clock assuming an underlying lognormal distribution, yields the highest (log) marginal likelihood for both data sets. We conduct analyses with MMMs that feature an HKY or GTR substitution model with a single set of symmetric rate parameters along with two or three different sets of base frequencies (i.e., MMM(Inline graphic)Inline graphic and MMM(Inline graphic)Inline graphic models), as well as generalizations of these MMMs that feature as many different sets of rate parameters as sets of base frequencies, and both symmetric and asymmetric Inline graphic (i.e., MMM(Inline graphic)Inline graphic and MMM(Inline graphic)Inline graphic models). For all of these models, we set all Inline graphic in equation 1.2 to 1 to ensure identifiability when using an ASRV model in combination with MMMs. We also analyze the data with a nucleotide covarion model (Tuffley and Steel 1998), which we can easily compose within our MMM framework through XML specification.

The psaB data set strongly prefers the covarion-style model over a standard GTR+ASRV substitution model by a log Bayes factor of 208. The MMM(GTR)Inline graphic and MMM(GTR)Inline graphic yield log Bayes factors of 257 and 313, respectively, over the standard GTR+ASRV model. MMM(GTR)Inline graphic and MMM(GTR)Inline graphic parameterizations yield further increases in model fit of 321 and 347, respectively, over the GTR+ASRV model. Because additional categories within the MMM offer diminishing returns in terms of model fit at the expense of additional computation time, we did not explore MMMs with even higher dimensions. Figure 2 shows the maximum clade credibility (MCC) trees obtained under the standard GTR+ASRV model and the MMM(GTR)Inline graphic that generated the highest (log) marginal likelihood. While the clustering within the seed plants is identical under both models, substantial differences in posterior support can be observed for specific clades. In the remaining part of the tree, these models result in completely different clustering patterns with strong support for many clades under the MMM(GTR)Inline graphic model.

Figure 2.

Figure 2

Phylogenetic reconstruction of plant plastid sequences, based on the psaB protein-coding gene; unlabeled nodes have Inline graphic0.9999 posterior probability. Left: MCC tree based on a standard GTR+ASRV model. Right: MCC tree based on an MMM(GTR)Inline graphic with ASRV, which is strongly supported over the MCC tree generated under the GTR model (log Bayes factor of 347). While only a single different clustering can be observed within the Angiosperms, many differing clusters that have very high posterior probabilities are generated using the MMM(GTR)Inline graphic outside of the seed plants.

In Figure 3, we illustrate the complex substitution patterns across all sites on the MAP psaB phylogeny, using the most probable hidden state realizations of the MMM(GTR)Inline graphic. We use a simple counting procedure to quantify the number of differences between the ancestral model states as a means to reconstruct which sites evolve according to which CTMC within the MMM(GTR)Inline graphic, and we observe a relatively small amount of CTMC switching throughout the phylogeny (of note, we observe a 4.5-fold increase in number of sites switching between CTMCs in our analysis of the ndhD gene in Supplementary material available on Dryad). The reconstructed patterns go beyond mere codon position partitioning, as we observe different substitution dynamics per codon position. In particular, the third codon position is the only position that evolves according to a particular CTMC a majority of the time, and it also exhibits the greatest degree of switching between CTMC realizations. We depict the mean posterior instantaneous substitution rates of the various MMM components in Figure 3, showing a clearly asymmetric CTMC switching process and three distinct GTR model realizations within the MMM. This complex interplay of model components is consistent with the strong Bayes factor support of the MMM(GTR)Inline graphic over all other models tested.

Figure 3.

Figure 3

Markov-modulated model behavior on the psaB protein-coding gene phylogeny. a) Amount of time (branch lengths in genetic distance) spent in each CTMC model for each alignment site based on the most probable hidden state realizations of the MMM on the maximum a posteriori phylogeny. b) Summary of the number of sites that evolve according to each CTMC, illustrating complex substitution patterns that go beyond codon position partitioning, as well as 2.8% of sites switching between CTMC realizations. c) Distribution of sites in each codon position across the different CTMC model realizations, showing that first and second codon positions switch far less frequently between CTMC models than the third codon position, in which the substitutions occur according to a clearly predominant CTMC. d) Switching behavior of the MMM between the three CTMC models, with the mean instantaneous substitution rates shown for those models (with circle sizes proportional to rate intensity).

Conclusion

MMMs can infer substantially different phylogenies compared to standard nucleotide substitution models, and they can be associated with significant increases in model fit. A targeted simulation study that assesses the ability of MMMs to retrieve the generative models of simulated sequence alignments and to quantify their increase in model fit when the MMM was the generative model shows that these large differences are not artifacts of using such high-dimensional models (see Supplementary material available on Dryad). Our simulation study also shows similar differences in model fit compared to the ones obtained in this section for the psaB and ndhD genes, as well as the ability of state-of-the-art Bayesian model selection to select the generative substitution model even when compared with similar model parameterizations. Importantly, when simulating data under a standard GTR model, MMMs exhibit a worse model fit than under the generative GTR model. These analyses of simulated data show that MMMs can easily be used in combination with recent developments in Bayesian model selection (Baele et al. 2016) and provide additional support for our conclusions that these models can yield substantial increases in model fit over standard nucleotide substitution models.

We note that each additional CTMC within an MMM (significantly) increases computational demands, and that a search for the optimal MMM may therefore prove time-consuming for complex large data sets. Avoiding direct evaluation of the finite-time transition probabilities through emerging algorithms that instead manipulate the matrix exponential action (Ji et al. 2016) represents a possible work around. In the mean time, to make such computations manageable, BEAST can however exploit the BEAGLE library (Ayres et al. 2019) to offload the large matrix multiplications onto powerful multi-core hardware solutions. In particular, the use of graphics cards for scientific computing yields significant performance gains over standard multi-core processors (see Supplementary material available on Dryad), rendering phylogenetic inference under these MMMs feasible despite their complexity.

Finally, it remains important to recognize that phylogenetic substitution models draw inspiration from biology and biochemistry, but do not capture the full complexity of these underlying processes. MMMs offer a substantial increase in model complexity over traditional substitution models but—like most other substitution models—also make simplifying assumptions, for example, regarding site-independent evolution, as there is no mechanism within an MMM in which changes in one site result in concomitant changes in another. Resulting model misspecification (and potential overparameterization) can mislead model-based tree reconstruction methods (Steel 2005). To guard against such situations, a well-developed statistical theory such as Bayesian model testing should be employed to compare models in an objective manner and choose a model that carefully balances the model’s parameterization with the available information in the data. After all, as Steel (2005) sagely states, the aim of model selection is not to find the “true model” but to find a model with sufficient parameters to capture the key features of the data.

Additionally, we have made available an online tutorial on how to construct XML files to perform phylogenetic inference using Markov-modulated models in BEAST: http://beast.community/markov_modulated.html.

Acknowledgments

We would like to thank the editors, Bryan Carstens and David Bryant, as well as three anonymous reviewers for their constructive comments that helped improve this article. We are grateful to Cécile Ané for kindly providing the plant plastid genome data sets. We gratefully acknowledge support from NVIDIA Corporation with the donation of parallel computing resources used for this research.

Supplementary Material

Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.230s5h0.

Funding

This work was supported by the Interne Fondsen KU Leuven/Internal Funds KU Leuven under grant agreement C14/18/094, and by the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen”, G0E1420N to G.B.]. The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 725422-ReservoirDOCS); the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen,” 12Q5619N and V434319N to P.B.]; the Research Foundation – Flanders [“Fonds voor Wetenschappelijk Onderzoek – Vlaanderen,” G066215N, G0D5117N, and G0B9317N to P.L.]; National Science Foundation [DMS 1264153] and National Institutes of Health [R01 AI107034 and U19 AI135995], in part to M.A.S. The Artic Network receives funding from the Wellcome Trust through project 206298/Z/17/Z.

References

  1. Ané C., Burleigh J.G., McMahon M.M., Sanderson. M.J.  2005. Covarion structure in plastid genome evolution: a new statistical test. Mol. Biol. Evol.  22:914–924. [DOI] [PubMed] [Google Scholar]
  2. Ayres D. L., Cummings M. P., Baele G., Darling A.E., Lewis P.O., Swofford D.L., Huelsenbeck J.P., Lemey P., Rambaut A., Suchard M.A.  2019. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst. Biol.  68:1052–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Baele G., Lemey P., Suchard M.A.  2016. Genealogical working distributions for Bayesian model testing with phylogenetic uncertainty. Syst. Biol.  65:250–264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Blanquart S., Lartillot N.  2006. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Mol. Biol. Evol.  23:2058—2071. [DOI] [PubMed] [Google Scholar]
  5. Embley T.M., Thomas R.H., Williams R.A.D.  1993. Reduced thermophilic bias in the 16s rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus. Syst. Appl. Microbial.  16:25–29. [Google Scholar]
  6. Felsenstein J.  1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol.  17:368–376. [DOI] [PubMed] [Google Scholar]
  7. Fischer W., Meier-Hellstern K.  1993. The Markov-modulated Poisson process (MMPP) cookbook. Perform. Evaluation  18:149–171. [Google Scholar]
  8. Gascuel O., Guindon S.  2007. Modelling the variability of evolutionary processes. Reconstruct. Evol.  2:65–99. [Google Scholar]
  9. Guindon S., Rodrigo A.G., Dyer K.A., Huelsenbeck J.P.  2004. Modeling the site-specific variation of selection patterns along lineages. Proc. Natl. Acad. Sci. USA  101:12957–12962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ji X., Griffing A., Thorne J.L.  2016. A phylogenetic approach finds abundant interlocus gene conversion in yeast. Mol. Biol. Evol.  33:2469–2476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Mooers A.O., Holmes E.C.  2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol.  15:365–369. [DOI] [PubMed] [Google Scholar]
  12. Murray R.G.E.  1992. The family Deinococcaceae In: Balows, A., Trüper, H.G., Dworkin, M., Harder, W., and Schleifer, K.-H., editors. The prokaryotes: a handbook on the biology of bacteria: ecophysiology, isolation, identification, applications, Vol. 4. New York: Springer; p. 3732—3744. [Google Scholar]
  13. Pagel M., Meade A.  2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol.  53:571–581. [DOI] [PubMed] [Google Scholar]
  14. Pan V.Y., Chen Z.Q.  1999. The complexity of the matrix eigenproblem. Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing STOC ’99 ACM, New York, NY, USA. p. 507–516. [Google Scholar]
  15. Penny D., McComish B.J., Charleston M.A., Hendy M.D.  2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol.  53:711–753. [DOI] [PubMed] [Google Scholar]
  16. Singer C.E., Ames B.N.  1970. Sunlight ultraviolet and bacterial DNA base ratios. Science  170:822–826. [DOI] [PubMed] [Google Scholar]
  17. Steel M.  2005. Should phylogenetic models be trying to ‘fit an elephant’?  Trends Genet.  21:307–309. [DOI] [PubMed] [Google Scholar]
  18. Suchard M.A., Lemey P., Baele G., Ayres D.L., Drummond A.J., Rambaut A.  2018. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4:vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Suchard M.A., Rambaut A.  2009. Many-core algorithms for statistical phylogenetics. Bioinformatics  25:1370–1376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Tuffley C., Steel M.  1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci.  147:63–91. [DOI] [PubMed] [Google Scholar]
  21. Yang Z.  1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol.  39:306–314. [DOI] [PubMed] [Google Scholar]
  22. Yang Z.  1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol.  11:367–372. [DOI] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES