Abstract
Methods for the probabilistic mapping of the history of state changes over a phylogeny have been available for the study of molecular evolution for over two decades. In spite of this, such methods have yet to be adopted at large by most molecular evolutionary biologists. Here, we re-emphasize the potential of these stochastic mappings with examples pertaining to the study of the amino acid replacement process. We show how the features targeted by today’s top-performing models could have been highlighted in a full phylogenetic context with an amino acid-level Jukes-Cantor model. We also demonstrate how stochastic mappings could be used for detecting CpG hypermutability, a site-dependent feature. We hope for a larger project utilizing mapping-based methods to provide of more fulsome characterization of molecular evolution, and to prioritize and assess modeling efforts. Finally, we draw attention to the options available within the PhyloBayes(-MPI) software for producing mappings under a large set of evolutionary models.
Keywords: Amino acid replacement process, Nucleotide substitution, Model assessment, Bayesian inference
Introduction
In a molecular phylogenetic context, the observed data consist of a multiple sequence alignment. Assuming a given phylogeny, this observed data are sometimes augmented by a history, for each site, of state change events (between, e.g., nucleotides, amino acids, or codons) along the branches of the tree. This history, or mapping, includes the timing of state changes along the branches, and their exact nature, and must be compatible with the states in the alignment; the final states of mappings along terminal branches must be the states observed of the alignment. Mappings are produced under an assumed model. Classically, the assumed model was one of maximum parsimony (Fitch 1971), but since Nielsen’s (Nielsen 2002) seminal paper, mappings have also been drawn under probabilistic models. Since these mappings are conditional on the alignment and on parameters of the model, they are often referred to as posterior mappings. Nielsen’s insight was to apply well-known stochastic process simulation methods over a phylogeny, with a rejection-based sampling to collect only mappings compatible with (or conditional on) the alignment. Other methods from the stochastic process literature have since also been emphasized as useful for the purpose of a directed sampling—i.e., without needing a rejection step—of posterior mappings (see, e.g., Rodrigue et al. 2008; Hobolth and Stone 2009).
The conditional nature of the reconstruction is critical. In forcing mappings to be compatible with the states in the alignment, posterior mappings can reflect far more features than those accounted for by the evolutionary model invoked to produce them (more on this below, with examples). As such, posterior mappings can be studied in their own right to uncover features of the evolutionary process, or can be contrasted against posterior predictive mappings, which are pure (unconstrained by the alignment states) simulations of the state change process, conditional on parameter values. In our opinion, extracting features of the evolutionary process from posterior mappings is an under-exploited approach, as is the application of the posterior predictive framework to mappings rather than alignments (Bollback 2005).
To date, most efforts in using posterior mappings have focused on one or few features, for use in a specific application, with mappings drawn with one or few evolutionary models. Following Nielsen’s paper, examples include using mappings to devise tests of co-evolution across sites (e.g., Dimmic et al. 2005; Dutheil et al. 2005) or across substitution contexts (Lee et al. 2015, 2016), to provide fast means of computing dN/dS ratios (e.g., Minin and Suchard 2008; Lemey et al. 2012; Guéguen and Duret 2017), to test for homoplasy (Lartillot et al. 2007), to test for horizontal gene transfer events (Cohen et al. 2010), or to test for time-heterogeneity of the substitution process (Romiguier et al. 2012). They have also been used as a basis for implementing otherwise intractable evolutionary models (e.g., Robinson et al. 2003, 2005), or simply for speeding up the implementation of others (e.g., Lartillot 2006; Rodrigue et al. 2008), but often utilized only as a computational device for demarginalizing the phylogenetic likelihood function within Markov chain Monte Carlo (MCMC) algorithms.
Amino Acid-Level Examples
Setting aside concerns with devising specific tests, or implementing novel models, we believe explorations of posterior mappings, under relatively simple and currently available models, could be useful in uncovering many features of the evolutionary process. We first illustrate this process with amino acid-level examples.
Exploring Posterior Mappings of Amino Acid Replacements
As a first example, we step back in modeling advances, to an amino acid-level (Jukes and Cantor 1969, JC) model (i.e., a Poisson process). Under such a model, the dwell time of amino acid-replacing events follows an exponential distribution with a parameter that can be fixed to interpret branch lengths as the expected number of replacements per site, and the process is assumed homogeneous across the different pairwise types of replacements, homogeneous across the branches of the phylogeny, and homogeneous across the positions of the alignment. In other words, with such a model, there are no free parameters to the amino acid replacement process, which does not make any distinction between the twenty amino acid states. In a Bayesian context, with a fixed tree topology, an MCMC sampler only operates on branch lengths, and its hyperparameter. We used PhyloBayes-MPI (Lartillot et al. 2013) to produce a posterior sample of sets of branch lengths conditional on a classic alignment of 17 vertebrate -globin genes (Yang et al. 2000), translated to an amino acid alignment. We kept the tree topology fixed as in Yang et al. (2000). We then drew posterior and posterior predictive mappings over this sample, still using the JC model at the amino acid-level.
At the time the JC model was proposed (Jukes and Cantor 1969), most efforts at teasing out features of the amino acid replacement process consisted of counting schemes on pairwise sequence alignments, from the pioneering work of Dayhoff (1978) and others (Zuckerkandl and Pauling 1965; Anfinsen 1973; Doolittle 1981; Chothia and Lesk 1986). Retrospectively, one realizes that posterior mappings, drawn under this simplest probabilistic model of amino acid replacement, were not far from reach. Figure 1 displays some features crudely extracted from the mappings. Panel 1A takes one of Nielsen’s first examples of the application of the approach, which consists of assessing the heterogeneity in rates across sites by calculating the variance in the number of replacement events across the positions of the alignment. Since some positions (columns) of the alignment can be in the same amino acid state for all (or nearly all) species, whereas other positions can be in several different states for the different species, the posterior mappings at some positions will require very few (or no) replacement events to be compatible with the alignment states, whereas others will need to have many replacement events in order to end up with several different states at the ends of the terminal branches of the phylogeny required to be compatible with the alignment states. Overall, this would produce a high variance in the number of events across sites, and is expected from real data given the heterogeneity in selective pressures across the positions of a protein such as globin. The posterior distribution of the variance in number of replacement events, obtained from the set of posterior mappings, is displayed in the blue histogram of panel 1A, and shows much higher values than posterior predictive distribution, obtained from the set of posterior predictive mappings (orange). The posterior predictive mappings are expected to have low variance in the number of events across sites, given the assumption of homogeneity of the JC model used in their simulation. Overall, this shows that the data have high heterogeneity in overall rates across sites, and an explicit modeling of this feature (e.g., Yang 1994) is likely warranted.
Panel 1B of figure 1 uses the same posterior mappings to study the heterogeneity of replacement rates between all possible pairs of amino acid states. It shows, as the area of blue circles, in the lower-left triangle of the 20 by 20 graphic, the proportion of events within a posterior mapping that are of each pairwise type, averaged over the posterior sample. A clear pattern emerges whereby the pairs of amino acids with similar biophysical properties tend to have the highest proportions of events. Indeed, even with this small alignment of 17 vertebrate -globin sequences, 144 amino acids in length, the proportions have a high correlation with LG amino acid exchangeabilities (0.668 Pearson correlation coefficient: p-value ). The posterior predictive mappings, by the definition of the JC model, lead to orange circles (top-right triangle of the graphic) of very even areas. In panel 1C, we show the proportions of time spent in each amino acid state in posterior mappings, averaged across sites, and averaged over the posterior sample. Again, these proportions are quite similar to the LG amino acid frequencies, with a correlation coefficient (0.758 Pearson correlation coefficient: p-value ). As before, posterior predictive mappings lead to a very even amount of time spent in the 20 amino acids, which is expected given the simulation model. Taken together, these results suggest markedly uneven amino acid exchangeabilities and frequencies, with empirically plausible values, and an explicit modeling of these features, say, as an LG model, or even a GTR-like model, may be warranted. On a methodological level, we note that setting the values of parameters of a GTR model to the posterior mean of the proportions of pairwise exchanges, for the relative exchangeability parameters, and to the posterior mean proportion of time spent in each state, for the state frequency parameters, would amount to a crude single iteration of a mapping-based Monte Carlo expectation-maximization algorithm (see Rodrigue et al. 2007) for finding posterior modes, but already yields plausible values.
We next explored mappings for across-site pattern heterogeneity, i.e., fine features of the amino acid substitution process, with respect to heterogeneity in the types of substitutions that occur across sites. In our exploration, we simply focus on site-specific calculations of the proportions of time spent in each of the twenty states. We seek to determine whether certain proportion profiles tend to occur repeatedly across the alignment, or more generally, whether the set of proportion profiles across sites tends to cluster into meaningful subgroups. To do this, we simply clustered all site-specific amino acid proportion profiles into a limited number of clusters, 10, using the K-means clustering algorithm (MacQueen 1967) as implemented in the scikit-learn python package (Pedregosa et al. 2011). We then extracted the average (centroid) proportion profiles from each cluster to be used as the representative amino acid proportion profiles. Both representative proportion profiles obtained from posterior (panel 1D) and posterior predictive (panel 1E) mappings were plotted using logograms generated with a modified version of cogent3 (Knight et al. 2007). The profiles from panel 1D are skewed and reveal a biophysical reality of the studied protein: some sites prefer small non-polar amino acids (blue), while other sites prefer non-polar aliphatic amino acids (green), for example. The profiles computed from the posterior predictive mappings are uniform, with no discernible preference for any amino acid, as expected under homogeneous JC model.
Assessing Amino Acid Replacement Modeling Success
Overall, our study of posterior mappings suggests that a model similar to CAT-GTR+ (Lartillot and Philippe 2004; Lartillot 2006) may be warranted from our data set. The CAT-GTR+ has consistently come out on top in likelihood-based model comparisons (e.g., Lartillot 2006; Lartillot et al. 2007; Bujaki and Rodrigue 2022; Lartillot 2023), so it is not too surprising that the features captured by such a model would be reflected in the mappings produced by simpler models. And now that such a model has been developed, we can assess how well the posterior predictive mappings resemble the posterior mappings. Leaving aside formal posterior predictive tests, we simply repeat our graphical contrasting using a finite version of CAT+GTR+, built from 10 empirically predefined amino acid preference profiles (Quang et al. 2008), which we call C10+GTR+.
Panel 2A of Fig. 2 shows that the variance in the number of substitutions recovered from posterior and posterior predictive mappings with C10+GTR+ overlap, but still exhibit differences. This result suggests that the overall rate heterogeneity has been somewhat resolved, but there is still more variance to account for in the posterior mappings. One possibility is that the overall rate distribution across the alignment does not follow a gamma law; other approaches, such as mixtures of gamma distributions (Mayrose et al. 2005), or non-parametric methods (Huelsenbeck and Suchard 2007), or even using a semi-supervised learning approach (Silvestro et al. 2024), could be considered. Interestingly, when using the more sophisticated substitution model, namely C10+GTR+, there is much more variance in the substitution rate across sites recovered with both types of mappings than that recovered with the simplest model from the previous figure (Fig. 1: panel 1A). This is partly because more sophisticated models allow for more complex evolutionary histories involving more hidden amino acid replacements at certain sites (Lartillot et al. 2007).
Fig. 2.
Comparisons between posterior and posterior predictive mappings generated under the C10+GTR+ model (using 10 predefined amino acid profiles), a finite version of CAT+GTR+ model, for some key features of the evolutionary process of the -globin gene. A Variance in the number of amino acid substitution events across sites calculated from the posterior (blue) and posterior predictive (orange) mappings. B Proportion of amino acid substitution events calculated from posterior (lower triangle, blue) and posterior predictive (upper triangle, orange) mappings. C Proportion of time spent in each amino acid calculated from posterior (blue) and posterior predictive (orange) mappings. D Average profiles (k-means centroids) of the proportion of time spent in each amino acid calculated from posterior mappings. E Average profiles (k-means centroids) of the proportion of time spent in each amino acid calculated from posterior predictive mappings. Color scheme used in (D-E) logograms: AGST (small non-polar: blue), FYW (aromatic: orange), ILVM (non-polar aliphatic: green), HKR (polar positive: red), DE (polar negative: purple), NQ (polar neutral: cyan), P (proline: brown), C (cysteine: pink) (color figure online)
Fig. 1.
Comparisons between posterior and posterior predictive mappings generated under JC model for some key features of the evolutionary process of the -globin gene, including 17 vertebrate sequences, 144 amino acids in length. A Variance in the number of amino acid replacement events across sites calculated from the posterior (blue) and posterior predictive (orange) mappings. B Proportion of amino acid substitution events calculated from posterior (lower triangle, blue) and posterior predictive (upper triangle, orange) mappings. C Proportion of time spent in each amino acid calculated from posterior (blue) and posterior predictive (orange) mappings. D Average profiles (k-means centroids) of the proportion of time spent in each amino acid calculated from posterior mappings. E Average profiles (k-means centroids) of the proportion of time spent in each amino acid calculated from posterior predictive mappings. Color scheme used in (D-E) logograms: AGST (small non-polar: blue), FYW (aromatic: orange), ILVM (non-polar aliphatic: green), HKR (polar positive: red), DE (polar negative: purple), NQ (polar neutral: cyan), P (proline: brown), C (cysteine: pink) (color figure online)
Both the proportion of events (panel 2B) and the proportion of time spent in each amino acid (panel 2C) are very similar between the posterior and posterior predictive mappings, with Pearson correlation coefficients of 0.983 (p-value ) and 0.839 (p-value ), respectively. The model has good performance for these features.
Finally, the logograms obtained from posterior predictive mappings (panel 2E) under C10+GTR+ are at least slightly suggestive of biologically plausible amino acid profiles, with some dominated by I, L and V, or by D and E, or yet K and R. We note that we used a finite version of the CAT+GTR+ with predefined empirical profiles that are not fully adapted to our gene of interest, which may explain such unfocused logograms. The use of other empirical profiles may be worth exploring for this dataset.
Altogether, by examining posterior predictive mappings in contrast to posterior mappings, we can see that we are at least partially capturing the features set out by the modeling objectives of C10+GTR+, while suggesting further rounds of modeling explorations.
A Codon-Level Example
The previous examples are based on models or mapping statistics with the same overall form under the assumption of site independence. In this section, we explore using mappings generated under a simple, site-independent model to detect site-dependent features. Our example uses a simple codon substitution model based on the framework proposed by Muse and Gaut (1994), referred to in Rodrigue et al. (2008) as the MG-F14. The site-dependent feature we explore is that of CpG hypermutability. Specifically, for a given mapping, we count the number of CpG transitions (that is, CpG to TpG as well as CpG to CpA), denoted as C, while accounting for the CpG transition opportunity over the phylogeny. We do this by dividing C by the sum of all dwell times spent in CpG states, denoted T. We refer to this as the CpG transition rate, , although this expression should be taken loosely, since it is a very crude approximation. Note that determining CpG transitions requires knowledge of the current state at adjacent nucleotide sites along the alignment, which may be within a given codon, or straddling two consecutive codons.
The test is based on the idea that posterior mappings may carry information about CpG hypermutability, whereas posterior predictive mappings do not, since the substitution model used to produce them ignores this feature. The test calculates the proportion of times the CpG transition rate, computed from a posterior substitution mapping generated from a set of parameters , , is greater than the CpG transition rate computed from the predictive substitution mapping also generated from parameters , . For a sample of K sets of parameter values, the test thus produces a crude (uncalibrated) posterior predictive p-value, denoted ppp-value, given by the following equation:
| 1 |
where is the indicator function that returns 1 if the posterior CpG transition rate is greater than the predicted rate and 0 otherwise.
Note that CpG transition rates can be easily calculated to account for the position of CpGs in codons (positions 1-2 or 2-3) or at the interface between two codons (positions 3-1). It is also possible to filter for synonymous or non-synonymous CpG transitions, for example.
Simulation Study
To assess the reliability of this stochastic mapping approach to detecting signals of CpG hypermutability, we used a complex site-dependent codon substitution model from Laurin-Lemay et al. (2018). This model is built from the MG-F14 model, but includes a parameter, , capturing CpG hypermutability.
The parameter values needed for generating synthetic codon alignments were obtained from the posterior distributions from the analysis of 10 mammalian protein-coding gene alignments (Laurin-Lemay et al. 2018) using the MG-F14 codon substitution model implemented in PhyloBayes-MPI (Rodrigue and Lartillot 2013). As for , we set it to three different values in three different sets of simulations to assess the test in the absence of CpG hypermutability (), and in its presence at two different levels ( and ). The synthetic alignments were then produced using the jump-chain simulation algorithm described in Laurin-Lemay et al. (2022).
Each synthetic alignment was analyzed with the MG-F14 using PhyloBayes-MPI, producing posterior and posterior predictive mappings for each set of parameters drawn from the MCMC sample. For each analysis, we sampled 500 posterior predictive and 500 posterior mappings (1000 mappings in total) across 50 sets of posterior values, with 10 replicates per set of values. Each experimental condition (three values) was replicated 100 times (10 genes 10 synthetic alignments). We then calculated the proportions of false positives and false negatives under each condition using our previously defined simple test (Eq. 1).
The test based on the MG-F14 model had a low rate of false positives (Table 1: 6%), as expected with %. It also had no false negatives under both levels of CpG hypermutability tested, see Table 1. These findings suggest that this simple test is reliable and could be used for detecting CpG hypermutability in real data.
Table 1.
False negative and false positive rates of the CpG hypermutability test obtained with the MG-F14 model on synthetic codon alignments generated using the same model, incorporating varying levels of CpG hypermutability (i.e., )
| Type of controls | Models used to generate the synthetic alignments | Positive tests (%) |
|---|---|---|
| Negative | 6 | |
| Positive | 100 | |
| Positive | 100 |
Empirical Study
Following the same protocol as in the simulation study, we applied the CpG hypermutability test to 137 mammalian codon sequence alignments previously studied in Laurin-Lemay et al. (2018). When considering all dinucleotide contexts (codon positions 1-2, 2-3, 3-1) and both synonymous and non-synonymous substitutions, we recovered 86%, 88%, and 93% of positive tests when using significance thresholds of %, 5% and 10%, respectively (Table 2). We note that, all of these genes were identified as having significant levels of CpG hypermutability using an explicit model for this purpose (Laurin-Lemay et al. 2018).
Table 2.
Proportion of positive CpG hypermutability tests (%) obtained with the MG-F14 substitution model on 137 mammalian gene alignments. The column titles indicate the counting rules applied to substitution mappings to compute the test statistics. Values in parentheses correspond to the proportions of positive tests at different significance thresholds: 1%, 5%, and 10%, respectively. Bold values indicate the highest proportion of positive tests for each significance threshold
| Codon positions | Non-synonymous + Synonymous (%) | Synonymous (%) | Non-synonymous (%) |
|---|---|---|---|
| 1–2 + 2–3 + 3–1 | (86; 88; 93) | (80; 86; 88) | (89; 92; 95) |
| 1–2 | (42; 45; 50) | (0; 0; 0) | (42; 45; 50) |
| 2–3 | (; ; 98) | (88; 94; ) | (75; 81; 83) |
| 3–1 | (91; 92; 94) | (81; 87; 88) | (82; 91; 93) |
Interestingly, when we broke down the test by codon position, we observed different levels of positive results (see Table 2). As expected, we found low levels of positive tests at positions 1 and 2 (e.g., 45% at %), where all substitutions are non-synonymous and imply significant biochemical changes. Codon positions 1 and 2 can undergo changes from arginine (CGN) to histidine (CAY), glutamine (CAR), cysteine (TGY), tryptophan (TGG), or the stop codon TGA. Consequently, no test could yield a positive result when focused on the synonymous part of the CpG transitions that occur at positions 1 and 2 because no substitutions were available.
Conversely, we found the highest levels of positive tests at positions 2 and 3 (e.g., 98% at a 5% threshold of significance), where both synonymous and non-synonymous substitutions can occur. Further analysis revealed that the signal for positive tests at positions 2 and 3 was primarily due to synonymous substitutions (e.g., 94% at a 5% threshold), as opposed to 81% for non-synonymous CpG transitions at the same significance level (%). However, the combination of synonymous and non-synonymous CpG transitions yielded the strongest signal.
Overall, these experiments suggest that our crude stochastic mapping approach can reveal CpG hypermutability for most data sets, while being far simpler than the complex model and elaborate conditional approximate Bayesian computation procedure of our previous work (Laurin-Lemay et al. 2018). We personally take this lesson to heart for our future work: before embarking on a laborious modeling project, take the time to conduct the appropriate prospecting using stochastic mapping.
Conclusions and Future Work
Although several authors (e.g., Bollback 2005; Minin and Suchard 2008; Romiguier et al. 2012) have repeatedly predicted a bright future for the use of posterior mappings in molecular evolution, the approach has yet to take off as once hoped. We see this as a missed opportunity. This is all the more true given that the current bioinformatics situation is historically unique, with access to ever-growing genomic resources (e.g., Consortium Z 2020), access to increasingly sophisticated phylogenetic models (e.g., Rodrigue et al. 2020). As demonstrated, the approach can help both to guide model development and assess modeling progress.
Much work remains to be done. Formalizing the contrast between posterior predictive and posterior mappings in calibrated tests is an important area of future work. More imaginative mapping-based measurements could help reveal signals for many features that are not captured by current models. For example, in terms of mutation processes, we could look at the frequency of codon changes implying doublet or triplet mutations with a simple JC-type model operating at the level of codon states (a 61 by 61 JC model) rather than the more common nucleotide and amino acid states. Amino acid doublet replacements, or nearest neighbor interactions, could be studied more broadly under a 400 by 400 amino acid dipeptide replacement matrix (Gonnet et al. 1994). The extent to which we could envisage n-tuple state matrices to directly produce posterior mappings is an open question. A first step beyond the point of tractability would be use a simple model to propose posterior mappings, and a Metropolis-Hastings accept/reject rule under the target model, as in Robinson et al. (2003), but without the extra complications of sampling parameter values as well. From a demographic point of view, it might also be possible to study the impact of effective population size on the occurrence of biochemically radical or conservative amino acid replacement events along a phylogeny, the expectation being that sub-trees of the phylogeny involving species with low effective populations would have more radical amino acid replacements than sub-trees involving species with high effective populations. Producing mapping with other classes of models, including covarion-like approaches (e.g., Tuffley and Steel 1998; Zhou et al. 2010) and the PoMo model (De Maio et al. 2015) should be also further explored.
In studying protein evolution, many works have focused on the detailed sequence of amino acid replacements using in silico simulations (e.g., Parisi and Echave 2001; Pollock et al. 2012). Posterior mappings offer the possibility of studying actual (albeit inferred) histories of amino acid replacements. Ancestral sequence reconstruction and protein resurrection studies generally focus on the inferred states at the internal nodes of phylogenetic trees (e.g., Harms and Thornton 2010; Eick et al. 2016; Xie et al. 2021). Posterior mappings could supplement such studies, enabling a resurrection of a set of proteins as they might have evolved over time, which would offer a detailed characterization of protein evolution.
Mappings in PhyloBayes(-MPI) can be generated under numerous models (Lartillot et al. 2013), including site-heterogeneous models such as CAT and CAT-GTR (Lartillot and Philippe 2004), as well as a wide set of codon substitution models (Rodrigue and Lartillot 2013). From the mapping files produced by the software, it is up to the investigator to come up with new ways of studying the inferred evolutionary process. Other software also offer the possibility of stochastic mapping (Drummond and Rambaut 2007; Revell 2011), but under a more limited set of models.
It is likely that the informational content of posterior mappings will be ahead of that of posterior predictive mappings for some time, while we progressively incorporate our biological understanding into evolutionary models (Rodrigue and Philippe 2010). Though it is unclear, we dare speculate that sufficient modeling efforts may eventually bring us to a point where we (or recent artificial intelligence systems, e.g., Trost et al. 2023) are unable to distinguish between posterior and posterior predictive mappings; the evolutionary histories anticipated by our models would match closely those inferred from real data.
Acknowledgements
We thank two anonymous reviewers for their constructive feedback. This work was funded by a Discovery Grant from Natural Sciences and Engineering Research Council of Canada to NR. We also thank Wade Hong and Julio Aguilar-Hernandez for their help in managing our computer cluster.
Data Availability
The software and data used in the study are available at https://github.com/Simonll/Stochastic-character-mapping.
Declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
References
- Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181(4096):223–230. 10.1126/science.181.4096.223 [DOI] [PubMed] [Google Scholar]
- Bollback JP (2005) Posterior mapping and posterior predictive distributions. Springer, New York, pp 439–462. 10.1007/0-387-27733-1_16 [Google Scholar]
- Bujaki T, Rodrigue N (2022) Bayesian cross-validation comparison of amino acid replacement models: contrasting profile mixtures, pairwise exchangeabilities, and gamma-distributed rates-across-sites. J Mol Evol 90(6):468–475. 10.1007/s00239-022-10076-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chothia C, Lesk A (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823–826. 10.1002/j.1460-2075.1986.tb04288.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen O, Gophna U, Pupko T (2010) The complexity hypothesis revisited: connectivity rather than function constitutes a barrier to horizontal gene transfer. Mol Biol Evol 28(4):1481–1489. 10.1093/molbev/msq333 [DOI] [PubMed] [Google Scholar]
- Consortium Z (2020) A comparative genomics multitool for scientific discovery and conservation. Nature 587(7833):240–245. 10.1038/s41586-020-2876-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dayhoff MO (ed) (1978) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation, Washington, D.C [Google Scholar]
- De Maio N, Schrempf D, Kosiol C (2015) Pomo: an allele frequency-based approach for species tree estimation. Syst Biol 64(6):1018–1031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dimmic MW, Hubisz MJ, Bustamante CD, Nielsen R (2005) Detecting coevolving amino acid sites using Bayesian mutational mapping. Bioinformatics 21(Suppl 1):i126–i135. 10.1093/bioinformatics/bti1032 [DOI] [PubMed] [Google Scholar]
- Doolittle RF (1981) Similar amino acid sequences: Chance or common ancestry? Science 214(4517):149–159. 10.1126/science.7280687 [DOI] [PubMed] [Google Scholar]
- Drummond AJ, Rambaut A (2007) Beast: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7:1–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dutheil J, Pupko T, Jean-Marie A, Galtier N (2005) A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol 22(9):1919–1928. 10.1093/molbev/msi183 [DOI] [PubMed] [Google Scholar]
- Eick GN, Bridgham JT, Anderson DP, Harms MJ, Thornton JW (2016) Robustness of reconstructed ancestral protein functions to statistical uncertainty. Mol Biol Evolut. 10.1093/molbev/msw223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitch WM (1971) Toward defining the course of evolution: Minimum change for a specific tree topology. Syst Biol 20(4):406–416. 10.1093/sysbio/20.4.406 [Google Scholar]
- Gonnet G, Cohen M, Benner S (1994) Analysis of amino acid substitution during divergent evolution: the 400 by 400 dipeptide substitution matrix. Biochem Biophys Res Commun 199(2):489–496. 10.1006/bbrc.1994.1255 [DOI] [PubMed] [Google Scholar]
- Guéguen L, Duret L (2017) Unbiased estimate of synonymous and nonsynonymous substitution rates with nonstationary base composition. Mol Biol Evol 35(3):734–742. 10.1093/molbev/msx308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harms MJ, Thornton JW (2010) Analyzing protein structure and function using ancestral gene reconstruction. Curr Opin Struct Biol 20(3):360–366. 10.1016/j.sbi.2010.03.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hobolth A, Stone EA (2009) Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution. Ann Appl Stat. 10.1214/09-aoas247 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huelsenbeck JP, Suchard MA (2007) A nonparametric method for accommodating and testing across-site rate variation. Syst Biol 56(6):975–987. 10.1080/10635150701670569 [DOI] [PubMed] [Google Scholar]
- Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Mammalian protein metabolism, Elsevier, pp 21–132, 10.1016/b978-1-4832-3211-9.50009-7
- Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, Eaton M, Hamady M, Lindsay H, Liu Z, Lozupone C, McDonald D, Robeson M, Sammut R, Smit S, Wakefield MJ, Widmann J, Wikman S, Wilson S, Ying H, Huttley GA (2007) Pycogent: a toolkit for making sense from sequence. Genome Biol 8(8):R171. 10.1186/gb-2007-8-8-r171 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lartillot N (2006) Conjugate Gibbs sampling for Bayesian phylogenetic models. J Comput Biol 13(10):1701–1722. 10.1089/cmb.2006.13.1701 [DOI] [PubMed] [Google Scholar]
- Lartillot N (2023) Identifying the best approximating model in Bayesian phylogenetics: Bayes factors, cross-validation or WAIC? Syst Biol 72(3):616–638. 10.1093/sysbio/syad004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lartillot N, Philippe H (2004) A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 21(6):1095–1109. 10.1093/molbev/msh112 [DOI] [PubMed] [Google Scholar]
- Lartillot N, Brinkmann H, Philippe H (2007) Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol 7(Suppl 1):S4. 10.1186/1471-2148-7-s1-s4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lartillot N, Rodrigue N, Stubbs D, Richer J (2013) PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst Biol 62(4):611–615. 10.1093/sysbio/syt022 [DOI] [PubMed] [Google Scholar]
- Laurin-Lemay S, Rodrigue N, Lartillot N, Philippe H (2018) Conditional approximate Bayesian computation: a new approach for across-site dependency in high-dimensional mutation–selection models. Mol Biol Evol 35(11):2819–2834. 10.1093/molbev/msy173 [DOI] [PubMed] [Google Scholar]
- Laurin-Lemay S, Dickson K, Rodrigue N (2022) Jump-chain simulation of Markov substitution processes over phylogenies. J Mol Evol 90(3–4):239–243. 10.1007/s00239-022-10058-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee HJ, Rodrigue N, Thorne JL (2015) Relaxing the molecular clock to different degrees for different substitution types. Mol Biol Evol 32(8):1948–1961. 10.1093/molbev/msv099 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee HJ, Kishino H, Rodrigue N, Thorne JL (2016) Grouping substitution types into different relaxed molecular clocks. Philos Trans R Soc B Biol Sci 371(1699):20150,141. 10.1098/rstb.2015.0141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemey P, Minin VN, Bielejec F, Pond SLK, Suchard MA (2012) A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. Bioinformatics 28(24):3248–3256. 10.1093/bioinformatics/bts580 [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proc. of the fifth Berkeley symposium on mathematical statistics and probability, Berkeley, vol 1, pp 281–297
- Mayrose I, Friedman N, Pupko T (2005) A gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21(Suppl 2):ii151–ii158. 10.1093/bioinformatics/bti1125 [DOI] [PubMed] [Google Scholar]
- Minin VN, Suchard MA (2008) Fast, accurate and simulation-free stochastic mapping. Philos Trans R Soc B Biol Sci 363(1512):3985–3995. 10.1098/rstb.2008.0176 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muse S, Gaut B (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evolut. 10.1093/oxfordjournals.molbev.a040152 [DOI] [PubMed] [Google Scholar]
- Nielsen R (2002) Mapping mutations on phylogenies. Syst Biol 51(5):729–739. 10.1080/10635150290102393 [DOI] [PubMed] [Google Scholar]
- Parisi G, Echave J (2001) Structural constraints and emergence of sequence patterns in protein evolution. Mol Biol Evol 18(5):750–756. 10.1093/oxfordjournals.molbev.a003857 [DOI] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 [Google Scholar]
- Pollock DD, Thiltgen G, Goldstein RA (2012) Amino acid coevolution induces an evolutionary stokes shift. Proc Natl Acad Sci. 10.1073/pnas.1120084109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quang LS, Gascuel O, Lartillot N (2008) Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics 24(20):2317–2323. 10.1093/bioinformatics/btn445 [DOI] [PubMed] [Google Scholar]
- Revell LJ (2011) phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol 3(2):217–223 [Google Scholar]
- Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL (2003) Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol 20(10):1692–1704. 10.1093/molbev/msg184 [DOI] [PubMed] [Google Scholar]
- Rodrigue N, Lartillot N (2013) Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics 30(7):1020–1021. 10.1093/bioinformatics/btt729 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodrigue N, Philippe H (2010) Mechanistic revisions of phenomenological modeling strategies in molecular evolution. Trends Genet 26(6):248–252. 10.1016/j.tig.2010.04.001 [DOI] [PubMed] [Google Scholar]
- Rodrigue N, Lartillot N, Bryant D, Philippe H (2005) Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 347(2):207–217. 10.1016/j.gene.2004.12.011 [DOI] [PubMed] [Google Scholar]
- Rodrigue N, Philippe H, Lartillot N (2007) Exploring fast computational strategies for probabilistic phylogenetic analysis. Syst Biol 56(5):711–726. 10.1080/10635150701611258 [DOI] [PubMed] [Google Scholar]
- Rodrigue N, Lartillot N, Philippe H (2008) Bayesian comparisons of codon substitution models. Genetics 180(3):1579–1591. 10.1534/genetics.108.092254 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodrigue N, Latrille T, Lartillot N (2020) A Bayesian mutation-selection framework for detecting site-specific adaptive evolution in protein-coding genes. Mol Biol Evol 38(3):1199–1208. 10.1093/molbev/msaa265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Romiguier J, Figuet E, Galtier N, Douzery EJP, Boussau B, Dutheil JY, Ranwez V (2012) Fast and robust characterization of time-heterogeneous sequence evolutionary processes using substitution mapping. PLoS ONE 7(3):e33,852. 10.1371/journal.pone.0033852 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silvestro D, Latrille T, Salamin N (2024) Toward a semi-supervised learning approach to phylogenetic estimation. Syst Biol. 10.1093/sysbio/syae029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trost J, Haag J, Höhler D, Jacob L, Stamatakis A, Boussau B (2023) Simulations of sequence evolution: How (un)realistic they are and why. Mol Biol Evolut. 10.1093/molbev/msad277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tuffley C, Steel M (1998) Modeling the covarion hypothesis of nucleotide substitution. Math Biosci 147(1):63–91 [DOI] [PubMed] [Google Scholar]
- Xie VC, Pu J, Metzger BP, Thornton JW, Dickinson BC (2021) Contingency and chance erase necessity in the experimental evolution of ancestral proteins. ELife. 10.7554/elife.67336 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol 39(3):306–314. 10.1007/bf00160154 [DOI] [PubMed] [Google Scholar]
- Yang Z, Nielsen R, Goldman N, Pedersen AMK (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155(1):431–449. 10.1093/genetics/155.1.431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Brinkmann H, Rodrigue N, Lartillot N, Philippe H (2010) A Dirichlet process covarion mixture model and its assessments using posterior predictive discrepancy tests. Mol Biol Evol 27(2):371–384 [DOI] [PubMed] [Google Scholar]
- Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. Elsevier, Amsterdam, pp 97–166. 10.1016/b978-1-4832-2734-4.50017-6 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The software and data used in the study are available at https://github.com/Simonll/Stochastic-character-mapping.


