Significance
Phylogenetic methods inferring molecular coevolution have recently gained traction for their capacity to predict protein surface interactions and to clarify their function within metabolic pathways. These methods rely on phylogenies inferred under the assumption that coevolution does not exist (i.e., sites evolve independently). However, violations of this assumption lead to considerable inaccuracies in the inferred phylogeny, which in turn can negatively affect the estimation of coevolution. We tackle this problem by developing a Bayesian method to simultaneously infer phylogenetic relationships and predict coevolution from nucleotide sequences. The main novelty of our method is its ability to account for the interdependencies between molecular coevolution and phylogeny, thus relaxing a long-standing assumption in the study of molecular evolution.
Keywords: Bayesian inference, phylogeny, molecular coevolution, tree of life
Abstract
Patterns of molecular coevolution can reveal structural and functional constraints within or among organic molecules. These patterns are better understood when considering the underlying evolutionary process, which enables us to disentangle the signal of the dependent evolution of sites (coevolution) from the effects of shared ancestry of genes. Conversely, disregarding the dependent evolution of sites when studying the history of genes negatively impacts the accuracy of the inferred phylogenetic trees. Although molecular coevolution and phylogenetic history are interdependent, analyses of the two processes are conducted separately, a choice dictated by computational convenience, but at the expense of accuracy. We present a Bayesian method and associated software to infer how many and which sites of an alignment evolve according to an independent or a pairwise dependent evolutionary process, and to simultaneously estimate the phylogenetic relationships among sequences. We validate our method on synthetic datasets and challenge our predictions of coevolution on the 16S rRNA molecule by comparing them with its known molecular structure. Finally, we assess the accuracy of phylogenetic trees inferred under the assumption of independence among sites using synthetic datasets, the 16S rRNA molecule and 10 additional alignments of protein-coding genes of eukaryotes. Our results demonstrate that inferring phylogenetic trees while accounting for dependent site evolution significantly impacts the estimates of the phylogeny and the evolutionary process.
Molecular coevolution is the evolutionary process by which interactions between distant sites of a molecule, or sites of different molecules, are maintained such as to preserve advantageous functional or structural properties. For instance, coevolving fragments within protein sequences are involved in folding constraints and informative of folding intermediates, peptide assembly, or key mutations with known roles in genetic diseases (1, 2). The ever-growing availability of molecular sequences (nucleotides and amino acids) provides us with an unprecedented amount of data that hold a strong potential to reveal genes and gene regions evolving under a constrained process (3, 4).
There exist several methods to infer coevolution from sequence data alone (based on matching patterns between sites, as reviewed in refs. 5 and 6). However, these methods do not exploit a key component in modeling the underlying evolutionary processes: the phylogenetic tree describing the relationships between molecular sequences. Incorporating the phylogenetic signal in the analysis of coevolution is crucial because it enables us to distinguish between truly coevolving patterns and similar patterns induced by the shared history of sequences (5, 7). To this end, several methods have been developed to infer coevolution while accounting for phylogenetic relationships (7), but only a few of these explicitly model the process of coevolution along a given phylogenetic tree (8–11).
All phylogeny-aware methods to detect coevolution rely on the assumption that the phylogenetic relationships between sequences are known and can be treated as “observed data.” Typically, phylogenetic trees are themselves inferred from molecular data (12), but their inference is based on a fundamental assumption that each site evolves independently of all of the others (13). This assumption, which is evidently violated in the presence of coevolution, has benefits in terms of computational tractability, because the likelihood of an alignment given a phylogenetic tree is the product of the individual likelihood of each site. This simplification of the evolutionary mechanism in the presence of nonindependent sites has been shown to decrease the accuracy of the inferred phylogenetic trees (14, 15). However, datasets with strong functional or structural constraints are often analyzed within phylogenetic frameworks that assume independence among sites. For instance, the small ribosomal subunit (16S) is frequently used to estimate the earliest evolutionary relationships between the major lineages of the tree of life (16, 17), neglecting its numerous structural constraints and evidence of coevolution (18).
The presence of coevolutionary patterns across many nucleotide and amino acid sequences extends far beyond the 16S gene and is supported by a large body of evidence (9, 19). Ignoring the interdependencies between the phylogenetic history of the sequences and the constrained processes governing the evolution of nucleotides or amino acids can severely hamper our ability to infer correct phylogenetic trees (14, 15) and accurately detect coevolution (5, 7). However, the inference of these interdependent processes is still conducted separately for mathematical convenience, at the expense of assuming independence among sites which was originally described as “not necessarily biologically valid” by Felsenstein in 1983, at the dawn of likelihood-based molecular phylogenetics (13). To address this issue, we present a Bayesian framework to analyze a nucleotide alignment and jointly estimate (i) the number of pairs of sites that coevolve and their position in the sequence, thus differentiating from a background model of independent evolution, (ii) the parameters of the independent and dependent models of substitution, and (iii) the phylogenetic tree describing the relationships between sequences. Our method is called CoevRJ and the software implementing it is available at https://bitbucket.org/XavMeyer/coevrj.
We evaluate the performance of CoevRJ in reconstructing phylogenetic trees and inferring coevolution based on an extensive range of simulated datasets, an alignment of the highly coevolving 16S rRNA and 10 empirical eukaryote datasets of protein-coding genes. We show that CoevRJ provides an accurate identification of the coevolving sites, as validated by simulated and empirical data. We assess the effects of coevolution on phylogenetic estimates by comparing our results with those obtained under the assumption of independent evolution and demonstrate the importance of accounting for dependence among sites when inferring phylogenetic trees on datasets subject to coevolution.
Results
CoevRJ: A Bayesian Framework to Jointly Estimate Phylogeny and Molecular (Co)evolution.
The CoevRJ method simultaneously infers the phylogenetic relationships between molecular sequences as well as the number and position along the sequence of the sites (if any) that evolved in a dependent fashion. The method estimates the posterior probability of the many scenarios of evolution considered, as well as their parameters values using the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm (20) (Fig. 1).
Our approach, further described in Materials and Methods, includes a mixture of two evolutionary models that can accommodate many possible scenarios of dependent and independent evolution among sites within a gene. Sites are not a priori assigned to either category; rather, their mode of evolution is estimated from the data. Independent sites are considered to evolve under a general time-reversible substitution model with rate heterogeneity among sites modeled by a Gamma distribution [GTR+ model (21)]. The mean rate of substitution is set to 1 and the shape of the Gamma distribution is modeled by a single parameter and discretized into a finite set of rate multipliers to incorporate varying substitution rates across all sites of the alignment.
Dependent sites are modeled with an adapted version of the Coev model (11), under which coevolving pairs of sites evolve in a dependent fashion, such that the nucleotides at both sites remain within a predefined set of nucleotide pairs, defined as the “coevolving profile.” A substitution in one site of a coevolving pair is expected to trigger a subsequent substitution at the other site such that the nucleotides combination remains in the profile. A coevolving profile contains between two and four nucleotide pairs encompassing the possible cosubstitutions. For instance, pairs (AA, CC, TT, GG) or the Watson–Crick base pairs (AT, CG) may form a coevolving profile. However, the Watson–Crick base pairs augmented with the wobbling pair GT cannot form a profile since the wobbling pair differs by a single substitution from the two other pairs (11). The coevolution process is modeled as a reversible continuous-time Markov chain with rate parameters distinguishing among single site substitutions (i) leading to the profile (rate ), (ii) breaking the profile (rate ), and (iii) allowing the pairs of sites to evolve within out-of-profile pairs (rate ).
The combination of GTR+ and Coev forms a mixture of models parameterized by the number of coevolving pairs of sites and their position within the molecular alignment. This mixture of models ranges from sites being fully independent and evolving under a pure GTR+ model to all sites being involved in nonoverlapping coevolving pairs. Independent and coevolving sites, regardless of the configuration of the mixture, are assumed to evolve on the same phylogeny (including topology and branch lengths), and therefore both contribute to its estimation. The branch lengths are, however, decoupled between models by scaling their respective rate matrices to yield an expected value of one substitution per branch length unit and by applying a unique branch length (or rate) multiplier for sites under coevolution.
Estimating all these parameters represents a significant computational challenge that we overcome by making some simplifying assumptions. We assume the coevolving pairs to follow a homogeneous (co)evolutionary process and share the same substitution rates (). This (co)evolutionary process is mainly characterized by parameters and that define the attraction of a pair of sites to their coevolving profile. As these two rate parameters are not expressed in the GTR+ model, we considered the rate parameters of coevolving and independent sites as independent. Finally, we infer the profile of each coevolving pair from a reduced set of profiles limited to the nucleotide pairs observed at the corresponding sites in the alignment. Under these assumptions, Bayesian inference of the GTR+ and Coev mixture of models is tractable with the CoevRJ method, as we demonstrate in the following sections using synthetic and empirical datasets.
CoevRJ Accurately Estimates Pairs of Sites and the Phylogeny.
To assess the performance of CoevRJ, we generated a total of 250 alignments of nucleotides each with 1,000 sites and 50 taxa, with a proportion of coevolving sites ranging from 0% (independent evolution) to 50% of the sites. CoevRJ correctly recovered the number of pairs of coevolving sites and their position (if any) under each scenario, with accuracy equal to 0.99 or higher (SI Appendix, Table S1). For datasets simulated under independent evolution, CoevRJ identified at most one pair over the 1,000 simulated sites. For datasets simulated with coevolving pairs, CoevRJ accurately identified most of the coevolving pairs of sites with a sensitivity of 98%. Overall, the models inferred with CoevRJ were consistent with the amount of coevolution simulated: Both the number of coevolving sites and their position were accurately identified as well as the sites that were not coevolving.
We then measured the ability of CoevRJ to recover the simulated amount of rate heterogeneity (i.e., the parameter of the GTR+ model) and the total number of substitutions, measured as the total branch length of the reconstructed phylogenetic tree. To estimate the effects of ignoring coevolution between sites, we reanalyzed the datasets under a standard GTR+ model where all sites are considered to evolve independently [as implemented in MrBayes (23)]. The performances of CoevRJ and GTR+, assessed by the relative errors with respect to the simulated parameters, were equivalent in the absence of coevolution, reflecting the fact that CoevRJ correctly reduced to a model where all sites are independent (Fig. 2 A and B). However, as the proportions of coevolving sites in the alignment increased, the accuracy of the estimated rate heterogeneity and branch lengths decreased substantially under the GTR+ model, while it remained essentially unchanged under CoevRJ (Fig. 2 A and B).
Finally, the accuracy of the inferred phylogenetic tree with increasing levels of coevolution was consistently improved when using CoevRJ rather than the standard GTR+ model. In presence of coevolution, the tree topologies inferred with CoevRJ were more accurate than those inferred under the GTR+ model (SI Appendix, Fig. S1). The divergence between the trees estimated by the GTR+ model and CoevRJ increased significantly with the proportion of coevolution (SI Appendix, Fig. S1) as a small but significant increase in misidentified bipartitions (internal nodes) affected the phylogenetic trees inferred under the independent sites model (Fig. 2C).
CoevRJ Identifies Alternative Hypotheses for the “Tree of Life.”
To gain insight on the effect of accounting for coevolution on an empirical dataset, we analyzed an alignment of the 16S rRNA that includes sequences for 146 taxa spanning the three domains of life, Bacteria, Archea, and Eukaryotes (18). The 16S rRNA is subject to many structural constraints and is therefore used as a benchmark for the method’s ability to predict coevolution. Using this dataset, pairs of sites predicted as coevolving can be assessed by comparing them to the known 3D structure of the small ribosomal subunit for several species [E. coli (24), Drosophila melanogaster (25) and Homo sapiens (26)]. Coincidentally, since 16S is shared by all prokaryotes and eukaryotes and is a slow-evolving gene, it is also often used to infer phylogenetic relationships, especially focusing on the earliest nodes in the tree of life (16, 17).
CoevRJ predicted 256 pairs of nucleotides (19.5% of the alignment positions) as coevolving with a posterior probability greater than 0.95. Of these, 94% of the pairs of sites were located very closely on the 3D structure, for example less than 6.5 Å for E. coli (Fig. 3 and Materials and Methods). The majority of these pairs (71%) were inferred as having a profile consistent with Watson–Crick base pairs (AT, GC). Among the 16 pairs having probability greater than 0.95 and not supported by the structure of E. coli, 14 were inferred with profiles diverging from a pure Watson–Crick profile, suggesting that they could be involved in functional constraints (SI Appendix, Table S2). Additionally, 12 of them are known to bind with other small ribosomal subunits not represented in our dataset (summarized in SI Appendix, Table S2 from ref. 28). Sites involved in such bonds could be coevolving with residues on other sequences and may thus present evolutionary patterns departing strongly from the one of independent evolution, which may lead CoevRJ to infer them as coevolving within the 16S rRNA. Finally, 108 pairs were predicted as significantly coevolving but with probability lower than 0.95 (27% with sites closely located on the structure of E. coli). Empirically validating these predictions is difficult as they could result from coevolution affecting only a portion of the phylogeny and would require the 3D structures for many species. However, the interpretation of the coevolving pairs based on the 3D structure of E. coli was confirmed when using the D. melanogaster and H. sapiens structures for validation (SI Appendix, Fig. S2).
The phylogenetic tree inferred by CoevRJ from the 16S RNA dataset significantly differed from the one obtained under the assumption of independence among sites. Our analyses identified 51 internal nodes not shared between the two topologies (normalized Robinson–Foulds distance (29) of 0.18; Fig. 4 and SI Appendix, Figs. S3–S9). In addition to topological differences, the two models inferred substantially different branch lengths. For instance, the branch separating the Archaea from the Eukaryotes had an estimated branch length that was 25% longer with CoevRJ than with the GTR+ model, while branches deep in the bacterial clades were generally inferred as shorter with CoevRJ (SI Appendix, Figs. S10–S12). These discrepancies have a strong impact on the estimates of divergence times based on these phylogenies. Indeed, we found strongly differing ultrametric trees when using the phylogeny estimated under the CoevRJ or the GTR+ model (Fig. 4), and conflicting estimates of the divergence times were observed for varying settings of the underlying molecular clock (SI Appendix, Fig. S13).
A Wider Perspective on the Failure to Account for Dependence Among Sites.
We tested the CoevRJ approach on a diverse range of 10 protein-coding genes of eukaryotes from the Selectome database (30) having significantly different alignment size and gene annotation (SI Appendix, Table S3). These alignments were specifically selected out of 8,000 protein-coding genes for their significant signals of coevolution as predicted by the Coev method (11). For each dataset, we inferred the parameters with both CoevRJ and GTR+ and computed the discrepancies measured between the methods for the inferred rate heterogeneity, branch lengths, and tree topology.
While the proportion and intensity of coevolution detected within these datasets varied (Fig. 5A and SI Appendix, Fig. S14), not accounting for dependence among sites led to decrease in the estimate of the rate heterogeneity between sites (Fig. 5B). Additionally, the estimates of branch lengths differed substantially between CoevRJ and GTR+. The total branch length inferred on these datasets was inconsistent between both methods without showing a bias (Fig. 5C). Notably, the difference in total branch length did not come from a global factor equally affecting all branches but from many changes of varying amplitude on different branches (Fig. 5D).
The tree topologies inferred under CoevRJ and GTR+ differed for all genes (Fig. 5E). The amount of differences among topologies obtained under the different methods ranged from 0.06 to more than 0.3 (normalized Robinson–Foulds distance). The differences measured on the substitution rates, branch lengths, and tree topologies suggested that accounting for dependence among sites significantly impacted the parameter estimates without presenting a consistent bias. The diversity of these discrepancies suggests that failing to account for the dependencies between sites led to unpredictable effects on the inferred evolutionary histories.
Discussion
We presented a method to analyze an alignment, while moving beyond the unrealistic assumption that all sites evolve as independent units. CoevRJ jointly infers the posterior probability of the phylogenetic tree, the pairs of coevolving sites, and the underlying parameters of the evolutionary models. CoevRJ therefore enables us to capture the reciprocal effects of the shared evolutionary history of molecular sequences and the pairs of sites that coevolve. The joint analysis of these two processes has remained an unsolved challenge in previous approaches (6, 31). Our results show that CoevRJ can accurately estimate both the phylogenetic tree and the parameters of the underlying (co)evolutionary process.
Modeling the evolutionary process enables CoevRJ to extract more information from the data than just the position of the pairs of coevolving sites. The posterior probability of each coevolving pair is informative of the strength of the nucleotide pairing along with the inferred distribution of profiles that determine the nature of the pairings. Estimating these parameters within a Bayesian framework results in intuitive posterior probabilities and enables us to use Bayes factors to properly define thresholds for the significance of predicted coevolving pairs (Materials and Methods). Given these advantages, the power of CoevRJ to predict coevolution compared favorably to existing methods on the 16S rRNA dataset (SI Appendix, Fig. S15).
Our findings join previous studies showing that the accuracy of standard phylogenetic inference is negatively impacted when dependence among sites is present in the data (14, 15). Phylogenetic trees inferred from synthetic datasets with CoevRJ show that our method can correct these inaccuracies. Similarly, phylogenetic trees inferred on the eukaryote datasets strongly differed from the ones inferred with a model assuming that sites evolve independently. The extent of the divergences were not predictable with respect to the nature and magnitude of the coevolutionary predictions. Such inaccuracies in the phylogeny could impact analyses using these phylogenetic trees. For instance, phylogenies inferred on the 16S rRNA datasets suggested that conflicting conclusions would be reached when aiming to date the tree of life (Fig. 4).
The machinery developed for CoevRJ can be extended to other mixtures of models for RNA, DNA, or amino acid sequences. For instance, a targeted study of the secondary structure of RNA could be conducted by replacing the Coev model by models accounting for substitution between Watson–Crick pairs and wobbling pairs (e.g., refs. 8 and 32). However, further improvements to the performance of this machinery are required to relax the most limiting assumptions on the current mixture of models to better integrate the richness and complexity of molecular evolution. For instance, the assumption of rate homogeneity of the coevolving pairs could be relaxed by adding a Gamma model of rate heterogeneity, as for the independent sites, or by considering a mixture of coevolutionary processes with different substitution rates (). Additionally, extending the model to consider coevolution at more than two sites at a time (e.g., triplet, quadruplet, or n-tuple) could better capture the potential underlying molecular structures. Finally, the study of coevolution in protein-coding genes would deeply benefit from the integration of codons or amino acid models, which is currently computationally prohibitive.
Extending CoevRJ to amino acid models would enable further investigation of the effect of dependent sites on dating of the tree of life, which is frequently achieved by analyzing the RNA and the proteins in the small ribosomal subunit (16, 17). Solving this computational challenge would also facilitate the study of protein–protein interactions (4) jointly with the underlying evolutionary process. While several methodological challenges remain, our approach paves the way to a new generation of more realistic models of molecular evolution. Improving our understanding of the shared history of genes and species requires that we integrate more complexity in evolutionary models (33), and our method demonstrates that such additional complexity is counterbalanced by a significant improvement of the inferred phylogenetic relationships and (co)evolutionary processes.
Materials and Methods
Evolutionary Models.
We designed a set of models in which nucleotides can either evolve independently of the others or according to a coevolutionary process whereby pairs of sites evolve in a mutually dependent fashion. The proportions of both types of sites in an alignment, as well as the specific assignment of each site to either model, is assumed to be unknown and estimated from the data.
Our Bayesian model allows us to jointly infer the following parameters (which are described in detail in the paragraphs below):
-
i)
the phylogenetic tree (topology and branch lengths) describing the relationships between genes;
-
ii)
the number of pairs of coevolving sites in the alignment;
-
iii)
the assignment of each individual site to either a coevolving pair or to the set of independently evolving sites;
-
iv)
the parameters of the substitution model describing independent site evolution; and
-
v)
the parameters of the substitution model for coevolving pairs of sites.
Independent substitution model.
Independent sites are modeled as evolving under the GTR+ model (21). This model accounts for rate heterogeneity using a discrete Gamma distribution model with four different rate multipliers known as the GTR+. We identify the instantaneous rates quantifying the rate of change from one nucleotide to another , as well as the parameter defining the shape of the Gamma distribution with . We scale the instantaneous rate matrix by the sum of its off-diagonal elements to disentangle the effect of the branch lengths and the rate parameters .
The likelihood of each independent site contained in the set is computed separately using the Felsenstein pruning algorithm (13) on the phylogenetic tree . The joint likelihood, under the assumption of independence among sites, is then computed as
with identifying the tree topology.
Dependent substitution model.
Dependent sites are assumed to follow a pairwise coevolution model adapted from ref. 11. Under this model, a pair of sites defined by two positions in the alignment are assumed to evolve within a profile of coevolution identified as . Following ref. 11, we define as “coevolving profile” the set of coevolving nucleotides for two sites (e.g., AT and CG). For any pair of nucleotides, there exist up to 192 possible profiles representing all of the possible combinations of pairs of nucleotides (34).
Since the Coev model does not allow double substitutions, evolutionary changes within the coevolving profile require at least two substitutions, for instance AT GT CG. We model this process with two parameters describing (i) the rate at which a coevolving pair is replaced by a noncoevolving one (e.g., GT), thus exiting the coevolving profile, and (ii) the rate at which the pair of sites return to the coevolving profile (e.g., CG). We indicate the two rates with and , respectively. In a coevolving pair, we expect the rate to be much larger than the rate , that is, it should be rare to leave a coevolving profile (low ) and it should be highly probable to regain a coevolving pair (high ). Therefore, the ratio expresses the attraction of pairs of positions to stay within the coevolving profile. Additionally, substitutions maintaining the pair of positions out of the profile occur at rate for each position.
The matrix of instantaneous substitution rates is then composed of the rates quantifying the rate of going from a pair of nucleotides to the pair and defined by
Branch lengths were not inferred by the original Coev method. Therefore, substitution rates were assumed to be consistent under the site-independent and site-dependent hypotheses. In other words, a substitution on two independent sites was equivalent to two substitutions on a pair of sites such that the total number of substitutions was preserved. This was achieved by doubling the length of branches under the Coev model.
Here we relax that assumption by inferring from the data the factor describing the difference in units of branch length between independent and coevolving substitutions. Branch lengths are therefore decoupled for sites under dependent and independent evolution by a rate modifier . This modification leads the probability transition matrix for coevolving pairs to be computed as
for the th branch. As for the , we normalize the matrix to disentangle the effects of the branch length and the rate scaling . Furthermore, we assume that all coevolving pairs share the same parameters of the matrix , whereas an individual coevolving profile is estimated for each pair.
In summary, a set of dependent sites is formed of pairs. Each pair is characterized by its positions and profile and its evolution is characterized by the modified Coev substitution model with parameters . The likelihood of each coevolving pair is computed using the Felsenstein pruning algorithm (13) on the phylogenetic tree . The joint likelihood of dependent sites is then given as
This model therefore assumes independence between coevolving pairs and thus does not directly account for dependence between three or more sites.
The set of models.
Given sequences of sites, we consider the set of models defined by the number of coevolving pairs such that . These sites are split in a subset of independent sites and a subset of coevolving pairs (dependent sites),
Although site assignment to either category can change during the Bayesian sampling algorithm, each site is exclusively dependent or independent in a given sample () and, if dependent, a site can only be present in one pair at a time. Finally, the joint likelihood of both types of sites is given as
Bayesian Framework.
We implemented the models in a Bayesian framework named CoevRJ that estimates all of the free parameters as well as the number of pairs . We used the RJMCMC algorithm (20) to estimate the joint posterior distribution of the parameters and models space:
which contains the probability for how many and which sites evolve independently or in coevolving pairs. The analysis starts from a model where all sites are independent () and proposes alternative configurations where pairs of sites are coevolving.
Proposals.
To explore this complex parameter and model space, we designed several proposals. Two types of parameters were differentiated: those that do or do not depend on .
Proposals for the phylogenetic tree and GTR+ model parameters.
These proposals aim to update the branch lengths , the tree topology , and the parameters . The space of phylogenetic tree topologies is sampled by using the stochastic nearest-neighbor interchange as well as the extended subtree pruning and regrafting proposals, while the continuous parameters of this category are sampled using adaptive multivariate normal proposals as described in ref. 35.
Proposals for the Coev model parameters ().
The parameters are sampled whenever . The branch-length scaling factor is updated using two different proposals. The first one is a simple multiplier proposal. The second proposal accounts for a potential negative correlation between and enabling both sets of parameters to change without impacting the overall number of substitutions in the phylogenetic tree.
Negative correlation between these parameters may happen when the model transitions from a purely independent sites model () to a model with many coevolving pairs () evolving faster than the independent sites (). Under these circumstances, many independent proposals would be required to reduce the branch lengths while increasing , and therefore would strongly impact the mixing and the convergence of the sampling process.
Therefore, we account for this negative correlation by proposing a move with a multivariate normal distribution having covariance matrix , where the th row corresponds to the branch length and the last row corresponds to the parameter . This covariance matrix is built such that
with and variances arbitrarily fixed (e.g., and ). The choice of these parameters results from empirical observations on the mean (Monte Carlo) sampling variance of branch length and the multiplier . The value of the parameter enforces that we expect significant correlations between the branch lengths and a negative correlation with the rate multiplier.
Furthermore, given that the matrix is normalized, the parameters are defined on the interval and have their sum constrained to one. To obtain an efficient sampling of such parameters, we use the reparameterization described in ref. 36. We replace the parameters by the parameters . The values of and are then given as
We then fix the parameter . One of the advantages of this reparameterization is that the new parameters are lying in and can thus be sampled with standard proposal kernels. We therefore sample parameters and using normal distributions with variances empirically calibrated to provide proper mixing on simulated datasets.
Proposals affecting k and the positions of pairs.
We have three different types of proposals that operate on the parameters and ; specifically, the proposals (i) add and remove pairs, (ii) change sites included in a pair, and (iii) change the coevolving profile in a pair.
Proposals moving through the model space.
The first two proposals are transdimensional moves because they change the number of parameters in the model and their acceptance probabilities are given by the RJMCMC algorithm (20). Our set of models defines a sequence of models with being the number of pairs and with each model having parameters identified by . The sequence increases the model complexity such that .
The probability of making a jump from a model with parameters to a more complex model with parameters is given as
[1] |
where A is defined as
[2] |
The first ratio represents the ratio of posterior probabilities. The second is the ratio of the proposal probabilities and the third is the ratio between the probability of drawing the random values required for the moves. The last term is the Jacobian of the mapping function that defines the relation between the parameters and the auxiliary variables in both models.
When proposing a move to a model having a different number of parameters, vectors of random numbers and are drawn such as to complete the parameters spaces of and such that with and . Assuming independent random numbers , the probability of drawing is defined as
To move from model to , we first draw a pair of independent sites. For the sake of simplicity, we consider hereafter that sites and profiles are drawn from uniform distributions. The notation used, however, accommodates more sensible approaches (see SI Appendix for details of the CoevRJ implementation).
In this proposal, a pair of positions is drawn by first selecting a position and then a second position . The probability of making this move is given as
A new profile must then be assigned to this new pair by drawing it directly from the distribution .
The opposite move going from to removes a pair arbitrarily chosen among the existing pairs with probability . Since this move only removes parameters, no random numbers are drawn. Therefore, Eq. 2 defining the acceptance probability of both moves simplifies to
[3] |
Additionally, given that the random parameters are drawn independently from the current parameter value , the last term of Eq. 3, the determinant of the Jacobian, is equal to 1.
For the general case where , the probability of making the move () is then given by Eq. 3. The probability of a backward move () is given as . In the special case identified by the proposals , that moves from a model having only independent sites () to a model having one coevolving pair (), parameters as well as must be proposed. Coev parameters and are drawn from independent normal distributions with parameters and , respectively, while the branch-length scaling factor is drawn from a Gamma distribution with parameters (). In CoevRJ, these distributions have been tuned to result in more efficient proposals. For this special move, the probability is therefore altered and is given by
Proposals sampling the pair space ().
Two proposals aim to provide a proper mixing of the pairs of positions under coevolution without affecting the number of pairs and are subject to the standard Metropolis–Hastings acceptance ratio (37). The first proposal breaks an existing pair and exchanges one of its positions with a site considered as independent. A pair is selected arbitrarily among the existing pairs. The position kept is equal to with probability , otherwise the position with probability . The position is then removed from the pair.
The independent site is drawn from the probability distribution and a new profile is drawn according to the profile probability distribution . The Hastings ratio for this move is thus given by
The second proposal chooses two pairs arbitrarily among the existing and mixes their positions randomly. Once the new pairs are created, new profiles and have to be drawn. This proposal is symmetrical with the exception of the proposals on the profiles. The Hastings ratio is then given by
Proposals exploring the profile space (when k ).
The last proposal samples the possible profiles for any given coevolving pairs by drawing a new profile for a pair according to the probability . Its Hastings ratio is equal to 1 given that such moves are symmetric.
Priors on standard parameters.
We assume a uniform prior on the tree topology and an exponential prior on each branch length with rate . The Gamma rate distribution modeling the rate heterogeneity of independent sites is defined by its shape parameter for which we assume an exponential distribution with rate . Finally, to maintain consistency with the GTR model implemented in MrBayes (23), the exchangeability rates of this model are assigned a flat Dirichlet prior distribution.
Priors on Coev model parameters.
The branch-length scaling parameter for the coevolving pairs has a Gamma prior distribution with shape and scale parameters equal to 2. This distribution has its mode located at 2, which expresses our prior belief that two dependent sites should evolve at the same pace as two independent sites. Parameters that are used to reparameterize the parameters () have a normal prior distribution . These normal distributions are defined such as to favor a substitution rate greater than the others (SI Appendix, Table S4). This parameterization reflects our belief that pairs of sites under coevolution should differ from an independent evolutionary process (i.e., ) and be constrained to stay within the profile (i.e., high ratio), which is also supported by empirical findings and simulations (11).
For the 16S rRNA dataset, we used results from the Coev model (11) to estimate informed prior distributions on parameters (), which we derived following the methodology described in refs. 36 and 38. We computed the mean and SD of parameters () from pairs of positions with significant support for the Coev model (). Using these values, we estimated the moments of the prior distributions for the parameters. The resulting parameterization (SI Appendix, Table S4) is close to the one defined for the default prior that results in a stationary frequency for nucleotide pairs in the profile to be close to .
Finally, we defined a uniform prior distribution on the profile based on the set of profiles observed in the alignment for a given pair of positions. This empirical prior reduces the space of parameters to explore and plays a key role in making analysis tractable with CoevRJ. However, this simplification implies that pairs of sites cannot (co)evolve under an unobserved profile (e.g., invariant sites are evolving under an independent process by default).
While more informative priors on the distribution of profiles could be used for specific type of molecular data (e.g., favoring Watson–Crick profiles for RNA sequences), we chose to use this conservative and vague prior since it accommodates the wide range of scenarios we considered including coevolution within DNA sequences. Under this prior, the marginal probability of the profiles inferred on the 16S rRNA dataset showed that the method had the power to infer the profile even without more informative priors. Indeed, 71% of the estimated coevolving pairs followed a pure Watson–Crick profile (AT, CG) and 17% of the pairs evolved under a profile containing canonical pairs coupled with other ones (AT, XX or CG, XX or AT, CG, XX). These results suggest that our priors do not prevent us from inferring common coevolving profiles.
Priors on the number of pairs.
We used a hierarchical prior on the number of pairs by assigning a Poisson distribution with rate on the number of pairs and by an exponential distribution with parameter as a hyperprior on . This scheme enables the parameter to be estimated using Gibbs sampling. We fixed the value of to 10. This hyperprior results in an expected value for of 0.1 (i.e., a Poisson distribution with mode at ), which expresses our prior belief that coevolution should be a rare evolutionary event. Furthermore, the choice of parameter was validated by analyses conducted on the simulated datasets (SI Appendix, Fig. S16). The parameter values used for the hyperprior maintained close to the true number of simulated coevolving pairs.
Priors on the configuration of pairs.
The last prior defines the probability of observing a given set of coevolving pairs of positions. We assumed a uniform prior reflecting that pairs of sites are considered a priori to be equally likely to be under coevolution. Under this assumption, the prior is given by the number of possible configurations of pairs for a sequence of positions. Considering that we do not differentiate the order of positions in a pair, the total number of configurations for a given is defined as
Furthermore, we do not differentiate the ordering of the pairs. There are permutations for a set of pairs and thus the total number of ordered configurations is given by
The probability of observing a given configuration is then given as
Significance of Pairs Inferred as Coevolving.
To confidently consider pairs to be coevolving, we designed a method defining a threshold on the probability at which a pair is considered as strongly significant. This threshold ensures that we only report pairs predicted with a marginal posterior probability significantly higher from the one we expect from the random sampling of pairs under a uniform prior on their positions integrated over the prior probability of seeing any given number of pairs.
We define this threshold after an MCMC run with CoevRJ by using the inferred posterior probability of a given pair to compute the Bayes factor
[4] |
where is the prior probability of pair to be coevolving.
This prior probability explains the uniform probability of observing pair assuming that every pair is equally likely to be sampled. This probability is conditioned on the number of pairs , such that . Here we adopt a conservative approach and use the inferred posterior probability of as the probability . The prior probability of sampling a pair given is defined as
where defines the number of nucleotides in the sequence. The probability of drawing pair at step is given as
while the probability of not drawing or is defined as
The significance threshold for a pair is then derived from Eq. 4 by using the threshold for strong significance suggested by Kass and Raftery (39). When applied on the posterior distribution inferred with CoevRJ on the 16S rRNA dataset, this approach resulted in a threshold of . Pairs inferred with a posterior probability smaller than this value were therefore treated as insignificant.
Experimental Setting.
Experiments on all of the datasets consisted of a comparison of results obtained with CoevRJ to the ones obtained with the GTR+ model assuming site independence as implemented in MrBayes (23). Prior distributions in MrBayes were set as the ones for CoevRJ with . Both implementations were run under similar settings with four processors dedicated for (40).
Runs were considered as having converged when the distribution of tree topologies stabilized [i.e., when the average SD of the splits frequencies (41) was measured to reach 0.05 using three independent runs for each dataset]. In addition, the parameter traces were examined to ensure proper convergence. The burn-in phase of each run was discarded and the remaining samples were used to estimate the posterior distribution. Comparisons between tree distributions were conducted by (i) computing the majority-rule consensus tree from the tree distributions obtained under each model and (ii) computing the normalized Robinson–Foulds distance (29) between the consensus trees.
Simulation of datasets.
We simulated five categories of nucleotides sequences with varying amount of coevolving sites (0, 5, 10, 20, and 50%). We simulated 50 replicates for each of these categories. For each replicate, we simulated a random phylogenetic tree with 50 tips using the R package APE (42) with branch length drawn from an exponential distribution (). This tree was used to generate an alignment composed of 1,000 nucleotides.
Sites evolving independently were simulated with the Evolver simulator (43) using a GTR+ model with arbitrary rates. The shape parameter of the Gamma distribution was drawn from a distribution Gamma.
Pairs of coevolving sites were simulated using the Coev simulator (34). Each pair was attributed a random profile composed of two nucleotide pairs (e.g., AA, CC). The parameters for the Coev simulator were set to and such as to generate strongly coevolving pairs of sites (). For both the site-independent and site-dependent models, we assumed equal base frequencies for each state (i.e., 25% for each state in the GTR model and 6.25% for each state in the Coev model).
Proximity of nucleotides on the 3D structure of the 16S rRNA dataset.
To validate CoevRJ predictions of coevolution on the 16S rRNA dataset, we compared these results with pairs of nucleotides located closely on the 3D structure of the molecule (and thus potentially bonding). These pairs were identified as nucleotides having their two closest atoms at a distance of less than 6.5 Å. This threshold is consistent with the one used in ref. 18 (8 Å) and is representative of the average resolution of the Protein Data Bank (PDB) structure considered (24–26). The results obtained with this threshold are robust when considering other possible values in the range of 4 Å to 8 Å (SI Appendix, Fig. S3).
Molecular dating of the 16S rRNA dataset.
We analyzed the consensus trees obtained with both CoevRJ and the pure GTR+ model using the penalized likelihood framework (27) as implemented in the R package APE (42). We used multiple relaxed molecular clock models (i.e., correlated and relaxed) to ensure that our observations were not due to the use of a specific model. Furthermore, each model was used under four different values, changing the strength with which the rate is constrained along branches. This parameter took the value ranging from rates along the branch being totally independent to strongly related.
Data Availability.
The simulated and empirical molecular sequences used for the analyses can be found on the CoevRJ git repository (https://bitbucket.org/XavMeyer/coevrj). The PDB structures used to validate the findings can be accessed on the RSCB PDB (https://www.rcsb.org) with the identifiers 4GD2 (E. coli, ref. 24), 5VYC (H. sapiens, ref. 25), and 4V6W (D. melanogaster, ref. 26).
Supplementary Material
Acknowledgments
We thank Michael May, John Huelsenbeck, and two anonymous reviewers whose insightful comments helped improve and clarify this manuscript, and the Vital-IT facilities of the Swiss Institute of Bioinformatics for the use of their HPC infrastructure. This work was supported by Swiss National Science Foundation Grant P2GEP2_178032 (to X.M.), Swedish Research Council Grant 2015-04748 and the Swedish Foundation for Strategic Research (D.S.), and Swiss National Science Foundation Grant 4075-40_167276 and the University of Lausanne (N.S.).
Footnotes
The authors declare no conflicts of interest.
This article is a PNAS Direct Submission.
Data deposition: The simulated and empirical molecular sequences used for the analyses can be found on the CoevRJ git repository (https://bitbucket.org/XavMeyer/coevrj).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1813836116/-/DCSupplemental.
References
- 1.Dib L, Salamin N, Gfeller D. Polymorphic sites preferentially avoid co-evolving residues in MHC class I proteins. PLoS Comput Biol. 2018;14:e1006188. doi: 10.1371/journal.pcbi.1006188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Douam F, et al. A protein coevolution method uncovers critical features of the hepatitis C virus fusion mechanism. PLoS Pathog. 2018;14:e1006908. doi: 10.1371/journal.ppat.1006908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat Rev Genet. 2013;14:249–261. doi: 10.1038/nrg3414. [DOI] [PubMed] [Google Scholar]
- 4.Szurmant H, Weigt M. Inter-residue, inter-protein and inter-family coevolution: Bridging the scales. Curr Opin Struct Biol. 2018;50:26–32. doi: 10.1016/j.sbi.2017.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Talavera D, Lovell SC, Whelan S. Covariation is a poor measure of molecular coevolution. Mol Biol Evol. 2015;32:2456–2468. doi: 10.1093/molbev/msv109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: A key issues review. Rep Prog Phys. 2018;81:032601. doi: 10.1088/1361-6633/aa9965. [DOI] [PubMed] [Google Scholar]
- 7.Dutheil JY. Detecting coevolving positions in a molecule: Why and how to account for phylogeny. Brief Bioinform. 2012;13:228–243. doi: 10.1093/bib/bbr048. [DOI] [PubMed] [Google Scholar]
- 8.Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999;15:446–454. doi: 10.1093/bioinformatics/15.6.446. [DOI] [PubMed] [Google Scholar]
- 9.Yeang C-H, Haussler D. Detecting coevolution in and among protein domains. PLoS Comput Biol. 2007;3:e211. doi: 10.1371/journal.pcbi.0030211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Dutheil JY, Jossinet F, Westhof E. Base pairing constraints drive structural epistasis in ribosomal RNA sequences. Mol Biol Evol. 2010;27:1868–1876. doi: 10.1093/molbev/msq069. [DOI] [PubMed] [Google Scholar]
- 11.Dib L, Silvestro D, Salamin N. Evolutionary footprint of coevolving positions in genes. Bioinformatics. 2014;30:1241–1249. doi: 10.1093/bioinformatics/btu012. [DOI] [PubMed] [Google Scholar]
- 12.Yang Z, Rannala B. Molecular phylogenetics: Principles and practice. Nat Rev Genet. 2012;13:303–314. doi: 10.1038/nrg3186. [DOI] [PubMed] [Google Scholar]
- 13.Felsenstein J. Statistical inference of phylogenies. J R Stat Soc Ser A. 1983;146:246–272. [Google Scholar]
- 14.Huelsenbeck JP, Nielsen R. Effect of nonindependent substitution on phylogenetic accuracy. Syst Biol. 1999;48:317–328. doi: 10.1080/106351599260319. [DOI] [PubMed] [Google Scholar]
- 15.Nasrallah CA, Mathews DH, Huelsenbeck JP. Quantifying the impact of dependent evolution among sites in phylogenetic inference. Syst Biol. 2011;60:60–73. doi: 10.1093/sysbio/syq074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Brown CT, et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature. 2015;523:208–211. doi: 10.1038/nature14486. [DOI] [PubMed] [Google Scholar]
- 17.Hug LA, et al. A new view of the tree of life. Nat Microbiol. 2016;1:16048. doi: 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]
- 18.Yeang CH, Darot JF, Noller HF, Haussler D. Detecting the coevolution of biosequences—An example of RNA interaction prediction. Mol Biol Evol. 2007;24:2119–2131. doi: 10.1093/molbev/msm142. and erratum (2008) 25:2077. [DOI] [PubMed] [Google Scholar]
- 19.Uguzzoni G, et al. Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proc Natl Acad Sci USA. 2017;114:E2662–E2671. doi: 10.1073/pnas.1615068114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
- 21.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994;39:306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
- 22.Bernier CR, et al. RiboVision suite for visualization and analysis of ribosomes. Faraday Discuss. 2014;169:195–207. doi: 10.1039/c3fd00126a. [DOI] [PubMed] [Google Scholar]
- 23.Ronquist F, et al. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61:539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dunkle JA, et al. Structures of the bacterial ribosome in classical and hybrid states of tRNA binding. Science. 2011;332:981–984. doi: 10.1126/science.1202692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Anger AM, et al. Structures of the human and Drosophila 80s ribosome. Nature. 2013;497:80–85. doi: 10.1038/nature12104. [DOI] [PubMed] [Google Scholar]
- 26.Lomakin IB, et al. Crystal structure of the human ribosome in complex with DENR-MCT-1. Cell Rep. 2017;20:521–528. doi: 10.1016/j.celrep.2017.06.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sanderson MJ. Estimating absolute rates of molecular evolution and divergence times: A penalized likelihood approach. Mol Biol Evol. 2002;19:101–109. doi: 10.1093/oxfordjournals.molbev.a003974. [DOI] [PubMed] [Google Scholar]
- 28.Cannone JJ, et al. The Comparative RNA Web (CRW) Site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 2002;3:2. doi: 10.1186/1471-2105-3-2. and erratum (2002) 3:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147. [Google Scholar]
- 30.Moretti S, et al. Selectome update: Quality control and computational improvements to a database of positive selection. Nucleic Acids Res. 2014;42:D917–D921. doi: 10.1093/nar/gkt1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Figliuzzi M, Barrat-Charlaix P, Weigt M. How pairwise coevolutionary models capture the collective residue variability in proteins? Mol Biol Evol. 2017;35:1018–1027. doi: 10.1093/molbev/msy007. [DOI] [PubMed] [Google Scholar]
- 32.Nasrallah CA, Huelsenbeck JP. A phylogenetic model for the detection of epistatic interactions. Mol Biol Evol. 2013;30:2197–2208. doi: 10.1093/molbev/mst108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lartillot N. Probabilistic models of eukaryotic evolution: Time for integration. Philos Trans R Soc Lond B Biol Sci. 2015;370:20140338. doi: 10.1098/rstb.2014.0338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Dib L, et al. Coev-web: A web platform designed to simulate and evaluate coevolving positions along a phylogenetic tree. BMC Bioinformatics. 2015;16:394. doi: 10.1186/s12859-015-0785-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Meyer X, Chopard B, Salamin N. Accelerating Bayesian inference for evolutionary biology models. Bioinformatics. 2017;33:669–676. doi: 10.1093/bioinformatics/btw712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gelman A, Bois F, Jiang J. Physiological pharmacokinetic analysis using population modeling and informative prior distributions. J Am Stat Assoc. 1996;91:1400–1412. [Google Scholar]
- 37.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
- 38.Gelman A. Method of moments using Monte Carlo simulation. J Comput Graph Stat. 1995;4:36–54. [Google Scholar]
- 39.Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90:773–795. [Google Scholar]
- 40.Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics. 2004;20:407–415. doi: 10.1093/bioinformatics/btg427. [DOI] [PubMed] [Google Scholar]
- 41.Lakner C, van der Mark P, Huelsenbeck JP, Larget B, Ronquist F. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst Biol. 2008;57:86–103. doi: 10.1080/10635150801886156. [DOI] [PubMed] [Google Scholar]
- 42.Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20:289–290. doi: 10.1093/bioinformatics/btg412. [DOI] [PubMed] [Google Scholar]
- 43.Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The simulated and empirical molecular sequences used for the analyses can be found on the CoevRJ git repository (https://bitbucket.org/XavMeyer/coevrj). The PDB structures used to validate the findings can be accessed on the RSCB PDB (https://www.rcsb.org) with the identifiers 4GD2 (E. coli, ref. 24), 5VYC (H. sapiens, ref. 25), and 4V6W (D. melanogaster, ref. 26).