Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 May 24;17(5):e1008957. doi: 10.1371/journal.pcbi.1008957

On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins

Edwin Rodriguez Horta 1,2, Martin Weigt 2,*
Editor: Rita Casadio3
PMCID: PMC8177639  PMID: 34029316

Abstract

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.

Author summary

Many homologous protein families contain thousands of highly diverged amino-acid sequences, which fold into close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.

Introduction

Global coevolutionary modeling approaches have recently seen a lot of interest [1, 2], either directly for predicting residue-residue contacts from sequence ensembles corresponding to homologous protein families [35], in predicting mutational effects [68], or even in designing artificial but functional protein sequences [912], or as an input to deep-learning based protein structure prediction. The latter approach has recently lead to a breakthrough in predicting protein structure from sequence [1317].

The basic idea of coevolutionary models, like the Direct-Coupling Analysis (DCA) [6], is that the amino-acid sequences, typically given in the form of a multiple-sequence alignment (MSA) of width (or aligned sequence length) L, can be considered as a sample drawn from some unknown probability distribution P(a1, …, aL), with (a1, …, aL) being an aligned amino-acid sequence. This probabilistic model is typically parameterized as P(a1, …, aL) ∝ exp{∑i<j Jij(ai, aj) + hi(ai)} via biases (or fields) hi(ai) representing site-specificities in amino-acid usage (i.e. patterns of amino-acid conservation), and via statistical couplings Jij(ai, aj), which represent coevolutionary constraints and cause correlated amino-acid usage in positions i and j [18].

In most cases, the parameters of these models are inferred by (approximate) maximum-likelihood, under the assumption that the sequences in the MSA are (almost) independently and identically distributed according to P(a1, …, aL). On one hand, this assumption is needed to make model inference from MSA technically feasible. On the other hand, it is in obvious contradiction to the fact that sequences in homologous protein families share common ancestry in evolution, and therefore typically show considerable phylogenetic correlations, which can be used to infer this unknown ancestry from data [19]. Phylogeny induces highly non-trivial correlations between MSA columns [20], which however do not represent any functional relationship.

Disentangling correlations induced by functional or structural couplings from phylogeny-caused correlations turns out to be a highly non-trivial task [2022]. Simple statistical corrections have been proposed, like down-weighting similar sequences when determining statistical correlations [3], or the average-product correction (APC) [23] applied to the final coevolutionary coupling scores. While sequence weighting has initially been reported to significantly improve contact prediction, recent works show little effect [24], probably due to the fact that, e.g., Pfam [25] is now based on reference proteomes and therefore less redundant than databases used to be about a decade ago. APC was shown to be more of a correction of biases related to amino-acid conservation than to phylogeny [26].

To make progress, we suggest a complementary approach. Instead of removing phylogenetic correlations from DCA-type analyses, we suggest null models having the same conservation and phylogenetic patterns of the original MSA of the protein family under consideration, but strictly lack any functional or structural couplings.

Running DCA on artificial MSA generated by these null models, and comparing them to the results obtained from natural MSA, we find some remarkable results: while the largest eigenvalues of the residue-residue covariance matrix appears to be dominated by phylogenetic effects, the strongest DCA couplings are hardly influenced by phylogeny. The spurious couplings induced by phylogeny are, however, non-zero, and may limit the accuracy of contact prediction when going beyond the first few strongest couplings. More precisely, our observations are quantitatively compatible with the idea, that already the first false-positive contact prediction are caused by not correctly treating phylogeny in DCA. This shows also that methods properly dissecting phylogenetic and functional correlations in sequence data have a high potential to substantially extend the contact predictions beyond current methods.

The paper is organized as follows. After this introduction, we provide the Materials and Methods, with a short review of DCA, but most importantly with the presentation of three null models. The Results section compares the spectral properties of the residue-residue covariance matrix of the real data with those of MSA generated by the null models, followed by an assessment of the couplings inferred by DCA and their relation to residue-residue contacts. The Conclusion sums up the results and discusses potentially interesting future directions. Supporting tables and figures are shown in S1 Text.

Materials and methods

Protein families, sequence alignments and Direct Coupling Analysis

Coevolutionary analysis is mostly applied to families of homologous proteins (or protein domains), as provided by the Pfam database [25]. Multiple sequence alignments (MSA) can be downloaded in the form of rectangular arrays D={aim|i=1,,L;m=1,,M} of width L (aligned sequence length) and depth M (number of aligned sequences). The entries aim{,A,C,,Y} are either one of the 20 standard amino acids, or alignment gaps represented as “−”. Note that insertions are not aligned in Pfam alignments, and are therefore typical removed from the MSA before statistical model learning. Here L describes the sequence length after removal of insertions. For the structural analysis, the Pfam MSA is mapped to experimentally resolved PDB protein structures [27], and distances are measured as minimum distances between heavy atoms. Following established standards in the coevolutionary literature, we use a cutoff of 8Å for residue-residue contacts.

For our work, we have selected three datasets:

  • DS1: Detailed results are given for 9 Pfam protein families, with not too long sequences (L < 250), not too large MSA (M < 10, 000 after removing duplicate sequences) and available PDB structures, see details in Table A in S1 Text.

  • DS2: Statistical results are given for 60 Pfam protein families belonging to the PSICOV benchmark set [28]. Only families with M < 12, 000 pairwise distinct sequences were considered, cf. Table B in S1 Text.

  • DS3: We have also compiled and dataset of 20 smaller Pfam protein families with known PDB structures, to study the influence of finite MSA depth on our findings, cf. Table C in S1 Text.

Global coevolutionary models, as those constructed by DCA, describe the sequence variability between the members of a protein family, i.e. between different rows of the MSA D, via a statistical model

P(a1,,aL|J,h)=1Zexp{1i<jLJij(ai,aj)+1iLhi(ai)}, (1)

parameterized via pairwise coevolutionary residue-residue couplings Jij(ai, aj) and single-residue biases (or fields) hi(ai), while Z is a normalization factor also known as partition function. In the simplest setting, these parameters are inferred from the data via maximum likelihood, i.e.

{J,h}=argmaxJ,hm=1MP(a1m,,aLm|J,h). (2)

This maximization leads directly to the fact, that the model P reproduces the empirical statistics of single MSA columns and of column pairs,

fi(ai)={ak|ki}P(a1,,aL),fij(ai,aj)={ak|ki,j}P(a1,,aL), (3)

where fi(a) represents the fraction of amino acids a in column i, i.e. the residue-conservation statistics, while the fij(a, b) describe the fraction of sequences having simultaneously amino acids a and b in columns i and j, thereby representing residue covariation / coevolution, i.e. the correlated usage of amino acids in pairs of columns, cf. Fig 1. The inference of the parameters is a computationally hard task, since, e.g., the computation of the marginals in Eq (3) depends on an exponential sum over O(21L) terms. Many approximation schemes have been proposed, we use plmDCA [29] based on pseudo-likelihood maximization, since it represents a well-tested compromise between accuracy and running time.

Fig 1. Schematic representation of the information used by DCA and the null models.

Fig 1

MSA contain several types of information about the sequence variability. The sequence profile and residue covariation describe the statistics of individual MSA columns and column pairs, both are used in DCA. However, the MSA contains also phylogenetic information, here represented by the matrix {DmnH|1m<nM} of Hamming distances between sequences, or by the (inferred) phylogenetic tree. The different null models use the profile and phylogenetic information, but no residue covariation.

A particularity of this approach is that Eq (2) assumes that the sequences in the MSA D form an independently and identically distributed sample of P(a1, …, aL) and that the likelihood can be factorized into a product over the rows of D. This assumption is incorrect; biological sequences are the result of natural evolution and thus show hierarchical phylogenetic relations. Phylogeny by itself leads to a non-trivial correlation structure between different residue positions with a power-law spectrum [20], and this leads to non-zero but also non-functional residue-residue couplings when using DCA. These may interfere with the functional couplings, which are e.g. used for residue-residue contact prediction from MSA data, and thereby negatively impact prediction accuracy.

It is notoriously hard to disentangle the two, cf. [21, 22]. The problem is that evolution is a non-equilibrium stochastic process, whose dynamics in principle depends on the evolutionary constraints represented, e.g., by the couplings and fields in the DCA model. Global model inference from phylogenetically correlated data remains an open questions.

Here we follow a different route. We define different null models, which take residue conservation and in part also phylogeny into account, but do not show any intrinsic amino-acid covariation. The null models allow us to create large numbers of suitably randomized sequence ensembles, on which standard DCA can be run. The couplings resulting from randomized data can be used to assess the statistical significance of the couplings resulting from the real MSA D, and therefore to discard purely phylogeny-caused couplings. While, to the best of our knowledge, this has never been done in the context of protein families and DCA, somehow similar techniques have been proposed in the context of phylogenetic profiling [30], but applied to correlations rather than couplings.

Null model I: Profile-aware sequence randomization

The first null model is very simple. It randomizes the input MSA by conserving the single-column statistics fi(a), for all sites i = 1, …, L and all amino acids or gaps a ∈ {−, A, …, Y}. This is done by simple random but independent permutations of all MSA columns. This destroys all correlations between positions (the coevolutionary ones) and between sequences (the phylogenetic ones), only the residue conservation patterns of the original MSA are preserved. Formally, the randomized sequences become an independently and identically distributed sample from the profile model

Pprofile(a1,,aL)=i=1Lfi(ai). (4)

So in principle there are no couplings between different residues at all. However, when running DCA on this sample, inferred couplings will be non-zero due to finite sample size. They will take distinct values from one randomization to the next, but there may be systematic biases due to the distinct conservation levels, which are maintained as compared to the original Pfam MSA.

Null model II: Profile- and phylogeny-aware sequence randomization

The second null model is more complicated, since it preserves also (at least approximately) the phylogenetic information contained in the original MSA. Here we assume that this information is coded in the pairwise distances between sequences, i.e. in the matrix {DmnH|1m<nM} of Hamming distances between all pairs of sequences, as is done in distance-based phylogeny reconstruction [19, 31].

The aim of the second null model is to construct a randomized MSA which preserves both the sequence profile given by the position-specific frequencies fi(a), and the matrix {DmnH} of pairwise Hamming distances between sequences. This can be achieved by the following Markov chain Monte Carlo (MCMC) procedure acting on the entire alignment. Our method is initialized using a sample of null model I, i.e. all coevolutionary and phylogenetic information from the original MSA is destroyed, but the profile is preserved. The resulting randomized MSA after t MCMC steps is called D˜t={a˜im|i=1,,L,m=1,,M}.

In step tt + 1, one column i ∈ {1, …, L} is selected randomly, as well as two rows m, n ∈ {1, …, M}. An exchange of the entries a˜im and a˜in is attempted, to obtain a new matrix D˜t+1. This matrix is accepted with a Metropolis-Hastings acceptance probability pacc of

pacc=min[1,exp{β(DH(D˜t+1)DH(D)DH(D˜t)DH(D))}], (5)

otherwise the exchange is refused and the matrix D˜t remains invariant in step t + 1. In this expression, DH(D) stands for the matrix of Hamming distances between the rows of the original MSA D, analogously for the randomized MSA, and ||⋅|| for the Frobenius norm of matrices. This acceptance rule guarantees that each exchange, which brings the distance matrix DH(D˜t) closer to the target matrix DH(D), is accepted. Exchanges going into the opposite direction are accepted with a smaller probability depending exponentially on the formal “inverse temperature” β. Here we use simulated annealing, i.e. we initialize β in a small value and slowly increase it over time, in order to force the Hamming distances of the randomized MSA closer an closer to the ones of the natural MSA. Fig A in S1 Text illustrates that very high correlations (Pearson correlation > 0.97) between the two distances matrices are actually obtained across protein families by our algorithm. Fig B in S1 Text shows the distance histograms between natural and randomized sequences. We find that, with rare exceptions, randomized sequences are at most at 60–70% sequence identity (minimal Hamming distance 30–40%) to the closest natural sequences, showing that Null model II actually generates sequences, which are distant from the original Pfam MSA.

Since the sequence profile remains unchanged by this procedure, and the natural distances between sequences are approached, the randomized MSA thus contains approximately the same conservation and phylogenetic properties of the biological sequence data. However, potentially existing functional correlations between MSA columns are eliminated. The resulting data-covariance matrix is expected to have non-trivial properties in agreement with [20], and DCA is expected to be able to reproduce this correlation structure via couplings Jij(a, b). Repeating the randomization many times, we can assess the statistics of phylogeny-generated couplings, and thereby the significance of the couplings found by running DCA on the original protein sequences collected in D.

Note also that, in the limit where the formal inverse temperature in Eq (5) is set to β = 0, i.e. in the case of infinite formal temperature T = β−1, we recover Null model I. One could use β therefore as an interpolating parameter between these two Null models.

Null model III: Profile- and phylogeny-aware sequence resampling

To corroborate the results of Null model II, we have also used a complementary strategy using explicitly an evolutionary model and a phylogenetic tree to resample sequences on this tree. The evolutionary model we use is the Felsenstein model for independent-site evolution [32], i.e. a model which does account for site-specific conservation profiles and phylogeny, but not for any intrinsic correlation / coupling between different sites. In this context, the stationary probability distribution of sequences is described by a profile model

Pω(a1,,aL)=i=1Lωi(ai), (6)

which has the same form of the profile model in Eq (4), but the factors are not given directly by the empirical amino-acid frequencies in the MSA columns. All sites i = 1, …, L evolve independently, and for each site i the probability of finding some amino acid b, given an ancestral amino acid a some time t before, is given by

P(ai=b|ai=a,t)=eμtδa,b+(1eμt)ωi(b), (7)

with μ being the mutation rate and δa,b the Kronecker symbol, which equals one if and only if the two arguments are equal, and zero else. In this model, there is no mutation with probability eμt, and the amino acid in position i does not change, or at least one mutation with probability 1 − eμt. In the latter case, the new amino acid b is emitted with its equilibrium probability ωi(b). While being simple, the Felsenstein model of evolution is frequently used in phylogenetic inference.

The algorithm proceeds in the following way, using the implementation of [22]:

  • A phylogenetic tree T is inferred from the MSA D using FastTree [33]. Instead of using inter-sequence Hamming distances (like in Null model II), FastTree is using a maximum-likelihood approach based on a model of independent-site evolution, i.e. no coevolutionary information is taken into acocunt in tree inference.

  • The mutation rate μ and all site-specific frequencies {ωi(a)} are inferred using maximum likelihood.

  • To resample the MSA according to this model, the root sequence is drawn randomly from Pω, and stochastically evolved on the branches of T using the transition probability Eq (7).

  • The resampled MSA is composed by the sequences resulting in the leaves of T.

This procedure allows thus to emit many artificial MSA being evolved on the same phylogeny and with the same stationary sequence distribution as the one inferred from the natural sequences given in D, but no coevolutionary information is taken into account at any of the four steps. Note that the emitted MSA are expected to be more noisy than the ones of Null model II. In particular the column statistics will differ from fi(a), and also the inter-protein Hamming distances DH will differ more from the ones in the training MSA, cf. Fig C in S1 Text showing that correlations between the DH matrices remain large but not as large as in Null model II (Pearson correlations 0.7–0.95 for the protein families in dataset DS1).

Again DCA can be run on many of the resampled MSA, and the DCA couplings of the natural MSA can be compared with the statistics of the resampled ones, to assess their statistical significance beyond finite-size and phylogenetic effects.

Results and discussion

The two Null models II and III, which both include phylogenetic correlations between proteins, lead to qualitatively coherent, but quantitatively slightly different results, which reflect the different randomization strategies. In the main text of this article, we will present almost exclusively the results of Null model II, in comparison to the natural MSA and Null model I. The results for Null model III are delegated to S1 Text, unless explicitly stated.

The spectral properties of the residue-residue correlation matrix are dominated by phylogenetic effects

Following the mathematical derivations published in [20], we would expect that the residue-residue covariance matrix C = {cij(a, b) | i, j = 1, …, L; a, b∈{−, A, …, Y}} with cij(a, b) = fij(a, b) − fi(a)fj(b) is strongly impacted by phylogenetic correlations in the data. More precisely, while totally random data would lead to the Marchenkov-Pastur distribution for the eigenvalue spectrum of C, the hierarchical structure of data on the leaves of a phylogenetic tree leads to a power-law tail of large eigenvalues.

It is thus not very astonishing, that both Null models II and III show fat tails in the spectrum of their data covariance matrices C (even if Null model II does not fulfill the mathematical conditions of the derivation in [20] because not generated according to a hierarchical process), while the spectrum of Null model I has a substantially more compact support, cf. Fig 2, and Figs D and E in S1 Text. The interesting observation is that, at the level of the eigenvalue spectrum, the natural data are hardly distinguishable from the phylogeny-aware Null models II and III, in difference to Null model I.

Fig 2. Eigenvalue spectra of the covariance matrix of the natural MSA and for Null models I and II.

Fig 2

We show the cumulative distribution of the unified eigenvalue spectra for the 60-protein dataset DS2, i.e. the fraction of eigenvalues larger than λ is shown as a function of λ. We observe that the phylogeny-aware Null model II shows the same fat tail for large eigenvalues, which is also present in the natural data, while the non-phylogenetic Null model I has a more compact support. The cutoff of the tail for large λ is an effect of the inter-family variability of the largest eigenvalues among the 60 spectra, cf. Fig D in S1 Text for the 9 individual proteins in dataset DS1.

This suggests the following conclusions: the dominant global residue-residue correlation structure, as far as reflected by the largest eigenvalues of the C-matrix, results from phylogeny. A comparison with principal-component analysis (PCA) relates these eigenvalues to the large-scale organization of sequences in sequence space, e.g. into clusters of sequences. Note that the eigenvectors are expected to contain complementary information, e.g. used for PCA or for the identifcation of protein sectors [34, 35], defined as multi-residue groups of coherent evolution.

Phylogenetic effects induce couplings in DCA, but these are smaller than couplings found in natural sequences

However, the couplings derived by DCA are not directly related to the largest eigenvalues of the residue-residue covariance matrix. Actually, the computationally most efficient DCA approximations based on mean-field [3] or Gaussian [36] approximations, relate the couplings J to the negative of the inverse of C. The DCA couplings are therefore dominated by the smallest eigenvalues of C, cf. also [37].

Here we use plmDCA, the resulting couplings therefore lack any simple relation to the eigenvalues and eigenvectors of the residue-residue covariance matrix. In difference to standard plmDCA we do not use any sequence weighting, since it might interfere with the phylogenetic signal in a non-controlled way. In Fig 3, we plot histograms of the DCA couplings (APC-corrected Frobenius norms FAPC of the coupling matrices for each residue pair, i.e. the standard output of plmDCA, cf. [29] and the Introduction for a short explanation of APC) for Null models I, II and the natural MSA D for the datasets DS2 of large and DS3 of small-medium depth MSA. Equivalent results for the nine individual families in DS1 are shown in Fig F in S1 Text, along with those for Null model III in Fig G in S1 Text.

Fig 3. DCA scores derived from natural sequence data and from MSA generated by Null models I and II, for datasets DS2 (panel A) of large MSA, and DS3 (panel B) of smaller MSA.

Fig 3

For the protein families under study, we show the histograms of DCA coupling scores FAPC (APC corrected Frobenius norm of couplings, the standard output of plmDCA), for the natural MSA and samples of Null models I and II. Here and in the following, histograms are normalized as probability distributions, i.e. to area one under the curve. It becomes evident that phylogenetic effects create—at least for sufficiently deep MSA—larger couplings than to be expected from finite sample size alone. However, couplings derived from the natural MSA have substantially larger values. The figures include also the positive predictive value (PPV, scale on the right of each panel), providing the fraction of true contacts in between all couplings FAPC above some threshold θ, as a function of θ, for plmDCA run on the natural MSA. We clearly see that almost all large couplings correctly predict contacts, while the PPV starts to drop once we reach FAPC reached also by phylogenetic effects in Null model II. We find this to be true for all non-trivial contacts (sequence separation |ij| > 4) as well as for long-distance contacts (|ij| > 24).

We see that across all protein families, DCA couplings from natural data reach significantly larger values than those derived from both Null models. The latter two miss in particular the strong tail for large values; their supports being much more concentrated. For large MSA (DS2), the phylogeny-aware Null model II generates larger couplings than the phylogeny-unaware Null model I, i.e. they go beyond what is to be expected from finite-sample effects alone. This latter difference almost vanishes for smaller MSA (DS3), where the coupling histograms for Null models I and II almost coincide. Interestingly, the phylogenetic couplings of Null model II are almost invariant with respect to family size, while finite-size effects (Null model I) decrease with family size, and the tail of large couplings resulting from natural sequences grows with family size.

It is very interesting that, while the spectra are similar for natural data and phylogeny-aware null models, the dominant residue-residue couplings are neither explainable by phylogeny nor by finite sample size. They must consequently result from intrinsic evolutionary constraints acting on the proteins due to natural selection for correctly folded and properly functioning proteins.

Residue-residue contact predictions are moderately impacted by phylogenetic effects

This observation becomes even more interesting, when we compare the couplings of residue-residue contacts and non-contacts. In Fig 4A, we show normalized coupling histograms for the two subsets of residue pairs in dataset DS2. The tail of large couplings is present only in the contacts, explaining why DCA and the closely related GREMLIN accurately predict contacts when couplings are high enough [38, 39], cf. also the positive predictive value (PPV) in function of the coupling in Fig 3. As is visible in the Fig H in S1 Text for the nine individual protein families, the contact prediction in a family depends crucially on the size of this tail of large couplings.

Fig 4. Histogram of DCA scores derived from natural sequence data (Panel A) and Null model II (Panel B) for residue-residue contacts and non-contacts.

Fig 4

For the protein families in DS2, we show the histograms of DCA coupling scores (APC corrected Frobenius norm of couplings), separated for contacts and non-contacts (defined using the representative protein structures in Table B in S1 Text). Only pairs with linear separation |ij| > 4 along the chain are taken into account.

Using Null model II, we cann assess the strength of phylogeny-induced spurious couplings on the same subsets of contacts and non-contacts extracted for DS2, cf. Fig 4B and Fig I in S1 Text (and similarly Fig J in S1 Text for Null model III). We see that the two histograms for contacts and non-contacts get almost identical to each other; due to the larger number of non-contacts the largest couplings are therefore dominated by non-contacts across all studied protein families. Most interestingly, when we compare the non-contact histograms in both panels of Fig 4, they are extremely similar. It appears that the strength of the non-contact couplings in the natural data is mostly consistent with the phylogeny-induced spurious couplings in Null model II.

The differences between these histograms translates into differences between PPV-curves as shown in Fig 5. The upper panels show the results for the union of all predictions, i.e. the largest DCA-couplings scores FAPC of all families come first, for the dataset DS2 of large MSA (Panel A) and the DS3 of smaller MSA (Panel B). We see that the largest couplings derived from the natural MSA are contacts in both cases, but much more contacts are found in the larger MSA. However, in the lower panels we see that for smaller MSA only about 60% of the considered families have such large scores, leading to an initial average PPV (each family considered individually, and the individual PPV are averaged) of only about 0.6, while almost all large MSA lead to initially accurate contact predictions. The figures also contain a contact prediction for randomized data from Null model II. In the upper panels we observe a very weak contact signal; it results from the fact that conserved sites tend to lead to larger phylogeny-induced spurious couplings, but they also tend to be concentrated in proteins, e.g. in active sites or the protein core, and consequently to have a higher contact fraction.

Fig 5. PPV for residue-residue contact prediction from natural data and Null model II.

Fig 5

The positive predictive values for residue contact prediction are shown for datasets DS2 (Panels A and C) and DS3 (Panels B and D), using the natural data (red, blue) and randomized data from Null model II (green). The upper panels (A,B) show joint contact prediction for all proteins, the lower panels (C,D) the averages over the individual PPV curves for all single families. All panels show also hypothetical PPV curves (purple), which might be reached by a method removing phylogenetic biases; they articficially combine DCA scores obtained from natural MSA on contacts, and from Null model I on non-contacts.

The histograms in Figs 3 and 4 are derived from individual samples of the Null models. One might expect that they change from sample to sample. While this is the case for individual couplings, the histograms remain remarkably unchanged when comparing samples, cf. Figs K and L in S1 Text. These observations show us that, while phylogenetic effects result in non-zero couplings between residues when DCA is applied, these couplings are relatively weak and never reach the size of the couplings, which allow for a high-confidence contact prediction. This idea is also corroborated by the quantitative assessment of the statistical significance in the couplings derived from natural sequences, as compared to the ones generated by the Null models. To this aim, we assign a z-score to each residue pair (i, j): Using 50 samples of Null model II, we determine the mean and standard deviation of couplings derived from Null model II, individually for each pair (i, j). We use these values to determine the z-score, i.e. the number of standard deviations, the actual couplings (from natural MSA) is away from the means for Null model II. In Fig 6, we observe, that this z-score is highly correlated with the plmDCA score derived from natural MSA, across all families. Almost all DCA scores above 0.2–0.3 have highly significant z-scores above 3 or even more. Even larger correlations between DCA and z-scores are observed in Null model III, cf. Fig M in S1 Text. One might be tempted to use this statistical significance score instead of the DCA-coupling strength for contact prediction. In Fig 5 we show that the two lead to highly comparable results, with a slight advantage for the standard FAPC after the first few predictions. This difference might result from the before-mentioned observation that conserved positions tend to have larger phylogeny-induced couplings (and thus larger variances between different samples of Null model II), causing a systematic reduction of the related z-scores.

Fig 6. z-scores of couplings derived from the natural MSA, as compared to the distribution of couplings derived from Null model II.

Fig 6

For each residue pair (i, j), we calculate the z-score for the DCA score derived from natural data as compared to 50 realizations of Null model II. Panel A shows the data for the dataset DS2 of large MSA, Panel B for DS3 of small-intermediate MSA.

Conclusion

Global coevolutionary modeling treats multiple-sequence alignments of homologous protein sequences as collections of independently and identically distributed samples of some unknown probability distribution P(a1, …, aL), which has to be reconstructed from data. The assumption of independence is obviously violated due to the common evolutionary history, in particular sequences from related species show strong phylogenetic correlations.

It is, however, notoriously difficult to unify the idea of a global model including coevolutionary covariation between sites and phylogenetic correlations between sequences. Statistical corrections may improve the situation slightly, but they are too simple to take the hierarchical correlation structure into account, which is generated by the evolutionary dynamics on a phylogenetic tree.

Here we have proposed to approach this problem in a complementary way, by introducing null models—i.e. randomized or re-emitted multiple-sequence alignments—which reproduce conservation and phylogeny, but do not contain any real coevolutionary signal. When applying Direct Coupling Analysis as a prototypical global coevolutionary modeling approach, we observe that phylogenetic correlations between sequences lead to a changed residue-residue correlation structure, represented by a fat tail in the eigenvalue spectrum of the data covariance matrix. It leads also to distributed couplings, which, however, are smaller than the largest couplings found when applying DCA to natural sequence data, i.e. smaller than the couplings used for residue-residue contact prediction. The latter are significantly larger than couplings resulting from phylogeny, i.e. we can conclude that the first predicted contacts are influenced only to a very limited degree by phylogenetic couplings.

However, it is also striking that, across the studies protein families, the phylogeny-caused couplings in Null models II and III almost reach the DCA-score threshold found before for accurate contact prediction. This suggests that the suppression of phylogenetic biases in the data (or their better consideration in model inference), may shift this threshold down and therefore allow for predicting much more contacts. The potential gain would be limited due to the finite depths of the MSA, whose effects are assessed by Null model I. We can therefore quantitatively assess the potential in removing phylogenetic effects by the following hypothetical DCA output: on all contacts in our dataset DS2 and DS3 we use the standard plmDCA scores derived from natural data, and on all non-contacts we remove phylogenetic effects by using couplings derived by running plmDCA on samples of Null model I. The resulting hypothetical PPV curves are given in Fig 5 as purple lines: they are substantially higher than the real PPV obtained from the original data. In the case of the large MSA in DS2, we find a broad plateau of almost perfect PPV close to one, starting to drop only after about 50 top residue pairs, but staying above 90% even after 100 pairs, as compared to about 70% for the real DCA predictions. Even in the small-to-medium-depth MSA of DS3 the potential effect of removing phylogenetic effects is considerable, even if the finite-sample effect is much more pronounced. While in the real data only less than 60% of the considered protein start with a true-positive prediction, the hypothetical phylogeny-removed prediction starts with a PPV above 70%. Since coevolution-based scores are also input to most of the recent deep-learning-based contact predictors, we could imagine that corrections for phylogenetic effects would also improve the accuracy of these methods.

Using many realization of the Null models, we can provide a z-score for the couplings found in the original sequence data, and thereby assess their statistical significance beyond effects of finite and phylogenetically correlated sampling. This is of interest in exploratory studies: in a recent study, one of us has used Null model II in a collaboration aiming at finding potential epistatic effects in a global analysis of more than 50,000 SARS-Cov-2 genomes [40]. Due to the obvious strong correlation between these very recently diverged genomes, potential epistatic couplings have to be assessed carefully, and scoring them by Null model II has turned out to be an essential element in the identification of a sparse, but statistical significant genome-wide network of epistatic couplings.

Supporting information

S1 Text. The file contains all supporting tables and figures cited in the main text.

(PDF)

Acknowledgments

We thanks Pierre Barrat-Charlaix and Alejandro Lage-Castellanos for numerous discussions.

Data Availability

Data and code is available at https://github.com/ed-rodh/Null_models_I_and_II.

Funding Statement

This work (MW) was funded by the EU H2020 Research and Innovation Programme MSCA-RISE-2016 under Grant Agreement No. 734439 InferNet. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. De Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nature Reviews Genetics. 2013;14(4):249–261. doi: 10.1038/nrg3414 [DOI] [PubMed] [Google Scholar]
  • 2. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Reports on Progress in Physics. 2018;81(3):032601. doi: 10.1088/1361-6633/aa9965 [DOI] [PubMed] [Google Scholar]
  • 3. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences. 2011;108(49):E1293–E1301. doi: 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nature biotechnology. 2012;30(11):1072–1080. doi: 10.1038/nbt.2419 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ovchinnikov S, Park H, Varghese N, Huang PS, Pavlopoulos GA, Kim DE, et al. Protein structure determination using metagenome sequence data. Science. 2017;355(6322):294–298. doi: 10.1126/science.aah4043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Morcos F, Schafer NP, Cheng RR, Onuchic JN, Wolynes PG. Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection. Proceedings of the National Academy of Sciences. 2014;111(34):12408–12413. doi: 10.1073/pnas.1413575111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution. 2016;33(1):268–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, et al. Mutation effects predicted from sequence co-variation. Nature biotechnology. 2017;35(2):128–135. doi: 10.1038/nbt.3769 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Cheng RR, Morcos F, Levine H, Onuchic JN. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proceedings of the National Academy of Sciences. 2014;111(5):E563–E571. doi: 10.1073/pnas.1323734111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tian P, Louis JM, Baber JL, Aniana A, Best RB. Co-Evolutionary Fitness Landscapes for Sequence Design. Angewandte Chemie International Edition. 2018;57(20):5674–5678. doi: 10.1002/anie.201713220 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Reimer JM, Eivaskhani M, Harb I, Guarné A, Weigt M, Schmeing TM. Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility. Science. 2019;366 (6466). doi: 10.1126/science.aaw4388 [DOI] [PubMed] [Google Scholar]
  • 12. Russ WP, Figliuzzi M, Stocker C, Barrat-Charlaix P, Socolich M, Kast P, et al. An evolution-based model for designing chorismate mutase enzymes. Science. 2020;369(6502):440–445. doi: 10.1126/science.aba3304 [DOI] [PubMed] [Google Scholar]
  • 13. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS computational biology. 2017;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics. 2018;34(19):3308–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Protein structure prediction using multiple deep neural networks in CASP13. Proteins: Structure, Function, and Bioinformatics. 2019;. doi: 10.1002/prot.25834 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nature communications. 2019;10(1):1–13. doi: 10.1038/s41467-019-11994-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020;117(3):1496–1503. doi: 10.1073/pnas.1914677117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences. 2009;106(1):67–72. doi: 10.1073/pnas.0805923106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Felsenstein J, Felenstein J. Inferring phylogenies. vol. 2. Sinauer associates Sunderland, MA; 2004. [Google Scholar]
  • 20. Qin C, Colwell LJ. Power law tails in phylogenetic systems. Proceedings of the National Academy of Sciences. 2018;115(4):690–695. doi: 10.1073/pnas.1711913115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Obermayer B, Levine E. Inverse Ising inference with correlated samples. New Journal of Physics. 2014;16(12):123017. doi: 10.1088/1367-2630/16/12/123017 [DOI] [Google Scholar]
  • 22. Rodriguez Horta E, Barrat-Charlaix P, Weigt M. Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data. Entropy. 2019;21(11):1090. doi: 10.3390/e21111090 [DOI] [Google Scholar]
  • 23. Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008;24(3):333–340. [DOI] [PubMed] [Google Scholar]
  • 24. Hockenberry AJ, Wilke CO. Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses. Entropy. 2019;21(10):1000. doi: 10.3390/e21101000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic acids research. 2019;47(D1):D427–D432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Vorberg S, Seemayer S, Söding J. Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction. PLoS computational biology. 2018;14(11):e1006526. doi: 10.1371/journal.pcbi.1006526 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The protein data bank. Nucleic acids research. 2000;28(1):235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–190. [DOI] [PubMed] [Google Scholar]
  • 29. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E. 2013;87(1):012707. doi: 10.1103/PhysRevE.87.012707 [DOI] [PubMed] [Google Scholar]
  • 30. Cohen O, Ashkenazy H, Levy Karin E, Burstein D, Pupko T. CoPAP: coevolution of presence–absence patterns. Nucleic acids research. 2013;41(W1):W232–W237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution. 1987;4(4):406–425. [DOI] [PubMed] [Google Scholar]
  • 32. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution. 1981;17(6):368–376. doi: 10.1007/BF01734359 [DOI] [PubMed] [Google Scholar]
  • 33. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010;5(3):e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009;138(4):774–786. doi: 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Rivoire O, Reynolds KA, Ranganathan R. Evolution-based functional decomposition of proteins. PLoS Computational Biology. 2016;12(6):e1004817. doi: 10.1371/journal.pcbi.1004817 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M, et al. Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners. PloS ONE. 2014;9(3):e92721. doi: 10.1371/journal.pone.0092721 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Cocco S, Monasson R, Weigt M. From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS computational biology. 2013;9(8):e1003176. doi: 10.1371/journal.pcbi.1003176 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Uguzzoni G, Lovis SJ, Oteri F, Schug A, Szurmant H, Weigt M. Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proceedings of the National Academy of Sciences. 2017;114(13):E2662–E2671. doi: 10.1073/pnas.1615068114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Anishchenko I, Ovchinnikov S, Kamisetty H, Baker D. Origins of coevolution between residues distant in protein 3D structures. Proceedings of the National Academy of Sciences. 2017;114(34):9122–9127. doi: 10.1073/pnas.1702664114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Zeng HL, Dichio V, Horta ER, Thorell K, Aurell E. Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes. Proceedings of the National Academy of Sciences. 2020;117(49):31519–31526. doi: 10.1073/pnas.2012331117 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008957.r001

Decision Letter 0

Rita Casadio, Arne Elofsson

7 Oct 2020

Dear Weigt,

Thank you very much for submitting your manuscript "Phylogenetic correlations have limited effect on coevolution-based contact prediction in proteins" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Please consider answering all the criticisms

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. In particular you need to show that the conclusions is also of relevant in more families, in particular in much smaller ones and that it really helps improving predictions.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Rita Casadio

Guest Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

Please consider answering all the criticisms

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Coevolution analysis forms the basis of many recent methods for template-free protein structure prediction. It aims to predict 3D contacts between residues in protein domains from their statistical signature of coevolution. All methods in use today treat the sequences in the input multiple sequence alignment (MSA) as being independent, with some simple sequence weighting to correct for different degrees of redundancy. This is a gross assumption because in fact the protein sequences are statistically dependent on each other, as described by the phylogenetic tree that can be reconstructed from the MSA.

This study investigates the influence of the assumption of independence by randomizing real MSAs such that the sequence composition at each position is conserved and the pairwise distances between the sequences is conserved, while the statistical signature of co-evolution signatures between residues is remove (except for noise that remains, of course).

The main results of the study are reported to be

(1) that spurious co-evolution signatures caused by the unjustified approximation of independence are not limiting our ability to predict contacts for the most strongly co-evolving pairs of positions;

(2) that we could predict many more contacts if we were able to suppress the spurious couplings by a proper treatment of the phylogenetic dependence of the sequences.

Main issues:

1. Result (1) is not as general as the authors purport. It only applies to MSAs with very many sequences and as such is close to self-fulfilling prophecy. If the MSAs are large enough, the phylogenetic noise will not limit the contact predictions. Also, these results are not new and not surprising. We have known for a long time that the position pairs obtaining the top DCA scores in large MSAs are to a large fraction in contact, except for false positives caused by homo-multimerization and other well-described effects. Other studies have shown that the independence assumption adds considerable noise to the coupling scores.

2. To learn a huge number of coupling parameters without overfitting, the models such as GREMLIN and DCA use strong regularizing square penalties than push the coupling parameters towards zero. MSAs with many diverse sequences therefore produce higher scores, as the signal can better push against the strong regularizers. Therefore, the larger and more diverse the MSA is, the larger the top co-evolution scores usually are and the less limiting the influence of spurious signals from phylogeny will be. Therefore the depth and diversity of the MSA is an absolutely critical parameter to understand the influence and limitations induced by the independence assumption. The authors have ignored this parameter.

Also, they show results only for very large MSAs (table 1), large enough to predict with current methods a sufficient number of contacts reliably to predict their structures reliably. The interesting regime is of course that of an intermediate number of sequences, 50- 500 for example, for which it is still challenging to predict contacts.

3. So what? It is not clear in what way the insights in this study would help us to improve contact prediction in the future. How precisely might we reduce or remove the spurious signals from the unjustified assumption of independence of the sequences? How can we use the presented null models to improve contact prediction?

Reviewer #2: This is an interesting and insightful paper.

It tackles a long-standing issue in DCA, namely the relevance of phylogenetic correlations between sequences and how they affect the inferred co-evolutionary coupling strengths.

The authors address the problem elegantly, and their main result is that, mostly, phylogenetic correlations affect only low-strength couplings, so that high-strength ones can be safely trusted.

Although their work is convincing, I have some issues that I would like to see addressed.

1) As a practitioner, I (and I guess most people using DCA for practical applications) know that DCA works extremely well, with really very very few false positives within the first N contacts (N being the length of the sequences in the MSA). Once down-weighted by similarity, thus, it seems that phylogenetic correlations play only a minor role. This is perfectly in line with their findings.

My understanding is that correlated mutations for O(N) pairs give the same final scores whether they are "clustered" on branches of the tree or scattered everywhere. And in principle, the only reason why we should downplay their role when the same pair appears in a subbranch is not because they are close on the tree, but because the full sequences are close to each other (otherwise they would be wonderful co-occurrences to be duly appreciated by DCA). But this sequence similarity is precisely what re-weighting accounts for.

In this respect, I have to admit that I never felt the worry about phylogenetic correlations as compelling and urgent (interesting for sure to understand, but not necessarily as a major concern).

Could the authors comment?

2) Connected to this, I'd also like to comment on the null models they used in this manuscript, to understand to what extent theirs is not a circular argument.

My understanding is that phylogenetic trees are built using the independent site hypothesis. Some of them are built using (weighted) Hamming distances, others using likelihood approaches. In this respect, what the authors do is perfectly consistent with the correlation present in these trees. Nonetheless, that's the definition of phylogeny that people has given because i) it works and ii) it is mathematically and computationally tractable. Should the phylogenetic information that really affects DCA be encoded in ways that escape the independent site approximation used to build trees, it would not be addressed by the present manuscript.

In this respect they might more carefully phrase their results stating that phylogeny information related to tree-building is only mildly affecting the results, whereas deeper phylogenetic information, beyond the single site approximation, might still do it and this should be addressed in future work.

This is just a suggestion to stress that what is real (phylogenetic correlation) is different from the approximate ways we use to capture it (present tree-building algorithms).

Reviewer #3: In this paper authors performed an analysis based on Null models to support the idea that phylogenetic correlations only provide a limited contribution to coevolution-based contact prediction methods.

To prove their claim, authors defined two types of Null models preserving conservation only and conservation+phylogeny from the original MSA and compared DCA coupling obtained with these models against those derived from the original MSA. As result, they show that, while phylogenetic effects are dominant to determine the structure of the residue-residue correlation matrix, coupling derived by DCA using MSAs preserving conservation+phylogeny but not containing any coevolution signal are not sufficient to distinguish residue-residue contacts.

Overall, I believe that the procedure applied is sound and results are properly supported by data.

However, I suggest to better justify the choice of 9 protein families analyzed in this study, possibly providing more details on data selection. In general, I believe that enlarging the set of considered families would definitely add to the paper. Maybe, results for some additional family could be reported and discussed in the supplementary material.

Minor:

I suggest to reproduce all figures to improve readability in B/W

There is a missing citation in Fig4 caption

Reviewer #4: The authors investigate the influece of phylogenetic correlations in coevolutionary-based contact prediction.

The effect of phylogentic correlations is investigated by analyzing the performances of plmDCA (a coevolutionary-based contact predictor)

on real MSAs against those obtained on the same MSAs after a reshuffling that keeps unaltered the position

specific amino acid frequecies and the pairwise Hamming distances between sequences (i.e. Null model II).

As a general comment, the approach looks interesting although I find it hard to understand

whether the conclusions are interesting or not. My major concers follow.

1. The authors evaluate DCA couplings at sequence separation > 4. Such sequence separation

is probably too short to get strong conclusions from the analysis. Residue-residue

contacts are more abundant for short sequence separations and are usually ignored

since they "mask" the more interesting long-range contact.

In fact, contact prediction accuracy is usually assessed on long range contacts

(sequence separation > 24), which provide stronger constraints on the protein 3D structure.

2. It is not clear why the author chose to focus only on 9 Pfam protein families.

Since this work essentually provides a statistical analysis, more robust conclusions

may be obtained on a larger set of protein families. For example, I would suggest

to consider the (quite popular) benchmark set of 150 single chain, single domain

proteins taken from the PSICOV/MetaPSICOV benchmark dataset (obtained from Pfam).

Also, the author should find some way to summarize all the results in a single

plot/table instead of showing a different plot for each protein family. As an alternative, the

authors should at least justify why this cannot be done or why this is not

useful/interesting.

3. In order to make the results more accessible to the scientific community

that works on protein contact prediction it would be useful to

to compare the contact prediction accuracy on the real MSAs against

the accuracy obtained on the shuffled MSAs (e.g. Null model II). In particular, it is not

clear whether plmDCA achieves good or bad prediction accuracy on the benchmark set.

4. The conclusions are a bit misleading. For instance, in Section "Conclusion"

the authors state that "contact prediction is influenced only to a very limited

degree by phylogenetic couplings" but they also add that

"it is also striking that, across several protein families, the phylogeny-caused couplings

in Null models II and III almost reach the DCA-score threshold found before for accurate

contact prediction. This suggests that the suppression of phylogenetic biases in the data

(or their better consideration in model inference), may shift this threshold down and

therefore allow for predicting much more contacts."

First of all, I remark again that 9 protein families are not enough to draw strong conlusions.

Anyway, such concluding remarks do not clarify whether we should care or not

about phylogenetic bias in coevolutionary-based contact prediction.

5. It would be interesting to know (or at least discuss) whether the phylogeny-caused

couplings identified on the shuffled MSAs can be used to filter-out false positive

contact predictions. Maybe I am worng, but at least in principle, it seems to me that

this could be possible by simply performing a Z-test.

6. I would be curious to know how much the null model II shuffles the single

protein sequences. Such analysis coulbe easily done by simply comparing/aligning

a sequence in the shuffled MSA against those in the real MSA in order to detect

the one with the highest sequence similarity. The average sequence similarity can thus

give an idea of how much the shuffled MSA is similar to the real one.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Paolo De Los Rios

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008957.r003

Decision Letter 1

Rita Casadio, Arne Elofsson

17 Feb 2021

Dear Weigt,

Thank you very much for submitting your manuscript "On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Dear authors, there are still some pending issues raised by one of the reviwer. I encorauge you to take this into consideration and resubmit the manuscript with comments as soon as possble

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Rita Casadio

Guest Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Dear authors, there are still some pending issues raised by one of the reviwer. I encorauge you to take this into consideration and resubmit the manuscript with comments as soon as possble

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for the clearer statement of their conclusions and for the improvements of their analyses despite the difficulties of working with bad Internet connections etc. As such the study sheds light on an interesting and not much quantitatively researched aspect of Potts models.

I have two remaining major issues that should be easy to address.

Major point:

1. Starting on line 393, the authors write:

`Suppression of phylogenetic biases in the data (or their better consideration in model inference) ... may allow for predicting much more contacts. The comparison with Null model I, which contains only finite-sample effects at given residue conservations, shows that sufficiently big protein families will be needed to exploit this. For the smallest protein families, represented by our dataset DS3, the finite-sample effects are found to be almost as large as the phylogenetic effects on the DCA results, while the two were well separated in the case of the deeper 400 MSA in datasets DS1 and DS2.'.

Looking at Fig. 3B (showing data for DS3 with small MSAs), at F^APC value corresponding to a PPV of 0.5, the density of null model I predictions in green is about a factor 2 lower than the density of null model II predictions. This means if the phylogenetic biases could be removed, the PPV would increase from TP/(TP+FP) = 0.5 to roughly TP'/(TP'+FP) = TP/(TP+FP/2) = 0.67. This is no small improvement and contradicts the above statement made by the authors.

Obviously, this result is somewhat surprising as the logarithmic y axis hides the magnitude of the effect, and it is cumbersome to read it off the graph. It would therefore be important to add traces to Fig. 5 that show the PPV versus number of predictions for the hypothetical case that phylogenetic effects can be corrected, for DS2 and DS3. The discussion and conclusions will probably have to be adjusted accordingly.

2. Did the authors use sequence weighting? For the comparison of the green and magenta traces in Fig. 3 it is strictly necessary to not use sequence weighting. Otherwise, the weights for null model I would be close to one (since sequences are independent) while those of null model II would be significantly smaller than 1, which would systematically shrink the sizes of the coupling coefficients under null model II and make the distribution incomparable to null model I. Sequence weighting would also render the analysis in point 1 meaningless.

Minor points:

3. Define abbreviation APC, explain in 1-2 sentences what it does and give a proper reference for completeness.

4. Define `normalized frequency' used in Fig. 3.

5. The caption of Figure 4 is missing an explanation what A and B refers to.

6. In figure 4B, what sense does contact and non-contact make for null model II where there are no contacts?

7. Line 363:

`In Fig. 5 we show that the two lead to highly comparable results, with a slight advantage for the standard F^APC after the first few predictions.'

Please explain why. My guess: Residue conservation is positively correlated with both higher variance of the DCA scores and with a higher probability for a contact. Therefore, dividing by the square root of the variance can reduce PPV.

Reviewer #3: I have no further comments/concerns.

Reviewer #4: The revised version of the manuscript addresses all my concerns. I have no further comments at this time.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008957.r005

Decision Letter 2

Arne Elofsson

9 Apr 2021

Dear Weigt,

We are pleased to inform you that your manuscript 'On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Arne Elofsson

Deputy Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008957.r006

Acceptance letter

Arne Elofsson

20 May 2021

PCOMPBIOL-D-20-01261R2

On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins

Dear Dr Weigt,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. The file contains all supporting tables and figures cited in the main text.

    (PDF)

    Attachment

    Submitted filename: reviewer_reply.pdf

    Attachment

    Submitted filename: reviewer_reply.pdf

    Data Availability Statement

    Data and code is available at https://github.com/ed-rodh/Null_models_I_and_II.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES