Significance
Most biological processes rely on specific interactions between proteins, but the experimental characterization of protein−protein interactions is a labor-intensive task of frequently uncertain outcome. Computational methods based on exponentially growing genomic databases are urgently needed. It has recently been shown that coevolutionary methods are able to detect correlated mutations between residues in different proteins, which are in contact across the interaction interface, thus enabling the structure prediction of protein complexes. Here we show that the applicability of coevolutionary methods is much broader, connecting multiple scales relevant in protein−protein interaction: the residue scale of interprotein contacts, the protein scale of specific interactions between paralogous proteins, and the evolutionary scale of conserved interactions between homologous protein families.
Keywords: coevolution, protein−protein interaction networks, paralog matching, statistical inference, direct coupling analysis
Abstract
Understanding protein−protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein−protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue−residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has, in turn, been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being colocalized in operons. Here we show that the direct coupling analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify interprotein residue−residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.
Almost all biological processes depend on interacting proteins. Understanding protein−protein interactions is therefore key to our understanding of complex biological systems. In this context, at least two questions are of interest: First, the question “who with whom,” i.e., which proteins interact; this concerns the networks connecting specific proteins inside one organism, but also—in the context of this article—the evolutionary perspective of protein−protein interactions, which are conserved across different species. Their coevolution is at the basis of many modern computational techniques for characterizing protein−protein interactions. The second question is the question “how” proteins interact with each other, in particular, which residues are involved in the interaction interfaces, and which residues are in contact across the interfaces. Such knowledge may provide important mechanistic insight into questions related to interaction specificity or competitive interaction with partially shared interfaces.
The experimental identification of protein−protein interactions is an arduous task (for reviews, cf. refs. 1 and 2): High-throughput techniques that aim to identify protein−protein interactions in vivo or in vitro are well documented and include large-scale yeast two-hybrid assays and protein affinity mass spectrometry assays. Such large-scale efforts have revealed useful information but are hampered by high false positive and false negative error rates. Structural approaches based on protein cocrystallization are intrinsically low-throughput and of uncertain outcome due to the unphysiological treatment needed for protein purification, enrichment, and crystallization. It is therefore tempting to use the exponentially increasing genomic databases to design in silico techniques for identifying protein−protein interactions (cf. refs. 3 and 4). Prominent techniques, to date, include the search for colocalization of genes on the genome (e.g., operons in bacteria) (5, 6), the Rosetta stone method (domains fused to a single protein in some genome are expected to interact in other genomes) (7, 8), and also coevolutionary techniques like phylogenetic profiling (correlated presence or absence of interacting proteins in genomes) (9) or similarities between phylogenetic trees of groups of orthologous proteins (compare the mirrortree method) (10, 11). Despite the success of all these methods, their sensitivity is limited due to the use of relatively coarse global criteria (genomic location, phylogenetic distance) instead of full amino acid sequences.
The availability of thousands of sequenced genomes (12), thanks to next-generation sequencing techniques, enables the application of much finer-scale statistical modeling approaches, which take into account the full sequence (13). In this context, direct coupling analysis (DCA) (14) was developed to detect direct interprotein coevolution and, in turn, interprotein residue−residue contacts between bacterial signal transduction proteins, to help to assemble protein complexes (15, 16) and shed light on interaction specificity (17, 18). The applicability of DCA and related coevolutionary approaches (19, 20) to protein−protein interactions far beyond the signaling system has been recently established (21, 22).
However, these methods require a large joint multiple sequence alignment (MSA) of at least about 1,000 amino acid sequence pairs to work accurately. Each line of this MSA concatenates a pair of interacting proteins. So far, the application of coevolutionary methods remains therefore restricted to those cases where such joint alignments could be constructed easily: (i) Each species has only a single copy of the family, i.e., no paralogs exist. Matching of interacting proteins can be achieved by uniqueness in the genome. (ii) Even if paralogs exist, genes of interacting proteins are frequently colocalized on the genome and can therefore be matched by chromosomal vicinity. This finding is true, in particular, in the case of bacteria; functionally related proteins are frequently coded in operons and consequently cotranscribed.
Colocalization is used extensively in the construction of joint MSA for covariance analysis (22–25). However, the case of multiple paralogs with noncolocalized genes has remained out of reach for coevolutionary analysis, despite its enormous relevance: Out of the 4,499 Pfam-29 (26) protein families with more than 500 sequences, 3,221 have, on average, more than two paralogs per species, and 1,378 families even more than five paralogs. Another observation underlines the importance of addressing generally localized genes: Out of 3,643 protein−protein interactions reported for Escherichia coli in the IntAct Molecular Interaction database (27), only 1,341 (36.8%) concern intraoperon interactions.
Here we suggest an approach, based on a simultaneous construction of the joint MSA and detection of interprotein coevolution, to solve the problem of matching paralogs. The method is based on the idea that the correct matching of interacting paralogs maximizes the interprotein coevolutionary signal. The corresponding optimization problem turns out to be extraordinarily hard to solve exactly. We therefore propose two approximate strategies: The first one is computationally very efficient and of sufficient accuracy for subsequent contact prediction. If interaction partner prediction is the central task, a slower but more accurate iterative scheme can be used. The validity of the approach is demonstrated in the cases of bacterial two-component signal transduction and the protein−protein interaction network between the proteins of the Tryptophan biosynthesis pathway (Trp pathway). Our findings open the field to broad applications to protein interactions beyond single-copy or colocalized protein-coding genes, and help to bridge the multiple scales of interprotein coevolution.
Results
An Efficient Approach to Paralog Matching in Interacting Protein Families.
Paralog matching by maximizing the interfamily covariation.
In this paper, we show that DCA may help to solve the aforementioned paralog problem by simultaneously matching paralogs and determining interprotein coevolutionary scores. We argue that the best matching is actually the one maximizing the interprotein covariation; empirical evidence for the correctness of this idea will be provided later in this section.
We use the injective matching strategy illustrated in Fig. 1A (see SI Appendix for mathematical details); it starts from the individual MSAs of two protein families, denoted as and . Only sequences belonging to the same species are matched. For each single species, all proteins from the family of lower paralog number are matched to pairwise different proteins in the other family. Matched sequence pairs are concatenated. In this article, we consider neither the sparse case, where only part of the sequences are matched, nor cases of promiscuous interaction, where one protein should be matched to several others.
Fig. 1.
Paralog matching procedures. (A) The considered injective strategy to match paralogs. For each species (depicted by different colors), each paralog from the species with the lower paralog number is matched to a distinct sequence in the other species (injection). (B) The pipeline of the PPM algorithm. Species are sorted by increasing matching entropy (a measure of the computational complexity of the matching). Starting from a seed matching (generated, in our case, by restricting the MSA to all genomes having a single sequence in both families), the algorithm calculates the DCA model, uses it to add and match a new species, and iterates these two steps until all species are matched. (C) The IPM pipeline; 2k random matchings are generated, and each one is independently refined using hill climbing of the likelihood. After refinement, pairs of matchings are merged using average matching scores. Refinement and merging are iterated until only a single refined matching is left.
For any given matching , we thus find a joint MSA , which concatenates two subalignments and of the original single-family MSAs. Within Gaussian DCA (28), the total log-likelihood of an arbitrary MSA depends only on its regularized covariance matrix and reads (SI Appendix). The amount of interprotein coevolution can be quantified by the interprotein log-likelihood , which results from the difference between two quantities: (i) the log-likelihood of the joint MSA Xπ and (ii) the log-likelihood of the single-family sub-MSAs and , modeled separately. The best matching maximizes this interfamily log-likelihood, i.e., . Due to the huge number of possible matchings, which is exponential in the number of species and superexponential in the number of paralogs inside each species, the exact solution of this optimization task is, unfortunately, infeasible. Furthermore, we empirically observed this discrete optimization problem to be plagued by many local likelihood maxima, such that local search algorithms easily get stuck.
We therefore propose two heuristic algorithms to approximate the solution of this optimization task. A fast progressive method is applicable to large-scale data sets (e.g., many pairs of large families). Although having limited accuracy in identifying specifically interacting paralogs, the method is suitable for subsequent interfamily DCA analysis to predict residue−residue contacts between proteins, or to discriminate interacting from noninteracting families. A slow but accurate iterative method is more suitable for smaller-scale problems, where the accurate identification of individual interacting protein pairs is central.
An efficient progressive paralog-matching algorithm.
A first algorithmic strategy to find the matching maximizing the interfamily covariation is inspired by progressive techniques in constructing MSAs (29): Species are matched progressively, starting with the simplest ones (species with low paralog numbers in our case) and iteratively adding more complicated species with higher paralog numbers. Each species is matched only once, on the basis of all already matched species. Our progressive paralog-matching (PPM) algorithm proceeds as follows (technical details are provided in Methods and in SI Appendix; the pipeline is depicted in Fig. 1B):
-
1.
Species are ordered according to the entropy of their possible matchings, i.e., to the expected hardness of the matching task.
-
2.
Species of low entropy are used to generate a seed matching. In our specific case, zero-entropy species, i.e., species with a single paralog, are used.
-
3.
In order of increasing entropy, species are added recursively: (a) Gaussian DCA (GaussDCA) (28) is applied to the already matched MSA. (b) The GaussDCA parameters are used to score each pair of paralogs inside the new species to be added. (c) An optimal matching for the new species is constructed using these scores.
The algorithm terminates when all species are included. The absence of iterative error correction makes this algorithm computationally efficient. However, early on, fixed errors may propagate through the whole procedure and disturb later matched species.
An accurate iterative paralog-matching algorithm.
The PPM algorithm matches iteratively the proteins belonging to each species only once, based on the previously matched species. Any matching error made at some stage is kept up to the end, possibly causing other matching errors. It would be possible to correct at least part of these errors when considering later included proteins. However, the likelihood landscape has many local maxima, so a simple iterative refinement remains stuck close to the PPM result.
To overcome this limitation, our slow but accurate iterative paralog-matching (IPM) algorithm follows three steps (all extensively described in SI Appendix),
-
1.
Generate K random paralog matchings respecting species. In practical applications, K = 256 was found to be a good compromise between computational time and accuracy.
-
2.
Independently refine all K matchings iteratively by hill climbing (discrete analog of gradient ascent): at step t + 1, improve the matching within each single species, based on the GaussDCA model computed from the matching at the previous step t, until convergence to a local likelihood maximum.
-
3.
Merge pairs of matchings by averaging and refinement: substitute two matchings with a new one obtained from the average GaussDCA model, and refine it by hill climbing as above. This is iterated until a single matching remains. Subsequently, perform a final refinement step: produce K′ = 32 noisy (partially scrambled) versions of the last matching and merge them again as above, thus obtaining a new matching; repeat until the score reaches a plateau.
The idea behind the merging step is simple: The consensus of two imperfect matchings should reinforce the common signal compared with the random noise. This nonlocal change of the matching is found to be able to escape local log-likelihood maxima. Details of the algorithm are given in Methods and in SI Appendix; the pipeline is depicted in Fig. 1C.
Simultaneous Identification of Interaction Partners and Interprotein Residue−Residue Contacts in Bacterial Signal Transduction.
To test both algorithmic strategies, we first consider bacterial two-component systems (TCS) (14), which are the most diffused signal transduction systems in the bacteria. TCS have played a prominent role in the development of DCA (14). They consist of two interacting proteins, the Histidine sensor kinase (SK), as a signal receiver, and the response regulator (RR), which, under activation, typically acts as a transcription factor and triggers a transcriptional response (30). In particular, we use the dataset of Procaccini et al. (17), which collects 8,998 interacting (so-called cognate) protein pairs from 712 distinct species (Methods). A random matching between SK and RR inside species would make, on average, one correct prediction per species; that is, only a fraction of 712/8,998 = 7.9% of all matched SK/RR pairs would be correct. Earlier approaches to match SK and RR have used Bayesian residue networks (23) or aligned protein similarity networks (31); although they improve substantially over random matchings, their accuracy remains inferior to the algorithms presented here.
We first check the self-consistency of our matching idea: Is our MSA of SK/RR, which are colocalized in joint operons and therefore expected to be truly interacting, stable under the matching procedure? To answer this question, we infer a DCA model using this MSA, and we rematch all species. No changes are observed: The true MSA is actually a fixed point of the proposed algorithmic procedure. As a second step, we run PPM. Two possibilities to assess the quality of the matching are considered. First, we check which fraction of the 8,998 matched pairs actually coincides with cognate pairs (as in the dataset published in ref. 17). Second, we use the matched MSA to predict interprotein residue−residue contacts. Because this second test requires only a single run of DCA on the matched alignment, we replace GaussDCA with the more accurate but slower plmDCA (pseudo-likelihood maximization DCA) (32). This algorithm results in a plmDCA score for each residue pair; the largest interprotein residue−residue scores are used to predict interprotein residue−residue contacts; see SI Appendix for details.
Before running PPM, only 59 out of 8,998 sequence pairs are matched immediately because both SK and RR are unique in the genome. The extension of this seed matching by PPM is shown in Fig. 2A: Although the seed matching alone is insufficient to predict interprotein contacts between SK and RR (only one true contact out of the strongest 15 interprotein predictions), it is sufficient to guide PPM to 84.7% precision: 7,620 out of 8,998 cognate pairs are correctly identified. The red line in Fig. 2A shows the number of correctly matched pairs as a function of all matched pairs during the progression of the algorithm. This mildly sublinear curve signals a moderate decay in accuracy during the matching procedure, which results from a tradeoff between increasingly more accurate DCA models (larger sequence numbers) and increasingly harder matching tasks (species were sorted according to their matching entropy). The final matching is sufficient to provide accurate interprotein contact predictions; all of the 15 highest-scoring residue pairs are true interprotein contacts [distance 8 Å in (Protein Data Bank) file 3dge (33)]. We observe that, with increasing size of the progressive matching, the contact prediction becomes more and more accurate: For 1,014 matched sequences, 10 out of the first 15 plmDCA predictions are interprotein contacts, and, for 2,000 matched sequences, even 13 out of 15 are contacts; see Fig. 2B for a more quantitative assessment.
Fig. 2.
The progressive matching procedure matches cognate pairs and enables interprotein residue contact prediction. (A) The red line shows the fraction of the 8,998 SK/RR cognate pairs, which are correctly matched by the progressive matching algorithm, as a function of the matched pairs. A perfect matching procedure would follow the dashed diagonal. The SK/RR complex structure is overlaid with the 15 highest-scoring contact predictions at three different steps of the algorithm: for the seed alignment, after having matched 1,014 proteins, and at the end of the matching. Green bonds show correct predictions, and red bonds show incorrect predictions (contact cutoff 8 Å). The upper structure shows the prediction obtained with the full cognate MSA. (B) The positive predictive value (i.e., the fraction of true positives amongst all interprotein contact predictions) is shown, as a function of the number of predictions, for several joint MSA: the true operon-based cognate matching (solid black); the matching of the seed alignment (magenta); and after having matched 1,014 (blue), 2,000 (green), and 8,998 (red) sequences. The perfect predictor is depicted as black dashed. The prediction accuracy grows during the progressive matching and finally reaches almost the accuracy of the cognate matching.
PPM assigns a protein−protein matching score to each of the pairs in the final matching; compare step 3b of the PPM algorithmic description. Fig. 3A shows that the highest matching scores exclusively indicate truly interacting pairs. All of the first 1,347 pairs are cognate pairs. Although PPM is computationally very efficient, its accuracy in identifying true interaction partners is limited. In the progressive strategy, once a matching error is made, it is not corrected but influences all subsequently matched species. To this end, we have applied the computationally more involved IPM algorithm (Fig. 3C); thanks to the nonlocal merging steps, we reach 91.2% of precision (8,206 true matches). Although IPM proceeds to maximize the log-likelihood, the matching error is, up to fluctuations, monotonously decreasing. Furthermore, Fig. 3C, Inset shows that IPV slightly exceeds the log-likelihood of the true operon-based matching, but the error rate is not decreasing any more beyond that point, suggesting that the intrinsic error rate of the association between log-likelihood and matching error is close to 9%.
Fig. 3.
PPM vs. IPM algorithm for the SK/RR system. Shown are the histograms of DCA scores of the final (A) PPM and (B) IPM matching. The fraction of true positive (TP) predictions is colored in green, and the fraction of false positive (FP) predictions in red. Although the high-scoring pairs are exclusively TP, low and intermediate scores show a mixture of TP and FP. The overall histogram is only insignificantly shifted toward higher scores when comparing IPM to PPM, but the overall weight of the FP is visibly decreased. C shows the dependence of the number of matching errors (FP) on the log-likelihood of the IPM matching. Iteration proceeds from the upper left to the lower right corner, showing the last two stages of IPM: first, the progressive merging of locally optimal matchings (blue points), and, then, the final refinement stage (red points). The overall procedure arrives at a log-likelihood that is slightly superior to the one of the true matching (dotted vertical line), at a precision of about 91.2% (8,206 TP out of 8,998 TP+FP). (Inset) An enlargement of the refinement stage; the almost-linear relation between log-likelihood and error clearly breaks down once the log-likelihood of the cognate matching is reached.
Simultaneous Identification of Interacting Families and Specifically Interacting Proteins in a Bacterial Metabolic Pathway.
DCA has been used to identify interacting protein families (21, 25). Based again on the availability of large joint MSA, only pairs of families showing significant interprotein coevolution are expected to interact. Again, we argue that, even without a large known set of (potentially) interacting protein pairs, the PPM strategy simultaneously creates such an alignment, and the interprotein plmDCA scores are informative about interfamily interaction. Following ref. 25, the average of the four highest interprotein residue−residue plmDCA scores is used (SI Appendix).
As a test system, we choose the tryptophan biosynthesis pathway comprising seven different proteins, TrpA through TrpG, which catalyze subsequent reactions in the pathway. Among the 21 protein family pairs, only two are known to interact based on experimentally resolved cocrystal structures: TrpA−TrpB [PDB 1k7f (34)] and TrpE−TrpG [PDB 1qdl (35)]. Although individual Pfam MSA sizes reach from 8,713 sequences for TrpF to 78,265 for TrpG, pairing by uniqueness in the genome only in three cases leads to joint MSAs beyond 1,000 sequences (TrpC−TrpF, 1,578; TrpA−TrpC, 1,546; TrpA−TrpF, 1,433). The actually interacting pairs have extremely small joint MSAs of 15 sequences for TrpE−TrpG and 95 sequences for TrpA−TrpB. No detection of interactions is possible with such small alignments (Fig. 4). In ref. 25, we have shown that matching by genomic colocalization leads to joint MSA sizes of 2,519 to 8,053 sequences, with a majority below 4,000 sequences. These alignments separate the two known interacting pairs (interprotein plmDCA scores 0.3, 0.38) from an almost continuous background of scores not exceeding 0.17.
Fig. 4.
Detection of protein−protein interactions between enzymes of the tryptophan biosynthesis pathway. (A) The known PPI between the seven enzymes in the Trp pathway; only TrpA−TrpB and TrpE−TrpG are known to interact. (B−D) The results (B) for the seed matching (subalignment made of genomes having a single sequence in both families), (C) for matchings of 1,000 sequences per protein pair, and (D) for the full matching. Line width is proportional to the interprotein coevolution score; the first two predictions are colored (TP, green; FP, red). For the seed alignment, none of the true PPIs is recognized, whereas for 1,000 sequences, one out of the two PPIs is recognized. The second true PPI has the third score, but there is no gap between true and false PPI. For the full matching, the known PPI are found to be the two highest-scoring pairs, with scores detached from an almost continuous distribution of the remaining 19 scores. E and F show the PDB structures of the complexes (E) TrpA−TrpB and (F) TrpE−TrpF, together with the 15 highest DCA-scoring interprotein pairs, colored in green for TP interprotein contact predictions (12 for TrpA−TrpB, 11 for TrpE−TrpG) and in red for FP predictions (3 for TrpA−TrpB, 4 for TrpE−TrpG). The contact prediction is based on the fully matched PPM alignments.
To test our paralog matching, we apply PPM to each of the 21 Trp protein pairs (Fig. 4). The seed matchings, generated by uniqueness in the genome, range from 15 to 1,578 protein pairs. They do not allow for recovering the correct interacting family pairs (ranks 5 and 21 out of 21). After having matched 1,000 protein pairs in each family, and computed the interprotein plmDCA score, the three highest-scoring family pairs are TrpA−TrpB (score 0.23), TrpF−TrpG (score 0.18), and TrpE−TrpG (score 0.17), followed by almost continuous scores below 0.15. The correct interactions thus have ranks 1 and 3, but no gap exists between the scores of interacting pairs and the scores of noninteracting pairs.
Using the full progressive matchings, TrpA−TrpB (TrpE−TrpG) have an interprotein plmDCA score of 0.34 (0.25) followed by almost continuous scores below 0.15. The two correct interactions are recognized with a gap, which is almost as large as in the matching obtained using genomic colocalization, illustrating again the strong capacity of our method to recover accurately the matching between interacting proteins. Again, we stress that our method, at variance with the one presented in ref. 25, does not need any information about the genomic location but only about the protein sequences in the MSA. Thus, it is of more general applicability.
Results obtained at the level of interaction networks can be corroborated by interprotein contact predictions obtained for the two interacting pairs (Fig. 4 E and F): For TrpA–TrpB, 9 out of the first 10 (and 12 out of the first 15) interprotein contact predictions are true positive. The situation is very similar for TrpE−TrpG: 10 out of the first 10 and 11 out of the first 15 predicted pairs are in contact across the interface.
To assess the robustness of our results, we included more (putative) negative controls to perform a larger-scale analysis extending beyond the Trp system. We considered a larger dataset of 40 protein families, which we tested exhaustively against the four proteins involved in interactions, i.e., TrpA, TrpB, TrpE, and TrpG (each of TrpA, TrpB, TrpE, and TrpG is tested against all other 39 families); see SI Appendix for the selection of these proteins and detailed results. Despite the increased number of possible protein family pairs, the scoring gap of the known interacting pairs vs. all other pairs discriminates interacting from noninteracting pairs; only one pair (Trp2/P0ABY7, score 0.226) shows an interprotein plmDCA score close to TrpA/TrpB (0.337) and TrpE/TrpG (0.245). The alignment of this pair has, however, an insufficient sequence number for reliable coevolutionary inference A large interprotein plmDCA score based on a sufficiently deep MSA seems to provide a promising predictor of conserved protein−protein interaction.
Discussion
Global methods to detect coevolution, like DCA, (Precise Structural Contact Prediction Using Sparse Inverse Covariance Estimation), and GREMLIN, have recently enjoyed growing popularity in a very specific setting: Starting from a large multiple sequence alignment of homologous proteins, these approaches have helped to extract residue−residue contacts from residue−residue amino acid covariation. In the context of interacting proteins, the inferred interprotein contacts have, in turn, helped to structurally assemble protein complexes. However, the applicability of these methods has remained limited due to the a priori need to obtain joint multiple sequence alignments of pairs of interacting proteins, with each row containing a pair of interacting proteins out of two protein families. This MSA has to be obtained by external information like the uniqueness of the two protein families inside a species (no paralogs present) or the genomic colocalization in bacterial operons.
In this work, we show that one can turn the argument around: The coevolution between two protein families itself can be used to identify interacting partner proteins, and thereby to generate the joint MSA while simultaneously obtaining an interprotein contact prediction. We have shown that an accurate matching between proteins families can be obtained, which (i) connects only proteins in the same species and (ii) maximizes the detectable interfamily coevolutionary signal. The idea is that basically any mismatch connecting two noninteracting proteins decreases the interfamily covariation. In Fig. 3, we have actually observed that there is an almost monotonously decreasing relation between the log-likelihood of a matching (which is a measure of the total interfamily coevolutionary signal) and the error rate in the matching, compared with a Gold-standard dataset of colocalized bacterial proteins from two-component signal transduction pathways. However, once the log-likelihood of the Gold-standard matching was obtained (or even slightly exceeded), the residual matching error of about 9% did not decrease any more; this may be a sign for an intrinsic limitation of the idea connecting likelihood and matching accuracy, but it may also be a biological signal. About 60% of the mismatches were pairwise switches (transpositions) between two TCS, and 18% concern triples. It has been speculated, before, that 15 to 20% of all bacterial signaling systems display some tendency to crosstalk; that is, interactions are not really one-to-one. Part of the “mismatched” proteins could actually been read as predictions for inter-TCS crosstalk. However, in model species E. coli and Bacillus subtilis, where cases of crosstalk have been reported (36, 37), no matching errors were found.
The intuitive idea of maximizing the interfamily coevolutionary signal leads to a computationally extremely hard problem: The search space (i.e., all possible joint MSA) is exponentially large in the number of species and superexponential in the number of paralogs inside each species. The problem would become much simpler to solve if the global log-likelihood score could be replaced by a local correlation measure maximizing, e.g., the Frobenius norm of the interprotein covariance matrix. This is implemented as a first fast stage of the IPM algorithm, but, in the case of TCS, it gets stuck at a high error rate of almost 40% of mismatches. Global modeling is necessary to reach high accuracy in paralog matching. We have also seen that the accuracy drops only slightly (error rate ∼15%) when the slow iterative procedure is replaced by a fast PPM. The resulting joint alignments are sufficiently precise to enable accurate interprotein contact prediction, and to discriminate between interacting and noninteracting protein families.
The two strategies—progressive and IPM—both open the road to large-scale analysis for predicting currently unknown protein−protein interactions. Coevolution-based procedures to analyze PPI have extensively used colocalization (22–25). A natural question is, what fraction of the known bacterial interactome comes from colocalized genes? Given our partial knowledge of the interactome at present, we still cannot provide a precise answer to this question. However, we can give a partial estimate based on current knowledge in E. coli, i.e., in the currently best-studied model species. E. coli’s proteome consists of 4,323 nonredundant proteins organized in 2,148 operons (817 of which host at least two genes). This results in 4,885 potential PPIs within the same operon, in comparison with more than 9 million protein pairs in total. IntAct (27), one of the most comprehensive database for the PPI network, reports 3,643 PPI for E. coli, of which more than one-third (1,341 pairs) are intraoperon PPI. A domain-based database of structurally known PPI, iPfam (38), reports 4,100 interacting family pairs (∼2,000 of which are homodimers). The breakdown of the 4,885 possible intraoperon interactions in terms of distinct protein domains gives 8,068 distinct intraoperon domain pairs. Of the 2,100 heterodimeric domain pairs in iPfam, only 640 are present in E. coli, 214 of which are in the same operon. Again, about one-third of the known interactions originate from the same operon. Our methodology provides an efficient and scalable algorithmic strategy to analyze the remaining two-thirds of the known interactome, for which criteria such as genomic proximity cannot be used.
Methods
Gaussian Direct Coupling Analysis.
The basis of the paralog matching procedure is the GaussDCA formulated in ref. 28. Let us assume a matched MSA of sequences of length . The MSA is transformed into an -dimensional binary array by replacing each amino acid with a distinct 20-dimensional vector containing one entry “1” and 19 entries “0”; gaps are represented by zero vectors. The empirical covariance matrix of the transformed MSA is the -dimensional square matrix (the explicit dependence on the matching leading to the MSA is suppressed here), and the empirical mean is the -dimensional vector ; see SI Appendix for the precise definition of these quantities using standard DCA sequence weighting and pseudocounts. Given these empirical matrices, the GaussDCA model assigns a probability
to any amino acid sequence of length in binary representation. From this expression, the log-likelihood of the original MSA can be easily determined as (SI Appendix). Our matching strategy aims at maximizing this likelihood by selecting the matching leading to the joint MSA . In the initial phase of IPM, we will also use the squared Frobenius norm of the covariance matrix , i.e., , as a faster to compute objective function (SI Appendix).
Paralog Matching.
The two matching strategies are described in Results, and extensive details are provided in SI Appendix. As the optimal assignment problem can be easily formulated in terms of linear programming, we used the Gurobi library (39) to efficiently solve it.
Data Extraction.
TCS.
The data for the SK/RR analysis were originally published in ref. 17; here we give a short description: 769 bacterial genomes were scanned using hmmer (Hidden Markov Model biosequence analysis) (40) with the Pfam 22.0 Hidden Markov Models (41) for the following SK domains: ‘‘HisKA’’ (PF00512), ‘‘HWE_HK’’ (PF07536), ‘‘HisKA_2’’ (PF07568), ‘‘HisKA_3′’ (PF07730), ‘‘His_ kinase’’ (PF06580), and ‘‘Hpt’’ (PF01627), and, for the RR domain, ‘‘Response_reg’’ (PF00072). Using a simple operational definition of an operon as a sequence of consecutive genes of same coding sense, and with intergenic distances not exceeding 200 base pairs, a total M = 8,998 SK/RR pairs were identified in operons containing a single SK (of type HisKA) and a single RR domain. As reference structure, we consider the PDB entry 3dge (32).
Trp operon.
The tryptophan biosynthetic pathway consists of seven enzymes (TrpA, TrpB, TrpC, TrpD, TrpE, TrpF, and TrpG). Only two protein−protein interactions are known and resolved structurally: TrpA−TrpB [PDB 1k7f (42)] and TrpG−TrpE [PDB 1qdl (43)]. Single-protein MSA have been extracted using the pipeline proposed in ref. 25: (i) Extract sequences corresponding to names from Uniprot (Universal Protein Resource); (ii) run MAFFT (multiple-alignment program for amino acid or nucleotide sequences) (44) using mafft–anysymbol–auto; (iii) create a profile Hidden Markov Model using hmmbuild from the hmmer suite, and search Uniprot using hmmsearch (45); and (iv) remove inserts.
In addition to the seven Trp enzymes, we also created, with the same procedure, an enlarged dataset of 33 negative controls. Details are provided in SI Appendix.
Note.
While finalizing this manuscript, we learned that A.-F. Bitbol, R. S. Dwyer, L. J. Colwell, and N. S. Wingreen have prepared a related paper on predicting interacting paralog pairs (46).
Supplementary Material
Acknowledgments
We thank Christoph Feinauer, Guido Uguzzoni, and Hendrik Szurmant for helpful discussions. M.W. was partly funded by the Agence Nationale de la Recherche Project COEVSTAT (ANR-13-BS04-0012-01). C.B. was partly funded by the European Research Council (Grant 267915).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: Julia package for paralog matching is available at https://github.com/Mirmu/ParalogMatching.jl.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1607570113/-/DCSupplemental.
References
- 1.Shoemaker BA, Panchenko AR. Deciphering protein−protein interactions. Part I. Experimental techniques and databases. PLOS Comput Biol. 2007;3(3):e42. doi: 10.1371/journal.pcbi.0030042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rao VS, Srinivas K, Sujini GN, Kumar GN. Protein-protein interaction detection: Methods and analysis. Int J Proteomics. 2014;2014:147648. doi: 10.1155/2014/147648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Shoemaker BA, Panchenko AR. Deciphering protein−protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLOS Comput Biol. 2007;3(4):e43. doi: 10.1371/journal.pcbi.0030043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Keskin O, Tuncbag N, Gursoy A. Predicting protein−protein interactions from the molecular to the proteome level. Chem Rev. 2016;116(8):4884–4909. doi: 10.1021/acs.chemrev.5b00683. [DOI] [PubMed] [Google Scholar]
- 5.Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci. 1998;23(9):324–328. doi: 10.1016/s0968-0004(98)01274-2. [DOI] [PubMed] [Google Scholar]
- 6.Galperin MY, Koonin EV. Who’s your neighbor? New computational approaches for functional genomics. Nat Biotechnol. 2000;18(6):609–613. doi: 10.1038/76443. [DOI] [PubMed] [Google Scholar]
- 7.Marcotte CJV, Marcotte EM. Predicting functional linkages from gene fusions with confidence. Appl Bioinformatics. 2002;1(2):93–100. [PubMed] [Google Scholar]
- 8.Marcotte EM, et al. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999;285(5428):751–753. doi: 10.1126/science.285.5428.751. [DOI] [PubMed] [Google Scholar]
- 9.Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96(8):4285–4288. doi: 10.1073/pnas.96.8.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein−protein interaction. Protein Eng. 2001;14(9):609–614. doi: 10.1093/protein/14.9.609. [DOI] [PubMed] [Google Scholar]
- 11.Juan D, Pazos F, Valencia A. High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc Natl Acad Sci USA. 2008;105(3):934–939. doi: 10.1073/pnas.0709671105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Reddy TB, et al. The Genomes OnLine Database (GOLD) v.5: A metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2015;43(Database issue):D1099–D1106. doi: 10.1093/nar/gku950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat Rev Genet. 2013;14(4):249–261. doi: 10.1038/nrg3414. [DOI] [PubMed] [Google Scholar]
- 14.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein−protein interaction by message passing. Proc Natl Acad Sci USA. 2009;106(1):67–72. doi: 10.1073/pnas.0805923106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H. High-resolution protein complexes from integrating genomic information with molecular simulation. Proc Natl Acad Sci USA. 2009;106(52):22124–22129. doi: 10.1073/pnas.0912100106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Dago AE, et al. Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc Natl Acad Sci USA. 2012;109(26):E1733–E1742. doi: 10.1073/pnas.1201301109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Procaccini A, Lunt B, Szurmant H, Hwa T, Weigt M. Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: Orphans and crosstalks. PLoS One. 2011;6(5):e19729. doi: 10.1371/journal.pone.0019729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cheng RR, Morcos F, Levine H, Onuchic JN. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc Natl Acad Sci USA. 2014;111(5):E563–E571. doi: 10.1073/pnas.1323734111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–190. doi: 10.1093/bioinformatics/btr638. [DOI] [PubMed] [Google Scholar]
- 20.Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue−residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci USA. 2013;110(39):15674–15679. doi: 10.1073/pnas.1314045110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue−residue interactions across protein interfaces using evolutionary information. eLife. 2014;3:e02030. doi: 10.7554/eLife.02030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hopf TA, et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife. 2014;3:e03430. doi: 10.7554/eLife.03430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Burger L, van Nimwegen E. Accurate prediction of protein−protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008;4(1):165. doi: 10.1038/msb4100203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein−protein interaction by message passing. Proc Natl Acad Sci USA. 2009;106(1):67–72. doi: 10.1073/pnas.0805923106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Feinauer C, Szurmant H, Weigt M, Pagnani A. Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the Trp operon. PLoS One. 2016;11(2):e0149166. doi: 10.1371/journal.pone.0149166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Finn RD. 2012. Pfam: The Protein Families Database. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics (Wiley, New York), Vol 3.
- 27.Orchard S, et al. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(Database issue):D358–D363. doi: 10.1093/nar/gkt1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Baldassi C, et al. Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners. PLoS One. 2014;9(3):e92721. doi: 10.1371/journal.pone.0092721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987;25(4):351–360. doi: 10.1007/BF02603120. [DOI] [PubMed] [Google Scholar]
- 30.Stock AM, Robinson VL, Goudreau PN. Two-component signal transduction. Annu Rev Biochem. 2000;69(1):183–215. doi: 10.1146/annurev.biochem.69.1.183. [DOI] [PubMed] [Google Scholar]
- 31.Bradde S, et al. Aligning graphs and finding substructures by a cavity approach. Europhys Lett. 2010;89(3):37009. [Google Scholar]
- 32.Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys. 2013;87(1):012707. doi: 10.1103/PhysRevE.87.012707. [DOI] [PubMed] [Google Scholar]
- 33.Casino P, Rubio V, Marina A. Structural insight into partner specificity and phosphoryl transfer in two-component signal transduction. Cell. 2009;139(2):325–336. doi: 10.1016/j.cell.2009.08.032. [DOI] [PubMed] [Google Scholar]
- 34.Weyand M, Schlichting I, Marabotti A, Mozzarelli A. Crystal structures of a new class of allosteric effectors complexed to tryptophan synthase. J Biol Chem. 2002;277(12):10647–10652. doi: 10.1074/jbc.M111285200. [DOI] [PubMed] [Google Scholar]
- 35.Knöchel T, et al. The crystal structure of anthranilate synthase from Sulfolobus solfataricus: Functional implications. Proc Natl Acad Sci USA. 1999;96(17):9479–9484. doi: 10.1073/pnas.96.17.9479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Howell A, Dubrac S, Noone D, Varughese KI, Devine K. Interactions between the YycFG and PhoPR two-component systems in Bacillus subtilis: The PhoR kinase phosphorylates the non-cognate YycF response regulator upon phosphate limitation. Mol Microbiol. 2006;59(4):1199–1215. doi: 10.1111/j.1365-2958.2005.05017.x. [DOI] [PubMed] [Google Scholar]
- 37.Rietkötter E, Hoyer D, Mascher T. Bacitracin sensing in Bacillus subtilis. Mol Microbiol. 2008;68(3):768–785. doi: 10.1111/j.1365-2958.2008.06194.x. [DOI] [PubMed] [Google Scholar]
- 38.Finn RD, Miller BL, Clements J, Bateman A. iPfam: A database of protein family and domain interactions found in the Protein Data Bank. Nucleic Acids Res. 2014;42(D1):D364–D373. doi: 10.1093/nar/gkt1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gurobi Optimization, Inc. 2015. Gurobi Optimizer Reference Manual (Gurobi Optimization, Houston)
- 40.Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
- 41.Finn RD, et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 2016;44(D1):D279–D285. doi: 10.1093/nar/gkv1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Weyand M, Schlichting I, Marabotti A, Mozzarelli A. Crystal structures of a new class of allosteric effectors complexed to tryptophan synthase. J Biol Chem. 2002;277(12):10647–10652. doi: 10.1074/jbc.M111285200. [DOI] [PubMed] [Google Scholar]
- 43.Knöchel T, et al. The crystal structure of anthranilate synthase from Sulfolobus solfataricus: Functional implications. Proc Natl Acad Sci USA. 1999;96(17):9479–9484. doi: 10.1073/pnas.96.17.9479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Finn RD, et al. HMMER web server: 2015 update. Nucleic Acids Res. 2015;43(W1):W30–W38. doi: 10.1093/nar/gkv397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Bitbol A-F, Dwyer RS, Colwell LJ, Wingreen NS. Inferring interaction partners from protein sequences. Proc Natl Acad Sci USA. 2016;113:12180–12185. doi: 10.1073/pnas.1606762113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




