Abstract
The recent availability of protein–protein interaction networks for several species makes it possible to study protein complexes in an evolutionary context. In this article, we present a novel network-based framework for reconstructing the evolutionary history of protein complexes. Our analysis is based on generalizing evolutionary measures for single proteins to the level of whole subnetworks, comprehensively considering a broad set of computationally derived complexes and accounting for both sequence and interaction changes. Specifically, we compute sets of orthologous complexes across species, and use these to derive evolutionary rate and age measures for protein complexes. We observe significant correlations between the evolutionary properties of a complex and those of its member proteins, suggesting that protein complexes form early in evolution and evolve as coherent units. Additionally, our approach enables us to directly quantify the extent to which gene duplication has played a role in the evolution of complexes. We find that about one quarter of the sets of orthologous complexes have originated from evolutionary cores of homodimers that underwent duplication and divergence, testifying to the important role of gene duplication in protein complex evolution.
INTRODUCTION
Recent technological advances, such as yeast two-hybrid screens (1) and co-immunoprecipitation (coIP) assays (2), enable the systematic characterization of protein–protein interaction (PPI) networks across multiple species. Large-scale PPI networks are currently available for human and most model species (3–5).
To date, evolutionary analysis of protein network data has been mostly limited to comparison of single interactions (6), or whole networks (7). In the context of the latter, methods were developed to identify protein complexes that are conserved across species (8,9). Other approaches for studying the evolution of protein pathways or complexes have been mostly based on sequence similarity only (10). Functionally linked proteins were shown to have a tendency to evolve together (11–13); conversely, proteins with similar phylogenetic profiles were shown to have higher chances of participating in the same biochemical pathways (14). Another study (15) showed that phylogenetic profiles of proteins in the same functional module tend to be significantly coherent, with variations in the level of coherence between different types of modules.
The evolution of modularity in PPI networks was studied by Pereira-Leal and coworkers (16,17), who proposed that the duplication of self-interacting proteins plays a key role in the formation of a modular network structure. Furthermore, they suggested that duplication of whole complexes is also a contributing factor for modularity, observing that a significant fraction of the complexes in Saccharomyces cerevisiae bare strong similarity to each other. An additional recent work (10) studied evolutionary cohesive modules in PPI networks, i.e. modules whose components have a uniform pattern of loss and gain throughout evolution. It was shown that younger cohesive modules play different roles than older ones and are more likely to be horizontally transferred. In addition, the cohesiveness of a module was shown to correlate with its size and inter-connectivity, and inversely correlate with the rate of duplication among the member proteins.
In this study, we present a novel computational framework for reconstructing the evolutionary history of protein complexes from a network perspective. Our method is based on generalizing established evolutionary measures for single proteins (18,19) to the level of protein subnetworks. Specifically, we define statistical measures for the level of homology between pairs of complexes, and use these measures to search for sets of orthologous complexes across species. The settings of our analysis differ from previous studies in three key points: (i) In contrast to previous studies (15–17) that restricted their analysis to known complexes and metabolic pathways, we consider a comprehensive set of computationally derived putative protein complexes in all of the studied species. (ii) We identify conserved protein complexes by taking into account both sequence and interaction patterns rather than testing conservation based on sequence only [as in (10)] or interaction only [as in (15)]. (iii) We consider all patterns of conservation rather than restricting the analysis to complexes that are conserved in all species [as in (8)].
We use the sets of orthologous complexes to infer evolutionary rate and age estimates for the member complexes. These estimates are validated in several ways and employed to investigate mechanistic aspects of protein complex evolution. We find a high level of agreement between the evolutionary rates of proteins and those of the complexes they form, supporting the view that protein complexes tend to undergo evolution as coherent units. Secondly, we study the role of duplication of self-interacting proteins in the evolution of protein complexes, showing that about one quarter of the sets of orthologous complexes are likely to have originated from conserved cores of homodimers that underwent duplication and divergence.
MATERIALS AND METHODS
PPI network construction
Our analysis includes seven species: Homo sapiens, Drosophila melanogaster, Plasmodium falciparum, Caenorhabditis elegans, budding yeast S. cerevisiae, Escherichia coli and Helicobacter pylori. For each species we obtained up-to-date PPIs and protein sequence data, gathered from recently published papers (20–29) and from public databases (3,30–36). High-throughput mass spectrometry data (22,27,28) was translated into binary PPIs using the spoke model (37). To deal with false positive errors (falsely reported interactions), we adapted a method by Bader et al. (38) and assigned confidence values to the interactions based on their supporting experimental evidence (Supplementary Data).
Protein cluster detection
We identify highly connected clusters within the PPI networks using two algorithms: (i) NetworkBLAST (8)—which performs a greedy search for dense subnetworks; and (ii) Markov clustering (MCL) (39)—which uses simulated random walks within the network to detect distinct clusters. The MCL method was recently shown to outperform other clustering techniques (40).
Protein clusters were obtained for each species separately by merging the outputs of the two algorithms, while maintaining an upper bound of 80% on the permitted overlap between clusters. The merging procedure as well as benchmarks of the two algorithms using the MIPS database are detailed in the Supplementary Data. The numbers of obtained clusters are depicted in Supplementary Figure 1a and range from 162 (P. falciparum) to 3419 (yeast).
To validate the collection of identified clusters, we measured the coherence of their member proteins with respect to their functional annotation and essentiality status. A total of 6854 (70%) of the clusters (across all species) exhibited significant functional coherence, and 1511 (34%) out of the 4366 clusters inferred for S. cerevisiae and E. coli (for which we had gene essentiality information) were significantly enriched with essential proteins (Supplementary Figure 2 and Supplementary Data).
As an additional validation, we evaluated the correspondence of the S. cerevisiae clusters to curated complexes from MIPS (41). This was done by computing sensitivity and specificity indices as in (42). Restricting the analysis to yeast clusters that intersect some MIPS complex, we found that 62% of those significantly match a known complex (sensitivity), covering 97% of the MIPS complexes (specificity; see Supplementary Data).
Constructing sets of orthologous clusters
The sets of orthologous cluster (SOC) construction consists of two steps: (i) identifying pairs of homologous clusters; and (ii) using the homologous pairs for identifying SOCs.
The homology relations are determined as follows: given two species α and β we define a protein similarity graph G = (Vα,Vβ,E) where Vα (Vβ) is the protein set of α (β). We connect pairs of sequence similar proteins by an edge, using a BLAST E-value cutoff of ≤10−6 (thus ensuring a significance level of approximately 0.01 after correcting for multiple hypothesis testing). Given two clusters cα, cβ from species α and β, respectively, we measure their level of homology using two complementary statistical scores: (i) Edge-based score—the density of sequence similarity edges, connecting protein pairs from the two clusters:
where NE(A, B) is number of sequence similarity edges connecting pairs of proteins in sets A and B, and
is the hypergeometric score (43). (ii) Node-based score— the total number of proteins which have a potential ortholog on the opposite set:
where NV(A,B) is number of proteins in set B that are sequence similar to a protein from A.
We filter the computed relations by placing a bound of 5% on the false discovery rate (FDR) of the two scores (44) (i.e. in expectation, 5% of the discovered relations are false positives). Further requiring that for every related pair at least 25% of the proteins in one of the clusters have a sequence similar protein in the other cluster, yields a preliminary set of pairs of homologous clusters. For each cluster, we then report only its best match (taking the mean over the two scores) in each species and construct a cluster homology graph. The nodes in this graph correspond to protein clusters (across different species); and the edges connect clusters to their best matches (note that this relation might be one sided). Notably, the sequence similarity criterion employed in the protein similarity graph coincides with that of Sharan et al. (8) and Kelly et al. (45). We chose not to use stricter definitions such as reciprocal best BLAST matches, or members of the same Inparanoid (46) cluster, since as previously noted by Sharan et al. (8), this may result in missing many functional orthologs that exhibit a relatively weak sequence similarity signal.
The construction of the SOCs starts by enumerating all 7-node cliques (complete subgraph) in the cluster homology graph and then merging cliques that have six nodes in common until no more merging is possible. We then remove all the merged cliques from the graph and repeat the procedure using cliques of decreasing sizes. At iteration 1 ≤ i ≤ 6, the algorithm enumerates all the cliques of size 8 −i and merges cliques with 7 − i nodes in common. In the sixth and last iteration, we consider cliques comprised of pairs of clusters. To obtain a better support for the implied orthology relations within the SOCs resulting from these small seeds, we require the best-match relations between the two clusters in the clique to be mutual. We note that a SOC might contain a few clusters from the same species. These may be paralogous clusters or overlapping clusters.
Handling false negatives in the interaction data
False negative (undetected) interactions may lead to underestimation of conservation levels and result in discarding true orthology relations between clusters. To estimate the false negative rate in the data, we measured the fraction of potential cluster-orthology misses (Supplementary Data). Intuitively, we define a potential miss as a case where a cluster seems to be conserved when using only sequence data, and not conserved when using both sequence and PPI data. The estimated false negative rate for the entire data set was 40%.
To tackle this problem, we used a filtering criterion which aims at removing clusters for which orthologs may be obscured by lack of PPI data (Supplementary Data). The estimated false negative rate after the filtering was reduced to 36% (Supplementary Figure 3). The filtering reduced the size of the set of clusters that are members of a SOC by 25%. Notably, the set of species-specific clusters was reduced by 37.2% (Figure 2D). This pronounced difference indicates that many of the species-specific clusters may have been inferred as such due to lack of PPI data, and that our filtering procedure has managed to pin down many of those cases.
We also computed the false negative rate based on manually curated protein complexes from the MIPS database (Supplementary Data). The estimated rates (38.4% and 32.3% with and without filtering, respectively) are in line with the estimations above. Notably, the false negative rates computed for the prokaryotes (E. coli and H. pylori), along with that of P. falciparum, are substantially higher than those of the rest of the species in this study. In addition to the lack of experimental data in the latter two networks, this observed gap is likely to stem from actual differences in the networks themselves (namely, that sequence similarities are less likely to imply conservation of interactions) as evident from Figure 2E. While expected for the prokaryotes, it was also shown that wiring in the PPI network of P. falciparum is substantially different from that of other eukaryotes (47).
Propensity for gene loss and protein age estimation
The propensity for gene loss (PGL) (19) measure quantifies the conservation of a protein in evolution and is based on the presence/absence of its orthologs across a set of species (more details on the computational process are provided in the next section). To compute the PGL values, we obtained clusters of orthologous genes in 17 eukaryotic species from NCBI's HomoloGene database (48). The eukaryotic species include nine animals, five fungi, two plants and one pathogen. We considered the PGL values of all proteins whose ancestor dates back to the bilateria or fungi ancestors (or earlier) under an optimal parsimonious reconstruction. The corresponding phylogenetic tree was taken from NCBI and the divergence time estimates were taken from (49,50) (Supplementary Figure 5).
In addition, we classify the proteins into age groups according to the lowest common ancestor of their phyletic pattern in the phylogenetic tree. We treat the evolutionary age as a real value by representing every group by its estimated divergence time (Supplementary Figure 5). Species-specific proteins are assigned with a minimal age value of zero.
Propensity for cluster loss and cluster age estimation
A phylogenetic tree relating the investigated species was taken from NCBI (48). Divergence time estimates were taken from (49–52). In addition, we used an estimated divergence time of 2000 My between H. pylori and E. coli (Supplementary Figure 4).
The PCL measure is defined in an analogous manner to the protein-level PGL. Given a phylogenetic tree and a pattern of presence and absence of a protein cluster across the leaves of the tree, the pattern of presence and absence across all the inner (ancestral) nodes in the tree is determined using an optimal parsimonious reconstruction. This reconstruction seeks to minimize the number of losses along the branches of the phylogenetic tree, while being constrained by the Dollo parsimony principle, under which cluster loss is treated as irreversible [a cluster can be lost independently in several evolutionary lineages but cannot be regained (53)]. The PCL is then defined as the ratio between the total length of branches in the phylogenetic tree along which the cluster was lost and the total length of branches along which it could have been lost. For the computation of PCL, we considered only clusters that can be traced back to the eukaryotic ancestor or to the root of the phylogenetic tree under an optimal parsimonious reconstruction.
For age estimation, the clusters are classified into five distinct age groups: Bilateria, Fungi/Metazoa, Eukarya, Eukarya/Bacteria and species-specific. The assignment of a cluster to an age group is done according to the most recent ancestor, common to all the species in its phyletic pattern. Similarly to single proteins, we treat the evolutionary age of a cluster as a real-valued variable by representing every age group by its estimated divergence time (Supplementary Figure 4).
Gene duplication and cluster evolution
In the following, we consider two proteins of the same species as putatively paralogous if their BLAST E-value is lower than 10−6. For a given SOC, let S be its set of proteins, and let O denote the set of proteins from the participating species whose evolutionary age is not smaller than that of the SOC (as inferred by its phyletic pattern). We consider O∩S as the putative evolutionary core of the SOC. To evaluate the role played by duplication in the formation of a given SOC, we measure the enrichment of its core with duplicated, self-interacting proteins. To this end, we define P as the set of proteins that satisfy the following conditions: (i) the protein is self-interacting or has a self-interacting paralog; and (ii) it coresides in a cluster with one of its paralogs. We then compute a hypergeometric score quantifying the enrichment of the core with protein from P: HG(|O|,|O ∩ P|,|O ∩ S|,|O ∩ P ∩ S|). The obtained P-values were corrected for multiple hypothesis testing using the procedure of Benjamini and Hochberg (44) and placing an FDR cutoff of 5%, where the number of hypotheses equals the number of SOCs (647).
RESULTS
A framework for evolutionary analysis of protein complexes
We amassed PPI data from public databases and recent publications to construct a comprehensive up-to-date collection of PPI networks for seven species: H. sapiens, D. melanogaster, C. elegans, S. cerevisiae, P. falciparum, E. coli and H. pylori (Methods section, Supplementary Data.)
As experimental data on protein complexes are not available for most of the analyzed species (with the exception of S. cerevisiae, and to a lesser extent H. sapiens and E. coli), we applied computational approaches to infer protein complexes within each of the networks. To this end, we used two previously published algorithms for protein complex detection (8,39). We merged their results into a single collection of putative protein complexes, which we term clusters, for each network. Overall, we identified 9886 clusters within the seven networks (Supplementary Figure 1a). We validated the identified clusters by evaluating the coherency of their member proteins with respect to functional annotation and essentiality status (see Methods section). We used the identified protein clusters together with cross-species protein similarity information to derive SOCs, which are key to the evolutionary analysis presented in the sequel.
Sets of orthologous clusters
We define a SOCs as a collection of clusters from two or more species that are likely to have evolved from a common ancestral protein complex. To identify these sets, we extended the notion of a cluster of orthologous groups [COG, see (18)] from the single gene level to the level of protein subnetworks: the SOC inference algorithm starts by identifying pairs of clusters from different species that are potentially orthologous. The algorithm then proceeds to find sets of clusters (cliques), each from a different species, in which all members are potentially orthologous. Finally, the SOCs are formed using an iterative clustering procedure, which merges pairs of cliques that differ only by a single node. The SOC construction pipeline is depicted in Figure 1 and described in the Methods section.
Altogether, we obtained 647 SOCs spanning two to seven species each, with a median of three clusters per SOC (Figure 2; see Supplementary Table 4 for the complete list of inferred SOCs). The SOCs allow inferring phyletic patterns for clusters (or whole SOCs), i.e. patterns of presence/absence of proteins clusters across the seven studied species. Overall, 52 out of the 120 possible phyletic patterns were observed, with the number of occurrences of each pattern varying from more than 50 (spanning different subsets of the investigated eukaryotes, excluding P. falciparum) to a single occurrence (typically involving both eukaryotes and prokaryotes). The SOCs cover 2823 (28%) of the clusters with relative ratios of coverage varying from 10% (H. pylori) to 37% (P. falciparum). Expectedly, the percentage of clusters participating in the SOCs was substantially lower for the two investigated prokaryotes due to their large evolutionary distance from the rest of the species.
To validate the computed SOCs, we first evaluated their functional coherence using the functional annotations of the participating clusters (Supplementary Data). 257 (39% versus a random expectation of 5%) of all SOCs and 219 (60%) of the SOCs of size three and more were found to be functionally coherent. In addition, we constructed a phylogenetic tree relating the analyzed species according to their co-membership in SOCs (Supplementary Data). The reconstructed tree (Figure 2E, right) highly matched the known tree of life (54) with the only exception being the lack of a separate prokaryotic clade. Notably, when using the conservation of individual PPIs rather than SOC co-membership to construct the phylogenetic tree, we obtained a less accurate tree with yeast and human placed together in a separate clade (Figure 2E, left). It is reasonable to assume that this deviation reflects the dominance of the yeast network in the available PPI data. Importantly, this effect vanishes when using cluster orthology as the basis for the tree reconstruction. As a further validation for the SOC construction, we traced the phyletic patterns of manually curated protein complexes from the MIPS database (41). We estimated the accuracy of these patterns by comparing the inferred presence/absence indicators to prior biological knowledge (Supplementary Data). The inferred patterns attained an accuracy level of 80%. Examples for SOCs constructed for MIPS complexes are given in Figure 3A and the Supplementary Data.
A notable problem in the analysis of large-scale PPI data in general and of protein clusters in particular, is the prevalence of false negatives. To tackle this problem, we restricted the analysis to clusters for which we had confidence in their inferred phyletic patterns, and filtered clusters for which orthologs in more than one species could not be detected due to possible lack of PPI data (see Methods section). The results presented in the following sections were obtained with the filtered collection of clusters.
Evolutionary measures for protein clusters
We developed two novel measures for characterizing the evolution of clusters: propensity for loss in evolution and evolutionary age. Both measures rely on the phyletic patterns induced by membership of a cluster in SOCs and on the phylogenetic tree relating the investigated species (Supplementary Figure 4).
The PCL is a cluster-level analog of the PGL measure introduced by Krylov et al. (19). The PGL of a gene is an estimate for the rate at which it was lost in evolution. Given a phylogenetic tree over a set of species and a phyletic pattern for the gene across these species, the PGL of the gene is the ratio between the overall lengths of branches along which the gene was lost and the total length of branches along which it was either lost or preserved. Analogously, we computed the PCL value of a cluster by reconstructing its phyletic pattern across the ancestral species in the phylogenetic tree that relates the seven investigated species, and measuring the relative length of branches along which the cluster was lost (see Methods section).
The evolutionary age estimate is based on a classification of the clusters into several distinct age groups reflecting their estimated emergence time relative to the lineage split events in the phylogeny of the investigated species. The age groups, in ascending order (from less to more ancient), include: Bilateria, Fungi/Metazoa, Eukarya and Eukarya/Bacteria. The age group of a cluster is determined as its latest possible emergence time under an optimal parsimonious reconstruction (see Methods section), in a manner similar to (10). We defined an additional age group, the species-specific group, as the set of clusters that have no putative orthologs in other species. We assign the clusters in this group with a minimal age value of zero. The distribution of species-specific clusters among the species shows a similar trend as before with higher rates of species-specific clusters found for the two prokaryotes, and covers a total of 15.8% of the clusters (Figure 2D). To validate the evolutionary measures, we investigate their correlation with various functional attributes. Our findings, provided in the Supplementary Data, are consistent with those previously reported for single proteins (18,19).
Mechanistic principles of protein complex evolution
The inferred phyletic patterns and evolutionary measures allow us to directly probe various mechanistic aspects of the evolution of protein complexes. In the following, we concentrate on two fundamental questions: do proteins tend to evolve independently of one another or do proteins within the same complex evolve in a coherent manner? And, how central is the role of gene duplication in the evolution of protein complexes?
Cluster evolution versus single protein evolution
The evolution of PPI networks was previously shown to have modular characteristics in the sense that proteins in a complex are likely to be lost or gained concomitantly (12–13). To obtain further insights into the evolution of complexes, we looked at the mode of organization of proteins into clusters throughout their evolution. We considered the following two trends: (i) the proteins in a cluster were originally unrelated and became a functional unit through evolution; and (ii) the organization into the same cluster characterizes proteins in a cluster ever since their emergence.
To test which of these scenarios is more prevalent, we computed the median PGL and evolutionary age values of the proteins in each cluster (see Methods section) and compared them with the respective PCL and cluster age values. We concentrated on the eukaryotic clusters, as PGL information for prokaryotic genes was not readily available. The results, summarized in Table 1, show significant correlations between the evolutionary attributes of a cluster and those of its member proteins. This supports the plausibility of the second scenario.
Table 1.
Evolutionary measure | Median PGL | Median protein age |
---|---|---|
Homo sapiens | ||
Age | −0.270 | 0.364 |
PCL | ns | −0.243 |
Saccharomyces cerevisiae | ||
Age | −0.339 | 0.438 |
PCL | 0.201 | −0.306 |
All | ||
Age | −0.337 | 0.311 |
PCL | 0.124 | −0.240 |
Shown are Pearson correlation coefficients. Non significant correlations are marked as ‘ns’.
An example for a match between the conservation of complexes and proteins is the yeast coat protein complex I (COPI), which mediates intra-Golgi and Golgi-to-ER trafficking. The core proteins of the coat complex machineries are known to be highly conserved in eukaryotes (55). On the other hand, they are not expected to be present in prokaryotes, as they lack endomembranes (56). Consistent with this expectation, the SOC containing the COPI cluster in yeast is comprised solely of eukaryotic clusters, and includes all the investigated eukaryotes except P. falciparum (Figure 3A). Proteins comprising this SOC include both GTPases (ARF1 in yeast and human, F13D12.7, F52A8.2 and C26C6.2 in C. elegans, and CG15010 in D. melanogaster), and coat proteins (SEC21/27, COP1 in yeast and COPA, COPB1/2, COPG2 in human). Notably, interactions among coat proteins (and other proteins related to the COPI complex) are missing from the P. falciparum network; as a result, the corresponding cluster is missing from the set of P. falciparum clusters and, consequently, from the SOC containing the yeast COPI cluster.
A second example is the mating-type cluster in yeast shown in Figure 3B. This cluster contains the genes HMLALPHA1, MATALPHA2 and HMRA1, HMRA2 which are either expressed from the MAT locus by haploids or serve as silent cassettes for the exchange of mating types (57). In addition, it contains the Ste12 transcription factor and its interacting protein Mcm1. Being involved in a process which is highly specific such as mating, we would not expect this cluster to be evolutionary conserved. And indeed, the SOC construction identifies it as yeast-specific. On the other hand, when analyzing individual components of the cluster, we find that Mcm1 is highly conserved throughout the evolutionary scale. In addition to controlling mating functions, this protein also affects other processes in the cell such as cell-cycle progression, cell wall synthesis and DNA repair (58–60). Thus, despite the observed correlation, the evolutionary history of a complex does not necessarily reflect the evolutionary history of all of its components. This is probably due to the fact that individual proteins may find novel roles within the cell, not necessarily in the context of a complex.
The role of gene duplication in protein complex evolution
Gene duplication and subsequent divergence is one of the fundamental forces underlying the expansion of eukaryotic proteomes (61). It was recently hypothesized that it is key to the development of modularity in PPI networks as well (16). Specifically, it was suggested that a substantial portion of the complexes in the yeast PPI network have originated from evolutionary cores of homodimers. According to this hypothesis, those ancient homodimers served as ‘seeds’ which subsequently evolved to whole complexes through events of duplication, diversification and augmentation by additional proteins. To support this conjecture, Pereira-Leal et al. (16) derived a series of corollaries and showed that they hold in yeast. In particular, they showed that dimers of paralogous proteins are likely to have evolved from the duplication of homodimers, and that protein complexes tend to contain pairs of paralogs (see Supplementary Data for a validation of these corollaries on our data).
Here, we used the constructed SOCs to provide a more explicit validation for the role of duplication of self-interacting proteins in evolution. We hypothesized that a set of complexes that originated from an ancestor homodimer seed would contain a conserved core of paralogous, self-interacting proteins. To measure the effect of duplication of self-interacting proteins on the evolution of the clusters in our data set, we estimated how many of the SOCs conform with this expectation. For each SOC, we isolated its putative evolutionary core by considering only proteins whose estimated age is at least as high as that induced by the SOC (see Methods section). If the clusters in the SOC have evolved from duplications of a homodimer seed, we would expect the core to be enriched with paralogous self-interacting proteins. Hence, we computed for each SOC a statistical score that compares this level of over representation to random sets of proteins of at least the same evolutionary age as those contained in the SOC (see Methods section). After correcting for multiple hypothesis testing using the FDR procedure of Benjamini and Hochberg (44) and using a cutoff of 5%, we found that the cores of 142 (22%) of the SOCs were enriched with paralogous self-interacting proteins, clearly testifying to the important role of duplication in the evolution of protein complexes.
Figure 3C presents one such SOC. This SOC is annotated as chromatin modification and contains orthologous histone deacetylase units from H. sapiens, S. cerevisiae and C. elegans (panel 1). The putative core of the SOC, highlighted in panel 2, is dominated by two conserved sets of duplicated proteins. The first set comprises of a group of paralogs from human (TBL1XR1, TBL1X, RBBP4 associated with histone deacetylation and chromatin assembly) and a single representative from yeast and worm (histone-binding proteins TUP1 and K07A1.12, respectively), where the yeast representative is known to be self-interacting (62). The second set contains one representative from yeast and one from worm (the RPD3 and C53A5.3 histone deacetylase proteins), and four paralogous human proteins (HDAC1/2/3/9 belonging to the histone deacetylase family) in which two of the proteins (HDAC1/3) are self-interacting.
DISCUSSION
We presented a framework for evolutionary analysis of protein complexes. By generalizing concepts from the level of single proteins, we constructed orthologous sets containing clusters from seven different species. These sets allow us to infer patterns of presence and absence across the evolutionary tree, and consequently to estimate the propensity for loss in evolution and evolutionary age. We verified the orthologous sets in several ways including reconstructing the participating species' phylogeny and manually investigating a small set of hand-curated complexes.
We used the inferred SOCs to investigate mechanistic aspects of protein complex evolution. First, we probed the relationship between the evolutionary characteristics of a cluster as a whole and that of its constituents, observing a significant correlation between the two. Second, we have shown the importance of gene duplication as a mechanism for the evolution of protein complexes.
The resulting new evolutionary measures can be employed to study other aspects of protein complex evolution beyond the mechanistic aspects studied here. A fundamental question in this regard is how different functional attributes impact the evolution of a complex. In the Supplementary Data we show that the evolutionary rate of a complex significantly correlates with its level of connectivity in the network, the specificity of its function and its essentiality. These findings are consistent with those previously reported for single proteins (18,19) and agree with our previous findings on the coherent evolution of the protein members of a complex.
It is pleasing to see that current PPI networks are already rich enough to enable the careful study of intricate processes like protein complex evolution, after carefully controlling for the yet considerable rates of false positive and false negative interactions. But not less important, the integrated computational approach laid out here is likely to lead to many further new insights concerning protein complex evolution as molecular interaction databases continue to expand in their size, accuracy and species coverage.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Tel-Aviv University president and rector scholarship (to N.Y.); Israel Science Foundation and the Ministry of Science and Technology, Israel (to M.K.); A French-Israeli collaboration initiative funded by the Israeli Ministry of Science (MOST) (to E.R.); German Israel Foundation grant (to R.S.); Converging Technologies grant from the Israel Science Foundation (to M.K., E.R., and R.S.). Funding for open access charge: Israel Science Foundation.
Conflict of interest statement. None declared.
Supplementary Material
REFERENCES
- 1.Fields S. High-throughput two-hybrid analysis. the promise and the peril. FEBS J. 2005;272:5391–5399. doi: 10.1111/j.1742-4658.2005.04973.x. [DOI] [PubMed] [Google Scholar]
- 2.Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. doi: 10.1038/nature01511. [DOI] [PubMed] [Google Scholar]
- 3.Xenarios I, Salwinski L, Joyce X, Higney P, Kim S, Eisenberg D. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30:303–305. doi: 10.1093/nar/30.1.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, et al. Intact–open source resource for molecular interaction data. Nucleic Acids Res. 2007;35:D561–D565. doi: 10.1093/nar/gkl958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M. Identification of potential interaction networks using sequence-based searches for conserved protein–protein interactions or ‘interologs’. Genome Res. 2001;11:2120–2126. doi: 10.1101/gr.205301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat. Biotechnol. 2006;24:427–433. doi: 10.1038/nbt1196. [DOI] [PubMed] [Google Scholar]
- 8.Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T. Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA. 2005;102:1974–1979. doi: 10.1073/pnas.0409522102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 2006;16:1169–1181. doi: 10.1101/gr.5235706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Campillos M, vonMering C, Jensen LJ, Bork P. Identification and analysis of evolutionarily cohesive functional modules in protein networks. Genome Res. 2006;16:374–382. doi: 10.1101/gr.4336406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA. 1999;96:4285–4288. doi: 10.1073/pnas.96.8.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ettema T, van derOost J, Huynen M. Modularity in the gain and loss of genes: applications for function prediction. Trends Genet. 2001;17:485–487. doi: 10.1016/s0168-9525(01)02384-8. [DOI] [PubMed] [Google Scholar]
- 13.Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004;5:R7. doi: 10.1186/gb-2004-5-2-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Qin H, Lu HH, Wu WB, Li WH. Evolution of the yeast protein interaction network. Proc. Natl Acad. Sci. USA. 2003;100:12820–12824. doi: 10.1073/pnas.2235584100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Snel B, Huynen MA. Quantifying modularity in the evolution of biomolecular systems. Genome Res. 2004;14:391–397. doi: 10.1101/gr.1969504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pereira-Leal JB, Levy ED, Kamp C, Teichmann SA. Evolution of protein complexes by duplication of homomeric interactions. Genome Biol. 2007;8:R51. doi: 10.1186/gb-2007-8-4-r51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pereira-Leal JB, Teichmann SA. Novel specificities emerge by stepwise duplication of functional modules. Genome Res. 2005;15:552–559. doi: 10.1101/gr.3102105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
- 19.Krylov DM, Wolf YI, Rogozin IB, Koonin EV. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 2003;13:2229–2235. doi: 10.1101/gr.1589103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al. A map of the interactome network of the metazoan C. elegans. Science. 2004;303:540–543. doi: 10.1126/science.1091403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Stanyon CA, Liu G, Mangiola BA, Patel N, Giot L, Kuang B, Zhang H, Zhong J, Finley RL., Jr A Drosophila protein-interaction map centered on cell-cycle regulators. Genome Biol. 2004;5:R96. doi: 10.1186/gb-2004-5-12-r96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, et al. Large-scale identification of protein–protein interaction of Escherichia coli k-12. Genome Res. 2006;16:686–691. doi: 10.1101/gr.4527806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, et al. The protein–protein interaction map of Helicobacter pylori. Nature. 2001;409:211–215. doi: 10.1038/35051615. [DOI] [PubMed] [Google Scholar]
- 24.Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature. 2005;437:1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]
- 25.Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al. A human protein–protein interaction network: a resource for annotating the proteome. Cell. 2005;122:957–968. doi: 10.1016/j.cell.2005.08.029. [DOI] [PubMed] [Google Scholar]
- 26.LaCount DJ, Vignali M, Chettier R, Phansalkar A, Bell R, Hesselberth JR, Schoenfeld LW, Ota I, Sahasrabudhe S, Kurschner C, et al. A protein interaction network of the malaria parasite Plasmodium falciparum. Nature. 2005;438:103–107. doi: 10.1038/nature04104. [DOI] [PubMed] [Google Scholar]
- 27.Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440:631–636. doi: 10.1038/nature04532. [DOI] [PubMed] [Google Scholar]
- 28.Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006;440:637–643. doi: 10.1038/nature04670. [DOI] [PubMed] [Google Scholar]
- 29.Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, et al. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J. Biol. 2006;5:11. doi: 10.1186/jbiol36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen CK, et al. Wormbase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 2005;33:D383–D389. doi: 10.1093/nar/gki066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.FlyBase-Consortium. The flybase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 2003;31:172–175. doi: 10.1093/nar/gkg094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mori H, Isono K, Horiuchi T, Miki T. Functional genomics of Escherichia coli in Japan. Res. Microbiol. 2000;151:121–128. doi: 10.1016/s0923-2508(00)00119-4. [DOI] [PubMed] [Google Scholar]
- 33.Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty BA, et al. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature. 1997;388:539–547. doi: 10.1038/41483. [DOI] [PubMed] [Google Scholar]
- 34.Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371. doi: 10.1101/gr.1680803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Fraunholz MJ, Roos DS. Plasmodb: exploring genomics and post-genomics data of the malaria parasite, Plasmodium falciparum. Redox Rep. 2003;8:317–320. doi: 10.1179/135100003225002961. [DOI] [PubMed] [Google Scholar]
- 36.Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, et al. Saccharomyces genome database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 2004;32:D311–D314. doi: 10.1093/nar/gkh033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bader G, Hogue C. Analyzing yeast protein–protein interaction data obtained from different sources. Nat. Biotechnol. 2002;20:991–997. doi: 10.1038/nbt1002-991. [DOI] [PubMed] [Google Scholar]
- 38.Bader JS, Chaudhuri A, Rothberg JM, Chant J. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol. 2004;22:78–85. doi: 10.1038/nbt924. [DOI] [PubMed] [Google Scholar]
- 39.Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Brohee S, vanHelden J. Evaluation of clustering algorithms for protein–protein interaction networks. BMC Bioinformatics. 2006;7:488. doi: 10.1186/1471-2105-7-488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, et al. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004;32:D41–D44. doi: 10.1093/nar/gkh092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hirsh E, Sharan R. Identification of conserved protein complexes based on a model of protein network evolution. Bioinformatics. 2007;23:e170–e176. doi: 10.1093/bioinformatics/btl295. [DOI] [PubMed] [Google Scholar]
- 43.Harkness W. Properties of the extended hypergeometric distribution. Ann. Math. Stat. 1965;36:938–945. [Google Scholar]
- 44.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 1995;57:289–300. [Google Scholar]
- 45.Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA. 2003;100:11394–11399. doi: 10.1073/pnas.1534710100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Remm M, Storm C, Sonnhammer E. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 2001;314:1041–1052. doi: 10.1006/jmbi.2000.5197. [DOI] [PubMed] [Google Scholar]
- 47.Suthram S, Sittler T, Ideker T. The plasmodium protein network diverges from those of other eukaryotes. Nature. 2005;438:108–112. doi: 10.1038/nature04135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, et al. Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003;31:28–33. doi: 10.1093/nar/gkg033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Borenstein E, Shlomi T, Ruppin E, Sharan R. Gene loss rate: a probabilistic measure for the conservation of eukaryotic genes. Nucleic Acids Res. 2007;35:e7. doi: 10.1093/nar/gkl792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hedges SB, Blair JE, Venturi ML, Shoe JL. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evol. Biol. 2004;4:2. doi: 10.1186/1471-2148-4-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Friedman R, Hughes AL. Pattern and timing of gene duplication in animal genomes. Genome Res. 2001;11:1842–1847. doi: 10.1101/gr.200601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Feng DF, Cho G, Doolittle RF. Determining divergence times with a protein clock: update and reevaluation. Proc. Natl Acad. Sci. USA. 1997;94:13028–13033. doi: 10.1073/pnas.94.24.13028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Farris J. Phylogenetic analysis under dollo's law. Syst. Zool. 1977;26:77–88. [Google Scholar]
- 54.Sogin ML, Hinkle G, Leipe DD. Universal tree of life. Nature. 1993;362:795. doi: 10.1038/362795a0. [DOI] [PubMed] [Google Scholar]
- 55.Bock JB, Matern HT, Peden AA, Scheller RH. A genomic perspective on membrane compartment organization. Nature. 2001;409:839–841. doi: 10.1038/35057024. [DOI] [PubMed] [Google Scholar]
- 56.Devos D, Dokudovskaya S, Alber F, Williams R, Chait BT, Sali A, Rout MP. Components of coated vesicles and nuclear pore complexes share a common molecular architecture. PLoS Biol. 2004;2:e380. doi: 10.1371/journal.pbio.0020380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Herskowitz I. A regulatory hierarchy for cell specialization in yeast. Nature. 1989;342:749–757. doi: 10.1038/342749a0. [DOI] [PubMed] [Google Scholar]
- 58. Bahler,J. (2005) Cell-cycle control of gene expression in budding and fission yeast. Annu. Rev. Genet., 39, 69–94. [DOI] [PubMed] [Google Scholar]
- 59. Abraham,D.S. and Vershon,A.K. (2005) N-terminal arm of Mcm1 is required for transcription of a subset of genes involved inmaintenance of the cell wall. Eukaryot. Cell, 4, 1808–1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Workman C, Mak H, McCuine S, Tagne J, Agarwal M, Ozier O, Begley T, Samson L, Ideker T. A systems approach to mapping dna damage response pathways. Science. 2006;312:1054–1059. doi: 10.1126/science.1122088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Teichmann SA, Park J, Chothia C. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl Acad. Sci. USA. 1998;95:14658–14663. doi: 10.1073/pnas.95.25.14658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Jabet C, Sprague ER, VanDemark AP, Wolberger C. Characterization of the n-terminal domain of the yeast transcriptional repressor tup1. proposal for an association model of the repressor complex tup1 x ssn6. J. Biol. Chem. 2000;275:9011–9018. doi: 10.1074/jbc.275.12.9011. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.