Abstract
A fundamental issue in molecular evolution is how to identify the evolutionary forces that determine the fate of duplicated genes. The dosage balance hypothesis has been invoked to explain gene duplication patterns at the genomic level under the premise that a dosage imbalance among protein-complex subunits or interacting partners is often deleterious. Here we examine this hypothesis by investigating the molecular basis of dosage sensitivity. We focus on the extent of protein wrapping, which indicates how strongly the structural integrity of a protein relies on its interactive context. From this perspective, we predict that the duplicates of a highly under-wrapped protein or protein subunit should (1) be more sensitive to dosage imbalance and be less likely to be retained and (2) be more likely to survive from a whole-genome duplication (WGD) than from a non-WGD because a WGD causes little or no dosage imbalance. Our under-wrapping analysis of more than 12,000 protein structures strongly supports these predictions and further reveals that the effect of dosage sensitivity on gene duplicability decreases with increasing organismal complexity.
Author Summary
A gene duplication provides an extra gene copy that can be free to accumulate mutations and gain a new function. Therefore, gene duplication plays a very important role in evolution. However, the presence of an additional gene copy can sometimes be deleterious because it can lead to an excessive dosage relative to those of its interacting partners. This dosage imbalance effect in turn influences the fate of duplicated genes in evolution. Our study gives the first description to our knowledge of the molecular/structural basis for the dosage imbalance effect. We study the relationships between gene family size and extent of protein under-wrapping, a molecular quantifier of the reliance of the protein on binding partnerships to maintain structural integrity, indicative of the extent of structure protection from disruptive hydration. Using more than 12,000 protein three-dimensional structures from six organisms that range from bacteria to human, we show an inverse relationship between extent of protein under-wrapping and family size. That is, a duplication is unlikely to be tolerated if the protein is highly under-wrapped (i.e., its structure requires substantial stabilizing interactions with other proteins). We also show that the effect of dosage imbalance is more apparent in unicellular organisms but is buffered to some extent in higher eukaryotes.
Introduction
Gene duplication is a primary source for the emergence of new genes and increases genome complexity [1,2]. In recent years, the evolutionary forces influencing gene duplicability have been under intense study. In particular, the gene dosage balance hypothesis [3] has been often invoked to explain gene duplication patterns at the genomic level [4]. The dosage balance hypothesis states that an imbalance in the concentrations of the subcomponents of macromolecular complexes can be deleterious [3]. Although this notion was originally proposed in the context of protein complexes, it can be extended to other protein interaction partnerships [5]. If dosage imbalance is indeed deleterious, the outcome of a gene duplication event would largely depend on the immediate dosage sensitivity effect. While significant progress has been made in the last several years [4,6–9], the influence of dosage imbalance on the retention of gene duplicates remains not well understood. So far, the most relevant studies on this topic have mainly focused on protein complex data or protein-protein interaction data, which have inherent limitations. First of all, such data represent the interacting context of a protein in an abstract way. For example, the potential dosage imbalance effect of protein subunits in a complex may crucially depend on their topological positions within the complex and on the complex-assembly pathway [5]. Second, more importantly, there is a conceptual distinction between a-priori plausible protein associations and obligatory associations required to preserve the structural integrity and functionality of the protein. Thus, even if the interacting context of a protein could be characterized by some measurements (e.g., protein connectivity or interacting surface), the potential imbalance effect would still be hard to assess. Lastly, it is known that most current protein interaction data are noisy, being plagued with both false positives and false negatives [10,11].
Recent advances in structural genomics and biophysics enable us to examine the dosage balance hypothesis in the light of the three-dimensional structure of proteins. In this regard, we focus on a specific attribute of protein structure, the so-called under-wrapping [12–17]. This attribute quantifies the extent to which the protein structure is reliant on the interactive context to maintain its integrity. In particular, overexpressing a highly under-wrapped protein can increase the propensity for aberrant misfolding and aggregation [16], promoting dosage sensitivity.
The under-wrapping parameter describes the solvent accessibility of the major determinants of protein structure: the backbone hydrogen bonds (Figure 1). Thus, in order for the structure to prevail and remain functionally competent, backbone hydrogen bonds must be “wrapped” by clusters of non-polar amino acid residues that exclude the surrounding water, thereby preventing the competing hydration of the paired polar groups. Since backbone hydration competes with structure retention, the intramolecular hydrogen bonds that are water-accessible, termed dehydrons [13], represent structural vulnerabilities. As a consequence, dehydrons promote binding partnerships with the concurrent exclusion of surrounding water, as needed to maintain the structural integrity of the protein [13,15,17]. The hydrogen-bond protection requirement poses a strong constraint on protein architecture and dictates that highly under-wrapped proteins, i.e., those with a large number of dehydrons, should be highly interactive [15] to maintain their structural integrity.
As shown in Figure 1, the wrapping extent can be accurately determined from reported structure deposited in the Protein Data Bank (PDB) [12]. As protein structures become more under-wrapped, they become more reliant on binding partnerships [15]. Thus, protein under-wrapping quantifies how strongly the structural integrity of a protein depends on its binding partners [13], thereby framing a vantage point to study the dosage imbalance effect.
Results/Discussion
From the above reasoning we predict that the probability of retention of gene duplicates in evolution (i.e., gene duplicability) should decrease with the extent of hydrogen bond under-wrapping of the polypeptide encoded by the gene. To test this prediction, we compiled non-redundant proteins with PDB-reported structures, calculated the under-wrapping extent for each protein (subunit), and determined the duplicability (m, the gene family size) for the corresponding gene. Interestingly, in all six organisms studied (Escherichia coli, yeast, worm, fly, human and thale cress), we found a negative correlation between protein under-wrapping extent and gene duplicability (Figures 2A–2C and S1).
Since it has been shown that genes with particular biological functions tend to duplicate in evolution [18–20], we examined the potential influence of functional bias on our results. We compared the under-wrapping extent of yeast singletons with that of duplicates in different functional categories and found that singletons are consistently more under-wrapped than duplicates in each functional category (Figure S2). This result indicates that the effect of protein wrapping on gene duplicability is independent from the previously known functional bias of gene duplication.
Our study reveals a universal negative effect of protein under-wrapping on gene duplicability in a variety of species, strongly supporting the dosage balance hypothesis. The decreasing tendency is most significant from m = 1 to 4 and becomes less obvious at higher duplicability. However, the dependence between the two variables in different species varies a lot: the negative correlation is quite strong in simple organisms such as E. coli and yeast, but becomes weak in complex organisms such as humans. To perform a more rigorous comparison, we used the linear regression to roughly capture the dependence between protein under-wrapping and gene duplicability. As shown in Figure 2D, as organismal complexity increases, the effect of protein under-wrapping on gene duplicability decreases, that is, E. coli > yeast > worm > fly ∼ human ∼ thale cress, suggesting a less important role of the dosage imbalance effect in complex organisms. To further understand this intriguing trend, we examined the per-gene-family protein under-wrapping distributions in different species. As shown in Figure 3, E. coli and yeast proteins have relatively broad under-wrapping distributions, while human proteins show a narrow distribution mainly from 35% to 55%. There are fewer well-wrapped proteins (<35%) in humans, implying that most human proteins need binding partners to maintain the integrity of their functional structure. On the other hand, unicellular species appear to possess more autonomous protein folders (under-wrapping <35%), capable of operating without forming obligatory complexes [17]. However, the contrasting distributions between complex and simple organisms are hard to interpret, due to the staggering difference at the proteome level.
Duplicated genes can arise from either whole-genome duplication (WGD) or non-WGD (including individual or segmental duplication) [21]. In a WGD, every gene in the genome is duplicated at the same time, so that binding partnerships are also duplicated, leading to less chance of dosage imbalance than a non-WGD. Thus, an interesting prediction stemming from the dosage balance hypothesis is that duplicates of highly under-wrapped proteins would be more likely to survive from a WGD than from a non-WGD event. Since the duplication history of yeast genes has been largely elucidated [22], we decided to test this prediction using yeast duplicates with m = 2. We classified the yeast duplicates into two groups: one group from WGD and the other from non-WGD. By performing the analysis conditioned on the same m, the under-wrapping difference between the two groups should mainly be determined by the underlying duplication mechanisms. We found that the under-wrapping extent in WGD duplicates is significantly higher than that in non-WGD duplicates (Figure 4A, N WGD = 51, N non-WGD = 56, two-tailed Wilcox rank test p < 8 × 10−10), implying that the dosage imbalance effect was indeed relaxed in the WGD. Again, we examined this trend in different functional categories and found that the WGD duplicates are consistently more under-wrapped than the non-WGD duplicates in each category (Figure 4B).
In higher eukaryotes, considerable amount of highly under-wrapped proteins are associated with highly duplicated genes, suggesting that complex organisms are less sensitive to the dosage imbalance effect. This can possibly be attributed to several factors. First, complex organisms may have more efficient systems to adjust gene expression levels (e.g., chaperons, proteases and non-coding RNAs). It has been shown that in cultured cells more than 60% human promoter polymorphisms cause more than two-fold differences in gene-expression level [23]. Second, widespread alternative splicing in higher eukaryotes may play an important role to fix the imbalance effect, since different splicing variants might represent an “escape route” to avoid dosage imbalance. Third, it has been suggested that proteins tend to physically interact with similar partners, especially with their own duplicates [24]. Complex organisms may have higher allostery (i.e., dimerization or oligomerization), which can partly alleviate dosage imbalance. Fourth, complex organisms generally have a smaller effective population size than do simple organisms [25], so that a duplicate bearing a slightly deleterious dosage imbalance effect would have a better chance to be fixed in the population, thereby allowing a longer time for functional innovation. Last but no the least, adaptation (positive selection) due to functional diversification may have played an important role in determining the retention of duplicated genes in complex organisms [26,27] (e.g., MHC genes in mammals [28]).
In summary, we have identified protein under-wrapping as a molecular basis of dosage sensitivity. An imbalance-generating duplication becomes less tolerable if the protein is severely under-wrapped and therefore requires substantial stabilizing interactions with other proteins. Indeed, the extent of under-wrapping in a protein can be used as an approximate predictor of the strength of the effect of dosage imbalance on gene duplicability. The prediction can be made more broadly and precisely in the future when more data on protein structures, especially on protein complexes, become available.
Materials and Methods
Gene family size calculation in model organisms.
We obtained gene information from the following sources: E. coli, E. coli Genome and Proteome Database (http://genprotec.mbl.edu/) (GenProtEC); Saccharomyces cerevisiae, Saccharomyces Genome Database (http://www.yeastgenome.org/) (SGD1.01); Caenorhabditis elegans, WormBase (http://www.wormbase.org/) (WB170); Drosophila melanogaster, Berkeley Drosophila Genome Project (http://www.fruitfly.org/) (BDGP 4.3); Homo sapiens, Ensembl Genome Database (NCBI36); Arabidopsis thaliana, Arabidopsis Information Resource (http://www.arabidopsis.org/) (TIR7.0). Then, based on the GenProtEC family annotation, 4,485 E. coli genes were grouped into 2,901 gene families (a singleton gene is counted as one family in our analysis); based on the Ensembl gene family annotation [29], 6,024 yeast genes were grouped into 4,661 families, 20,173 worm genes were grouped into 11,503 families, 14,116 fly genes were grouped into 9,477 families, and 22,357 human genes were grouped into 12,394 families. Thale cress gene families were classified using the MCL algorithm [30] with the default Ensembl parameters, which grouped 26,819 genes into 10,236 gene families. We excluded genes annotated with more than one gene family from our analysis.
Computing the extent of protein under-wrapping.
For each of the six organisms under study, we constructed a set of non-redundant genes with at least one PDB representative structure. From the reported structure we calculated the extent of protein under-wrapping by determining the ratio of the number of insufficiently wrapped hydrogen bonds (dehydrons) to the total number of backbone hydrogen bonds in the structure. The dehydron identification from reported protein structure follows the protocol detailed in Chen et al. [12]. Together, our dataset includes 822 E. coli genes, 476 yeast genes, 29 worm genes, 94 fly genes, 2,275 human genes and 168 thale cress genes, for which we have both gene duplicability and protein structural data.
The extent of hydrogen-bond wrapping, ρ, measures the number of non-polar groups contained within a desolvation domain defined as two intersecting balls of fixed radius (∼thickness of three water layers) centered at the α-carbons of the residues paired by the amide-carbonyl hydrogen bond (Figure 1). In this study we adopted r = 5.7Å, and while the wrapping statistics on hydrogen bonds vary with this radius, the tails of the distribution remain invariant, thus enabling a unique identification of dehydrons. An across-PDB analysis reveals that hydrogen bonds are wrapped on average by ρ = 24.3 ± 4.8 non-polar groups for desolvation radius 5.7Å. Being insufficiently wrapped, dehydrons lie in the tails of the distribution, i.e., their desolvation microenvironment contains 19 or fewer non-polar groups, so that their ρ value is below the mean minus one Gaussian dispersion [12,15]. Thus, the overall under-wrapping of a protein is computed by determining the percentage of intramolecular hydrogen bonds with ρ ≤ 19. This criterion for identifying a dehydron fits the well-defined ansatz used to assess the wrapping statistics, which places dehydrons at the 8% percentile of most under-wrapped hydrogen bonds irrespective of the desolvation radius adopted [13–17]. Hence, the criterion is justified by the robustness of the results to variations in the assessment of the bond microenvironment.
The under-wrapping variation of a protein generated by structural differences in reported PDB entries is less than 8.8%. This variability arises from the different structural adaptations (induced fits) adopted by the protein in different crystallized complexes or from differences between uncomplexed protein structure in solution (often determined by NMR) and crystal structure. To account for such differences, the under-wrapping extent for each gene is typically averaged over all its PDB representations (Text S1). We obtained per-gene-family under-wrapping distributions by averaging the under-wrapping values among members within a gene family whenever available.
In this study, the wrapping computations involved more than 12,000 protein structures because a large fraction of the non-redundant proteins examined had various PDB representations with differences arising from the following sources: complexation diversity, level of structure resolution, NMR conformational diversity and high B-factors in the crystal (Text S1). The under-wrapping data obtained in our study are given in Tables S1–S6.
Yeast WGD versus non-WGD duplicates analysis.
We obtained WGD gene duplicate pairs from Kellis et al. [22]. We used the Wilcoxon rank test (two-tailed) to determine whether the distributions of protein under-wrapping between WGD and non-WGD are different, since the underlying distributions are not normal. We used the GO term analysis tools [31] to map yeast genes into the GO terms in the default GO slim file.
Supporting Information
Acknowledgments
The help of Anuphap Prachumwat in gene family classification is gratefully acknowledged. We would like to thank the three reviewers for their valuable suggestions.
Footnotes
A previous version of this article appeared as an Early Online Release on December 13, 2007 (doi:10.1371/journal.pgen.0040011.eor).
Author contributions. HL, W-HL, and AF designed the research. HL, KRP, JC, and AF performed the research. HL, W-HL, and AF wrote the paper.
Funding. This research was supported by National Institutes of Health (NIH) grants to WHL and AF.
Competing interests. The authors have declared that no competing interests exist.
References
- Ohno S. Evolution by gene duplication. New York: Springer-Verlag; 1970. [Google Scholar]
- Long M, Betran E, Thornton K, Wang W. The origin of new genes: glimpses from the young and old. Nat Rev Genet. 2003;4:865–875. doi: 10.1038/nrg1204. [DOI] [PubMed] [Google Scholar]
- Veitia RA. Exploring the etiology of haploinsufficiency. Bioessays. 2002;24:175–184. doi: 10.1002/bies.10023. [DOI] [PubMed] [Google Scholar]
- Papp B, Pal C, Hurst LD. Dosage sensitivity and the evolution of gene families in yeast. Nature. 2003;424:194–197. doi: 10.1038/nature01771. [DOI] [PubMed] [Google Scholar]
- Veitia RA. Gene dosage balance: deletions, duplications and dominance. Trends Genet. 2005;21:33–35. doi: 10.1016/j.tig.2004.11.002. [DOI] [PubMed] [Google Scholar]
- Prachumwat A, Li WH. Protein function, connectivity, and duplicability in yeast. Mol Biol Evol. 2006;23:30–39. doi: 10.1093/molbev/msi249. [DOI] [PubMed] [Google Scholar]
- Yang J, Lusk R, Li WH. Organismal complexity, protein complexity, and gene duplicability. Proc Natl Acad Sci U S A. 2003;100:15661–15665. doi: 10.1073/pnas.2536672100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aury JM, Jaillon O, Duret L, Noel B, Jubin C, et al. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 2006;444:171–178. doi: 10.1038/nature05230. [DOI] [PubMed] [Google Scholar]
- Kondrashov FA, Koonin EV. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet. 2004;20:287–290. doi: 10.1016/j.tig.2004.05.001. [DOI] [PubMed] [Google Scholar]
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417:399–403. doi: 10.1038/nature750. [DOI] [PubMed] [Google Scholar]
- Batada NN, Hurst LD, Tyers M. Evolutionary and physiological importance of hub proteins. PLoS Comput Biol. 2006;2:e88. doi: 10.1371/journal.pcbi.0020088. doi: 10.1371/journal.pcbi.0020088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, Zhang X, Fernandez A. Molecular basis for specificity in the druggable kinome: sequence-based analysis. Bioinformatics. 2007;23:563–572. doi: 10.1093/bioinformatics/btl666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandez A. Keeping dry and crossing membranes. Nat Biotechnol. 2004;22:1081–1084. doi: 10.1038/nbt0904-1081. [DOI] [PubMed] [Google Scholar]
- Fernandez A, Berry RS. Proteins with H-bond packing defects are highly interactive with lipid bilayers: implications for amyloidogenesis. Proc Natl Acad Sci U S A. 2003;100:2391–2396. doi: 10.1073/pnas.0335642100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandez A, Berry RS. Molecular dimension explored in evolution to promote proteomic complexity. Proc Natl Acad Sci U S A. 2004;101:13460–13465. doi: 10.1073/pnas.0405585101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandez A, Kardos J, Scott LR, Goto Y, Berry RS. Structural defects and the diagnosis of amyloidogenic propensity. Proc Natl Acad Sci U S A. 2003;100:6446–6451. doi: 10.1073/pnas.0731893100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandez A, Scheraga HA. Insufficiently dehydrated hydrogen bonds as determinants of protein interactions. Proc Natl Acad Sci U S A. 2003;100:113–118. doi: 10.1073/pnas.0136888100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seoighe C, Wolfe KH. Yeast genome evolution in the post-genome era. Curr Opin Microbiol. 1999;2:548–554. doi: 10.1016/s1369-5274(99)00015-6. [DOI] [PubMed] [Google Scholar]
- Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. Selection in the evolution of gene duplications. Genome Biol. 2002;3:RESEARCH0008. doi: 10.1186/gb-2002-3-2-research0008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marland E, Prachumwat A, Maltsev N, Gu Z, Li WH. Higher gene duplicabilities for metabolic proteins than for nonmetabolic proteins in yeast and E. coli. J Mol Evol. 2004;59:806–814. doi: 10.1007/s00239-004-0068-x. [DOI] [PubMed] [Google Scholar]
- Graur D, Li WH. Fundamentals of molecular evolution. Sunderland (Massachusetts): Sinauer Press; 2000. [Google Scholar]
- Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
- Rockman MV, Wray GA. Abundant raw material for cis-regulatory evolution in humans. Mol Biol Evol. 2002;19:1991–2004. doi: 10.1093/oxfordjournals.molbev.a004023. [DOI] [PubMed] [Google Scholar]
- Lukatsky DB, Shakhnovich BE, Mintseris J, Shakhnovich EI. Structural similarity enhances interaction propensity of proteins. J Mol Biol. 2007;365:1596–1606. doi: 10.1016/j.jmb.2006.11.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
- Kirschner M, Gerhart J. Evolvability. Proc Natl Acad Sci U S A. 1998;95:8420–8427. doi: 10.1073/pnas.95.15.8420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang H, Li WH. Gene essentiality, gene duplicability and protein connectivity in human and mouse. Trends Genet. 2007;23:375–378. doi: 10.1016/j.tig.2007.04.005. [DOI] [PubMed] [Google Scholar]
- Hao L, Nei M. Rapid expansion of killer cell immunoglobulin-like receptor genes in primates and their coevolution with MHC Class I genes. Gene. 2005;347:149–159. doi: 10.1016/j.gene.2004.12.012. [DOI] [PubMed] [Google Scholar]
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, et al. Ensembl 2006. Nucleic Acids Res. 2006;34:D556–D561. doi: 10.1093/nar/gkj133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, et al. GO::TermFinder—Open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20:3710–3715. doi: 10.1093/bioinformatics/bth456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.