Abstract
Gene duplication is widely accepted as a key evolutionary process, leading to new genes and novel protein functions. By providing the raw genetic material necessary for functional expansion, the mechanisms that involve the retention and functional diversification of duplicate genes are one of the central topics in evolutionary and comparative genomics. One proposed source of retention and functional diversification is protein subcellular relocalization (PSR). PSR postulates that changes in the subcellular location of eukaryotic duplicate proteins can positively modify function and therefore be beneficial to the organism. As such, PSR would promote retention of those relocalized duplicates and result in significantly lower death rates compared with death rates of nonrelocalized duplicate pairs. We surveyed both relocalized and nonrelocalized duplicate proteins from the available genomes and proteomes of 59 eukaryotic species and compared their relative death rates over a Ks range between 0 and 1. Using the Cox proportional hazard model, we observed that the death rates of relocalized duplicate pairs were significantly lower than the death rates of the duplicates without relocalization in most eukaryotic species examined in this study. These observations suggest that PSR significantly increases retention of duplicate genes and that it plays an important, but currently underappreciated, role in the evolution of eukaryotic genomes.
Keywords: duplicate genes, protein subceulluar relocalization, genome evolution, death rates
Introduction
For more than 40 years, it has been widely accepted that gene duplication is an important process underlying the evolution of new genes (Ohno 1970). With increasing availability of genomic data over the last 15 years, there has been renewed interest in this aspect of genome evolution, specifically with regards to the various evolutionary mechanisms involved with the retention and functional diversification of duplicate genes or paralogs (Zhang 2003; Innan and Kondrashov 2010). Some of the more well-known mechanisms of retention and diversification include neofunctionalization (Ohno 1970) and subfunctionalization (Hughes 1994; Force et al. 1999), both of which have been reviewed at great length in the literature (Innan and Kondrashov 2010). Recently, protein subcellular relocalization (PSR) was proposed as a means by which duplicate genes can potentially evolve novel functions through changes in the localization of their proteins within the cell (Byun and Geeta 2007; Byun et al. 2009). The basic premise behind PSR is that changes in a protein’s subcellular location can cause it to take on new or modified roles within the cell. Such functional changes due to subcellular relocalization have been observed in biochemical studies (Bizily et al. 2003; Lessering et al. 2004; Heilmann et al. 2004). PSR postulates that if such functional changes happen to a duplicate protein and the change is advantageous, the duplicate gene may be retained and ultimately lead to the evolution of a new gene.
The N-terminal peptide (NTP) is one of the best understood signals responsible for protein subcellular localization (Kaiser et al. 1987; Bannai et al. 2002). The NTP, a short degenerate sequence of approximately 13–85 amino acids located at the N-terminus of a protein, specifies its location within the eukaryotic cell. Once the protein is delivered to its correct subcellular location, the NTP is typically cleaved off and degraded, and therefore does not participate directly in mature protein function (Bannai et al. 2002). Therefore, changes to the NTP sequence can cause proteins to relocate without changing the actual sequence of the mature protein. In some instances, even minor changes to the NTP, such as a single nucleotide subsititution, are potentially capable of altering protein localization (Byun and Geeta 2007).
Over the past few years, several studies have supported the idea that PSR plays a key role in the evolution of duplicate genes in eukaryotes such as Saccharomyces (Marques et al. 2008) and humans (Wang et al. 2009). However, to date, no large-scale study has been undertaken to evaluate PSR as a universal mechanism of general eukaryotic duplicate gene evolution. Although a study comparing singleton and duplicate genes in Saccharomyces and Schizosaccharomyces found no significant difference in the PSR rate (Qian and Zhang 2009), it does not negate the importance of PSR as an evolutionary mechanism for duplicate genes. Rather, it suggests that PSR may play a role in the evolution of orthologous genes as well.
In this study, we examined duplicate gene pairs and their corresponding proteins compiled from the complete genomes and proteomes of 59 metazoan, single-celled eukaryotes, plant, algal, and fungal species to compare the retention (as measured by death rates) of relocalized duplicates and nonrelocalized duplicates over a range of Ks (number of synonymous substitutions per synonymous sites) values 0 to 1. Using the Cox proportional hazard model to compare death rates among over 700,000 relocalized and nonrelocalized duplicate gene pairs, we found compelling evidence to suggest that PSR duplicates have significantly lower death rates than duplicates, which do not relocalize. This observation, which suggests that relocalization significantly increases retention of duplicate genes, is consistent with the idea that PSR plays an important role in the evolution of duplicates and eukaryotic genomes.
Results and Discussion
A total of 7,16,917 duplicate gene pairs and their corresponding proteins were identified and analyzed from 59 different fungal, metazoan, green plant/green algae, and basal eukaryotic species (supplementary table S1, Supplementary Material online). For each species, we determined the total number of duplicate genes, and then calculated the proportion of each genome that was duplicated (supplementary table S1, Supplementary Material online). To simplify reporting of these data, we placed each of these species in one of the following categories: Fungi, Metazoan, Plants, Algae, and Basal Eukaryotes (single-celled protists) and summarize the data in table 1. The results of our analyses were consistent to what has been documented from other studies. For example, we predicted the percentage of duplicate genes in Homo sapiens and Arabidopsis thaliana was 32.6% and 50.7%, respectively. Although our estimates appear to be more conservative, they are largely consistent with predicted values of 38% for H. sapiens by Li et al. (2001) and 65% for A. thaliana by Arabidopsis Genome Initiative (2000). Estimates were also consistent with that of Gu et al. (2000) for Saccharomyces, Drosophila, and Caenorhabditis elegans. They estimated the total number of protein families in each species to be 530, 674, and 1,219, respectively; our estimates were 371, 644, and 1,283. The overall consistency of our estimates with those of other studies supports the validity of our method/algorithm of identifying paralogs.
Table 1.
Group | Average Proportion of Duplication |
---|---|
Fungi/algae | 0.13 ± 0.05 |
Plants | 0.37 ± 0.15 |
Metazoan | 0.25 ± 0.11 |
Basal eukaryote | 0.15 ± 0.090 |
For each duplicate protein pair identified from each eukaryotic species, the subcellular location was predicted using MultiLoc2. We chose MultiLoc2 because of its ability to predict localization in more subcellular compartments and in a greater variety of species groups than other comparable predictors. Furthermore, MultiLoc2 has shown higher accuracy than similar prediction programs through its incorporation of phylogenetic profiles and GO (Blum et al. 2009). It has been documented that some proteins exhibit dual targeting, which can complicate predictions of subcellular localization made by targeting software (Baudisch et al. 2013). We minimized this potential problem by not focusing on predicting specific subcellular locations of duplicate proteins but rather focsuing on whether they were predicted to be in the same or different locations. In this study, we were not interested in predicting the specific subcellular location of duplicate protein pairs. Rather, we focused on whether they were predicted to be the same or different. We categorized duplicate pairs as either relocalized (duplicate proteins with different predicted subcellular locations) or nonrelocalized (duplicates with identical predicted subcellular locations). We used these estimates to calculate the frequency of relocalized duplicate gene pairs (supplementary table S1, Supplementary Material online, %RDG) and summarize the data in table 2. The percentage of relocalized gene pairs ranged from 21.3% in metazoa to 29.1% in basal eukaryotes. The upper range of relocalized duplicates in individual species were found in rice (Oryza sativa 67,697/1,91,985 = 35.3%), and platypus (Ornithorhynchus anatinus 2,360/4,116 = 57.3%). The lower range was represented by Drosophila melanogaster (173/1,679 = 10.3%), horse (Equus caballus 1,234/13,753 = 15.7%), and the trypanosomatid Leishmania major (130/1,768 = 13.1%). Although the predictive nature of MultiLoc2 is a limitation of this study, we were encouraged to find that its predictions of subcellular localization were consistent with an empirically derived estimate for S. cervisiae. Our predicted estimate of 28.6% (562/1,966) for S. cervisiae falls within the 24–37% range empirically determined by Marques et al. (2008).
Table 2.
Group | RD | TND | FRD |
---|---|---|---|
Fungi | 962 | 3,453 | 0.28 |
Plants/algae | 164,663 | 528,618 | 0.31 |
Metazoan | 37,491 | 175,899 | 0.21 |
Basal eukaryotes | 2,607 | 8,947 | 0.29 |
Note.—RD, relocalized duplicates; TND, total number of duplicates; FR, frequency of relocalized duplicates. For each eukaryotic group, the frequency of PSR among duplicates is high. The total number of relocalized duplicate proteins are based on subcellular locations predicted by MultiLoc2.
For each species, we calculated the hazard ratio (death rate for nonrelocalized/death rate for relocalized) for duplicate pairs with Ks values ranging from 0 < Ks < 1 (supplementary table S2, Supplementary Material online, for full list of all ratios) using the Cox proportional hazard model. We chose a cut off of Ks = 1 to minimize potential errors associated with multiple hits (Li 1997) and potential multiple relocalizations at higher Ks values. The aim of using the Cox proportional hazard model was to compare the death rates of nonrelocalized and relocalized duplicate pairs by estimating the hazard ratio associated with them. Hazard ratios more than 1 indicate a higher death rate of nonrelocalized duplicates relative to the death rate of relocalized (PSR) duplicates. The hazard ratios for eight species ranging from V. carteri to H. sapiens are shown in table 3 as examples of our total data set (supplementary table S2, Supplementary Material online). The hazard ratios can be interpreted as follows: a hazard ratio of 1.58 for V. carteri with 0 < Ks < 0.05 (table 3) mean that nonrelocalized duplicate genes have a 58% higher hazard rate or death rate as compared with relocalized duplicates. Within individual species, hazard ratios varied as a function of Ks. In other words, death rates appeared to fluctuate with the duplicate gene’s relative age. Even in relatively young duplicates (Ks ≤ 0.05), 33.9% (20/59) of all species were observed to have hazard ratios significantly greater than 1 compared with 6.8% of species that showed the reverse (hazard ratio < 1). The observation that death rates in relocalized duplicates was significantly lower than nonrelocalized duplicate pairs at a Ks ≤ 0.05 for 34% of the species we examined, suggests that PSR may influence paralog retention during the earlier stages of duplication in some eukaryotic species. This is particularly interesting given that evolutionary forces, which act in the early stages following duplication, may be crucial in determining the ultimate fate of duplicated genes (Moore and Purugganan 2003). Hazard ratios were also not consistent between species (table 3). However, when hazard ratios were examined over a large number of eukaryotes from 0 < Ks < 1, it was apparent that, overall, hazard ratios were significantly greater than 1. Although it is possible that these observations may be caused by the continuous generation of relocalized duplicates from nonrelocalized duplicates over time rather than actual retention itself, a preliminary analysis of positive selection amongst the relocalized and nonrelocalized duplicate pairs from all species used in this study suggest that the data are more likely to be due to retention rather than a gradual accumulation of relocalized duplicates. Assuming that preferential retention of relocalized duplicates is due in part to some added benefit and thus subject to positive selection, we examined the data to see whether more relocalized duplicates exhibited evidence of positive selection over nonrelocalized duplicates. To obtain a reasonable sample size for the each of the Ks ranges used in our hazard ratio analysis, we combined all duplicates from all species used in this study. We then calculated the proportion of relocalized and nonrelocalized duplicates with a Ka/Ks > 1.5. Although Ka/Ks > 1 is typically the standard by which positive selection is measured, we chose 1.5 to give more weight to our initial analysis. For each of the Ks ranges used in this study, we found that significantly more relocalized duplicates have a Ka/Ks > 1.5 than nonrelocalized duplicates, a result consistent with preferential retention of relocalized duplicates rather than a gradual relocalization over time (supplementary table S3, Supplementary Material online).
Table 3.
Species | 0 < Ks < 0.05 | 0 < Ks < 0.1 | 0 < Ks < 0.25 | 0 < Ks < 0.5 | 0 < Ks < 0.75 | 0 < Ks < 1 |
---|---|---|---|---|---|---|
Volvox carteri | 1.58* | 1.30* | 1.40** | 1.32* | 1.36** | 1.13* |
Saccharomyces cervisiae | 1.73* | 1.82** | 3.80** | 2.26** | 2.23* | 2.72** |
Caenorhabditis elegans | 0.92 | 0.69 | 1.36* | 1.20* | 1.23* | 1.59** |
Drosophila rerio | 1.24** | 1.52** | 1.31** | 1.35** | 1.61** | 1.55** |
Homo sapiens | 1.92** | 1.56** | 1.64** | 1.38** | 1.39** | 1.27** |
Mus musculus | 1.26** | 1.17** | 1.09** | 1.43** | 1.46** | 1.40** |
Phytophthora ramorum | 1.34** | 1.38** | 1.31** | 1.21** | 1.17** | 1.13** |
Oryza sativa | 1.46** | 1.24** | 1.21** | 1.22** | 1.20** | 1.24* |
Note.—Hazard ratios = 1 indicate death rates between relocalized and nonrelocalized duplicates are equal. Hazard ratios > 1 indicate death rates of relocalized duplicates are lower than the death rates of nonrelocalized duplicates.
*Significant hazard ratios P < 0.05.
**Significant hazard ratios P < 0.001.
It is also possible that gene conversion could lead to biases in our death rate estimations with the appearance of fewer older (high Ks) nonrelocalized duplicates resulting in apparent lower retention rates over time. As gene conversion tends to occur in large gene families (>5 members), we removed all such duplicates by excluding those genes with more than five identifying matches (Lynch and Conery 2000) and then reanalyzed the data. Although specific patterns of retention for individual species changed as members of large gene families were removed, the overall results remained unchanged: Relocalized duplicate pairs had significantly higher retention than nonrelocalized duplicates. In fact, in this case, we found no instances in which nonrelocalized duplicates had significantly higher retention (supplementary table S4, Supplementary Material online).
The use of homology-based predictors like MultiLoc2 was another possible limitation with this analysis. Unfortunately, the performance of predictors that ignore homology would likely be inadequate for this type of study. To minimize potential biases introduced by homology, we reanalyzed the data by first removing all highly similar duplicate pairs (Ks < 0.01). This was done to eliminate the possibilty of erroneous predictions, which could lead to an overrepresentation of nonrelocalized duplicates with low Ks. When we compare the result of this analysis with the earlier results, we find the overall pattern intact: Relocalized duplicate pairs have significantly higher retention than nonrelocalized duplicates (supplementary tables S5 and S6, Supplementary Material online).
We observed that hazard ratios for individual species varied depending upon whether we used the entire or a subset of the data. One of the most dramatic differences we noticed was in A. thaliana. In the complete data set, we observed significant retention of relocalized duplicates but when corrected for gene conversion, we no longer saw such retention. In this particular case, it is possible that 1) gene conversion was biasing the data so that nonrelocalized duplicates appeared to have low Ks or that 2) a number of relocalized duplicates are located in large multigene families in A. thaliana. Removing them may have biased the results against retention of relocalized duplicates. Although we did not quantify our observations, we did note that a number of relocalized duplicates in A. thaliana did appear to belong to large multigene families involved in secondary metabolism. This is consistent with observations made by Heilmann et al. (2004).
As part of our investigation, we also examined the mutation patterns in the NTP region of duplicate gene pairs. Although the most common types of NTP mutation in the analysis were duplicate pairs with base substitutions (None), it was the complete deletion/gain of the NTP that resulted in proportionately more subcellular relocalizations (fig. 1a and b). The higher frequency of relocalization associated with complete NTP indels compared with base substitutions is not necessarily surprising given the greater magnitude of the former type of mutation. Based on this observation, we speculate that different mechanisms of gene duplication may influence the manner in which the duplicate proteins relocalize. For example, whole-genome and large-scale segmental duplications would likely give rise to duplicates with intact NTPs. Products of these types of duplication events would probably relocalize through base substitutions (and/or indels), which we found to be very common in the NTP. On the other hand, small-scale duplications caused by mechanisms such as illegitimate crossing over have the potential to generate duplicates with complete NTP additions/deletions, which in turn are more likely to result in subcellular relocalizations. Although, in our study, we did not distingush between duplicates formed by whole-genome or segmental duplications, we did examine some species that have not had any documented whole-genome duplications (WGD) (e.g., C. intestinalis), and some that have had multiple WGD such as polyploidizations (e.g., O. sativa) (Blanc and Wolfe 2004). In both types of species, we found evidence to support the idea that relocalized duplicates tend to have higher retention than nonrelocalized duplicates. The purpose of this work was to examine general patterns of duplicate gene retention across eukaryotic genomes. A closer examination of these hazard ratios in specific species from the perspective of their unique genomic history as well as specific gene families is an area of future research.
How could PSR initially be advantageous enough to significantly increase retention? We can envision several scenarios. First, the ancestral function of duplicate genes may not change with relocalization but instead could allow that function to be carried out in different subcellular compartments. The existence of identical protein functions in different subcellular compartments is not a new concept and can be achieved through alternate transcription and translation, alternate splicing (Regev-Rudzki et al. 2005) and possibly through PSR of duplicate genes. Increased distribution of ancestral function in different compartments by PSR could be viewed as a type of subfuctionalization as described by Hughes (1994). Second, it is not unreasonable to think that a change in subcellular location can have an immediate and beneficial effect on function. Such a change in function has been demonstrated experimentally. For example, when the cytosolic protein IDE (insulin degrading enzyme) was engineered to relocalize to the mitochondria, it immediately changed its function from regulating plasma insulin levels to degrading peptides (Leissring et al. 2004). One possible example of PSR and a change in function is that of dioscorins. Dioscorins are essential storage proteins found in yam tubers (Dioscorea spp.). They are assumed to aggregrate in the vacuoles of storage cells due to the presence of a signal peptide at their N-terminus (Lu et al. 2012). Although they have signficant cDNA similarity to α-carbonic anhydrases (αCAHs), a gene family that participates in the reversible hydration of CO2 in mammals, dioscorins share many characteristics of plant storage proteins such as high amide content and solubility. Despite the lack of highly conserved histidines characteristic of αCAHs, it was found that dioscorins still possess the ability for αCAHs activity, leading researchers to conclude that dioscorins are a novel type of αCAHs (Lu et al. 2012). Given that αCAHs in C3 dicots are generally known to localize to the chloroplast and stroma (Moroney et al. 2001), it is possible that these storage proteins may be an example of neofunctionalization via PSR. Another possible example of neofunctionalization through PSR may be the nonmuscle myosin heavy chain isoforms MHC-A and MHC-B in Xenopus. These two isoforms are encoded by two duplicate genes and are known to have distinct subcellular localizations based on immunoflourescence microscopy. The isoforms were shown to have very different enzymatic activities leading researchers to suggest that these two MHC isoforms have distinct functions (Kelley et al. 1996). A literature search results in a plethora of examples which suggest that different subcellular locations of duplicate proteins are associated with diverse functions (Pidoux and Tasken 2010; Castellano and Santos 2011). In future, it would be interesting to document the functional categories of both relocalized and nonrelocalized duplicate pairs, along with their subcellular locations as it may reveal important evolutionary patterns about gene diversification via PSR. Although we cannot say with certainty that the example described earlier are cases of neofunctionalization by PSR without examining outgroups and information on ancestral function, in light of our data indicating significant retention of relocalized duplicate genes, such studies would be worth pursing.
Another advantage that can initially arise from relocalization may be to alleviate dosage effects. Relocalization of duplicate proteins may cause them to become functionally inactive due to changes in the metabolic environment of the cell. This could serve to reduce overexpression of these proteins and thereby restore normal protein dosage. Evolution of novel function through PSR and dosage effects is not necessarily mutually exclusive. In fact, they may work together in that initial retention through a reduction in dosage, may give a fraction of these duplicates the time needed to accumulate beneficial mutations resulting in advantageous functions.
Conclusion
Several past studies have indicated the importance of PSR in the evolution of duplicate genes in vertebrates (Rosso et al. 2008; Kassahn et al. 2009). Here, we examined 56 species to see whether PSR plays a much wider role in eukaryotic genome evolution. The observation of significantly decreased death rates of relocalized duplicates in genomes spanning from single-celled eukaryotes to plants to mammals suggests that PSR is an important evolutionary process that may drive neofunctionalization in eukartyotes but yet has largely remained under-appreciated.
Materials and Methods
Coding sequences (CDSs) of annotated genes from 59 species were downloaded from Ensembl (Flicek et al. 2011) release 62, and Ensembl Genomes release 9 (Kersey et al. 2010) (for a complete list of all species see supplementary table S1, Supplementary Material online). MySQL queries were used to obtain the complete set of CDSs for each species. Metazoan coding sequences (CDS) from Ensembl genes were obtained using the following SQL query from the public Ensembl MySQL server at ensembldb.ensembl.org, where ID was a variable interpolated by an ad hoc Perl script that repeated the query once for each species’ database.
SELECT m.stable_id,m.description,scds.sequence_cds FROM sequence_cds scds, member m WHERE m.member_id=scds.member_id AND m.genome_db_id=ID
CDSs for nonmetazoan species were obtained using the following SQL query from the online MySQL database at mysql.ebi.ac.uk in a similar manner.
SELECT stable_id, description, sequence_cds FROM member, sequence_cds WHERE sequence_cds.member_id = member.member_id AND member.genome_db_id = ID.
Gene descriptions and correspondences between gene, transcript and protein IDS were downloaded from Ensembl and Ensembl genome using XML-based queries using the Martservice utility of the BioMart (Haider et al. 2009) interface for each of the species. The two green algal CDS collections for Chlamydomonas reinhartii (Merchant et al. 2007) and Volvox carteri (Prochnik et al. 2010) were obtained from phytosome (Goodstein et al. 2012).
The functional units of these analyses are paralogous protein pairs that represent putative duplicated genes presumed to share a common ancestor in the species lineage. The CDS transcriptome for each species was processed to have one representative sequence for each coding gene. In cases where alternative transcripts were annotated, the longest CDS was selected to represent the gene. Each processed CDS transcriptome was conceptually translated to create a representation of the species’ proteome. Proteins pairs were initially identified through all versus all intraspecific Blast (Altschul et al. 1997) analysis. The initial criteria for selection on candidate pairs from Blast analysis were proteins that align with an expect (e value) of ≤1e−3 and a Blast score ratio of ≥0.33. The Blast score ratio (Vilella et al. 2009) takes into account the bit score for the protein pair A–B as relates to the self-score of each protein gets when BLASTed against itself, where
(1) |
A larger BSR represents a higher quality of protein alignment in terms of length and sequence similarity. We used the threshold BSR of 0.33, as recommended by Vilella et al. (2009). Although all pairs were used in the clustering analysis (discussed later), we used proteins with ≥50% sequence identity along with at least 80% of their length for subcellular localization studies.
Protein pairs were then subjected to additional more stringent reciprocal filters to reduce spurious matches due to factors such as shared protein domains. Using criteria similar to those developed by Gu et al. (2002), we eliminated pairs whose alignment length was less than 80% of the total length of protein and, for peptides of length ≥150 amino acid residues, a minimum cutoff for percent sequence identity (I) of 30% was used. For peptides of length <150, the minimum I was calculated by using the method of Rost (1999):
(2) |
where L is the length of the alignment. This formula was derived from an empirical study that suggested that shorter peptides require a higher threshold for percent identity. Protein pairs that met all of the above criteria were retained for further analysis.
Proteins were clustered using a stringent double-linkage algorithm, in which filtered, reciprocal protein pairs for A, B, and C must all exist for proteins A, B, and C to be clustered. Resulting clusters are regarded as gene families. After this procedure, some proteins are represented in more than one cluster, which indicates that a nonreciprocal pair exists in the filtered set. Such pairs meet the e value and BSR thresholds but do not reciprocally pass the downstream filters, indicating a lower percentage identity or that they do not align more than 80% of protein length. Superclusters were formed by evaluating all protein pairs for single linkages and merging clusters where proteins were duplicated, until each protein was represented in only one cluster or supercluster. The family data, protein pair data and results of other analysis below were stored in a partially normalized MySQL database for future reference.
As gene conversion tends to occur in large gene families (>5 members), to minimize the potential effects from gene conversion, which could bias the death rates of nonrelocalized duplicate pairs, we ran all subsequent analyses on two data sets: 1) with all identified duplicate genes and 2) excluding all duplicates with more than five identifying matches (Lynch and Conery 2000).
The CDS sequences corresponding to protein pairs were assembled and each pair was analyzed for rates of synonymous (Ks) substitution. CDSs were translated and the proteins aligned with CLUSTALW (Thompson et al. 1994), which was then back-translated to the CDS alignment using an ad hoc BioPerl (Stajich et al. 2002) script. Ks was calculated using the yn00 program (Yang and Nielson 2000), which accounts for both the transition/transversion rate and codon usage biases.
Subcellular localization for individual proteins was predicted using Multiloc2 (Blum et al. 2009). Multiloc2 was used because it is capable of predicting localization in many subcellular compartments (specifically HighRes) and is trained for a greater variety of species groups. MultiLoc2 uses several subpredictors based on overall amino acid composition, identification of sorting signals, and detection of sequence motifs. Furthermore, the incorporation of phylogenetic profiles and GO (Gene Ontology) terms results in MultiLoc2 outperforming other comparable prediction systems in two benchmark studies done by Blum et al. (2009). One potential limitation in using MultiLoc2 is its use of homology. Such predictors could potentially cause erroneous subcellular predictions in pairs that have high similarity. Unfortunately, the performance of predictors, which ignore homology, would likely be inadequate for this type of study. To minimize these potential biases in MultiLoc2, we conducted the following survival analysis in two ways: 1) with all identified duplicate pairs and 2) with all duplicate pairs but those with a Ks < 0.01. By removing highly similar duplicate pairs, we remove those duplicates that are most likely to be problematic for MultiLoc2.
N-Terminal Mutations
The NTP mutations were categorized as having 1) base pair substitutions (None); 2) indels at the terminal end (Terminal); 3) internal indels (Partial); and 4) complete deletions (Complete). For each of the five eukaryotic groups (fungi, metazoan, algae, plants, and basal eukaryotes), we determined the total number of duplicate pairs, the total number of duplicate pairs with each of the four types of mutations, and then calculated the frequency of each. We also calculated the frequency of relocalization of each mutation type by dividing the total number of relocalized duplicate pairs for each mutation category by the total number of relocalized pairs for each of the five eukaryotic groups.
Statistical Analyses
The techniques of survival analysis include several parametric regression models (e.g., exponential, Weibull, log-logistic, and log-normal) and a semi-parametric model (Cox Proportional Hazard) to estimate the association between covariates and the distribution of the survival time or the response variable (Therneau and Grambsch 2000; Tableman and Kim 2004). The Cox Proportional Hazard model is currently the most widely used approach (Harrell 2001). In this study, we used the Cox proportional hazard model to compare the death rates associated with relocalized and nonrelocalized duplicate pairs (Therneau and Grambsch 2000). The model is defined as follows:
(3) |
where t represents survival time (Ks) of relocalized and nonrelocalized duplicate pairs, h0(t) is called the baseline hazard, β is a coefficient, and DP represents duplicate pairs status (DP = 0 or relocalized duplicate pairs and DP = 1 or nonrelocalized duplicate pairs). The β coefficient is estimated by maximizing the partial likelihood function introduced by Cox (1972). The hazard ratio for DP = 1 and DP = 0 is defined as hDP = 1(t)/hDP=0(t) = e1×β/e0×β = eβ. The hazard ratio or eβ > 1 represents that the death rate of nonrelocalized duplicate pairs is higher as compared with relocalized duplicate pairs. The analyses were performed using an open source statistical software R (R Development Core Team 2011). All estimates and confidence intervals were obtained using the coxph function available in the survival package (Tableman and Kim 2004).
The Cox proportional hazard model does not assume that the gene duplication rate or the birth rate is constant. The only assumption is that the hazard in the comparison group (nonrelocalized duplicate genes) is a constant proportion of the hazard in the reference group (relocalized duplicate genes). Graphical checks of the overall adequacy of the Cox proportional hazard model was performed using the Cox–Snell residuals plot (Tableman and Kim 2004). The plots show that the model gave a reasonable fit to the data and therefore the proportionality assumption of the model is satisfied.
Supplementary Material
Supplementary tables S1–S6 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).
Acknowledgments
The authors thank the following people and institutions for their assistance: A. Vilella (European Bioinformatics Institute) for advice on MySQL queries of the Ensembl databases; S. McKay (iPlant Collaborative; Cold Spring Harbor Laboratory) for assistance with bioinformatics and high performance computing; O. Kohlbacher for discussions on the use of MultiLoc2, and the NIH Fellows Editorial Board for editorial assistance. They thank the Broad and Genome Institutes (Washington University) for permission to use some of their unpublished genomes for this analysis. They also thank E. Koonin, M. Lynch, J. Zhang, R. Geeta, S. Phelan, and A. Fernandez for their helpful suggestions to this manuscript as well as the anonymous reviewers for their thoughtful, insightful comments. This work was supported by the Fairfield University’s Science Institute and, in part, by the Intramural Research Program of the National Institutes of Health, NLM ZIA LM000097-13.
Literature Cited
- Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. doi: 10.1038/35048692. [DOI] [PubMed] [Google Scholar]
- Bannai H, Tamada Y, Maruyama O, Nakai K, Miyano S. Extensive feature detection of N-teminal protein sorting signals. Bioinformatics. 2002;18:298–305. doi: 10.1093/bioinformatics/18.2.298. [DOI] [PubMed] [Google Scholar]
- Baudisch B, Langner U, Garz I, Klösgen RB. The exception proves the rule? Dual targeting of nuclear encoded proteins into endosymbiotic organelles. New Phytol. Advance Access published September. 2013;20 doi: 10.1111/nph.12482. 2013, doi: 10.1111/nph.12482. [DOI] [PubMed] [Google Scholar]
- Bizily SP, Tehryung K, Muthugapatti KK, Meagher RB. Subcellular targeting of methylmercury lyase enhances its specific activity for organic mercury detoxification in plants. Plant Physiol. 2003;131:463–471. doi: 10.1104/pp.010124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blanc G, Wolfe KH. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell. 2004;16:1667–1678. doi: 10.1105/tpc.021345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blum T, Briesemeister S, Kohlbacher O. MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics. 2009;10:274–285. doi: 10.1186/1471-2105-10-274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byun McKay SA, Geeta R. Protein subcellular relocalization: a new perspective on the origin of novel genes. Trends Ecol Evol. 2007;22:338–344. doi: 10.1016/j.tree.2007.05.002. [DOI] [PubMed] [Google Scholar]
- Byun McKay SA, Geeta R, Duggan R, Carroll B, McKay SJ. Missing the subcellular target: a mechanism of eukaryotic gene evolution. In: Pontarotti P, editor. Evolutionary biology from concept to application II. New York: Springer-Verlag; 2009. pp. 175–183. [Google Scholar]
- Castellano E, Santos E. Functional specificity of ras isoforms. Genes Cancer. 2011;2:216–231. doi: 10.1177/1947601911408081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR. Regression models and life-tables. J R Stat Soc B. 1972;34:187–220. [Google Scholar]
- Flicek P, et al. Ensembl 2011. Nucleic Acids Res. 2011;39(Database issue):D800–D806. doi: 10.1093/nar/gkq1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Force A, et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;15:1531–1545. doi: 10.1093/genetics/151.4.1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodstein DM, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:1178–1186. doi: 10.1093/nar/gkr944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu Z, Cavalcanti A, Chen FC, Bouman P, Li WH. Extent of gene duplication in the genomes of Drosophila, nematode and yeast. Mol Biol Evol. 2000;19:256–262. doi: 10.1093/oxfordjournals.molbev.a004079. [DOI] [PubMed] [Google Scholar]
- Gu X, Wang Y, Gu J. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 2002;31:205–209. doi: 10.1038/ng902. [DOI] [PubMed] [Google Scholar]
- Haider S, et al. BioMart Central Portal—unified access to biological data. Nucleic Acids Res. 2009;37(Web Server issue):W23–W27. doi: 10.1093/nar/gkp265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer Series in Statistics. New York: Springer-Verlag; 2001. [Google Scholar]
- Heilmann I, Pidkowich MS, Girke T, Shanklin J. Switching desaturase enzyme specificity by alternate subcellular targeting. Proc Natl Acad Sci U S A. 2004;101:10266–10271. doi: 10.1073/pnas.0402200101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes AL. The evolution of functionally novel proteins after gene duplication. Proc Biol Sci. 1994;256:119–124. doi: 10.1098/rspb.1994.0058. [DOI] [PubMed] [Google Scholar]
- Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
- Kaiser CA, Preuss D, Grisafi P, Botstein D. Many random sequences functionally replace the secretion signal sequence of yeast invertase. Science. 1987;235:312–317. doi: 10.1126/science.3541205. [DOI] [PubMed] [Google Scholar]
- Kähäri A, et al. Ensembl genomes: extending Ensembl across the taxonomic space. Nucleic Acids Res. 2010;38:563–569. doi: 10.1093/nar/gkp871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kassahn KS, Dang VT, Wilkins SJ, Perkins AC, Ragan MA. Evolution of gene function and regulatory control after whole genome duplication: comparative vertebrates. Genome Res. 2009;19:1404–1418. doi: 10.1101/gr.086827.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley CA, et al. Xenopus nonmuscle myosin heavy chain isoforms have different subcellular localizations and enzymatic activities. J Cell Biol. 1996;134:675–687. doi: 10.1083/jcb.134.3.675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kersey PJ, et al. Ensembl genomes: extending Ensembl across the taxonomic space. Nucleic Acids Res. 2010;38(suppl 1):D563–D569. doi: 10.1093/nar/gkp871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leissring MA, et al. Alternative translation initiation generates a novel isoform of insulin-degrading enzyme targeted to mitochondria. Biochem J. 2004;383:439–446. doi: 10.1042/BJ20041081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li WH. Molecular evolution. Sunderland (MA): Sinauer Associates; 1997. [Google Scholar]
- Li WH, Gu Z, Wang H, Nekrutenko A. Evolutionary analysis of the human genome. Nature. 2001;409:847–849. doi: 10.1038/35057039. [DOI] [PubMed] [Google Scholar]
- Lu YL, Chia CY, Liu YW, Hou WC. Biological activities and applications of dioscorins, the major tuber storage proteins of yam. J Tradition Compl Med. 2012;2:41–46. doi: 10.1016/s2225-4110(16)30069-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch M, Conery J. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. doi: 10.1126/science.290.5494.1151. [DOI] [PubMed] [Google Scholar]
- Marques A, Vinckenbosch N, Brawand D, Kaessmann H. Functional diversification of duplicate genes through subcellular adaptation of encoded proteins. Genome Res. 2008;9:R54. doi: 10.1186/gb-2008-9-3-r54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Merchant SS, et al. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007;318:245–250. doi: 10.1126/science.1143609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore RC, Purugganan MD. The early stages of gene duplication. Proc Natl Acad Sci U S A. 2003;100:15682–15687. doi: 10.1073/pnas.2535513100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moroney JV, Bartlett SG, Samuelsson G. Carbonic anhydrases in plants and algae. Plant Cell Environ. 2001;24:141–153. [Google Scholar]
- Ohno S. Evolution by gene duplication. New York: Springer-Verlag; 1970. p. 160. [Google Scholar]
- Pidoux G, Taskén K. Specificity and spatial dynamics of protein kinase A signaling organized by A-kinase-anchoring proteins. J Mol Endocrinol. 2010;44:271–284. doi: 10.1677/JME-10-0010. [DOI] [PubMed] [Google Scholar]
- Prochnik SE, et al. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science. 2010;329:223–226. doi: 10.1126/science.1188800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qian W, Zhang J. Protein subcellular relocalization in the evolution of yeast singleton and duplicate genes. Genome Biol Evol. 2009;1:198–204. doi: 10.1093/gbe/evp021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Development Core Team. 2011 R: a language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing. [cited 2013 Dec 10]. Available at: http://www.R-project.org/ [Google Scholar]
- Regev-Rudzki N, Karniely S, Ben-Haim NN, Pines O. Yeast aconitase in two locations and two metabolic pathways: seeing small amounts is believing. Mol Biol Cell. 2005;16:4163–4171. doi: 10.1091/mbc.E04-11-1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosso L, et al. Birth and rapid subcellular adaptation of a hominoid-specific CDC14 protein. PLoS Biol. 2008;6(6):e140. doi: 10.1371/journal.pbio.0060140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. doi: 10.1093/protein/12.2.85. [DOI] [PubMed] [Google Scholar]
- Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. doi: 10.1101/gr.361602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tableman M, Kim JS. Survival analysis using S: analysis of time-to-event data. New York: Chapman and Hall/CRC; 2004. [Google Scholar]
- Therneau TM, Grambsch PT. Modeling survival data: extending the Cox model statistics for biology and health. 2000. New York: Springer. [Google Scholar]
- Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acid Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vilella AJ, et al. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Huang Y, Lavrov D, Gu X. Comparative study of human mitochondrial proteome reveals extensive protein subcellular relocalization after gene duplications. BMC Evol Biol. 2009;9:275. doi: 10.1186/1471-2148-9-275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Nielson R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 2000;17:32–43. doi: 10.1093/oxfordjournals.molbev.a026236. [DOI] [PubMed] [Google Scholar]
- Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003;18:292–298. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.