Abstract
Rapidly evolving proteins can aid the identification of genes underlying phenotypic adaptation across taxa, but functional and structural elements of genes can also affect evolutionary rates. In plants, the ‘edges’ of exons, flanking intron junctions, are known to contain splice enhancers and to have a higher degree of conservation compared to the remainder of the coding region. However, the extent to which these regions may be masking indicators of positive selection or account for the relationship between dN/dS and other genomic parameters is unclear. We investigate the effects of exon edge conservation on the relationship of dN/dS to various sequence characteristics and gene expression parameters in the model plant Arabidopsis thaliana. We also obtain lineage-specific dN/dS estimates, making use of the recently sequenced genome of Thellungiella parvula, the second closest sequenced relative after the sister species Arabidopsis lyrata. Overall, we find that the effect of exon edge conservation, as well as the use of lineage-specific substitution estimates, upon dN/dS ratios partly explains the relationship between the rates of protein evolution and expression level. Furthermore, the removal of exon edges shifts dN/dS estimates upwards, increasing the proportion of genes potentially under adaptive selection. We conclude that lineage-specific substitutions and exon edge conservation have an important effect on dN/dS ratios and should be considered when assessing their relationship with other genomic parameters.
Keywords: dN/dS, Arabidopsis thaliana, lineage-specific evolution, splice enhancer
Introduction
Rates of sequence evolution are known to vary between genes, particularly at non-synonymous sites (Bromham 2009). Various genomic parameters are significant predictors of dN/dS, an estimate of the rate of protein evolution corrected by the underlying rate of substitution at synonymous sites. In a substantial number of species, including Arabidopsis thaliana, expression level is considered the best predictor of dN/dS ratios (Akashi 2003; Krylov et al. 2003; Wright et al. 2004; Drummond et al. 2005; Cherry 2010a), alongside expression breadth (an estimate of the proportion of tissues in which a gene is expressed) (Duret & Mouchiroud 2000; Winter et al. 2004; Zhang & Li 2004; Park & Choi 2010). Other variables, including codon usage bias (Urrutia & Hurst 2001; Xia et al. 2009), GC content (Ticher & Graur 1989; Cherry 2010b), protein multi-functionality (Hahn & Kern 2005; Podder et al. 2009), the number of interacting partners per protein (Fraser & Hirsh 2004; Makino & Gojobori 2006; Wang & Lercher 2011), recombination rate (Pál et al. 2001; Wright et al. 2006), gene/protein length (Coghlan & Wolfe 2000; Urrutia & Hurst 2003; Lemos et al. 2005; Stoletzki & Eyre-Walker 2007) and both intron number and length (Seoighe et al. 2005; Tang et al. 2006; Larracuente et al. 2008) have all been associated with dN/dS ratios.
Variations in dN/dS are thought to stem primarily from gene-specific selective pressures related to the functionality of their protein products (Tennessen 2008). As such, dN/dS is often used to identify those genes likely to be involved in adaptation (Yang & Bielawski 2000; Hurst 2002; Nielsen 2005). Determining which genes are under selection is important for understanding how genetic diversity is maintained and the relative importance of opposing selective forces in shaping a species’ genetic diversity.
As with many species, plant genes predominantly evolve under purifying selection (Gossmann et al. 2010), with low estimates for the number of positively selected genes in sorghum (Hamblin et al. 2006), maize (Ross-Ibarra et al. 2009), A. thaliana (Schmid et al. 2005; Slotte et al. 2011) and A. lyrata (Foxe et al. 2008). In these studies, dN/dS was calculated from pairwise alignments resulting in ratios which are a composite of substitutions in both lineages compared since their divergence from their last common ancestor. Using an outgroup species allows the calculation of lineage-specific dN/dS (Arbiza et al. 2006; Bakewell et al. 2007; Kawahara & Imanishi 2007; Weedall et al. 2008; Parmakelis et al. 2010; Toll-Riera et al. 2011) which could unmask further genes with species-specific signatures of positive selection and/or potentially stronger associations between certain genomic characteristics and the rate of sequence evolution. The model plant Arabidopsis thaliana is an ideal organism for investigating genomewide signatures of selection in the plant taxa as a sister species, Arabidopsis lyrata (with an estimated 13 mya divergence time from A. thaliana), has been sequenced (Hu et al. 2011). Thellungiella parvula [43 mya divergence from A. thaliana (Dassanayake et al. 2011)] provides a suitable outgroup for assessing lineage-specific sequence evolution. The availability of multiple A. thaliana genomes (Cao et al. 2011; Gan et al. 2011) enables the assessment of intraspecific diversity, which can be used to estimate deviations from a neutral expectation based on both sequence divergence and intraspecific variation, such as the neutrality index (NI) (Haldane 1956).
The interpretation of dN/dS and NI estimates assumes that synonymous substitutions are mostly evolving under neutral or nearly neutral conditions and are a proxy of the underlying mutation rate. However, exon sequences can contain exonic splicing enhancers (ESEs), sequence motifs involved in both constitutive and regulated splicing by facilitating the assembly of splicing complexes (Tacke & Manley 1999; Blencowe 2000; Zheng 2004). ESEs are enriched in the vicinity of splice sites, particularly downstream of a splice acceptor, with their peak abundance increasing closer to an exon–intron boundary (Wu et al. 2005). As higher conservation in this region, including at synonymous sites, can reflect differential patterns of codon usage (Comeron & Guthrie 2005; Parmley & Hurst 2007; Warnecke & Hurst 2007; Caceres & Hurst 2013) and affect the overall dN/dS estimate per gene (Carlini & Genut 2006; Parmley et al. 2006), this may influence relationships between dN/dS and various genomic parameters, particularly in compact, intron-rich genomes. In A. thaliana's genome, 75% of the genes are multi-exonic, 29% of the exons are below 100 bp, the median exon length is 53 codons, and ESE hexamers have been identified (Pertea et al. 2007). Thus, ESE conservation could have a strong impact on estimates of dN/dS, and consequently on estimates of the relative contribution of positive and purifying selection to A. thaliana genome evolution.
It is not yet known, however, how dN/dS estimates in plants are influenced either by increased conservation at exon edges or by the introduction of an outgroup species to obtain lineage-specific estimates nor how this may affect the covariance between the rate of sequence evolution and any genomic parameter previously shown to be a significant predictor of NI and/or dN/dS.
Here, we address this issue by examining coding sequence evolution in A. thaliana, with A. lyrata and T. parvula as comparison species. We investigate whether the calculation of lineage-specific sequence evolutionary rate and/or the removal of exon edges (i) may unmask a larger proportion of genes with signatures of selection, (ii) alter the relationship between expression level and evolutionary rate and (iii) alter the association between dN/dS and other structural and functional parameters previously identified as dN/dS correlates in one or more other species.
Materials and methods
Genome sequences and gene annotations
Exon coordinates for A. thaliana strain Col-0 were obtained from The Arabidopsis Information Resource (TAIR) (http://ftp://ftp.arabidopsis.org/, file ‘TAIR10_exon_20101028’, downloaded 15 February 2013). The A. lyrata genome (Hu et al. 2011), strain MN47 (Entrez genome project ID 41137), was obtained from GenBank (http://www.ncbi.nlm.nih.gov/nuccore/ADBK00000000, downloaded 17 October 2012). The T. parvula genome, version 2.0 (Dassanayake et al. 2011), was obtained from http://thellungiella.org/blast/db/TpV8-4.fa (downloa-ded 17 October 2012).
Other data sources
Codon usage bias per gene was expressed both as the effective number of codons (ENC) (Wright 1990) and as the frequency of optimal codons (Fop) (Ikemura 1981). The number of protein–protein interactions (PPIs) per gene was obtained from BioGRID, version 3.1.75 (Stark et al. 2006, 2011). Recombination data were obtained from Marais et al. (2004); this variable is used as a control as an insignificant relationship between recombination and dN/dS is expected in an effectively obligate selfer. A gene's degree of multi-functionality was measured as the number of GOslim terms assigned to it for biological processes. ‘GOslim’ is a condensed set of gene ontology (GO) categories, obtained from TAIR (http://ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.txt, downloaded 8 October 2013) (Berardini et al. 2004). The majority of GOslim terms (∼87%) are derived from curated experimental or computational evidence, rather than being inferred from sequence similarity, which can result in higher false prediction rates (Jones et al. 2007). All raw data used in this study are available in Table S1 (Supporting information).
Tests of sequence evolution and selection
Two measures of the degree and direction to which A. thaliana sequences diverge from a neutral expectation were calculated – a neutrality index (NI) and dN/dS. Calculations require data on the number of polymorphic and diverged residues in each sequence. To obtain the former, we used single nucleotide polymorphism (SNP) data obtained after aligning 17 fully sequenced and independently assembled accessions against the Col-0 reference genome (Gan et al. 2011) (data from Po-0 were not used as it has both unusually high heterozygosity and similarity to Oy-0). Diverged positions were identified from pairwise alignments of A. thaliana against both A. lyrata and T. parvula. Alignments were made for 21 198 genes against A. lyrata and 10 289 genes against T. parvula, of which 7086 genes could be aligned against both. Alignments were first obtained for exons in the longest available transcript per A. thaliana gene, using blastn (Altschul et al. 1990) with default parameters and a significance threshold of 1e−10. These were refined by applying the Smith–Waterman algorithm to the best blastn hit [fasta36.3.5d with parameters –A –a (require alignments to use entire sequence)] (Pearson 2000). The resulting alignments were then concatenated to create a single sequence alignment per gene. To ensure the alignment was in-frame, the translated A. thaliana sequence was aligned against either the A. lyrata or T. parvula sequence using tblastn (default parameters and significance threshold 1e−10).
For genes with at least 150 aligned bases, dN/dS was estimated from the concatenated sequences using the Yang and Nielson model, as implemented in the yn00 package of PAML (Yang 2007). These estimates are referred to as ‘pairwise’ dN/dS. We also calculated a lineage-specific estimate of dN/dS using the extremophile crucifer Thellungiella parvula (Dassanayake et al. 2011) as an outgroup, according to the method of Toll-Riera et al. (2011). First, we identified those T. parvula genes with detectable homology to an A. thaliana gene for >50% of the CDS length of the longest Col-0 transcript (blastn with default parameters). Multiple sequence alignments between the CDS of an A. thaliana gene, its A. lyrata orthologue (if extant) and the homologous sequence in T. parvula were made using PRANK (Löytynoja & Goldman 2008). dN/dS was calculated using the codeml package of PAML (Yang 2007), with the equilibrium codon frequencies of the model used as free parameters (CodonFreq = 3). These data were filtered to remove sequences less than 150 bp in length or with branches showing either dS < 0.02, dS > 2 or dN > 2 as these are either unreliable for estimates of the dN/dS ratio, nonbona fide orthologues or otherwise saturated with substitutions (Löytynoja & Goldman 2008). We assumed an unrooted tree topology of [(A. thaliana, A. lyrata), T. parvula].
The neutrality index for each sequence, NI, was calculated as log((2Ds + 1) (2Pn + 1)/(2Dn + 1) (2Ps + 1)), where Ds and Dn are the numbers of synonymous and non-synonymous substitutions, and Ps and Pn are the numbers of synonymous and non-synonymous polymorphisms (Haldane 1956). NI values can be tested with the null hypothesis of neutrality, that the ratios of intra- and interspecies non-synonymous to synonymous variation are equal. Positive selection is inferred when interspecies exceeds intraspecies variation – adaptive mutations spread throughout a population rapidly and so affect the number of observed substitutions (i.e. divergence), but not the number of polymorphisms (Egea et al. 2008). NI can thus be interpreted in the same manner as a McDonald–Kreitman test for comparing the ratio of fixed to within-species differences: its symmetrical distribution allows the inference of purifying selection when NI > 0 and positive selection when NI < 0 (McDonald & Kreitman 1991).
Exon edge trimming
To assess the effect of exon edge conservation on rates of sequence evolution, we removed up to 30 codons from the edges of each A. thaliana exon that could be fully aligned against the A. lyrata or T. parvula genome with an alignment both in-frame and a multiple of three in length. Exons were then concatenated, and genes with sequences of at least 150 bp after trimming constituted ‘trimmed’ subsets of, at minimum, 1443 genes (i.e. those for which all 30 codons can be removed) and 174 genes, for alignments against A. lyrata and T. parvula, respectively. All analyses comparing ‘trimmed’ and ‘untrimmed’ sequences use the same set of exons per gene. A supplemental file containing both the raw alignments and evolutionary rate estimates for all data sets is available at the DRYAD repository (http://dx.doi.org/10.5061/dryad.905sq).
Randomization test
Estimates of dN, dS, dN/dS and NI vary when codons are removed from the edges of exons, suggesting that the strength of selection differs in these regions. To assess whether the difference is indeed due to the nature of exon edges or due to codon removal, we created a parallel set of estimates of dN, dS, dN/dS and NI after random codon removal (s = 1000 randomizations per gene) for comparison. A numerical P-value was calculated as follows: letting q be the number of times the ‘sequential removal’ estimate of dN, dS or dN/dS was higher than the ‘random removal’ estimate (or lower, in the case of NI), then p = ((s-q) + 1)/s + 1. As variable estimates of dN, dS, dN/dS and NI can in turn alter the correlation strength with predictors of evolutionary rate (such as, e.g. expression level), the above test was also repeated using estimates of Spearman's rho for both the ‘sequential removal’ and ‘random removal’ conditions.
Expression data
Three independent sources of A. thaliana transcript abundance data were used: (i) the Arabidopsis Development Atlas (ADA), representing 79 tissues, generated by the AtGenExpress Consortium (Schmid et al. 2005) [NASCARRAYS reference numbers 150–154, (http://affymetrix.arabidopsis.info/, downloaded 7 November 2011)]. Expression level was quantified both as the maximum absolute gcRMA [robust multi-array analysis corrected for the GC content of the oligo (Wu et al. 2004)] across all tissue types (after clustering the data into seven types – root, stem, seed, leaf, flower, pollen and apex) (Slotte et al. 2011) and as the average across all 79 tissues (Yang & Gaut 2011). Expression breadth was calculated from this database as both the number of tissues in which a gene is expressed and the tissue specificity index (tau), a scalar measure bounded between 0 (for housekeeping genes) and 1 (for genes expressed in a single tissue) (Yanai et al. 2005). (ii) Massive parallel signature sequencing (MPSS) data (Brenner et al. 2000; Meyers et al. 2004; Nakano et al. 2006) – which quantifies gene expression by counting short (17–20 bp) mRNA-derived tags – representing five tissues (http://mpss.udel.edu/at/mpss_index.php, downloaded 28 March 2011). Expression level was quantified as either the average (Yang 2009) or the maximum number of tags across all tissues (Foxe et al. 2008). (iii) RNA-seq transcript abundance data, where expression levels were taken as absolute read values corrected by sequence length (Gan et al. 2011). On top of the indices of expression obtained from each data set, all three estimates of transcript abundance (MPSS, ADA and RNA-seq) were transformed into Z-scores (Cheadle et al. 2003) to allow direct comparisons between them. In addition, the weighted average of two sets of A. thaliana protein abundance data was obtained for a total of 19 761 genes (pax-db.org, downloaded 15 February 2013) (Baerenfaller et al. 2008; Castellana et al. 2008). These data employ tandem mass spectrometry to quantify protein abundance by spectral counting.
Alternative splicing
Alternative splicing indices were calculated as described in Chen et al. (2014). In brief, alternative splicing events were identified by comparing mapping coordinates from EST data [obtained from dbEST (Boguski et al. 1993); http://ftp://ftp.ncbi.nih.gov/repository/dbEST, downloaded 1 May 2011] to the genome sequence. To avoid biases introduced by differential transcript coverage between genes (Kim et al. 2007; Nilsen & Graveley 2010; Chen et al. 2012), we used a transcript number normalization method (Kim et al. 2007) whereby the number of alternative splicing events per gene is calculated as the average number of events detected using 100 random samples of 10 mapped ESTs.
Correlations between evolutionary rates and functional and structural gene characteristics
To determine whether, and to what extent, any functional and structural variables affect a gene's dN/dS and NI estimates, various correlation analyses were performed. All analyses were conducted in R (R Development Core Team 2012). Initially, all correlations were assessed using Spearman's rho. However, as many of the variables found to be significantly associated with dN/dS are themselves covariates of expression level (the strongest correlate of dN/dS), it is possible that some parameters co-vary with dN/dS as a by-product of their relationship with expression level. As such, to better understand the relative contribution of genomic features to dN/dS, we assessed the relationship between individual parameters and dN/dS after controlling for the effect of expression level, using partial Spearman's correlation coefficients [R package ‘ppcor’ (Kim & Yi 2006, 2007)]. In addition, to test whether correlation strengths for dN/dS with any given genomic feature differ between pairwise and lineage-specific dN/dS estimates, we assessed statistical significance using a t-test on the Z-transformed values of rho, as implemented by the paired.r method of the R package ‘psych’ (Revelle 2014).
Results
Correlates of dN/dS and NI in A. thaliana
Expression level and breadth [calculated using RNA-seq data (Gan et al. 2011)] were significant predictors of dN/dS and NI (calculated from pairwise alignments between A. thaliana and A. lyrata) and were in fact their strongest predictors compared to other variables (Table 1 and Table S2 in Supporting information). Similar results were obtained when using independent expression-level estimates from two alternative platforms, microarrays and MPSS as well as when applying four normalization procedures previously used for each set of estimates (Table S2 in Supporting information).
Table 1.
Correlation strength of dN/dS and NI with different variables in A. thaliana, after alignment against A. lyrata, T. parvula or both
Alignments of A. thaliana with A. lyrata |
Alignments of A. thaliana with T. parvula |
Alignments of A. thaliana with both A. lyrata and T. parvula |
||||
---|---|---|---|---|---|---|
Variable | dN/dS | NI | dN/dS | NI | dN/dS | NI |
Average exon length | 0.103 | −0.026 | 0.017 | 0.045 | −0.141 | −0.043 |
Average intron length | −0.070 | 0.043 | −0.052 | 0.061 | −0.027 | 0.088 |
Gene length | −0.243 | 0.092 | −0.067 | −0.047 | −0.169 | 0.044 |
Primary transcript length | −0.243 | 0.092 | −0.067 | −0.047 | −0.170 | 0.043 |
Protein length | −0.124 | 0.050 | −0.015 | −0.060 | −0.186 | −0.034 |
Total exon length | −0.203 | 0.075 | −0.066 | −0.039 | −0.200 | 0.005 |
Total intron length | −0.228 | 0.086 | −0.056 | −0.041 | −0.021 | 0.089 |
UTR length (5′) | −0.183 | 0.032 | −0.131 | 0.003 | −0.01 | 0.035 |
UTR length (3′) | −0.122 | 0.053 | −0.070 | 0.040 | −0.051 | 0.086 |
Expression breadth | −0.399 | 0.120 | −0.284 | 0.117 | −0.130 | 0.232 |
Exp. level (RNA-seq) | −0.415 | 0.145 | −0.285 | 0.117 | −0.143 | 0.217 |
Protein abundance | −0.302 | 0.078 | −0.241 | 0.095 | −0.086 | 0.194 |
Tissue specificity (tau) | 0.277 | −0.088 | 0.210 | −0.092 | 0.128 | −0.175 |
Effective number of codons | 0.059 | −0.016 | 0.065 | −0.035 | 0.064 | −0.043 |
Frequency of optimal codons | −0.194 | 0.065 | −0.187 | 0.116 | −0.069 | 0.176 |
GC (%) | −0.009 | 0.036 | −0.057 | 0.081 | −0.110 | 0.038 |
Intron density | −0.158 | 0.048 | −0.022 | −0.052 | 0.026 | 0.064 |
Total no. of introns | −0.212 | 0.071 | −0.038 | −0.069 | −0.014 | 0.062 |
Multifunctionality | −0.132 | −0.013 | −0.137 | −1.18 × 10−4 | −0.045 | −0.012 |
No. of protein–protein interactions | −0.060 | 0.031 | −0.084 | 0.069 | −0.113 | 0.152 |
Recombination rate | 0.007 | −0.058 | −0.011 | −0.019 | 0.026 | −0.022 |
All values shown are correlation strengths, as Spearman's rho. All values are statistically significant at P < 0.05, except for those underlined.
When dN/dS and NI estimates were obtained from alignments of A. thaliana and T. parvula, a more distant relative of A. thaliana than A. lyrata, expression level and breadth remain the strongest predictors of both dN/dS and NI, albeit with comparatively weaker correlation strengths in the case of dN/dS variance (but equivalent correlation strengths for NI variance) (Table 1 and Table S2 in Supporting information). In general, other variables such as gene length and codon usage bias explain progressively smaller proportions of dN/dS and NI variance in an equivalent order to that using A. thaliana–A.lyrata estimates. The association between dN/dS and NI with any non-expression-related genomic parameter is not fully accounted for by its association with gene expression as after the effect of expression level is removed using partial Spearman's correlation coefficients, all significant correlates of dN/dS and NI remained so (Table 2 and Table S3 in Supporting information).
Table 2.
Partial correlations of dN/dS and 11 evolutionary rate predictors in A. thaliana, after controlling for expression level
Variable | Alignments of A. thaliana with A. lyrata | Alignments of A. thaliana with T. parvula | Alignments of A. thaliana with both A. lyrata and T. parvula |
---|---|---|---|
Average exon length | 0.077 | 0.008 | −0.151 |
Average intron length | −0.037 | −0.037 | −0.014 |
Gene length | −0.155 | −0.051 | −0.156 |
Protein length | −0.093 | −0.029 | −0.196 |
Total exon length | −0.124 | −0.052 | −0.191 |
Total intron length | −0.148 | −0.039 | 0.002 |
Total no. of introns | −0.136 | −0.021 | 0.007 |
Frequency of optimal codons | −0.130 | −0.121 | −0.045 |
Expression breadth | −0.220 | −0.148 | −0.055 |
Protein abundance | −0.126 | −0.112 | −0.008 |
No. of protein–protein interactions | −0.108 | −0.118 | −0.118 |
All values shown are partial correlation strengths, as Spearman's rho. All values are statistically significant at P < 0.05, except those underlined.
Accounting for exon edge conservation influences dN/dS and its relationship with various genomic parameters, and unmasks higher levels of positive selection
Using pairwise alignments of A. thaliana against either A. lyrata or T. parvula, we find that codon removal at the edges of exons results in increased dN, dS and dN/dS estimates when compared to estimates made after random codon removal from any position in the sequence (Fig.1 and Table S4 in Supporting information). This is observed irrespective of whether 10, 20 or 30 codons are removed (Table S4 in Supporting information). Estimates of NI were found to decrease after codon removal from the exon edges compared to random codon removal, also suggesting a weakening in the departure of sequence evolution from a neutral expectation (Fig.1 and Table S4 in Supporting information). These patterns are consistent with exon edges being under selective constraint, having fewer non-synonymous substitutions than sequence elsewhere in the gene. In general, exon edge removal shifts dN/dS values towards a range indicative of either stronger positive or relaxed purifying selection, with an overall increase in the proportion of genes potentially under adaptive selection (Table 3 and Table S5 in Supporting information).
Fig. 1.
dN, dS, dN/dS and NI after exon edge removal. dN/dS (a), dN (b), dS (c) and NI (d) for a sample of 1443 genes with at least one fully alignable exon between A. thaliana and A. lyrata, after removing one codon at a time from exon edges (black), to a maximum of 30. The effects of random codon removal are shown in red. Distributions significantly differ when 30 codons are removed sequentially, but not randomly, compared to when no codons are removed. For sequential removal vs. no removal, Kruskal–Wallis P = 0.02 (dN/dS) and < 2.2 × 10−16 (NI). For random removal vs. no removal, Kruskal–Wallis P = 0.08 (dN/dS) and 0.49 (NI).
Table 3.
Exon edge removal shifts dN/dS values towards a range indicative of either stronger positive or relaxed purifying selection, with the proportion of genes potentially under adaptive selection increased
Chi-square test |
|||||||
---|---|---|---|---|---|---|---|
Dataset | Max. no. of codons removed from each gene | No. of genes | % of genes with dN/dS >1 (no codons removed) | % of genes with dN/dS >1 (after sequential codon removal) | % of genes with dN/dS >1 (after random codon removal) | χ2 | P |
Alignments of A. thaliana against A. lyrata | 10 | 3213 | 1.81 | 2.4 | 1.81 | 11.25 | 7.96 × 10−4 |
20 | 2041 | 1.62 | 2.45 | 1.71 | 6.43 | 0.011 | |
30 | 1443 | 1.39 | 2.43 | 1.39 | 6.22 | 0.013 | |
Alignments of A. thaliana against T. parvula | 10 | 779 | 0.64 | 1.67 | 0.77 | 8.17 | 4.27 × 10−3 |
20 | 350 | 0.29 | 1.43 | 0.29 | 16.00 | 6.33 × 10−5 | |
30 | 174 | 0 | 2.87 | 0 | NA | NA |
To understand the effect of higher conservation at the exon edges on the relationships between dN/dS and other genomic parameters, we then re-analysed the correlations. We found that the correlation strength of dN/dS with several genomic features – in particular, expression level and expression breadth – decreased after the removal of exon edges. In contrast, we observed only marginal changes to these correlation coefficients after removing an equivalent number of codons from random positions (Fig.2 and Table S6 in Supporting information). This suggests that, after the removal of exon edges, the decreased correlation strength between dN/dS and genomic parameters is not explained by increased noisiness resulting from the use of shorter sequences to estimate dN/dS. It also suggests that a dN/dS-based test of selection is most acute for more highly expressed genes and that stronger correlations of dN/dS with their various characteristics reflect the stronger constraints upon them. Furthermore, when considering NI, several variables including expression level, expression breadth, the total number of introns and various measures of gene length become marginally, but significantly, stronger predictors of NI (Table S6 in Supporting information). Nevertheless, the relative order of these parameters as predictors of dN/dS remains largely unchanged with expression level still the dominant predictor.
Fig. 2.
Variables that have a significantly different correlation with dN/dS after the sequential removal of 30 codons from exon edges, compared to random codon removal. The four variables shown – expression breadth, expression level, tau and GC content – are those which have significantly different estimates of rho for their correlation with dN/dS before and after codon removal. Two criteria are met for each variable: that rho is significantly different after sequential, compared to random codon removal, and that rho is significantly different after sequential, compared to no codon removal. Estimates of dN/dS are made using alignments of A. thaliana against A.lyrata. Data for this figure, including P-values and sample sizes, are shown in Table S6 (Supporting information).
Reduced prominence of gene expression as a predictor of A. thaliana's lineage-specific dN/dS
Lineage-specific dN/dS estimates derived from multiple alignments of A. thaliana genes with A. lyrata and T. parvula resulted in a marked decrease in the correlation between dN/dS and various genomic parameters including expression level and expression breadth, with total exon length becoming the strongest correlate of dN/dS (Table 1 and Table S2 in Supporting information).
To rule out the possibility that reductions in both the absolute and relative strength of the correlation between dN/dS and gene expression when examining lineage-specific changes may be explained by differences in the gene/codon set tested, we recalculated pairwise dN/dS for A. thaliana against A. lyrata and T. parvula using only those codons common to the multiple alignments of A. thaliana, A. lyrata and T. parvula (i.e. those used to estimate lineage-specific dN/dS; Table S7 in Supporting information). This analysis confirmed that when the same codons are analysed, lineage-specific dN/dS estimates have markedly weaker correlations with numerous genomic features compared to either pairwise estimate (Table 4 and Table S8 in Supporting information).
Table 4.
Correlates of dN/dS using estimates derived from codons common to the alignment of A. thaliana, A. lyrata and T. parvula
Variable | Alignments of A. thaliana with A. lyrata | Alignments of A. thaliana with T. parvula | Alignments of A. thaliana with both A. lyrata and T. parvula |
---|---|---|---|
Average exon length† | −0.040 | −0.040 | −0.106 |
Average intron length | −0.080 | −0.060 | −0.002 |
Gene length | −0.146 | −0.115 | −0.154 |
Primary transcript length | −0.146 | −0.115 | −0.154 |
Protein length† | −0.107 | −0.089 | −0.177 |
Total exon length | −0.139 | −0.114 | −0.182 |
Total intron length | −0.088 | −0.072 | −0.040 |
UTR length (5′) | −0.012 | −0.007 | 0.045 |
UTR length (3′) | −0.066 | −0.056 | −0.017 |
Expression breadth† | −0.286 | −0.317 | −0.182 |
Exp. level (RNA-seq)† | −0.256 | −0.284 | −0.144 |
Protein abundance† | −0.198 | −0.225 | −0.102 |
Tau (tissue specificity)† | 0.214 | 0.239 | 0.124 |
Effective number of codons† | 0.116 | 0.123 | 0.051 |
Frequency of optimal codons† | −0.142 | −0.189 | −0.051 |
GC (%) | −0.054 | −0.076 | −0.087 |
Intron density | −0.050 | −0.056 | −0.020 |
Total no. of introns | −0.079 | −0.067 | −0.045 |
Multifunctionality | −0.060 | −0.038 | −0.036 |
Protein–protein interactions | −0.127 | −0.153 | −0.098 |
Recombination rate | −0.001 | −0.063 | 0.041 |
Correlation strengths are shown as Spearman's rho. All values are statistically significant at P < 0.05, except those underlined. The rightmost column shows lineage-specific dN/dS estimates.
Significantly different correlation strength when using lineage-specific dN/dS estimates compared to pairwise estimates.
We further found that using lineage-specific substitution patterns markedly reduces the number of genes with dN/dS >1 (21 genes have dN/dS >1, 0.3% of the sample analysed) when compared to pairwise alignments of A. thaliana with A. lyrata (423 genes have dN/dS >1, 2% of the sample analysed; chi-square P < 2.2 × 10−16), although not when compared to alignments with T. parvula (41 genes have dN/dS > 1, 0.4% of the sample analysed, chi-square P = 0.327). In summary, when examining lineage-specific dN/dS estimates, the prominence of gene expression is diminished, and protein length becomes the dominant predictor. This pattern is not explained by variations in the sample of genes/codons used for the analyses. Importantly, we observed no evidence that the use of lineage-specific dN/dS estimates unmasks any additional signatures of positive selection compared to pairwise alignments.
Discussion
Selective constraint upon exon edges affects the relationship between dN/dS and expression
Previous studies have shown that in mammalian species, exonic splicing enhancer sequences result in higher conservation of synonymous sites at exon edges, suggestive of selective constraint to maintain correct splicing (Carlini & Genut 2006; Parmley et al. 2006). Here, we show that the removal of codons at the exon edges has a strong effect on the rate of substitutions at synonymous sites in A. thaliana, suggesting similar constraint, and associated functional importance, for ESE-containing regions in plants. A moderate increase was also observed in the rate of non-synonymous substitutions reflecting the fact that purifying selection at these sites is higher than the average observed at non-synonymous sites elsewhere in the gene.
This study is, to the best of our knowledge, the first to explore the relationship between rates of sequence evolution and genomic parameters (including gene expression) in the context of exon edge conservation due to the presence of splice enhancers. Generally, the removal of exon edges resulted in a weaker association between dN/dS and NI with measures of expression level and breadth. The relationship between dN/dS and other genomic parameters – such as various measures of gene/protein length – showed a moderate decrease, whereas the association between several length parameters and NI was strengthened (Table S6 in Supporting information). The observed decrease in the relationship between dN/dS and NI to gene expression after the removal of exon edges suggests that a stronger degree of purifying selection acting upon splice enhancer regions partly explains the association between dN/dS, NI and expression. From this, we can infer stronger splice-mediated selection in more highly expressed genes.
It is possible that more highly expressed genes are under increased constraint for accurate splice site definition, with this relationship partly masked by the stronger association of higher expression with lower dN/dS, which largely reflects constraint on the gene's function. In this respect, selection may also be masked on other properties expected to be under stronger constraint in more highly expressed genes, such as codon usage affecting translational error rate (Drummond et al. 2005), translation efficiency (Akashi & Eyre-Walker 1998) and mRNA stability (Tuller et al. 2010), although such analyses are beyond the scope of this study.
It is reasonable to ask whether anything can explain the higher selective constraint upon exon edges in such a way as to also relate both to a gene's structure and to its expression. One possible explanation may be the extent to which a gene is alternatively spliced. Alternative splicing has been shown to positively correlate with both the ratio of total intron length to overall gene length (Koralewski & Krutovsky 2011) and gene expression level (Chen et al. 2014). As longer genes are more likely to have more complex exon–intron architectures (Zhu et al. 2009), they are expected to have a higher number of possible alternative splicing events. If we assume that the exon edges are under increased selection for accurate alternative splicing compared to non-alternatively spliced exons, then those genes with higher levels of alternative splicing are expected to show a greater discrepancy in evolutionary rate estimates before and after codon removal. Using estimates of the number of alternative splicing events per gene, we find that dN/dS ratios (calculated from pairwise alignments of A. thaliana and A. lyrata to maximize sample size) are more strongly affected by codon removal from the exon edges in genes with higher levels of alternative splicing – for instance, the increase in dN after 10 codons are removed is significantly higher for genes with more splicing events (rho = 0.13, P = 2.7 × 10−4; Table S9 in Supporting information). This pattern is also observed when removing 20 or 30 codons from exon edges (Table S9 in Supporting information). Although based upon a limited sample size, this finding merits further scrutiny as it shows that genes with alternative splicing events, compared to nonspliced genes, have a higher degree of conservation at exon edges relative to conservation of the remaining coding sequence.
Lineage-specific dN/dS estimates have a stronger relationship with gene length than with expression level
The use of pairwise alignments for estimating dN/dS could influence any relationship between dN/dS and a gene's characteristics as biases are introduced due to branch-specific changes in the strength and direction of selection. For example, if a gene in A. lyrata was under a greater degree of purifying selection than its A. thaliana orthologue, this would result in a decreased dN/dS estimate in A. thaliana (Toll-Riera et al. 2011). This would introduce noise into the correlation of dN/dS and any genic feature in A. thaliana. Estimating a lineage-specific dN/dS using T. parvula as an outgroup, we found the correlation strength of dN/dS with many genic features, both structural and functional, is reduced (Table S2 in Supporting information). In particular, the estimate of rho for the expression level–dN/dS relationship is reduced by more than 50% when using a lineage-specific compared to a pairwise dN/dS estimate (Table 1 and Table S2 in Supporting information). However, the use of lineage-specific dN/dS estimates increased the correlation between dN/dS and gene length. This is of interest given the relationship between the three variables – as expression and length are both negative correlates of dN/dS, it follows that genes under stronger purifying selection are more likely to be both highly expressed and shorter. As selection for higher expression can reasonably predict a gene's length, with shorter genes minimizing costly transcription and translation (Castillo-Davis et al. 2002; Eisenberg & Levanon 2003; Urrutia & Hurst 2003), this suggests that gene length itself, rather than expression, could be a stronger predictor of dN/dS. This finding also supports a previously observed negative relationship between dN/dS and gene length identified using A. thaliana–A. lyrata orthologous pairs (Yang & Gaut 2011).
It is possible that the comparatively reduced prominence of expression level as a predictor of evolutionary rate is explained in this case by mating system: A. thaliana, unlike A. lyrata or T. parvula, is a near obligate selfer, having a patchy distribution of inbred populations with relatively rare outcrossed mating between different ecotypes (Tian et al. 2002). Selfing increases genomewide homozygosity, and thus decreases the number of gametes which may be independently sampled in a given population, in effect reducing effective population size (Szövényi et al. 2014). As a consequence, the efficacy of selection – particularly purifying selection – at purging weakly deleterious mutations is reduced (Wright et al. 2013; Glemin & Muyle 2014). In this respect, the degree of constraint acting upon highly expressed genes may be partially masked when using lineage-specific dN/dS estimates. Nevertheless, that A. thaliana experiences a general trend of relaxed selection compared to A. lyrata is only weakly supported (Glémin 2007) and in any case, the relationship of expression level to lineage-specific dN/dS for A. lyrata is equally reduced, assuming expression to be equivalent in both species (rho = −0.15, P < 2.2 × 10−16; Table S2 in Supporting information). In addition, it is important to note that the differences between pairwise and lineage-specific dN/dS are not explained by the differences in gene/codon samples used to estimate dN/dS resulting from the fact that a smaller proportion of the A. thaliana genome can be simultaneously aligned with both the A. lyrata and the T. parvula genomes as similar results are obtained when restricting the analyses to a common set of codons. Finally, we believe that T. parvula is a justifiable outgroup species as it has an estimated divergence time from A. thaliana of approx. 40 mya; this falls within the range of distances for species used to calculate lineage-specific dN/dS (e.g. approx. 90 mya for the divergence of humans and dogs, as in Toll-Riera et al. (2011)) without confounding the estimate by saturation.
Exon edge removal, but not lineage-specific substitution patterns, unmasks higher levels of positive selection
One key objective of this study was to assess whether exon edge conservation and the use of pairwise alignments could be masking higher levels of molecular adaptation than what has previously been observed. In general, we find that the proportion of genes under potential positive selection (dN/dS >1) is increased by the removal of exon edges. Of particular interest are four genes (AT1G08680, AT1G60930, AT2G17305 and AT4G27370) where dN/dS ratios are higher than 1 only after codons are removed from the exon edges, but not when codons are removed from random positions. This could suggest, in these cases, that an adaptive signature has been partially masked by disproportionate synonymous substitutions at the edges of exons. Of note is that AT1G08680 (ARF GAP-like zinc finger-containing protein ZIGA4) has been linked to adaptive germination phenotypes (Morrison & Linder 2014) and that AT1G60930 (RECQ helicase L4B) appears to be a duplicate gene that has undergone a degree of functional divergence (Singh et al. 2010). As duplicated genes undergo asymmetric sequence divergence relative to each other (Conant & Wagner 2003), an adaptive interpretation is in this case plausible.
When considering lineage-specific dN/dS, however, the proportion of genes with dN/dS >1 is significantly lower than when dN/dS is estimated using pairwise alignments of A. thaliana with A. lyrata. This could indicate that dN/dS values higher than 1 are, for several genes, being driven by increased dN/dS values in the A. lyrata lineage which, notably, does not show prevalent self-fertilization.
Having found a significant effect of exon edge conservation and lineage-specific substitution upon dN/dS estimates when each was considered separately, we wished to test whether the relationship between dN/dS and the set of genomic parameters changed when both factors are taken into account together. However, there were only a limited number of genes for which full exons could be aligned across all three species, as required for the analysis of codon removal at the exon edges and the estimates of lineage-specific dN/dS. Using a limited sample (n = 73) in which 10 codons could be removed from the exon edges, we found no significant differences in the relationship of dN/dS to any genomic parameter after codons were removed from the exon edges compared to removal at random sites (Table S6 in Supporting information). Better annotation of A. lyrata and T. parvula, or the genomes of related species, would improve the testing of the effects of exon edge conservation upon dN/dS estimates using lineage-specific substitutions.
The variation in sequence evolution among genes and its association with genic characteristics, including expression, could also be partly explained by genomic context. Most notably, chromosomal location has been associated with gene expression in A. thaliana (Yamada et al. 2003; Schmid et al. 2005). Several studies have also shown that across the genome, there are nonrandom clusters of genes with similar expression profiles in a variety of taxa (Lercher et al. 2002; Versteeg et al. 2003). Clusters of genes with similar evolutionary rate have also been identified (Williams & Hurst 2000; Lercher et al. 2001). A common mechanism may explain both clusters (i.e. Williams & Hurst (2002) but see Lercher et al. (2004)), although further assessment of such hypotheses fall outside of the scope of this study.
In summary, we show that higher conservation at the edges of exons in A. thaliana plays an important part in determining dN/dS ratios by increasing the proportion of conserved synonymous sites. The effect of these conserved regions upon overall dN/dS values partly explains the relationship between rates of protein evolution and expression level. By accounting for lineage-specific substitution patterns and the effect of conservation at the exon edges, the ability of expression level to explain variation in evolutionary rate is diminished, with gene length becoming the strongest correlate. In addition, we found evidence of masked positive selection from the conservation of exon edges, irrespective of the noise introduced to dN/dS estimates by the use of pairwise alignments.
Acknowledgments
The authors wish to thank Laurence Hurst for comments on this manuscript. This work was supported by a University of Bath fee studentship to SJB, a BBSRC grant (BB/F022697/1) to PXK and a Royal Society Dorothy Hodgkin Research Fellowship (DH071902), Royal Society research grant (RG0870644) and a Royal Society research grant for fellows (RG080272) to AOU.
Data accessibility
Alignments and associated evolutionary rate estimates are available for download at the DRYAD repository (http://datadryad.org), entry doi:10.5061/dryad.905sq.
Supporting information
Additional supporting information may be found in the online version of this article.
Table S1 Structural and functional characteristics of A. thaliana genes.
Table S2 Relationship between dN/dS and NI with various genomic characteristics in A. thaliana.
Table S3 Partial correlations between dN/dS and 20 genomic characteristics in A. thaliana, controlling for expression level.
Table S4 Average estimates of four evolutionary rate variables after sequential codon removal from the exon edges vs. random codon removal.
Table S5 Characteristics of the dN/dS and NI distributions for dataset A (pairwise alignment of A. thaliana with A. lyrata) and dataset B (pairwise alignment of A. thaliana with T. parvula), before and after codon removal at exon-intron junctions.
Table S6 Correlations between 4 selection strength/direction variables and 25 genomic characteristics, after the removal of 10, 20 and 30 codons from the exon edges vs. random removal of an equal number of codons.
Table S7 dN/dS estimates using codons common to the alignment of A. thaliana, A. lyrata and T. parvula.
Table S8 Relationship between dN/dS and evolutionary rate predictors using estimates derived from codons common to the alignment of A. thaliana, A. lyrata and T. parvula.
Table S9 Relationship between the average number of alternative splicing events per gene and the difference in evolutionary rate estimates before and after codon removal from the exon edges.
References
- Akashi H. Translational selection and yeast proteome evolution. Genetics. 2003;164:1291–1303. doi: 10.1093/genetics/164.4.1291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akashi H, Eyre-Walker A. Translational selection and molecular evolution. Current Opinion in Genetics & Development. 1998;8:688–693. doi: 10.1016/s0959-437x(98)80038-5. [DOI] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Arbiza L, Dopazo J, Dopazo H. Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS Computational Biology. 2006;2:e38. doi: 10.1371/journal.pcbi.0020038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baerenfaller K, Grossmann J, Grobei MA, et al. Genome-scale proteomics reveals arabidopsis thaliana gene models and proteome dynamics. Science. 2008;320:938–941. doi: 10.1126/science.1157956. [DOI] [PubMed] [Google Scholar]
- Bakewell MA, Shi P, Zhang J. More genes underwent positive selection in chimpanzee evolution than in human evolution. Proceedings of the National Academy of Sciences. 2007;104:7489–7494. doi: 10.1073/pnas.0701705104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berardini TZ, Mundodi S, Reiser L, et al. Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiology. 2004;135:745–755. doi: 10.1104/pp.104.040071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blencowe BJ. Exonic splicing enhancers: mechanism of action, diversity and role in human genetic diseases. Trends in Biochemical Sciences. 2000;25:106–110. doi: 10.1016/s0968-0004(00)01549-8. [DOI] [PubMed] [Google Scholar]
- Boguski MS, Lowe TMJ, Tolstoshev CM. dbEST - database for expressed sequence tags. Nature Genetics. 1993;4:332–333. doi: 10.1038/ng0893-332. [DOI] [PubMed] [Google Scholar]
- Brenner S, Johnson M, Bridgham J, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology. 2000;18:630–634. doi: 10.1038/76469. [DOI] [PubMed] [Google Scholar]
- Bromham L. Why do species vary in their rate of molecular evolution? Biology Letters. 2009;5:401–404. doi: 10.1098/rsbl.2009.0136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caceres EF, Hurst LD. The evolution, impact and properties of exonic splice enhancers. Genome Biology. 2013;14:R143. doi: 10.1186/gb-2013-14-12-r143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao J, Schneeberger K, Ossowski S, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nature Genetics. 2011;43:956–963. doi: 10.1038/ng.911. [DOI] [PubMed] [Google Scholar]
- Carlini DB, Genut JE. Synonymous SNPs provide evidence for selective constraint on human exonic splicing enhancers. Journal of Molecular Evolution. 2006;62:89–98. doi: 10.1007/s00239-005-0055-x. [DOI] [PubMed] [Google Scholar]
- Castellana NE, Payne SH, Shen Z, et al. Discovery and revision of Arabidopsis genes by proteogenomics. Proceedings of the National Academy of Sciences USA. 2008;105:21034–21038. doi: 10.1073/pnas.0811066106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA. Selection for short introns in highly expressed genes. Nature Genetics. 2002;31:415–418. doi: 10.1038/ng940. [DOI] [PubMed] [Google Scholar]
- Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. Journal of Molecular Diagnostics. 2003;5:73–81. doi: 10.1016/S1525-1578(10)60455-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Molecular Biology and Evolution. 2014;31:1402–1413. doi: 10.1093/molbev/msu083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L, Tovar-Corona JM, Urrutia AO. Alternative splicing: a potential source of functional innovation in the eukaryotic genome. International Journal of Evolutionary Biology. 2012;2012:10. doi: 10.1155/2012/596274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cherry JL. Expression level, evolutionary rate, and the cost of expression. Genome Biology and Evolution. 2010a;2:757–769. doi: 10.1093/gbe/evq059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cherry JL. Highly expressed and slowly evolving proteins share compositional properties with thermophilic proteins. Molecular Biology and Evolution. 2010b;27:735–741. doi: 10.1093/molbev/msp270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coghlan A, Wolfe KH. Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast. 2000;16:1131–1145. doi: 10.1002/1097-0061(20000915)16:12<1131::AID-YEA609>3.0.CO;2-F. [DOI] [PubMed] [Google Scholar]
- Comeron JM, Guthrie TB. Intragenic Hill-Robertson interference influences selection intensity on synonymous mutations in Drosophila. Molecular Biology and Evolution. 2005;22:2519–2530. doi: 10.1093/molbev/msi246. [DOI] [PubMed] [Google Scholar]
- Conant GC, Wagner A. Asymmetric sequence divergence of duplicate genes. Genome Research. 2003;13:2052–2058. doi: 10.1101/gr.1252603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dassanayake M, Oh DH, Haas JS, et al. The genome of the extremophile crucifer Thellungiella parvula. Nature Genetics. 2011;43:913–918. doi: 10.1038/ng.889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. Why highly expressed proteins evolve slowly. Proceedings of the National Academy of Sciences USA. 2005;102:14338–14343. doi: 10.1073/pnas.0504070102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duret L, Mouchiroud D. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Molecular Biology and Evolution. 2000;17:68–070. doi: 10.1093/oxfordjournals.molbev.a026239. [DOI] [PubMed] [Google Scholar]
- Egea R, Casillas S, Barbadilla A. Standard and generalized McDonald–Kreitman test: a website to detect selection by comparing different classes of DNA sites. Nucleic Acids Research. 2008;36:W157–W162. doi: 10.1093/nar/gkn337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eisenberg E, Levanon EY. Human housekeeping genes are compact. Trends in Genetics. 2003;19:362–365. doi: 10.1016/S0168-9525(03)00140-9. [DOI] [PubMed] [Google Scholar]
- Foxe JP, V-u-N Dar, Zheng H, et al. Selection on amino acid substitutions in Arabidopsis. Molecular Biology and Evolution. 2008;25:1375–1383. doi: 10.1093/molbev/msn079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fraser H, Hirsh A. Evolutionary rate depends on number of protein-protein interactions independently of gene expression level. BMC Evolutionary Biology. 2004;4:13. doi: 10.1186/1471-2148-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gan X, Stegle O, Behr J, et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature. 2011;477:419–423. doi: 10.1038/nature10414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glémin S. Mating systems and the efficacy of selection at the molecular level. Genetics. 2007;177:905–916. doi: 10.1534/genetics.107.073601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glemin S, Muyle A. Mating systems and selection efficacy: a test using chloroplastic sequence data in Angiosperms. Journal of Evolutionary Biology. 2014;27:1386–1399. doi: 10.1111/jeb.12356. [DOI] [PubMed] [Google Scholar]
- Gossmann TI, Song B-H, Windsor AJ, et al. Genome wide analyses reveal little evidence for adaptive evolution in many plant species. Molecular Biology and Evolution. 2010;27:1822–1832. doi: 10.1093/molbev/msq079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn MW, Kern AD. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Molecular Biology and Evolution. 2005;22:803–806. doi: 10.1093/molbev/msi072. [DOI] [PubMed] [Google Scholar]
- Haldane JB. The estimation and significance of the logarithm of a ratio of frequencies. Annals of Human Genetics. 1956;20:309–311. doi: 10.1111/j.1469-1809.1955.tb01285.x. [DOI] [PubMed] [Google Scholar]
- Hamblin MT, Casa AM, Sun H, et al. Challenges of detecting directional selection after a bottleneck: lessons from Sorghum bicolor. Genetics. 2006;173:953–964. doi: 10.1534/genetics.105.054312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu TT, Pattyn P, Bakker EG, et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genetics. 2011;43:476–481. doi: 10.1038/ng.807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hurst LD. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends in Genetics. 2002;18:486. doi: 10.1016/s0168-9525(02)02722-1. [DOI] [PubMed] [Google Scholar]
- Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology. 1981;151:389–409. doi: 10.1016/0022-2836(81)90003-6. [DOI] [PubMed] [Google Scholar]
- Jones C, Brown A, Baumann U. Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics. 2007;8:170. doi: 10.1186/1471-2105-8-170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kawahara Y, Imanishi T. A genome-wide survey of changes in protein evolutionary rates across four closely related species of Saccharomyces sensu stricto group. BMC Evolutionary Biology. 2007;7:9. doi: 10.1186/1471-2148-7-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim E, Magen A, Ast G. Different levels of alternative splicing among eukaryotes. Nucleic Acids Research. 2007;35:125–131. doi: 10.1093/nar/gkl924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim SH, Yi SV. Correlated asymmetry of sequence and functional divergence between duplicate proteins of Saccharomyces cerevisiae. Molecular Biology and Evolution. 2006;23:1068–1075. doi: 10.1093/molbev/msj115. [DOI] [PubMed] [Google Scholar]
- Kim SH, Yi SV. Understanding relationship between sequence and functional evolution in yeast proteins. Genetica. 2007;131:151–156. doi: 10.1007/s10709-006-9125-2. [DOI] [PubMed] [Google Scholar]
- Koralewski TE, Krutovsky KV. Evolution of exon-intron structure and alternative splicing. PLoS ONE. 2011;6:e18055. doi: 10.1371/journal.pone.0018055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krylov DM, Wolf YI, Rogozin IB, Koonin EV. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Research. 2003;13:2229–2235. doi: 10.1101/gr.1589103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larracuente AM, Sackton TB, Greenberg AJ, et al. Evolution of protein-coding genes in Drosophila. Trends in Genetics. 2008;24:114–123. doi: 10.1016/j.tig.2007.12.001. [DOI] [PubMed] [Google Scholar]
- Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL. Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Molecular Biology and Evolution. 2005;22:1345–1354. doi: 10.1093/molbev/msi122. [DOI] [PubMed] [Google Scholar]
- Lercher MJ, Chamary JV, Hurst LD. Genomic regionality in rates of evolution is not explained by clustering of genes of comparable expression profile. Genome Research. 2004;14:1002–1013. doi: 10.1101/gr.1597404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lercher MJ, Urrutia AO, Hurst LD. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nature Genetics. 2002;31:180–183. doi: 10.1038/ng887. [DOI] [PubMed] [Google Scholar]
- Lercher MJ, Williams EJ, Hurst LD. Local similarity in evolutionary rates extends over whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Molecular Biology and Evolution. 2001;18:2032–2039. doi: 10.1093/oxfordjournals.molbev.a003744. [DOI] [PubMed] [Google Scholar]
- Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–1635. doi: 10.1126/science.1158395. [DOI] [PubMed] [Google Scholar]
- Makino T, Gojobori T. The evolutionary rate of a protein is influenced by features of the interacting partners. Molecular Biology and Evolution. 2006;23:784–789. doi: 10.1093/molbev/msj090. [DOI] [PubMed] [Google Scholar]
- Marais G, Charlesworth B, Wright SI. Recombination and base composition: the case of the highly self-fertilizing plant Arabidopsis thaliana. Genome Biology. 2004;5:R45. doi: 10.1186/gb-2004-5-7-r45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351:652–654. doi: 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
- Meyers BC, Tej SS, Vu TH, et al. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Research. 2004;14:1641–1653. doi: 10.1101/gr.2275604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morrison GD, Linder CR. Association mapping of germination traits in Arabidopsis thaliana under light and nutrient treatments: searching for G × E effects. G3 (Bethesda) 2014;4:1465–1478. doi: 10.1534/g3.114.012427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakano M, Nobuta K, Vemaraju K, et al. Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Research. 2006;34:D731–D735. doi: 10.1093/nar/gkj077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R. Molecular signatures of natural selection. Annual Review of Genetics. 2005;39:197–218. doi: 10.1146/annurev.genet.39.073003.112420. [DOI] [PubMed] [Google Scholar]
- Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463:457–463. doi: 10.1038/nature08909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pál C, Papp B, Hurst LD. Does the recombination rate affect the efficiency of purifying selection? The yeast genome provides a partial answer. Molecular Biology and Evolution. 2001;18:2323–2326. doi: 10.1093/oxfordjournals.molbev.a003779. [DOI] [PubMed] [Google Scholar]
- Park S, Choi S. Expression breadth and expression abundance behave differently in correlations with evolutionary rates. BMC Evolutionary Biology. 2010;10:241. doi: 10.1186/1471-2148-10-241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parmakelis A, Moustaka M, Poulakakis N, et al. Anopheles immune genes and amino acid sites evolving under the effect of positive selection. PLoS ONE. 2010;5:e8885. doi: 10.1371/journal.pone.0008885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parmley JL, Chamary JV, Hurst LD. Evidence for purifying selection against synonymous mutations in mammalian exonic splicing enhancers. Molecular Biology and Evolution. 2006;23:301–309. doi: 10.1093/molbev/msj035. [DOI] [PubMed] [Google Scholar]
- Parmley JL, Hurst LD. Exonic splicing regulatory elements skew synonymous codon usage near intron-exon boundaries in mammals. Molecular Biology and Evolution. 2007;24:1600–1603. doi: 10.1093/molbev/msm104. [DOI] [PubMed] [Google Scholar]
- Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology. 2000;132:185–219. doi: 10.1385/1-59259-192-2:185. [DOI] [PubMed] [Google Scholar]
- Pertea M, Mount SM, Salzberg SL. A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana. BMC Bioinformatics. 2007;8:159. doi: 10.1186/1471-2105-8-159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Podder S, Mukhopadhyay P, Ghosh TC. Multifunctionality dominantly determines the rate of human housekeeping and tissue specific interacting protein evolution. Gene. 2009;439:11–16. doi: 10.1016/j.gene.2009.03.005. [DOI] [PubMed] [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. [Google Scholar]
- Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois, USA: Northwestern University; 2014. [Google Scholar]
- Ross-Ibarra J, Tenaillon M, Gaut BS. Historical divergence and gene flow in the genus Zea. Genetics. 2009;181:1399–1413. doi: 10.1534/genetics.108.097238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmid M, Davison TS, Henz SR, et al. A gene expression map of Arabidopsis thaliana development. Nature Genetics. 2005;37:501–506. doi: 10.1038/ng1543. [DOI] [PubMed] [Google Scholar]
- Seoighe C, Gehring C, Hurst LD. Gametophytic selection in Arabidopsis thaliana supports the selective model of intron length reduction. PLoS Genetics. 2005;1:e13. doi: 10.1371/journal.pgen.0010013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh S, Roy S, Choudhury S, Sengupta D. DNA repair and recombination in higher plants: insights from comparative genomics of arabidopsis and rice. BMC Genomics. 2010;11:443. doi: 10.1186/1471-2164-11-443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slotte T, Bataillon T, Hansen TT, et al. Genomic determinants of protein evolution and polymorphism in Arabidopsis. Genome Biology and Evolution. 2011;3:1210–1219. doi: 10.1093/gbe/evr094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stark C, Breitkreutz BJ, Chatr-Aryamontri A, et al. The BioGRID interaction database: 2011 update. Nucleic Acids Research. 2011;39:D698–D704. doi: 10.1093/nar/gkq1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stark C, Breitkreutz BJ, Reguly T, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Research. 2006;34:D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoletzki N, Eyre-Walker A. Synonymous codon usage in Escherichia coli: selection for translational accuracy. Molecular Biology and Evolution. 2007;24:374–381. doi: 10.1093/molbev/msl166. [DOI] [PubMed] [Google Scholar]
- Szövényi P, Devos N, Weston DJ, et al. Efficient purging of deleterious mutations in plants with haploid selfing. Genome Biology and Evolution. 2014;6:1238–1252. doi: 10.1093/gbe/evu099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tacke R, Manley JL. Determinants of SR protein specificity. Current Opinion in Cell Biology. 1999;11:358–362. doi: 10.1016/S0955-0674(99)80050-7. [DOI] [PubMed] [Google Scholar]
- Tang CS, Zhao YZ, Smith DK, Epstein RJ. Intron length and accelerated 3′ gene evolution. Genomics. 2006;88:682–689. doi: 10.1016/j.ygeno.2006.06.017. [DOI] [PubMed] [Google Scholar]
- Tennessen JA. Positive selection drives a correlation between non-synonymous/synonymous divergence and functional divergence. Bioinformatics. 2008;24:1421–1425. doi: 10.1093/bioinformatics/btn205. [DOI] [PubMed] [Google Scholar]
- Tian D, Araki H, Stahl E, Bergelson J, Kreitman M. Signature of balancing selection in Arabidopsis. Proceedings of the National Academy of Sciences USA. 2002;99:11525–11530. doi: 10.1073/pnas.172203599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ticher A, Graur D. Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes. Journal of Molecular Evolution. 1989;28:286–298. doi: 10.1007/BF02103424. [DOI] [PubMed] [Google Scholar]
- Toll-Riera M, Laurie S, Albà MM. Lineage-specific variation in intensity of natural selection in mammals. Molecular Biology and Evolution. 2011;28:383–398. doi: 10.1093/molbev/msq206. [DOI] [PubMed] [Google Scholar]
- Tuller T, Waldman YY, Kupiec M, Ruppin E. Translation efficiency is determined by both codon bias and folding energy. Proceedings of the National Academy of Sciences USA. 2010;107:3645–3650. doi: 10.1073/pnas.0909910107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urrutia AO, Hurst LD. Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics. 2001;159:1191–1199. doi: 10.1093/genetics/159.3.1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urrutia AO, Hurst LD. The signature of selection mediated by expression on human genes. Genome Research. 2003;13:2260–2264. doi: 10.1101/gr.641103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Versteeg R, van Schaik BD, van Batenburg MF, et al. The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Research. 2003;13:1998–2004. doi: 10.1101/gr.1649303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang GZ, Lercher MJ. The effects of network neighbours on protein evolution. PLoS ONE. 2011;6:e18288. doi: 10.1371/journal.pone.0018288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warnecke T, Hurst LD. Evidence for a trade-off between translational efficiency and splicing regulation in determining synonymous codon usage in Drosophila melanogaster. Molecular Biology and Evolution. 2007;24:2755–2762. doi: 10.1093/molbev/msm210. [DOI] [PubMed] [Google Scholar]
- Weedall GD, Polley SD, Conway DJ. Gene-specific signatures of elevated non-synonymous substitution rates correlate poorly across the Plasmodium genus. PLoS ONE. 2008;3:e2281. doi: 10.1371/journal.pone.0002281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams EJ, Hurst LD. The proteins of linked genes evolve at similar rates. Nature. 2000;407:900–903. doi: 10.1038/35038066. [DOI] [PubMed] [Google Scholar]
- Williams EJ, Hurst LD. Clustering of tissue-specific genes underlies much of the similarity in rates of protein evolution of linked genes. Journal of Molecular Evolution. 2002;54:511–518. doi: 10.1007/s00239-001-0043-8. [DOI] [PubMed] [Google Scholar]
- Winter EE, Goodstadt L, Ponting CP. Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Research. 2004;14:54–61. doi: 10.1101/gr.1924004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright F. The ‘effective number of codons’ used in a gene. Gene. 1990;87:23–29. doi: 10.1016/0378-1119(90)90491-9. [DOI] [PubMed] [Google Scholar]
- Wright SI, Foxe JP, DeRose-Wilson L, et al. Testing for effects of recombination rate on nucleotide diversity in natural populations of Arabidopsis lyrata. Genetics. 2006;174:1421–1430. doi: 10.1534/genetics.106.062588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright SI, Kalisz S, Slotte T. Evolutionary consequences of self-fertilization in plants. Proceedings of the Royal Society B: Biological Sciences. 2013;280:20130133. doi: 10.1098/rspb.2013.0133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright SI, Yau CBK, Looseley M, Meyers BC. Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Molecular Biology and Evolution. 2004;21:1719–1726. doi: 10.1093/molbev/msh191. [DOI] [PubMed] [Google Scholar]
- Wu Y, Zhang Y, Zhang J. Distribution of exonic splicing enhancer elements in human genes. Genomics. 2005;86:329–336. doi: 10.1016/j.ygeno.2005.05.011. [DOI] [PubMed] [Google Scholar]
- Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association. 2004;99:909–917. [Google Scholar]
- Xia Y, Franzosa EA, Gerstein MB. Integrated assessment of genomic correlates of protein evolutionary rate. PLoS Computational Biology. 2009;5:e1000413. doi: 10.1371/journal.pcbi.1000413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamada K, Lim J, Dale JM, et al. Empirical analysis of transcriptional activity in the Arabidopsis genome. Science. 2003;302:842–846. doi: 10.1126/science.1088305. [DOI] [PubMed] [Google Scholar]
- Yanai I, Benjamin H, Shmoish M, et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21:650–659. doi: 10.1093/bioinformatics/bti042. [DOI] [PubMed] [Google Scholar]
- Yang H. In plants, expression breadth and expression level distinctly and non-linearly correlate with gene structure. Biology Direct. 2009;4:45. doi: 10.1186/1745-6150-4-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang L, Gaut BS. Factors that contribute to variation in evolutionary rate among Arabidopsis genes. Molecular Biology and Evolution. 2011;28:2359–2369. doi: 10.1093/molbev/msr058. [DOI] [PubMed] [Google Scholar]
- Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
- Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends in Ecology & Evolution. 2000;15:496–503. doi: 10.1016/S0169-5347(00)01994-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang L, Li WH. Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Molecular Biology and Evolution. 2004;21:236–239. doi: 10.1093/molbev/msh010. [DOI] [PubMed] [Google Scholar]
- Zheng ZM. Regulation of alternative RNA splicing by exon definition and exon sequences in viral and mammalian gene expression. Journal of Biomedical Science. 2004;11:278–294. doi: 10.1159/000077096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu L, Zhang Y, Zhang W, et al. Patterns of exon-intron architecture variation of genes in eukaryotic genomes. BMC Genomics. 2009;10:47. doi: 10.1186/1471-2164-10-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Table S1 Structural and functional characteristics of A. thaliana genes.
Table S2 Relationship between dN/dS and NI with various genomic characteristics in A. thaliana.
Table S3 Partial correlations between dN/dS and 20 genomic characteristics in A. thaliana, controlling for expression level.
Table S4 Average estimates of four evolutionary rate variables after sequential codon removal from the exon edges vs. random codon removal.
Table S5 Characteristics of the dN/dS and NI distributions for dataset A (pairwise alignment of A. thaliana with A. lyrata) and dataset B (pairwise alignment of A. thaliana with T. parvula), before and after codon removal at exon-intron junctions.
Table S6 Correlations between 4 selection strength/direction variables and 25 genomic characteristics, after the removal of 10, 20 and 30 codons from the exon edges vs. random removal of an equal number of codons.
Table S7 dN/dS estimates using codons common to the alignment of A. thaliana, A. lyrata and T. parvula.
Table S8 Relationship between dN/dS and evolutionary rate predictors using estimates derived from codons common to the alignment of A. thaliana, A. lyrata and T. parvula.
Table S9 Relationship between the average number of alternative splicing events per gene and the difference in evolutionary rate estimates before and after codon removal from the exon edges.
Data Availability Statement
Alignments and associated evolutionary rate estimates are available for download at the DRYAD repository (http://datadryad.org), entry doi:10.5061/dryad.905sq.