Skip to main content
eLife logoLink to eLife
. 2024 Sep 6;12:RP87335. doi: 10.7554/eLife.87335

The protein domains of vertebrate species in which selection is more effective have greater intrinsic structural disorder

Catherine A Weibel 1,2,†,, Andrew L Wheeler 3,, Jennifer E James 4,§, Sara M Willis 4,#, Hanon McShea 5, Joanna Masel 4,
Editors: C Brandon Ogbunugafor6, Detlef Weigel7
PMCID: PMC11379457  PMID: 39239703

Abstract

The nearly neutral theory of molecular evolution posits variation among species in the effectiveness of selection. In an idealized model, the census population size determines both this minimum magnitude of the selection coefficient required for deleterious variants to be reliably purged, and the amount of neutral diversity. Empirically, an ‘effective population size’ is often estimated from the amount of putatively neutral genetic diversity and is assumed to also capture a species’ effectiveness of selection. A potentially more direct measure of the effectiveness of selection is the degree to which selection maintains preferred codons. However, past metrics that compare codon bias across species are confounded by among-species variation in %GC content and/or amino acid composition. Here, we propose a new Codon Adaptation Index of Species (CAIS), based on Kullback–Leibler divergence, that corrects for both confounders. We demonstrate the use of CAIS correlations, as well as the Effective Number of Codons, to show that the protein domains of more highly adapted vertebrate species evolve higher intrinsic structural disorder.

Research organism: None

eLife digest

Evolution is the process through which populations change over time, starting with mutations in the genetic sequence of an organism. Many of these mutations harm the survival and reproduction of an organism, but only by a very small amount.

Some species, especially those with large populations, can purge these slightly harmful mutations more effectively than other species. This fact has been used by the ‘drift barrier theory’ to explain various profound differences amongst species, including differences in biological complexity. In this theory, the effectiveness of eliminating slightly harmful mutations is specified by an ‘effective' population size, which depends on factors beyond just the number of individuals in the population.

Effective population size is normally calculated from the amount of time a ‘neutral’ mutation (one with no effect at all) stays in the population before becoming lost or taking over. Estimating this time requires both representative data for genetic diversity and knowledge of the mutation rate. A major limitation is that these data are unavailable for most species. A second limitation is that a brief, temporary reduction in the number of individuals has an oversized impact on the metric, relative to its impact on the number of slighly harmful mutations accumulated.

Weibel, Wheeler et al. developed a new metric to more directly determine how effectively a species purges slightly harmful mutations. Their approach is based on the fact that the genetic code has ‘synonymous’ sequences. These sequences code for the same amino acid building block, with one of these sequences being only slightly preferred over others.

The metric by Weibel, Wheeler et al. quantifies the proportion of the genome from which less preferred synonymous sequences have been effectively purged. It judges a population to have a higher effective population size when the usage of synonymous sequences departs further from the usage predicted from mutational processes.

The researchers expected that natural selection would favour ‘ordered’ proteins with robust three-dimensional structures, i.e., that species with a higher effective population size would tend to have more ordered versions of a protein. Instead, they found the opposite: species with a higher effective population size tend to have more disordered versions of the same protein. This changes our view of how natural selection acts on proteins.

Why species are so different remains a fundamental question in biology. Weibel, Wheeler et al. provide a useful tool for future applications of drift barrier theory to a broad range of ways that species differ.

Introduction

Species differ from each other in many ways, including mating system, ploidy, spatial distribution, life history, size, lifespan, genome size, mutation rate, selective pressure, and population size. These differences make the process of purifying selection more efficient in some species than others. Our understanding of both the causes and consequences of these differences is limited in part by a reliable metric with which to measure them. In the long term, the probability that a gene is fixed for one allele rather than another allele is given by the ratio of fixation and counter-fixation probabilities (Bulmer, 1991). In an idealized population of constant population size and no selection at linked sites, a mutation–selection–drift model describes how this ratio of fixation probabilities depends on the census population size N (Kimura, 1962), and hence gives the fraction of sites expected to be found in preferred vs. non-preferred states (Figure 1).

Figure 1. The effectiveness of selection, calculated as the long-term ratio of time spent in fixed deleterious: fixed beneficial allele states given symmetric mutation rates, is a function of the product sN.

Figure 1.

Assuming a diploid Wright–Fisher population with s << 1, the probability of fixation of a new mutation π(N,s)=1es21eNs , and the y-axis is calculated as πN,-s/(πN,-s+πN,s). s is held constant at a value of 0.001 and N is varied. Results for other small magnitude values of s are superimposable. For small sN, selection is ineffective at producing codon bias. For large sN, selection is highly effective. For only a relatively narrow range of intermediate values of sN, the degree of codon bias depends quantitatively on sN.

This reasoning has been extended to real populations by positing that species have an ‘effective’ population size, Ne (Ohta, 1973). Ne is the census population size of an idealized population that reproduces a property of interest in the focal population. Ne is therefore not a single quantity per population, but instead depends on which property is of interest.

The amount of neutral polymorphism is the usual property used to empirically estimate Ne (Charlesworth, 2009; Doyle et al., 2015; Lynch et al., 2016). However, the property of most relevance to nearly neutral theory is instead the inflection point s at which non-preferred alleles become common enough to matter (Figure 1), and hence the degree to which highly exquisite adaptation can be maintained in the face of ongoing mutation and genetic drift (Kimura, 1962; Ohta, 1972; Ohta, 1992). While genetic diversity has been found to reflect some aspects of life history strategy (Romiguier et al., 2014), there remain concerns about whether neutral genetic diversity and the limits to weak selection always remain closely coupled in non-equilibrium settings.

As a practical matter, Ne is usually calculated by dividing some measure of the amount of putatively neutral (often synonymous) polymorphism segregating in a population by that species’ mutation rate (Charlesworth, 2009). As a result, Ne values are only available for species that have both polymorphism data and accurate mutation rate estimates, limiting their use. Worse, Ne is not a robust statistic. In the absence of a clear species definition, polymorphism is sometimes calculated across too broad a range of genomes, substantially inflating Ne (Daubin and Moran, 2004); a poor sampling scheme can have the converse effect of deflating genetic diversity. Transient hypermutation (Plotkin et al., 2006), which is common in microbes, causes further short-term inconsistencies in polymorphism levels. Perhaps most importantly, a recent bottleneck will deflate Ne based on the coalescence time, even if too brief to lead to significant erosion of fine-tuned adaptations. But drift barrier theory concerns the level with which adaptation is fine-tuned, and so a better metric would capture that directly, rather than indirectly rely on neutral diversity.

An alternative approach to measure the efficiency of selection exploits codon usage bias, which is influenced by weak selection for factors such as translational speed and accuracy (Hershberg and Petrov, 2008; Plotkin and Kudla, 2011; Hunt et al., 2014). The degree of bias in synonymous codon usage that is driven by selective preference offers a more direct way to assess how effective selection is at the molecular level in a given species (Li, 1987; Bulmer, 1991; Akashi, 1996; Subramanian, 2008). Conveniently, it can be estimated from only a single genome, that is, without polymorphism or mutation rate data for that species.

One commonly used metric, the Codon Adaptation Index (CAI) (Sharp and Li, 1987; Sharp et al., 2010) takes the average of Relative Synonymous Codon Usage (RSCU) scores, which quantify how often a codon is used, relative to the codon that is most frequently used to encode that amino acid in that species. While this works well for comparing genes within the same species, it unfortunately means that the species-wide strength of codon bias appears in the normalizing denominator (see Equation 4 and Figure 3—figure supplement 1A). Paradoxically, this can make more exquisitely adapted species have lower rather than higher species-averaged CAI scores (Figure 3—figure supplement 1B; Rocha, 2004; Botzman and Margalit, 2011).

To compare species using CAI, it has been suggested that instead of taking a genome-wide average, one should consider a set of highly expressed reference genes (Sharp et al., 2005; Vicario et al., 2007; Subramanian, 2008; dos Reis and Wernisch, 2009). This approach assumes that the relative strength of selection on those reference genes (often a function of gene expression) remains approximately constant across the set of species considered (red distributions in Figure 2). Its use also requires careful attention to the length of reference genes (Urrutia and Hurst, 2001; Doherty and McInerney, 2013), and some approaches also require information about tRNA gene copy numbers and abundances (dos Reis and Wernisch, 2009).

Figure 2. More highly adapted species (bottom) have a higher proportion of their sites subject to effective selection on codon bias (blue area).

Figure 2.

The Codon Adaptation Index (CAI) attempts to compare the intensity of selection (Figure 1, x-axis) in a subset of genes under strong selection (red areas). Given the narrow range of quantitative dependence of codon bias on sN shown in Figure 1, our new metric is intended to capture differences in the proportion of the proteome subject to substantial selection (blue areas).

Since codon bias varies quantitatively within only a small range of sN (Figure 1), a promising approach is to measure the proportion of sites at which codon adaptation is effective. We posit that more highly adapted species have a higher proportion of both genes and sites subject to effective selection on codon bias (Figure 2; Galtier et al., 2018). Indeed, CAI might also rely in part on variation in the fraction of sites within the reference genes that is subject to effective selection as a function of species (Figure 2, red). Here we take this logic further, considering all sites in a proteome-wide approach. Averaging across the entire proteome provides robustness to shifts in the expression level of or strength of selection on particular genes. The proteome-wide average depends on the fraction of sites whose selection coefficients exceed the ‘drift barrier’ for that particular species (Figure 2, blue threshold).

In estimating the effects of selection, it is critical to control for other causes of codon bias. In particular, species differ in their mutational bias with respect to the proportion of the genome that consists of guanine-cytosine base pairs (GC), and in the frequency of GC-biased gene conversion (Urrutia and Hurst, 2001; Duret and Galtier, 2009; Doherty and McInerney, 2013; Figuet et al., 2014). Here, we control for %GC, capturing species differences both in mutation and in gene conversion, by calculating the Kullback–Leibler divergence of the observed codon frequencies away from the codon frequencies that we would expect to see given the genomic %GC content of the species. Kullback–Leibler divergence measures the distance of an observed probability distribution from an expected reference distribution, capturing a measure of surprise (Kullback and Leibler, 1951). This method does not require us to specify preferred vs. non-preferred codons, and can thus also accommodate situations in which different genes have different codon preferences (Gingold et al., 2014; Cope et al., 2018).

An alternative metric, the Effective Number of Codons (ENC) originally quantified how far the codon usage of a sequence departs from equal usage of synonymous codons (Wright, 1990), with lower ENC values indicating greater departure. This approach creates a complex relationship with GC content (Fuglsang, 2008), and so ENC was later modified to correct for GC content (Novembre, 2002). However, a remaining issue with this modified ENC is that differences among species in amino acid composition might act as a confounding factor, even after controlling for GC content. Specifically, species that make more use of an amino acid for which there is stronger selection among codons (which is sometimes the case Vicario et al., 2007) would have higher codon bias, even if each amino acid considered on its own had identical codon bias irrespective of which species it is in. Confounding with amino acid frequencies has been shown to be a problem at the individual protein level (Cope et al., 2018). Neither ENC (Fuglsang, 2004; Fuglsang, 2008) nor the CAI (Sharp and Li, 1987) adequately control for differences in amino acid composition when applied across species. Despite early claims to the contrary (Wright, 1990), this problem is not easy to fix for ENC (Fuglsang, 2004; Fuglsang, 2008).

Here, we extend the CAI, using the information-theory-based Kullback–Leibler divergence, so that it corrects for both GC and amino acid composition (see Methods) to create a new Codon Adaptation Index of Species (CAIS). The availability of a complete genome allows both metrics to be readily calculated without data on polymorphism or mutation rate, without selecting reference genes, and without concerns about demographic history. Our purpose is to find an accessible metric that can quantify the limits to weak selection important to nearly neutral theory; this differs from past evaluations focused on comparing different genes of the same species and recapitulating ‘ground truth’ simulations thereof (Sun et al., 2013; Zhang et al., 2012; Liu et al., 2018). To demonstrate the usefulness of our method, we identify a novel correlation with intrinsic structural disorder (ISD), pointing to what else might be subject to weak selective preferences at the molecular level. While ENC can also identify subtle selection on ISD, CAIS can do so without the risk of confounding with amino acid frequencies.

Results

Both ENC and CAIS solve the GC confounding problem that plagues CAI

CAI is seriously confounded with GC content (Figure 3A). ENC is not confounded with GC content (Figure 3B), while CAIS has only a very weak correlation that is not significant after correction for multiple comparisons (Figure 3C).

Figure 3. Codon Adaptation Index (CAI) is seriously confounded with GC content (A), while Effective Number of Codons (ENC) and Codon Adaptation Index of Species (CAIS) are not (B and C).

We control for phylogenetic confounding via Phylogenetic Independent Contrasts (PIC) (Felsenstein, 1985); this yields an unbiased R2 estimate (Rohlf, 2006). Each datapoint is one of 118 vertebrate species with ‘Complete’ intergenic genomic sequence (allowing for %GC correction) and TimeTree divergence dates (allowing for PIC correction). Red line shows unweighted lm(y ~ x) with gray region as 95% confidence interval. Figure 3—figure supplement 1 shows in more detail why CAI is not appropriate for species-wide effectiveness of selection measurements. Plots without PIC correction are shown in Figure 3—figure supplement 2. The impact of amino acid frequency correction on CAIS is shown in Figure 3—figure supplement 3.

Figure 3.

Figure 3—figure supplement 1. Codon Adaptation Index (CAI) is not appropriate for species-wide effectiveness of selection measurements.

Figure 3—figure supplement 1.

Each CAI value shown is averaged over an entire species’ proteome. (A) The value of CAI is driven by its normalizing denominator term, CAImax. (B) As a result, CAI is inversely proportional to Codon Adaptation Index of Species (CAIS). Each datapoint is one of 118 vertebrate species with ‘Complete’ intergenic genomic sequence available (allowing for %GC correction) and TimeTree divergence dates (allowing for Phylogenetic Independent Contrasts [PIC] correction). p-values shown are for Pearson’s correlation.
Figure 3—figure supplement 2. The same relationships are shown as in Figure 3, but without correction for phylogenetic confounding, suggesting GC confounding for the Effective Number of Codons (ENC) but not the Codon Adaptation Index of Species (CAIS).

Figure 3—figure supplement 2.

Codon Adaptation Index (CAI) (A) and ENC (B) both correlate with genomic GC, but CAIS (C) does not. Red line shows lm(y ~ x), with gray region as 95% confidence interval. We use Phylogenetic Independent Contrasts (PIC) corrected results rather than these results because PIC correction removes non-independent errors to produce an unbiased R2 estimate (Rohlf, 2006).
Figure 3—figure supplement 3. Vertebrate Codon Adaptation Index of Species (CAIS) values are not greatly affected by computation for a standardized amino acid composition vs. computation for the amino acid frequencies in the species in question.

Figure 3—figure supplement 3.

Proteins in better adapted species evolve more structural disorder

As an example of how correlations with codon adaptation metrics can be used to identify weak selective preferences, we investigate protein ISD. Disordered proteins are more likely to be harmful when overexpressed (Vavouri et al., 2009), and ISD is more abundant in eukaryotic than prokaryotic proteins (Schad et al., 2011; Xue et al., 2012; Basile et al., 2019), suggesting that low ISD might be favored by more effective selection.

However, compositional differences among proteomes might not be driven by differences in how a given protein sequence evolves as a function of the effectiveness of selection. Instead, they might be driven by the recent birth of ISD-rich proteins in animals (James et al., 2021), and/or by differences among sequences in their subsequent tendency to proliferate into many different genes (James et al., 2023). To focus only on the effects of descent with modification, we use a linear mixed model, with each species having a fixed effect on ISD, while controlling for Pfam domain identity as a random effect. We note that once GC is controlled for, codon adaptation can be assessed similarly in intrinsically disordered vs. ordered proteins (Gossmann et al., 2012). Controlling for Pfam identity is supported, with standard deviation in ISD of 0.178 among Pfams compared to residual standard deviation of 0.058, and a p-value on the significance of the Pfam random effect term of 3 × 10−13. Controlling in this way for Pfam identity, we then ask whether the fixed species effects on ISD are correlated with CAIS and with ENC.

Surprisingly, more exquisitely adapted species have more disordered protein domains (Figure 4). Results using ENC and CAIS are similar, with ENC having higher power; the correlation coefficient is 0.36 for CAIS compared to 0.50 for ENC, and the p-value for ENC is 3 orders of magnitude lower. We note, however, that amino acid frequencies strongly influence ISD (Theillet et al., 2013). The CAIS correlation is more reliable than the ENC correlation because by construction, CAIS controls for differences in amino acid frequencies among species.

Figure 4. Protein domains have higher intrinsic structural disorder (ISD) when found in more exquisitely adapted species, according to (A) the Codon Adaptation Index of Species (CAIS) and (B) the Effective Number of Codons (ENC).

We plot -ENC rather than ENC to more easily compare results with those from CAIS. (C) Correcting for local rather than genome-wide %GC removes the relationship. Each datapoint is one of 118 vertebrate species with ‘complete’ intergenic genomic sequence available (allowing for %GC correction), and TimeTree divergence dates (allowing for Phylogenetic Independent Contrasts [PIC] correction). ‘Effects’ on ISD shown on the y-axis are fixed effects of species identity in our linear mixed model, after PIC correction. Red line shows unweighted lm(y ~ x) with gray region as 95% confidence interval. Panels without PIC correction are presented in Figure 4—figure supplement 1.

Figure 4.

Figure 4—figure supplement 1. The same relationships are shown as in Figure 4, here without correction for phylogenetic confounding.

Figure 4—figure supplement 1.

As in Figure 4, intrinsic structural disorder (ISD) of protein domains is higher in more highly adapted species, as measured by Codon Adaptation Index of Species (CAIS) (A) and Effective Number of Codons (ENC) (B), but not by CAIS calculated with local GC% rather than genome-wide GC% (C). ISD is calculated as in Figure 4. Red line shows lm(y ~ x), with gray region as 95% confidence interval.

Different parts of the genome have different GC contents (Bernardi, 2000; Eyre-Walker and Hurst, 2001; Lander et al., 2001), primarily because the extent to which GC-biased gene conversion increases GC content depends on the local rate of recombination (Galtier et al., 2001; Meunier and Duret, 2004; Duret et al., 2006; Duret and Galtier, 2009). We therefore also calculated a version of CAIS whose codon frequency expectations are based on local intergenic GC content. This performed worse (Figure 4C) than our simple use of genome-wide GC content (Figure 4A) with respect to the strength of correlation between CAIS and ISD. If GC-biased gene conversion is a more powerful force than weak selective preferences among codons, then local GC content will evolve more rapidly than codon usage (Kondrashov et al., 2010). In this case, genome-wide GC may serve as an appropriately time-averaged proxy. It is also possible that the local non-coding sequences we used were too short (at 3000 bp or more), creating excessive noise that obscured the signal.

Many vertebrates have higher recombination rates and hence GC-biased gene conversion near genes; in this case genome-wide GC content would misestimate the codon usage expected from the combination of mutation bias and GC-biased gene conversion in the vicinity of genes. If GC-biased gene conversion drove CAIS, we expect high |localGC¯ globalGC| to predict high CAIS. We do not see this relationship (Figure 5), suggesting that gene conversion strength is not a confounding factor impacting CAIS.

Figure 5. Codon Adaptation Index of Species (CAIS) is not correlated with the degree to which local genomic regions differ in their GC content from global GC content.

Figure 5.

If CAIS were driven by GC-biased gene conversion, genomes with more heterogeneous %GC distributions should have higher CAIS scores.

Younger animal-specific protein domains have higher ISD (James et al., 2021). It is possible that selection in favor of high ISD is strongest in young domains, which might use more primitive methods to avoid aggregation (Foy et al., 2019; Bertram and Masel, 2020). To test this, we analyze two subsets of our data: those that emerged prior to the last eukaryotic common ancestor (LECA), here referred to as ‘old’ protein domains, and ‘young’ protein domains that emerged after the divergence of animals and fungi from plants. Young and old domains show equally strong trends of increasing disorder with species’ adaptedness (Figure 6).

Figure 6. More exquisitely adapted species have higher intrinsic structural disorder (ISD) in both young (A and B) and old (C and D) protein domains, according to both the Codon Adaptation Index of Species (CAIS) (A, C), and the Effective Number of Codons (ENC) (B, D).

Age assignments are taken from James et al., 2021, with vertebrate protein domains that emerged prior to last eukaryotic common ancestor (LECA) classified as ‘old’, and vertebrate protein domains that emerged after the divergence of animals and fungi from plants as ‘young’. ‘Effects’ on ISD shown on the y-axis are fixed effects of species identity in our linear mixed model. The same n = 118 datapoints are shown as in Figures 3 and 4. Red line shows lm(y ~ x), with gray region as 95% confidence interval. Panels without Phylogenetic Independent Contrasts (PIC) correction are shown in Figure 6—figure supplement 1.

Figure 6.

Figure 6—figure supplement 1. Without correction for phylogenetic confounding, more highly adapted species have higher intrinsic structural disorder (ISD) in both young (A and B) and old (C and D) protein domains, according to both the Codon Adaptation Index of Species (CAIS) (A, C), and the Effective Number of Codons (ENC) (B, D).

Figure 6—figure supplement 1.

Age assignments and ISD effects are calculated as in Figure 6. Same n = 118 datapoints are shown as in Figures 35. Red line shows lm(y ~ x), with gray region as 95% confidence interval.

Discussion

When different properties are each causally affected by a species’ exquisiteness of adaptation, this will create a correlation between the properties. We use codon adaptation as a reference property, such that correlations with codon adaptation indicate selection. To detect ISD as a novel property under selection, we used a linear mixed model approach that controls for Pfam identity as a random effect. This approach shows that the same Pfam domain tends to be more disordered when found in a well-adapted species (i.e. a species with a higher CAIS or ENC). This is true for both ancient and recently emerged protein domains.

It is important that no additional variable such as GC content or amino acid frequencies creates a spurious correlation by affecting both CAIS and our property of interest. For this reason, we define CAIS as the observed Kullback–Leibler divergence (Kullback and Leibler, 1951) from the codon usage expected given the GC content. The GC content pertinent to this expectation depends primarily on mutation bias and GC-biased gene conversion (Romiguier and Roux, 2017), but potentially also on selection on individual nucleotide substitutions that is hypothesized to favor higher %GC (Long et al., 2018). By controlling for %GC, we exclude all these forces from influencing CAIS or ENC. We thus capture the extent of adaptation in codon bias, including translational speed, accuracy, and any intrinsic preference for GC over AT that is specific to coding regions. These remaining codon-adaptive factors do not create a statistically convincing correlation between CAIS and GC (Figure 3C), nor between ENC and GC (Figure 3B), although CAI is strongly correlated with GC (Figure 3A). Notably, our new CAIS metric of codon adaptation controls for amino acid frequencies, rather than, like ENC, only GC content.

A direct effect of ISD on fitness agrees with studies of random Open Reading Frames (ORFs) in Escherichia coli, where fitness was driven more by amino acid composition than %GC content, after controlling for the intrinsic correlation between the two (Kosinski et al., 2022). However, we have not ruled out a role for selection for higher %GC in ways that are general rather than restricted to coding regions, whether in shaping mutational biases (Smith and Eyre-Walker, 2001; Hershberg and Petrov, 2009; Hildebrand et al., 2010; Novoa et al., 2019; Forcelloni and Giansanti, 2020) or the extent of gene conversion, or even at the single-nucleotide level in a manner shared between coding regions and intergenic regions (Long et al., 2018).

A more complex metric could control for more than just GC content and amino acid frequencies. First vs. second vs. third codon positions have different nucleotide usage on average, but while correcting for this might be useful for comparing genes (Zhang et al., 2012), correcting for it while comparing species might remove the effect of interest. Similarly, while it might be useful to control for dinucleotide and trinucleotide frequencies (Brbić et al., 2015), to avoid circularity these would need to be taken from intergenic sequences, with care needed to avoid influence from unannotated protein-coding genes or even pseudogenes.

Note that if a species were to experience a sudden reduction in census population size, for example due to habitat loss, leading to less effective selection, it would take some multiple of the neutral coalescent time for CAIS to fully adjust. CAIS thus represents a relatively long-term historical pattern of adaptation. The timescales setting neutral polymorphism-based Ne estimates are likely shorter, based on a single round of coalescence. It is possible that the reason that we obtained correlations when we controlled for genome-wide GC content, but not when we controlled for local GC content, is also that codon adaptation adjusts slowly relative to the timescale of fluctuations in local GC content.

Here, we developed a new metric of species adaptedness at the codon level, capable of quantifying degrees of codon adaptation even among vertebrates. We chose vertebrates partly due to the abundance of suitable data, and partly as a stringent test case, given past studies finding limited evidence for codon adaptation (Kessler and Dean, 2014). It remains to be seen how CAIS behaves among species with stronger codon adaptation. We restricted our analysis to only the best annotated genomes, in part to ensure the quality of intergenic %GC estimates, and in part limited by the feasibility of running linear mixed models with 6 million datapoints. The phylogenetic tree is well resolved for vertebrate species, with an overrepresentation of mammalian species. Despite the focus on vertebrates, we were able to discover new results regarding selection on ISD.

Our finding that more effective selection prefers higher ISD was unexpected, given that lower-Ne eukaryotes have more disordered proteins than higher-Ne prokaryotes (Ahrens et al., 2017; Basile et al., 2019). However, this can be reconciled in a model in which highly disordered sequences are less likely to be found in high-Ne species, but the sequences that are present tend to have slightly higher disorder than their low-Ne homologs. High ISD might help mitigate the trade-off between affinity and specificity in protein–protein interactions (Dunker et al., 1998; Huang and Liu, 2013; Lazar et al., 2022); non-specific interactions might be short-lived due to the high entropy associated with disorder, which specific interactions are robust to.

Codon adaptation metrics more directly quantify how species vary in their exquisiteness of adaptation, than do estimates of effective population size that are based on neutral polymorphism. Both CAIS and ENC can also be estimated for far more species because they do not require polymorphism or mutation rate data, nor tRNA gene copy numbers and abundances, but only a single complete genome. CAIS has the additional advantage of not being confounded with amino acid frequencies. This makes CAIS a useful tool for applying nearly neutral theory to protein evolution, as shown by our worked example of ISD.

Methods

Key resources table.

Reagent type (species) or resource Designation Source or reference Identifiers Additional information
Software, algorithm IUPRED2 DOI: https://doi.org/10.1093/nar/gky384 RRID:SCR_014632
Software, algorithm Codon Adaptation Index of Species This paper See Materials and methods
Software, algorithm Codon Adaptation Index DOI: https://doi.org/10.1093/nar/15.3.1281
Software, algorithm ape DOI: https://doi.org/10.1093/bioinformatics/bty633 RRID:SCR_017343 R package
Software, algorithm Effective Number of Codons DOI: https://doi.org/10.1093/oxfordjournals.molbev.a004201

Species

Pfam sequences and IUPRED2 estimates of ISD predictions were taken from James et al., 2021, who studied species marked as ‘Complete’ in the GOLD database, with divergence dates available in TimeTree (Kumar et al., 2017). James et al., 2021 applied a variety of quality controls to exclude contaminants from the set of Pfams and assign accurate dates of Pfam emergence. Pfams that emerged prior to LECA are classified here as ‘old’, and Pfams that emerged after the divergence of animals and fungi from plants are classified as ‘young’, following annotation by James et al., 2021. Species list and other information can be found at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species (copy archived at MaselLab, 2024).

Codon Adaptation Index

Sharp and Li, 1987 quantified codon bias through the CAI, a normalized geometric mean of synonymous codon usage bias across sites, excluding stop and start codons. We modify this to calculate CAI including stop and start codons, because of documented preferences among stop codons in mammals (Wangen and Green, 2020). While usually used to compare genes within a species, among-species comparisons can be made using a reference set of genes that are highly expressed (Sharp and Li, 1987). Each codon i is assigned an RSCU value:

RSCUi=Ni1naj=1naNj, (1)

where Ni denotes the number of times that codon i is used, and the denominator sums over all na codons that code for that specific amino acid. RSCU values are normalized to produce a relative adaptiveness values wi for each codon, relative to the best adapted codon for that amino acid:

wiRSCUiRSCUmax. (2)

Let L be the number of codons across all protein-coding sequences considered. Then

CAI=[Πi=1Lwi]1L. (3)

To understand the effects of normalization, it is useful to rewrite this as:

CAI=[Πi=1LRSCUiRSCUmax]1L=CAIrawCAImax, (4)

where CAIraw is the geometric mean of the ‘unnormalized’ or observed synonymous codon usages, and CAImax is the maximum possible CAI given the observed codon frequencies.

GC content

We calculated total %GC content (intergenic and genic) during a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019 (described in James et al., 2023). Of the 170 vertebrates meeting the quality criteria of James et al., 2021, 118 had annotated intergenic sequences within NCBI, so we restricted the dataset further to keep only the 118 species for which total GC content was available.

Codon Adaptation Index of Species

Controlling for GC bias in synonymous codon usage

Consider a sequence region r within species s where each nucleotide has an expected probability of being G or C = gr. For our main analysis, we consider just one region r encompassing the entire genome of a species s. In a secondary analysis, we break the genome up and use local values of gr in the non-coding regions within and surrounding a gene or set of overlapping genes. To annotate the boundaries of these local regions, we first selected 1500 base pairs flanking each side of every coding sequence identified by NCBI annotations. Coding sequence annotations are broken up according to exon by NCBI. When coding sequences of the same gene did not fall within 3000 base pairs of each other, they were treated as different regions. When two coding sequences, whether from the same gene or from different genes, had overlapping 1500 bp catchment areas, we merged them together. gr was then calculated based on the non-coding sites within each region, including both genic regions such as promoters and non-genic regions such as introns and intergenic sequences.

With no bias between C vs. G, nor between A vs. T, nor patterns beyond the overall composition taken one nucleotide at a time, the expected probability of seeing codon i in a triplet within r is

pi,r= gr2kGC(1gr2)kAT, (5)

where kGC+kAT=3 total positions in codon i. The expected probability that amino acid a in region r is encoded by codon i is

Ei,r=pi,rj=1napj,r. (6)

We can then measure the degree to which the observed codon frequencies diverge from these expected probabilities using the Kullback–Leibler divergence. This gives a CAIS metric for a species s where Oi,s is the observed frequency of codon i:

CAIS(s)=Σi=164Oi,slog(Oi,sEi,s). (7)

Controlling for amino acid composition

Some amino acids may be more intrinsically prone to codon bias. We want a metric that quantifies effectiveness of selection (not amino acid frequency), so we re-weight CAIS on the basis of a standardized amino acid composition, to remove the effect of variation among species in amino acid frequencies.

Let Fa be the frequency of amino acid a across the entire dataset of 118 vertebrate genomes. We want to re-weight Oi,s on the basis of Fa to ensure that differences in amino acid frequencies among species do not affect CAIS, while preserving relative codon frequencies for the same amino acid. We do this by solving for αa,s so that

Fa=αa,sΣj=1naOj,s. (8)

We then define fi,s`=αa,sOi,s to obtain an amino acid frequency adjusted CAIS:

CAIS(S)=Σi=164fi,slog(Oi,sEi,s). (9)

The Fa values for our species set are at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species/blob/main/CAIS_ENC_calculation/Total_amino_acid_frequency_vertebrates.txt. Use of the standardized set of amino acid frequencies Fa has only a small effect on computed CAIS values relative to using each vertebrate species’ own amino acid frequencies (Figure 3—figure supplement 3).

CAIS corrected for local intergenic GC content but not species-wide amino acid composition is

CAISlocalGC(s)=(Πr=1GΣi=164Oi,rlog(Oi,rEi,r))1L, (10)

where Oi,r is the number of times codon i appears in region r of species s, Ei,r is the expected number of times codon i would appear in region r of species s given the local intergenic GC content, G is the number of regions, and L=r=1Gi=164Oi,r is the total number of codons in the genome. Rewritten for greater computational ease:

CAISlocalGC(s)=e1LΣr=1Gln(i=164Oi,rlog(Oi,rEi,r)). (11)

Given the limited impact of amino acid frequency correction, we used Equation 11 for the local GC results, but we could correct for amino acid composition by replacing the Oi,r prefactor with fi,s`, or even fi,r`.

Novembre’s ENC controlled for total GC content

The expected number of codons is based on the squared deviations Xa2 of the frequencies of the codons for each amino acid a from null expectations:

Xa2=Σi=1naNa(OiEi)2Ei, (12)

where Na is the total number of times that amino acid a appears. Novembre, 2002 defines the corrected ‘F value’ of amino acid a as

F^a=Xa2+Nanana(Na1) (13)

and

ENC=2+9F^2+1F^3+5F^4+3F^6, (14)

where each F`^na is the average of the ‘F values’ for amino acids with na synonymous codons. Past measures of ENC do not contain stop or start codons (Wright, 1990; Novembre, 2002; Fuglsang, 2004), but as we did for CAI and CAIS above, we include stop codons as an ‘amino acid’ and therefore amend Equation 14 to

ENC=2+9F^2+2F^3+5F^4+3F^6. (15)

Statistical analysis

All statistical modeling was done in R 3.5.1. Scripts for calculating CAI and CAIS were written in Python 3.7.

Phylogenetic Independent Contrasts

Spurious phylogenetically confounded correlations can occur when closely related species share similar values of both metrics. One danger of such pseudoreplication is Simpson’s paradox, where there are negative slopes within taxonomic groups, but a positive slope among them might combine to yield an overall positive slope. We avoid pseudoreplication by using Phylogenetic Independent Contrasts (PIC) (Felsenstein, 1985) to assess correlation. PIC analysis was done using the R package ‘ape’ (Paradis and Schliep, 2019).

Acknowledgements

We thank Luke Kosinski, David Liberles, and Sawsan Wehbi for helpful discussions, Paul Nelson for providing the genome-wide GC contents, and the University of Arizona Undergraduate Biology Research Program for training. We thank Gavin Douglas for writing a convenient end-to-end implementation of CAIS on the basis of our preprint, which can be found at https://github.com/gavinmdouglas/handy_pop_gen/blob/main/CAIS.py, and for catching a minor bug in our code in time for us to correct it in the version of record. We thank the anonymous reviewers for constructive feedback, and Laurent Duret for helpful elaboration on the concerns of reviewer 1.

Funding Statement

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Contributor Information

Joanna Masel, Email: masel@arizona.edu.

C Brandon Ogbunugafor, Yale University, United States.

Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany.

Funding Information

This paper was supported by the following grants:

  • National Institutes of Health GM104040 to Catherine A Weibel, Jennifer E James, Sara M Willis, Joanna Masel.

  • National Institutes of Health GM132008 to Andrew L Wheeler.

  • John Templeton Foundation 60814 to Catherine A Weibel, Jennifer E James, Sara M Willis, Joanna Masel.

  • Arnold and Mabel Beckman Foundation Scholars Program to Catherine A Weibel.

  • National Science Foundation WAESO/LSAMP Cooperative Agreement HRD-1101728 to Catherine A Weibel.

  • National Aeronautics and Space Administration Arizona NASA Space Grant Consortium, Cooperative Agreement 80NSSC20M0041 to Catherine A Weibel.

  • National Science Foundation Graduate Research Fellowship Program to Hanon McShea.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing.

Formal analysis, Investigation, Visualization, Methodology, Writing – review and editing.

Resources, Data curation, Supervision, Investigation, Methodology, Writing – original draft.

Resources, Data curation, Supervision.

Methodology, Writing – review and editing.

Conceptualization, Formal analysis, Supervision, Funding acquisition, Methodology, Writing – original draft, Project administration, Writing – review and editing.

Additional files

MDAR checklist

Data availability

There is no new data. Processed data and code underlying this article are available in the public repository at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species (copy archived at MaselLab, 2024).

The following previously published dataset was used:

Jennifer J, Sara W, Paul N, Catherine W, Luke K, Joanna M. 2020. Data from: Universal and taxon-specific trends in protein sequences as a function of age. figshare.

References

  1. Ahrens JB, Nunez-Castilla J, Siltberg-Liberles J. Evolution of intrinsic disorder in eukaryotic proteins. Cellular and Molecular Life Sciences. 2017;74:3163–3174. doi: 10.1007/s00018-017-2559-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Akashi H. Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics. 1996;144:1297–1307. doi: 10.1093/genetics/144.3.1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Basile W, Salvatore M, Bassot C, Elofsson A. Why do eukaryotic proteins contain more intrinsically disordered regions? PLOS Computational Biology. 2019;15:e1007186. doi: 10.1371/journal.pcbi.1007186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bernardi G. Isochores and the evolutionary genomics of vertebrates. Gene. 2000;241:3–17. doi: 10.1016/s0378-1119(99)00485-0. [DOI] [PubMed] [Google Scholar]
  5. Bertram J, Masel J. Evolution rapidly optimizes stability and aggregation in lattice proteins despite pervasive landscape valleys and mazes. Genetics. 2020;214:1047–1057. doi: 10.1534/genetics.120.302815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Botzman M, Margalit H. Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles. Genome Biology. 2011;12:R109. doi: 10.1186/gb-2011-12-10-r109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Brbić M, Warnecke T, Kriško A, Supek F. Global shifts in genome and proteome composition are very tightly coupled. Genome Biology and Evolution. 2015;7:1519–1532. doi: 10.1093/gbe/evv088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991;129:897–907. doi: 10.1093/genetics/129.3.897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Charlesworth B. Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation. Nature Reviews. Genetics. 2009;10:195–205. doi: 10.1038/nrg2526. [DOI] [PubMed] [Google Scholar]
  10. Cope AL, Hettich RL, Gilchrist MA. Quantifying codon usage in signal peptides: Gene expression and amino acid usage explain apparent selection for inefficient codons. Biochimica et Biophysica Acta. Biomembranes. 2018;1860:2479–2485. doi: 10.1016/j.bbamem.2018.09.010. [DOI] [PubMed] [Google Scholar]
  11. Daubin V, Moran NA. Comment on “The origins of genome complexity.”. Science. 2004;306:978. doi: 10.1126/science.1100559. [DOI] [PubMed] [Google Scholar]
  12. Doherty A, McInerney JO. Translational selection frequently overcomes genetic drift in shaping synonymous codon usage patterns in vertebrates. Molecular Biology and Evolution. 2013;30:2263–2267. doi: 10.1093/molbev/mst128. [DOI] [PubMed] [Google Scholar]
  13. dos Reis M, Wernisch L. Estimating translational selection in eukaryotic genomes. Molecular Biology and Evolution. 2009;26:451–461. doi: 10.1093/molbev/msn272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Doyle JM, Hacking CC, Willoughby JR, Sundaram M, DeWoody JA. Mammalian genetic diversity as a function of habitat, body size, trophic class, and conservation status. Journal of Mammalogy. 2015;96:564–572. doi: 10.1093/jmammal/gyv061. [DOI] [Google Scholar]
  15. Dunker AK, Garner E, Guilliot S, Romero P, Albrecht K, Hart J, Obradovic Z, Kissinger C, Villafranca JE. Protein Disorder and the Evolution of Molecular Recognition: Theory, Predictions and Observations. Pac Symp Biocomput. 1998:473–484. [PubMed] [Google Scholar]
  16. Duret L, Eyre-Walker A, Galtier N. A new perspective on isochore evolution. Gene. 2006;385:71–74. doi: 10.1016/j.gene.2006.04.030. [DOI] [PubMed] [Google Scholar]
  17. Duret L, Galtier N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annual Review of Genomics and Human Genetics. 2009;10:285–311. doi: 10.1146/annurev-genom-082908-150001. [DOI] [PubMed] [Google Scholar]
  18. Eyre-Walker A, Hurst LD. The evolution of isochores. Nature Reviews. Genetics. 2001;2:549–555. doi: 10.1038/35080577. [DOI] [PubMed] [Google Scholar]
  19. Felsenstein J. Phylogenies and the comparative method. The American Naturalist. 1985;125:1–15. doi: 10.1086/284325. [DOI] [PubMed] [Google Scholar]
  20. Figuet E, Ballenghien M, Romiguier J, Galtier N. Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates. Genome Biology and Evolution. 2014;7:240–250. doi: 10.1093/gbe/evu277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Forcelloni S, Giansanti A. Evolutionary forces and codon bias in different flavors of intrinsic disorder in the human proteome. Journal of Molecular Evolution. 2020;88:164–178. doi: 10.1007/s00239-019-09921-4. [DOI] [PubMed] [Google Scholar]
  22. Foy SG, Wilson BA, Bertram J, Cordes MHJ, Masel J. A shift in aggregation avoidance strategy marks a long-term direction to protein evolution. Genetics. 2019;211:1345–1355. doi: 10.1534/genetics.118.301719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Fuglsang A. The ‘effective number of codons’ revisited. Biochemical and Biophysical Research Communications. 2004;317:957–964. doi: 10.1016/j.bbrc.2004.03.138. [DOI] [PubMed] [Google Scholar]
  24. Fuglsang A. Impact of bias discrepancy and amino acid usage on estimates of the effective number of codons used in a gene, and a test for selection on codon usage. Gene. 2008;410:82–88. doi: 10.1016/j.gene.2007.12.001. [DOI] [PubMed] [Google Scholar]
  25. Galtier N, Piganeau G, Mouchiroud D, Duret L. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics. 2001;159:907–911. doi: 10.1093/genetics/159.2.907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Galtier N, Roux C, Rousselle M, Romiguier J, Figuet E, Glémin S, Bierne N, Duret L. Codon usage bias in animals: Disentangling the effects of natural selection, effective population size, and GC-biased gene conversion. Molecular Biology and Evolution. 2018;35:1092–1103. doi: 10.1093/molbev/msy015. [DOI] [PubMed] [Google Scholar]
  27. Gingold H, Tehler D, Christoffersen NR, Nielsen MM, Asmar F, Kooistra SM, Christophersen NS, Christensen LL, Borre M, Sørensen KD, Andersen LD, Andersen CL, Hulleman E, Wurdinger T, Ralfkiær E, Helin K, Grønbæk K, Ørntoft T, Waszak SM, Dahan O, Pedersen JS, Lund AH, Pilpel Y. A dual program for translation regulation in cellular proliferation and differentiation. Cell. 2014;158:1281–1292. doi: 10.1016/j.cell.2014.08.011. [DOI] [PubMed] [Google Scholar]
  28. Gossmann TI, Keightley PD, Eyre-Walker A. The effect of variation in the effective population size on the rate of adaptive molecular evolution in eukaryotes. Genome Biology and Evolution. 2012;4:658–667. doi: 10.1093/gbe/evs027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hershberg R, Petrov DA. Selection on codon bias. Annual Review of Genetics. 2008;42:287–299. doi: 10.1146/annurev.genet.42.110807.091442. [DOI] [PubMed] [Google Scholar]
  30. Hershberg R, Petrov DA. General rules for optimal codon choice. PLOS Genetics. 2009;5:e1000556. doi: 10.1371/journal.pgen.1000556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hildebrand F, Meyer A, Eyre-Walker A. Evidence of selection upon genomic GC-content in bacteria. PLOS Genetics. 2010;6:e1001107. doi: 10.1371/journal.pgen.1001107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Huang Y, Liu Z. Do intrinsically disordered proteins possess high specificity in protein–protein interactions? Chemistry – A European Journal. 2013;19:4462–4467. doi: 10.1002/chem.201203100. [DOI] [PubMed] [Google Scholar]
  33. Hunt RC, Simhadri VL, Iandoli M, Sauna ZE, Kimchi-Sarfaty C. Exposing synonymous mutations. Trends in Genetics. 2014;30:308–321. doi: 10.1016/j.tig.2014.04.006. [DOI] [PubMed] [Google Scholar]
  34. James JE, Willis SM, Nelson PG, Weibel C, Kosinski LJ, Masel J. Universal and taxon-specific trends in protein sequences as a function of age. eLife. 2021;10:e57347. doi: 10.7554/eLife.57347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. James JE, Nelson PG, Masel J. Differential retention of pfam domains contributes to long-term evolutionary trends. Molecular Biology and Evolution. 2023;40:msad073. doi: 10.1093/molbev/msad073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Kessler MD, Dean MD. Effective population size does not predict codon usage bias in mammals. Ecology and Evolution. 2014;4:3887–3900. doi: 10.1002/ece3.1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kimura M. On the probability of fixation of mutant genes in a population. Genetics. 1962;47:713–719. doi: 10.1093/genetics/47.6.713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Kondrashov AS, Povolotskaya IS, Ivankov DN, Kondrashov FA. Rate of sequence divergence under constant selection. Biology Direct. 2010;5:5. doi: 10.1186/1745-6150-5-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Kosinski LJ, Aviles NR, Gomez K, Masel J. Random peptides rich in small and disorder-promoting amino acids are less likely to be harmful. Genome Biology and Evolution. 2022;14:evac085. doi: 10.1093/gbe/evac085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  41. Kumar S, Stecher G, Suleski M, Hedges SB. TimeTree: A resource for timelines, timetrees, and divergence times. Molecular Biology and Evolution. 2017;34:1812–1819. doi: 10.1093/molbev/msx116. [DOI] [PubMed] [Google Scholar]
  42. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J, International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  43. Lazar T, Tantos A, Tompa P, Schad E. Intrinsic protein disorder uncouples affinity from binding specificity. Protein Science. 2022;31:e4455. doi: 10.1002/pro.4455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Li WH. Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons. Journal of Molecular Evolution. 1987;24:337–345. doi: 10.1007/BF02134132. [DOI] [PubMed] [Google Scholar]
  45. Liu SS, Hockenberry AJ, Jewett MC, Amaral LAN. A novel framework for evaluating the performance of codon usage bias metrics. Journal of the Royal Society, Interface. 2018;15:20170667. doi: 10.1098/rsif.2017.0667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Long H, Sung W, Kucukyildirim S, Williams E, Miller SF, Guo W, Patterson C, Gregory C, Strauss C, Stone C, Berne C, Kysela D, Shoemaker WR, Muscarella ME, Luo H, Lennon JT, Brun YV, Lynch M. Evolutionary determinants of genome-wide nucleotide composition. Nature Ecology & Evolution. 2018;2:237–240. doi: 10.1038/s41559-017-0425-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Lynch M, Ackerman MS, Gout JF, Long H, Sung W, Thomas WK, Foster PL. Genetic drift, selection and the evolution of the mutation rate. Nature Reviews. Genetics. 2016;17:704–714. doi: 10.1038/nrg.2016.104. [DOI] [PubMed] [Google Scholar]
  48. MaselLab Codon-adaptation-index-of-species. swh:1:rev:408af3d150311c4732219abae67c6929421908dfSoftware Heritage. 2024 https://archive.softwareheritage.org/swh:1:dir:60841500b11a4aeb2f505d00dedae2a1db75f513;origin=https://github.com/MaselLab/Codon-Adaptation-Index-of-Species;visit=swh:1:snp:0ddc98c60b565558d38fd277d5e0a81e1e9fef0c;anchor=swh:1:rev:408af3d150311c4732219abae67c6929421908df
  49. Meunier J, Duret L. Recombination drives the evolution of GC-content in the human genome. Molecular Biology and Evolution. 2004;21:984–990. doi: 10.1093/molbev/msh070. [DOI] [PubMed] [Google Scholar]
  50. Novembre JA. Accounting for background nucleotide composition when measuring codon usage bias. Molecular Biology and Evolution. 2002;19:1390–1394. doi: 10.1093/oxfordjournals.molbev.a004201. [DOI] [PubMed] [Google Scholar]
  51. Novoa EM, Jungreis I, Jaillon O, Kellis M. Elucidation of codon usage signatures across the domains of life. Molecular Biology and Evolution. 2019;36:2328–2339. doi: 10.1093/molbev/msz124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Ohta T. Population size and rate of evolution. Journal of Molecular Evolution. 1972;1:305–314. doi: 10.1007/BF01653959. [DOI] [PubMed] [Google Scholar]
  53. Ohta T. Slightly deleterious mutant substitutions in evolution. Nature. 1973;246:96–98. doi: 10.1038/246096a0. [DOI] [PubMed] [Google Scholar]
  54. Ohta T. The nearly neutral theory of molecular evolution. Annual Review of Ecology and Systematics. 1992;23:263–286. doi: 10.1146/annurev.ecolsys.23.1.263. [DOI] [Google Scholar]
  55. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
  56. Plotkin JB, Dushoff J, Desai MM, Fraser HB. Codon usage and selection on proteins. Journal of Molecular Evolution. 2006;63:635–653. doi: 10.1007/s00239-005-0233-x. [DOI] [PubMed] [Google Scholar]
  57. Plotkin JB, Kudla G. Synonymous but not the same: the causes and consequences of codon bias. Nature Reviews. Genetics. 2011;12:32–42. doi: 10.1038/nrg2899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Rocha EPC. Codon usage bias from tRNA’s point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Research. 2004;14:2279–2286. doi: 10.1101/gr.2896904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Rohlf FJ. A comment on phylogenetic correction. Evolution; International Journal of Organic Evolution. 2006;60:1509–1515. doi: 10.1554/05-550.1. [DOI] [PubMed] [Google Scholar]
  60. Romiguier J, Gayral P, Ballenghien M, Bernard A, Cahais V, Chenuil A, Chiari Y, Dernat R, Duret L, Faivre N, Loire E, Lourenco JM, Nabholz B, Roux C, Tsagkogeorga G, Weber AAT, Weinert LA, Belkhir K, Bierne N, Glémin S, Galtier N. Comparative population genomics in animals uncovers the determinants of genetic diversity. Nature. 2014;515:261–263. doi: 10.1038/nature13685. [DOI] [PubMed] [Google Scholar]
  61. Romiguier J, Roux C. Analytical biases associated with GC-content in molecular evolution. Frontiers in Genetics. 2017;8:16. doi: 10.3389/fgene.2017.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Schad E, Tompa P, Hegyi H. The relationship between proteome size, structural disorder and organism complexity. Genome Biology. 2011;12:R120. doi: 10.1186/gb-2011-12-12-r120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987;15:1281–1295. doi: 10.1093/nar/15.3.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Research. 2005;33:1141–1153. doi: 10.1093/nar/gki242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Sharp PM, Emery LR, Zeng K. Forces that influence the evolution of codon bias. Philosophical Transactions of the Royal Society B. 2010;365:1203–1212. doi: 10.1098/rstb.2009.0305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Smith NGC, Eyre-Walker A. Synonymous codon bias is not caused by mutation bias in G+C-rich genes in humans. Molecular Biology and Evolution. 2001;18:982–986. doi: 10.1093/oxfordjournals.molbev.a003899. [DOI] [PubMed] [Google Scholar]
  67. Subramanian S. Nearly neutrality and the evolution of codon usage bias in eukaryotic genomes. Genetics. 2008;178:2429–2432. doi: 10.1534/genetics.107.086405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Sun X, Yang Q, Xia X. An improved implementation of effective number of codons (nc) Molecular Biology and Evolution. 2013;30:191–196. doi: 10.1093/molbev/mss201. [DOI] [PubMed] [Google Scholar]
  69. Theillet FX, Kalmar L, Tompa P, Han KH, Selenko P, Dunker AK, Daughdrill GW, Uversky VN. The alphabet of intrinsic disorder. Intrinsically Disordered Proteins. 2013;1:e24360. doi: 10.4161/idp.24360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Urrutia AO, Hurst LD. Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics. 2001;159:1191–1199. doi: 10.1093/genetics/159.3.1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Vavouri T, Semple JI, Garcia-Verdugo R, Lehner B. Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity. Cell. 2009;138:198–208. doi: 10.1016/j.cell.2009.04.029. [DOI] [PubMed] [Google Scholar]
  72. Vicario S, Moriyama EN, Powell JR. Codon usage in twelve species of Drosophila. BMC Evolutionary Biology. 2007;7:226. doi: 10.1186/1471-2148-7-226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Wangen JR, Green R. Stop codon context influences genome-wide stimulation of termination codon readthrough by aminoglycosides. eLife. 2020;9:e52611. doi: 10.7554/eLife.52611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Wright F. The “effective number of codons” used in a gene. Gene. 1990;87:23–29. doi: 10.1016/0378-1119(90)90491-9. [DOI] [PubMed] [Google Scholar]
  75. Xue B, Dunker AK, Uversky VN. Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. Journal of Biomolecular Structure & Dynamics. 2012;30:137–149. doi: 10.1080/07391102.2012.675145. [DOI] [PubMed] [Google Scholar]
  76. Zhang Z, Li J, Cui P, Ding F, Li A, Townsend JP, Yu J. Codon deviation coefficient: a novel measure for estimating codon usage bias and its statistical significance. BMC Bioinformatics. 2012;13:43. doi: 10.1186/1471-2105-13-43. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife assessment

C Brandon Ogbunugafor 1

This study develops a useful metric for quantifying codon usage adaptation - the Codon Adaptation Index of Species (CAIS). This metric permits direct comparisons of the strength of selection at the molecular level across species. The study is based on solid evidence, and the authors identify relationships between CAIS and the presence of disordered protein domains. Other correlations, such as the one between CAIS and body size, are weak and non-significant. In summary, the study introduces an interesting new approach to quantifying codon usage across species, which may be helpful in attempts to measure selection at the molecular level.

Reviewer #2 (Public Review):

Anonymous

Summary:

The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that attempts to control for differences in amino acid usage and GC% across species. Using their new metric, the authors observe a positive relationship between CAIS and the overall “disorderedness” of a species protein domains. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection sNe when mutation bias changes across species.

Strengths:

(1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance.

(2) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences. A significant improvement over the previous version is the implementation of software tool for applying this method.

(3) The authors do a better job of putting their results in the context of the underlying theory of CAIS compared to the previous version.

(4) The paper is generally well-written.

Weaknesses:

(1) The previously observed correlation between CAIS and body size was due to a bug when calculating phylogenetic independent contrasts. I commend the authors for acknowledging this mistake and updating the manuscript accordingly. I feel that the unobserved correlation between CAIS and body size should remain in the final version of the manuscript. Although it is disappointing that it is not statistically significant, the corrected results are consistent with previous findings (Kessler and Dean 2014).

(2) I appreciate the authors for providing a more detailed explanation of the theoretical basis model. However, I remain skeptical that shifts in CAIS across species indicates shifts in the strength of selection. I am leaving the math from my previous review here for completeness.

As in my previous review, let’s take a closer look at the ratio of observed codon frequencies vs. expected codon frequencies under mutation alone, which was previously notated as RSCUS in the original formulation. In this review, I will keep using the RSCUS notation, even though it has been dropped from the updated version. The key point is this is the ratio of observed and expected codon frequencies. If this ratio is 1 for all codons, then CAIS would be 0 based on equation 7 in the manuscript – consistent with the complete absence of selection on codon usage. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of r = genome for some species s.

RSCUSi=OiEi

I think what the authors are attempting to do is “divide out” the effects of mutation bias (as given by Ei), such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represents adaptive codon usage. Consider Gilchrist et al. GBE 2015, which says that the expected frequency of codon i at selection-mutation-drift equilibrium in gene g for an amino acid with Na synonymous codons is

Ei,g=eΔMiΔηiϕgj=1NaeΔMjΔηjϕg

where ∆M is the mutation bias, ∆η is the strength of selection scaled by the strength of drift, and φg is the gene expression level of gene g. In this case, ∆M and ∆η reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which ∆M,∆η = 0. Assuming the selection-mutation-drift equilibrium model is generally adequate to model of the true codon usage patterns in a genome (as I do and I think the authors do, too), the Ei,g could be considered the expected observed frequency codon i in gene g

E[Oi,g].

Let’s re-write the Ei=pij=1Napj in the form of Gilchrist et al., such that it is a function of mutation bias ∆M. For simplicity we will consider just the two codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term gr and 1 − gr can be written as

gr=μATGCμATGC+μGCAT1gr=μGCATμATGC+μGCAT

where µx→y is the mutation rate from nucleotides x to y. As described in Gilchrist et al. MBE 2015 and

Shah and Gilchrist PNAS 2011, the mutation bias ΔMNNA,NNG=log(μATGCμGCAT) .This can be expressed in terms of the equilibrium GC content by recognizing that

gr1gr=μATGCμGCATgr1gr=eΔM

As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon i at an amino acid becomes just a Bernoulli process.

pi=grx(1gr)(1x)

If we do this, then

ENNA=pNNApNNA+pNNG=1grgr+(1gr)=1gr1gr+1=1eΔM+1=eΔM1+eΔM

Recall that in the Gilchrist et al. framework, the reference codon has ∆MNNG,NNG = 0 = ⇒ e−∆MNNG,NNG =

(1) Thus, we have recovered the Gilchrist et al. model from the formulation of Ei under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for ∆η in equation (1).

We can then calculate the expected RSCUS using equation (1) (using notation E[Oi]) and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression defined as (ϕg=1). Assume in this case that NNG is the reference codon (∆MNNG,∆ηNNG = 0).

E[RSCUSNNA]=E[ONNA]ENNA=eΔηNNA(eΔMNNA+eΔMNNG)eΔMNNAΔηNNA+eΔMNNGΔηNNG=eΔMNNAΔηNNA+eΔMNNGΔηNNAeΔMNNAΔηNNA+eΔMNNGΔηNNG=eΔMNNAΔηNNA+eΔηNNAeΔMNNAΔηNNA+1

This shows that the expected value of RSCUS for a two codon amino acid is expected to increase as the strength of selection ∆η increases, which is desired. Note that ∆η in Gilchrist et al. is formulated in terms of selection against a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If ∆η = 0 (i.e. selection does not favor either codon), then E[RSCUS] = 1. Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if sNe (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances.

Consider our 2-codon amino acid scenario. You can see how changing GC content without changing selection can alter the CAIS values calculated from these two codons. Particularly problematic appears to be cases of extreme mutation biases, where CAIS tends toward 0 even for higher absolute values of the selection parameter. Codon usage for the majority of the genome will be primarily determined by mutation biases,

with selection being generally strongest in a relatively few highly-expressed genes. Strong enough mutation biases ultimately can overwhelm selection, even in highly-expressed genes, reducing the fraction of sites subject to codon adaptation.

Review image 1.

Review image 1.

Review image 2. CAIS (Low Expression).

Review image 2.

Review image 3. CAIS (Average Expression).

Review image 3.

Review image 4. CAIS (High Expression).

Review image 4.

If we treat the expected codon frequencies as genome-wide frequencies, then we are basically assuming this genome made up entirely of a single 2-codon amino acid with selection on codon usage being uniform across all genes. This is obviously not true, but I think it shows some of the potential limitations of the CAIS approach. Based on these simulations, CAIS seems best employed under specific scenarios. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content around 0.41, so I suspect their results are okay (assuming things like GC-biased gene conversion are not an issue). Outliers in GC content probably are best excluded from the analysis.

Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids. One potential challenge to CAIS is the non-monotonic changes in codon frequencies observed in some species (again, see Shah and Gilchrist 2011 and Gilchrist et al. 2015).

eLife. 2024 Sep 6;12:RP87335. doi: 10.7554/eLife.87335.3.sa2

Author response

Catherine A Weibel 1, Andrew L Wheeler 2, Jennifer E James 3, Sara M Willis 4, Hanon McShea 5, Joanna Masel 6

The following is the authors’ response to the original reviews.

In addition to our responses to reviewer suggestions below, a minor bug in the calculation of CAIS was brought to our attention by a reader of our preprint. We have corrected this bug and rerun analyses, whose results became slightly stronger as noise was removed. While we were doing that, someone pointed out to us that our equations were almost the same as Kullback-Leibler divergence, which explains why our metric performed so well. We have made the numerically trivial (see before vs. after figure below) mathematical change to use Kullback-Leibler divergence instead, and now have a better story, with a solid basis in information theory, as to why CAIS works.

Author response image 1.

Author response image 1.

Unfortunately, we discovered a second bug that caused our PIC correction code to fail to perform the needed correction for phylogenetic confounding. The previously reported correlation between CAIS (or ENC) with body mass no longer survives PIC-correction. We have therefore removed this analysis from the manuscript. Our story now stands more on the theoretical basis of CAIS and ENC than on the post facto validation than it previously did. We now also present CAIS and ENC on a more equal footing. ENC results are slightly stronger, while CAIS has the complementary advantage of correcting for amino acid frequencies.

The work involved in these changes, as well as some of the responses to reviews below, justifies changing the second author into a co-first author, and adding an additional coauthor (Hanon McShea) who discovered the second bug.

Reviewer #1 (Public Review):

In this manuscript, the authors propose a new codon adaptation metric, Codon Adaptation Index of Species (CAIS), which they present as an easily obtainable proxy for effective population size. To permit between-species comparisons, they control for both amino acid frequencies and genomic GC content, which distinguishes their approach from existing ones. Having confirmed that CAIS negatively correlates with vertebrate body mass, as would be expected if small-bodied species with larger effective populations experience more efficient selection on codon usage, they then examine the relationship between CAIS and intrinsic structural disorder in proteins.

The idea of a robust species-level measure of codon adaptation is interesting. If CAIS is indeed a reliable proxy for the effectiveness of selection, it could be useful to analyze species without reliable life history- or mutation rate data (which will apply to many of the genomes becoming available in the near future).

A key question is whether CAIS, in fact, measures adaptation at the codon level. Unfortunately, CAIS is only validated indirectly by confirming a negative correlation with body mass. As a result, the observations about structural disorder are difficult to evaluate.

As discussed in the preamble above, we have replaced the body mass validation with a stronger theoretical basis in information theory.

A potential problem is that differences in GC between species are not independent of life history. Effective population size can drive compositional differences due to the effects of GC-biased gene conversion (gBGC). As noted by Galtier et al. (2018), genomic GC correlates negatively with body mass in mammals and birds. It would therefore be important to examine how gBGC might affect CAIS, and to what extent it could explain the relationship between CAIS and body mass.

Suppose that gBGC drives an increase in GC that is most pronounced at 3rd codon positions in highrecombination regions in small-bodied species. In this case, could observed codon usage depart more strongly from expectations calculated from overall genomic GC in small vertebrates compared to large ones? The authors also report that correcting for local intergenic GC was unsuccessful, based on the lack of a significant negative relationship with body mass (Figure 3D). In principle, this could also be consistent with local GC providing a relatively more appropriate baseline in regions with high recombination rates. Considering these scenarios would clarify what exactly CAIS is capturing.

Figure 3 (previously Supplementary Figures S5A and S5B) shows that CAIS is negligibly correlated with %GC (not robust to multiple comparisons correction), and ENC not at all. We believe this is evidence against the possibility brought up by the reviewer, i.e. that Ne might affect gBGC (and hence global %GC). This relationship, if present, could act as a confounding effect, but it is not present within our species dataset.

Note that we expect our genomic-GC-based codon usage expectations to reflect unchecked gBGC in an average genomic region, independently of whether that species has high or low Ne. Our working model is that non-selective forces, include gBGC as well as conventional mutation biases, vary among species, and that they rather than selection determine each species’ genome-wide %GC. By correcting for genome-wide %GC, CAIS and ENC correct for both mutation bias and gBGC, in order to isolate the effects of selection.

This argument, based on an average genomic region, is vulnerable to gene-rich genomic regions having differentially higher recombination rates and hence GC-biased gene conversion. However, we do not see the expected positive correlation between |𝐥𝐨𝐜𝐚𝐥 𝐆𝐂 - global GC| and CAIS (see new Figure 5), again suggesting that gene conversion strength is not a confounding factor acting on CAIS.

Given claims about "exquisitely adapted species", the case for using CAIS as a measure of codon adaptation would also be stronger if a relationship with gene expression could be demonstrated. RSCU is expected to be higher in highly expressed genes. Is there any evidence that the equivalent GCcontrolled measure behaves similarly?

Correlations with gene expression are outside the scope of the current work, which is focused on producing and exploiting a single value of codon adaptation per species. It is indeed possible that our general approach of using Kullback-Leibler divergence to correct for genomic %GC could be useful in future work investigating differences among genes.

The manuscript is overall easy to follow, though some additional context may be helpful for the general reader. A more detailed discussion of how this work compares to the approach taken by Galtier et al. (2018), which accounted for GC content and gBGC when examining codon preferences, would be appropriate, for example. In addition, it would have been useful to mention past work that has attempted to explicitly quantify selection on codon usage.

One key difference between our work and that of Galtier et al. 2018 is that our approach does not rely on identifying specific codon preferences as a function of species. Our approach might therefore be robust to scenarios where different genes have different codon preferences (see Gingold et al. 2014 https://doi.org/10.1016/j.cell.2014.08.011). At a high level, our results are in broad agreement with those of Galtier et al., 2018, who found that gBGC affected all animal species, regardless of Ne, and who like us, found that the degree of selection on codon usage depended on Ne.

Reviewer #2 (Public Review):

## Summary

The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that controls for differences in amino acid usage and GC% across species. Using their new metric, the authors find a previously unobserved negative correlation between the overall adaptiveness of codon usage and body size across 118 vertebrates. As body size is negatively correlated with effective population size and thus the general strength of natural selection, the negative correlation between CAIS and body size is expected. The authors argue this was previously unobserved due to failures of other popular metrics such as Codon Adaptation Index (CAI) and the Effective Number of Codons (ENC) to adequately control for differences in amino acid usage and GC content across species. Most surprisingly, the authors also find a positive relationship between CAIS and the overall "disorderedness" of a species protein domains. As some of these results are unexpected, which is acknowledged by the authors, I think it would be particularly beneficial to work with some simulated datasets. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection Ne when the mutation bias changes across species.

## Strengths

(1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance (see Cope et al. Biochemica et Biophysica Acta - Biomembranes 2018 for a clear example of this).

We now cite Cope et al. as an example of how amino acid composition can act as a confounding factor.

(2) The authors present numerous analysis using both ENC and mean CAI as a comparison to CAIS, helping given a sense of how CAIS corrects for some of the issues with these other metrics. I also enjoyed that they examined the previously unobserved relationship between codon usage bias and body size, which has bugged me ever since I saw Kessler and Dean 2014. The result comparing protein disorder to CAIS was particularly interesting and unexpected.

Unfortunately, our previous PIC correction code was buggy, and in fact the relationship with body size does not survive PIC correction (although it is strong prior to PIC correction). We have therefore removed it from the paper. However, the more novel result on protein disorder remains strong.

(3) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences.

## Weaknesses

(1) The main weakness of this work is that it lacks simulated data to confirm that it works as expected. This would be particularly useful for assessing the relationship between CAIS and the overall effect of protein structure disorder, which the authors acknowledge is an unexpected result. I think simulations could also allow the authors to assess how their metric performs in situations where mutation bias and natural selection act in the same direction vs. opposite directions. Additionally, although I appreciate their comparisons to ENC and mean CAI, the lack of comparison to other popular codon metrics for calculating the overall adaptiveness of a genome (e.g. dos Reis et al.'s S statistic, which is a function of tRNA Adaptation Index (tAI) and ENC) may be more appropriate. Even if results are similar to S, CAIS has a noted advantage that it doesn't require identifying tRNA gene copy numbers or abundances, which I think are generally less readily available than genomic GC% and protein-coding sequences.

The main limitation of dos Reis’s test in our view is that, like the better versions of CAI, it requires comparable orthologs across species. See also the discussion below re the benefits of proteome-wide approach. We now also note the advantage of not needing tRNA gene copy numbers and abundances.

Simulated datasets would be great, but we think it a nice addition rather than must-have, in particular because we are skeptical about whether our understanding of all relevant processes is good enough such that simulations would add much to our more heuristic argument along the lines of Figure 2. E.g. the complications of Gingold et al. 2014 cited above are pertinent, but incorporating them would make simulations quite involved. Instead, we now have a stronger theoretical justification for CAIS grounded in information theory. We have significantly expanded discussion of Figure 2 to give a clearer idea of the conceptual underpinnings of CAIS and ENC.

The authors mention the selection-mutation-drift equilibrium model, which underlies the basic ideas of this work (e.g. higher Ne results in stronger selection on codon usage), but a more in-depth framing of CAIS in terms of this model is not given. I think this could be valuable, particularly in addressing the question "are we really estimating what we think we're estimating?"

Let's take a closer look at the formulation for RSCUS. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of r = genome for some species s.

RSCUSi=OiEi

I think what the authors are attempting to do is "divide out" the effects of mutation bias (as given by Ei), such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represent adaptive codon usage. Consider Gilchrist et al. MBE 2015, which says that the expected frequency of codon i at selection-mutation-drift equilibrium in gene g for an amino acid with Na synonymous codons is

Ei,g=eΔMiΔηiϕgj=1NaeΔMjΔηjϕg

where ∆M is the mutation bias, ∆η is the strength of selection scaled by the strength of drift, and φg is the gene expression level of gene g. In this case, ∆M and ∆η reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which ∆M,∆η = 0. Assuming the selection-mutation-drift equilibrium model is generally adequate to model of the true codon usage patterns in a genome (as I do and I think the authors do, too), the Ei,g could be considered the expected observed frequency codon i in gene g E[Oi,g].

Let’s re-write the Ei=pij=1Napj in the form of Gilchrist et al., such that it is a function of mutation bias ∆M. For simplicity we will consider just the two codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term gr and 1 − gr can be written as

gr=μATGCμATGC+μGCAT1gr=μGCATμATGC+μGCAT

where µx→y is the mutation rate from nucleotides x to y. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias ΔMNNA,NNG=log(μATGCμGCAT) .This can be expressed in terms of the equilibrium GC content by recognizing that

gr1gr=μATGCμGCATimpliesgr1gr=eΔM

As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon i at an amino acid becomes just a Bernoulli process.

pi=grx(1gr)(1x)

If we do this, then

ENNA=pNNApNNA+pNNG =1grgr+(1gr) =1gr1gr+1 =1eΔM+1 =eΔM1+eΔM

Recall that in the Gilchrist et al. framework, the reference codon has ∆MNNG,NNG = 0 = ⇒ e−∆MNNG,NNG = 1. Thus, we have recovered the Gilchrist et al. model from the formulation of Ei under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for ∆η in equation (1)..

We can then calculate the expected RSCUS using equation (1) (using notation E[Oi]) and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as ϕg=1). Assume in this case that NNG is the reference codon (∆MNNG,∆ηNNG = 0).

E[RSCUSNNA]=E[ONNA]ENNA=eΔηNNA(eΔMNNA+eΔMNNG)eΔMNNAΔηNNA+eΔMNNGΔηNNG=eΔMNNAΔηNNA+eΔMNNGΔηNNAeΔMNNAΔηNNA+eΔMNNGΔηNNG=eΔMNNAΔηNNA+eΔηNNAeΔMNNAΔηNNA+1

This shows that the expected value of RSCUS for a two-codon amino acid is expected to increase as the strength of selection Δη increases, which is desired. Note that Δη in Gilchrist et al. is formulated in terms of selection *against* a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If (i.e. selection does not favor either codon), then E[RSCUS]=1. Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if sNe (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content ranging around 0.41, so I suspect their results are okay.

Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids.

We thank Reviewer 2 for explicitly laying out the math that was implicit in our Figures 1 and 2. While we keep our more heuristic presentation, our revised manuscript now more clearly acknowledges that the per-site codon adaptation bias depicted in Figure 1 has limited sensitivity to s*Ne. The reason that we believe our approach worked despite this, is that we think the phenomenon is driven by what is shown in Figure 2. I.e., where Ne makes a difference is by determining the proteome-wide fraction of codons subject to significant codon adaptation, rather than by determining the strength of codon adaptation at any particular site or gene. We have made multiple changes to the texts to make this point clearer.

Another minor weakness of this work is that although the method is generally applicable to any species with an annotated genome and the code is publicly available, the code itself contains hard-coded values for GC% and amino acid frequencies across the 118 vertebrates. The lack of a more flexible tool may make it difficult for less computationally-experienced researchers to take advantage of this method.

Genome-wide %GC values are hard-coded because they were taken from the previous study of James et al. (2023) https://doi.org/10.1093/molbev/msad073. As summarized in the manuscript, genome-wide %GC was a byproduct of a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019. The more complicated code used to calculate the intergenic %GC, and the code used to calculate amino acid frequencies is located at https://github.com/MaselLab/CodonAdaptation-Index-of-Species. Luckily, someone else just wrote a simpler end to end pipeline for us, on the basis of our preprint. We now note this in the Acknowledgements, and link to it: https://github.com/gavinmdouglas/handy_pop_gen/blob/main/CAIS.py.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Jennifer J, Sara W, Paul N, Catherine W, Luke K, Joanna M. 2020. Data from: Universal and taxon-specific trends in protein sequences as a function of age. figshare. [DOI]

    Supplementary Materials

    MDAR checklist

    Data Availability Statement

    There is no new data. Processed data and code underlying this article are available in the public repository at https://github.com/MaselLab/Codon-Adaptation-Index-of-Species (copy archived at MaselLab, 2024).

    The following previously published dataset was used:

    Jennifer J, Sara W, Paul N, Catherine W, Luke K, Joanna M. 2020. Data from: Universal and taxon-specific trends in protein sequences as a function of age. figshare.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES