Abstract
Codon usage bias is a fundamental feature of the genetic code, yet its impact on messenger RNA translation is incompletely defined. Here, we integrate comparative genomics, human tissue proteomes, large cancer cell line, and patient cancer datasets to reveal a conserved codon-bias axis. Across mammals, we show that GC-biased gene conversion drives human-specific GC3 (third codon nucleotide bias score) drifts, yet the functional dichotomy is maintained: A/T-ending codons associate with proliferation and RNA processing, while G/C-ending (Third nucleotide Guanine or Cytosine) codons associate with differentiation and neuronal functions. At the isoacceptors level, synonymous codons segregate into distinct functional categories. To mechanistically connect codon usage to cancer, we introduce the ANN- and m7G-indices, capturing codons decoded by transfer RNA (tRNA) modifications t6A and m7G. Both indices negatively correlate with GC3 and enrich for pro-oncogenic proliferative pathways. Human tissue proteomes reveal strong codon bias discordance between RNA and protein levels, with nervous system tissues enriched for G/C-ending codons while proliferative organs are A/T-biased. Analysis of 2600 cancer cell lines and 21 cancer types revealed heterogeneous codon preferences in cancer cell lines but a global A/T-ending shift in human cancer-upregulated proteins. These findings establish synonymous codon divergence and tRNA modification indices as key determinants of translational reprogramming in health and cancer.
Graphical Abstract
Graphical Abstract.
Introduction
The translation of the genetic code to protein is a heavily regulated and energy expensive process that entails the translation of genetic information, in the form of codons, stored in the coding sequences (CDS) of messenger RNAs (mRNAs) to amino acids that form proteins. During mRNA translation, a limited set of transfer RNA (tRNA) anticodons (48 in human and 47 in mouse) are available to decode the 64 mRNA codons, including 3 stop codons, encoding for the 20 essential amino acids [1]. This arrangement leads to a degree of genetic code redundancy, where most amino acids [apart from tryptophan (Trp) and methionine (Met)] have multiple encoding codons. Furthermore, tRNA wobble modifications fill in the gaps between codons and anticodons mismatch, allowing for noncognate pairings at the third codon nucleotide that expands, and in some cases restricts, the decoding capacity of certain tRNA codons [2]. This redundancy also allows for the dynamic finetuning of mRNA translation via altering codon optimality by dynamically changing tRNA modification levels to respond to external or internal cellular cues [3–7]. For example, codon biased translation is increasingly being recognized as a driver for oncogenesis [5, 8] as well as a player in many other diseases [1]. Thus, the classical view of codon usage and optimality is shifting from it being a static metric to a dynamic and adaptable one [3, 5, 9, 10].
Despite the ever-increasing interest in the mRNA translation level at the codon scale, and the presence of tools, such as Ribo-seq, that can provide such information, the dynamic nature of codon biased translation remains fundamentally incompletely understood. For example, why do certain tRNA modifications that occur in multiple tRNAs drive a pro-oncogenic codon biased translational program [5]? How can certain tRNA modifications, and their consequent impact on codon optimality, drive cellular stress response [3, 6, 7]? Are there global pro-oncogenic codon signatures that drive cancer progression? Codon usage has long been recognized to vary across tissues and gene expression programs [11], and we previously showed how tissue-specific enrichment of tRNA modifications influences mRNA translation and codon bias in mouse tissues [9]. Other studies have highlighted the role of tRNA modifications in driving oncogenesis and other diseases [12, 13]. However, a holistic view of codon biased translation is yet to be achieved.
In this study, we combined elements of evolutionary biology, computational analysis of CDS codon usage and bias, and analysis of multiple large proteomics datasets spanning human physiology, cancer cell lines, and human cancers, to reveal the evolutionary and functional relevance of codon usage bias. Importantly, our analysis validates the presence of distinct tissue signature of codon usage in humans, akin to what was observed in rodents [9] as well as the presence of a semi-global codon biased oncogenic signature that could be driven by several tRNA modifications.
Materials and methods
Data sources and preprocessing
Protein CDS for human and mouse transcriptomes were retrieved from Gencode (v49 for human and vM38 for mouse). Rat CDS were retrieved from Ensemble (GRCr8). CDS were harmonized and the canonical CDS (in case of human and mouse) or the longest CDS (in case of rat) per gene was selected for downstream analysis. Codon count tables were generated by coRdon package in R. To select canonical transcripts, we used Gencode gtf tags to identify the canonical transcript per gene, then we used the CDS of that transcript for downstream analysis. We have, in parallel, used longest CDS for human and mouse CDS, and we observed that the results and observed trends remain unchanged.
The human tissue transcriptomes and proteomes were retrieved from a previously published article [14]. RNA-seq expression matrices (log₂-transformed TPM) and proteomics relative abundance matrices were imported, and gene identifiers were mapped from Ensembl IDs to HGNC gene symbols to homogenize all datasets and files.
The cancer cell lines analysis was conducted using cell models passport database (https://cellmodelpassports.sanger.ac.uk/) [15]. The RNA-seq (17 February 2025) and proteomics (11 February 2025) datasets were downloaded as csv or tsv files and directly imported into R after harmonization for analysis. Metadata of cell lines annotation (23 April 2025) and gene identifiers (12 December 2024) were also downloaded and used in our analysis.
Human cancers with matched normal tissues proteomics were retrieved from CancerProteome database [16]. Differentially expressed proteins (FC (Fold Change) and FDR (False discovery rate) values) in 21 cancer types were downloaded from the database and used for the analysis.
Per-tumor sample proteomics were retrieved from a published study [17]. Per-sample protein intensities were downloaded from the supplementary data of the study and used for downstream analysis after z-score normalization.
GC3 score analysis
The GC content at the third codon position (GC₃) was calculated per gene as:
![]() |
where
is the sum of codons having guanine or cytosine nucleotides at position 3 for gene i, and
the total number of codons in the gene.
To compare across species, mean rodent GC₃ was defined as:
![]() |
The human-specific drift was quantified by:
![]() |
|ΔGC3| ≥ 0.2 indicated drift genes. Drift genes were selected for downstream analysis.
ANN index
The ANN index was calculated by this equation:
![]() |
m7G index
m7G-index was calculated first by identifying consensus m7G modified tRNAs, defined as those detected in at least 3 studies from 5 studies that profiled m7G at the isoacceptors level [12, 18–21]. The m7G-index was calculated as such:
![]() |
Isoacceptors frequencies analysis
For each gene, codon usage was decomposed into frequencies of tRNA isoacceptors (synonymous codons decoding the same amino acid). Isoacceptors frequencies of a gene were calculated by:
![]() |
This yielded a score from 0 to 1.
T-statistics for isoacceptors frequencies versus background
The T-statistic describing the isoacceptors codon frequencies of a list of selected genes was calculated by one-sample T-statistic against the genome-wide background:
![]() |
where Ti is the codon isoacceptors frequencies, xi is the mean frequency for codon i in the sample,
refers to the mean frequency for codon i in the background (genome-wide), si refers to the genome-wide standard deviation of isoacceptors frequency for codon i, and n refers to the sample size (i.e. number of foreground genes or selected genes). The resulting T-stat value from the analysis indicates direction of enrichment (positive versus negative) and statistical significance (|T-stat| ≥ 2 indicates P ≤.05)
Poly-amino acid repeat detection (Poly-Q/A/P analysis)
After K-mean clustering of human, mouse, and rat genes by their GC3 scores, we identified the genes in each cluster and used them for downstream analysis. Protein sequences were translated from CDS and scanned for runs of glutamine (Q), alanine (A), and proline (P). For each gene, the longest uninterrupted repeat length was extracted. Genes were classified as “present” for a poly-run if the maximal run length was greater than or equal to a minimum threshold (m = 5).
Statistical testing included:
Presence/absence analysis: pairwise Fisher’s exact tests with Benjamini–Hochberg multiple test correction were applied across species for each cluster to identify presence or absence of Poly-Q/A/P containing genes in each GC3 cluster versus the background as well as between species.
Repeat length comparison: Kruskal–Wallis tests were used to assess differences in run lengths across species. Pairwise Wilcoxon rank-sum tests with Benjamini–Hochberg correction were conducted for species pairs.
GC-biased gene conversion analysis
To analyze recombination events at the third codon nucleotide that could drive drifts in GC3 scores, we first created codon-level multiple sequence alignments per gene for all genes detected in all three species (15 348 genes). Next, we identified consensus nucleotide in the rodent (mouse and rat) CDS and compared the human to it. If we observed a change from A/T to G/C in humans we counted it as B (biased towards GC), if we observed a change from G/C to A/T we counted it as S. Next, we calculated the GC-biased gene conversion (gBGC) score as such:
![]() |
A higher scorefrac Indicates recombination events at the third codon nucleotide substituting A/T with G/C and leading to higher GC3 score and vice versa. A scorefrac around 50% indicates no significant drifting. The analysis was conducted in three passes, first, we analyzed all codons for synonymous and nonsynonymous recombination events (all3rd), then we analyzed only four-fold degenerate codons (codons whose change in the third nucleotide leads to no change in encoded amino acid) for synonymous recombination events (fourfold3), and lastly we excluded the four-fold degenerate codons to analyze nonsynonymous recombination events (fourfoldAA_allpos).
Amino acid z-score analysis
For each protein-coding gene and species, amino acid usage frequencies were converted into z-scores to normalize for background distribution. First, using the codon count tables, we translated each codon to its amino acid and created an amino acid count table. Next, we calculated the amino acid z-score per gene as such:
![]() |
Where
is the standardized usage of amino acid i in gene j.
is the amino acid raw count,
is the mean amino acid counts in the gene j, and
is the standard deviation of all amino acid counts in the gene.
Amino acid z-scores were calculated for all genes across species, then we selected the genes present in the 3 species for downstream analysis by Sparse Partial Least Squares Discriminant Analysis (sPLS-DA) and variable importance in projection (VIP) Using mixOmics package [22].
To identify amino acid changes in outlier genes (i.e. those that showed differences between species in the sPLS-DA analysis), we first calculated the amino acid z-score shift in outliers using this equation:
![]() |
where
= mean z-score of amino acid i in outlier gene of species s, and
= mean z-score in nonoutlier genes. Statistical significance was assessed with a two-sample t-test for each (species × amino acid) pair, and false discovery rates were controlled by the Benjamini–Hochberg method.
Because amino acids differ in their capacity to modulate GC3 (due to codon degeneracy), we quantified each amino acid’s GC3 potential as the proportion of its synonymous codons ending in G or C:
![]() |
We then calculated a GC3-weighted shift:
![]() |
This weighting prioritizes amino acids whose codon structure allows stronger influence on GC3 variation. Heatmaps of
were generated, with amino acids ordered by GC3 potential (highest to lowest). Rows corresponded to amino acids and columns to species. A diverging color scale (blue = negative shift, red = positive shift, white = no change) was centered at zero. This representation highlights amino acids where outlier genes are enriched or depleted relative to background, and links those shifts to their potential impact on GC3 composition.
Isoacceptors clustering and codon-level feature importance
For the isoacceptors-based clustering analysis, we used per-gene isoacceptors frequencies as calculated above. Genes were clustered using k-means (k = 10, 100 random starts, seed fixed for reproducibility) on the gene × codon isoacceptors frequency matrix after centering and scaling.
To identify codons contributing most strongly to cluster separation, we calculated, for each codon, the ratio of between-cluster sum of squares (BSS) to total sum of squares (SSₜ) across genes. This statistic, equivalent to ANOVA η², quantifies the proportion of codon variance explained by cluster membership. Codons with high BSS/SSₜ values were considered highly cluster informative. Codons were ranked accordingly, and the top 20 codons were selected for visualization.
For each cluster, centroid isoacceptors frequencies were computed as the average frequency of each codon among genes assigned to that cluster. To highlight enrichment or depletion, centroid values were mean-centered against the genome-wide codon averages, producing ΔisoFreq values. Positive ΔisoFreq values indicate codons enriched in a given cluster relative to the background, whereas negative values indicate depletion. These deviations were visualized in heatmaps with rows corresponding to clusters and columns to codons.
To assign functional interpretation to clusters, we performed over-representation analysis (ORA) of gene ontology biological processes (GOBP) using the clusterProfiler R package. Enrichment results were corrected for multiple testing using the Benjamini–Hochberg method and only significant enrichment (FDR < 0.05) were selected.
To provide an overview across clusters, we performed keyword analysis of significant GOBP and KEGG terms. Enrichment descriptions were tokenized into keywords, stop words removed, and term frequencies weighted by term frequency–inverse document frequency. The top-ranked keywords per cluster were used to summarize biological functions associated with codon-usage clusters.
Isoacceptors codon-specific ORA enrichment
To investigate whether synonymous codons enrich for distinct functional categories, we ranked genes by isoacceptors frequency for each codon. For each codon, the top 5% of genes (highest isoacceptors frequency) were selected, and ORA GOBP analysis performed.
We compared enrichment results across synonymous codons of the same amino acid to assess functional divergence. Representative examples included Histidine (CAC versus CAT) and Valine codons, where enrichment patterns were distinct despite encoding the same amino acid. Enrichment was summarized as dot plots of significant GOBP categories per codon.
Isoacceptors analysis of GOBP terms
Here, we analyzed each GOBP term as an independent entity, keeping our analysis to terms with ≥ 8 genes. For each GOBP term, we computed the isoacceptors frequencies per codon across its member genes, then compared this to the genome-wide average to calculate T-stat values (see above). Deviations were converted into T-statistics (see below). The resulting codon × pathway matrix of T-statistics was clustered and visualized as a heatmap (ComplexHeatmap), allowing detection of global AT-ending versus GC-ending codon biases across pathways.
Keyword analysis of GOBP enriched terms in isoacceptors clustering analysis
For each cluster GOBP ORA analysis, we tokenized the titles of the GOBP terms and removed English stop words and tokens with fewer than 2 alphabetic characters. Next, we calculated token raw counts (n) per cluster then calculated the relative token frequency:
![]() |
Next, we calculated the tf-idf scoring using this equation:
![]() |
where c is cluster, N the number of clusters, tf is the term frequency, idf is the inverse document frequency, or how unique the token is across all clusters, and df(w) the number of clusters containing word w. For each cluster we report the top 15 words by tf–idf (ties not allowed). Next, we visualized the enriched keywords per cluster as a heatmap with clustering based on shared keywords between clusters.
GC3 analysis of GOBP terms
As with the isoacceptors frequencies analysis of GOBP terms, we selected terms with ≥8 genes. For each term, we calculated the mean GC3 score of genes nested under the term to generate per-term GC3 score.
Keyword enrichment analysis of GO terms
To identify and summarize differences between biological processes associated with codon bias, we performed a keyword enrichment analysis on GOBP terms stratified by GC3 content. GOBP terms were first ranked by their mean GC3 index across annotated genes. The top and bottom 5% of terms were extracted and treated as “high-GC3” and “low-GC3” groups, respectively. Term names were tokenized into individual words, lowercased, and cleaned by removing standard English stop words and terms shorter than three characters. Word frequencies were then calculated separately for the high- and low-GC3 groups. To quantify relative enrichment, we computed log₂ fold-changes in frequency between the two groups, adding a pseudo-count to avoid division by zero. The resulting “keyness” scores highlighted keywords disproportionately associated with high- versus low-GC₃ pathways. For visualization, the top 10 positively and negatively enriched words were plotted as a bar chart, enabling direct comparison of thematic trends between GC3-rich and GC3-poor biological processes.
Results
Analysis of mammalian Third nucleotide codon bias reveals contribution of amino acid content and recombination events to GC3 drifts across species
We first compared third nucleotide G/C content (GC3 scores) and synonymous codon usage across human, mouse, and rat CDS, reflecting the most widely used mammalian experimental systems and their relationship to human biology. We obtained CDS for human (GENCODE v48), mouse (GENCODE vM37), and rat (Ensembl GRCr8). We selected the canonical transcript per gene (for human and mouse) or the longest transcript per gene (for rat). We employed two metrics for our analysis: GC3 scores and isoacceptors frequencies. GC3 scores were calculated by analyzing the codon sequences in a CDS, then calculating the ratio of codons ending in G or C (i.e. the third nucleotide is G or C) versus those ending in A or T [23]. The score ranges from zero (no G/C ending codons) to 1 (all codons are G/C ending; see the ‘Materials and methods’ section). Isoacceptors frequencies were calculated as described previously [24, 25]. Essentially, we compared the ratio of a given codon in a CDS to the sum of all synonymous codons of the same amino acid, yielding an isoacceptors score ranging from zero to 1 (see the ‘Materials and methods’ section). We next created codon count tables and matrices for GC3 scores per gene and isoacceptors frequencies scores per codon per gene. To compare between the three species, we selected the orthologous genes (≈15 000 protein coding genes) in all 3 data sets. We observed clear differences between human GC3 scores and those of the rat and mouse (Fig. 1A). While the GC3 density plots of the rat and mouse scores revealed a clear sharp peak around the median, the human GC3 scores were more plateaued with more genes having extreme GC3 scores. The median of the human GC3 scores remained like the other species (≈0.6). Spearman rank correlation analysis showed that while the correlation between the three species is high, it is higher between rat and mouse compared to their correlation with human GC3 scores (Fig. 1B).
Figure 1.
GC3 scores diverge between humans and rodents. (A) Density plot of GC3 scores across the three species. (B) Heatmap of Spearman’s rank correlation analysis of gene GC3 scores across the three species. (C) Ridgeline analysis of isoacceptors codon frequencies of Arginine codons showing near identical values across all species. The same was evident for all other amino acid codons. (D) Heatmap of Spearman’s rank correlation analysis of gene isoacceptors frequencies scores across the three species. (E) Sparse partial least square discriminant analysis (sPLS-DA) of amino acid z-scores across species. Each dot represents one gene. Each gene is colored by its species of origin. (F, G) Density plot of different amino acid z-scores across species showing examples of divergent amino acid z-scores.
Despite these differences in GC3 scores between the three species, isoacceptors frequency scores were nearly identical for all codons, with a spearman’s rank correlation Rho of 1 (p < 1e-300) in all pairwise comparisons (Fig. 1C and D). Thus, we could not attribute such differences to synonymous codon substitution. We argued that evolutionary changes in amino acid constitution of genes or gene length could explain such differences. First, we analyzed gene length differences between the three species. Globally, we observed statistically significant differences in CDS lengths across species (Supplementary Fig. S1A and B). Next, we extracted the outlier genes based on their GC3 score differences between human and mouse and human and rat datasets by calculating ΔGC3 between human and each species separately (see the ‘Materials and methods’ section). We selected arbitrarily the top 200 outlier genes for downstream analysis and observed that around 50% of those genes are shared between the human versus mouse and human versus rat GC3 analysis (Supplementary Fig. S1C). Analysis of gene length of the outlier genes revealed statistically significant differences between rat outlier genes and other species but not between human and mouse (Supplementary Fig. S1D and E).
Next, we examined whether changes in amino acids composition between CDS of different species could explain the observed GC3 patterns, given that GC3 scores are sensitive to synonymous and nonsynonymous codon substitution. To do so, we first collated the codons pertaining to each amino acid for each CDS to create amino acid count per gene. Next, we normalized the counts by calculating the amino acid (AA) z-score for each gene. Sparse least square regression analysis (sPLS-DA) validated these differences in AA content of genes across the three species (Fig. 1E). Variables of importance in projection analysis (VIP analysis) revealed that aspartate, arginine, and alanine were the top determinants of clustering of genes across the three species (Supplementary Fig. S1F). Plotting the z-scores for each amino acid across all genes and species revealed clear differences in some amino acids, such as Alanine, serine, and others (Fig. 1F and G), while other amino acids had near identical internal z-score values across species.
Next, we analyzed the outlier genes to examine whether their amino acid sequences could explain the changes in GC3 scores across species. First, we compared amino acid z-scores between outlier and nonoutlier sets for each species. Next, we weighed observed differences by the intrinsic GC3 potential of each amino acid (i.e. the fraction of its synonymous codons that terminate in G/C). This weighting highlighted amino acids where differential usage is most likely to impact GC3 scores. T-test was used and adjusted p-values were obtained using the Benjamini–Hochberg method. We visualized the significant amino acids in our analysis as heatmaps of Δz (outlier – background) values, ordered by GC3 potential (Supplementary Fig. S2). Certain amino acids enrichment patterns were stable across species while others showed differences. For example, aspartate had higher Δz in human outliers than in other species while Ala had higher Δz in rat outliers.
Previously, gBGC due to recombination was reported to lead to synonymous and nonsynonymous codon switching [26, 27]. The nonsynonymous recombination events could lead to evolutionary changes in amino acid sequences. To test whether recombination events could explain our observations, we initially calculated the delta-GC3 scores for each gene (ΔGC3), which we defined as:
. A positive ΔGC3 indicates a drift in human genes towards more GC-ending codons compared to rodents and vice versa. We selected genes with |ΔGC3| ≥ 0.2 to examine, which represents a 20% change in GC3 score. The analysis yielded 133 human genes with GC3 gain (i.e. higher GC3 scores compared to the mouse orthologs) and 170 genes with GC3 loss (i.e. lower GC3 scores in humans). Analysis of gene length revealed statistical differences in both gain and loss gene sets between humans and mouse/rat orthologs (Supplementary Fig. S3A and B). We selected the top gene by ΔGC3 score (SRSF8, ΔGC3 = 0.47) and the bottom gene (PARD6B, ΔGC3 = −0.38) for visualization (Supplementary Fig. S3C and D). As seen in the two examples presented, recombination occurs leading to substitution of A/T with G/C and vice versa, thus leading to the GC3 drift observed.
Next, we analyzed gBGC by first creating per-gene codon alignment maps for all orthologs shared present across the three species and identifying the rodent consensus sequence at the third codon position. We conducted the analysis in three passes to differentiate between synonymous and nonsynonymous recombination. First, we analyzed the four-fold degenerate sites (i.e. amino acids where the change in the third codon nucleotide leads to synonymous change, for example, Ala which is encoded by GCU, GCC, GCA, and GCG). Next, we excluded the four-fold degenerate sites to identify recombination events leading to nonsynonymous substitution (i.e. changing the amino acid sequence of the protein). Finally, we did a global pass to evaluate the global trends in recombination. We tallied the AT to GC or GC to AT recombination events at the third codon nucleotide in human genes compared to rodent genes to analyze recombination events that drive the drift in GC3 scores. Using K-mean clustering, we clustered orthologues into three K-mean clusters, after multiple iterations. There were two large clusters, clusters C1, which was A/T drifted in humans, and C3, which was G/C drifted in humans, and a smaller cluster, cluster C2 (Fig. 2A). We conducted our gBGC analysis on each of the three GC3 clusters. We observed in all events (synonymous and nonsynonymous) (Supplementary Fig. S3E), in synonymous recombination events (Supplementary Fig. S3F), and in nonsynonymous events (Supplementary Fig. S3G) that cluster 3 (C3) had higher AT to GC conversion events while cluster 1 (C1) had lower AT to GC conversion events (or higher GC to AT conversion events). Cluster 2 (C2) remained neutral with no significant drifts. These observations align with our clustering and the changes observed in GC3 scores between human and rodents. It is also notable that we observed a strong agreement between the ΔGC3 gene scores and their gBGC scores (Supplementary Fig. S3H), indicating that ΔGC3 can be an easy and quick way to screen for potential drifts due to recombination events.
Figure 2.
(A) K-mean clustering of genes based on their GC3 across the three species examined. (B) ORA of GOBP of genes belonging to cluster 1 (C1) of the K-mean GC3 clusters. (C) ORA of GOBP of genes belonging to cluster 3 (C3) of the K-mean GC3 clusters. (D) GOBP ORA analysis of the most A/T-ending biased genes. (E) GOBP ORA analysis of the most G/C-ending biased genes.
Together, these analyses demonstrate that third-position nucleotide composition differs systematically between humans and rodents, with human genes exhibiting broader GC3 distributions and more extreme GC3 values. While these differences are accompanied by modest shifts in amino acid composition and CDS length, they are most parsimoniously explained by recombination-associated gBGC acting on both synonymous and nonsynonymous sites. Importantly, these evolutionary forces reshape nucleotide and codon composition without global restructuring of synonymous decoding preferences, which remain highly conserved across species.
Third nucleotide codon bias is linked to distinct molecular pathways
Our analysis revealed strong changes in global codon bias between humans and rodents, however, whether these changes lead to differences in functional enrichments remain unclear. To understand how codon patterns influence functional pathways, we performed GOBP ORA analysis on genes belonging to each of the 3 clusters shown in (Fig. 2A). Our analysis revealed that the genes in cluster C1, which had lower GC3 scores in human, enriched for pathways linked to mitotic and cell division, proliferation, and chromosomal functions (Fig. 2B). On the other hand, C2 cluster, which had higher GC3 score in human, was enriched for genes linked to synaptic and neuronal function and cell fate commitment (Fig. 2C). To evaluate whether these patterns could explain the global differences in GC3 scores across species (i.e. the presence of more extreme GC3 scores in human) we analyze the top and bottom genes by their GC3 scores (using a preset threshold: human: >0.9 or <0.3, mouse and rat: >0.8 or <0.3) using ORA of GOBP. In all species, we observed that the AT-biased genes (i.e. lower GC3 scores) were enriched for pathways related to cell proliferation and mRNA processing (Fig. 2D), showing only human data for privity]. while GC-biased genes were enriched for cell fate commitment and neuronal functions (Fig. 2E), showing only human data for privity]. To validate the biased enrichment of GC3 (GC-biased) and AT3 (AT-biased) genes towards cell fate and neuronal pathways versus cell proliferation and mRNA processing, we, orthogonally, analyzed the human GOBP pathways to evaluate their inherent bias. To do so, we focused on those pathways with ≥8 genes and calculated the average GC3 score per pathways using the mean GC3 score of the genes nested under the pathway. The median GC3 score of all GOBP terms hovered around 0.6 (Supplementary Fig. S4A). GC3-rich GOBP terms were linked to differentiation and neuronal functions (Supplementary Fig. S4B), while AT3-rich GOBP terms were mostly linked to cell proliferation (Supplementary Fig. S4C). We further conducted keyword search in the top 5% GC3- or AT3-rich terms and found that GC3-rich terms enriched for differentiation and signaling keywords while AT-rich terms enriched for RNA processing and chromosome related keywords (Supplementary Fig. S4D). The same patterns were observed when we repeated the analysis using gene ontology cellular component and molecular functions terms (data not shown). To evaluate whether the same patterns are replicated across species, we repeated the analysis using mouse and rat GOBP terms. The same dichotomy between proliferation/RNA processing and differentiation/neuronal pathways was also observed (data not shown).
In summary, while we observed evolutionary drifts in third nucleotide codon composition between humans and rodents, the most AT-ending and GC-ending biased genes in all three species enriched for the same functional programs, i.e. proliferation versus differentiation. Thus, evolutionary shifts in third-position nucleotide composition alter the distribution of GC3 values without reassigning genes to different functional programs, preserving higher-order biological organization across species.
Synonymous codons usage clusters genes to distinct groups
The translation of specific genes or gene sets in a given context is dependent on their GC3 scores (or third nucleotide bias) as well as isoacceptors frequencies [4, 6, 23, 24, 28]. While our analysis revealed evolutionary changes in GC3 bias from rodents to humans due to nonsynonymous and synonymous recombination, gene isoacceptors frequencies were globally conserved, apart from few outliers (data not shown) [10]. Davis et al. [10] also showed that genes cluster by their isoacceptors frequencies into functional groups. We thus replicated their analysis by calculating per codon isoacceptors scores (ranging from 0 to 1, see the ‘Materials and methods’ section) for each gene across CDS. Here, we will focus on the human CDS for brevity. However, we replicated the same analysis for mouse and rat CDS and observed similar trends (not shown).
Based on their per codon isoacceptors frequencies, we clustered the human genes into 10 k-mean clusters after multiple iterations (Fig. 3A). Next, we conducted ORA GOBP analysis on each cluster to identify functional groups. Indeed, we observed unique functional groupings of clusters (Supplementary Table S1). To summarize our findings, we conducted keyword analysis on the GOBP enrichment of each cluster, identified unique keywords for each cluster, and used the top keywords to indicate functional grouping (Fig. 3B). We observed clear functional clustering based on isoacceptors codon frequencies, confirming Davis et al’s results [10]. For example, Clusters 1 and 6 were enriched for immune-linked pathways, while Cluster 3 was enriched for neuronal and differentiation pathways. Cluster 4 was enriched for mitotic and cell division pathways. Clusters 5 and 8 were enriched for translation and RNA metabolism pathways. Cluster 7 was enriched for reproduction, embryogenesis, and development pathways. Cluster 10 was enriched for pathways linked to ion signaling, muscle contraction, and synaptic transmission. We next evaluated what codons could be contributing to the clustering observed. To determine which codons were most informative for distinguishing between the k-means clusters, we quantified, for each codon, the proportion of its variance explained by cluster membership (between-cluster sum of squares/total sum of squares, BSS/SSₜ). This measure is equivalent to an ANOVA effect size (η²). Codons with higher BSS/SSₜ values therefore represent features that contribute strongly to cluster separation. Ranking codons by their BSS/SSₜ scores revealed a subset of highly informative codons (Fig. 3C). We next summarized the distribution of these top codons across clusters by calculating the average isoacceptors frequency for each cluster and centering it against the global background means. This yielded Δ isoFreq values, where positive values indicate codons enriched in a given cluster relative to the genome-wide baseline, and negative values indicate depletion (Fig. 3D). Interestingly, in addition to codon-specific enrichment patterns that distinguish individual clusters, we observed a consistent segregation of clusters along a broader A/T-ending versus G/C-ending codon axis. This suggests that synonymous codon usage is hierarchically organized, with fine-scale, isoacceptors-specific preferences operating within a global third-position nucleotide bias. In this framework, GC3 imposes a genome-wide compositional constraint, while isoacceptors frequencies encode more granular, functionally distinct translational programs.
Figure 3.
Isoacceptors frequencies cluster genes in distinct functional clusters (A) K-mean clustering of human genes by their isoacceptors frequencies. (B) Heatmap of the top representative keywords of enriched GOBP terms in each cluster. (C) Top 20 informative codons across clusters. Delta iso-freq indicate shift in mean iso freq of codon in that cluster compared to the background (all genes). (D) Heatmap showing the Δ isoacceptors frequencies (isoFreq) of the top 20 informative codons in each cluster. Δ isoFreq was calculated by comparing the isoacceptors frequencies of the codon in a cluster versus the background. (E, F) Example of enrichment analysis of synonymous codon pair. His-CAC or His-CAU biased genes were selected and GOBP ORA analysis performed.
While clustering analyses demonstrate that genes segregate into functional programs based on their overall isoacceptors frequency profiles, they do not directly address whether individual synonymous codons within the same amino acid family associate with distinct biological processes. To examine this question at higher resolution, we performed codon-centric functional enrichment analyses. We selected the top 5% genes by the isoacceptors frequencies of each codon and conducted GOBP ORA analysis to evaluate synonymous codons driven functional enrichment and compare synonymous codons. De demonstrate how synonymous codons are linked to, and could potentially regulate, different biological functions, we selected Histidine as an example. Histidine (His) is encoded by 2 codons, CAC and CAU (CAT in the plots). His has only one anticodon, GUG, and thus requires queuosine (Q) tRNA modification to decode the CAU codon [29]. Functional enrichment analysis revealed that His-CAC rich genes enrich for pathways linked to protein-DNA complexes, nucleosome assembly, and myeloid cells and megakaryocytes differentiation (Fig. 3E). On the other hand, His-CAU is linked to pathways related to immune response and mitochondrial respiratory complex assembly (Fig. 3F). Another expanded example is Valine, which is encoded by 4 codons. Val-GTG is the most abundantly used codon in humans and rodents, while Val-GTA is the rarest (Supplementary Fig. S5A). We observed specific and distinct enrichment patterns across the 4 codons (Supplementary Fig. S5B). Val-GTA enriched for RNA metabolic and splicing processes, Val-GTC enriched for immune pathways, Val-GTG enriched for developmental pathways, while Val-GTT enriched for chromosomal and nuclear division pathways.
In summary, we show that codon isoacceptors frequencies or codon usage preference segregate genes into functional groups. Importantly, synonymous codon do enrich towards different processes/pathways, which is of great importance when considering, for example, that changes in tRNA modifications were shown to drive cancer oncogenesis or oxidative stress response by shifting translation towards specific codons and genes [5, 24, 30]. Thus, our analysis clarifies why changes in tRNA modifications can shift cellular phenotypes even by changing the optimality of synonymous codons.
Cancer-linked tRNA modifications preferentially target A/T-biased codon programs
In recent years, our understanding of epitranscriptomic regulation of mRNA translation and disease pathologies has evolved drastically [1, 2, 5]. Importantly, many tRNA modifications have been shown to promote cancer via codon biased translation [1, 5]. Despite increasing evidence linking tRNA modifications to cancer, how codon composition shapes their translational impact remains unclear. While the role of many tRNA modifications in cancer remain to be studied, there are few modifications that have been consistently linked to cancer aggressiveness. In particular, N7-methylguanosine (m7G) and N6-threonylcarbamoyladenosine (t6A) have been consistently reported to be linked to several cancers’ pathologies. Building on our observation that genes segregate into functional programs based on synonymous codon usage, we asked whether cancer-associated tRNA modifications preferentially engage specific codon-defined gene sets. First, we examined t6A, which promotes the translation of ANN codons via codon-anticodon stacking [31]. t6A is implicated in multiple cancers such as glioblastoma [30], hepatocellular carcinoma [32], and others. To understand the links between t6A and cancer, we created an ANN index, whereas we counted, for each gene, the ratio of ANN codons compared to all codons, reflecting a gene’s potential sensitivity to t6A-dependent decoding (see the ‘Materials and methods’ section). The median ANN index across the genome was ≈ 0.3 (Fig. 4A). Next, we examined m7G, which is implicated in multiple cancers due to increased copy number of its writer METTL1 [5]. We first created a consensus set of m7G decoded codons. We curated m7G modified tRNAs detected in five studies [12, 18–21] and selected those that appear in at least three studies as the consensus set. Next, we created the m7G-index by calculating the ratio of m7G decoded codons in a gene to all codons (see the ‘Materials and methods’ section) reflecting a gene’s sensitivity to m7G-dependent decoding (Fig. 4B). We observed good agreement between gene ANN index and m7G-index (Spearman’s Rho = 0.6) (Fig. 4C). We further observed significant negative correlation between ANN index or m7G-index and gene GC3 (Spearman’s Rho = −0.632 and −0.525 respectively) (Fig. 4D and Supplementary Fig. S6A) indicating that both modifications could shift translation to be more A/T-ending biased.
Figure 4.
tRNA modifications indices predict the impact of tRNA modifications on codon usage bias. (A) ANN index density plot across human genes. (B) m7G-index density plot across human genes. (C) Spearman’s rank correlation analysis between gene m7G-index and ANN index scores. (D) Spearman’s rank correlation analysis between gene ANN index and GC3 scores. (E) GOBP ORA of top 10% genes by ANN index scores. (F) GOBP ORA of top 10% genes by m7G-index scores.
Next, we selected the top and bottom 10% genes by their ANN index or m7G-index and conducted ORA GOBP analysis to examine their functional enrichment. Genes enriched in ANN codons enriched for mRNA translation and proliferation linked pathways (Fig. 4E), while those poor in ANN codons enriched for differentiation linked GOBP terms (Supplementary Fig. S6B). Genes rich in m7G codons enriched for pathways linked to mRNA translation, immunity, and metal ion stress response (Fig. 4F), while those poor in m7G codons enriched for differentiation and development linked pathways (Supplementary Fig. S6C).
Together, our analysis shows that two cancer-associated tRNA modification systems, t6A and m7G, preferentially target genes with shared codon composition features. Both ANN- and m7G-rich genes are biased toward A/T-ending codons and enrich for proliferative and translational pathways, whereas genes depleted for these codons preferentially associated with differentiation and developmental programs. These results indicate that oncogenic tRNA modification pathways engage pre-existing codon-defined gene programs, providing a unifying framework linking codon composition to translational reprogramming in cancer
Human proteomes reveal tissue-specific codon-biased translation programs
Previously, it was shown that codon usage and bias influence mRNA translation and gene expression at the tissue level [11, 23, 28]. Importantly, in mouse tissues, most abundantly translated mRNAs were A/T-ending biased except for the brain, which was more G/C-ending biased [28]. Understanding these patterns is important from the standpoint of developing more specific gene and mRNA therapeutics via codon reengineering to achieve better protein expression in target tissues/cells and reduce expression in off-target sites [10, 28]. To that end, we retrieved an atlas of 32 human tissues proteomes and transcriptomes from the GTEx database [14] and analyzed the codon patterns across tissues. Spearman’s rank correlation showed a modest correlation between RNA and protein expression across tissues, with a global Spearman’s Rho of 0.299 and a range from as low as 0.145 (Breast mammary tissue) to 0.471 in the cerebellum (Fig. 5A). Next, for each tissue, we selected the top 10% expressed genes/proteins and analyzed their GC3 scores and isoacceptors frequencies. At the RNA expression level, all tissues appeared to be G/C-ending biased with similar isoacceptors frequencies patterns across the board (Fig. 5B). However, at the protein expression level, we observed clear clustering of G/C-ending versus A/T-ending biased tissues (Fig. 5C), further confirming the transcriptome-proteome mismatch and underscoring the influence of post-transcriptional processes (i.e. translational, epitranscriptional, or proteome-level regulation) on codon usage bias. In human proteomes, tissues from the nervous system, arteries, skin, lungs, and spleen were G/C biased. On the other hand, liver, heart, muscles, and other internal organs were mostly A/T-ending biased. The testis was observed to be the most A/T-ending biased tissue as well as the tissue with the strongest difference between RNA and protein GC3 scores (Fig. 5D). Importantly, the observed tissue clustering mimicked what was observed previously in mice using Ribo-seq [28] where the brain was G/C-ending biased while most other tissues examined were A/T-ending biased. This observed dichotomy fits the narrative of A/T-ending bias being linked to proliferation and G/C-ending bias being linked to neuronal function and differentiation given our knowledge of tissue physiology. To further clarify such notions, we conducted ORA GOBP analysis on the top 10% expressed proteins in each tissue (Supplementary Fig. S7A). We observed general clustering trends mimicking What was observed at the protein isoacceptors frequencies analysis. Further, GOBP ORA analysis of the most A/T-biased tissue (the testis; Supplementary Fig. S7B), and the one of the most G/C-biased tissues (the cerebellum; Supplementary Fig. S7C) recapitulated the proliferation versus differentiation/neuronal function bias observed at the level of global A/T-ending versus G/C-ending bias. Examining the GC3 scores of the top 20 GOBP terms enriched in each tissue revealed the clear G/C versus A/T-ending preferences for different tissue (Supplementary Fig. S8A and B).
Figure 5.
Analysis of human tissues proteomes and transcriptomes. (A) Spearman’s rank correlation analysis of RNA and protein expression across all tissues. (B) Heatmap of isoacceptors codon frequencies analysis of top 10% expressed RNAs in human tissues presented as T-stat of tissue expressed genes versus the genome as background. (C) Heatmap of isoacceptors codon frequencies analysis of top 10% expressed proteins in human tissues presented as T-stat of tissue expressed genes versus the genome as background. (D) GC3 density plots of the top 10% expressed mRNAs and proteins in testis.
In summary, we observe tissue specific codon patterns that clusters human tissues based on their synonymous codon biases in agreement with previous computational analysis [23] and model organisms based analysis [28, 33]. Importantly, we observed strong differences between RNA and protein expression as well as codon bias when analyzed at the RNA or the protein levels, indicating a strong regulatory influence at the level of mRNA translation regulating codon selection and bias that impacts the proteome [4, 24]. Our observations indicate that tissue specific codon-biased translation is fine-tuned to tissue’s physiological functions.
Analysis of human cancer proteomes reveal global A/T-biased translational shifts.
Given that we hypothesized that a global A/T-ending codon biased translation, driven, for example, by certain tRNA modifications such as t6A or m7G, could be a hallmark of cancer, we wanted to test such notion in human cancer specimens. To that end, we retrieved information of different cancers proteomes from CancerProteome database (http://bio-bigdata.hrbmu.edu.cn/CancerProteome/index.jsp) [16]. The database includes an analysis of different public databases and provides information of differential expressed proteins in cancer samples versus normal tissues, which allowed us to examine A/T-ending codon biased translational shifts in cancer. Using this information, we extracted the upregulated and downregulated proteins in 21 cancer types provided by the database and analyzed their GC3 scores and isoacceptors frequencies. Globally, we observed that upregulated proteins were more A/T-ending biased while the downregulated proteins were more G/C-ending biased based on the GC3 scores of the differentially expressed proteins (Fig. 6A). However, certain cancers were G/C-ending biased such as acute myeloid leukemia (Fig. 6B). Isoacceptors frequencies analysis offered a more granular view of codon usage bias, revealed specific codon preferences across cancer types, and confirmed the global cancer A/T-ending codon shifts, with the presence of few exceptions (Fig. 6C and D). Thus, across multiple cancer types, proteins upregulated in tumors preferentially exhibit A/T-ending codon bias relative to normal tissue of origin, consistent with engagement of proliferation-associated codon programs. These trends also correlate with our observation of the potential impact of cancer-associated tRNA modifications and are consistent with recent observations of A/T-ending codon biased programs in proliferating cells [34]. While this trend is observed broadly, notable exceptions exist, highlighting cancer-type specific variations in codon usage and optimality. These findings suggest that many cancers converge on a shared codon bias translational program, while retaining context-dependent codon programs that warrant further investigation.
Figure 6.
Analysis of human cancer proteomes compared to normal tissues. (A) Global GC3 scores of upregulated and downregulated proteins (|log2FC| > 0.58 and FDR < 0.05) across all cancers compared to their respective normal tissues. (B) Global GC3 scores of upregulated and downregulated proteins (|log2FC| > 0.58 and FDR < 0.05) in acute myeloid leukemia. (C) Heatmap of different cancers isoacceptors frequencies. Upregulated and downregulated proteins were analyzed separately and compared against the genome to generate T-statistics which were used for visualization. (D) Cancer abbreviations and full names.
Cancer cell lines exhibit heterogeneous codon usage patterns not observed in patient tumors
The importance of understanding codon decoding extends from evolutionary biology to understanding diseases and developing therapeutics. As shown above, certain cancer-linked tRNA modifications drive oncogenesis via codon biased mRNA translation, which agrees with the global codon usage patterns in human cancers. Given the ongoing interest in understanding epitranscriptomics, tRNA modifications, and codon-biased mRNA translation in driving cancer and as potential therapeutic avenue [5, 8], we extended our analysis to 2600 cancer cell lines analyzed in cell model passports database (https://cellmodelpassports.sanger.ac.uk/) [15]. We first conducted quality control analysis on the whole dataset, which showed variations in the numbers of models (i.e. cell lines) related to different tissues (Supplementary Fig. S9A) and the number of representative tissue status (e.g. tumor versus metastasis) in the database (Supplementary Fig. S9B). Importantly, the number of detected genes and proteins showed acceptable variability and normal distribution across the database (Supplementary Fig. S9C and D). Spearman’s rank correlation revealed nearly no correlation between RNA and protein expression (Spearman’s Rho = 0.181 globally across all models; Fig. 7A). GC3 scores based on proteomics data showed relatively tight distribution across all models with no major variations (Fig. 7B). Next, we calculated the isoacceptors T-stat of the top 10% expressed proteins in the 948 models that remained after filtering and plotted them using K-mean and hierarchical clustering (Fig. 7C). We observed the presence of several major clusters, with one cluster being significantly G/C-ending biased (K-mean cluster 1) while another cluster was A/T-ending biased (K-mean cluster 2; Fig. 7C). Examining the K-mean clusters profiles revealed overlaps between models originating from different tissues, indicating inter-tissue heterogeneity in terms of isoacceptors frequencies (Supplementary Fig. S10A). To examine this notion, we analyzed models by tissues of origins. We observed significant inter-tissue heterogeneity across all tissues and models. For example, in cell models from central nervous system and from those from hematopoietic and lymphoid tissues, we observed distinct clusters based on isoacceptors codon bias (Supplementary Fig. S10B and C). Thus, K-mean clustering cannot be attributed only to specific tissue of origin of the model but rather indicates a more complex figure. Notwithstanding these observations, we noted the presence of patterns of enrichment of different tissue origins, cancer types, tissue status, and growth properties of the models across the clusters (Supplementary Fig. S11). For example, cells growing as suspension and cells derived from metastatic tumors were more represented in cluster 1 (Supplementary Fig. S11C and D). More granular analysis of cancer type revealed specific clustering. For example, breast carcinoma was more represented in cluster 2 while B-cell non-Hodgkin’s lymphoma was more represented in cluster 1 (Supplementary Fig. S11B).
Figure 7.
Analysis of cancer cell line atlas reveals heterogeneity in codon usage. (A) Spearman’s rank correlation analysis of RNA and protein expression across all examined models (i.e. cell lines). (B) Histogram of top 10% expressed proteins GC3 scores across all examined models. (C) Heatmap with K-mean clustering of isoacceptors frequencies T-stat of the top 10% expressed proteins in all tested models.
We wanted to understand whether this heterogeneity in cancer cell lines codon usage is an inherent cancer feature due to patient heterogeneity or is due to culturing conditions. To do so, we retrieved human cancer samples proteomics from a recently published dataset [17]. The dataset comprised of 999 tumor samples representing 22 cancer types. We retrieved the preprocessed protein intensities per sample, normalized the protein expression, selected the top 10% expressed proteins per sample, and then analyzed the isoacceptors frequencies for each sample to evaluate patient to patient heterogeneity in codon usage across cancers. We observe that within each of the examined tumors, isoacceptors codon frequencies were largely homogeneous, with few variations in certain codons across samples (Supplementary Fig. S12). Thus, we can conclude that the heterogeneity of the codon usage observed in cancer cell lines is a feature of the cell lines and not due to the parent tumor itself. However, this should be validated by comparing cell lines and their tumor of origin using the same methodologies. Unfortunately, such datasets are not available in the literature. It is important to also note the variations in the observed patterns when we examined differentially expressed proteins (in Fig. 7) which revealed A/T-bias compared to normal tissues, and top expressed proteins in cancer samples (which showed clear synonymous codon usage bias but less defined A/T- versus G/C-ending bias in (Supplementary Fig. S12). These differences highlight the differences between oncogenic programs (i.e. divergence from normal tissues) and abundant proteins which could be expressed at same levels in cancer and normal tissues (i.e. housekeeping genes).
In summary, we observed heterogeneous codon usage patterns across the many cell lines used in cancer research, which was not observed when we examined human cancer samples proteomes. This heterogeneity underscores important unaddressed limitations in cancer research and raises questions regarding the replicability of findings related to mRNA translation dynamics, for example due to tRNA modification changes or other epitranscriptomic features, across different cell lines. Thus, there is a need to expand studies investigating translational changes in cancer cells to include different cell lines with different codon usage dynamics to ensure the robustness of observations and relevance to clinical cancer features and to be able to discern cell line-specific adaptations from intrinsic tumor biology.
Discussion
In this work, we employed two complementary metrics to investigate how codon-biased translation shapes mRNA translation and links protein output to biological function. GC3 scores provide a global measure of codon bias, reflecting whether genes preferentially use A/T-ending or G/C-ending codons [23]. In contrast, isoacceptors codon frequencies capture synonymous codon bias at higher resolution, offering a more granular view of codon preference within individual CDS [35]. Using these metrics, we observed that GC3 scores exhibit evolutionary drift from rodents to humans as a result of both synonymous and nonsynonymous recombination, consistent with previous reports of gBGC [26, 27], yet without altering gene functional enrichment. Importantly, these evolutionary shifts did not lead to major changes in isoacceptors codon frequencies, with only a limited number of outliers showing divergence [10]. These observations have important implications in interpreting results from rodent models commonly used to study various diseases or to probe the molecular basis of codon decoding. Our findings demonstrate that conclusions drawn from rodent models regarding codon-biased translational programs are broadly transferable to humans at the functional level, despite underlying evolutionary differences. Both GC3- and isoacceptors-based analyses revealed robust functional stratification of genes according to their codon content. At the global level, A/T-ending versus G/C-ending codon bias derived from GC3 scores consistently separated genes involved in proliferation from those associated with differentiation and neuronal functions, at both the gene and pathway levels [36], consistent with recent evidence that AT3- and GC3-biased codons are differentially utilized in proliferative versus differentiated states [34]. Synonymous codon usage analysis revealed a more granular view of functional gene enrichment and stratifications, demonstrating that differential enrichment of synonymous codons can support distinct translational programs, as illustrated by the histidine and valine codon examples presented here.
We also show, for the first time, how tRNA modification indices, specifically the ANN index and m7G-index used in this work, can explain how tRNA modifications are linked to oncogenic programs [5]. These indices reveal that genes enriched in codons decoded by t6A- and m7G-dependent tRNAs preferentially engage A/T-ending codon programs, suggesting that a global A/T-biased translational landscape is a recurring feature of cancer in agreement with mechanistic studies showing that codon composition directly influences translational efficiency in proliferative contexts [34]. A finding we observed in many cancers when we analyzed human cancer proteomes compared to their tissues of origin [16]. Although this A/T-ending bias was prominent in most cancers, notable exceptions were identified, showcasing the need for future studies that systematically profile tRNA expression and modification landscapes across diverse cancer types, subtypes, and clinical contexts to better understand these variations. These observations align with experimental evidence linking t6A and tRNA m7G to cancer progression and stemness, and with their established roles in shaping decoding efficiency and codon optimality [12, 18–21, 30, 32]. The enzymatic systems responsible for these modifications, YRDC/KEOPS for t6A and METTL1-WDR4 for m7G, are frequently reported to be upregulated in multiple cancers [1, 5, 12, 20, 30], and perturbation of either pathway often produces convergent phenotypes, including enhanced stemness and increased chemotherapy resistance. Although these modifications occur outside the codon itself—t6A at position 37 and m7G in the variable/T-arm region—they can still alter translational output by modulating tRNA structure, stability, and decoding efficiency of their cognate codon sets. Accordingly, increased activity of these pathways is expected to preferentially amplify translation of gene programs enriched for the corresponding codons. However, we note that not all codons decoded by m7G-modified tRNAs are A/T-ending; some modified tRNAs recognize mixed codon families that include both A/T- and G/C-ending codons, such as tRNA-ArgTCT, which decodes both AGA and AGG codons and is recognized as oncogenic driver [12]. Accordingly, the indices used here should be interpreted as empirical summaries of the curated consensus codon sets analyzed in this study, providing a first-pass framework for phenotype interpretation and hypothesis generation. More direct approaches, including codon-level Ribo-seq and targeted proteomic analyses, will be required to mechanistically resolve how specific tRNA modifications alter codon-dependent translation in a context- and system-dependent manner. More broadly, multiple tRNA modifications—including anticodon-adjacent (position 37 modifications) [37], wobble, and body modifications—have been implicated in tumorigenesis and oxidative-stress responses by reweighting decoding landscapes, either locally by favoring translation of codon-enriched transcripts or more globally by biasing translation toward broader codon programs depending on context (e.g. A/T-ending codon biased translation described herein) [3, 4, 8]. Taken together, these studies support the view that multiple epitranscriptomic pathways can converge on predictable, codon-program–level consequences, either independently or in combination, in a context-dependent manner.
Analysis of cancer cell lines revealed pronounced heterogeneity in codon usage among lines derived from the same tissue or cancer origin, a pattern not observed in patient tumor proteomes. This suggests that codon usage heterogeneity is largely a feature of cancer cell lines, potentially arising from culturing conditions or adaptations acquired during their establishment and propagation. This observation raises important concerns regarding the interpretation and reproducibility of findings derived from individual cell line models. Consequently, studies investigating translational dysregulation in cancer should incorporate multiple cell lines with distinct codon usage profiles to assess whether observed effects are model-specific or broadly representative of a given cancer type or subtype.
One of the interesting applications of understanding codon usage bias in healthcare is the codon reengineering of mRNA/gene constructs to achieve better protein production in target cells/tissues while reducing the aberrant off-protein production. We previously demonstrated this principle in mice tissues using in vivo adenovirus mediated gene delivery of mutant EGFP constructs [9], and similar findings were reported in vitro by Davis et al. [10]. Thus, a complete understanding of codon usage at the tissue, cell, and cell state is essential for future biotechnological applications. Our current analysis extends these observations to human tissues, showing that distinct tissues exhibit characteristic codon usage signatures that broadly mirror, with some differences, those observed in mice [9]. Importantly, these patterns were not observable when we examined mRNA expression datasets of the same tissues, highlighting the importance of translational regulation in dictating codon usage and proteome output, supporting Plotkin et al.’s [11] intuition regarding potential codon-mediated translational control influencing tissue-specific gene products in humans. We further observed distinct codon usage signatures when we examined human cancer and cancer cell lines proteomes. Nonetheless, it is important to note that current approaches in studying cancer translation or proteomics are done in-bulk. Thus, the true cancer cell signatures might be diluted or altered due to the presence of other cells in the samples such as immune cells. Despite this limitation, our computational analyses consistently identify a global A/T-ending codon bias in the proteomes of most cancers, supporting the existence of a cancer-associated codon signature. Previous works also alluded to the presence of such signatures in cancers. For example, Rapino et al. [38] showed that wobble uridine tRNA modifications drive the translation of proteins enriched in AAA, GAA, and CAA codons in BRAF(V600E)-expressing melanoma cells. The upregulation of these proteins was linked to resistance to chemotherapy in melanoma.
Although the analysis here focused on cancer datasets, owing to their availability and relatively high quality, it is important to note that the applications extend beyond the cancer field. Translational deregulation and the consequent proteostasis dysregulation are known to occur in many conditions such as aging [39, 40], diabetes [13, 41], and neurodegenerative diseases [42] to name a few. Understanding global and specific codon patterns in disease conditions could, in theory, allow for the design of more robust gene/mRNA therapeutics that can achieve better results in target cells.
It is important to highlight that our analysis and conclusions regarding cancer codon usage shifts are limited by the quality of the datasets available. Technical variations or low-quality samples could indeed skew the analysis. While we selected datasets from studies that we believe are of high quality, more orthogonal analysis and validation is required to fully comprehend the influence of codon usage and bias on cancer initiation, progression, and outcomes.
In conclusion, our study demonstrates that codon usage and bias segregate genes into functionally coherent groups and that codon-biased translation is finely tuned to the physiological demands of tissues and cells. We identify a recurrent A/T-biased codon signature in cancer proteomes, which may be leveraged to optimize gene and mRNA therapeutics targeting malignant cells. We anticipate that continued investigation of translational regulation in cancer, particularly through the generation of high-resolution epitranscriptomic and translatomic datasets at the tissue and single-cell levels, will be essential for advancing our understanding of cancer biology and for the development of novel therapeutic strategies
Supplementary Material
Acknowledgements
The authors report no conflict of interest nor any ethical adherences regarding this work. Large language model (ChatGPT) was used for language editing as an assistance in code writing.
Author contributions: Sherif Rashad (Conceptualization [lead], Data curation [lead], Formal analysis [lead], Funding acquisition [equal], Investigation [lead], Methodology [lead], Project administration [equal], Visualization [lead], Writing—original draft [lead]), Kuniyasu Niizuma (Funding acquisition [equal], Project administration [equal], Writing—original draft [supporting]).
Contributor Information
Sherif Rashad, Department of Neurosurgical Engineering, Graduate School of Biomedical Engineering, Tohoku University, Sendai 980-8575, Japan; Department of Translational Neuroscience, Tohoku University Graduate School of Medicine, Sendai 980-8575, Japan.
Kuniyasu Niizuma, Department of Neurosurgical Engineering, Graduate School of Biomedical Engineering, Tohoku University, Sendai 980-8575, Japan; Department of Translational Neuroscience, Tohoku University Graduate School of Medicine, Sendai 980-8575, Japan; Department of Neurosurgery, Tohoku University Graduate School of Medicine, Sendai 980-8575, Japan.
Supplementary data
Supplementary data is available at NAR Genomics & Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was supported by the Japan Society for Promotion of Science grants number 23H02741 for S.R. and by JST Moonshot R&D project number JPMJPS2023 for K.N. Open access charges were paid by The Support Program for the Article Processing Charge (APC) For Tohoku University Members
Data availability
No new data was generated for this manuscript.
References
- 1. Rashad S, Marahleh A. Metabolism meets translation: dietary and metabolic influences on tRNA modifications and codon biased translation. WIREs RNA. 2025;16:e70011. 10.1002/wrna.70011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Suzuki T. The expanding world of tRNA modifications and their disease relevance. Nat Rev Mol Cell Biol. 2021;22:375–92. 10.1038/s41580-021-00342-0. [DOI] [PubMed] [Google Scholar]
- 3. Rashad S, Al-Mesitef S, Mousa A et al. Translational response to mitochondrial stresses is orchestrated by tRNA modifications. bioRxiv, 10.1101/2024.02.14.580389, 14 February 2024, preprint: not peer reviewed. [DOI] [Google Scholar]
- 4. Huber SM, Begley U, Sarkar A et al. Arsenite toxicity is regulated by queuine availability and oxidation-induced reprogramming of the human tRNA epitranscriptome. Proc Natl Acad Sci USA. 2022;119:e2123529119. 10.1073/pnas.2123529119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Dedon PC, Begley TJ. Dysfunctional tRNA reprogramming and codon-biased translation in cancer. Trends Mol Med. 2022;28:964–78. 10.1016/j.molmed.2022.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chionh YH, McBee M, Babu IR et al. tRNA-mediated codon-biased translation in mycobacterial hypoxic persistence. Nat Commun. 2016;7:13302. 10.1038/ncomms13302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Chan CTY, Pang YLJ, Deng W et al. Reprogramming of tRNA modifications controls the oxidative stress response by codon-biased translation of proteins. Nat Commun. 2012;3:937. 10.1038/ncomms1938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Shen C, Che Y, Zhou K et al. ALKBH1 drives tumorigenesis and drug resistance via tRNA decoding reprogramming and codon-biased translation. Cancer Discov. 2025;15:2298–325., 10.1158/2159-8290.CD-24-1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Ando D, Rashad S, Begley TJ et al. Decoding codon bias: the role of tRNA modifications in tissue-specific translation. Int J Mol Sci. 2025;26:706, 10.3390/ijms26020706 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Davis ET, Raman R, Byrne SR et al. Genes and pathways comprising the human and mouse ORFeomes display distinct codon bias signatures that can regulate protein levels. bioRxiv, 10.1101/2025.02.03.636209, 4 February 2025, preprint: not peer reviewed. [DOI] [Google Scholar]
- 11. Plotkin JB, Robins H, Levine AJ. Tissue-specific codon usage and the expression of human genes. Proc Natl Acad Sci USA. 2004;101:12588–91. 10.1073/pnas.0404957101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Orellana EA, Liu Q, Yankova E et al. METTL1-mediated m(7)G modification of Arg-TCT tRNA drives oncogenic transformation. Mol Cell. 2021;81:3323–38. 10.1016/j.molcel.2021.06.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Santos M, Anderson CP, Neschen S et al. Irp2 regulates insulin production through iron-mediated Cdkal1-catalyzed tRNA modification. Nat Commun. 2020;11:296. 10.1038/s41467-019-14004-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Jiang L, Wang M, Lin S et al. A quantitative proteome map of the human body. Cell. 2020;183:269–83. 10.1016/j.cell.2020.08.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Gonçalves E, Poulos RC, Cai Z et al. Pan-cancer proteomic map of 949 human cell lines. Cancer Cell. 2022;40:835–49. 10.1016/j.ccell.2022.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Lv D, Li D, Cai Y et al. CancerProteome: a resource to functionally decipher the proteome landscape in cancer. Nucleic Acids Res. 2024;52:D1155–62. 10.1093/nar/gkad824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Knol JC, Lyu M, Böttger F et al. The pan-cancer proteome atlas, a mass spectrometry-based landscape for discovering tumor biology, biomarkers, and therapeutic targets. Cancer Cell. 2025;43:1328–46. 10.1016/j.ccell.2025.05.003. [DOI] [PubMed] [Google Scholar]
- 18. García-Vílchez R, Añazco-Guenkova AM, Dietmann S et al. METTL1 promotes tumorigenesis through tRNA-derived fragment biogenesis in prostate cancer. Mol Cancer. 2023;22:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. García-Vílchez R, Añazco-Guenkova AM, López J et al. N7-methylguanosine methylation of tRNAs regulates survival to stress in cancer. Oncogene. 2023;42:3169–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Dai Z, Liu H, Liao J et al. N(7)-Methylguanosine tRNA modification enhances oncogenic mRNA translation and promotes intrahepatic cholangiocarcinoma progression. Mol Cell. 2021;81:3339–55. 10.1016/j.molcel.2021.07.003. [DOI] [PubMed] [Google Scholar]
- 21. Lin S, Liu Q, Lelyveld VS et al. Mettl1/Wdr4-mediated m(7)G tRNA methylome is required for normal mRNA translation and embryonic stem cell self-renewal and differentiation. Mol Cell. 2018;71:244–55. 10.1016/j.molcel.2018.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Rohart F, Gautier B, Singh A et al. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13:e1005752. 10.1371/journal.pcbi.1005752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Benisty H, Hernandez-Alias X, Weber M et al. Genes enriched in A/T-ending codons are co-regulated and conserved across mammals. Cell Syst. 2023;14:312–323 e313. [DOI] [PubMed] [Google Scholar]
- 24. Rashad S, Al-Mesitef S, Mousa A et al. Translational response to mitochondrial stresses is orchestrated by tRNA modifications. bioRxiv, 10.1101/2024.02.14.580389, 14 February 2024, preprint: not peer reviewed. [DOI] [Google Scholar]
- 25. Rashad S, Byrne SR, Saigusa D et al. Codon usage and mRNA stability are translational determinants of cellular response to canonical ferroptosis inducers. Neuroscience. 2022;501:103–30. 10.1016/j.neuroscience.2022.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Rousselle M, Laverré A, Figuet E et al. Influence of recombination and GC-biased gene conversion on the adaptive and nonadaptive substitution rate in mammals versus birds. Mol Biol Evol. 2019;36:458–71. 10.1093/molbev/msy243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Palidwor GA, Perkins TJ, Xia X. A general model of codon bias due to GC mutational bias. PLoS One. 2010;5:e13431. 10.1371/journal.pone.0013431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Ando D, Rashad S, Begley TJ et al. Decoding codon bias: the role of tRNA modifications in tissue-specific translation. Int J Mol Sci. 2025;26:706. 10.3390/ijms26020706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Rashad S. Queuosine tRNA modification: connecting the microbiome to the translatome. Bioessays. 2024;47:e202400213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Wu X, Yuan H, Wu Q et al. Threonine fuels glioblastoma through YRDC-mediated codon-biased translational reprogramming. Nat Cancer. 2024;5:1024–44. 10.1038/s43018-024-00748-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Akiyama N, Ishiguro K, Yokoyama T et al. Structural insights into the decoding capability of isoleucine tRNAs with lysidine and agmatidine. Nat Struct Mol Biol. 2024;31:817–25. 10.1038/s41594-024-01238-1. [DOI] [PubMed] [Google Scholar]
- 32. Guo J, Zhu P, Ye Z et al. YRDC mediates the resistance of lenvatinib in hepatocarcinoma cells via modulating the translation of KRAS. Front Pharmacol. 2021;12:744578. 10.3389/fphar.2021.744578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Allen SR, Stewart RK, Rogers M et al. Distinct responses to rare codons in select Drosophila tissues. eLife. 2022;6:e76893. 10.7554/eLife.76893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Guimaraes JC, Mittal N, Gnann A et al. A rare codon-based translational program of cell proliferation. Genome Biol. 2020;21:44. 10.1186/s13059-020-1943-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Tumu S, Patil A, Towns W et al. The gene-specific codon counting database: a genome-based catalog of one-, two-, three-, four- and five-codon combinations present in Saccharomyces cerevisiae genes. Database. 2012;8:bas002. 10.1093/database/bas002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Gingold H, Tehler D, Christoffersen NR et al. A dual program for translation regulation in cellular proliferation and differentiation. Cell. 2014;158:1281–92. 10.1016/j.cell.2014.08.011. [DOI] [PubMed] [Google Scholar]
- 37. Weller C, Bartok O, McGinnis CS et al. Translation dysregulation in cancer as a source for targetable antigens. Cancer Cell. 2025;43:823–840.e18. 10.1016/j.ccell.2025.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Rapino F, Delaunay S, Rambow F et al. Codon-specific translation reprogramming promotes resistance to targeted therapy. Nature. 2018;558:605–9. 10.1038/s41586-018-0243-7. [DOI] [PubMed] [Google Scholar]
- 39. Keele GR, Zhang JG, Szpyt J et al. Global and tissue-specific aging effects on murine proteomes. Cell Rep. 2023;42:112715. 10.1016/j.celrep.2023.112715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Pechmann S. Ageing-related decline of translation as a consequence of transcription dysregulation. J R Soc Interface. 2025;22:20250323. 10.1098/rsif.2025.0323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Steinthorsdottir V, Thorleifsson G, Reynisdottir I et al. A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet. 2007;39:770–5. 10.1038/ng2043. [DOI] [PubMed] [Google Scholar]
- 42. Bento-Abreu A, Jager G, Swinnen B et al. Elongator subunit 3 (ELP3) modifies ALS through tRNA modification. Hum Mol Genet. 2018;27:1276–89. 10.1093/hmg/ddy043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No new data was generated for this manuscript.






















