Abstract
Engineering translation holds great promise for maximizing protein yields in agriculture and biotechnology, but the diversity of plant genomes hinders predictable engineering. To identify mRNA features that broadly improve translation, we conducted a comparative translatome analysis across model plants. We found that codons with G or C at the third position (GC3) are consistently associated with higher translation efficiency. Experimental results confirmed that elevating GC3 increases both protein output and mRNA abundance, in both GC- and AT-rich species. Comparative analyses across 80 plant species, spanning a wide range of GC3 levels, show that GC3 content is positively correlated with translation efficiency. Additionally, high GC3-codon usage is conserved among endogenous high-abundance proteins, such as Rubisco small subunits and ribosomal proteins. Finally, tRNA availability likely explains why GC3 codons broadly enhance translation. Together, our results provide a simple guideline for codon optimization: increasing GC3 can enhance protein production across diverse plants.
INTRODUCTION
Among the various layers of gene expression regulation, translation represents a promising target for engineering gene expression due to its direct impact on protein production. Manipulating mRNA features that influence translation has enabled the modulation of protein output in diverse plant species 1–5. Gene-specific regulatory elements, such as upstream open reading frames (uORF) in 5′ UTRs and binding sites for specific RNA-binding proteins often in 3′ UTRs, provide transcript-specific translational regulation 6–9. In contrast, certain sequence features exert more global effects on translation. For example, the nucleotide context surrounding the start codon, known as the Kozak sequence, plays a critical role in translation initiation across eukaryotes 10–13. In GC-rich plants, such as rice and maize, manipulating codon usage, specifically the GC content at the third nucleotide position of the codon (GC3), has high potential for enhancing protein production. In rice, polysome profiling revealed that highly translated mRNAs are associated with higher GC3 levels 14. In maize, increasing GC3 content in transgenes elevates both mRNA and protein levels, likely by reducing siRNA-mediated gene silencing 15, though its role in translation remains to be investigated.
Codon optimization, which substitutes synonymous codons to match the host species’ codon usage, has long been employed for heterologous protein production 16,17. The availability or complete lack of certain tRNAs is a major reason why codon optimization is often necessary, particularly for expression in prokaryotic species 18–20. In eukaryotes, tRNAs have diversified, increasing both the total number of tRNA genes and the diversity of anticodons 21. Additionally, wobble base-pairing allows certain tRNA anticodons to recognize multiple codons 19,22,23. While codon optimization has successfully increased target proteins in many plants 24–27, it is not universally effective and, in some cases, has led to reduced protein production 28–30. These findings suggest that codon usage is subject to additional regulation that remains to be elucidated across plants.
The diversity of crops and plants overall presents a challenge for our ability to reliably predict and engineer translation. Early studies analyzing the annotated coding sequences have shown that for the Kozak sequence, eudicots preferentially use AAAAUGGC, whereas monocots prefer GCCAUGGC (AUG indicates start codon) 12,31,32. For codon usage, eudicots preferentially use A and U in the 3rd position of each codon, while monocots prefer G and C 16,33–35. Yet, it remains unclear how these sequence features directly affect translation globally, as translation is a more onerous metric to quantify until recently.
Our ability to quantify the translation efficiency (TE) of individual mRNAs has been revolutionized by ribosome profiling (Ribo-seq). Ribo-seq is a method for deep sequencing of ribosome footprints, and can be used to infer TE when normalized to transcript abundance from RNA-seq 36. Ribo-seq has revealed 100-fold differences in TE in yeast, and 4000-fold differences in TE in maize 36,37. Using hundreds of Ribo-seq datasets in yeast, mammals, and humans, in-depth models have been developed in these species to predict TE of a given mRNA based on its sequence features 38–42.
Here we aimed to uncover mRNA features that universally increase translation efficiency and protein production in diverse plant species. We employed both computational and experimental strategies and found that increasing GC3 could be widely applied to enhance protein production in both AT-rich and GC-rich plants. The conserved GC3-favoring tRNA pools help explain why GC3 links to high translation. We propose that increasing GC3 could be a universal codon-optimization guideline for maximizing protein yields in plants and possibly other eukaryotes.
RESULTS
Identifying mRNA features associated with high translation efficiency
To identify mRNA features associated with translation efficiency (TE) regulation across plants, we sought to compare mRNA sequences between high-TE genes and low-TE genes from monocots and eudicots. To this end, we analyzed Ribo-seq and RNA-seq data from two monocots, maize (Zea mays B73) 37 and rice (Oryza sativa Nipponbare) 43, and two eudicots, Arabidopsis (Arabidopsis thaliana Col-0) 9 and tomato (Solanum lycopersicum Heinz 1706) 44. Using the Ribo-seq and RNA-seq ratio, we calculated TE for >17,000 genes per species. Subsequently, we defined genes with the highest and lowest 10% TE as high-TE genes and low-TE genes in each species (Fig. 1A). Consistently, protein abundance measured by quantitative proteomics shows that high-TE genes are associated with higher protein levels compared to low-TE genes (Fig. 1B).
Figure 1: Comparison of high-TE and low-TE genes across four plant species.
A) Workflow of identification of high-TE and low-TE genes. B) Violin plots showing high-TE genes are associated with higher protein abundance measured by quantitative proteomics. C) Sequence bias of high-TE genes in four model species. While all four species show strong Kozak sequence bias, enrichment of G and C in the 3rd nucleotide of each codon (GC3) was only observed in monocots (maize and rice) but not in eudicots (Arabidopsis and tomato). Sequence logos displaying 6 nt upstream and 30 nt downstream from the CDS start codon are shown.
Given that 5′ UTRs can strongly influence translation initiation, we first performed a motif search using 5′ UTR plus 20 nt of coding sequences in high-TE genes, with sequences from low-TE genes serving as the control. While several motifs within the 5′ UTR were enriched in high-TE genes in various species, e.g., UA-rich and CA-rich regions in Arabidopsis, maize, and rice, as well as AGC motifs in maize and rice, we found the Kozak sequence was consistently identified as the most significant motif across all four species (Supplemental Fig. 1). Sequence logo comparisons confirmed that high-TE genes display a stronger bias in the Kozak sequence relative to low-TE genes (Fig. 1C and Supplemental Fig. 2), consistent with the established role of the Kozak sequence in regulating translation initiation 45. The monocots maize and rice prefer GCCAUGGCG, whereas the eudicots Arabidopsis and tomato prefer AAAAAUGGC, coinciding with the GC-rich genomes of maize and rice, in contrast to the AT-rich genomes of Arabidopsis and tomato.
Besides Kozak sequences, we observed that both maize and rice display sequence bias extending further into the coding sequence. Specifically, the 3rd nucleotide of each codon shows a strong preference for G or C (GC3) in high-TE genes (Fig. 1C), suggesting that GC3 codons are associated with increased translation. In contrast, the two eudicots do not exhibit sequence bias beyond the Kozak sequence (Fig. 1C).
GC3 codons are associated with high TE
To investigate the role of GC3 codons in TE regulation, we examined what codons are favored by high-TE genes across the four species compared to low-TE genes (Fig. 2A). In maize and rice, we observed strong codon preference for each amino acid in high-TE genes. Importantly, the codons favored by high-TE genes are all GC3 codons. The only GC3 codons that are not preferred by high-TE genes are found in six-fold degenerate amino acids (UUG for Leu, and AGG for Arg). In contrast, in Arabidopsis and tomato, the bias for GC3 codons is obscured. Some amino acids, like Ile, Lys, Phe, and Tyr, show a minor preference for the GC3 codons (Fig. 2A).
Figure 2: Higher TE is associated with increased use of GC3 codons.
A) Differential codon usage of each amino acid between high- and low-TE genes. GC3 codons are highlighted in red font and outlined in black in the bar charts. B) Violin plots showing the distribution of GC3 content in all expressed genes, low-TE, and high-TE groups in the four species. Significance determined by Wilcoxon ranked sum test (ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001).
We next examined the overall CDS (coding sequence) GC3 content versus TE across species (Fig. 2B). In the GC-rich species maize and rice, we observed a striking difference in GC3 content between high-TE and low-TE genes, with high-TE genes having a median GC3 content of 90% and 77%, compared to 47% and 51% in low-TE genes, respectively (Fig. 2B). These results are consistent with previous rice polysome profiling data linking high GC3 with high TE, as well as with transgenic maize research showing that elevated GC3 increases target protein levels 14,15. In the AT-rich species Arabidopsis and tomato, AU3 codons are more common, so we expected that high-TE genes may have lower GC3. Surprisingly, high-TE genes in Arabidopsis still show a small but significant increase in GC3 compared to low-TE genes (median 44% vs. 41%) (Fig. 2B). In tomato, which has the lowest GC content of the four species, high-TE genes also exhibit slightly higher GC3, although the difference is not statistically significant (median 38% vs. 37%) (Fig. 2B).
To corroborate these findings, we analyzed an additional 22 Ribo-seq datasets, which produced consistent results across all four species from different growth conditions and tissue types (Supplemental Table 1, Supplemental Fig. 3). Together, these findings support that higher GC3 is correlated with higher TE, with much stronger differences observed in monocots compared to eudicots.
Increasing GC3 improves protein production in both GC- and AT-rich species
To directly test whether high GC3 content can increase protein production in GC-rich and AT-rich species, we performed dual luciferase assays using luciferases with varying GC3 content but identical amino acid sequences. In the first set of plasmids, we designed Firefly luciferase (Fluc) with low (50%), mid (64%), and high (83%) GC3 content, and used Renilla luciferase (Rluc) as the internal control (Fig. 3A, Supplemental File 1). In maize, mid GC3 and high GC3 Fluc variants yielded 5.6-fold and 55-fold increase in relative luminescence (a proxy of Fluc protein levels) over the low GC3 (Fig. 3C, left). We observed a similar trend in Arabidopsis, with mid GC3 and high GC3 showing 2.2-fold and 4.1-fold higher luminescence than the low GC3 (Fig. 3E, left). In tobacco BY2 cells, which are in the same family as tomato, mid GC3 and high GC3 also showed 1.5-fold and 2.6-fold higher luminescence than low GC3 (Fig. 3G, left). These results demonstrate that high GC3 content can increase protein production, even in AT-rich species.
Figure 3: Higher GC3 codons increase protein production across plants, even for AT-rich species.
A-B) Schematic of plasmids to test the effect of GC3 content on protein production using two different reporters. C-H) Relative protein levels from dual luciferase assay, relative mRNA levels from RT-pPCR quantified for Fluc (C, E, G) or Nluc (D, F, H) in maize (C, D), Arabidopsis (E, F), and tobacco BY2 protoplasts (G, H). The numbers above the graphs indicate the fold change of the median normalized to the low GC3 constructs. Significance determined by Wilcoxon ranked sum test (ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). I) The commonly used fluorescent proteins (FPs) contain extremely high GC3. Top: Comparison of GC3 content of native GFP and DsRed versus widely used engineered versions. Middle: Suggested GC3 content for GFP using various codon optimization tools in Arabidopsis, tomato, maize, and rice. Bottom: GC3 content of expressed genes in the four model plants.
To validate this trend, we performed dual luciferase assays using a second set of plasmids, in which NanoLuc luciferase (Nluc) was modified to contain three levels of GC3 content (44%, 66%, and 87%), with Fluc serving as the internal control (Fig. 3B, Supplemental File 1). Consistently, the Nluc results demonstrated that higher GC3 content increased protein production in all three species (Fig. 3D, F, left and 3H).
To determine whether the increase in Fluc and Nluc luminescence was driven by enhanced translation efficiency or greater mRNA abundance, we quantified the luciferase mRNA levels using RT-qPCR. The qPCR primers were designed within regions that are not codon-modified to ensure fair comparisons across different luciferase variants. Interestingly, in both maize and Arabidopsis, higher GC3 also increased Fluc and Nluc mRNA levels (Fig. 3C–F, right). However, the increase in mRNA levels alone cannot fully account for the observed rise in Fluc and Nluc protein levels, suggesting that enhanced translation efficiency by higher GC3 also contributes to increased protein production.
The increased Fluc and Nluc protein levels in Arabidopsis and tobacco contradict the conventional codon optimization guideline, which recommends matching the host genome’s codon frequency to boost protein production. The low GC3 Fluc and Nluc are the only two luciferases that fall within the normal range of GC3 content in Arabidopsis (Supplemental Fig. 4), yet these low GC3 variants consistently showed the lowest activity (Fig. 3E, F, left and Fig. 3G, H). Moreover, the low GC3 Nluc was specifically designed to match the codon usage of Arabidopsis high-TE genes, but it was still outperformed by both the original Nluc sequence (mid GC3) and the Nluc optimized to maize high-TE genes (high GC3) (Fig. 3E, F). Thus, increasing GC3 content represents a new promising strategy to enhance protein production in both AT- and GC-rich species.
To explore whether high GC3 content broadly enhances heterologous expression, we examined the GC3 composition of other commonly used reporter genes. Fluorescent proteins (FPs), such as GFP and DsRed, are widely used across many organisms and have been extensively engineered to improve expression and brightness 46. Compared to the native GFP and DsRed (32% and 52% GC3), the current commonly used variants show strikingly higher GC3 content (93–97% for GFP; 97–99% for DsRed) (Fig. 3 I, top, Supplemental Table 2). The extremely high GC3 content of these FPs well exceeds the GC3 levels recommended by various codon-optimization tools and most of the expressed genes in Arabidopsis, tomato, rice, and maize (Fig. 3 I, middle and bottom). The widespread adoption of high-GC3 FPs further supports the capacity of GC3 to enhance protein production across diverse species, regardless of their endogenous GC3 content.
TE and GC3 are positively correlated across diverse lineages
We further investigated whether the preference for GC3 codons in high-TE genes is conserved across broader lineages. Because Ribo-seq data are only available for a limited number of species, we asked whether data from one species could be used to make predictions in another. To explore this, we first identified orthologs among the four model species using OrthoFinder 47, and then examined the correlation of TE between the orthologs in different species across different Ribo-seq datasets. We observed a consistent positive correlation in all species analyzed, with the strongest interspecies correlation between maize and rice (ρ = 0.52–0.58), and the weakest between rice and tomato (ρ = 0.27–0.50) (Fig. 4A). Orthologs of Arabidopsis low-TE and high-TE genes in tomato, maize, and rice showed a consistent and significant difference in TE, validating our approach for identifying low-TE and high-TE gene groups based on orthology (Fig. 4B). Likewise, using ortholog predictions reproduced similar findings that high-TE genes are associated with higher GC3 (Fig. 4C), supporting the applicability of this approach to species lacking Ribo-seq data.
Figure 4: High TE is consistently associated with high GC3 across diverse species.
A) Spearman correlation coefficients of TE between different Ribo-seq datasets from Arabidopsis (At), tomato (Sl), rice (Os), and maize (Zm), comparing the same genes if the data is from the same species, or orthologs if comparing two different species. B-C) Violin plots of TE (B) and GC3 (C) of Arabidopsis low-TE and high-TE genes, or of orthologs of Arabidopsis low-TE and high-TE genes in tomato, maize, and rice. Significance determined by Wilcoxon ranked sum test (ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). D) Categorization of 80 species used in this analysis. E) Scatter plot showing that GC3 content is positively correlated with total genomic GC content in the 80 species. Red dashed line: the regression slope; grey dashed line: one-to-one relationship. F) Median GC3 content of low-TE and high-TE genes for each of the 80 species identified using orthologs of Arabidopsis low-TE and high-TE genes. Legend shared with (E). Grey lines connect medians for the same species.
We selected 80 diverse plant species, including a wide range of angiosperms, gymnosperms, and non-seed plants (Fig. 4D, Supplemental Table 3). Their GC3 content ranges from a minimum of 35.6% in the model legume (Medicago truncatula) to a maximum of 66.6% in the duckweed Spirodela polyrhiza (Supplemental Table 3). Most angiosperms outside of Poaceae, including early-diverged angiosperms, eudicots, and non-Poaceae monocots, have GC3 content well below that of the Poaceae cluster (Fig. 4E). Overall, GC3 content is strongly correlated with genome GC content across species (ρ = 0.73, Fig. 4E). Notably, the regression slope between genomic GC content and GC3 content (red dashed line) exceeds the one-to-one linear relationship (grey dashed line) (Fig. 4E), suggesting a selective bias favoring higher GC3 content.
We applied the ortholog-based TE prediction strategy to the 80 diverse species. In all species, the median GC3 content of the high-TE genes is higher than that of the low-TE genes, with statistically significant differences in all 80 species (Fig. 4F). Together, our ortholog-based TE predictions suggest that high TE is positively correlated with high GC3 across diverse species. This result is consistent with our observation that high GC3 increases heterologous protein production across both GC-rich and AT-rich species (Fig. 3).
Endogenous high-abundance proteins exhibit increased GC3 across diverse lineages
We further investigated the relationship between GC3 and protein levels of endogenous genes. Mining Arabidopsis and maize proteomics data, we observed that high-protein genes have significantly higher GC3 content across species (Fig. 5A, B). This raised the possibility that high-protein genes may have evolved to adopt high GC3 content to enhance their expression throughout evolution. To address this, we identified homologs encoding three high-abundance protein groups across the 80 species: 1) Rubisco small subunit proteins (RBCS), 2) light-harvesting complex components (LHC), and 3) ribosomal proteins (RP) (Supplemental Fig. 5 and Supplemental Table 4). We then compared their GC3 content relative to the species’ median GC3. Remarkably, all three protein groups displayed significantly higher GC3 content across a wide range of species (Fig. 5C–F). On average, the GC3 content of RBCS, LHC, and RP was 21%, 16%, and 10% above the species’ median GC3 across diverse lineages (Fig. 5C–F). The degree of increase in GC3 content also correlates with their protein abundance (Supplemental Fig. 5). These observations suggest that high GC3 is a conserved feature linked to the high protein abundance of these important genes across evolution.
Figure 5: Important highly expressed proteins preferentially adopt high GC3.
A-B) Violin plots showing that high-protein genes are associated with higher GC3 in maize and Arabidopsis (***: p < 0.001, Wilcoxon ranked sum test). C) Phylogenetic tree of 80 plant species from OrthoFinder. The four model species are highlighted in red. D-F) Boxplots illustrating differences in GC3 content between D) Rubisco small subunit (RBCS), E) light harvesting complex (LHC), and F) ribosomal protein (RP) genes, and the median GC3 content of all genes in each species. Homologs of Arabidopsis RBCS, LHC, and RP genes were used to identify these important, highly abundant protein genes. The grey vertical dashed line indicates 0 (no difference compared to the median of all genes).
GC3 codons better match the tRNA pool across diverse species
To elucidate why higher GC3 broadly increases protein production, we tested the hypothesis that GC3 codons may better match the tRNA pool across diverse species. To evaluate this, we curated high-confidence tRNA genes in the 80 plant species with tRNAscan-SE 48 and quantified anticodon counts as a proxy for tRNA availability (Supplemental Table 5). The total number of tRNA genes identified ranged from 153 to 4,066, with a median of 577 (Supplemental Fig. 6A). The number of distinct anticodons identified in each species ranged from 45 to 53, with a median of 48 (Supplemental Fig. 6B). The anticodon counts across species consistently showed positive correlations, with more closely related species generally exhibiting stronger correlations (Supplemental Fig. 6C).
To connect tRNA anticodon availability to codon usage, we next calculated the codon weight (Wi) for each codon, which quantifies codon optimality (ranging from 0 to 1, where 0 is nonoptimal and 1 is optimal) (Supplemental Fig. 7, Supplemental Table 6). Wi is determined by the corresponding tRNA abundance and anticodon-codon base pairing efficiency, with penalties applied for wobble base-pairing 19. We found that Wi values vary greatly across species, although some codons are consistently high, such as GAG for Glu (average 0.92), while others are consistently low, such as UUA for Leu (average 0.26) (Supplemental Fig. 7). Although the GC3 codons for each amino acid do not always have the highest Wi (Supplemental Fig. 7), on average, codons with a 3rd position G or C have higher Wi than those ending with A or U (Fig. 6A).
Figure 6: tRNA availability favors GC3 codons across different lineages.
A) Codons with a third nucleotide of G or C have higher codon weight (Wi) on average. Violin plots show the distribution of Wi for all 61 sense codons across 80 plant species, grouped by the third nucleotide of each codon. Letters a–d represent significant differences as determined by ANOVA followed by Tukey’s HSD (p < 0.05). B) High tAI is associated with high TE across plants. Violin plots of tAI for all expressed genes, low-TE and high-TE genes for maize, rice, Arabidopsis, and tomato, are shown. Significance determined by Wilcoxon ranked sum test (ns: p > 0.05, *: p < 0.05, **: p < 0.01, ***: p < 0.001). C) Codons with a third nucleotide of G or C show stronger positive correlations with tAI compared to those ending with A or U. Violin plots display the distribution of correlations between tAI and the frequency of specific third-position nucleotides across all genes in the 80 species. D) GC3 content is positively correlated with tAI across diverse plant lineages. Spearman’s rank correlation (ρ) between GC3 and tAI is shown for the 80 species. The four model species are highlighted. The grey vertical dashed line indicates ρ =0.
To determine how all codons in a gene collectively match to the tRNA pool, we calculated the tRNA adaptation index (tAI), which incorporates Wi values for all codons in the CDS 19. In maize, rice, Arabidopsis, and tomato, tAI is positively associated with TE, as well as RNA-seq and Ribo-seq measurements (Fig. 6B, Supplemental Fig. 8). This suggests that genes better matching the tRNA pool also exhibit higher mRNA abundance and translation. Importantly, we found that tAI is almost always positively correlated with frequency of G3 and C3 codons within genes (Fig. 6C), as well as GC3 content overall (Fig. 6D), across the 80 plant species. Conversely, the frequencies of A3 and U3 codons are nearly always negatively correlated with tAI (Fig. 6C). This provides a mechanistic explanation for why higher GC3 can improve protein production across diverse plants, including many AT-rich species. Although eudicots generally have low GC3 content (Fig. 4E), their tAI values are still positively associated with GC3 (ρ = 0.24 on average, Fig. 6D). In contrast, lineages with higher GC3 content, such as Poaceae, show a much stronger correlation between GC3 and tAI (ρ = 0.75 on average, Fig. 6D). Together, the relationship between tAI and GC3 suggests that higher GC3 enhances protein production by better matching the tRNA pool, which also correlates with elevated mRNA abundance and translation efficiency.
GC3 Recoder: a tool for customizing codon usage
To help researchers easily modify the GC3 content of a CDS, we developed an online app, GC3 Recoder (https://larrywu.shinyapps.io/GC3-recoder/) (Supplemental Fig. 9). Users can adjust GC3 content to a target percentage or match codon frequencies specified by a custom table, such as the maize high-TE gene codon frequency table used to generate the high-GC3 luciferases in this study, with the option to preserve specific restriction enzyme sites.
DISCUSSION
Contrary to the conventional recommendations to match the codon frequency of the host genomes 49, our findings suggest a new codon optimization strategy: increasing GC3 content can enhance protein production across diverse plant species, including both GC- and AT-rich genomes. Because AT-rich species generally have a narrower GC content distribution, their genomes contain fewer high-GC3 genes; therefore, the effect of high GC3 is more difficult to detect. Yet important and highly abundant proteins, such as the Rubisco small subunit and ribosomal proteins, preferentially adopt high GC3 throughout evolution. In addition to our luciferase experiments, the widely used high-GC3 GFP and DsRed further highlight the applicability of GC3 for improving protein expression across diverse species.
High GC3 content is easy to achieve, as every amino acid has at least one GC3 codon, making it theoretically possible to encode any protein with 100% GC3 codons. In our case, we simply matched the codon usage of maize high-TE genes. We also developed an online tool, GC3 Recoder, to help researchers easily modify the GC3 content for gene engineering.
Our results, together with previous findings, strongly suggest that high GC3 content synergistically enhances protein production through multi-level regulatory mechanisms. In addition to boosting translation efficiency and mRNA abundance, high GC3 may also reduce DNA methylation and small RNA-mediated gene silencing, as well as disrupt AU-rich motifs that destabilize mRNAs 15,50,51. Our results further suggest that GC3 codons better match tRNA availability. One possible reason for the universally GC3-favoring tRNA pools in diverse species is that GC base pairing enables stronger codon–anticodon hydrogen bonding, which could promote faster and more efficient translation 17,52.
Beyond plants, high GC3 in humans also increases mRNA level and protein production, both in reporter assays and in endogenous genes 53,54. Numerous species, including many animals and fungi, have a large tRNA repertoire, where matching tRNAs are available for GC3 codons. Thus, provided the corresponding anticodons are present, GC3 preference may be generalized to other eukaryotes.
MATERIALS AND METHODS
Ribo-seq and RNA-seq processing
The code and parameters used are provided in the accompanying GitHub (https://github.com/kaufm202/GC3_codon_use). Briefly, raw Ribo-seq and RNA-seq data were downloaded from NCBI Short Read Archive (SRA) or Amazon Web Services (AWS) using the accession IDs in Supplemental Table 1. Adapter and quality trimming were performed using cutadapt 55. rRNA and tRNA sequences were acquired from the NCBI Nucleotide database and removed using Bowtie2 56. The filtered reads were then aligned to CDS from the Araport 11 annotation for Arabidopsis, the MSU release 7 annotation for rice, the iTAG v5.0 annotation for tomato, and the NAM 5.0 annotation for maize using STAR 57. RSEM was then used to estimate transcripts per million (TPM) from the STAR alignments 58. Translation efficiency (TE) was calculated as the mean Ribo-seq TPM divided by the mean RNA-seq TPM within CDS for each transcript. Only the most abundant transcript isoform from each gene and transcripts with RNA-seq TPM > 1 and TE < 30 were used for downstream analysis. The genes with the 10% lowest and highest TE were defined as the low-TE and high-TE genes in each species.
Analysis of quantitative proteomics data
Quantitative proteomics data from label-free mass spectrometry was downloaded from published datasets for Arabidopsis 59 and maize 60, using label-free quantification (LFQ) values in Arabidopsis, and distributed normalized spectral abundance factor (dNSAF) in maize. To match the published dataset, the maize Ribo-seq and RNA-seq were realigned to the B73 RefGen_v2 5a genome annotation. Abundance values within each species were normalized to the minimum value.
Motif analysis
The 5′ UTR plus 20 nt of the CDS of each transcript was extracted using the GenomicFeatures package 61 in R 62. For transcripts without annotated 5′ UTRs, the 100 nt upstream of the CDS start codon were assumed as the 5’UTR. Motif discovery in high-TE genes was performed using STREME from the MEME suite 63,64 using the low-TE genes as the control with a p-value cutoff of 0.05.
Modification of GC3 content of Fluc and Nluc and construction of dual luciferase assay (DLA) plasmids
The sequences of different Fluc and Nluc luciferase variants with different GC3 contents are provided in Supplemental File 1. Fluc and Nluc were codon-modified to modulate GC3 content without changing the peptide sequence using the ExpOptimizer codon optimization tool available online through NovoPro (www.novoprolabs.com/tools/codon-optimization). For the Fluc sequences, the first 20 codons were unmodified to avoid potential effects from the Kozak sequence, then the following 372 codons were modified, with PpuMI, KasI, and BamHI RE sites excluded. The Low GC Fluc contains the unmodified Fluc sequence of the pHsu133 plasmid 9 (see below). The Mid GC Fluc was codon-modified with the default codon usage table provided by NovoPro for Zea mays as the expression host. The High GC Fluc was codon optimized with a user-defined codon usage table derived from our maize high-TE genes (Supplemental File 2). For Nluc, the first 10 codons, qPCR primer annealing sites, and HA tag sequence were unmodified, and the remaining 153 codons were codon-modified, with XbaI, KpnI, and EcoRI RE sites excluded. The Mid GC Nluc contains the original Nluc sequence of the pHsu510 plasmid (see below). The Low GC Nluc was codon-modified with a user-defined codon usage table derived from our Arabidopsis high-TE genes (Supplemental File 2). The High GC Nluc was codon-modified with a user-defined codon usage table derived from our maize high-TE genes (Supplemental File 2).
For the Fluc/Rluc DLA constructs, the pHsu133 plasmid 9 containing 35S-Fluc and 35S-Rluc was used as the Low GC reporter. The codon-modified Fluc variants were synthesized by BioBasic and cloned into pHsu133 via Kas I and PpuM I sites, generating the Mid GC and High GC Fluc plasmids.
For the Nluc/Fluc constructs, plasmid 3698 (a gift from Yves Poirier’s Lab) was used to generate the pHsu510 plasmid containing RHIP1-Nluc and 35S-Fluc, and used as the Mid GC Nluc plasmid. The codon-modified Nluc sequences were subcloned into pHsu510 via XbaI and KpnI sites, generating the Low GC and High GC Nluc plasmids.
Protoplast transformation and dual luciferase assays
Protoplast isolation, transformation, and dual luciferase assays were performed as previously described 9,65. For Arabidopsis protoplasts, surface-sterilized Arabidopsis Col-0 seeds were cold stratified in water for 2 days at 4°C, then planted on soil and grown for 19–21 days with 16-hour light (~100 μmol m−2 s−1, cool white, fluorescent bulbs) and 8-hour dark cycle at 22°C with intermittent watering. Fully expanded rosette leaves were excised and thinly sliced for protoplast isolation. Leaf slices were placed in enzyme solution (1% [w/v] cellulase, 0.25% [w/v] macerozyme, 0.4 M mannitol, 20 mM KCl, 20 mM MES, and 10 mM CaCl2), then vacuum infiltrated for 30 minutes, followed by a 2-hour incubation with gentle shaking at 40 rpm, and then 5 minutes at 80 rpm to release protoplasts. The protoplasts were filtered through a 70 μm cell strainer, then centrifuged at 100 × g at 4°C, and washed twice with cold W5 (154 mM NaCl, 125 mM CaCl2, 5 mM KCl, and 2 mM MES). The protoplasts were counted using a hemocytometer, then resuspended in cold MMG (4.5 mM MES [pH 5.7], 0.4 M mannitol, and 15 mM MgCl2) to a concentration of 1×105 protoplasts per 150 μL.
For maize protoplasts, surface-sterilized maize B73 seeds were imbibed in water overnight at room temperature, then transferred to a petri dish with filter paper moistened with water for 3–4 days until germinated, then transferred to soil and grown at 22°C in the dark with intermittent watering. Leaves from 14-day etiolated maize seedlings were excised and thinly sliced 66. Protoplast isolation was then carried out with the same method as Arabidopsis, except with a 4-hour incubation in the enzyme solution.
For tobacco BY2 cell protoplasts, the cells were spread on petri dishes containing solid BY2 growth media (LS media plus 3% [w/v] sucrose, 0.56 mM myo-inositol, 3.8 μM thiamine, 4.4 mM K2HPO4, and 0.8% [w/v] Phytoblend agar) and grown for 10 days at room temperature in the dark. BY2 cells were then scraped directly from the plate for protoplast isolation. Protoplast isolation was carried out with the same method as for Arabidopsis, except with no vacuum infiltration, with a 3-hour incubation in enzyme solution, and using a 100 μm cell strainer at the filtering step.
6 to 8 replicates of 1×105 protoplasts were mixed with 5 μg of plasmid DNA and 170 μL PEG solution (40% [w/v] PEG4000, 0.2 M mannitol, 100 mM CaCl2) and incubated for 5 minutes. Protoplasts were washed four times with cold W5, then transferred to cell culture plates and incubated in the dark for 16 to 18 hours. Protoplasts were then centrifuged at 2,250 × g for 3 min at 4 °C, the supernatant was removed, and then protoplasts were lysed by resuspending in 100 μL 1x Passive Lysis Buffer (Promega E1910). After shaking at room temperature for 15 minutes, protoplasts were centrifuged at 2,250 × g for 3 min at 4 °C, and 20 μL of cleared lysate was used for the luciferase assays. Fluc/Rluc and Nluc/Fluc DLAs were performed using the Dual-Luciferase Reporter Assay System (Promega E1960) and Nano-Glo Dual-Luciferase Reporter Assay System (Promega N1630), respectively, in accordance with the manufacturer’s protocols using the GloMax Navigator Plate Reader (Promega, GM2010). For Fluc/Rluc DLA, Fluc relative luminescence is reported as the Fluc luminescence normalized to Rluc luminescence for each sample, and then normalized to the median of the low GC samples. For Nluc/Fluc DLA, Nluc relative luminescence is reported as Nluc luminescence normalized to Fluc luminescence for each sample, and then normalized to the median of the low GC samples.
RT-qPCR of DLA lysate
RNA was extracted from 60 μL DLA lysate using RNA Clean and Concentrator-5 kit (Zymo Research, R1016). To remove remaining DNA, purified RNA was then treated with 5 U of DNase I (Zymo Research, E1012) for 30 min, then repurified using RNA Clean and Concentrator-5 kit. cDNA synthesis was performed from 4 μL of eluted RNA using LunaScript RT SuperMix Kit (NEB, E3010) with a total reaction volume of 8 μL. qPCR was then performed using Luna Universal qPCR Master Mix (NEB, M3003) using primers provided in Supplemental File 1. Relative Fluc expression was calculated using the ΔΔCt method 67 using Rluc as the reference gene for Fluc/Rluc DLA, and relative Nluc expression was calculated using Fluc as the reference gene for Nluc/Fluc DLA, and normalized to the median of the low GC samples.
GC3 Recoder online app
GC3 Recoder (https://larrywu.shinyapps.io/GC3-recoder/), a Shiny app for codon optimization, was developed to modify the input CDS codons using synonymous codons without altering the protein sequence. After checking the input sequence for errors and locking specific restriction sites, the app changes codons using one of two distinct methods specified by users. With the ‘Target GC3’ method, the app randomly selects synonymous codons to achieve the target GC3 content. With the ‘Frequency’ method, it selects synonymous codons based on a user-provided frequency table (consisting of 3 columns: codon, corresponding amino acid, and frequency per thousand).
Phylogenetic analysis
Genome sequences and annotations for 80 species were downloaded from the sources described in Supplemental Table 3. CDSs were extracted and then translated to peptide sequences using GenomicFeatures 61 and BioStrings 68 in R 62. Only one CDS per gene was used, either the longest CDS when using all genes, or the CDS with the highest homology when using orthologs and homologs. Orthologs were identified using OrthoFinder 47. Arabidopsis Rubisco small subunit (RBCS), light-harvesting complex (LHC) and ribosomal protein (RP) genes were obtained from TAIR 69,70 and then removed genes with Ribo-seq TPM > 500. For homolog identification of the Arabidopsis RBSC, LHC, and RP genes, protein sequence were aligned using BLASTp with Diamond 71. We defined homologs as any protein with greater than 40% homology with at least one of the Arabidopsis proteins. The transcript IDs for the orthologs of Arabidopsis low-TE and high-TE genes and the homologs of the Arabidopsis RBCS, LHC, and RP genes in all 80 species are provided in Supplemental Table 4.
Calculation of tAI
tRNA gene copies were identified using tRNAscan-SE 48. tRNA gene copies were filtered to a high confidence set by removing low-quality hits, pseudo-tRNAs, tRNAs with ambiguous anticodon, mitochondrial and plastidial tRNAs, and tRNAs with a high number of copies within a small genomic region. The total tRNA gene copies for each anticodon were then summed.
tRNA adaptation index (tAI) was calculated as previously described, with a minor modification to prevent outlier anticodon counts from skewing the final tAI values 19. Outliers were identified if the anticodon counts were greater than two standard deviations from the mean when the maximum anticodon count is excluded. Outlier anticodon counts were reduced to be equal to the next highest anticodon count within two standard deviations of the mean. Then, for each codon, the weight was calculated from the sum of anticodon counts for anticodons that pair with a given codon, with penalties for wobble base-pairing. The final codon weights (Wi) represent the codon weight normalized to the codon with the maximum weight. tAI was calculated as the geometric mean of Wi of all codons in a given CDS, excluding the start and stop codons.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Ning Jiang for the helpful discussion on the research. This work used computational resources and services provided by the Institute for Cyber-Enabled Research at Michigan State University. This work was supported by a predoctoral training award under Grant Number T32-GM110523 from the National Institute of General Medical Sciences of the National Institutes of Health to IDK, and research grants from the National Science Foundation under Award Numbers 2425390 and 2051885, and the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM155375 to PYH. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or NIH.
Data visualization
Sequence logos were visualized using the ggseqlogo package 72 in R. Phylogenetic trees were visualized using ggtree 73. All other plots were generated using ggplot2 74 in R.
Footnotes
Code availability
The Bash and R scripts used to process the data as described above are available in the GitHub repository: https://github.com/kaufm202/GC3_codon_use.
REFERENCES:
- 1.Reis R. S., Deforges J., Sokoloff T. & Poirier Y. Modulation of Shoot Phosphate Level and Growth by PHOSPHATE1 Upstream Open Reading Frame. Plant Physiol. 183, 1145–1156 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Xing S. et al. Fine-tuning sugar content in strawberry. Genome Biol. 21, 230 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liu X. et al. Fine-tuning Flowering Time via Genome Editing of Upstream Open Reading Frames of Heading Date 2 in Rice. Rice 14, 59 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hardy E. C. & Balcerowicz M. Untranslated yet indispensable—UTRs act as key regulators in the environmental control of gene expression. J. Exp. Bot. 75, 4314–4331 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wu H.-Y. L., Jen J. & Hsu P. Y. What, where, and how: Regulation of translation and the translational landscape in plants. Plant Cell 36, 1540–1564 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Francischini C. W. & Quaggio R. B. Molecular characterization of Arabidopsis thaliana PUF proteins – binding specificity and target candidates. FEBS J. 276, 5456–5470 (2009). [DOI] [PubMed] [Google Scholar]
- 7.Von Arnim A. G., Jia Q. & Vaughn J. N. Regulation of plant translation by upstream open reading frames. Plant Sci. 214, 1–12 (2014). [DOI] [PubMed] [Google Scholar]
- 8.Wu H.-W. et al. Noise reduction by upstream open reading frames. Nat. Plants 8, 474–480 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wu H.-Y. L. et al. Improved super-resolution ribosome profiling reveals prevalent translation of upstream ORFs and small ORFs in Arabidopsis. Plant Cell 36, 510–539 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kozak M. Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44, 283–292 (1986). [DOI] [PubMed] [Google Scholar]
- 11.Lukaszewicz M., Feuermann M., Jérouville B., Stas A. & Boutry M. In vivo evaluation of the context sequence of the translation initiation codon in plants. Plant Sci. 154, (2000). [Google Scholar]
- 12.Sugio T. et al. Effect of the sequence context of the AUG initiation codon on the rate of translation in dicotyledonous and monocotyledonous plant cells. J. Biosci. Bioeng. 109, 170–173 (2010). [DOI] [PubMed] [Google Scholar]
- 13.Xie J. et al. Precise genome editing of the Kozak sequence enables bidirectional and quantitative modulation of protein translation to anticipated levels without affecting transcription. Nucleic Acids Res. 51, 10075–10093 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhao D. et al. Analysis of Ribosome-Associated mRNAs in Rice Reveals the Importance of Transcript Size and GC Content in Translation. G3 7, 203–219 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sidorenko L. V. et al. GC-rich coding sequences reduce transposon-like, small RNA-mediated transgene silencing. Nat. Plants 3, 875–884 (2017). [DOI] [PubMed] [Google Scholar]
- 16.Gustafsson C., Govindarajan S. & Minshull J. Codon bias and heterologous protein expression. Trends Biotechnol. 22, 346–353 (2004). [DOI] [PubMed] [Google Scholar]
- 17.Plotkin J. B. & Kudla G. Synonymous but not the same: the causes and consequences of codon bias. Nat. Rev. Genet. 12, 32–42 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kanaya S., Yamada Y., Kudo Y. & Ikemura T. Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155 (1999). [DOI] [PubMed] [Google Scholar]
- 19.dos Reis M. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Burgess-Brown N. A. et al. Codon optimization can improve expression of human genes in Escherichia coli: A multi-gene study. Protein Expr. Purif. 59, 94–102 (2008). [DOI] [PubMed] [Google Scholar]
- 21.Santos F. B. & Del-Bem L.-E. The Evolution of tRNA Copy Number and Repertoire in Cellular Life. Genes 14, 27 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Crick F. H. C. Codon—anticodon pairing: The wobble hypothesis. J. Mol. Biol. 19, 548–555 (1966). [DOI] [PubMed] [Google Scholar]
- 23.Agris P. F. et al. Celebrating wobble decoding: Half a century and still much is new. RNA Biol. 15, 537–553 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Perlak F. J., Fuchs R. L., Dean D. A., McPherson S. L. & Fischhoff D. A. Modification of the coding sequence enhances plant expression of insect control protein genes. Proc. Natl. Acad. Sci. 88, 3324–3328 (1991). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lonsdale D. M., Moisan L. J. & Harvey A. J. The effect of altered codon usage on luciferase activity in tobacco, maize and wheat. Plant Cell Rep. 17, 396–399 (1998). [DOI] [PubMed] [Google Scholar]
- 26.Li X., Wei J., Tan A. & Aroian R. V. Resistance to root-knot nematode in tomato roots expressing a nematicidal Bacillus thuringiensis crystal protein. Plant Biotechnol. J. 5, 455–464 (2007). [DOI] [PubMed] [Google Scholar]
- 27.Jeong Y. S. et al. Effect of codon optimization on the enhancement of the β-carotene contents in rice endosperm. Plant Biotechnol. Rep. 11, 171–179 (2017). [Google Scholar]
- 28.Suo G. et al. Effects of codon modification on human BMP2 gene expression in tobacco plants. Plant Cell Rep. 25, 689–697 (2006). [DOI] [PubMed] [Google Scholar]
- 29.Laguía-Becher M. et al. Effect of codon optimization and subcellular targeting on Toxoplasma gondii antigen SAG1 expression in tobacco leaves to use in subcutaneous and oral immunization in mice. BMC Biotechnol. 10, 52 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Agarwal P., Gautam T., Singh A. K. & Burma P. K. Evaluating the effect of codon optimization on expression of bar gene in transgenic tobacco plants. J. Plant Biochem. Biotechnol. 28, 189–202 (2019). [Google Scholar]
- 31.Joshi C. P., Zhou H., Huang X. & Chiang V. L. Context sequences of translation initiation codon in plants. Plant Mol. Biol. 35, 993–1001 (1997). [DOI] [PubMed] [Google Scholar]
- 32.Gupta P., Rangan L., Ramesh T. V. & Gupta M. Comparative analysis of contextual bias around the translation initiation sites in plant genomes. J. Theor. Biol. 404, 303–311 (2016). [DOI] [PubMed] [Google Scholar]
- 33.Kawabe A. & Miyashita N. T. Patterns of codon usage bias in three dicot and four monocot plant species. Genes Genet. Syst. 78, 343–352 (2003). [DOI] [PubMed] [Google Scholar]
- 34.Mukhopadhyay P., Basak S. & Ghosh T. C. Differential Selective Constraints Shaping Codon Usage Pattern of Housekeeping and Tissue-specific Homologous Genes of Rice and Arabidopsis. DNA Res. 15, 347–356 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tatarinova T. V., Alexandrov N. N., Bouck J. B. & Feldmann K. A. GC3 biology in corn, rice, sorghum and other grasses. BMC Genomics 11, (2010). [Google Scholar]
- 36.Ingolia N. T., Ghaemmaghami S., Newman J. R. S. & Weissman J. S. Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 324, 218–223 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lei L. et al. Ribosome profiling reveals dynamic translational landscape in maize seedlings under drought stress. Plant J. 84, 1206–1208 (2015). [DOI] [PubMed] [Google Scholar]
- 38.Huang T. et al. Analysis and Prediction of Translation Rate Based on Sequence and Functional Features of the mRNA. PLoS ONE 6, e16036 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Weinberg D. E. et al. Improved Ribosome-Footprint and mRNA Measurements Provide Insights into Dynamics and Regulation of Yeast Translation. Cell Rep. 14, 1787–1799 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Schlusser N., González A., Pandey M. & Zavolan M. Current limitations in predicting mRNA translation with deep learning models. Genome Biol. 25, 227 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Liu Y. et al. Translation efficiency covariation identifies conserved coordination patterns across cell types. Nat. Biotechnol. (2025). [Google Scholar]
- 42.Zheng D. et al. Predicting the translation efficiency of messenger RNA in mammalian cells. Nat. Biotechnol. (2025). [Google Scholar]
- 43.Yang X. et al. Comparative ribosome profiling reveals distinct translational landscapes of salt-sensitive and -tolerant rice. BMC Genomics 22, 612 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wu H. Y. L., Song G., Walley J. W. & Hsu P. Y. The tomato translational landscape revealed by transcriptome assembly and ribosome profiling. Plant Physiol. 181, 367–380 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kozak M. Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes. Nucleic Acids Res. 9, 5233–5252 (1981). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Shaner N. C., Steinbach P. A. & Tsien R. Y. A guide to choosing fluorescent proteins. Nat. Methods 2, 905–909 (2005). [DOI] [PubMed] [Google Scholar]
- 47.Emms D. M. & Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Chan P. P., Lin B. Y., Mak A. J. & Lowe T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Sharp P. M. & Li W.-H. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Fan X. C., Myer V. E. & Steitz J. A. AU-rich elements target small nuclear RNAs as well as mRNAs for rapid degradation. Genes Dev. 11, 2557–2568 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Musaev D. et al. UPF1 regulates mRNA stability by sensing poorly translated coding sequences. Cell Rep. 43, 114074 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Chevance F. F. V., Le Guyon S. & Hughes K. T. The Effects of Codon Context on In Vivo Translation Speed. PLoS Genet. 10, e1004392 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Kudla G., Lipinski L., Caffin F., Helwak A. & Zylicz M. High Guanine and Cytosine Content Increases mRNA Levels in Mammalian Cells. PLoS Biol. 4, e180 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Mordstein C. et al. Codon Usage and Splicing Jointly Influence mRNA Localization. Cell Syst. 10, 351–362.e8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011). [Google Scholar]
- 56.Langmead B. & Salzberg S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Dobin A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Li B. & Dewey C. N. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Song G., Hsu P. Y. & Walley J. W. Assessment and Refinement of Sample Preparation Methods for Deep and Quantitative Plant Proteome Profiling. PROTEOMICS 18, 1800220 (2018). [Google Scholar]
- 60.Walley J. W. et al. Integration of omic networks in a developmental atlas of maize. Science 353, 814–818 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Lawrence M. et al. Software for Computing and Annotating Genomic Ranges. PLoS Comput. Biol. 9, e1003118 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.R Core Team. R: A Language and Environment for Statistical Computing. https://www.R-project.org/ (2025).
- 63.Bailey T. L. et al. MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Bailey T. L. STREME: accurate and versatile sequence motif discovery. Bioinformatics 37, 2834–2840 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yoo S.-D., Cho Y.-H. & Sheen J. Arabidopsis mesophyll protoplasts: a versatile cell system for transient gene expression analysis. Nat. Protoc. 2, 1565–1572 (2007). [DOI] [PubMed] [Google Scholar]
- 66.Gomez-Cano L., Yang F. & Grotewold E. Isolation and Efficient Maize Protoplast Transformation. BIO-Protoc. 9, (2019). [Google Scholar]
- 67.Livak K. J. & Schmittgen T. D. Analysis of Relative Gene Expression Data Using Real-Time Quantitative PCR and the 2−ΔΔCT Method. Methods 25, 402–408 (2001). [DOI] [PubMed] [Google Scholar]
- 68.Pagès H., Patrick Aboyoun, Gentleman R. & DebRoy S. Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings (2025). [Google Scholar]
- 69.Rhee S. Y. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 31, 224–228 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Reiser L. et al. The Arabidopsis Information Resource in 2024. Genetics 227, iyae027 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Buchfink B., Reuter K. & Drost H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017). [DOI] [PubMed] [Google Scholar]
- 73.Yu G. Using ggtree to visualize data on tree-like structures. Curr. Protoc. Bioinforma. 69, (2020). [Google Scholar]
- 74.Wickham H. Ggplot2. (Springer International Publishing, New York, 2016). doi: 10.1007/978-3-319-24277-4. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Sequence logos were visualized using the ggseqlogo package 72 in R. Phylogenetic trees were visualized using ggtree 73. All other plots were generated using ggplot2 74 in R.






