Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2026 Feb 24;27:327. doi: 10.1186/s12864-026-12672-4

Temporal and gene-specific dynamics of codon usage evolution in SARS-CoV-2 genomes

Paweł Błażej 1, Dorota Mackiewicz 1, Paweł Mackiewicz 1,
PMCID: PMC13037255  PMID: 41735839

Abstract

Background

Since its emergence in 2019, SARS-CoV-2 has undergone continuous evolution, raising questions about codon usage and adaptation to the human host. Because viral fitness depends on rapid replication and efficient protein production, evolutionary processes that optimize translation speed and accuracy may be favoured. The aim of this study was to investigate temporal and gene-specific changes in synonymous codon usage in this coronavirus to assess whether its evolution reflects adaptation to the human translational machinery. To address this question, we analyzed 84,324 genomes collected between January 2020 and October 2024.

Results

Codons were recoded into six groups based on their synonymous usage in human protein-coding genes, enabling detection of temporal shifts in viral codon preferences. This analysis revealed pronounced changes in codon class composition occurring around 2021–2022, early 2023, and late 2023–2024, periods that coincide with major viral variant replacements. Distinct evolutionary trends were observed among functional gene groups. Structural genes exhibited codon usage biased toward less optimal (frequent) human codon classes. In contrast, non-structural genes (ORF1a and ORF1ab) showed a progressive increase in the use of more optimal (frequent) codon classes, whereas accessory genes exhibited variable patterns.

Discussion

Greater codon adaptation in ORF1a and ORF1ab likely enhances translation efficiency, supporting genome replication and transcription. Conversely, suboptimal codon usage in structural and accessory genes may favour immune evasion or regulate translation to prevent overuse of host resources. Codon shifts correlated strongly with nucleotide composition, indicating combined effects of mutational pressure and selection. Notably, codon usage dynamics aligned with vaccination campaigns and infection surges, suggesting that intense selective pressure and high replication rates promoted new mutations shaping codon preferences.

Conclusions

SARS-CoV-2 codon adaptation varies by time, gene type, and function, balancing replication efficiency with immune evasion. These insights may guide codon (de)optimization strategies in mRNA and DNA vaccines against emerging variants, e.g. by replacing more optimal with less optimal codons.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-026-12672-4.

Keywords: Adaptation, Codon usage, Coronavirus, COVID-19, Optimization, SARS-CoV-2, Translation

Background

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the etiological agent responsible for the COVID-19 pandemic, emerged in Wuhan, China, in late 2019 [1]. SARS-CoV-2 is classified as a betacoronavirus and shares a phylogenetic relationship with SARS-CoV and MERS-CoV but demonstrates distinct epidemiological and clinical characteristics [2, 3].

The SARS-CoV-2 genome consists of a positive-sense, single-stranded RNA molecule of approximately 30 kb [4]. The typical genome organisation includes 14 open reading frames (ORFs), which encode 31 proteins. Among them, there are two large polyproteins (encoded by ORF1a and ORF1ab), four structural proteins and several accessory proteins [5, 6]. The polyproteins are translated directly from the viral genomic RNA and subsequently cleaved by viral proteases into 16 non-structural proteins, which perform essential functions in viral replication, transcription, RNA modification, protein processing as well as suppressing host gene expression and immune response. They form a replication-transcription complex, which synthesises new viral genomes through RNA-dependent RNA polymerase. Other proteins, structural and accessory ones, are produced by discontinuous transcription of subgenomic mRNAs and subsequent translation.

Structural proteins include: surface glycoprotein known as Spike protein (S), envelope glycoprotein (E), membrane glycoprotein (M) and nucleocapsid glycoprotein (N) [7]. The S protein binds to the host cell receptor ACE2, allowing the virus to enter the cell [8]. The E protein participates in virion assembly, budding and release, whereas the M protein maintains virion shape and coordinates viral assembly. The N protein binds and protects the viral RNA genome, participating in viral replication and transcription. Products of other ORFs primarily modulate host immune responses and support viral replication.

The viral strategy involves efficient replication and the production of a vast array of proteins, which are then assembled into numerous virions. Thus, we can expect that the evolution of mechanisms accelerating these processes should be favoured. One of them can be changes in the synonymous codon usage, which can influence the speed and accuracy of translation [914]. Many studies indicate that highly expressed genes prefer codons, which are recognised by more numerous tRNA molecules. This ensures that products of these genes are produced quickly and in large quantities. Since the SARS-CoV-2 utilises the human host machinery, we can expect a relationship between the codon usage in the viral genes and the codon bias typical of Homo sapiens genes. The codon usage can also impact co-translational protein folding during translation elongation [1420]. It can also be important for modifying viral proteins that interact directly with the host’s immune system. However, it should be noted that biased synonymous codon usage can also arise as a by-product of selection acting at the amino acid level, interacting with nucleotide substitution patterns [21]. This effect is particularly pronounced in AT/U-rich genomes, such as SARS-CoV-2 with 63% of A + U because strong mutational pressure toward adenine and thymine combined with selection constraints on nonsynonymous substitutions involving these codons alter the synonymous codon frequencies [21].

Some analyses showed an adaptation of viral to human codon usage in general [22, 23] or specific tissues [24, 25]. It was also found that the gene for coronavirus Spike protein and ORF1ab revealed the highest codon adaptation index (CAI), especially in the Omicron variant [26, 27]. However, other studies showed a lower adaptation to the codon usage in the human host for SARS-CoV-2 strains with new mutations [28], all its genes considered together [29, 30] or only genes for N and S proteins [3133]. Some authors noticed that the CAI values calculated for the concatenated coding sequences of the coronavirus sequences have decreased over time with occasional fluctuations [34, 35]. It was interpreted that the virus is evolving to be less pathogenic. However, the period considered in the study was rather short. A recent study [32] reported small changes in overall and individual protein genes but noticed a high increase in CAI of the ORF8 sequence resulting from a nonsense mutation that truncates the reading frame and consequently leads to pronounced variability in codon bias.

The inconsistency in results reported by different authors may stem from the use of diverse methodologies and codon parameters, as well as the analysis of varied datasets. Therefore, we developed a new measure that recoded the codon usage in terms of the human host and studied a larger number of genomes collected from a much wider evolutionary time of the coronavirus. Thanks to that, we could track temporal changes in the synonymous codon usage of protein-coding sequences over a longer time. This study represents the first comprehensive analysis of coronavirus codon adaptation. Our results demonstrate that codon adaptation is dynamic over time and varies across genes and functional categories, reflecting an evolutionary trade-off between optimizing replication efficiency and evading host immune responses. Importantly, these insights provide a mechanistic foundation for rational codon (de)optimization strategies in mRNA and DNA vaccine design against emerging variants, enabling deliberate modulation of viral protein expression and immunogenicity.

Results

Codon usage in individual gene sequences of coronavirus in comparison to human

We applied a new measure, termed Codon Block Recoding (CBR), which assigns labels (from 0 to 5) to codons according to their relative usage in human protein-coding genes (see Methods). Within each synonymous block, codons were assigned labels from 0 (most frequent) to the highest rank (least frequent), proportional to the number of codons in the block. The fractional composition of codons (groups 0–5) used by different SARS-CoV-2 Wuhan-Hu-1 genes and human protein-coding genes were presented in Fig. 1. We investigated 12 functional open reading frames (ORFs) or genes annotated in SARS-CoV-2 genomes: two non-structural protein-coding sequences ORF1ab, ORF1a, four structural protein-coding sequences for envelope protein (E), membrane glycoprotein (M), nucleocapsid phosphoprotein (N), surface glycoprotein (S) as well as six accessory open reading frames: ORF3a, ORF6, ORF7a, ORF7b, ORF8 and ORF10.

Fig. 1.

Fig. 1

The percentages of codon groups (0–5) for protein-coding sequences of SARS-CoV-2 (Wuhan-Hu-1) and human

Generally, codons 0 and 1 constitute the largest fractions in viral genes, often comprising 21–32% and 25–46%, respectively. However, the human protein-coding genes are more abundant in codons 0, which are used in 46%, whereas codons 1 are present in a comparable fraction in viral genes, i.e. 30%. Codons 2 are moderately used (12–26%), whereas other groups can be even absent or reach a maximum of 15%. The predominance of codons 0 and 1 in viral genes, as in the human protein-coding genes, indicates possible adaptation toward host translational machinery.

Codon usage shows marked variation among the viral genes (Fig. 1). For instance, codon group 0 is nearly 1.5 times more frequent in ORF7b compared to the surface protein gene (S gene), while codon group 1 occurs 1.8 times more often in S gene than in the envelope protein gene (E gene). Codon group 2 is over twice as common in E gene compared to ORF8. Codon group 3 reaches its highest proportion in ORF10 (15%) but is absent in ORF6, whereas codon group 4 is most enriched in ORF7b (11%) but absent from ORF10. Finally, codon group 5 appears over seven times more frequently in E gene than in ORF3a. ORF7a, ORF7b and N gene are characterised by the highest proportion of 0 + 1 codons, approximately 70%, which are the most frequently used in human genes. In contrast, the E gene sequence contains only about 50% of 0 + 1 codons.

Since the usage of identified codon groups is expressed as fractions, differences in their usage can be effectively visualised using a two-dimensional correspondence analysis plot (Fig. 2). The analysis was performed on mean fractions of codon groups for each viral gene, calculated from genomes collected across specific months and years. Given that genomes from the United States and the United Kingdom account for approximately 90% of the dataset, we restricted all our analyses to these two countries in order to reduce potential sampling bias. The first two dimensions explain almost 75% of the variation. The genes are clearly separated from each other. The most distinct positions are occupied by ORF7b, ORF6, ORF10 and E gene, driven by the high proportion of codon group 4 in ORF7b, group 2 in both E gene and ORF6, group 3 in ORF10 and group 5 in both ORF10 and E gene. They are the smallest genes in the genome, and therefore are the most likely to be outliers. The remaining sequences predominantly contain high fractions of only codon groups 0 and/or 1. The gene sequences from different times but belonging to the same gene have a strong tendency to cluster together. However, the circles in the plot do not completely overlap, reflecting differences in codon composition, particularly in ORF7b, which suggests temporal variation and strain-specific shifts in codon usage.

Fig. 2.

Fig. 2

Correspondence analysis plot based on codon group (0–5) frequencies in individual SARS-CoV-2 protein-coding sequences. Each point represents a gene sequence present in genomes collected across specific months and years. Red numbers indicate the labelled codon group centroids

Codon usage in gene sequences from coronavirus genomes collected across various times

To examine temporal variations in codon composition in greater detail, we performed correspondence analyses separately for each gene present in genomes isolated at different times (Figs. 3 and 4). The first dimension explains from 71.8% (ORF1ab) to 99.9% (gene E) of variation, whereas the second from 0.04% (gene E) to 24.35% (ORF8) of variation.

Fig. 3.

Fig. 3

Correspondence analysis plot of codon group (0–5) frequencies in individual SARS-CoV-2 structural and non-structural protein-coding sequences. Each point represents a gene sequence present in genomes collected across specific months (marked from 01 to 12) and years (marked by colours). Red numbers indicate the labelled codon group centroids

Fig. 4.

Fig. 4

Correspondence analysis plot of codon group (0–5) frequencies in individual SARS-CoV-2 accessory protein-coding sequences. Each point represents a gene sequence present in genomes collected across specific months (marked from 01 to 12) and years (marked by colours). Red numbers indicate the labelled codon group centroids. For three ORFs, the insets without centroids were presented to better visualize the distribution of points

Along the first principal axis, most coronavirus genes cluster into two major groups: one comprising sequences from 2020 to 2021 and the other from 2022 to 2024 (Figs. 3 and 4). This division reflects shifts in the prevalence of specific codon groups between these periods. Sequences from 2020 to 2021 are enriched in codons 0 (M gene, ORF3a), codons 1 (E gene), codons 5 (ORF7b), or codons 1 and 3 (N gene). In contrast, sequences from 2022 to 2024 are dominated by codons 1 (ORF3a), codons 2 (E and M genes), codons 4 (ORF7b), or codons 4 and 5 (N gene).

ORF1a sequences also segregate into these two groups; however, sequences from January–February 2022 occupy an intermediate position in the plot, whereas ORF1ab sequences from this period cluster with the 2020–2021 group. The 2020–2021 group is characterised by an abundance of codons 2 and 3 (ORF1a) or codons 2 (ORF1ab), whereas the 2022–2024 group is dominated by codons 1 in both ORF1a and ORF1ab.

S genes are also separated into 2020–2021 and 2022–2024 groups, but along the second axis, with the former group enriched in codons 3 and the latter in codons 2 and 4. ORF6 genes form two main clusters according to the first dimension but corresponding to different periods: one includes sequences rich in codons 1 from 2020 to 2021 and July 2022–February 2023, whereas the other contains sequences enriched in codons 2 from February–May 2022 and April 2023–December 2024.

In addition to these main clusters, several smaller sets of sequences from specific time intervals can be identified for many genes (Figs. 3 and 4), indicating evolution of specific codon compositions over shorter periods. These sets are usually separated by the second principal axis. For example, ORF1a and ORF1ab sequences from July-December 2021 are rich in codon groups 4 and 5, and additionally, ORF1ab sequences throughout 2024 show a relatively higher abundance of codon group 4. The M genes from 2024 also form a separate cluster due to a lower content of codon group 3. A high fraction of codons 0 appeared independently in N genes in July-October 2021 and May-December 2023. S genes from 2024 are distinct from others because of a higher fraction of codons 5 and a lower fraction of codons 0, whereas 2022–2023 genes are depleted in codons 1. ORF3a sequences from 2023 contain considerably fewer codons 0 than other sequences, particularly those from 2020 to 2021, which are abundant in these codons. A compact cluster of ORF7a sequences from June–November 2021 is characterized by a high fraction of codons 5 and a low fraction of codons 3. Sequences from April–December 2023, although more dispersed in the plot, are distinguished from others by higher fractions of codons 0 and 2, and a reduced content of codons (1) ORF7b sequences from 2024 stand out due to an increased fraction of codons 1 at the expense of codons 0, while sequences from July–December 2021 are also enriched in codons 1 but have a lower fraction of codons (2) Among ORF8 sequences, two sets are characterised by distinct codon composition. Sequences from May–December 2023 are enriched in codons 0 rather than codons 1, while sequences from May–December 2021 have higher fractions of codons 2–5. ORF10 sequences from September 2020–January 2021 tend to contain more codons 4 than codons (3) In the plots for some sequences, i.e. M, N gene and ORF8, the points representing subsequent months and years, are arranged in series, which indicates a gradual change in the composition of codon groups with time.

General changes in codon usage in coronavirus sequences over time

To objectively find general trends in changes of codon usage and identify, for each gene, the sets of sequences that exhibit similar codon usage but are different from other sets, we carried out clustering based on fractions of codon groups (0–5), averaged over months and years. Hierarchical clustering was identified as the optimal method for seven genes, while k-means, DIANA, and PAM were optimal for the remaining three, one, and one genes, respectively. The number of the sequence set with the specific codon usage (clusters) varied from two (E and N genes, ORF1a, ORF1ab, ORF7a, ORF8), through three (ORF3a), four (M and S gene, ORF7b, ORF10) to five (ORF6). A relatively high positive average Silhouette values, from 0.642 (ORF1ab) to 0.943 (E gene), suggest that the clusters are well-separated and cohesive.

Consistent with these findings, the general null hypothesis for each gene, stating that there are no differences in codon group fractions among clusters across individual months and years, was rejected using nonparametric ANOVA-type testing at the 0.05 significance level. Likewise, the null hypotheses of equality between all pairwise cluster combinations were also rejected at the 0.05 level. The same conclusions can be drawn from the analysis of similarities (ANOSIM) at the 0.01 level.

Figure 5 presents the distribution of these clusters along the studied period. Consistent with the results of the correspondence analysis, the largest number of cluster transitions occurred near the 2021/2022 boundary, at the beginning of 2023 and around the transition 2023/2024. The changes reflect a temporal variation in codon content within the coronavirus genome in terms of human codon usage. The first change refers to genes E, M, N, and S, as well as ORF1ab, ORF1a, ORF3a, ORF6, and ORF7b. The second shift was observed in gene M, ORF3a, ORF6, ORF8, and ORF7a, whereas the third alteration occurred in genes M and S, as well as ORF3a, ORF7a, ORF7b, ORF8, and ORF10.

Fig. 5.

Fig. 5

The distribution of clusters in genes and variants of SARS-CoV-2 over the studied time period. Clusters for each gene include sequences characterised by a similar fraction of codon groups (0–5), averaged for a given month and year. Identical point colours for a given gene indicate that its sequences share similar codon usage across different months and years. The number of cluster changes indicates how many shifts between the clusters occurred for all genes

For most genes, such as the E and N genes, ORF1ab, and ORF7b, each cluster appears only once within a single continuous period, indicating that the changes in codon fractions were irreversible. In contrast, clusters reappear in separate, non-contiguous periods for some other genes, such as ORF3a, ORF7a, ORF8, and ORF10, suggesting a recurrence of similar codon compositions. For example, cluster 2 of ORF10 occurs in February-August 2020, March 2021-May 2022, January-July 2023 and January-November 2024.

Most cluster changes are associated with shifts in coronavirus variants (Fig. 5). The highest number of codon cluster transitions at 2021/2022, in early 2023, and at 2023/2024 correspond, respectively, to the replacements of: Delta by Omicron BA, Omicron BA and Omicron BQ.1 by Omicron XBB, as well as Omicron XBB and Omicron EG.5.1 by Omicron JN. The shift from Alpha to Delta is also reflected in the cluster change. Correlation between the number of cluster changes and the variant shift was positively correlated and statistically significant (Pearson’s correlation coefficient = 0.38, p-value = 0.004).

Temporal variations in the content of coronavirus codons corresponding to the most frequent codons in human genes

The codon groups labelled as 0 and 1 correspond to those preferred in human genes. Therefore, their combined frequency (0 + 1) can serve as a proxy for the translational efficiency of coronavirus genes when using the host machinery. To analyse potential evolutionary trends in the codon usage over time, we calculated the arithmetic mean of this frequency for sequences of individual genes aggregated monthly and compared with the occurrence of coronavirus variants (Figs. 6 and 7).

Fig. 6.

Fig. 6

Changes over time (X-axis) in the frequency of codon groups 0 and 1 (Y-axis) in SARS-CoV-2 structural and non-structural protein-coding sequences. The dynamics of changes was described by the arithmetic mean calculated monthly. For comparison, the temporal distribution of variants was also presented

Fig. 7.

Fig. 7

Changes over time (X-axis) in the frequency of codon groups 0 and 1 (Y-axis) in SARS-CoV-2 accessory protein-coding sequences. The dynamics of changes was described by the arithmetic mean calculated monthly. For comparison, the temporal distribution of variants was also presented

Two non-structural protein-coding sequences, i.e. ORF1a and ORF1ab overlap in the viral genome, on a long section, so the changes in the frequency of codons 0 + 1 are very similar (Fig. 6). These codons were initially relatively rare from January 2020 to December 2021. From that to April 2022, we can observe a substantial shift towards higher frequencies of synonymous codons that are also the most commonly used in human genes. This change is clearly associated with the replacement of variant Delta by Omicron BA. Next, their frequency remained at a similar relatively high level with a small decrease at the end of 2023 to the beginning of 2024. It corresponds to the transition from Omicron XBB and EG.5.1 to Omicron JN. From that, the frequency gradually increased again. The global tendency, in the long run, indicates that these sequences optimized their synonymous codon usage to that in human protein genes.

In contrast to the sequences coding for non-structural proteins, those encoding structural ones (envelope, membrane and nucleocapsid proteins as well as surface glycoprotein) showed generally a decrease in the codons frequently used in the human host for a long time (Fig. 6). In all four sequences, the frequencies of 0 + 1 calculated at the end of 2024 are characterized by lower values in comparison those in January 2020. Thus, the structural protein-coding sequences tended to use less optimal codons over time in terms of human codon usage. Interestingly, the dynamic of changes differs between the sequences. In the case of S gene, we observed a general trend to decrease 0 + 1 frequency with several local extrema: the local minimum in September 2021 and February 2023, as well as the local maximum in February 2022 and October 2023 (Fig. 6). These fluctuations show that the decrease in the contribution of 0 + 1 codons was not linear but was disturbed by occasional variations. These changes align with the distribution of variants, for instance, the local decline in 0 + 1 codons in mid-2021 corresponds to the dominance of Delta, the gradual decrease throughout 2022 reflects the prevalence of Omicron BA, whereas the rise in 2023 coincides with the emergence of Omicron XBB.

Three other structural protein-coding sequences demonstrated a characteristic decline from November 2021 up to: March 2022 (E gene), February 2022 (M gene) or August 2022 (N gene). All these curves exhibit an inflection point around the transition between 2021 and 2022, corresponding to the shift from Delta to Omicron BA. The genes encoding the envelope protein and membrane glycoprotein exhibited a consistently high frequency of codons 0 and 1 up to November 2021. In contrast, the frequency of these codons in the N gene fluctuated, reaching a peak in October 2021, which closely coincided with the dominance of Delta. After the drastic decrease, the course of the curve for the codon frequency was stable through time but in the genes for membrane glycoprotein and nucleocapsid phosphoprotein, it fluctuated, reaching several local extrema. The changes can be correlated with the occurrence of Omicron XBB and JN.

Other sequences coding for accessory proteins do not show consistent tendencies in the change of 0 + 1 codon frequencies (Fig. 7). However, we can notice a drop in this measure for the ORF6 sequence in the long term. Up to November 2021, the frequency was relatively high. Between November 2021 and April 2022 (the shift from Delta to Omicron BA), there was a drastic decrease, with the minimum in March 2022, and a fast increase from April 2022 to September 2023, when Omicron BQ.1 emerged. After a short stabilisation at a relatively high level, the frequency diminished again from January 2023 with the occurrence of Omicron XBB. Then, from June 2023, the frequency remained permanently low. ORF3a, ORF7a and ORF8 showed a rather high and constant 0 + 1 frequency in time with an episodic quick decline and rise in various periods: November 2021–April 2022, with a minimum in February 2022 (corresponding to the Delta–Omicron BA transition); February 2023–January 2024, with a minimum in July 2023 (associated with the emergence of Omicron XBB); and February 2021–February 2022, with a minimum in October 2021 (coinciding with transitions between Alpha, Delta and Omicron BA), respectively.

In contrast, the frequency of codons 0 and 1 in ORF7b and ORF10 remained relatively low during most of the study period. However, two sharp fluctuations were observed: for ORF7b, between April 2021 and March 2022, peaking in October 2021 (corresponding to Delta dominance), and for ORF10, between May 2023 and May 2024, reaching a maximum in October 2023 (coinciding with the presence of Omicron EG.5.1).

Relationships between the content of 0 + 1 codon groups and nucleotide composition

The observed dynamics of 0 + 1 codon frequency in individual coronavirus genes were correlated with concurrent shifts in their global nucleotide composition, calculated monthly (Table 1). Most of Pearson’s correlation coefficients were statistically different from zero, and ten genes showed absolute correlations above 0.6 with at least one nucleotide.

Table 1.

Pearson’s correlation coefficient (R) with p-values (p) between temporal changes in 0 + 1 codon frequencies in individual coronavirus genes and variations in their nucleotide composition. The coefficients with the absolute value larger than 0.6 were bolded

Name of gene encoding Adenine Uracil Guanine Cytosine
R p R p R p R p
ORF1ab polyprotein 0.891 4.1E-20 0.520 5.6E-05 0.868 5.2E-18 -0.810 5.2E-14
ORF1a polyprotein 0.909 4.1E-22 0.448 6.7E-04 -0.273 0.052 -0.838 1.0E-15
envelope protein 0.482 2.3E-04 -0.998 9.6E-68 -0.438 9.0E-04 0.998 1.9E-66
membrane glycoprotein -0.730 2.2E-10 -0.485 2.1E-04 0.449 6.7E-04 0.742 8.0E-11
nucleocapsid phosphoprotein -0.080 0.599 -0.475 2.8E-04 0.704 1.8E-09 -0.590 2.5E-06
surface glycoprotein -0.753 2.9E-11 -0.525 4.9E-05 0.160 0.285 0.882 3.4E-19
ORF3a protein -0.059 0.707 -0.096 0.528 -0.245 0.085 0.129 0.391
ORF6 protein 0.994 2.6E-54 -0.105 0.495 0.978 1.2E-38 -0.990 3.0E-48
ORF7a protein 0.802 1.3E-13 -0.827 4.8E-15 -0.311 0.025 0.634 2.2E-07
ORF7b protein -0.360 0.008 0.200 0.173 0.040 0.779 -0.197 0.174
ORF8 protein -0.917 4.1E-23 0.488 1.9E-04 -0.019 0.889 0.043 0.779
ORF10 protein -0.307 0.027 0.535 3.4E-05 0.154 0.297 -0.730 2.2E-10

For example, the increase in 0 + 1 codon frequency in ORF10 was accompanied by a reduction in cytosine content, in ORF1a additionally by a higher adenine fraction, and in ORF1ab and ORF6 also by increased guanine content (Table 1). A positive correlation with guanine fraction was likewise observed for the nucleocapsid phosphoprotein gene. The frequency of 0 + 1 codons was negatively correlated with adenine in ORF8 as well as in the genes encoding the surface and membrane glycoprotein, where this frequency increased with cytosine content. ORF7b and the envelope protein gene likewise showed a positive correlation between 0 + 1 codon frequency and cytosine, but in these cases, 0 + 1 codons were more abundant in adenine-rich and less frequent in uracil-rich sequences.

Association of 0 + 1 codon group content with vaccine doses and confirmed COVID-19 cases

The data for 0 + 1 codon groups showed interesting relationships with daily COVID-19 vaccine doses administered per million people (Figs S1 and S2) as well as daily new confirmed COVID-19 cases per million people (Figs S3 and S4). Until the end of 2021, vaccination doses rose significantly, peaking in April and December 2021, whereas the cases started to grow from the middle of December 2021 and reached their maximum in the middle of January 2022. Close to these peaks, both for doses and cases, we observed a sharp rise in 0 + 1 codon usage for ORF1ab and ORF1a, and an abrupt decrease in the E and M genes as well as ORF6. Marked fluctuations or shifts in the 0 + 1 measure, corresponding to the vaccination surge and confirmed infections, were also evident in the N and S genes, along with ORF3a, ORF7b, and ORF8. The relationship is more pronounced when the cases are compared with the absolute difference between values for 0 + 1 codon frequencies, d(0 + 1), across two consecutive months (Figs. 8 and 9). The peaks for these two variables coincide for nine genes: E, M and N genes as well as ORF1ab, ORF1a, ORF3a, ORF6, ORF7b and ORF8. The correlation coefficients between the two parameters for these genes are statistically significant, ranging from 0.49 to 0.82 (Table 2).

Fig. 8.

Fig. 8

Absolute month-to-month changes in 0 + 1 codon frequencies in SARS-CoV-2 structural and non-structural protein-coding sequences (left Y-axis) alongside daily new confirmed COVID-19 cases per million people (right Y-axis) over time (X-axis)

Fig. 9.

Fig. 9

Absolute month-to-month changes in 0 + 1 codon frequencies in SARS-CoV-2 accessory protein-coding sequences (left Y-axis) alongside daily new confirmed COVID-19 cases per million people (right Y-axis) over time (X-axis)

Table 2.

Pearson’s correlation coefficient (R) with p-values (p) between absolute month-to-month changes in 0 + 1 codon frequencies in individual coronavirus genes and daily new confirmed COVID-19 cases per million people. The statistically significant coefficients were bolded

Name of gene encoding R p
ORF1ab polyprotein 0.529 3.4E-05
ORF1a polyprotein 0.512 5.9E-05
envelope protein 0.823 2.6E-14
membrane glycoprotein 0.764 1.8E-11
nucleocapsid phosphoprotein 0.626 4.4E-07
surface glycoprotein -0.059 0.663
ORF3a protein 0.570 6.1E-06
ORF6 protein 0.491 1.2E-04
ORF7a protein -0.077 0.617
ORF7b protein 0.589 2.7E-06
ORF8 protein 0.646 1.8E-07
ORF10 protein -0.143 0.343

Discussion

To study the changes in the codon frequencies of SARS-CoV-2 protein-coding genes from the human codon usage perspective, we introduced a measure, called the Codon Block Recoding (CBR), that assigned the codons to six groups according to the relative synonymous codon usage in human genes. The recoding of codons is straightforward to interpret. If relative usage reflects the speed and efficiency of translation, then the composition of codon clusters can approximate the overall effectiveness of protein synthesis. CBR and the human synonymous codon usage (HCU) remain closely related, as evidenced by a strong and statistically significant correlation between the two measures. The Spearman correlation coefficient is -0.86 with p-value < 2.2e-16. However, CBR reduces differences in synonymous codon frequencies arising from variation in codon family sizes by performing codon ranking independently for each amino acid. These codons represent those with the highest synonymous usage within their respective codon families. Thus the codons classified as 0 or 1 can be interpreted as top preferred codons in human genes, irrespective of amino-acid composition. The combined content of these codons corresponds closely to the Codon Adaptation Index (CAI) [36, 37]. Indeed, for the coronavirus sequences analysed, the frequency of codons 0 + 1 shows a strong correlation with CAI (Pearson’s correlation coefficient = 0.79; p-value < 2.2e-16). However, our measure is less sensitive to fluctuations in the codon frequencies of the reference set, which cannot be directly related to the translation process but, for example, a mutational pressure and an overall genomic nucleotide composition [38, 39]. Moreover, the CAI is a one-dimensional measure that oversimplifies codon usage patterns, so it is not able to capture the multifactorial relationships among codons. By contrast, our approach considers six codon groups, thereby reducing the usage of 64 codons to a more manageable dimensionality while retaining a greater informational depth than CAI.

It should also be noted that CAI considers only the relative frequencies among synonymous codons for each amino acid and ignores differences in the overall amino acid composition of the protein. However, each amino acid is associated with a distinct total pool of available tRNAs, which influences the speed and efficiency of translation. Proteins enriched in amino acids decoded by abundant tRNAs are translated more efficiently, whereas those rich in amino acids recognized by rare tRNAs elongate more slowly.

In contrast to CAI, CBR approach incorporates both relative synonymous codon usage and the degeneration level of codons. This is achieved because less degenerated codons are assigned lower recoding labels (0 and 1), whereas more degenerated codons can also be represented by codons with higher labels (e.g., 4 and 5). It captures the relationship between codon degeneracy and tRNA gene copy number very well. Using the high-confidence set of tRNA genes from GtRNAdb [40], we found that codons with lower degeneracy levels tend to have more tRNA gene copies per codon than those with higher degeneracy: one-fold degenerate (FD) codons have 9 copies per codon, 2FD have 8.9, 3FD have 7.7, 4FD have 6.7, and 6FD have 4.7 (Spearman’s correlation coefficient ρ = − 1, p = 0.017). We further correlated the number of tRNA gene copies with both the relative codon adaptiveness values used in CAI and the recoding labels used in CBR, and found that the latter showed a stronger association (ρ = − 0.37, p = 0.004) than the former (ρ = 0.30, p = 0.02). These results indicate that CBR better reflects the composition of the tRNA pool than CAI, while simultaneously incorporating relative synonymous codon usage and thus accounting for multiple aspects of the translational process. Thus, the CBR measure captures not only a mutational and/or selection pressure on the usage of synonymous codons but also amino acids but related with tRNA pool copies.

The recoding of codons according to the synonymous codon usage in human turned out useful in detecting differences and changes in time in the codon preferences of protein-coding sequences from SARS-CoV-2 genomes. The analyses revealed that coronavirus genes can be differentiated based on their codon group composition. The greatest deviations were observed in the envelope glycoprotein gene, ORF6, ORF7b, and ORF10. Sequences of the nucleocapsid glycoprotein gene, ORF7a, and ORF7b were most enriched in 0 + 1 codons, suggesting better adaptation to the host translational machinery, whereas the envelope glycoprotein gene showed the lowest abundance of these codons. In agreement with this, ribosome-profiling experiments demonstrated that the N gene exhibited the highest translation levels among all genes studied, whereas ORF7a and ORF7b showed higher levels than the E gene [41].

Despite the relatively small magnitude of changes in codon usage, statistical analyses identified distinct and significant clusters of sequences sharing similar codon usage patterns across time. This suggests that the codon group composition of coronavirus genes underwent notable temporal shifts. The most pronounced shifts occurred around the 2021/2022 boundary, at the beginning of 2023, and during the 2023/2024 transition, closely corresponding to major coronavirus variant replacements: Delta to Omicron BA, Omicron BA and Omicron BQ.1 to Omicron XBB, and Omicron XBB and Omicron EG.5.1 to Omicron JN. In addition to these large-scale transitions, smaller changes were also observed within specific intermediate intervals.

Notably, the analysis of the combined frequency of the two most widely used synonymous codons in human protein-coding genes revealed interesting patterns in coronavirus genes, exhibiting distinct temporal tendencies. This frequency in ORF1ab and ORF1a increased as time progressed. Although the fraction of these codons never exceeded the expected value calculated for human codon usage, the increase in this frequency suggests a clear trend for optimisation of the viral codon usage with respect to human codon preferences. These ORFs encode polyproteins, which are cleaved into smaller non-structural proteins. The enhanced adaptation of these sequences is likely linked to the critical role of their products in viral replication and the expression of other genes at the early stage of the virus infection cycle. Since the coronavirus utilises the host translational machinery, including tRNAs, the usage of codons also preferred in human genes can facilitate and speed up protein biosynthesis. This is in line with the general view that the synonymous codon bias can affect the rate and efficiency of translation [914]. Consequently, the more efficient translation leading to a higher yield of polyproteins and their products can accelerate viral proliferation and transmission. Our findings indicate that the codon usage of key viral proteins adapts over time to align with the host’s codon bias. In agreement with that, the general study of the codon usage in the coding sequences of 502 human-infecting viruses including SARS-CoV-2 found that the adaptation is visible in early viral proteins [24].

However, the structural protein-coding sequences exhibited a decreasing tendency in the frequency of the synonymous codons preferred by the human host. This suggests that these sequences tend to use less optimal codons in terms of human codon usage, which corresponds to the results by Posani et al. [33]. The usage of poorly optimized codons may cause slower and more inaccurate translation elongation, which influences also the folding of synthetized proteins [1420]. Consequently, the overall production of viral proteins can be reduced. Moreover, the deviations from optimal translation rates as well as undesirable interactions between codons and noncognate tRNAs can increase the number of various misfolded protein variants [42, 43]. Since the structural proteins are exposed to the host immune system, their smaller number and/or altered structures produced in non-optimal translation can be beneficial for the virus due to reduced recognition by the immune system.

Moreover, the variable structure of epitopes can help to avoid the host response. Additionally, the accelerated shifts in codon usage patterns observed with time may confer enhanced adaptability to host immunological pressures, potentially increasing viral fitness in the face of host defences. In fact, we observed fluctuations in the usage of the optimal codons for genes encoding surface glycoprotein and nucleocapsid phosphoprotein, eliciting the strong immune response [44, 45], which can be associated with changing the structure of these proteins with time.

The structural proteins, particularly the surface glycoprotein, are subject to nonsynonymous adaptive mutations linked to their functional roles, such as receptor binding and immune evasion [31]. This pressure, being particularly significant, may reduce selective forces on synonymous sites, thereby making the evolution of optimal codon usage impossible.

The genes for accessory proteins demonstrated variable tendencies in the optimal codon usage. Due to their various function, it is difficult to find a general explanation for the changes in their codon frequencies. The short-term decline and rise of the codon usage in four genes for the accessory proteins suggest that these changes were not favoured by selection and did not have adaptive consequences. Nevertheless, the tendency to evolve and maintain the low frequency of the optimal codons, observed in the structural and some accessory protein-coding sequences, can also be associated with greater flexibility and adaptation to invade a broader range of hosts with different codon usage signatures [46, 47]. In fact, it was found that SARS-CoV-2 can infect a range of mammalian species [4850]. It was also proposed that viruses with codon usage too similar to that of the host can be harmful to its cells due to the depletion of tRNA pool [51, 52]. This can disrupt host translation and other cellular processes, potentially leading to over-exploitation of the host cell, which may ultimately limit the virus’s ability to multiply intensively.

Changes in the frequency of 0 + 1 codon groups within individual genes were significantly correlated with variation in their nucleotide composition, reflecting the combined effects of mutational pressure and/or selective constraints on codon usage. Notably, the direction of these correlations varied by gene and nucleotide, suggesting gene-specific forces shaping codon preferences. In most cases, 0 + 1 content increased with uracil depletion and guanine enrichment, whereas for adenine and cytosine, both positive and negative correlations were observed.

The most striking changes were observed at the turn of the years 2021 and 2022. Eight out of 11 studied protein-coding sequences revealed a drastic change also in 0 + 1 codon usage in a short time. These changes correspond very well to the shift from Delta to Omicron BA variant [5355]. The Omicron was first identified on November 24, 2021, in South Africa and on December 1, 2021, in the United States. By the week ending December 25, it had become very quickly the dominant variant. The emergence of Omicron likely resulted from its high mutation rate, ability to evade immunity as well as the global connectivity and transmission that facilitated its rapid spread.

Changes in 0 + 1 codon group frequencies were also closely linked to vaccination dynamics and infection trends. Peaks in vaccine doses (April and December 2021) and in confirmed cases (January 2022) coincided with sharp increases in ORF1ab and ORF1a or decreases in E, M, and ORF6, with additional fluctuations in N, S, ORF3a, ORF7b, and ORF8. Vaccination campaigns can impose selective pressure on the virus, promoting mutations that enhance immune evasion. Consistent with this, a retrospective study reported increased viral diversity in India during 2021–2022 following widespread vaccination [56]. As a result, more viruses could replicate if the immune system cannot fight them off, thereby providing additional opportunities for mutations to occur [57]. Similarly, the larger number of infected individuals provided also more opportunities for the virus to replicate, which in turn increases the probability of generating new mutations. It is not inconceivable that the changes in codon usage observed in our studies resulted from the accumulation of new mutations driven by both vaccination pressure and high infection rates.

The quality of our study depends on the availability and consistency of high-quality coronavirus genomes and their uniform representation across the analysed period. We observed that genome sampling is uneven among countries. Therefore, we focused the analyses on data from the USA and the UK, which were the most comprehensively represented. Nonetheless, it would be valuable to test whether our conclusions hold for other countries as well. Potential biases may also affect data on COVID-19 cases and deaths. Consequently, systematic and standardized data collection across regions, pandemics, and viral species is essential to obtain reliable insights into the long-term evolution of viral genomes.

Conclusions

Our findings also suggest that changes in codon usage, in terms of optimality for the human host, vary depending on the type and role of the genes. This reveals a trade-off between their competing evolutionary strategies. Viral codon usage appears to result not just from adaptation to host codon preferences but from multiple selective and mutational pressures, encompassing efficient transcription, mRNA export, and immune evasion [58]. The results, highlighting variations in codon usage across different coronavirus protein-coding genes, provide valuable insights that can aid in designing effective attenuated vaccines for novel strains, thereby supporting efforts to combat the pandemic. Attenuation can be achieved through, for example codon deoptimization, whereby optimal codons are replaced with less preferred synonymous codons, leading to reduced translational efficiency of the viral mRNA encoded in DNA- or mRNA-based vaccines.

Methods

Datasets

We initially analysed 94,571 SARS-CoV-2 genomes obtained from the NCBI Virus database. Only complete genomes isolated from humans and containing both a correctly annotated collection date and country of origin were included. Protein-coding sequences were identified using Viral Annotation DefineR, VADR v1.4.2 [59]. The reference model applied was vadr-models-sarscov2-1.3-2, which is based on the NC_045512.2 RefSeq sequence from the Wuhan-Hu-1 isolate [60]. In the final analyses, we retained only genomes with confidently identified genes.

The downloaded genome set represented isolates from 94 countries worldwide. However, sequencing efforts were heavily skewed, with genomes from the United States and the United Kingdom accounting for over 88% of the dataset, i.e. 84,324 genomes. To avoid bias from this imbalance, we restricted our analysis to genomes from these two countries. The resulting dataset covered collection dates from January 2020 to October 2024. The data about the genomes are available in Supplementary_file.txt.

The temporal distribution of coronavirus variants in the USA and the United Kingdom was based on data from Global Initiative on Sharing Avian Flu Data (GISAID) (https://gisaid.org) [61, 62], accessed via CoVariants.org (2025), with major processing by Our World in Data (https://ourworldindata.org/). Daily new confirmed COVID-19 cases per million people and daily vaccine doses administered per million people for the United States and the United Kingdom were also obtained from Our World in Data (https://ourworldindata.org/), which is based on source data from the World Health Organization (2025) [63]. For both sets, the 7-day rolling averages were downloaded. The cases for the USA were supplemented from those received from the U.S. Centers for Disease Control and Prevention (CDC) (https://covid.cdc.gov/covid-data-tracker). For each time interval, the average values from the two countries were calculated. The datasets were analysed using monthly time intervals.

Codon usage analysis

The novel measure termed codon block recoding (CBR) maps the 64 codons to a set of ordered categorical labels reflecting the synonymous codon usage patterns. Specifically, CBR can assign each codon a label from 0 to 5, corresponding to its relative usage among synonymous codons, i.e. encoding the same amino acid in protein-coding sequences of an organism. The most frequent codon in a synonymous block is assigned the label 0, whereas the least frequent codon is assigned the label n − 1, where n is the number of codons in the block. Methionine and tryptophan, represented by single codons, have the label 0. The labels 0, 1, 2, 3, 4 and 5 are ordered categorical variables, which can be useful in qualitative studies. They are also easy to interpret and can be used, for example, to analyse the codons in groups characterised by distinct, high, medium or low, relative usage. The abundance of these groups is represented as fractions for each sequence and can be treated as compositional data. This recoding can be applied to any variant of genetic code and used for any genome using an appropriate reference.

In this study, we recoded and assigned respective labels to codons in individual coronavirus protein-coding sequences according to the relative frequencies of synonymous codons in 93,487 human genes obtained from Codon Usage Database [64]. The human synonymous codon usage (HCU) and derived human codon block recoding (HCBR) are presented in table form in Fig. 10. The table includes the canonical codon blocks encoding 20 amino acids and stop codons. For comparison with the new measure, Codon Adaptation Index (CAI) was calculated with CodonW [65], providing the relative usage of each codon within each synonymous codon family in the human gene dataset.

Fig. 10.

Fig. 10

The human synonymous codon usage (HCU) table with assigned human codon block recoding (HCBR) labels. Cod - codon; AA - encoded amino acid; HCU - human synonymous codon usage; LA - assigned label corresponding to the usage

Statistical analyses

The correspondence analysis was conducted on mean fractions of six codon groups (labelled from 0 to 5) for each viral gene, calculated from genomes collected during a given month and year. The calculations were performed for all 12 genes considered together and separately for each gene in R software [66] using the FactoMineR package [67].

To identify sets of sequences for individual genes characterised by similar codon usage over the studied time period, we performed clustering using the clValid package in R [68]. The analysis was conducted on the dataset of codon group fractions, averaged over months and years for individual genes. We evaluated all available clustering methods (hierarchical, kmeans, diana, fanny, som, model, sota, pam, clara and agnes) and used the Silhouette index as an internal validation criterion to select the optimal clustering approach and configuration for each gene dataset. The method yielding the highest Silhouette score (from 0.64 to 0.94) was considered optimal, ensuring that the clustering choice reflected the underlying data structure rather than methodological bias. This metric evaluates cluster quality based solely on the dataset, assessing how similar each object is to its own cluster relative to other clusters.

The differences in the codon group content between these clusters were assessed with a nonparametric comparison of multivariate samples based on F-approximations for ANOVA type, assuming 1000 permutations and using a subset al.gorithm to determine which factor levels (cluster) differ significantly from one another. The analyses were conducted with the npmv package in R [69].

For the same purpose, we also applied the analysis of similarities (ANOSIM). This method used a dissimilarity matrix of the codon group fractions to statistically test whether significant differences exist between clusters. The matrix was calculated using the Bray-Curtis metric, which is suitable for fractional and compositional data. Calculations were performed with 100 permutations using the vegan package in R [70].

The change of SARS-CoV-2 variants (V) between subsequent months was calculated according to:

graphic file with name d33e1197.gif

where v is a coronavirus variant, whereas n and p are percentages of a given variant in the next and previous months, respectively.

Correlation coefficients and their statistical significance were calculated between various variables in R. When multiple hypotheses were tested, the Benjamini–Hochberg correction [71] was applied to adjust the p-values.

Supplementary Information

12864_2026_12672_MOESM1_ESM.pdf (338.8KB, pdf)

Additional file 1: Supplementary figures Fig. S1-S4 with legends (Figures_S1_S4.pdf).

12864_2026_12672_MOESM2_ESM.txt (3.1MB, txt)

Additional file 2: Data about the studied genomes (Supplementary_file.txt).

Acknowledgements

We are grateful to anonymous reviewers, whose insightful comments and suggestions greatly improved this work.

Authors’ contributions

P.B. and P.M. conceived and designed the study. P.B., P.M. and D.M. prepared datasets, performed analyses and interpreted the results. D.M. reviewed the literature. P.B. wrote the initial version of the manuscript, whereas P.M. and D.M. wrote the final version. All authors approved the final version of the manuscript.

Funding

Some computations were carried out at the Wrocław Centre for Networking and Supercomputing, Poland, under Grant No. 442. This work was supported by COST Action: CA21169-Information, Coding, and Biological Function: the Dynamics of Life.

Data availability

All data generated during this study are included in this published article and its supplementary information files (Supplementary_file.txt, Figures_S1_S4.pdf).

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zhu N, Zhang D, Wang W, Li C, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020;382(8):727–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lu R, Yang P, Cui J, Zhang Q, Fan L, Dai Y, Wu Q, Li L, Zhang S, Zhu L, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for outbreak origin and receptor binding. Lancet. 2020;395(10228):565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lu R, Yang P, Cui L, Zhang Q, Wu J, Zhang L, Huang C, Wang W, Li N, Hu Y, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for outbreak origin and control. Lancet. 2020;395(10229):1025–35.32222189 [Google Scholar]
  • 4.Wu AP, Peng YS, Huang BY, Ding X, Wang XY, Niu PH, Meng J, Zhu ZZ, Zhang Z, Wang JY, et al. Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China. Cell Host Microbe. 2020;27(3):325–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gupta S, Gupta D, Bhatnagar S. Analysis of SARS-CoV-2 genome evolutionary patterns. Microbiol Spectr. 2024;12(2):e02654–02623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Redondo N, Zaldívar-López S, Garrido JJ, Montoya M. SARS-CoV-2 Accessory Proteins in Viral Pathogenesis: Knowns and Unknowns. Front Immunol. 2021;12:708264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wrapp D, Wang N, Corbett KS, De Silva E, Ippolito GC, McLellan JS. Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science. 2020;367(6483):1260–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Pizzato M, Baraldi C, Boscato Sopetto G, Finozzi D, Gentile C, Gentile MD, Marconi R, Paladino D, Raoss A, Riedmiller I, et al. SARS-CoV-2 and the Host Cell: A Tale of Interactions. Front Virol. 2022;1:815388. [Google Scholar]
  • 9.Bulmer M. Coevolution of codon usage and transfer RNA abundance. Nature. 1987;325(6106):728–30. [DOI] [PubMed] [Google Scholar]
  • 10.Frumkin I, Lajoie MJ, Gregg CJ, Hornung G, Church GM, Pilpel Y. Codon usage of highly expressed genes affects proteome-wide translation efficiency. Proc Natl Acad Sci. 2018;115(21):E4940–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ikemura T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol. 1985;2(1):13–34. [DOI] [PubMed] [Google Scholar]
  • 12.Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura T. Codon usage and tRNA genes in eukaryotes: Correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J Mol Evol. 2001;53(4–5):290–8. [DOI] [PubMed] [Google Scholar]
  • 13.Kanaya S, Yamada Y, Kudo Y, Ikemura T. Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene. 1999;238(1):143–55. [DOI] [PubMed] [Google Scholar]
  • 14.Liu Y, Yang Q, Zhao F. Synonymous but Not Silent: The Codon Usage Code for Gene Expression and Protein Folding. Annu Rev Biochem. 2021;90(1):375–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Komar AA. A Code Within a Code: How Codons Fine-Tune Protein Folding in the Cell. Biochem (Moscow). 2021;86(8):976–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu Y. A code within the genetic code: codon usage regulates co-translational protein folding. Cell communication signaling: CCS. 2020;18:145–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Moss MJ, Chamness LM, Clark PL. The Effects of Codon Usage on Protein Structure and Folding. Annu Rev Biophys. 2024;53(1):87–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wu X, Xu M, Yang J-R, Lu J. Genome-wide impact of codon usage bias on translation optimization in Drosophila melanogaster. Nat Commun. 2024;15(1):8329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yu C-H, Dang Y, Zhou Z, Wu C, Zhao F, Sachs MS, Liu Y. Codon Usage Influences the Local Rate of Translation Elongation to Regulate Co-translational Protein Folding. Mol Cell. 2015;59:744–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhou M, Wang T, Fu J, Xiao G, Liu Y. Nonoptimal codon usage influences protein structure in intrinsically disordered regions. Mol Microbiol. 2015;97(5):974–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Błażej P, Mackiewicz D, Wnetrzak M, Mackiewicz P. The Impact of Selection at the Amino Acid Level on the Usage of Synonymous Codons. G3-Genes Genomes. Genetics. 2017;7(3):967–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nyayanit DA, Yadav PD, Kharde R, Cherian S. Natural Selection Plays an Important Role in Shaping the Codon Usage of Structural Genes of the Viruses Belonging to the Coronaviridae Family. Viruses. 2020;13(1):3–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ramazzotti D, Angaroni F, Maspero D, Mauri M, D’Aliberti D, Fontana D, Antoniotti M, Elli EM, Graudenzi A, Piazza R. Large-scale analysis of SARS-CoV-2 synonymous mutations reveals the adaptation to the human codon usage during the virus evolution. Virus Evol. 2022;8(1):veac026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hernandez-Alias X, Benisty H, Schaefer MH, Serrano L. Translational adaptation of human viruses to the tissues they infect. Cell Rep. 2021;34(11):108872–108872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Miller JB, Hippen AA, Wright SM, Morris C, Ridge PG. Human viruses have codon usage biases that match highly expressed proteins in the tissues they infect. Biomed Genet Genomics. 2017;2(2):1–5. [Google Scholar]
  • 26.Fumagalli SE, Padhiar NH, Meyer D, Katneni U, Bar H, DiCuccio M, Komar AA, Kimchi-Sarfaty C. Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies. Virol J. 2023;20(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Davidson A, Parr M, Totzeck F, Churkin A, Barash D, Frishman D, Tuller T. Over time analysis of the codon usage of SARS-CoV-2 and its variants. Comput Struct Biotechnol J. 2025;27:2034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tyagi N, Sardar R, Gupta D. Natural selection plays a significant role in governing the codon usage bias in the novel SARS-CoV-2 variants of concern (VOC). PeerJ. 2022;10:e13562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Eldin P, David A, Hirtz C, Battini J-L, Briant L. SARS-CoV-2 Displays a Suboptimal Codon Usage Bias for Efficient Translation in Human Cells Diverted by Hijacking the tRNA Epitranscriptome. Int J Mol Sci. 2024;25(21):11614–11614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Roy A, Guo F, Singh B, Gupta S, Paul K, Chen X, Sharma NR, Jaishee N, Irwin DM, Shen Y. Base Composition and Host Adaptation of the SARS-CoV-2: Insight From the Codon Usage Perspective. Front Microbiol. 2021;12:548275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Liu Y, Is. SARS-CoV-2 facing constraints in its adaptive evolution? Biomol Biomed. 2025;25(11):2407–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Padhiar NH, Ghazanchyan T, Fumagalli SE, DiCuccio M, Cohen G, Ginzburg A, Rikshpun B, Klein A, Santana-Quintero L, Smith S, et al. SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period. Virus Evol. 2025;11(1):veae115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Posani E, Dilucca M, Forcelloni S, Pavlopoulou A, Georgakilas AG, Giansanti A. Temporal evolution and adaptation of SARS-CoV-2 codon usage. Front Bioscience-Landmark. 2022;27(1):13. [DOI] [PubMed] [Google Scholar]
  • 34.Mogro EG, Bottero D, Lozano MJ. Analysis of SARS-CoV-2 synonymous codon usage evolution throughout the COVID-19 pandemic. Virology. 2022;568:56–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wu X, Shan Kj, Zan F, Tang X, Qian Z, Lu J. Optimization and Deoptimization of Codons in SARS-CoV‐2 and Related Implications for Vaccine Development. Adv Sci. 2023;10(23):2205445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sharp PM, Li WH. The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987;15(3):1281–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bahiri-Elitzur S, Tuller T. Codon-based indices for modeling gene expression and transcript evolution. Comput Struct Biotec. 2021;19:2646–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chen YH. A comparison of synonymous codon usage bias patterns in DNA and RNA virus genomes: quantifying the relative importance of mutational pressure and natural selection. Biomed Res Int. 2013;2013:406342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Plotkin JB, Kudla G. Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet. 2011;12(1):32–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Chan PP, Lowe TM. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 2009;37(Database issue):D93–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O, et al. The coding capacity of SARS-CoV-2. Nature. 2021;589(7840):125–30. [DOI] [PubMed] [Google Scholar]
  • 42.Jia X, He X, Huang C, Li J, Dong Z, Liu K. Protein translation: biological processes and therapeutic strategies for human diseases. Signal Transduct Target Therapy. 2024;9(1):44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Komar AA, Samatova E, Rodnina MV. Translation Rates and Protein Folding. J Mol Biol. 2024;436(14):168384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Shah VK, Firmal P, Alam A, Ganguly D, Chattopadhyay S. Overview of Immune Response During SARS-CoV-2 Infection: Lessons From the Past. Front Immunol. 2020;11:1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Torbati E, Krause KL, Ussher JE. The Immune Response to SARS-CoV-2 and Variants of Concern. Viruses. 2021;13(10):1911–1911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Jenkins GM, Holmes EC. The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res. 2003;92(1):1–7. [DOI] [PubMed] [Google Scholar]
  • 47.Luo W, Roy A, Guo F, Irwin DM, Shen X, Pan J, Shen Y. Host Adaptation and Evolutionary Analysis of Zaire ebolavirus: Insights From Codon Usage Based Investigations. Front Microbiol. 2020;11:570131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Pekar JE, Magee A, Parker E, Moshiri N, Izhikevich K, Havens JL, Gangavarapu K, Malpica Serrano LM, Crits-Christoph A, Matteson NL, et al. The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2. Science. 2022;377(6609):960–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Schindell BG, Allardice M, McBride JAM, Dennehy B, Kindrachuk J. SARS-CoV-2 and the Missing Link of Intermediate Hosts in Viral Emergence - What We Can Learn From Other Betacoronaviruses. Front Virol. 2022;2:875213. [Google Scholar]
  • 50.Tan CCS, Lam SD, Richard D, Owen CJ, Berchtold D, Orengo C, Nair MS, Kuchipudi SV, Kapur V, van Dorp L, et al. Transmission of SARS-CoV-2 from humans to animals and potential host adaptation. Nat Commun. 2022;13(1):2988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Castellano LA, McNamara RJ, Pallarés HM, Gamarnik AV, Alvarez DE, Bazzini AA. Dengue virus preferentially uses human and mosquito non-optimal codons. Mol Syst Biol. 2024;20(10):1085–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Chen F, Wu P, Deng S, Zhang H, Hou Y, Hu Z, Zhang J, Chen X, Yang J-R. Dissimilation of synonymous codon usage bias in virus–host coevolution due to translational selection. Nat Ecol Evol. 2020;4(4):589–600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Keyel AC, Russell A, Plitnick J, Rowlands JV, Lamson DM, Rosenberg E, St. George K. SARS-CoV-2 Vaccine Breakthrough by Omicron and Delta Variants, New York, USA. Emerg Infect Dis. 2022;28(10):1990–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Paton RS, Overton CE, Ward T. The rapid replacement of the SARS-CoV-2 Delta variant by Omicron (B.1.1.529) in England. Sci Transl Med. 2022;14:eabo5395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Robles-Escajeda E, Mohl JE, Contreras L, Betancourt AP, Mancera BM, Kirken RA, Rodriguez G. Rapid Shift from SARS-CoV-2 Delta to Omicron Sub-Variants within a Dynamic Southern U.S. Borderplex. Viruses. 2023;15(3):658–658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Jena D, Ghosh A, Jha A, Prasad P, Raghav SK. Impact of vaccination on SARS-CoV-2 evolution and immune escape variants. Vaccine. 2024;42(21):126153. [DOI] [PubMed] [Google Scholar]
  • 57.Rouzine IM, Rozhnova G. Evolutionary implications of SARS-CoV-2 vaccination for the future design of vaccination strategies. Commun Med. 2023;3(1):86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Mordstein C, Cano L, Morales AC, Young B, Ho AT, Rice AM, Liss M, Hurst LD, Kudla G. Transcription, mRNA Export, and Immune Evasion Shape the Codon Usage of Viruses. Genome Biol Evol. 2021;13(9):evab106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Nawrocki EP. Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genomics Bioinf. 2023;5(1):lqad002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y-Y, Tao Z-W, Huang Y, Lan K, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579(7798):270–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017;1(1):33–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Shu YL, McCauley J, GISAID. Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance. 2017;22(13):2–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Mathieu E, Ritchie H, Rodés-Guirao L, Appel C, Gavrilov D, Giattino C, et al. Coronavirus (COVID-19) Vaccinations. Published online at OurWorldinData.org.; 2020. https://archive.ourworldindata.org/20260223-071105/covid-vaccinations.html.
  • 64.Nakamura Y, Gojobori T, Ikemura T. Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 2000;28(1):292–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Peden JF. Analysis of codon usage. PhD thesis. United Kingdom: University of Nottingham. 1999.
  • 66.Team RC. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2025. https://www.R-project.org/. [Google Scholar]
  • 67.Le S, Josse J, Husson F, FactoMineR. An R package for multivariate analysis. J Stat Softw. 2008;25(1):1–18. [Google Scholar]
  • 68.Brock G, Datta S, Pihur V, Datta S. clValid: An R package for cluster validation. J Stat Softw. 2008;25(4):1–22. [Google Scholar]
  • 69.Ellis AR, Burchett WW, Harrar SW, Bathke AC. Nonparametric Inference for Multivariate Data: The R Package npmv. J Stat Softw. 2017;76(4):1–18.36568334 [Google Scholar]
  • 70.Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P, O’Hara R, Solymos P, Stevens M, Szoecs E et al. vegan: Community Ecology Package. R package. In. 2025. https://CRAN.R-project.org/package=vegan.
  • 71.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B. 1995;57(1):289–300. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12864_2026_12672_MOESM1_ESM.pdf (338.8KB, pdf)

Additional file 1: Supplementary figures Fig. S1-S4 with legends (Figures_S1_S4.pdf).

12864_2026_12672_MOESM2_ESM.txt (3.1MB, txt)

Additional file 2: Data about the studied genomes (Supplementary_file.txt).

Data Availability Statement

All data generated during this study are included in this published article and its supplementary information files (Supplementary_file.txt, Figures_S1_S4.pdf).


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES