Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2024 Feb 2;6(1):lqae009. doi: 10.1093/nargab/lqae009

A simple method for estimating time-irreversible nucleotide substitution rates in the SARS-CoV-2 genome

Kazuharu Misawa 1,2,, Ryo Ootsuki 3,4
PMCID: PMC11640943  PMID: 39678027

Abstract

SARS-CoV-2 is the cause of the current worldwide pandemic of severe acute respiratory syndrome. The change of nucleotide composition of the SARS-CoV-2 genome is crucial for understanding the spread and transmission dynamics of the virus because viral nucleotide sequences are essential in identifying viral strains. Recent studies have shown that cytosine (C) to uracil (U) substitutions are overrepresented in SARS-CoV-2 genome sequences. These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences. Thus, we develop a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. We investigated the number of nucleotide substitutions among the 7862 genomic sequences of SARS-CoV-2 registered in the Global Initiative on Sharing All Influenza Data (GISAID) that have been sampled from all over the world. Using the new method, the substitution rates in SARS-CoV-2 genomes were estimated. The C-to-U substitution rates of SARS-CoV-2 were estimated to be 1.95 × 10−3 ± 4.88 × 10−4 per site per year, compared with 1.48 × 10−4 ± 7.42 × 10−5 per site per year for all other types of substitutions.

Introduction

Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is an RNA virus that has spread globally and is the cause of the current COVID-19 pandemic (1,2). The study of the molecular evolution of SARS-CoV-2 is important as it provides a better understanding of the dynamics of virus spread and transmission. Understanding molecular evolution is essential in developing effective vaccines, therapeutic approaches, and identification of viral strains. In addition, continuous surveillance of the evolution of the virus will contribute to the implementation of surveillance strategies and long-term preparedness against the disease. The main objective of this study is to propose a method for predicting nucleotide changes in SARS-CoV-2 genomes.

Genomic analyses of SARS-CoV-2 have demonstrated that 50% of the sequence mutations are cytosine-to-uracil (C-to-U) transitions with an 8-fold base frequency directional asymmetry between C-to-U and U-to-C substitutions (3–6). The asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences, although many time-reversible nucleotide substitution models are available, including the Jukes-Cantor model (7), Kimura 2-parameter model (8), Hasegawa-Kishino-Yano model (9), Tamura and Nei model (10) and General Time-Reversible model (11). The molecular evolution of SARS-CoV-2 can be studied more effectively using time-irreversible models (12,13) than using time-reversible models. Previous studies of time-irreversible models have used iterative approaches, such as Newton-Raphson method, for estimating the substitution rates. Iterative methods require time-consuming calculations due to the repetition until estimates converge.

Here, we present a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. In this study, we present a simple algorithm for estimating the substitution rates by using the diagonalization method. The diagonalization method is often used for time-reversible model, such as Hasegawa–Kishino–Yano model (9) and general time reversible model (11). To verify the new model, the number of nucleotide substitutions of genomic sequences of SARS-CoV-2 registered are investigated in this study. Genomic sequences from the Global Initiative on Sharing All Influenza Data (GISAID) (14) that have been sampled from all over the world were investigated in this study.

Materials and methods

Definition of substitution rate matrix

In this study, the process of nucleotide substitution is considered as a continuous Markov process. The four RNA bases C, U, G and A are designated as 1, 2, 3 and 4, respectively, and Inline graphic is the probability that nucleotide Inline graphic is substituted by Inline graphic in time period Inline graphic. Inline graphic is a matrix in which the ijth element is Inline graphic.

Inline graphic satisfies the following Chapman–Kolmogorov equation:

graphic file with name M0007.gif (1)

Thus, equation (1) is obtained as

graphic file with name M0008.gif (2)

Time-irreversible model

Here, a new time-irreversible model is proposed for SARS-CoV-2 evolution. To model the directional asymmetry between C-to-U and U-to-C substitutions, a matrix, Inline graphic, was created, where Inline graphic is the C-to-U substitution rate and Inline graphic is the rate of other types of nucleotide substitutions.

graphic file with name M00012.gif (3)

Inline graphic is a derivative of the substitution probability matrix with respect to time Inline graphic (3).

Computing the powers of the substitution rate matrix by diagonalization

The substitution rate matrix Inline graphic defined by equation (3) can be diagonalized as

graphic file with name M00016.gif (4)

where

graphic file with name M00017.gif (5)

From equation (4), Inline graphic is obtained as

graphic file with name M00019.gif (6)

A probability matrix, Inline graphic, is obtained by

graphic file with name M00021.gif (7)

Thus,

graphic file with name M00022.gif

Using equation (7), Inline graphic can be obtained by

graphic file with name M00024.gif (8)

where Inline graphic is defined by

graphic file with name M00026.gif (9)

Finally, Inline graphic can be calculated by

graphic file with name M00028.gif (10)

where

graphic file with name M00029.gif (11)

Estimation of nucleotide substitution rates

Notably,

graphic file with name M00030.gif (12)

where Inline graphic and Inline graphic can be estimated using equation (6) and by solving simultaneous equations (11) and (12). The arithmetic mean is used when multiple estimates are obtained.

Estimation of nucleotide contents with respect to time

Inline graphic is the observed number of cases where the ancestral nucleotide is Inline graphic and the derived nucleotide is Inline graphic in time Inline graphic. Inline graphic is a matrix in which the ijth element is Inline graphic. The expected value of Inline graphic can be obtained by

graphic file with name M00040.gif (13)

where

graphic file with name M00041.gif (14)

and Inline graphic is the number of nucleotides Inline graphic in the ancestral sequence.

If Inline graphic is the estimate of Inline graphic, Inline graphic can be estimated by

graphic file with name M00047.gif (15)

Using equation (16) we obtain estimated values for w, x, y and z.

graphic file with name M00048.gif (16)

Applying equation (17) yields the estimated value for a and b.

graphic file with name M00049.gif (17)

Equation (18) provides the estimated value for h.

graphic file with name M00050.gif (18)

Confidence intervals of the estimates of the evolutionary rates

To obtain confidence intervals of the estimates of mutation rates, we used the bootstrap method. We perform bootstrap sampling by repeatedly creating new sets of the virus sequences of the same sample size through resampling with replacement from the original set of the sequences. This results in obtaining bootstrap samples. We created a distribution of the statistics obtained from the bootstrap samples. By examining the range from the 0.5th to the 99.5th percentile of the distribution, we obtained a 99% confidence interval.

Sequence analysis of SARS-Cov-2

To verify the proposed model, the number of nucleotide substitutions of genomic sequences and the changes in nucleotide contents of SARS-CoV-2 were investigated. Genomic sequences of SARS-CoV-2 were retrieved from the GISAID database every six months from 31 December 2019 to 31 December 2021 (14). Samples used in this study were collected every six months. Gapped sites were excluded from the analysis. The sequence of the sample taken on 31 December 2019 was assumed to be the ancestral sequence, because it is the sequence first identified in Wuhan, China (gisaid_epi_isl: EPI_ISL_402125).

Genomes with >29 000 nucleotides were considered as having complete coverage. Sequences with <0.05% unique amino acid substitutions (i.e. substitutions not seen in other sequences in the database) and no insertions/deletions, unless verified by the submitter, were included in the analysis. Only sequences without undetermined (Ns) were used. A pairwise alignment of each genome sequence and the reference sequence was obtained using the MAFFT (15), which is a rapid tool for multiple sequence alignment. The substitution rates were estimated by comparing the sample sequences collected as of 31 December 2020 with the reference sequence. Given that 1792 sequences were collected on that date (as shown in Table 1), estimates were derived from these individual sample comparisons. We calculated the mean value of the estimates to determine the overall estimate and its standard error. The confidence intervals of the overall estimates were determined using the estimates of substitution rates obtained from pairwise comparisons. The date, region, and sample size details of the GISAID sequences used in this study are given in Table 1. To avoid sampling bias, we investigated the nucleotide changes in each region independently. Table 1 shows the sample size of each region.

Table 1.

Date, region and sample size of the GISAID sequences used in this study

Sampling date
Region 2019/12/31 2020/6/30 2020/12/31 2021/6/30 2021/12/31 Total
Africa 0 19 11 13 5 48
Asia 1 70 211 334 440 1056
Europe 0 120 887 2369 1741 5117
North America 0 298 650 1057 558 2563
Oceania 0 52 13 11 28 104
South America 0 25 20 186 32 263
Total 1 584 1792 3970 2810 9157

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

To test whether the trend of nucleotide contents in the SARS-CoV-2 genome is caused by mutational bias or natural selection. the number of nonsynonymous and synonymous substitutions per site of the SARS-CoV-2 genes were estimated, because the selective force will depend on the function of the protein, which in turn depends on the amino acid sequences. Table 3 shows the number of nonsynonymous and synonymous changes per site of the SARS-CoV-2 genes estimated by NG86 model (16).

Table 3.

Major strains of SARS-CoV-2

Strain Pango lineage
Reference
Alpha B.1.1.7
Beta B.1.1.351
Gamma P.2
S A.23.1
Omicron BA.1

Results

Estimates of nucleotide substitutions of the SARC-Cov-2 genome

Table 2 shows the estimated substitution rates. C-to-U substitution rates were estimated as Inline graphic per site per year, and for other types of substitutions the rates were Inline graphic per site per year.

Table 2.

Estimated substitution rate per site per year

Type Rate SD
Non C-to-U 1.48 × 10−4 7.42 × 10−5
C-to-U 1.95 × 10−3 4.88 × 10−4

Changes in the nucleotide contents of the SARS-Cov-2 genome

Bar plots of the changes in C content of the SARS-CoV-2 genomes and sample dates are shown in Figure 1. The results show that the number of Cs decreased in the SARS-CoV-2 genome over the time period from 31 December 2019 to 31 December 2021, indicating that the nucleotide frequencies had not reached equilibrium. Figure 1 shows the bar plots of the changes in U content of the SARS-CoV-2 genomes and the sample dates. The number of Us increased in the SARS-CoV-2 genome in the same period. In the upper panels of Figure 1, it can be seen that the observed frequencies of U and C on 31 December 2021 are slightly different from the estimated trend line, but these differences are not significant (P > 0.05). The lower panels of Figure 1 show the changes in the number of Gs and As, respectively. The number of Gs and As were almost unchanged throughout the same time period. Solid lines in Figure 1 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The dotted curves are the 99% confidential intervals of the predicted nucleotide contents. These curves indicate that Cs will decrease almost linearly with time, while Ts will increase all over the world.

Figure 1.

Figure 1.

Bar plots of the changes in C (cytosine), U (urasil), G (guanine), and A (adenine) contents of the SARS-CoV-2 genomes and sample dates over the time period from 31 December 2019 to 31 December 2021, x axis: Date, y axis: Content. Solid lines represent trend curves depicting changes in nucleotide contents predicted by the new time-irreversible model. Dotted curves represent the 99% confidence intervals of the predicted nucleotide contents.

A global trends of nucleotide substitution rates of the SARS-Cov-2 genome

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Solid lines of Supplementary Figures S1–S6 also show trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The same trend of nucleotide substitutions was observed in all regions.

Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta and Omicron, respectively. Solid lines of Supplementary Figures S7–S11 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The results indicate the trend of nucleotide changes is indeed a global pattern across all SARS-CoV-2 strains.

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

Table 4 shows the number of nonsynonymous and synonymous substitutions per site of the SARS-CoV-2 genes estimated by Nei and Gojobori model (16). In this table, Inline graphic indicates the number of nonsynonymous substitutions per site and Inline graphic indicates the number of synonymous substitutions per site. Except S, Inline graphic ratio is smaller than one. In total, Inline graphic ratio is 0.47.

Table 4.

Synonymous and nonsynonymous substitutions

CDS Start End dInline graphic1000 Inline graphic 1000 Inline graphic Length
ORF1a 265 13468 0.57 1.36 0.42 4400
ORF1b 13467 21555 0.54 1.15 0.47 2695
S 21562 25384 1.71 1.51 1.13 1273
ORF3a 25392 26220 2.23 5.93 0.38 275
E 26244 26472 5.98 19.50 0.31 75
M 26522 27191 2.35 6.88 0.34 222
ORF6 27201 27387 13.54 54.40 0.25 61
ORF7a 27393 27759 8.74 11.05 0.79 121
ORF7b 27755 27887 10.04 38.74 0.26 43
ORF8 27893 28259 3.95 13.60 0.29 121
N 28273 29533 4.41 4.87 0.91 419
ORF10 29557 29674 11.17 43.22 0.26 38
Total 1.31 2.80 0.47 9743

Discussion

In this study, we proposed a method for predicting nucleotide changes in SARS-CoV-2 genomes. The results shown in Figure 1 and Supplementary Figures S1–S6 demonstrate that persistent changes in nucleotide frequencies in the SARS-CoV-2 genome. In addition, comprehensive analysis presented in Figure 1 and Supplementary Figures S1–S6 showed that the high C-to-U substitution rate is not limited to any one continent but is widespread worldwide. Sequence analyses of SARS-CoV-2 revealed that the estimated nucleotide composition calculated by our method was consistent with the observed changes in nucleotide composition.

The proposed method is based on time-irreversible model described in equation. When there is a stationary distribution of nucleotide content, i.e. Inline graphic, and the detailed balance condition described in equation (3) is satisfied for all Inline graphic and Inline graphic in the stationary state, the process is time reversible (3).

graphic file with name M00064.gif (19)

These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences.

In this study, it is assumed identical substitution rates except for C-to-U in equation (3). It is possible to incorporate a more complex model. Let us assume that u and v are transition and transversion rates, respectively. The difference in rates between transitions and transversions can be taken into account by modifying equation (3) as follows:

graphic file with name M00065.gif (20)
graphic file with name M00066.gif (21)

where

graphic file with name M00067.gif (22)

Thus, we obtain substitution matrix Inline graphic by:

graphic file with name M00069.gif (23)

Equation (23) is, however, difficult to handle to estimate h, u and v. Previous studies showed that the rate of G-to-U is higher among transversions in SARS-CoV-2 (17,18). Further study is needed to refine the model of the evolution of SARS-CoV-2 genome.

Nucleotide substitution rates of the SARS-Cov-2 genome

Using the new time-irreversible model of nucleotide substitutions proposed in this study, nucleotide substitution rates were estimated. The results suggest that the C-to-U substitution rate is 10 times higher than the rates of other types of substitutions. Hoshino et al. used the general time-reversible model with invariable sites and gamma distribution among site rate variation (GTR + G + I) as a nucleotide substitution model. The estimated mean substitution rate was Inline graphic substitutions per site per year (95% highest posterior density interval, Inline graphic) (19). This estimate was lower than the C-to-U substitution rates and higher than the non C-to-U of substitution rates estimated by the proposed new time-reversible model.

In this study, a simple algorithm for estimating the substitution rates using the diagonalization method is presented. The nucleotide substitution rates for the new model can be calculated as easily as with the traditional time-reversible model because the diagonalization method can be applied to the new model. To validate the new model, the number of nucleotide substitutions in genomic sequences of SARS-CoV-2 registered in the GISAID database that have been sampled from all over the world were analysed. The diagonalization method is often used for time-reversible models, such as the Hasegawa–Kishino–Yano model (9) and the general time-reversible model (11).

The changes in nucleotide contents differ among continents, as evidenced by Supplementary Figures S1–S6. However, the difference might be due to errors arising from the limited sample size, especially in Africa and Oceania. As shown in Table 1, the sample size of each continent differs substantially between continents.

Amino acid changes and natural selection of SARS-Cov2 genes

It is widely known that mutational asymmetries affect amino acid substitutions. Jordan et al. found similar trends in amino acid changes across 15 taxonomic groups representing bacteria, archaea, and eukaryotes (20). Misawa et al. showed that these trends are mainly caused by CpG hypermutability (21). The C-to-U substitutions in SARS-CoV-2 genomes are caused by host RNA editing enzymes, such as the APOBEC family of cytidine deaminases (22–24). The C-to-U hypermutation of the SARS-CoV-2 genome will increase the number of hydrophobic amino acids in the virus proteins, because the codons of the four most hydrophobic amino acids (phenylalanine, isoleucine, leucine and valine) contain a U in the first or second position, whereas the codons of the most polar amino acids (asparagine, aspartic acid, arginine, glutamate, glutamic acid and lysine) do not contain a U in the first or second position (3,25) (see the codon table in Figure S12). The model presented in this study suggests that the number of Cs is decreasing in the SARS-CoV-2 genome, while that of Ts is increasing indicating that nucleotide frequencies have not reached equilibrium. Evolutionary studies of the SARS-CoV-2 genome must be continued to predict the future course of the COVID-19 pandemic.

Table 4 shows that the dn/ds ratio is below one, with the exception of S. In total, dn/ds ratio is 0.47. The S gene that encodes the spike protein of SARS-CoV-2, which is believed to undergo natural selection (26). As shown in Table 4, the spike protein of SARS-CoV-2, which contains 1273 amino acids, is responsible for roughly 13% of the total 9743 amino acids encoded by its genome. Hence, the predominant global trend of nucleotide variation cannot be attributed to neutral evolution (27).

Limitations of the proposed method

It should be noted that the newly proposed method in this study may have limited applicability to RNA viruses that replicate through RNA-dependent RNA polymerases, as the C-to-U substitutions observed in SARS-CoV-2 genomes are primarily attributed to host RNA editing enzymes, such as the APOBEC family of cytidine deaminases. Additionally, in the analysis of the SARS-CoV-2 genome, the ancestral state is known. However, in cases where the ancestral state is unknown, it becomes necessary to estimate the state. Future studies are warranted to gain further insights into the evolutionary dynamics of SARS-CoV-2.

Supplementary Material

lqae009_Supplemental_File

Acknowledgements

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank Dr Nao Nishida, Dr Naoko Fujito and Dr Naoki Osada for their useful comments and discussions. We thank Margaret Biswas, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript .

Author contributions: Kazuharu Misawa: Conceptualization, Formal analysis, Methodology, Validation, Writing—original draft. Ryo Ootsukil: Formal analysis, Visualization, Writing—review & editing.

Contributor Information

Kazuharu Misawa, Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama 236-0004, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.

Ryo Ootsuki, Department of Natural Sciences, Faculty of Arts and Sciences, 1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan; Department of Chemical and Biological Sciences, Faculty of Science, Japan Women's University, 2-8-1 Mejirodai, Bunkyo-ku, Tokyo 112-8681, Japan.

Data availability

All sequence data used in this study can be downloaded from the GISAID database (https://www.gisaid.org/). All python codes and lists of GISAID accession numbers of virus sequences used in this study are available on github (https://github.com/kazumisawa/coronavirusEvolution) and FigShare (https://doi.org/10.6084/m9.figshare.23691411).

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve English. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Supplementary data

Supplementary Data are available at NARGAB Online.

Funding

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers JP17K08682, JP19K22647, JP20K07316 to K.M.

Conflict of interest statement. None declared.

References

  • 1. Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Yet al.. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020; 323:1061–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Yet al.. A new coronavirus associated with human respiratory disease in China. Nature. 2020; 579:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Simmonds P Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. mSphere. 2020; 5:e00408-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Iwasaki Y., Abe T., Ikemura T.. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol. 2021; 21:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Kim K., Calabrese P., Wang S., Qin C., Rao Y., Feng P., Chen X.S.. The roles of APOBEC-mediated RNA editing in SARS-CoV-2 mutations, replication and fitness. Sci. Rep. 2022; 12:14972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Nakata Y., Ode H., Kubota M., Kasahara T., Matsuoka K., Sugimoto A., Imahashi M., Yokomaku Y., Iwatani Y.. Cellular APOBEC3A deaminase drives mutations in the SARS-CoV-2 genome. Nucleic Acids Res. 2023; 51:783–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Jukes T.H., Cantor T.H.. Munro H.N. Mammalian Protein Metabolism. 1969; NY: Academic Press. [Google Scholar]
  • 8. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980; 16:111–120. [DOI] [PubMed] [Google Scholar]
  • 9. Hasegawa M., Kishino H., Yano T.. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985; 22:160–174. [DOI] [PubMed] [Google Scholar]
  • 10. Tamura K., Nei M.. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993; 10:512–526. [DOI] [PubMed] [Google Scholar]
  • 11. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986; 17:57–86. [Google Scholar]
  • 12. Boussau B., Gouy M.. Efficient likelihood computations with nonreversible models of evolution. Syst. Biol. 2006; 55:756–768. [DOI] [PubMed] [Google Scholar]
  • 13. Jayaswal V., Jermiin L.S., Poladian L., Robinson J.. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst. Biol. 2011; 60:74–86. [DOI] [PubMed] [Google Scholar]
  • 14. Elbe S., Buckland-Merrett G.. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017; 1:33–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Katoh K., Misawa K., Kuma K., Miyata T.. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30:3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Nei M., Gojobori T.. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 1986; 3:418–426. [DOI] [PubMed] [Google Scholar]
  • 17. Azgari C., Kilinc Z., Turhan B., Circi D., Adebali O.. The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense. Viruses. 2021; 13:394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Forni D., Cagliani R., Pontremoli C., Clerici M., Sironi M.. The substitution spectra of coronavirus genomes. Brief Bioinform. 2022; 23:bbab382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Hoshino K., Maeshiro T., Nishida N., Sugiyama M., Fujita J., Gojobori T., Mizokami M.. Transmission dynamics of SARS-CoV-2 on the Diamond Princess uncovered using viral genome sequence analysis. Gene. 2021; 779:145496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Jordan I.K., Kondrashov F.A., Adzhubei I.A., Wolf Y.I., Koonin E.V., Kondrashov A.S., Sunyaev S.. A universal trend of amino acid gain and loss in protein evolution. Nature. 2005; 433:633–638. [DOI] [PubMed] [Google Scholar]
  • 21. Misawa K., Kamatani N., Kikuno R.F.. The universal trend of amino acid gain-loss is caused by CpG hypermutability. J. Mol. Evol. 2008; 67:334–342. [DOI] [PubMed] [Google Scholar]
  • 22. Bishop K.N., Holmes R.K., Sheehy A.M., Malim M.H.. APOBEC-mediated editing of viral RNA. Science. 2004; 305:645. [DOI] [PubMed] [Google Scholar]
  • 23. Kosuge M., Furusawa-Nishii E., Ito K., Saito Y., Ogasawara K.. Point mutation bias in SARS-CoV-2 variants results in increased ability to stimulate inflammatory responses. Sci. Rep. 2020; 10:17766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Ratcliff J., Simmonds P.. Potential APOBEC-mediated RNA editing of the genomes of SARS-CoV-2 and other coronaviruses and its impact on their longer term evolution. Virology. 2021; 556:62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Matyášek R., Řehůřková K., Berta Marošiová K., Kovařík A.. Mutational asymmetries in the SARS-CoV-2 genome may lead to increased hydrophobicity of virus proteins. Genes (Basel). 2021; 12:826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lopez-Cortes G.I., Palacios-Perez M., Zamudio G.S., Velediaz H.F., Ortega E., Jose M.V.. Neutral evolution test of the spike protein of SARS-CoV-2 and its implications in the binding to ACE2. Sci. Rep. 2021; 11:18847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Frost S.D.W., Magalis B.R., Kosakovsky Pond S.L.. Neutral theory and rapidly evolving viral pathogens. Mol. Biol. Evol. 2018; 35:1348–1354. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqae009_Supplemental_File

Data Availability Statement

All sequence data used in this study can be downloaded from the GISAID database (https://www.gisaid.org/). All python codes and lists of GISAID accession numbers of virus sequences used in this study are available on github (https://github.com/kazumisawa/coronavirusEvolution) and FigShare (https://doi.org/10.6084/m9.figshare.23691411).

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES