Abstract
SARS-CoV-2 is the cause of the current worldwide pandemic of severe acute respiratory syndrome. The change of nucleotide composition of the SARS-CoV-2 genome is crucial for understanding the spread and transmission dynamics of the virus because viral nucleotide sequences are essential in identifying viral strains. Recent studies have shown that cytosine (C) to uracil (U) substitutions are overrepresented in SARS-CoV-2 genome sequences. These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences. Thus, we develop a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. We investigated the number of nucleotide substitutions among the 7862 genomic sequences of SARS-CoV-2 registered in the Global Initiative on Sharing All Influenza Data (GISAID) that have been sampled from all over the world. Using the new method, the substitution rates in SARS-CoV-2 genomes were estimated. The C-to-U substitution rates of SARS-CoV-2 were estimated to be 1.95 × 10−3 ± 4.88 × 10−4 per site per year, compared with 1.48 × 10−4 ± 7.42 × 10−5 per site per year for all other types of substitutions.
Introduction
Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is an RNA virus that has spread globally and is the cause of the current COVID-19 pandemic (1,2). The study of the molecular evolution of SARS-CoV-2 is important as it provides a better understanding of the dynamics of virus spread and transmission. Understanding molecular evolution is essential in developing effective vaccines, therapeutic approaches, and identification of viral strains. In addition, continuous surveillance of the evolution of the virus will contribute to the implementation of surveillance strategies and long-term preparedness against the disease. The main objective of this study is to propose a method for predicting nucleotide changes in SARS-CoV-2 genomes.
Genomic analyses of SARS-CoV-2 have demonstrated that 50% of the sequence mutations are cytosine-to-uracil (C-to-U) transitions with an 8-fold base frequency directional asymmetry between C-to-U and U-to-C substitutions (3–6). The asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences, although many time-reversible nucleotide substitution models are available, including the Jukes-Cantor model (7), Kimura 2-parameter model (8), Hasegawa-Kishino-Yano model (9), Tamura and Nei model (10) and General Time-Reversible model (11). The molecular evolution of SARS-CoV-2 can be studied more effectively using time-irreversible models (12,13) than using time-reversible models. Previous studies of time-irreversible models have used iterative approaches, such as Newton-Raphson method, for estimating the substitution rates. Iterative methods require time-consuming calculations due to the repetition until estimates converge.
Here, we present a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. In this study, we present a simple algorithm for estimating the substitution rates by using the diagonalization method. The diagonalization method is often used for time-reversible model, such as Hasegawa–Kishino–Yano model (9) and general time reversible model (11). To verify the new model, the number of nucleotide substitutions of genomic sequences of SARS-CoV-2 registered are investigated in this study. Genomic sequences from the Global Initiative on Sharing All Influenza Data (GISAID) (14) that have been sampled from all over the world were investigated in this study.
Materials and methods
Definition of substitution rate matrix
In this study, the process of nucleotide substitution is considered as a continuous
Markov process. The four RNA bases C, U, G and A are designated as 1, 2, 3 and 4,
respectively, and
is the probability that
nucleotide
is substituted by
in time period
.
is a matrix in which the
ijth element is
.
satisfies the following Chapman–Kolmogorov equation:
![]() |
(1) |
Thus, equation (1) is obtained as
![]() |
(2) |
Time-irreversible model
Here, a new time-irreversible model is proposed for SARS-CoV-2 evolution. To model the
directional asymmetry between C-to-U and U-to-C substitutions, a matrix,
, was created, where
is the C-to-U substitution rate and
is the rate of other types of nucleotide
substitutions.
![]() |
(3) |
is a derivative of the substitution probability matrix with respect to
time
(3).
Computing the powers of the substitution rate matrix by diagonalization
The substitution rate matrix
defined by equation (3) can be diagonalized as
![]() |
(4) |
where
![]() |
(5) |
From equation (4),
is obtained as
![]() |
(6) |
A probability matrix,
, is obtained by
![]() |
(7) |
Thus,
![]() |
Using equation (7),
can be obtained by
![]() |
(8) |
where
is defined by
![]() |
(9) |
Finally,
can be calculated by
![]() |
(10) |
where
![]() |
(11) |
Estimation of nucleotide substitution rates
Notably,
![]() |
(12) |
where
and
can be estimated using
equation (6) and by solving simultaneous
equations (11) and (12). The arithmetic mean is used when multiple
estimates are obtained.
Estimation of nucleotide contents with respect to time
is the observed number of cases where the ancestral nucleotide is
and the derived nucleotide is
in time
.
is a matrix in which the
ijth element is
. The
expected value of
can be obtained by
![]() |
(13) |
where
![]() |
(14) |
and
is the number of nucleotides
in the ancestral sequence.
If
is the estimate of
,
can be estimated
by
![]() |
(15) |
Using equation (16) we obtain estimated values for w, x, y and z.
![]() |
(16) |
Applying equation (17) yields the estimated value for a and b.
![]() |
(17) |
Equation (18) provides the estimated value for h.
![]() |
(18) |
Confidence intervals of the estimates of the evolutionary rates
To obtain confidence intervals of the estimates of mutation rates, we used the bootstrap method. We perform bootstrap sampling by repeatedly creating new sets of the virus sequences of the same sample size through resampling with replacement from the original set of the sequences. This results in obtaining bootstrap samples. We created a distribution of the statistics obtained from the bootstrap samples. By examining the range from the 0.5th to the 99.5th percentile of the distribution, we obtained a 99% confidence interval.
Sequence analysis of SARS-Cov-2
To verify the proposed model, the number of nucleotide substitutions of genomic sequences and the changes in nucleotide contents of SARS-CoV-2 were investigated. Genomic sequences of SARS-CoV-2 were retrieved from the GISAID database every six months from 31 December 2019 to 31 December 2021 (14). Samples used in this study were collected every six months. Gapped sites were excluded from the analysis. The sequence of the sample taken on 31 December 2019 was assumed to be the ancestral sequence, because it is the sequence first identified in Wuhan, China (gisaid_epi_isl: EPI_ISL_402125).
Genomes with >29 000 nucleotides were considered as having complete coverage. Sequences with <0.05% unique amino acid substitutions (i.e. substitutions not seen in other sequences in the database) and no insertions/deletions, unless verified by the submitter, were included in the analysis. Only sequences without undetermined (Ns) were used. A pairwise alignment of each genome sequence and the reference sequence was obtained using the MAFFT (15), which is a rapid tool for multiple sequence alignment. The substitution rates were estimated by comparing the sample sequences collected as of 31 December 2020 with the reference sequence. Given that 1792 sequences were collected on that date (as shown in Table 1), estimates were derived from these individual sample comparisons. We calculated the mean value of the estimates to determine the overall estimate and its standard error. The confidence intervals of the overall estimates were determined using the estimates of substitution rates obtained from pairwise comparisons. The date, region, and sample size details of the GISAID sequences used in this study are given in Table 1. To avoid sampling bias, we investigated the nucleotide changes in each region independently. Table 1 shows the sample size of each region.
Table 1.
Date, region and sample size of the GISAID sequences used in this study
| Sampling date | ||||||
|---|---|---|---|---|---|---|
| Region | 2019/12/31 | 2020/6/30 | 2020/12/31 | 2021/6/30 | 2021/12/31 | Total |
| Africa | 0 | 19 | 11 | 13 | 5 | 48 |
| Asia | 1 | 70 | 211 | 334 | 440 | 1056 |
| Europe | 0 | 120 | 887 | 2369 | 1741 | 5117 |
| North America | 0 | 298 | 650 | 1057 | 558 | 2563 |
| Oceania | 0 | 52 | 13 | 11 | 28 | 104 |
| South America | 0 | 25 | 20 | 186 | 32 | 263 |
| Total | 1 | 584 | 1792 | 3970 | 2810 | 9157 |
Synonymous and nonsynonymous changes of SARS-Cov-2 genes
To test whether the trend of nucleotide contents in the SARS-CoV-2 genome is caused by mutational bias or natural selection. the number of nonsynonymous and synonymous substitutions per site of the SARS-CoV-2 genes were estimated, because the selective force will depend on the function of the protein, which in turn depends on the amino acid sequences. Table 3 shows the number of nonsynonymous and synonymous changes per site of the SARS-CoV-2 genes estimated by NG86 model (16).
Table 3.
Major strains of SARS-CoV-2
| Strain | Pango lineage |
|---|---|
| Reference | |
| Alpha | B.1.1.7 |
| Beta | B.1.1.351 |
| Gamma | P.2 |
| S | A.23.1 |
| Omicron | BA.1 |
Results
Estimates of nucleotide substitutions of the SARC-Cov-2 genome
Table 2 shows the estimated substitution rates.
C-to-U substitution rates were estimated as
per site per year, and for other types of substitutions the rates were
per site per year.
Table 2.
Estimated substitution rate per site per year
| Type | Rate | SD |
|---|---|---|
| Non C-to-U | 1.48 × 10−4 | 7.42 × 10−5 |
| C-to-U | 1.95 × 10−3 | 4.88 × 10−4 |
Changes in the nucleotide contents of the SARS-Cov-2 genome
Bar plots of the changes in C content of the SARS-CoV-2 genomes and sample dates are shown in Figure 1. The results show that the number of Cs decreased in the SARS-CoV-2 genome over the time period from 31 December 2019 to 31 December 2021, indicating that the nucleotide frequencies had not reached equilibrium. Figure 1 shows the bar plots of the changes in U content of the SARS-CoV-2 genomes and the sample dates. The number of Us increased in the SARS-CoV-2 genome in the same period. In the upper panels of Figure 1, it can be seen that the observed frequencies of U and C on 31 December 2021 are slightly different from the estimated trend line, but these differences are not significant (P > 0.05). The lower panels of Figure 1 show the changes in the number of Gs and As, respectively. The number of Gs and As were almost unchanged throughout the same time period. Solid lines in Figure 1 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The dotted curves are the 99% confidential intervals of the predicted nucleotide contents. These curves indicate that Cs will decrease almost linearly with time, while Ts will increase all over the world.
Figure 1.
Bar plots of the changes in C (cytosine), U (urasil), G (guanine), and A (adenine) contents of the SARS-CoV-2 genomes and sample dates over the time period from 31 December 2019 to 31 December 2021, x axis: Date, y axis: Content. Solid lines represent trend curves depicting changes in nucleotide contents predicted by the new time-irreversible model. Dotted curves represent the 99% confidence intervals of the predicted nucleotide contents.
A global trends of nucleotide substitution rates of the SARS-Cov-2 genome
Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Solid lines of Supplementary Figures S1–S6 also show trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The same trend of nucleotide substitutions was observed in all regions.
Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta and Omicron, respectively. Solid lines of Supplementary Figures S7–S11 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The results indicate the trend of nucleotide changes is indeed a global pattern across all SARS-CoV-2 strains.
Synonymous and nonsynonymous changes of SARS-Cov-2 genes
Table 4 shows the number of nonsynonymous and
synonymous substitutions per site of the SARS-CoV-2 genes estimated by Nei and Gojobori
model (16). In this table,
indicates the number of nonsynonymous
substitutions per site and
indicates the number of synonymous
substitutions per site. Except S,
ratio is smaller than one. In total,
ratio is 0.47.
Table 4.
Synonymous and nonsynonymous substitutions
| CDS | Start | End | d 1000 |
1000 |
|
Length |
|---|---|---|---|---|---|---|
| ORF1a | 265 | 13468 | 0.57 | 1.36 | 0.42 | 4400 |
| ORF1b | 13467 | 21555 | 0.54 | 1.15 | 0.47 | 2695 |
| S | 21562 | 25384 | 1.71 | 1.51 | 1.13 | 1273 |
| ORF3a | 25392 | 26220 | 2.23 | 5.93 | 0.38 | 275 |
| E | 26244 | 26472 | 5.98 | 19.50 | 0.31 | 75 |
| M | 26522 | 27191 | 2.35 | 6.88 | 0.34 | 222 |
| ORF6 | 27201 | 27387 | 13.54 | 54.40 | 0.25 | 61 |
| ORF7a | 27393 | 27759 | 8.74 | 11.05 | 0.79 | 121 |
| ORF7b | 27755 | 27887 | 10.04 | 38.74 | 0.26 | 43 |
| ORF8 | 27893 | 28259 | 3.95 | 13.60 | 0.29 | 121 |
| N | 28273 | 29533 | 4.41 | 4.87 | 0.91 | 419 |
| ORF10 | 29557 | 29674 | 11.17 | 43.22 | 0.26 | 38 |
| Total | 1.31 | 2.80 | 0.47 | 9743 |
Discussion
In this study, we proposed a method for predicting nucleotide changes in SARS-CoV-2 genomes. The results shown in Figure 1 and Supplementary Figures S1–S6 demonstrate that persistent changes in nucleotide frequencies in the SARS-CoV-2 genome. In addition, comprehensive analysis presented in Figure 1 and Supplementary Figures S1–S6 showed that the high C-to-U substitution rate is not limited to any one continent but is widespread worldwide. Sequence analyses of SARS-CoV-2 revealed that the estimated nucleotide composition calculated by our method was consistent with the observed changes in nucleotide composition.
The proposed method is based on time-irreversible model described in equation. When there
is a stationary distribution of nucleotide content, i.e.
,
and the detailed balance condition described in equation (3) is satisfied for all
and
in the stationary state, the process is time
reversible (3).
![]() |
(19) |
These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences.
In this study, it is assumed identical substitution rates except for C-to-U in equation (3). It is possible to incorporate a more complex model. Let us assume that u and v are transition and transversion rates, respectively. The difference in rates between transitions and transversions can be taken into account by modifying equation (3) as follows:
![]() |
(20) |
![]() |
(21) |
where
![]() |
(22) |
Thus, we obtain substitution matrix
by:
![]() |
(23) |
Equation (23) is, however, difficult to handle to estimate h, u and v. Previous studies showed that the rate of G-to-U is higher among transversions in SARS-CoV-2 (17,18). Further study is needed to refine the model of the evolution of SARS-CoV-2 genome.
Nucleotide substitution rates of the SARS-Cov-2 genome
Using the new time-irreversible model of nucleotide substitutions proposed in this study,
nucleotide substitution rates were estimated. The results suggest that the C-to-U
substitution rate is 10 times higher than the rates of other types of substitutions.
Hoshino et al. used the general time-reversible model with invariable
sites and gamma distribution among site rate variation (GTR + G + I) as a nucleotide
substitution model. The estimated mean substitution rate was
substitutions per
site per year (95% highest posterior density interval,
)
(19). This estimate was lower than the C-to-U
substitution rates and higher than the non C-to-U of substitution rates estimated by the
proposed new time-reversible model.
In this study, a simple algorithm for estimating the substitution rates using the diagonalization method is presented. The nucleotide substitution rates for the new model can be calculated as easily as with the traditional time-reversible model because the diagonalization method can be applied to the new model. To validate the new model, the number of nucleotide substitutions in genomic sequences of SARS-CoV-2 registered in the GISAID database that have been sampled from all over the world were analysed. The diagonalization method is often used for time-reversible models, such as the Hasegawa–Kishino–Yano model (9) and the general time-reversible model (11).
The changes in nucleotide contents differ among continents, as evidenced by Supplementary Figures S1–S6. However, the difference might be due to errors arising from the limited sample size, especially in Africa and Oceania. As shown in Table 1, the sample size of each continent differs substantially between continents.
Amino acid changes and natural selection of SARS-Cov2 genes
It is widely known that mutational asymmetries affect amino acid substitutions. Jordan et al. found similar trends in amino acid changes across 15 taxonomic groups representing bacteria, archaea, and eukaryotes (20). Misawa et al. showed that these trends are mainly caused by CpG hypermutability (21). The C-to-U substitutions in SARS-CoV-2 genomes are caused by host RNA editing enzymes, such as the APOBEC family of cytidine deaminases (22–24). The C-to-U hypermutation of the SARS-CoV-2 genome will increase the number of hydrophobic amino acids in the virus proteins, because the codons of the four most hydrophobic amino acids (phenylalanine, isoleucine, leucine and valine) contain a U in the first or second position, whereas the codons of the most polar amino acids (asparagine, aspartic acid, arginine, glutamate, glutamic acid and lysine) do not contain a U in the first or second position (3,25) (see the codon table in Figure S12). The model presented in this study suggests that the number of Cs is decreasing in the SARS-CoV-2 genome, while that of Ts is increasing indicating that nucleotide frequencies have not reached equilibrium. Evolutionary studies of the SARS-CoV-2 genome must be continued to predict the future course of the COVID-19 pandemic.
Table 4 shows that the dn/ds ratio is below one, with the exception of S. In total, dn/ds ratio is 0.47. The S gene that encodes the spike protein of SARS-CoV-2, which is believed to undergo natural selection (26). As shown in Table 4, the spike protein of SARS-CoV-2, which contains 1273 amino acids, is responsible for roughly 13% of the total 9743 amino acids encoded by its genome. Hence, the predominant global trend of nucleotide variation cannot be attributed to neutral evolution (27).
Limitations of the proposed method
It should be noted that the newly proposed method in this study may have limited applicability to RNA viruses that replicate through RNA-dependent RNA polymerases, as the C-to-U substitutions observed in SARS-CoV-2 genomes are primarily attributed to host RNA editing enzymes, such as the APOBEC family of cytidine deaminases. Additionally, in the analysis of the SARS-CoV-2 genome, the ancestral state is known. However, in cases where the ancestral state is unknown, it becomes necessary to estimate the state. Future studies are warranted to gain further insights into the evolutionary dynamics of SARS-CoV-2.
Supplementary Material
Acknowledgements
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank Dr Nao Nishida, Dr Naoko Fujito and Dr Naoki Osada for their useful comments and discussions. We thank Margaret Biswas, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript .
Author contributions: Kazuharu Misawa: Conceptualization, Formal analysis, Methodology, Validation, Writing—original draft. Ryo Ootsukil: Formal analysis, Visualization, Writing—review & editing.
Contributor Information
Kazuharu Misawa, Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama 236-0004, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
Ryo Ootsuki, Department of Natural Sciences, Faculty of Arts and Sciences, 1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan; Department of Chemical and Biological Sciences, Faculty of Science, Japan Women's University, 2-8-1 Mejirodai, Bunkyo-ku, Tokyo 112-8681, Japan.
Data availability
All sequence data used in this study can be downloaded from the GISAID database (https://www.gisaid.org/). All python codes and lists of GISAID accession numbers of virus sequences used in this study are available on github (https://github.com/kazumisawa/coronavirusEvolution) and FigShare (https://doi.org/10.6084/m9.figshare.23691411).
Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work the authors used ChatGPT in order to improve English. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Supplementary data
Supplementary Data are available at NARGAB Online.
Funding
This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers JP17K08682, JP19K22647, JP20K07316 to K.M.
Conflict of interest statement. None declared.
References
- 1. Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Yet al.. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020; 323:1061–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Yet al.. A new coronavirus associated with human respiratory disease in China. Nature. 2020; 579:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Simmonds P Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. mSphere. 2020; 5:e00408-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Iwasaki Y., Abe T., Ikemura T.. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol. 2021; 21:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Kim K., Calabrese P., Wang S., Qin C., Rao Y., Feng P., Chen X.S.. The roles of APOBEC-mediated RNA editing in SARS-CoV-2 mutations, replication and fitness. Sci. Rep. 2022; 12:14972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Nakata Y., Ode H., Kubota M., Kasahara T., Matsuoka K., Sugimoto A., Imahashi M., Yokomaku Y., Iwatani Y.. Cellular APOBEC3A deaminase drives mutations in the SARS-CoV-2 genome. Nucleic Acids Res. 2023; 51:783–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Jukes T.H., Cantor T.H.. Munro H.N. Mammalian Protein Metabolism. 1969; NY: Academic Press. [Google Scholar]
- 8. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980; 16:111–120. [DOI] [PubMed] [Google Scholar]
- 9. Hasegawa M., Kishino H., Yano T.. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985; 22:160–174. [DOI] [PubMed] [Google Scholar]
- 10. Tamura K., Nei M.. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993; 10:512–526. [DOI] [PubMed] [Google Scholar]
- 11. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986; 17:57–86. [Google Scholar]
- 12. Boussau B., Gouy M.. Efficient likelihood computations with nonreversible models of evolution. Syst. Biol. 2006; 55:756–768. [DOI] [PubMed] [Google Scholar]
- 13. Jayaswal V., Jermiin L.S., Poladian L., Robinson J.. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst. Biol. 2011; 60:74–86. [DOI] [PubMed] [Google Scholar]
- 14. Elbe S., Buckland-Merrett G.. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017; 1:33–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Katoh K., Misawa K., Kuma K., Miyata T.. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30:3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Nei M., Gojobori T.. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 1986; 3:418–426. [DOI] [PubMed] [Google Scholar]
- 17. Azgari C., Kilinc Z., Turhan B., Circi D., Adebali O.. The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense. Viruses. 2021; 13:394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Forni D., Cagliani R., Pontremoli C., Clerici M., Sironi M.. The substitution spectra of coronavirus genomes. Brief Bioinform. 2022; 23:bbab382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Hoshino K., Maeshiro T., Nishida N., Sugiyama M., Fujita J., Gojobori T., Mizokami M.. Transmission dynamics of SARS-CoV-2 on the Diamond Princess uncovered using viral genome sequence analysis. Gene. 2021; 779:145496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Jordan I.K., Kondrashov F.A., Adzhubei I.A., Wolf Y.I., Koonin E.V., Kondrashov A.S., Sunyaev S.. A universal trend of amino acid gain and loss in protein evolution. Nature. 2005; 433:633–638. [DOI] [PubMed] [Google Scholar]
- 21. Misawa K., Kamatani N., Kikuno R.F.. The universal trend of amino acid gain-loss is caused by CpG hypermutability. J. Mol. Evol. 2008; 67:334–342. [DOI] [PubMed] [Google Scholar]
- 22. Bishop K.N., Holmes R.K., Sheehy A.M., Malim M.H.. APOBEC-mediated editing of viral RNA. Science. 2004; 305:645. [DOI] [PubMed] [Google Scholar]
- 23. Kosuge M., Furusawa-Nishii E., Ito K., Saito Y., Ogasawara K.. Point mutation bias in SARS-CoV-2 variants results in increased ability to stimulate inflammatory responses. Sci. Rep. 2020; 10:17766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ratcliff J., Simmonds P.. Potential APOBEC-mediated RNA editing of the genomes of SARS-CoV-2 and other coronaviruses and its impact on their longer term evolution. Virology. 2021; 556:62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Matyášek R., Řehůřková K., Berta Marošiová K., Kovařík A.. Mutational asymmetries in the SARS-CoV-2 genome may lead to increased hydrophobicity of virus proteins. Genes (Basel). 2021; 12:826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lopez-Cortes G.I., Palacios-Perez M., Zamudio G.S., Velediaz H.F., Ortega E., Jose M.V.. Neutral evolution test of the spike protein of SARS-CoV-2 and its implications in the binding to ACE2. Sci. Rep. 2021; 11:18847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Frost S.D.W., Magalis B.R., Kosakovsky Pond S.L.. Neutral theory and rapidly evolving viral pathogens. Mol. Biol. Evol. 2018; 35:1348–1354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All sequence data used in this study can be downloaded from the GISAID database (https://www.gisaid.org/). All python codes and lists of GISAID accession numbers of virus sequences used in this study are available on github (https://github.com/kazumisawa/coronavirusEvolution) and FigShare (https://doi.org/10.6084/m9.figshare.23691411).
Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.


























