A simple method for estimating time-irreversible nucleotide substitution rates in the SARS-CoV-2 genome

Kazuharu Misawa; Ryo Ootsuki

doi:10.1093/nargab/lqae009

. 2024 Feb 2;6(1):lqae009. doi: 10.1093/nargab/lqae009

A simple method for estimating time-irreversible nucleotide substitution rates in the SARS-CoV-2 genome

Kazuharu Misawa ^1,^2,^✉, Ryo Ootsuki ^3,⁴

PMCID: PMC11640943 PMID: 39678027

Abstract

SARS-CoV-2 is the cause of the current worldwide pandemic of severe acute respiratory syndrome. The change of nucleotide composition of the SARS-CoV-2 genome is crucial for understanding the spread and transmission dynamics of the virus because viral nucleotide sequences are essential in identifying viral strains. Recent studies have shown that cytosine (C) to uracil (U) substitutions are overrepresented in SARS-CoV-2 genome sequences. These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences. Thus, we develop a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. We investigated the number of nucleotide substitutions among the 7862 genomic sequences of SARS-CoV-2 registered in the Global Initiative on Sharing All Influenza Data (GISAID) that have been sampled from all over the world. Using the new method, the substitution rates in SARS-CoV-2 genomes were estimated. The C-to-U substitution rates of SARS-CoV-2 were estimated to be 1.95 × 10⁻³ ± 4.88 × 10⁻⁴ per site per year, compared with 1.48 × 10⁻⁴ ± 7.42 × 10⁻⁵ per site per year for all other types of substitutions.

Introduction

Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is an RNA virus that has spread globally and is the cause of the current COVID-19 pandemic (1,2). The study of the molecular evolution of SARS-CoV-2 is important as it provides a better understanding of the dynamics of virus spread and transmission. Understanding molecular evolution is essential in developing effective vaccines, therapeutic approaches, and identification of viral strains. In addition, continuous surveillance of the evolution of the virus will contribute to the implementation of surveillance strategies and long-term preparedness against the disease. The main objective of this study is to propose a method for predicting nucleotide changes in SARS-CoV-2 genomes.

Genomic analyses of SARS-CoV-2 have demonstrated that 50% of the sequence mutations are cytosine-to-uracil (C-to-U) transitions with an 8-fold base frequency directional asymmetry between C-to-U and U-to-C substitutions (3–6). The asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences, although many time-reversible nucleotide substitution models are available, including the Jukes-Cantor model (7), Kimura 2-parameter model (8), Hasegawa-Kishino-Yano model (9), Tamura and Nei model (10) and General Time-Reversible model (11). The molecular evolution of SARS-CoV-2 can be studied more effectively using time-irreversible models (12,13) than using time-reversible models. Previous studies of time-irreversible models have used iterative approaches, such as Newton-Raphson method, for estimating the substitution rates. Iterative methods require time-consuming calculations due to the repetition until estimates converge.

Here, we present a new time-irreversible model of nucleotide substitutions to estimate the substitution rates in SARS-CoV-2 genomes. In this study, we present a simple algorithm for estimating the substitution rates by using the diagonalization method. The diagonalization method is often used for time-reversible model, such as Hasegawa–Kishino–Yano model (9) and general time reversible model (11). To verify the new model, the number of nucleotide substitutions of genomic sequences of SARS-CoV-2 registered are investigated in this study. Genomic sequences from the Global Initiative on Sharing All Influenza Data (GISAID) (14) that have been sampled from all over the world were investigated in this study.

Materials and methods

Definition of substitution rate matrix

In this study, the process of nucleotide substitution is considered as a continuous Markov process. The four RNA bases C, U, G and A are designated as 1, 2, 3 and 4, respectively, and Inline graphic is the probability that nucleotide is substituted by in time period . is a matrix in which the ijth element is .

Inline graphic satisfies the following Chapman–Kolmogorov equation:

(1)

Thus, equation (1) is obtained as

(2)

Time-irreversible model

Here, a new time-irreversible model is proposed for SARS-CoV-2 evolution. To model the directional asymmetry between C-to-U and U-to-C substitutions, a matrix, Inline graphic , was created, where is the C-to-U substitution rate and is the rate of other types of nucleotide substitutions.

(3)

Inline graphic is a derivative of the substitution probability matrix with respect to time (3).

Computing the powers of the substitution rate matrix by diagonalization

The substitution rate matrix Inline graphic defined by equation (3) can be diagonalized as

(4)

where

(5)

From equation (4), Inline graphic is obtained as

(6)

A probability matrix, Inline graphic , is obtained by

(7)

Thus,

Using equation (7), Inline graphic can be obtained by

(8)

where Inline graphic is defined by

(9)

Finally, Inline graphic can be calculated by

(10)

where

(11)

Estimation of nucleotide substitution rates

Notably,

(12)

where Inline graphic and can be estimated using equation (6) and by solving simultaneous equations (11) and (12). The arithmetic mean is used when multiple estimates are obtained.

Estimation of nucleotide contents with respect to time

Inline graphic is the observed number of cases where the ancestral nucleotide is and the derived nucleotide is in time . is a matrix in which the ijth element is . The expected value of can be obtained by

(13)

where

(14)

and Inline graphic is the number of nucleotides in the ancestral sequence.

If Inline graphic is the estimate of , can be estimated by

(15)

Using equation (16) we obtain estimated values for w, x, y and z.

(16)

Applying equation (17) yields the estimated value for a and b.

(17)

Equation (18) provides the estimated value for h.

(18)

Confidence intervals of the estimates of the evolutionary rates

To obtain confidence intervals of the estimates of mutation rates, we used the bootstrap method. We perform bootstrap sampling by repeatedly creating new sets of the virus sequences of the same sample size through resampling with replacement from the original set of the sequences. This results in obtaining bootstrap samples. We created a distribution of the statistics obtained from the bootstrap samples. By examining the range from the 0.5th to the 99.5th percentile of the distribution, we obtained a 99% confidence interval.

Sequence analysis of SARS-Cov-2

To verify the proposed model, the number of nucleotide substitutions of genomic sequences and the changes in nucleotide contents of SARS-CoV-2 were investigated. Genomic sequences of SARS-CoV-2 were retrieved from the GISAID database every six months from 31 December 2019 to 31 December 2021 (14). Samples used in this study were collected every six months. Gapped sites were excluded from the analysis. The sequence of the sample taken on 31 December 2019 was assumed to be the ancestral sequence, because it is the sequence first identified in Wuhan, China (gisaid_epi_isl: EPI_ISL_402125).

Genomes with >29 000 nucleotides were considered as having complete coverage. Sequences with <0.05% unique amino acid substitutions (i.e. substitutions not seen in other sequences in the database) and no insertions/deletions, unless verified by the submitter, were included in the analysis. Only sequences without undetermined (Ns) were used. A pairwise alignment of each genome sequence and the reference sequence was obtained using the MAFFT (15), which is a rapid tool for multiple sequence alignment. The substitution rates were estimated by comparing the sample sequences collected as of 31 December 2020 with the reference sequence. Given that 1792 sequences were collected on that date (as shown in Table 1), estimates were derived from these individual sample comparisons. We calculated the mean value of the estimates to determine the overall estimate and its standard error. The confidence intervals of the overall estimates were determined using the estimates of substitution rates obtained from pairwise comparisons. The date, region, and sample size details of the GISAID sequences used in this study are given in Table 1. To avoid sampling bias, we investigated the nucleotide changes in each region independently. Table 1 shows the sample size of each region.

Table 1.

Date, region and sample size of the GISAID sequences used in this study

	Sampling date
Region	2019/12/31	2020/6/30	2020/12/31	2021/6/30	2021/12/31	Total
Africa	0	19	11	13	5	48
Asia	1	70	211	334	440	1056
Europe	0	120	887	2369	1741	5117
North America	0	298	650	1057	558	2563
Oceania	0	52	13	11	28	104
South America	0	25	20	186	32	263
Total	1	584	1792	3970	2810	9157

Open in a new tab

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

To test whether the trend of nucleotide contents in the SARS-CoV-2 genome is caused by mutational bias or natural selection. the number of nonsynonymous and synonymous substitutions per site of the SARS-CoV-2 genes were estimated, because the selective force will depend on the function of the protein, which in turn depends on the amino acid sequences. Table 3 shows the number of nonsynonymous and synonymous changes per site of the SARS-CoV-2 genes estimated by NG86 model (16).

Table 3.

Major strains of SARS-CoV-2

Strain	Pango lineage
Reference
Alpha	B.1.1.7
Beta	B.1.1.351
Gamma	P.2
S	A.23.1
Omicron	BA.1

Open in a new tab

Results

Estimates of nucleotide substitutions of the SARC-Cov-2 genome

Table 2 shows the estimated substitution rates. C-to-U substitution rates were estimated as Inline graphic per site per year, and for other types of substitutions the rates were per site per year.

Table 2.

Estimated substitution rate per site per year

Type	Rate	SD
Non C-to-U	1.48 × 10⁻⁴	7.42 × 10⁻⁵
C-to-U	1.95 × 10⁻³	4.88 × 10⁻⁴

Open in a new tab

Changes in the nucleotide contents of the SARS-Cov-2 genome

Bar plots of the changes in C content of the SARS-CoV-2 genomes and sample dates are shown in Figure 1. The results show that the number of Cs decreased in the SARS-CoV-2 genome over the time period from 31 December 2019 to 31 December 2021, indicating that the nucleotide frequencies had not reached equilibrium. Figure 1 shows the bar plots of the changes in U content of the SARS-CoV-2 genomes and the sample dates. The number of Us increased in the SARS-CoV-2 genome in the same period. In the upper panels of Figure 1, it can be seen that the observed frequencies of U and C on 31 December 2021 are slightly different from the estimated trend line, but these differences are not significant (P > 0.05). The lower panels of Figure 1 show the changes in the number of Gs and As, respectively. The number of Gs and As were almost unchanged throughout the same time period. Solid lines in Figure 1 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The dotted curves are the 99% confidential intervals of the predicted nucleotide contents. These curves indicate that Cs will decrease almost linearly with time, while Ts will increase all over the world.

A global trends of nucleotide substitution rates of the SARS-Cov-2 genome

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Solid lines of Supplementary Figures S1–S6 also show trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The same trend of nucleotide substitutions was observed in all regions.

Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta and Omicron, respectively. Solid lines of Supplementary Figures S7–S11 are trend curves of changes in nucleotide contents predicted by the new time-irreversible model. The results indicate the trend of nucleotide changes is indeed a global pattern across all SARS-CoV-2 strains.

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

Table 4 shows the number of nonsynonymous and synonymous substitutions per site of the SARS-CoV-2 genes estimated by Nei and Gojobori model (16). In this table, Inline graphic indicates the number of nonsynonymous substitutions per site and indicates the number of synonymous substitutions per site. Except S, ratio is smaller than one. In total, ratio is 0.47.

Table 4.

Synonymous and nonsynonymous substitutions

CDS	Start	End	d1000	1000		Length
ORF1a	265	13468	0.57	1.36	0.42	4400
ORF1b	13467	21555	0.54	1.15	0.47	2695
S	21562	25384	1.71	1.51	1.13	1273
ORF3a	25392	26220	2.23	5.93	0.38	275
E	26244	26472	5.98	19.50	0.31	75
M	26522	27191	2.35	6.88	0.34	222
ORF6	27201	27387	13.54	54.40	0.25	61
ORF7a	27393	27759	8.74	11.05	0.79	121
ORF7b	27755	27887	10.04	38.74	0.26	43
ORF8	27893	28259	3.95	13.60	0.29	121
N	28273	29533	4.41	4.87	0.91	419
ORF10	29557	29674	11.17	43.22	0.26	38
Total			1.31	2.80	0.47	9743

Open in a new tab

Discussion

In this study, we proposed a method for predicting nucleotide changes in SARS-CoV-2 genomes. The results shown in Figure 1 and Supplementary Figures S1–S6 demonstrate that persistent changes in nucleotide frequencies in the SARS-CoV-2 genome. In addition, comprehensive analysis presented in Figure 1 and Supplementary Figures S1–S6 showed that the high C-to-U substitution rate is not limited to any one continent but is widespread worldwide. Sequence analyses of SARS-CoV-2 revealed that the estimated nucleotide composition calculated by our method was consistent with the observed changes in nucleotide composition.

The proposed method is based on time-irreversible model described in equation. When there is a stationary distribution of nucleotide content, i.e. Inline graphic , and the detailed balance condition described in equation (3) is satisfied for all and in the stationary state, the process is time reversible (3).

(19)

These asymmetric substitutions between C and U indicate that traditional time-reversible substitution models cannot be applied to the evolution of SARS-CoV-2 sequences.

In this study, it is assumed identical substitution rates except for C-to-U in equation (3). It is possible to incorporate a more complex model. Let us assume that u and v are transition and transversion rates, respectively. The difference in rates between transitions and transversions can be taken into account by modifying equation (3) as follows:

(20)

(21)

where

(22)

Thus, we obtain substitution matrix Inline graphic by:

(23)

Equation (23) is, however, difficult to handle to estimate h, u and v. Previous studies showed that the rate of G-to-U is higher among transversions in SARS-CoV-2 (17,18). Further study is needed to refine the model of the evolution of SARS-CoV-2 genome.

Nucleotide substitution rates of the SARS-Cov-2 genome

Using the new time-irreversible model of nucleotide substitutions proposed in this study, nucleotide substitution rates were estimated. The results suggest that the C-to-U substitution rate is 10 times higher than the rates of other types of substitutions. Hoshino et al. used the general time-reversible model with invariable sites and gamma distribution among site rate variation (GTR + G + I) as a nucleotide substitution model. The estimated mean substitution rate was Inline graphic substitutions per site per year (95% highest posterior density interval, ) (19). This estimate was lower than the C-to-U substitution rates and higher than the non C-to-U of substitution rates estimated by the proposed new time-reversible model.

In this study, a simple algorithm for estimating the substitution rates using the diagonalization method is presented. The nucleotide substitution rates for the new model can be calculated as easily as with the traditional time-reversible model because the diagonalization method can be applied to the new model. To validate the new model, the number of nucleotide substitutions in genomic sequences of SARS-CoV-2 registered in the GISAID database that have been sampled from all over the world were analysed. The diagonalization method is often used for time-reversible models, such as the Hasegawa–Kishino–Yano model (9) and the general time-reversible model (11).

The changes in nucleotide contents differ among continents, as evidenced by Supplementary Figures S1–S6. However, the difference might be due to errors arising from the limited sample size, especially in Africa and Oceania. As shown in Table 1, the sample size of each continent differs substantially between continents.

Amino acid changes and natural selection of SARS-Cov2 genes

It is widely known that mutational asymmetries affect amino acid substitutions. Jordan et al. found similar trends in amino acid changes across 15 taxonomic groups representing bacteria, archaea, and eukaryotes (20). Misawa et al. showed that these trends are mainly caused by CpG hypermutability (21). The C-to-U substitutions in SARS-CoV-2 genomes are caused by host RNA editing enzymes, such as the APOBEC family of cytidine deaminases (22–24). The C-to-U hypermutation of the SARS-CoV-2 genome will increase the number of hydrophobic amino acids in the virus proteins, because the codons of the four most hydrophobic amino acids (phenylalanine, isoleucine, leucine and valine) contain a U in the first or second position, whereas the codons of the most polar amino acids (asparagine, aspartic acid, arginine, glutamate, glutamic acid and lysine) do not contain a U in the first or second position (3,25) (see the codon table in Figure S12). The model presented in this study suggests that the number of Cs is decreasing in the SARS-CoV-2 genome, while that of Ts is increasing indicating that nucleotide frequencies have not reached equilibrium. Evolutionary studies of the SARS-CoV-2 genome must be continued to predict the future course of the COVID-19 pandemic.

Table 4 shows that the dn/ds ratio is below one, with the exception of S. In total, dn/ds ratio is 0.47. The S gene that encodes the spike protein of SARS-CoV-2, which is believed to undergo natural selection (26). As shown in Table 4, the spike protein of SARS-CoV-2, which contains 1273 amino acids, is responsible for roughly 13% of the total 9743 amino acids encoded by its genome. Hence, the predominant global trend of nucleotide variation cannot be attributed to neutral evolution (27).

Limitations of the proposed method

It should be noted that the newly proposed method in this study may have limited applicability to RNA viruses that replicate through RNA-dependent RNA polymerases, as the C-to-U substitutions observed in SARS-CoV-2 genomes are primarily attributed to host RNA editing enzymes, such as the APOBEC family of cytidine deaminases. Additionally, in the analysis of the SARS-CoV-2 genome, the ancestral state is known. However, in cases where the ancestral state is unknown, it becomes necessary to estimate the state. Future studies are warranted to gain further insights into the evolutionary dynamics of SARS-CoV-2.

Supplementary Material

lqae009_Supplemental_File

lqae009_supplemental_file.docx^{(652.7KB, docx)}

Acknowledgements

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank Dr Nao Nishida, Dr Naoko Fujito and Dr Naoki Osada for their useful comments and discussions. We thank Margaret Biswas, PhD, from Edanz (https://jp.edanz.com/ac) for editing a draft of this manuscript .

Author contributions: Kazuharu Misawa: Conceptualization, Formal analysis, Methodology, Validation, Writing—original draft. Ryo Ootsukil: Formal analysis, Visualization, Writing—review & editing.

Contributor Information

Kazuharu Misawa, Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama 236-0004, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.

Ryo Ootsuki, Department of Natural Sciences, Faculty of Arts and Sciences, 1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan; Department of Chemical and Biological Sciences, Faculty of Science, Japan Women's University, 2-8-1 Mejirodai, Bunkyo-ku, Tokyo 112-8681, Japan.

Data availability

All sequence data used in this study can be downloaded from the GISAID database (https://www.gisaid.org/). All python codes and lists of GISAID accession numbers of virus sequences used in this study are available on github (https://github.com/kazumisawa/coronavirusEvolution) and FigShare (https://doi.org/10.6084/m9.figshare.23691411).

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve English. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Supplementary data

Supplementary Data are available at NARGAB Online.

Funding

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers JP17K08682, JP19K22647, JP20K07316 to K.M.

Conflict of interest statement. None declared.

References

1. Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Yet al.. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020; 323:1061–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Yet al.. A new coronavirus associated with human respiratory disease in China. Nature. 2020; 579:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Simmonds P Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. mSphere. 2020; 5:e00408-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Iwasaki Y., Abe T., Ikemura T.. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol. 2021; 21:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Kim K., Calabrese P., Wang S., Qin C., Rao Y., Feng P., Chen X.S.. The roles of APOBEC-mediated RNA editing in SARS-CoV-2 mutations, replication and fitness. Sci. Rep. 2022; 12:14972. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Nakata Y., Ode H., Kubota M., Kasahara T., Matsuoka K., Sugimoto A., Imahashi M., Yokomaku Y., Iwatani Y.. Cellular APOBEC3A deaminase drives mutations in the SARS-CoV-2 genome. Nucleic Acids Res. 2023; 51:783–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Jukes T.H., Cantor T.H.. Munro H.N. Mammalian Protein Metabolism. 1969; NY: Academic Press. [Google Scholar]
8. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980; 16:111–120. [DOI] [PubMed] [Google Scholar]
9. Hasegawa M., Kishino H., Yano T.. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985; 22:160–174. [DOI] [PubMed] [Google Scholar]
10. Tamura K., Nei M.. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993; 10:512–526. [DOI] [PubMed] [Google Scholar]
11. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986; 17:57–86. [Google Scholar]
12. Boussau B., Gouy M.. Efficient likelihood computations with nonreversible models of evolution. Syst. Biol. 2006; 55:756–768. [DOI] [PubMed] [Google Scholar]
13. Jayaswal V., Jermiin L.S., Poladian L., Robinson J.. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst. Biol. 2011; 60:74–86. [DOI] [PubMed] [Google Scholar]
14. Elbe S., Buckland-Merrett G.. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017; 1:33–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Katoh K., Misawa K., Kuma K., Miyata T.. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30:3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Nei M., Gojobori T.. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 1986; 3:418–426. [DOI] [PubMed] [Google Scholar]
17. Azgari C., Kilinc Z., Turhan B., Circi D., Adebali O.. The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense. Viruses. 2021; 13:394. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Forni D., Cagliani R., Pontremoli C., Clerici M., Sironi M.. The substitution spectra of coronavirus genomes. Brief Bioinform. 2022; 23:bbab382. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Hoshino K., Maeshiro T., Nishida N., Sugiyama M., Fujita J., Gojobori T., Mizokami M.. Transmission dynamics of SARS-CoV-2 on the Diamond Princess uncovered using viral genome sequence analysis. Gene. 2021; 779:145496. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Jordan I.K., Kondrashov F.A., Adzhubei I.A., Wolf Y.I., Koonin E.V., Kondrashov A.S., Sunyaev S.. A universal trend of amino acid gain and loss in protein evolution. Nature. 2005; 433:633–638. [DOI] [PubMed] [Google Scholar]
21. Misawa K., Kamatani N., Kikuno R.F.. The universal trend of amino acid gain-loss is caused by CpG hypermutability. J. Mol. Evol. 2008; 67:334–342. [DOI] [PubMed] [Google Scholar]
22. Bishop K.N., Holmes R.K., Sheehy A.M., Malim M.H.. APOBEC-mediated editing of viral RNA. Science. 2004; 305:645. [DOI] [PubMed] [Google Scholar]
23. Kosuge M., Furusawa-Nishii E., Ito K., Saito Y., Ogasawara K.. Point mutation bias in SARS-CoV-2 variants results in increased ability to stimulate inflammatory responses. Sci. Rep. 2020; 10:17766. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Ratcliff J., Simmonds P.. Potential APOBEC-mediated RNA editing of the genomes of SARS-CoV-2 and other coronaviruses and its impact on their longer term evolution. Virology. 2021; 556:62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Matyášek R., Řehůřková K., Berta Marošiová K., Kovařík A.. Mutational asymmetries in the SARS-CoV-2 genome may lead to increased hydrophobicity of virus proteins. Genes (Basel). 2021; 12:826. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Lopez-Cortes G.I., Palacios-Perez M., Zamudio G.S., Velediaz H.F., Ortega E., Jose M.V.. Neutral evolution test of the spike protein of SARS-CoV-2 and its implications in the binding to ACE2. Sci. Rep. 2021; 11:18847. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Frost S.D.W., Magalis B.R., Kosakovsky Pond S.L.. Neutral theory and rapidly evolving viral pathogens. Mol. Biol. Evol. 2018; 35:1348–1354. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqae009_Supplemental_File

lqae009_supplemental_file.docx^{(652.7KB, docx)}

Data Availability Statement

Supplementary Figures S1–S6 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in Africa, Asia, Europe, North America, Oceania, and South America, respectively. Supplementary Figures S7–S11 show bar plots of the changes in nucleotide contents of the SARS-CoV-2 genomes and the sample dates observed in several of the dominant strains, namely, Alfa, Beta, Gamma, Delta, and Omicron, respectively. Figure S12 shows the standard codon table.

[B1] 1. Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Yet al.. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA. 2020; 323:1061–1069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Yet al.. A new coronavirus associated with human respiratory disease in China. Nature. 2020; 579:265–269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Simmonds P Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories. mSphere. 2020; 5:e00408-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Iwasaki Y., Abe T., Ikemura T.. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol. 2021; 21:89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Kim K., Calabrese P., Wang S., Qin C., Rao Y., Feng P., Chen X.S.. The roles of APOBEC-mediated RNA editing in SARS-CoV-2 mutations, replication and fitness. Sci. Rep. 2022; 12:14972. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Nakata Y., Ode H., Kubota M., Kasahara T., Matsuoka K., Sugimoto A., Imahashi M., Yokomaku Y., Iwatani Y.. Cellular APOBEC3A deaminase drives mutations in the SARS-CoV-2 genome. Nucleic Acids Res. 2023; 51:783–795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Jukes T.H., Cantor T.H.. Munro H.N. Mammalian Protein Metabolism. 1969; NY: Academic Press. [Google Scholar]

[B8] 8. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980; 16:111–120. [DOI] [PubMed] [Google Scholar]

[B9] 9. Hasegawa M., Kishino H., Yano T.. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985; 22:160–174. [DOI] [PubMed] [Google Scholar]

[B10] 10. Tamura K., Nei M.. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993; 10:512–526. [DOI] [PubMed] [Google Scholar]

[B11] 11. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986; 17:57–86. [Google Scholar]

[B12] 12. Boussau B., Gouy M.. Efficient likelihood computations with nonreversible models of evolution. Syst. Biol. 2006; 55:756–768. [DOI] [PubMed] [Google Scholar]

[B13] 13. Jayaswal V., Jermiin L.S., Poladian L., Robinson J.. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst. Biol. 2011; 60:74–86. [DOI] [PubMed] [Google Scholar]

[B14] 14. Elbe S., Buckland-Merrett G.. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017; 1:33–46. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Katoh K., Misawa K., Kuma K., Miyata T.. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30:3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Nei M., Gojobori T.. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 1986; 3:418–426. [DOI] [PubMed] [Google Scholar]

[B17] 17. Azgari C., Kilinc Z., Turhan B., Circi D., Adebali O.. The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense. Viruses. 2021; 13:394. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Forni D., Cagliani R., Pontremoli C., Clerici M., Sironi M.. The substitution spectra of coronavirus genomes. Brief Bioinform. 2022; 23:bbab382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Hoshino K., Maeshiro T., Nishida N., Sugiyama M., Fujita J., Gojobori T., Mizokami M.. Transmission dynamics of SARS-CoV-2 on the Diamond Princess uncovered using viral genome sequence analysis. Gene. 2021; 779:145496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Jordan I.K., Kondrashov F.A., Adzhubei I.A., Wolf Y.I., Koonin E.V., Kondrashov A.S., Sunyaev S.. A universal trend of amino acid gain and loss in protein evolution. Nature. 2005; 433:633–638. [DOI] [PubMed] [Google Scholar]

[B21] 21. Misawa K., Kamatani N., Kikuno R.F.. The universal trend of amino acid gain-loss is caused by CpG hypermutability. J. Mol. Evol. 2008; 67:334–342. [DOI] [PubMed] [Google Scholar]

[B22] 22. Bishop K.N., Holmes R.K., Sheehy A.M., Malim M.H.. APOBEC-mediated editing of viral RNA. Science. 2004; 305:645. [DOI] [PubMed] [Google Scholar]

[B23] 23. Kosuge M., Furusawa-Nishii E., Ito K., Saito Y., Ogasawara K.. Point mutation bias in SARS-CoV-2 variants results in increased ability to stimulate inflammatory responses. Sci. Rep. 2020; 10:17766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Ratcliff J., Simmonds P.. Potential APOBEC-mediated RNA editing of the genomes of SARS-CoV-2 and other coronaviruses and its impact on their longer term evolution. Virology. 2021; 556:62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Matyášek R., Řehůřková K., Berta Marošiová K., Kovařík A.. Mutational asymmetries in the SARS-CoV-2 genome may lead to increased hydrophobicity of virus proteins. Genes (Basel). 2021; 12:826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Lopez-Cortes G.I., Palacios-Perez M., Zamudio G.S., Velediaz H.F., Ortega E., Jose M.V.. Neutral evolution test of the spike protein of SARS-CoV-2 and its implications in the binding to ACE2. Sci. Rep. 2021; 11:18847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Frost S.D.W., Magalis B.R., Kosakovsky Pond S.L.. Neutral theory and rapidly evolving viral pathogens. Mol. Biol. Evol. 2018; 35:1348–1354. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A simple method for estimating time-irreversible nucleotide substitution rates in the SARS-CoV-2 genome

Kazuharu Misawa

Ryo Ootsuki

Abstract

Introduction

Materials and methods

Definition of substitution rate matrix

Time-irreversible model

Computing the powers of the substitution rate matrix by diagonalization

Estimation of nucleotide substitution rates

Estimation of nucleotide contents with respect to time

Confidence intervals of the estimates of the evolutionary rates

Sequence analysis of SARS-Cov-2

Table 1.

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

Table 3.

Results

Estimates of nucleotide substitutions of the SARC-Cov-2 genome

Table 2.

Changes in the nucleotide contents of the SARS-Cov-2 genome

Figure 1.

A global trends of nucleotide substitution rates of the SARS-Cov-2 genome

Synonymous and nonsynonymous changes of SARS-Cov-2 genes

Table 4.

Discussion

Nucleotide substitution rates of the SARS-Cov-2 genome

Amino acid changes and natural selection of SARS-Cov2 genes

Limitations of the proposed method

Supplementary Material

Acknowledgements

Contributor Information

Data availability

Declaration of generative AI and AI-assisted technologies in the writing process

Supplementary data

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases