Future COVID19 surges prediction based on SARS-CoV-2 mutations surveillance

Fares Z Najar; Evan Linde; Chelsea L Murphy; Veniamin A Borin; Huan Wang; Shozeb Haider; Pratul K Agarwal

doi:10.7554/eLife.82980

. 2023 Jan 19;12:e82980. doi: 10.7554/eLife.82980

Future COVID19 surges prediction based on SARS-CoV-2 mutations surveillance

Fares Z Najar ¹, Evan Linde ¹, Chelsea L Murphy ¹, Veniamin A Borin ^1,², Huan Wang ³, Shozeb Haider ^3,⁴, Pratul K Agarwal ^1,^2,^✉

Editors: Jameel Iqbal⁵, Mone Zaidi⁶

PMCID: PMC9894583 PMID: 36655992

Abstract

COVID19 has aptly revealed that airborne viruses such as SARS-CoV-2 with the ability to rapidly mutate combined with high rates of transmission and fatality can cause a deadly worldwide pandemic in a matter of weeks (Plato et al., 2021). Apart from vaccines and post-infection treatment options, strategies for preparedness will be vital in responding to the current and future pandemics. Therefore, there is wide interest in approaches that allow predictions of increase in infections (‘surges’) before they occur. We describe here real-time genomic surveillance particularly based on mutation analysis, of viral proteins as a methodology for a priori determination of surge in number of infection cases. The full results are available for SARS-CoV-2 at http://pandemics.okstate.edu/covid19/, and are updated daily as new virus sequences become available. This approach is generic and will also be applicable to other pathogens.

Research organism: Viruses

Introduction

Protein and genome sequence analyses identify molecular level changes that enable viral adaptations for increased spread through the host population. Concrete evidence for a direct relationship between specific mutations and increase in rates of infection (and fatality) requires extensive laboratory studies that need significant time. The availability of unprecedented number of SARS-CoV-2 genome sequences is making possible identification of number and types of mutations, which in turn can provide vital knowledge in real time, crucial for decision making by health professionals for medical interventions. We are investigating several different approaches (synonymous, non-synonymous, and non-synonymous/synonymous ratio for the nucleotide sequences [Zhang et al., 2006] and conservative or radical substitutions for the amino acid sequences) for using number and types of mutations as a means to predict surge in infections as well as to monitor the changes in critical viral proteins. Recently, such analysis has been reported for single SARS-CoV-2 proteins (Kistler et al., 2022). Our approach, however, is based on the whole viral genome analysis and moreover it is performed continually in real time.

Materials and methods

The SARS-CoV-2 genomic sequences data and the number of COVID19 (Platto et al., 2021) sequences are continually obtained from the sources described below. The genomic sequences are carefully filtered for quality control and used for calculations of non-synonymous (K_a) and synonymous (K_s) mutation rates for each of the 26 proteins separately.

Data and data sources

Data for the number of reported COVID19 cases was accessed from Johns Hopkins University’s Our World In Data project (https://ourworldindata.org/coronavirus-source-data) (Dong et al., 2020).

Genomic sequence data

An in-house pipeline of scripts (using Linux commands) was designed around the eUtils tools (Nadkarni and Parikh, 2012) from NCBI in order to download and process the SARS-CoV-2 records from NCBI’s GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Briefly, we used esearch and efetch commands to obtain these GenBank records. Search string ‘SARS-CoV-2’, refined to ‘SARS-CoV-2 [ORGN]’, was used to download the identified records in the GenBank text format. After workflow optimization, post May 2022, the search process used NCBI’s newer datasets and dataformat command-line tools to identify sequences of interest while continuing to use the efetch tool to download records in the GenBank text format. Collectively as of November 21, 2022, a total of 6,468,196 records were searched and a total of 3,126,129 sequences matching the search criterion and passing the quality control steps were used for the results presented here .

Quality control

Incomplete and ambiguous SARS-CoV-2 genomic sequences and records containing incomplete collection dates were filtered out using the designed pipeline. For the records passing the quality control steps, the nucleotide sequence for each gene was extracted. A non-redundant version of the extracted nucleotide sequences was derived and translated to the cognate amino acid sequences. In the final phase of the pipeline, the accession numbers for each viral isolate with the nucleotide sequences, the associated protein sequences, the collection dates, and the country of collection were stored in SQLite relational database where they were indexed with unique identifiers to allow the retrieval and analysis of any part of the parsed data.

Frequency of data updates

As of July 2022, the described sources are monitored daily for updates. New data is continually downloaded and used for analysis.

Alignments and non-synonymous (K_a), synonymous (K_s) calculations

The translated proteins and nucleotides sequences were aligned using clustal-omega (Sievers and Higgins, 2014) and Pal2Nal (Suyama et al., 2006) programs to align the codons with their associated amino acids. The resulting alignments were then processed through the program kaks_calculator (Zhang, 2022) to calculate non-synonymous (K_a) and synonymous (K_s) values, and their ratio (K_a/K_s) which were used to assess the mutational adaptation for each protein. The parameters required for the kaks_calculator were based on the maximum-likelihood method derived from the work of Goldman and Yang, 1994. The first reported SARS-CoV-2 genomic sequence (‘the Wuhan sequence’) (Wu et al., 2020) was used as a reference for all the K_a, K_s, and K_a/K_s calculations. We explored the possibility of using other sequence(s) as references (e.g., the previous day or the previous month), however, due to the increasing number of variations available every day, it is difficult to select a representative sequence on an ongoing basis. It was also found that using the Wuhan sequence as a reference provided the most intuitive and interpretable results.

List of proteins investigated

The number of unique nucleotide sequences observed till date for each of the 26 proteins/open reading frames (ORFs) are listed in Table 1. The full results are available on the project website https://pandemics.okstate.edu/covid19/, which are continually updated.

Table 1. Number of unique records for the 26 proteins/open reading frames (ORFs).

Total number of quality-controlled SARS-CoV-2 sequences analyzed: 3,126,129 (as of November 21, 2022). Only three proteins showing the most relevant results and one other protein (marked by *) for comparison are depicted in the figures. These proteins are shown in bold.

Name	Unique records
Envelope protein	1314
Membrane protein	11,338
Nucleocapsid protein	70,579
Spike protein	188,166
Non-structural protein 1 (NSP1), leader protein	11,656
NSP2	67,837
NSP3	245,627
NSP4	31,257
NSP5, 3C-like proteinase	11,879
NSP6	16,479
NSP7	1304
NSP8	4490
NSP9	2848
NSP10	2429
NSP11	88
NSP12, RNA-dependent RNA polymerase (RDRP)*	60,575
NSP13, helicase	35,421
NSP14, 3'-to-5' exonuclease	28,501
NSP15, endoRNAse	12,901
NSP16, 2'-O-ribose methyltransferase	7636
ORF3a	41,694
ORF6	2117
ORF7a	9312
ORF7b	1368
ORF8	7036
ORF10	710

Open in a new tab

Results

It was found that collective non-synonymous mutations in key proteins of SARS-CoV-2 showed significant increase 10–14 days before the rapid rise in COVID19 cases, particularly related to the surges that occurred after the emergence of Gamma, Delta, Omicron, and BA.5 variants (Figure 1 and the related Figure 1—figure supplement 1 with the unnormalized results). At present, over 6.4 million SARS-CoV-2 genome sequences collected all over the world are available from GenBank (https://www.ncbi.nlm.nih.gov/sars-cov-2/), which were used for analysis of 26 SARS-CoV-2 proteins, including the structural (spike, envelope, membrane, nucleocapsid) proteins, non-structural proteins (NSPs), and ORFs. Note, our analysis was performed with the first reported (‘Wuhan’) SARS-CoV-2 sequence as a reference (Wu et al., 2020). In other words, the computed mutations are calculated in comparison to this reference sequence. The reason for an increase in mutations ahead of a surge is the search for adaptation against the acquired immunity (or gain in function) in either a single protein or a combination of proteins. The case of the Omicron variant indicates the development of the most drastic changes in several different proteins, which coincided with the largest increase in rate of infections (Figure 1). Non-synonymous mutations (K_a) in several proteins show significant increase before the increase in rate of infections (or surges), therefore, allowing a means for surge prediction.

Figure 1. — Non-synonymous mutations over the course of the COVID19 outbreak were identified by analysis of 6.4 million sequences. Gray dots indicate individual mutations, while black lines show weighted means for each day. Red lines show new COVID19 cases (averaged weekly) across the world. The green arrows mark the time when new mutations occurred in significant numbers before the outbreaks, allowing prediction of future outbreaks. The mutation values have been normalized using average of all mutations in the year 2020 (the first full year of the pandemic) as 1 (marked by dashed lines). Raw results are available in Figure 1—figure supplement 1. Values of 0 indicate same sequence as the Wuhan sequence, while larger values indicate more mutations. Note that each gray dot corresponds to a unique sequence, and there can be multiple records showing the same mutation. The weighted mean for the day is calculated by using all sequences reported for the day. The peaks for COVID19 cases are labeled with prevalent variants. Alpha/Beta, Omicron, and Omicron BA.2, BA.5 were the prevalent variants at the time of labeled peaks. For the two peaks in 2021 the case was less clear, with Gamma and Delta variants being observed at different times in different parts of the world.

Non-synonymous mutations over the course of the COVID19 outbreak were identified by analysis of 6.4 million sequences. Gray dots indicate individual mutations, while black lines show weighted means for each day. Red lines show new COVID19 cases (averaged weekly) across the world. The green arrows mark the time when new mutations occurred in significant numbers before the outbreaks, allowing prediction of future outbreaks. The mutation values have been normalized using average of all mutations in the year 2020 (the first full year of the pandemic) as 1 (marked by dashed lines). Raw results are available in Figure 1—figure supplement 1. Values of 0 indicate same sequence as the Wuhan sequence, while larger values indicate more mutations. Note that each gray dot corresponds to a unique sequence, and there can be multiple records showing the same mutation. The weighted mean for the day is calculated by using all sequences reported for the day. The peaks for COVID19 cases are labeled with prevalent variants. Alpha/Beta, Omicron, and Omicron BA.2, BA.5 were the prevalent variants at the time of labeled peaks. For the two peaks in 2021 the case was less clear, with Gamma and Delta variants being observed at different times in different parts of the world.

Use of mutational rates as a surge predictor

In addition to using non-synonymous mutations, a number of other metrics were also investigated for a reliable prediction signal. In particular, the commonly used non-synonymous to synonymous mutations ratio, K_a/K_s (Figure 1—figure supplement 2), and the rate of mutations (derivative of observed number of mutations with respect to time) (Figure 1—figure supplement 3) were also investigated in detail for suitability as a signal for surge prediction. As seen in Figure 1—figure supplement 2, K_a/K_s did not provide a reliable surge prediction signal. Figure 1—figure supplement 3 shows rate of mutations (calculated as a numerical derivative). For the case of Omicron surge, the proteins did show increased rate of mutations, however, for all other cases a clear signal was absent. Furthermore, the rate of mutations approach presented two additional challenges. First, a number of instances were observed where the rate of mutations increased but did not show increase in reported infections (false positive signal). Second, the nature of incoming genomic data is generally noisy (due to smaller number of samples and weighting of different mutations shows large variations) and changes quickly, therefore, the ongoing most recent rate of mutations is very noisy as well. It was concluded that at this stage, rate (derivative) of mutations is not a reliable signal for surge prediction. In the future, this could be revisited with more stable reporting of genomic sequences with shorter sample collection to sequence publication timeframes. Figure 1—figure supplement 4 presents side-by-side comparison of the metrics investigated. Overall, it appears that collective non-synonymous mutations (K_a) provides the most reliable signal for surge prediction. In the remaining text, we discuss the key results and their importance.

Spike protein

Spike protein interacts with the angiotensin-converting enzyme 2 receptor and plays a vital role in infecting the human cells (Xia, 2021). Spike protein has been the target of mRNA-based vaccines. Viral sequences show significant changes in synonymous and non-synonymous mutations in the spike protein (188,166 unique sequences observed so far), with large increases ahead of the surge in reported human infections, most noticeably with the surges associated with the Gamma/Delta and the Omicron variants (Figure 1A). It is important to note that the mutations show increase 10–14 days before the increase in human infections. It is also interesting to note that the synonymous mutations (data available on the website) show decrease post surges. The decrease in mutations prior to the Omicron BA.2 surge corresponds to reversal mutations (returning to reference sequence). However, at present the non-synonymous and synonymous mutations post the Omicron variant remain elevated, more so than any period during the COVID19 outbreak.

Proteins showing significant mutations

In addition to the spike protein, SARS-CoV-2 membrane (Lu et al., 2021; Figure 1B, 11,338 unique sequences observed so far) and envelope (Zheng et al., 2021; Figure 1C, 1314 unique sequences observed so far) proteins have also shown significant mutations, starting just before the Omicron variant (November 2021 onward). For the case of membrane protein, there was a significant increase that started in the Gamma/Delta variants (June 2021 onward) and further increased just before the BA.5 surge. The spike, membrane, and envelope proteins are all located on the surface of SARS-CoV-2 and potentially interact with the components of the immune system. The large increase in mutations in all these external proteins assumes importance in post-vaccination period (discussed further below).

Other proteins

For comparison, Figure 1D shows mutations from RNA-dependent RNA polymerase (RDRP, 60,575 unique sequences observed so far), which has been targeted for development of antiviral drug therapies. Till present, RDRP has shown comparatively lower magnitude of non-synonymous mutations. Note that gray dots are individual mutations, the mean (black line) is weighted by number of sequences for each day by the mutations. Significant increases in mutations are also observed in NSPs 1, 4, 6, 13, 15, ORFs 6, 7a, and 7b (data available on the project website). Overall, this analysis allows us to monitor ongoing mutations in different proteins; when rapid rise is observed over a short period of time, we issue surveillance watches and warnings (reserved for most extreme cases) for new possible variants with combination of proteins showing new mutations.

Vaccination and mutational frequencies

Widespread vaccination against SARS-CoV-2 (December 2020 onward) coincides with significant increase in mutation rates of several SARS-CoV-2 proteins. Spike, membrane, and envelope proteins have shown rapid mutations especially in the Omicron variant (gray dots in Figure 1). This is possibly due to viral adaptations under the selective pressure exerted by the vaccine, as a significant number of mutations were observed in 2021, especially for the spike protein (gray dots in Figure 1A indicate spike protein has mutated much more than any other protein). The long-term effectiveness of mRNA-based SARS-CoV-2 vaccines remains unknown. After the initial regimen of two doses, the administration of additional booster (third and fourth) doses has decreased due to improvement in COVID19 fatality rates as well as political reasons (Sabahelzain et al., 2021). This situation raises concerns. Other proteins have shown reversal mutations (higher similarity with the reference sequence) after periods of significant increase in mutations, however, post vaccination the significant mutations observed in the spike, envelope, and membrane protein related to the Omicron variant remain at extremely elevated levels. As Omicron, BA.2, BA.5, and subsequent variants are showing increased rates of transmission, gain or improvement of function in other proteins could lead to emergence of newer variants of concern. Over long term this needs to be addressed by vaccines with longer periods of effectiveness and post-infection treatment options including antiviral drugs.

Surge prediction

The methodology presented here allows monitoring the potential increase in reported number of human infections. To date, spike protein has shown the most direct correlation in the rate of non-synonymous mutations and the rates of human infection. In particular, in the case of Omicron variant and also the Gamma variant, spike protein showed rapid increase in mutations about 10–14 days ahead of time. Furthermore, membrane protein showed rapid mutations before surge related to BA.5. Therefore, such increase in mutations serves as an indication of upcoming surges. For example, we issued a surge watch on the website on June 29, 2022, which was converted to a warning on July 14. This was confirmed by increase in infection cases worldwide throughout July (see Figure 1—figure supplement 5). Further, we issued an additional warning on September 7, 2022, which was confirmed by surge in several European countries, including France, United Kingdom, Germany, and Italy (see Figure 1—figure supplement 6).

The role of different (or dominant) SARS-CoV-2 variants in major surges is unclear at this time and needs further research. Different variants have been prevalent in different geographic regions at different times over the course of COVID19 outbreak, therefore, it is difficult to assign the surges to individual variants. In particular, Gamma and Delta variants were both prevalent in different countries in 2021. We are working on enabling this analysis by geographic locations and the results will be available through the website. However, at present our analysis is able to make predictions about collective surges before they occur, as illustrated by the case of BA.5.

In the future, a number of factors could affect the performance of the presented approach. In particular, as the pandemic situation has improved in the second half of 2022, the number of tests being performed and the sequences being deposited into public repositories have decreased. Furthermore, it is widely being discussed that the population is showing increased immunity against the virus due to vaccination and naturally acquired immunity. The presented approach is dependent on availability of sequences, therefore, we hope that scientific community will continue to urge the medical community and public health agencies to commit resources to sequencing the samples from COVID19 positive patients on a regular basis. Nonetheless, even with availability of smaller number of sequences, our approach is weighted by mutations and percentage of sequences showing non-synonymous mutations. Therefore, whenever new mutations show up in large percentages, our approach will still be able to work. On the other hand, viruses continue to evolve and if the population acquires large-scale immunity leading to drastic reduction in number of infections, our surveillance approach would still allow preparation in cases of significant viral genome changes (such as going from SARS-CoV to SARS-CoV-2) whenever they occur and lead to the possibility of another major outbreak.

Discussion

The methodology and the website described here provide real-time mutational changes of 26 SARS-CoV-2 proteins and ORFs. The changes in non-synonymous mutations correlate with the increase in reported cases of infections. Apart from identifying mutations of concern for in-depth scientific studies, the website is intended to keep the medical community informed about potential upcoming surges. Warnings of increase in mutations and expected surges are displayed on the website (and also available through email alerts). It should be noted that this real-time analysis is dependent on the various health labs and medical facilities for swiftly depositing the viral genome sequences into the public databases such as the GenBank. The shorter the lag time in depositing the sequences by the wider community, more accurate and effective the prediction capabilities of our approach and the website will be.

Funding Statement

No external funding was received for this work.

Contributor Information

Pratul K Agarwal, Email: pratul.agarwal@okstate.edu.

Jameel Iqbal, DaVita Labs, United States.

Mone Zaidi, Icahn School of Medicine at Mount Sinai, United States.

Additional information

Competing interests

No competing interests declared.

Reviewing editor, eLife.

Founder and owner of Arium BioLabs LLC.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing.

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Investigation, Methodology, Writing – review and editing.

Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing.

Resources, Data curation, Software, Formal analysis, Investigation, Visualization, Writing – review and editing.

Formal analysis, Investigation, Visualization, Writing – review and editing.

Conceptualization, Supervision, Investigation, Visualization, Methodology, Writing – review and editing.

Conceptualization, Data curation, Formal analysis, Supervision, Validation, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing.

Additional files

MDAR checklist

elife-82980-mdarchecklist1.docx^{(100KB, docx)}

Data availability

All sequences used in this work are available from GenBank. The protocol used for analysis are described in the Materials and methods section.

References

Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet. Infectious Diseases. 2020;20:533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
Kistler KE, Huddleston J, Bedford T. Rapid and parallel adaptive mutations in spike S1 drive clade success in SARS-cov-2. Cell Host & Microbe. 2022;30:545–555. doi: 10.1016/j.chom.2022.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu S, Ye Q, Singh D, Cao Y, Diedrich JK, Yates JR, Villa E, Cleveland DW, Corbett KD. The SARS-cov-2 nucleocapsid phosphoprotein forms mutually exclusive condensates with RNA and the membrane-associated M protein. Nature Communications. 2021;12:502. doi: 10.1038/s41467-020-20768-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nadkarni PM, Parikh CR. An eutils toolset and its use for creating a pipeline to link genomics and proteomics analyses to domain-specific biomedical literature. Journal of Clinical Bioinformatics. 2012;2:9. doi: 10.1186/2043-9113-2-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Platto S, Wang Y, Zhou J, Carafoli E. History of the COVID-19 pandemic: origin, explosion, worldwide spreading. Biochemical and Biophysical Research Communications. 2021;538:14–23. doi: 10.1016/j.bbrc.2020.10.087. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sabahelzain MM, Hartigan-Go K, Larson HJ. The politics of COVID-19 vaccine confidence. Current Opinion in Immunology. 2021;71:92–96. doi: 10.1016/j.coi.2021.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sievers F, Higgins DG. Clustal omega. Current Protocols in Bioinformatics. 2014;48:3. doi: 10.1002/0471250953.bi0313s48. [DOI] [PubMed] [Google Scholar]
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research. 2006;34:W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, Yuan M-L, Zhang Y-L, Dai F-H, Liu Y, Wang Q-M, Zheng J-J, Xu L, Holmes EC, Zhang Y-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xia X. Domains and functions of spike protein in sars-cov-2 in the context of vaccine design. Viruses. 2021;13:109. doi: 10.3390/v13010109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z, Li J, Zhao XQ, Wang J, Wong GKS, Yu J. KaKs_Calculator: calculating KA and Ks through model selection and model averaging. Genomics, Proteomics & Bioinformatics. 2006;4:259–263. doi: 10.1016/S1672-0229(07)60007-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z. KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences. Genomics, Proteomics & Bioinformatics. 2022;20:536–540. doi: 10.1016/j.gpb.2021.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng M, Karki R, Williams EP, Yang D, Fitzpatrick E, Vogel P, Jonsson CB, Kanneganti TD. Tlr2 senses the SARS-cov-2 envelope protein to produce inflammatory cytokines. Nature Immunology. 2021;22:829–838. doi: 10.1038/s41590-021-00937-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.82980.sa0

Editor's evaluation

Jameel Iqbal ¹

This paper details the creation and data behind the website http://pandemics.okstate.edu/covid19/. The authors explore if there is a cause and effect between the detection of unusually increased mutation activity in the genomic surveillance databases and subsequent near-term surges in SARS-CoV-2 case numbers.

eLife. doi: 10.7554/eLife.82980.sa1

Decision letter

Editor: Jameel Iqbal¹

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Future COVID19 surges prediction based on SARS-CoV-2 mutations surveillance" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Mone Zaidi as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) Update the manuscript to include recent prediction/performance data on the Omicron variants.

2) Please address in detail all the recommendations made by the two reviewers.

Reviewer #1 (Recommendations for the authors):

The following should be addressed:

The paper notes that the ratio of non-synonymous to synonymous mutations (ka/ks) which is typically used, did not provide clear trends.

Ideally, these would be plotted against the best model that the authors are proposing. Separate graphs fail to demonstrate where one model fails and another succeeds.

The paper has numerous typos and careful proofreading is warranted. For example, line 170 "Therefore, such increase is mutations serve" should read "in"

Line 182 "reported cases of infections. Apart from identifying mutations of concerns for in-depth scientific studies" should read "concern"

Line 183 "the website is intended to keep the medical community informed about the potential surges." should read "about potential" without "the".

Additionally, potentially off-putting text should be rephrased. For example, line 157 "improvement in COVID-19 fatality rates as well as political reasons. This situation raises concerns." should be re-phrased to eliminate politics as a conjecture unless there is a documented reference to such.

Reviewer #2 (Recommendations for the authors):

This work suggests an interesting approach to stay ahead of the virological case curve but is too preliminary. The manuscript itself doesn't make any firm conclusions as to which protein or set of proteins is the most predictive. The claim of 14 days is hard to verify from the data provided in Figure 1 and is likely restricted to the detection of the Omicron variant. Furthermore, the method does not do a good job of predicting the variants evolving from the Omicron variant, suggesting that the baseline comparison needs to be refreshed. Given that the authors' focus is on infectivity, perhaps a more targeted focus at receptor binding domains of the spike protein may provide a more robust method, similarly providing comparisons based on major locations of the world (e.g. US, Europe, South Africa, India) might provide cleaner data as it could detect the local emergence of strains. The authors should also discuss the usefulness of such an approach in light of increased population immunity and decline in testing, which will negate the advantage of sequencing over case detection.

eLife. 2023 Jan 19;12:e82980. doi: 10.7554/eLife.82980.sa2

Author response

Essential revisions:

1) Update the manuscript to include recent prediction/performance data on the Omicron variants.

We have updated the manuscript with the most up to date data available. The data is current up to 21^st November 2022, including the Omicron variants. Further, we have included the predictions we made in June and September and their validation based on the increase in cases of infection.

2) Please address in detail all the recommendations made by the two reviewers.

We have made revisions as described in detail below.

Reviewer #1 (Recommendations for the authors):

The following should be addressed:

The paper notes that the ratio of non-synonymous to synonymous mutations (ka/ks) which is typically used, did not provide clear trends.

Ideally, these would be plotted against the best model that the authors are proposing. Separate graphs fail to demonstrate where one model fails and another succeeds.

Thanks for the suggestion. We have included these plots as Figure 1—figure supplement 4.

The paper has numerous typos and careful proofreading is warranted. For example, line 170 "Therefore, such increase is mutations serve" should read "in".

Done. Thanks for pointing this mistake out. We have carefully proofread the manuscript and corrected such mistakes and other errors.

Line 182 "reported cases of infections. Apart from identifying mutations of concerns for in-depth scientific studies" should read "concern"

Done.

Line 183 "the website is intended to keep the medical community informed about the potential surges." should read "about potential" without "the".

Done. Thank you again.

Additionally, potentially off-putting text should be rephrased. For example, line 157 "improvement in COVID-19 fatality rates as well as political reasons. This situation raises concerns." should be re-phrased to eliminate politics as a conjecture unless there is a documented reference to such.

Done. We have added appropriate reference (Sabahelzain, MM, Hartigan-Go, K, Larson, HJ. The politics of COVID-19 vaccine confidence. Current Opinion in Immunology. 2021;71;92-6.). We hope that the reviewer finds our changes acceptable.

Reviewer #2 (Recommendations for the authors):

This work suggests an interesting approach to stay ahead of the virological case curve but is too preliminary. The manuscript itself doesn't make any firm conclusions as to which protein or set of proteins is the most predictive. The claim of 14 days is hard to verify from the data provided in Figure 1 and is likely restricted to the detection of the Omicron variant. Furthermore, the method does not do a good job of predicting the variants evolving from the Omicron variant, suggesting that the baseline comparison needs to be refreshed. Given that the authors' focus is on infectivity, perhaps a more targeted focus at receptor binding domains of the spike protein may provide a more robust method, similarly providing comparisons based on major locations of the world (e.g. US, Europe, South Africa, India) might provide cleaner data as it could detect the local emergence of strains.

We thank the reviewer for very insightful feed-back. Like the reviewer we are extremely focused on rigor, validation and reproducibility of our approach. The reviewer mentions a number of interesting points. Our responses (in sequential order) are provided below:

1. Regarding the approach being too preliminary, as mentioned in response to the other reviewer, our approach has been successful as prediction issued in September (while this manuscript was in review) was validated by surge in number of cases in several European countries. We would very much like to immediately validate this approach more thoroughly, however, unlike the lab studies which could be addressed by additional experiments, we are dependent on the incoming data with the evolution of this pandemic. Our warnings from June and September have been validated and we are confident that this approach and website will help the medical and research community. We will continue to refine the approach.

2. Conclusion about which set of proteins: Based on the data so far, we do not believe that a single protein or few proteins will cause new surges (as discussed in the manuscript). Analysis reveals that any protein with new mutations could cause surges. Therefore, our website presents data on all SARS-CoV-2 proteins. We are concerned regarding making conclusions based on a limited set of proteins, as this could be detrimental to the cause and possibly miss the surges driven by other proteins.

3. 14-days: Please see the plots in response to Reviewer 1. We issued a warning on September 7^th and the cases started rising in the middle to late September. We believe additional surges would provide us more confidence.

4. Limited to Omicron variant: We believe the reason for this is that a lot more sequences became available. It is not a limitation of the approach but rather a limitation due to the availability of data.

5. Predicting the evolution of variants: This is a very interesting point. If we understood this comment correctly, reviewer is asking if we could predict the emergence of new variants. Even though it is very much possible but prediction of new variants (through mutational adaptations) was not the focus of the current study. But we will keep this suggestion in mind for ongoing studies. However, if the comment was regarding making surge predictions for Omicron sub-variants, new data has now been included in the revised manuscript. We thank the review for this suggestion.

6. Baseline comparison needs to be refreshed: If we understood correctly, the suggestion is to change the reference sequence from the Wuhan sequence to another appropriate sequence. We have been wondering about this ourselves (as noted in the original manuscript). However, the question is which one? The answer is not clear to us immediately as there are multiple variants prevalent across the world. We are keeping an eye on literature and if there is an obvious candidate, we will in future shift to a new reference sequence. In the meantime, we are also exploring the use of different reference sequences for different countries.

7. Receptor binding domains: As mentioned above, the approach is meant to be general and not tied to a particular protein. We are afraid that focusing on a single protein or a domain will miss out on the future breakouts. We continue to list more interesting mutations on all SARS-CoV-2 proteins in our warnings. If data suggests that one single protein or a domain is highly indicative of the surges in the future, we will publish an update.

8. Different locations and countries: This is a great point! As per reviewer’s suggestion we have enabled the infrastructure on our website. We have listed the charts associated with the US on the website (under the section More Charts in the top navigation menu). However, we found that the GenBank does not have sufficient number of sequences from other countries to provide the resolution needed. Therefore, we are working on enabling the real time analysis of the GISAID data. However, we found that the sequences in GISAID vary extremely widely in quality. Since we are aiming for high quality and reproducible studies (as both reviewers also suggest), we do not feel comfortable at this stage in including the results for different countries in the main manuscript. But reviewer’s point is a great one and we have built the infrastructure for geographic break-down analysis and the results will be continually added to the project website once quality control is considered satisfactory.

The authors should also discuss the usefulness of such an approach in light of increased population immunity and decline in testing, which will negate the advantage of sequencing over case detection.

Done. This is another interesting point. We have added a new paragraph in the manuscript at the end of Results section. Briefly, as widely discussed in the community, the immunity (across the population) is being developed. However, new variants keep on emerging. Therefore, it is important to monitor new mutations in all proteins. The decrease in testing is not optimal. We are hopeful, our results will encourage medical community and public policy makers to continue testing. In the meantime, we believe the ratio of sequences showing new mutations compared to no mutations for the same day (or the same week) will continue to be useful, even though the overall sequences being reported may go down.

Once again we thank the reviewer for all the insights and great suggestions.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

MDAR checklist

elife-82980-mdarchecklist1.docx^{(100KB, docx)}

Data Availability Statement

All sequences used in this work are available from GenBank. The protocol used for analysis are described in the Materials and methods section.

[bib1] Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet. Infectious Diseases. 2020;20:533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]

[bib3] Kistler KE, Huddleston J, Bedford T. Rapid and parallel adaptive mutations in spike S1 drive clade success in SARS-cov-2. Cell Host & Microbe. 2022;30:545–555. doi: 10.1016/j.chom.2022.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Lu S, Ye Q, Singh D, Cao Y, Diedrich JK, Yates JR, Villa E, Cleveland DW, Corbett KD. The SARS-cov-2 nucleocapsid phosphoprotein forms mutually exclusive condensates with RNA and the membrane-associated M protein. Nature Communications. 2021;12:502. doi: 10.1038/s41467-020-20768-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Nadkarni PM, Parikh CR. An eutils toolset and its use for creating a pipeline to link genomics and proteomics analyses to domain-specific biomedical literature. Journal of Clinical Bioinformatics. 2012;2:9. doi: 10.1186/2043-9113-2-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Platto S, Wang Y, Zhou J, Carafoli E. History of the COVID-19 pandemic: origin, explosion, worldwide spreading. Biochemical and Biophysical Research Communications. 2021;538:14–23. doi: 10.1016/j.bbrc.2020.10.087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Sabahelzain MM, Hartigan-Go K, Larson HJ. The politics of COVID-19 vaccine confidence. Current Opinion in Immunology. 2021;71:92–96. doi: 10.1016/j.coi.2021.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Sievers F, Higgins DG. Clustal omega. Current Protocols in Bioinformatics. 2014;48:3. doi: 10.1002/0471250953.bi0313s48. [DOI] [PubMed] [Google Scholar]

[bib9] Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research. 2006;34:W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, Yuan M-L, Zhang Y-L, Dai F-H, Liu Y, Wang Q-M, Zheng J-J, Xu L, Holmes EC, Zhang Y-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Xia X. Domains and functions of spike protein in sars-cov-2 in the context of vaccine design. Viruses. 2021;13:109. doi: 10.3390/v13010109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Zhang Z, Li J, Zhao XQ, Wang J, Wong GKS, Yu J. KaKs_Calculator: calculating KA and Ks through model selection and model averaging. Genomics, Proteomics & Bioinformatics. 2006;4:259–263. doi: 10.1016/S1672-0229(07)60007-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Zhang Z. KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences. Genomics, Proteomics & Bioinformatics. 2022;20:536–540. doi: 10.1016/j.gpb.2021.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Zheng M, Karki R, Williams EP, Yang D, Fitzpatrick E, Vogel P, Jonsson CB, Kanneganti TD. Tlr2 senses the SARS-cov-2 envelope protein to produce inflammatory cytokines. Nature Immunology. 2021;22:829–838. doi: 10.1038/s41590-021-00937-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Future COVID19 surges prediction based on SARS-CoV-2 mutations surveillance

Fares Z Najar

Evan Linde

Chelsea L Murphy

Veniamin A Borin

Huan Wang

Shozeb Haider

Pratul K Agarwal

Roles

Abstract

Introduction

Materials and methods

Data and data sources

Genomic sequence data

Quality control

Frequency of data updates

Alignments and non-synonymous (Ka), synonymous (Ks) calculations

List of proteins investigated

Table 1. Number of unique records for the 26 proteins/open reading frames (ORFs).

Results

Figure 1. Mutations in SARS-CoV-2 proteins increase before COVID19 surges.

Figure 1—figure supplement 1. Unnormalized results for the mutations in SARS-CoV-2 proteins.

Figure 1—figure supplement 2. Ratio of non-synonymous mutations/synonymous mutations in SARS-CoV-2 proteins.

Figure 1—figure supplement 3. Daily rate of non-synonymous mutations in SARS-CoV-2 proteins.

Figure 1—figure supplement 4. Side-by-side comparison of various metrics considered in this study.

Figure 1—figure supplement 5. Performance of the surge watch and warning issued on June 29, 2022, and July 14, 2022, respectively.

Figure 1—figure supplement 6. Performance of the surge watch issued on September 7, 2022.

Use of mutational rates as a surge predictor

Spike protein

Proteins showing significant mutations

Other proteins

Vaccination and mutational frequencies

Surge prediction

Discussion

Funding Statement

Contributor Information

Additional information

Competing interests

Author contributions

Additional files

Data availability

References

Editor's evaluation

Jameel Iqbal

Roles

Decision letter

Roles

Author response

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Alignments and non-synonymous (K_a), synonymous (K_s) calculations