Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2022 Feb 16:2021.09.07.21263228. [Version 2] doi: 10.1101/2021.09.07.21263228

Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness

Fritz Obermeyer 1,2,, Martin Jankowiak 1,2, Nikolaos Barkas 1, Stephen F Schaffner 1,3,4, Jesse D Pyle 1, Lonya Yurkovetskiy 5, Matteo Bosso 5, Daniel J Park 1, Mehrtash Babadi 1, Bronwyn L MacInnis 1,4,6, Jeremy Luban 1,5,6,7, Pardis C Sabeti 1,3,4,6,8,*, Jacob E Lemieux 1,9,*,
PMCID: PMC8863165  PMID: 35194619

Abstract

Repeated emergence of SARS-CoV-2 variants with increased fitness necessitates rapid detection and characterization of new lineages. To address this need, we developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many non-spike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, identifies viral lineages of concern as they emerge, and prioritizes mutations of biological and public health concern for functional characterization.

One Sentence summary:

A Bayesian hierarchical model of all SARS-CoV-2 viral genomes predicts lineage fitness and identifies associated mutations.


The SARS-CoV-2 pandemic has been characterized by repeated waves of cases driven by the emergence of new lineages with higher fitness, where fitness encompasses any trait that affects the lineage’s growth, including its basic reproduction number (R0), ability to evade existing immunity, and generation time. Rapidly identifying such lineages as they emerge and accurately forecasting their dynamics is critical for guiding outbreak response. Doing so effectively would benefit from the ability to interrogate the entirety of the global SARS-CoV-2 genomic dataset. The large size (currently over 7.5 million virus genomes) and geographic and temporal variability of the available data present significant challenges that will only become greater as more viruses are sequenced. Current phylogenetic approaches are computationally inefficient on datasets with more than ~5000 samples and take days to run at that scale. Ad hoc methods to estimate the relative fitness of particular SARS-CoV-2 lineages are a computationally efficient alternative (13), but have typically relied on models in which one or two lineages of interest are compared to all others and do not capture the complex dynamics of multiple co-circulating lineages.

Furthermore, estimates of relative fitness based on lineage frequency data alone (24) do not take advantage of additional statistical power that can be gained from analyzing the independent appearance and growth of the same mutation in multiple lineages. Performing a mutation-based analysis of lineage prevalence has the additional advantage of identifying specific genetic determinants of a lineage’s phenotype, which is critically important both for understanding the biology of transmission and pathogenesis and for predicting the phenotype of new lineages. The SARS-CoV-2 pandemic has already been dominated by several genetic changes of functional and epidemiological importance, including the spike (S) D614G mutation that is associated with higher SARS-CoV-2 loads (5, 6). In addition, mutations found in Variants of Concern (VoC), such as S:N439R, S:N501Y, and S:E484K, have been linked, respectively, to increased transmissibility (7), enhanced binding to ACE2 (8), and antibody escape (9, 10). Despite these successes, identifying functionally important mutations in the context of a large background of genetic variants of little or no phenotypic consequence remains challenging.

We set out to formulate a principled approach to modeling the relative fitness of SARS-CoV-2 lineages, estimating their growth as a linear combination of the effects of individual mutations. We developed PyR0, a hierarchical Bayesian regression model that enables scalable analysis of the complete set of publicly available SARS-CoV-2 genomes, and that could be applied to any viral genomic dataset and to other phenotypes. The model, which is summarized in Figure 1A and described in detail in the supplemental note, avoids the complexity of full phylogenetic inference by first clustering genomes by genetic similarity (refining PANGO lineages (11)), and then estimating the incremental effect on growth rate of each of the most common amino acid changes on the lineages in which they appear. By regressing growth rate as a function of genome sequence, the model shares statistical strength among genetically similar lineages without explicitly relying on phylogeny. By modeling only the multinomial proportion of different lineages rather than the absolute number of samples for each lineage (4, 12), and by doing so within 14-day intervals in 1560 globally distributed geographic regions, the model achieves robustness to a number of sources of bias that affect all lineages, across regions, and over time, including differences in data collection and changes in transmission due to such factors as social behavior, public health policy, and vaccination.

Figure 1.

Figure 1.

A. Overview of the PyR0 analysis pipeline. After clustering UShER’s mutation annotated tree, sequence data are used to construct spatio-temporal lineage prevalence counts ytpc and amino acid substitution covariates Xcf. Pyro is used to fit a Bayesian multivariate logistic multinomial regression model to ytpc and Xcf.

B. Relative fitness versus date of lineage emergence. Circle size is proportional to cumulative case count inferred from lineage proportion estimates and confirmed case counts. Inset table lists the 10 fittest lineages inferred by the model. R/RA is the fold increase in relative fitness over the Wuhan (A) lineage, assuming a fixed generation time of 5.5 days.

We fit PyR0 to 6,466,300 SARS-CoV-2 genomes available on GISAID (13, 14) as of January 20, 2022, in a model that contained 1544 PANGO lineages and 2904 nonsynonymous mutations. The output of the model is a posterior distribution for the relative fitness (exponential growth rate) of each lineage and for the contribution to the fitness from each mutation. Fitting this large model is computationally challenging, so we used stochastic variational inference, an approximate inference method that reduced our task to solving a 75-million-dimensional optimization problem on a GPU. Inference was implemented in the Pyro (15) probabilistic programming framework (see Supplemental Materials). The trained model can be used to infer lineage fitness, predict the fitness of completely new lineages, forecast future lineage proportions, and estimate the effects of individual mutations on fitness.

The model’s lineage fitness estimates (Figure 1B) show a modest upward trend over time among all lineages, accompanied by numerous lineages with dramatically higher fitness. Sensitivity analyses revealed broad consistency of fitness estimates across spatial data subsets (Figure S1). The upward trend may in part reflect an upward bias caused by the lineage assignment process, as can be seen in simulation studies (Figure S2), but the high tail of the distribution exhibits elevated fitness values far in excess of this trend. The rate of increase in fitness was not constant between the emergence of the virus into human populations in late 2019 and early 2022. Rather, periods of rapid evolution in fitness occurred and heralded new waves of increase in case counts (Figure 1B and Figure 2CDE). The model correctly inferred BA.2 to have the highest fitness to date, 8.9-fold (95% CI, 8.6–9.2) higher than the original A lineage (Figure 1B inset). Similar fitness was estimated for other Omicron sub-lineages BA.1 and BA.1.1 (Figure 1B). These fitness estimates, obtained in mid January 2022, predict B.1.1.529 and sublineages (collectively called Omicron in the WHO classification) will continue to displace other lineages, including the previously dominant Delta (Figure S3). While PANGO lineages facilitate communication by providing a stable nomenclature, we observed some PANGO lineages with multiple successive peaks in some regions, which could not be accounted for by a multivariate logistic growth model. We therefore algorithmically refined the 1544 PANGO lineages into 3000 finer clusters, and found our model identified significant heterogeneity within some PANGO lineages (Figure S4). Notably, B.1.1 displayed the greatest variability among lineages, followed by B.1.

Figure 2.

Figure 2.

A. Infectivity relative to WT of lentiviral vectors pseudotyped with the indicated Spike mutants. Target cells were HEK293T cells expressing ACE2 and TMPRSS2 transgenes. The genetic background of the Spike was Wuhan-Hu-1 bearing D614G. Red bars were significantly different from WT (adjusted p values shown). Black bars were not significantly different from WT. B. For the 1701 SARS-CoV-2 clusters with at least one amino acid substitution in the RBD domain we compare: i) the PyR0 prediction for the contribution to Δ log R from RBD substitutions only; to ii) antibody binding computed using the antibody-escape calculator in (17). The escape calculator is based on an intuitive non-linear model parameterized using deep mutational scanning data for 33 neutralizing antibodies elicited by SARS-CoV-2. PyR0 predictions exhibit high (Spearman) correlation with predictions from Greaney et al. C-E. We dissect PyR0 Δ log R estimates into S-gene (C), RBD (D), and non-S-gene (E) contributions for 3000 SARS-CoV-2 clusters (blue dots). The horizontal axis corresponds to the date at which each cluster first emerged. Red squares denote the median Δ log R within each monthly bin. The increased importance of S-gene mutations (notably in the RBD) over non-S-gene mutations starting around November 2021 is apparent.

We found that the model would have provided early warning of the rise of VoCs had it been routinely applied to SARS-CoV-2 samples, highlighting the benefit of timely publication of genomic data. For example, PyR0 would have forecast the coming dominance of B.1.1.7 in late November 2020 (Figure S5A), while the first models forecasting its rapid rise were published in mid December 2020 (16). Similar predictions would have been available for BA.1 by early December 2021 (Figure S5B, S6) and for AY.4 by May 2021 (Figure S5C). Likewise the elevated fitness of BA.2 was identified by mid December 2021 on the basis of 76 observed sequences (Figure S6). While variant-specific models were accurate and useful (2) in predicting the rise of these lineages, each modeling effort was specific to a particular lineage and geographic region; by contrast, PyR0 ‘s global approach provides similar early detection while also offering automated, rapid, and unbiased consideration of all variants and lineages, together with ranking based on relative fitness. When we tested the model’s predictive ability (Figure S5), we found that forecasts were reliable for 1–2 months into the future, when they tended to be disrupted by the emergence of a completely new strain (Table S1, Figure S7). Remarkably, the accuracy of forecasts stabilized typically within two weeks after the emergence of a new competitive lineage in a region (Figure S7).

By basing fitness estimates on the contributions of individual mutations, PyR0 can forecast the fitness of novel or hypothetical lineages using their mutational profiles alone. This is possible with SARS-CoV-2 because of the high rate of convergent evolution (Table 1, Figure S8), which allows the model to infer the fitness of new constellations of mutations based on the trajectories of other lineages in which they have previously emerged. This predictive capability is highly desirable from a public health standpoint because forecasts are available as soon as sequences from new lineages appear. To test the reliability of this kind of estimate, we fit leave-one-out estimators on subsets of the dataset with entire PANGO lineages removed (Figure S9). These estimators showed excellent agreement with estimators based on the observed behavior of the lineages, and they were also more accurate than naive phylogenetic estimators that assume the fitness of each new strain is equal to its parent lineage’s fitness (Pearson’s ρ = 0.983, after correcting for parent fitness, Figure S9). These results demonstrate the feasibility of this kind of estimate using the simplest possible linear-additive model, and provide a foundation for future research for more complex modeling that includes effects such as epistasis between mutations and migration across regions.

Table 1:

Amino acid substitutions most significantly associated with increased fitness. Significance is defined as posterior mean / posterior standard deviation. Fitness is per 5.5 days (estimated generation time of the Wuhan (A) lineage (1, 19)). Final column: number of PANGO lineages in which each substitution emerged independently.

Rank Gene Substitution Fold Increase in Fitness Number of Lineages
1 S H655Y 1.051 33
2 S T95I 1.046 30
3 ORF1a P3395H 1.039 5
4 S N764K 1.040 6
5 ORF1a K856R 1.039 2
6 S S371L 1.041 3
7 E T91 1.040 5
8 S Q954H 1.040 5
9 ORF9b P10S 1.039 25
10 S L981F 1.040 2
11 N P13L 1.040 25
12 S G339D 1.039 4
13 S S375F 1.040 5
14 S S477N 1.039 47
15 S N679K 1.040 11
16 S S373P 1.040 5
17 M Q19E 1.039 5
18 S D796Y 1.038 11
19 S N969K 1.040 5
20 S T547K 1.038 3

Unbiased, genome-wide estimates of the effect of SARS-CoV-2 mutations on fitness also provide a powerful tool for better understanding the biology of fitness. Our model allowed us to estimate the contribution of 2904 amino acid substitutions (Figure 3a, Table 1) to lineage fitness and to rank them by inferred statistical significance (Figure S10). Cross-validation confirmed that these results replicate across different geographic regions (Figure S11). The highest concentrations of fitness-associated mutations were found in the S, N, and the ORF1 polyprotein genes (ORF1a and ORF1b, Figures 3AB, S12S13). Using spatial autocorrelation as a measure of spatial structure, we found evidence of functional hotspots in the S, N, ORF7a, ORF3a, and ORF1a genes (Table S2). Within S, there were three hotspots of fitness-enhancing mutations, each within a defined functional region: the N-terminal domain, the receptor-binding domain (RBD), and the furin-cleavage site (Figure 3B). We assessed mutational enrichment in the top-ranked set of mutations and identified an enrichment for lysine to asparagine mutations in the S gene (Figure S14C). We visualized top scoring mutations within atomic structures for the spike protein (Figure 3DE), the nucleocapsid’s N-terminal domain (Figure 3F), the polymerase (Figure S15), and two proteases (Figure S16). Many of the top mutations in the S gene occurred in the receptor binding domain (RBD) making direct contacts with the ACE2 receptor, including K417N/T and E484K (Figures 3DE). Two top-ranked mutations, T478K and S477N, occur in a flexible loop adjacent to the S-ACE2 interface (Figure 3E), suggesting that these mutations may affect the kinetics of receptor engagement and possibly viral entry. Other mutations occurred in regions proximal to essential enzymatic active sites of the viral replication (Figure S15) or protein processing (Figure S16) machinery.

Figure 3.

Figure 3.

Manhattan plot of amino acid changes assessed in this study. A. Changes across the entire genome. B. Changes in the first 850 amino acids of S. In each of A-C the y axis shows effect size Δ log R, the estimated change in log relative fitness due to each amino acid change. The bottom three axes show the background density of all observed amino acid changes, the density of those associated with growth (weighted by |Δ log R|), and the ratio of the two. The top 55 amino acid changes are labeled. See Figure S13 for detailed views of S, N, ORF1a, and ORF1b. C. Changes in the first 250 amino acids of N. D. Structure of the spike-ACE2 complex (PDB: 7KNB). Spike subunits colored light blue, light orange, and gray. Top-ranked mutations are shown as red spheres. ACE2 is shown in magenta. E. Close-up view of the RBD interface. F. Top-ranked mutations in the N-terminal RNA-binding domain of N. Residues 44–180 of N (PDB: 7ACT) are shown in light blue. Amino acid positions corresponding to top mutations in this region are shown as red spheres. A 10-nt bound RNA is shown in gray.

We tested several of the high-scoring mutations in single-cycle infectivity assays as done previously (6), focusing on the RBD (Figure 2A). We found that while some individual mutations increased infectivity, on average high-scoring RBD mutations did not promote infectivity per se. We considered an alternate possibility that fitness of Spike mutations is driven by immune escape. Using RBD-aggregated mutations as a proxy for immune escape, we found that the fitness effect of these Spike mutations correlates well with antibody escape estimates from Greaney et al.(17) (Figure 2B). Together with the observed jump in fitness beginning in late 2021 (Figure 2C) associated with Spike mutations, but not mutations elsewhere in the genome (Figure 2E), these results suggest that immune escape is currently the dominant driver of fitness increases. In contrast to mutations in Spike, those in the serine-arginine rich region of N were linked to increased efficiency of SARS-CoV-2 genomic RNA packaging (18). Within ORF1, we found fitness-associated mutations across all viral enzymes, and clusters within additional non-structural proteins (nsps). The highest concentration of fitness-associated mutations is found in nsp4, nsp6, and nsp12–14 (Figure S12B,S13CD), suggesting unexplored function at those sites. For example, nsp4 and nsp6 have roles in assembly of replication compartments, and substitutions in these regions may influence the kinetics of replication (see Supplemental Note 3). We note that while convergent evolution makes it possible to identify candidate functional mutations, observational data alone is insufficient to declare mutations as causal rather than merely correlated. For this reason hits identified by our study require functional followup, and can be prioritized by our uncertainty-ranked list of important mutations.

In summary, PyR0 provides an unbiased, automated approach for detecting viral lineages with increased fitness. By combining a model-based assessment of lineage fitness with absolute case counts, our model provides a global picture of the events of the first two years of the pandemic. Because it assesses the contribution of individual mutations and aggregates across all lineages and geographic regions, it can identify mutations and gene regions that likely increase fitness, and it can predict the relative fitness of new lineages based solely on viral sequence. Applied to the full set of publicly available SARS-CoV-2 genomes, it provides a principled, unbiased analysis of the mutations driving increased fitness of the virus, identifying experimentally established driver mutations in S and highlighting the key role of non-S mutations, particularly in N, ORF1b, and ORF1a, which have received relatively less research attention. By jointly estimating lineage and mutational fitness from millions of viral sequences across thousands of regions, PyR0 shares statistical strength across regions and mutations to yield mechanistic insight into viral fitness and enhance public health by forecasting lineage dynamics.

Supplementary Material

Supplement 1
media-1.tsv (314.2KB, tsv)
Supplement 2
media-2.tsv (520KB, tsv)
1

Acknowledgements:

We acknowledge crucial assistance in data preprocessing from Angie Hinrichs. We thank Trevor Bedford and Cornelius Roemer for visualizing the outputs of our model on nextstrain.org. We acknowledge helpful discussions and feedback from Du Phan, William Hanage, Christopher Tomkins-Tinch, Shira Weingarten-Gabbay, Katie Siddle, Sagar Gosai, Steven Reilly, Eli Bingham, Mehrtash Babadi, Holly Soutter, Debora Marks, Noor Youssef, Sarah Gurev, and Nicole Thadani. We gratefully acknowledge the authors from the originating laboratories and the submitting laboratories, who generated and shared via GISAID genetic sequence data on which this research is based.

Funding:

This work was sponsored by the U.S. Centers for Disease Control and Prevention (BAA), as well as support from the Doris Duke Charitable Foundation (J.E.L.), the Howard Hughes Medical Institute (P.C.S.), the National Institute of Allergy and Infectious Diseases R37AI147868 (J.L.), and the Evergrande COVID-19 Response Fund Award from the Massachusetts Consortium on Pathogen Readiness (J.L.).

Footnotes

Authors have no competing interests.

Data and materials availability:

We gratefully acknowledge all data contributors, i.e. the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID initiative (13) on which this research is based. A total of 6,466,300 submissions are included in this study. A complete list of 6.4million accession numbers is included as Data (S3).

References and Notes:

  • 1.Davies N. G., Abbott S., Barnard R. C., Jarvis C. I., Kucharski A. J., Munday J. D., Pearson C. A. B., Russell W., Tully D. C., Washburne A. D., Wenseleers T., Gimma A., Waites W., Wong K. L. M., van Zandvoort K., Silverman J. D., CMMID COVID-19 Working Group, COVID-19 Genomics UK (COG-UK) Consortium, Diaz-Ordaz K., Keogh R., Eggo R. M., Funk S., Jit M., Atkins K. E., Edmunds W. J., Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England. Science. 372 (2021), doi: 10.1126/science.abg3055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Volz E., Mishra S., Chand M., Barrett J. C., Johnson R., Geidelberg L., Hinsley W. R., Laydon D. J., Dabrera G., O’Toole Á., Others, Assessing transmissibility of SARS-CoV-2 lineage B. 1.1. 7 in England. Nature, 1–17 (2021). [DOI] [PubMed] [Google Scholar]
  • 3.Stefanelli P., Trentini F., Guzzetta G., Marziano V., Mammone A., Poletti P., Grané C. M., Manica M., del Manso M., Andrianou X., Others, Co-circulation of SARS-CoV-2 variants B. 1.1. 7 and P. 1. medRxiv (2021) (available at https://www.medrxiv.org/content/10.1101/2021.04.06.21254923v1.abstract). [Google Scholar]
  • 4.Vöhringer H. S., Sanderson T., Sinnott M., De Maio N., Nguyen T., Goater R., Schwach F., Harrison I., Hellewell J., Ariani C., Gonçalves S., Jackson D., Johnston I., Jung A. W., Saint C., Sillitoe J., Suciu M., Goldman N., Birney E., Funk S., Volz E., Kwiatkowski D., Chand M., Martincorena I., Barrett J. C., Gerstung M., The Wellcome Sanger Institute Covid-19 Surveillance Team, The COVID-19 Genomics UK (COG-UK) Consortium, Genomic reconstruction of the SARS-CoV-2 epidemic across England from September 2020 to May 2021. bioRxiv (2021),, doi: 10.1101/2021.05.22.21257633. [DOI] [Google Scholar]
  • 5.Korber B., Fischer W. M., Gnanakaran S., Yoon H., Theiler J., Abfalterer W., Hengartner N., Giorgi E. E., Bhattacharya T., Foley B., Hastie K. M., Parker, Partridge D. G., Evans C. M., Freeman T. M., de Silva T. I., McDanal C., Perez L. G., Tang H., Moon-Walker A., Whelan S. P., LaBranche C. C., Saphire E. O., Montefiori D. C., Angyal A., Brown R. L., Carrilero L., Green L. R., Groves D. C., Johnson K. J., Keeley A. J., Lindsey B. B., Parsons P. J., Raza M., Rowland-Jones S., Smith N., Tucker R. M., Wang D., Wyles M. D., Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus. Cell (2020), doi: 10.1016/j.cell.2020.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yurkovetskiy L., Wang X., Pascal K. E., Tomkins-Tinch C., Nyalile T. P., Wang Y., Baum A., Diehl W. E., Dauphin A., Carbone C., Veinotte K., Egri S. B., Schaffner S. F., Lemieux J. E., Munro J. B., Rafique A., Barve A., Sabeti P. C., Kyratsous C. A., Dudkina N. V., Shen K., Luban J., Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant. Cell. 183, 739–751.e8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Deng X., Garcia-Knight M. A., Khalid M. M., Servellita V., Wang C., Morris M. K., Sotomayor-González A., Glasner D. R., Reyes K. R., Gliwa A. S., Reddy N. P., Sanchez San Martin C., Federman S., Cheng J., Balcerek J., Taylor J., Streithorst J. A., Miller S., Sreekumar B., Chen P.-Y., Schulze-Gahmen U., Taha T. Y., Hayashi J. M., Simoneau C. R., Kumar G. R., McMahon S., Lidsky P. V., Xiao Y., Hemarajata P., Green N. M., Espinosa A., Kath C., Haw M., Bell J., Hacker J. K., Hanson C., Wadford D. A., Anaya C., Ferguson D., Frankino P. A., Shivram H., Lareau L. F., Wyman S. K., Ott M., Andino R., Chiu C. Y., Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant. Cell. 184, 3426–3437.e8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Starr T. N., Greaney A. J., Hilton S. K., Ellis D., Crawford K. H. D., Dingens A. S., Navarro M. J., Bowen J. E., Tortorici M. A., Walls A. C., King N. P., Veesler D., Bloom J. D., Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding. Cell. 182, 1295–1310.e20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Choi B., Choudhary M. C., Regan J., Sparks J. A., Padera R. F., Qiu X., Solomon I. H., Kuo H.-H., Boucau J., Bowman K., Adhikari U. D., Winkler M. L., Mueller A. A., Hsu T. Y.-T., Desjardins M., Baden L. R., Chan B. T., Walker B. D., Lichterfeld M., Brigl M., Kwon D. S., Kanjilal S., Richardson E. T., Jonsson A. H., Alter G., Barczak A. K., Hanage W. P., Yu X. G., Gaiha G. D., Seaman M. S., Cernadas M., Li J. Z., Persistence and Evolution of SARS-CoV-2 in an Immunocompromised Host. N. Engl. J. Med. 383, 2291–2293 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Greaney A. J., Starr T. N., Gilchuk P., Zost S. J., Binshtein E., Loes A. N., Hilton S. K., Huddleston J., Eguia R., Crawford K. H. D., Dingens A. S., Nargi R. S., Sutton R. E., Suryadevara N., Rothlauf P. W., Liu Z., Whelan S. P. J., Carnahan R. H., Crowe J. E. Jr, Bloom J. D., Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe. 29, 44–57.e9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rambaut A., Holmes E. C., O’Toole Á., Hill V., McCrone J. T., Ruis C., du Plessis L., Pybus O. G., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 5, 1403–1407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Campbell F., Archer B., Laurenson-Schafer H., Jinnai Y., Konings F., Batra N., Pavlin B., Vandemaele K., Van Kerkhove M. D., Jombart T., Morgan O., le O. de Waroux Polain, Increased transmissibility and global spread of SARS-CoV-2 variants of concern as at June 2021. Euro Surveill. 26 (2021), doi: 10.2807/1560-7917.ES.2021.26.24.2100509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.GISAID Initiative and global contributors, EpiCoV(TM) human coronavirus 2019 database. GISAID (2020), (available at https://gisaid.org). [Google Scholar]
  • 14.Elbe S., Buckland-Merrett G., Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 1, 33–46 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bingham E., Chen J. P., Jankowiak M., Obermeyer F., Pradhan N., Karaletsos T., Singh R., Szerlip P., Horsfall P., Goodman N. D., Pyro: Deep universal probabilistic programming. J. Mach. Learn. Res. 20, 973–978 (2019). [Google Scholar]
  • 16.Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. Virological (2020), (available at https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563). [Google Scholar]
  • 17.Greaney A. J., Starr T. N., Bloom J. D., An antibody-escape calculator for mutations to the SARS-CoV-2 receptor-binding domain. bioRxiv (2021), doi: 10.1101/2021.12.04.471236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Syed A. M., Taha T. Y., Tabata T., Chen I. P., Ciling A., Khalid M. M., Sreekumar B., Chen P.-Y., Hayashi J. M., Soczek K. M., Ott M., Doudna J. A., Rapid assessment of SARS-CoV-2–evolved variants using virus-like particles. Science. 374, 1626–1632 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ferretti L., Ledda A., Wymant C., Zhao L., Ledda V., Abeler-Dörner L., Kendall M., Nurtay A., Cheng H.-Y., Ng T.-C., Lin H.-H., Hinch R., Masel J., Kilpatrick A. M., Fraser C., The timing of COVID-19 transmission. bioRxiv (2020),, doi: 10.1101/2020.09.04.20188516. [DOI] [Google Scholar]
  • 20.McBroome J., Thornlow B., Hinrichs A. S., Kramer A., De Maio N., Goldman N., Haussler D., Corbett-Detig R., Turakhia Y., A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Mol. Biol. Evol. (2021), doi: 10.1093/molbev/msab264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Turakhia Y., Thornlow B., Hinrichs A. S., De Maio N., Gozashti L., Lanfear R., Haussler D., Corbett-Detig R., Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat. Genet. 53, 809–816 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nersisyan S., Zhiyanov A., Shkurnikov M., Tonevitsky A., T-CoV: a comprehensive portal of HLA-peptide interactions affected by SARS-CoV-2 mutations. bioRxiv (2021), p. 2021.07.06.451227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hopf T. A., Schärfe C. P. I., Rodrigues J. P. G. L. M., Green A. G., Kohlbacher O., Sander C., Bonvin A. M. J. J., Marks D. S., Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife. 3 (2014), doi: 10.7554/eLife.03430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Frazer J., Notin P., Dias M., Gomez A., M in J. K., Brock K., Gal Y., Marks D. S., Disease variant prediction with deep generative models of evolutionary data. Nature. 599, 91–95 (2021). [DOI] [PubMed] [Google Scholar]
  • 25.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A., Automatic differentiation in PyTorch (2017), (available at https://openreview.net/pdf?id=BJJsrmfCZ).
  • 26.Gorinova M., Moore D., Hoffman M., in Proceedings of the 37th International Conference on Machine Learning, Iii H. D., Singh A., Eds. (PMLR, 2020), vol. 119 of Proceedings of Machine Learning Research, pp. 3648–3657. [Google Scholar]
  • 27.Neal R. M., Slice sampling. The Annals of Statistics. 31 (2003),, doi: 10.1214/aos/1056562461. [DOI] [Google Scholar]
  • 28.Kingma D. P., Ba J., Adam: A Method for Stochastic Optimization. arXiv [cs.LG] (2014), (available at http://arxiv.org/abs/1412.6980). [Google Scholar]
  • 29.Cappello L., Kim J., Liu S., Palacios J. A., Statistical Challenges in Tracking the Evolution of SARS-CoV-2. arXiv [stat.AP] (2021), (available at http://arxiv.org/abs/2108.13362). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lin A. E., Diehl W. E., Cai Y., Finch C. L., Akusobi C., Kirchdoerfer R. N., Bollinger L., Schaffner S. F., Brown E. A., Saphire E. O., Andersen K. G., Kuhn J. H., Luban J., Sabeti P. C., Reporter Assays for Ebola Virus Nucleoprotein Oligomerization, Virion-Like Particle Budding, and Minigenome Activity Reveal the Importance of Nucleoprotein Amino Acid Position 111. Viruses. 12 (2020), doi: 10.3390/v12010105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Syed A. M., Taha T. Y., Khalid M. M., Tabata T., Chen I. P., Sreekumar B., Chen P.-Y., Hayashi J. M., Soczek K. M., Ott M., Doudna J. A., Rapid assessment of SARS-CoV-2 evolved variants using virus-like particles. bioRxiv (2021), p. 2021.08.05.455082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Angelini M. M., Akhlaghpour M., Neuman B. W., Buchmeier M. J., Severe Acute Respiratory Syndrome Coronavirus Nonstructural Proteins 3, 4, and 6 Induce Double-Membrane Vesicles. mBio. 4 (2013),, doi: 10.1128/mbio.00524-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Graham R. L., Sims A. C., Brockway S. M., Baric R. S., Denison M. R., The nsp2 replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication. J. Virol. 79, 13399–13411 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Jungreis I., Sealfon R., Kellis M., SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat. Commun. 12, 2642 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Islam M. R., Hoque M. N., Rahman M. S., Alam A. S. M. R. U., Akther M., Puspo J. A., Akter S., Sultana M., Crandall K. A., Hossain M. A., Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity. Sci. Rep. 10, 14004 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cornillez-Ty C. T., Liao L., Yates J. R. 3rd, Kuhn P., Buchmeier M. J., Severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling. J. Virol. 83, 10314–10318 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gupta M., Azumaya C. M., Moritz M., Pourmal S., Diallo A., Merz G. E., Jang G., Bouhaddou M., Fossati A., Brilot A. F., Diwanji D., Hernandez E., Herrera N., Kratochvil H. T., Lam V. L., Li F., Li Y., Nguyen H. C., Nowotny C., Owens T. W., Peters J. K., Rizo A. N., Schulze-Gahmen U., Smith A. M., Young I. D., Yu Z., Asarnow D., Billesbølle C., Campbell M. G., Chen J., Chen K.-H., Chio U. S., Dickinson M. S., Doan L., Jin M., Kim K., Li J., Li Y.-L., Linossi E., Liu Y., Lo M., Lopez J., Lopez K. E., Mancino A., Moss F. R., Paul M. D., Pawar K. I., Pelin A., Pospiech T. H., Puchades C., Remesh S. G., Safari M., Schaefer K., Sun M., Tabios M. C., Thwin A. C., Titus E. W., Trenker R., Tse E., Tsui T. K. M., Wang F., Zhang K., Zhang Y., Zhao J., Zhou F., Zhou Y., Zuliani-Alvarez L., QCRG Structural Biology Consortium, Agard D. A., Cheng Y., Fraser J. S., Jura N., Kortemme, Manglik A., Southworth D. R., Stroud R. M., Swaney D. L., Krogan N. J., Frost A., Rosenberg O. S., Verba K. A., CryoEM and AI reveal a structure of SARS-CoV-2 Nsp2, a multifunctional protein involved in key host processes. bioRxiv (2021), doi: 10.1101/2021.05.10.443524. [DOI] [Google Scholar]
  • 38.Jin Z., Du X., Xu Y., Deng Y., Liu M., Zhao Y., Zhang B., Li X., Zhang L., Peng C., Duan Y., Yu J., Wang L., Yang K., Liu F., Jiang R., Yang X., You T., Liu X., Yang X., Bai F., Liu H., Liu X., Guddat L. W., Xu W., Xiao G., Qin C., Shi Z., Jiang H., Rao Z., Yang H., Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors. Nature. 582, 289–293 (2020). [DOI] [PubMed] [Google Scholar]
  • 39.Osipiuk J., Azizi S.-A., Dvorkin S., Endres M., Jedrzejczak R., Jones K. A., Kang S., Kathayat R. S., Kim Y., Lisnyak V. G., Maki S. L., Nicolaescu V., Taylor C. A., Tesar C., Zhang Y.-A., Zhou Z., Randall G., Michalska K., Snyder S. A., Dickinson B. C., Joachimiak A., Structure of papain-like protease from SARS-CoV-2 and its complexes with non-covalent inhibitors. Nat. Commun. 12, 743 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Hillen H. S., Kokic G., Farnung L., Dienemann C., Tegunov D., Cramer P., Structure of replicating SARS-CoV-2 polymerase. Nature. 584, 154–156 (2020). [DOI] [PubMed] [Google Scholar]
  • 41.Yan L., Ge J., Zheng L., Zhang Y., Gao Y., Wang T., Huang Y., Yang Y., Gao S., Li M., Liu Z., Wang H., Li Y., Chen Y., Guddat L. W., Wang Q., Rao Z., Lou Z., Cryo-EM Structure of an Extended SARS-CoV-2 Replication and Transcription Complex Reveals an Intermediate State in Cap Synthesis. Cell. 184, 184–193.e10 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Chen J., Malone B., Llewellyn E., Grasso M., Shelton P. M. M., Olinares P. D. B., Maruthi K., Eng E. T., Vatandaslar H., Chait B. T., Kapoor T. M., Darst S. A., Campbell E. A., Structural Basis for Helicase-Polymerase Coupling in the SARS-CoV-2 Replication-Transcription Complex. Cell. 182, 1560–1573.e13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Chen Y., Cai H., Pan J. ‘an, Xiang N., Tien P., Ahola T., Guo D., Functional screen reveals SARS coronavirus nonstructural protein nsp14 as a novel cap N7 methyltransferase. Proc. Natl. Acad. Sci. U. S. A. 106, 3484–3489 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Liu C., Shi W., Becker S. T., Schatz D. G., Liu B., Yang Y., Structural basis of mismatch recognition by a SARS-CoV-2 proofreading enzyme. Science (2021), doi: 10.1126/science.abi9310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Huang Y., Yang C., Xu X.-F., Xu W., Liu S.-W., Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacol. Sin. 41, 1141–1149 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Cubuk J., Alston J. J., Incicco J. J., Singh S., Stuchell-Brereton M. D., Ward M. D., Zimmerman M. I., Vithani N., Griffith D., Wagoner J. A., Bowman G. R., Hall K. B., Soranno A., Holehouse A. S., The SARS-CoV-2 nucleocapsid protein is dynamic, disordered, and phase separates with RNA. Nat. Commun. 12, 1936 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chen Z., Pei D., Jiang L., Song Y., Wang J., Wang H., Zhou D., Zhai J., Du Z., Li B., Qiu M., Han Y., Guo Z., Yang R., Antigenicity analysis of different regions of the severe acute respiratory syndrome coronavirus nucleocapsid protein. Clin. Chem. 50, 988–995 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.tsv (314.2KB, tsv)
Supplement 2
media-2.tsv (520KB, tsv)
1

Data Availability Statement

We gratefully acknowledge all data contributors, i.e. the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID initiative (13) on which this research is based. A total of 6,466,300 submissions are included in this study. A complete list of 6.4million accession numbers is included as Data (S3).


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES