Simple epidemiologic metrics can accurately predict which mutations in the SARS-CoV-2 genome will increase in frequency over the coming months.
Abstract
SARS-CoV-2 evolution threatens vaccine- and natural infection-derived immunity as well as the efficacy of therapeutic antibodies. To improve public health preparedness, we sought to predict which existing amino acid mutations in SARS-CoV-2 might contribute to future variants of concern. We tested the predictive value of features comprising epidemiology, evolution, immunology, and neural network-based protein sequence modeling, and identified primary biological drivers of SARS-CoV-2 intra-pandemic evolution. We found evidence that ACE2-mediated transmissibility and resistance to population-level host immunity has waxed and waned as a primary driver of SARS-CoV-2 evolution over time. We retroactively identified with high accuracy (area under the receiver operator characteristic curve, AUROC=0.92-0.97) mutations that will spread, at up to four months in advance, across different phases of the pandemic. The behavior of the model was consistent with a plausible causal structure wherein epidemiological covariates combine the effects of diverse and shifting drivers of viral fitness. We applied our model to forecast mutations that will spread in the future and characterize how these mutations affect the binding of therapeutic antibodies. These findings demonstrate that it is possible to forecast the driver mutations that could appear in emerging SARS-CoV-2 variants of concern. We validate this result against Omicron, showing elevated predictive scores for its component mutations prior to emergence, and rapid score increase across daily forecasts during emergence. This modeling approach may be applied to any rapidly evolving pathogens with sufficiently dense genomic surveillance data, such as influenza, and unknown future pandemic viruses.
INTRODUCTION
SARS-CoV-2 evolution presents an ongoing challenge to public health. Tens of thousands of mutations have arisen in the SARS-CoV-2 genome as the pandemic has progressed. Understanding the relative importance of mutations in viral proteins, particularly those of relevance for antiviral immunity, is key to allocating preparedness efforts. Mutations in the viral Spike protein have received particular attention because Spike is the target of antibody-mediated immunity and is the primary antigen in current vaccines (1). As of December 1st, 2021, there are 10,381 distinct amino acid substitutions, insertions, or deletions in Spike sequences from the GISAID database (2). These mutations occur at all but one position in the protein, in different combinations, creating over 160,000 unique Spike protein sequences. A small subset of these mutations are components of “Variants Being Monitored” (VBMs), “Variants of Interest” (VOIs) or “Variants of Concern” (VOCs), as classified by the United States Centers for Disease Control (CDC) (3). The distinction between VOIs and the higher alert VOCs is whether a negative clinical impact is suspected or confirmed. VBMs are variants that would be classified as VOCs if not for low prevalence.
Early statistical and algorithmic identification of the key Spike amino acid changes contributing to future putative VBM/VOI/VOCs are of clear benefit to public health strategy. Such predictions could enhance the identification of vulnerabilities for antibody-based therapeutics, vaccines, and diagnostics. Predicting future successful mutations would extend the time available to develop proactive responses at earlier stages of spread. It would also complement existing forecasting efforts which seek to predict overall SARS-CoV-2 incidence, hospitalizations, and death over time (4–6). Focus on the success of individual mutations rather than genomic variants also facilitates longer-term forecasting. The combinatorics of modeling genomic variants quickly become intractable. As a toy example, for a protein of length 1200, there are over 250 million distinct sequences that differ by only two amino acid changes. By focusing on amino acid success from the outset, we rely on common and largely correct assumptions about independence between mutations, and are able to leverage more information per mutation, thus extending the timeline on which evolution can be meaningfully forecast.
There is a robust and expanding set of analyses characterizing the features of amino acid mutations of SARS-CoV-2. Studies have identified the emergence of new variants with altered biological or antigenic properties (7–9) and characterized them using low-throughput methods (10, 11). Deep mutational scanning elucidates the in vitro biological effects of all single site amino acid substitutions in a fixed genomic backbone (12–14). Others have characterized the distribution of immunodominant sites across the viral proteome (15, 16) and estimated the fitness of viral sequences using neural natural language processing (NLP) applied to protein sequences (17).
We sought here to build upon these data and approaches to forecast the mutations that will spread from season to season. We hypothesized that this would also allow us to identify the dominant biological drivers of viral evolution over short-term timescales. These two goals are mutually reinforcing: the features that are most useful for forecasting can be inferred as measuring viral fitness. Conversely, a better understanding of evolutionary dynamics can make modeling more accurate and robust. To accomplish these goals we described patterns of rapid mutation spread both globally and within the United States and elucidated the relative predictive importance of amino acid mutational features comprising immunity, transmissibility, evolution, language model, and epidemiology. Next, we utilized data from previous infection waves to train and back-test a forecasting model that anticipates future spreading mutations and illustrated how forecasted mutations could differentially affect clinical antibodies. We extended this analysis to forecast mutations, specifically on the Delta lineage, across the whole SARS-CoV-2 proteome. As the number of Omicron sequences increases, such a targeted analysis could be repeated for that lineage as well.
RESULTS
Biological and epidemiological features of SARS-CoV-2 mutations that spread
For the purpose of developing the models, we defined “spreading” amino acid mutations as a specified fold change in frequency across multiple countries, comparing time windows before and after a chosen date (Fig. 1). These mutations could be substitutions, insertions, or deletions. (2) Within each country, we tabulated the number of sequences containing the mutation being modeled, versus those that did not, in the three months before and after a date of interest (Fig. 1A). For each mutation, we calculated a fold change and an associated comparison-adjusted p-value. Mutations with a significant Benjamini-Hochbert adjusted p-value (q < 0.05) from any country were retained. This set was further filtered using the following empirical criteria, all of which had to be met to define a mutation as spreading: a fold change (FC) from baseline of at least 10.0 in at least one country; a FC of at least 2.0 across three or more countries; and a minimum global frequency of 0.1% in the later time window. We highlight that the sequences used to calculate fold change from baseline and minimum frequency were all collected after those used for model training or feature calculation, with no overlap or interleaving between the two datasets. Performance was assessed over time by repeating this analysis in shifting or sliding time windows covering the whole data collection period, which corresponded to the three months prior to the desired forecast start date (Fig. 1B). Assessed data windows ranged from January-March 2020 to June-August 2021.
This definition of spreading mutations captured the expansion of VOI/VOCs globally (fig. S1A) as well as the growth of a number of lesser-known mutations (fig. S1B). Implicit in a mutation-centric approach to forecasting is the assumption that mutations accumulate in a manner that is approximately independent, or at least that their interactions can be averaged out when looking across all genomic backgrounds. To test for significant violations of this implicit assumption, we tested for linkage between all pairs of spreading mutations (fig. S2). Enrichment for co-occurrence between pairs of mutations at a rate of greater than 8-fold was observed for fewer than 5% of mutation pairs. Thus, we find that (pairwise) independence between mutations is a useful and approximately correct simplifying assumption.
We next determined which features of amino acid mutations are informative for predicting their spread at baseline (Table 1, data file S1). Within the receptor binding domain (RBD) of Spike, we found that ACE2 binding affinity was a useful predictor of mutation spread (area under the receiver operator characteristic curve, AUROC=0.85; Fig. 1C). Another useful predictor was the change in in vitro expression of Spike mutants (AUROC=0.82; fig. S3A). Among measures of immune escape, the binding contributions of known antibody epitopes (antibody binding score) to anti-SARS-CoV-2 antibodies were predictive of mutation spread (AUROC=0.71; Fig. 1C) whereas CD4+ or CD8+ T-cell immunogenicity did not offer substantial explanatory power for mutation spread (AUROC=0.52-0.62; fig. S3A). We found that Natural Language Processing (NLP) scores for sequence plausibility (grammaticality) (17) were similarly predictive to deep mutational scanning data (AUROC=0.82; Fig. 1C). The best evolutionary feature for prediction of spread (AUROC=0.86; Fig. 1C) was obtained from Fixed Effects Likelihood (FEL (18)) from the Hyphy package [http://www.hyphy.org] (19) which tests for pervasive negative or positive selection across the internal branches of a phylogenetic tree.
Table 1. Summary of analytical features.
Feature group | Variable | Meaning | Source or reference | Number of parameters |
Evolution | Positive selection (FEL, MEME) | Parameters from Fixed Effects Likelihood (FEL) and Mixed Effects Model of Evolution (MEME) | HyPhy (19) | 11 |
Codon-SHAPE | RNA SHAPE constraint | Manfredonia et al. 2020 (32) | 3 | |
Viral entropy | Shannon entropy at each codon position for an amino acid site | This work | 3 | |
Immune | CD8 epitope escape | The frequency of SARS CoV-2 mutations in cytotoxic lymphocyte (CTL) epitopes | Agerer et al. 2021 (15) | 1 |
CD8 response | The percent and average CD8+ T cell response to an epitope in patients | Tarke et al. 2021 (33) | 2 | |
CD4 response | The percent and average CD4+ T cell response to an epitope in patients | Tarke et al. 2021 (33) | 2 | |
Antibody binding score | The estimated percent contribution of a site to binding of the indicated antibody, as estimated by Molecular Operating Environment (MOE) | This work | 17 | |
Maximum escape fraction in vitro | The maximum escape fraction across all conditions for that mutation | Greaney et al. 2021 (34) | 1 | |
Epidemiology | Variant frequency | The percent of sequences with the mutation | Calculated from GISAID (2) | 1 |
Fraction of unique haplotypes | The fraction of unique Spike haplotypes in which a mutation is observed | Calculated from GISAID (2) | 1 | |
Number of countries | The number of countries where it has been observed. | Calculated from GISAID (2) | 1 | |
Epi Score | The exponentially weighted mean rank across the other epidemiology variables | Calculated from GISAID (2) | 1 | |
Transmissibility | RBD expression change | Change in RBD expression due to the mutation | Starr et al. 2020 (13) | 1 |
ACE2 binding change | The change in binding affinity for ACE2 | Starr et al. 2020 (13) | 1 | |
Language model | Language model | Grammaticality and semantic change of a mutation | Hie et al. 2021 (17) | 2 |
The highest predictive performance, however, was obtained from epidemiological features, that is, variables which more directly measure sampled mutation counts (Table 1). The most predictive variable in this feature category was “Epi Score”, the exponentially weighted mean ranking across the other epidemiological variables (mutation frequency, fraction of unique haplotypes in which the mutation occurs, and the number of countries in which it occurs), with AUROC=0.99. This score captures both lineage expansion and recurrent mutation that occurs in multiple variant lineages by convergent evolution. We note that the utility of recurrent mutation signals is consistent with recent findings that convergent evolution plays a substantial role in SARS-CoV-2 adaptation (20). As observed for the RBD alone, within Spike we also obtained the best predictive performance with epidemiologic (AUROC=0.96) and evolutionary (AUROC=0.84) measures (Fig. 1C). The performance of other feature sets for spike is presented in fig. S3B.
We next sought to interrogate the robustness of this approach to changes in the underlying drivers of SARS-CoV-2 evolution. For example, it has been hypothesized that selection due to immune pressure has increased with time as more individuals became immune through infection or vaccination (20). For example, the Gamma P.1 lineage is thought to have spread rapidly in Brazil largely due to immune selection in a population with high seroprevalence (21). We measured the predictive performance of antibody binding scores, which quantify the predicted percent contribution of each Spike site to antibody affinity. We took this metric as a proxy for B cell immunodominance (Table 1) (22). Taking the maximum of this value across antibodies at a given site yielded the maximum antibody binding score. The predictiveness of this metric increased from nearly uninformative early in the pandemic (p-value for difference from random=0.53), to an AUROC of 0.75 (p<1e-4; fig. S2C) for predicting spreading mutations during the third wave of the pandemic (Fig. 1D). Predictiveness subsequently decreased again to 0.64 by summer of 2021 coincidental with the emergence of Delta. However, we found that epidemiological features maintained their performance, achieving an AUROCs of 0.92-0.97 over multiple evaluation periods (Fig. 1D).
Last, we trained models to predict spreading mutations using all, or various subsets of, the features identified above. We employed logistic regression with baseline features as inputs. The best predictors were epidemiologic features (AUROC=0.98) and positive selection features (AUROC=0.83; fig. S4A). The performance of the full model was comparable to the non-model-based performance of Epi Score (fig. S4B). Therefore, to simplify reproducibility and further minimize the risk of overfitting, we used Epi Score to predict mutation spread going forward. We found that taking the top 5% of mutations according to their Epi Score achieved reasonable sensitivity (~50%) and maintained a positive predictive value of between 20 and 60% across time windows (fig. S5). Given that an average of ~3% of observed mutations are spreading at any point in time, this represents more than a 300-fold improvement in sensitivity, and a 6- to 20-fold improvement in positive predictive value relative to random selection.
In summary, immunity, transmissibility, evolution, language model, and epidemiologic features all effectively predicted mutation spread. The methodology captured changes to the underlying selective forces over the course of the pandemic. We found that epidemiologic features in particular display superior accuracy and maintain it over time.
Examining global dynamics and the emergence of VOCs
To determine whether local or global dynamics drive mutation spread, we examined whether spreading mutations in the United States were better predicted by global or US-only epidemiological values. We tested the performance of Epi Score across four waves of the pandemic. We found that mutations were predicted with an AUROC above 0.85 up to 11 months in advance, both within the United States and globally. Global epidemiology metrics were best overall and were generally more predictive of country-level mutation spread than the country-level metrics themselves (fig. S6).
To illustrate the practical utility of Epi Score using global features, we assessed how early we would have been able to forecast the spread of Spike mutations that define current and former CDC VOCs, VOIs, and VBMs (n=50 defining mutations). To be conservative, we defined the date that a mutation was first forecast as the earliest date at which it was predicted to spread in two subsequent analysis periods. Of the 50 mutations (Fig. 2A), the median time between when a mutation was forecast to spread and when it reached 1% frequency was 5 months. The maximum was 20 months, while the minimum was 0 months for D614G, because this mutation had already reached a frequency of 69% by the first forecast period. The distribution of these forecast intervals is presented in Fig. 2B.
Of particular note, Y145H was forecast to spread starting in July of 2021. This mutation is now a defining mutation of AY.4.2, a spreading sub-lineage of the Delta VOC. As of October 2021, AY.4.2 accounted for 8.5-11.3% of samples in the UK. Estimated growth rates remain slightly higher for AY.4.2 than for Delta, and the household secondary attack rate was higher for AY.4.2 cases than for other Delta cases (23). Based on these observations, we conclude that our approach was able to predict key mutations, across all current and former VOC/VOI/VBMs, several months in advance. Early warning of mutations in current VOCs, VOIs, and VBMs would have been possible before reaching worrisome degrees of global spread.
Understanding performance through a causal lens
Seeking to understand the high predictive performance of epidemiologic features, we constructed a directed acyclic graph to represent the hypothesized causal relationships, and to probe whether relative trends in performance were consistent with the expectations that follow from this model (Fig. 3A). We proposed that epidemiologic features mediate the relationship between viral fitness and mutation spread. Our rationale was that if a mutation’s contribution to viral fitness was sufficient to drive it to appreciable prevalence at one time point (as measured by global frequency and geographic distribution), and in the context many genetic backgrounds, it would likely drive it to higher prevalence in the future as well (unless it were outcompeted by a more fit adaptation, or the fitness landscape changed). This type of mediated relationship (fitnessÞcurrent prevalenceÞfuture prevalence) implies that epidemiological prevalence features will capture information from both known and unknown drivers of selection.
If the causal model were reasonable, we would expect first that variables whose causal effects are mediated, as defined above, should predict epidemiologic variables at a comparable or even greater accuracy compared to spreading mutations. This is illustrated by comparing the first and second columns of Fig. 3B. We observed that, with the exception of the maximal antibody binding score, all top variables predicted Epi Scores better than they predict mutation spread. The lower predictiveness of maximal antibody binding score for Epi Scores would be consistent with a slight time lag effect due to shifting evolutionary pressures.
A second criterion for mediation is that information from these variables should not substantially complement the predictiveness of the epidemiologic variables alone. In other words, there should be little or no additional information that other inputs provide relative to the epidemiologic variables. We assessed this by comparing the AUROCs of two-variable models in column 3 of Fig. 3B with the AUROC for Epi Score alone (0.983). The only nominal AUROC increase for a complemented model was observed for the evolutionary measure FEL (0.984). We did not find statistically significant complementarity with Epi Score for this or any other variable, either within the RBD or across full length Spike (see supplemental section “Mediation Analysis”, table S1).
Our examination of mediated causal relationships begins by assuming a causal graph based on prior knowledge. Such an approach is common to many causal inference methods (24) and represents a well-understood limitation of these methods (24). Therefore, we considered this as a tool to more systematically analyze the plausibility of our results. Although it is generally difficult to verify the structure of proposed causal graphs, our findings support the concept that epidemiological variables mediate the effects of other classes of explanatory variables, and this may explain their high predictive accuracy.
Emergence and spread of Omicron
While this work was in revision, we were confronted with the emergence in late November 2021 of the Omicron (B.1.1.529/21K) variant. Despite the low frequency of many of the individual mutations that define the major haplotype of Omicron (median allele frequency 0.00046), we observed high Epi Score values across Spike (median Epi Score of 9.51); Fig. 4A. A benefit of the computational simplicity of Epi Score is that predictions can easily be updated on a daily basis. We therefore sought to move beyond single time point Epi Scores to examine trends in Epi Score across time for the Omicron mutations. The time-analysis showed that the Omicron Spike mutations had progressively higher Epi Score values long preceding the acceleration that characterized the emergence of Omicron in November 2021 (Fig. 4B). We additionally found that the spread of Omicron was rapidly reflected in the raising Epi Scores of its mutations, and that daily forecasts allowed the identification of trending scores.
As an independent approach to assess the singularity of Omicron, we also examined the evolutionary nature of the Omicron mutations using our language model. Omicron had a grammaticality change between that of Alpha and Delta, but the highest semantic change (predicted antigenic shift) of any SARS-CoV-2 lineage (fig. S7). Indeed, Omicron’s semantic change score was twice that of both Alpha and Delta, consistent with high levels of mutation and immune escape adaptation.
Forecasting spreading mutations in Spike and proteome-wide
Building upon the accurate prediction of spreading mutations across different waves of the pandemic, we next leveraged Epi Score on current data to forecast mutations that may contribute to VOIs and VOCs over the coming months. Because global metrics outperformed metrics restricted to the United States, even for forecasting within the United States, we focused on global forecasting. We considered shortening our feature calculation window to further mitigate the effects of shifting evolutionary dynamics. However, we found that longer feature calculation windows improved performance across all prediction windows (fig. S8).
As an application of the forecasting analysis, we examined how forecasted mutations intersected with the binding sites of clinical antibodies as of October 19th, 2021. We found wide variation in the number of forecasted mutations per antibody epitope (Table 2), ranging from 10 mutations for Celltrion’s CT-P59, to two low-frequency mutations for Vir-7831 (sotrovimab), which was designed to be more robust to viral evolution by targeting a region that is conserved across coronaviruses (25). The two mutations in the epitope of sotrovimab, A340S and R346K, do not limit neutralization (25, 26). As an additional proof of concept, we focused our attention on Spike S494P, a mutation reported to have enhanced binding affinity to ACE2 (27), and to reduce neutralization by 3-5-fold in some convalescent sera (27). We found that the S494P mutation decreases neutralization potential of clinical therapeutic antibodies: Ly-CoV555 (bamlanivimab), CT-P59 and to a lesser extent to REGN10933 (casirivimab) (Fig. 5).
Table 2. Forecasted mutations for therapeutic antibodies.
Clinical therapeutic antibody | Forecasted mutations in epitopes |
VIR-7831 (sotrovimab) | A344S†, R346K† |
LY-CoV016 (etesevimab) | K417T‡, K417N*, L455F‡ |
REGN10987 (imdevimab) | R346K†, K444N*, G446V* |
LY-CoV555 (bamlanivimab) | L452R*, L452Q‡, V483F†, E484K*, E484Q*, F490S*, S494L‡, S494P* |
REGN10933(casirivimab) | K417T*, K417N*, L455F*, G476S*, S477I‡, T478K‡, E484K*, E484Q*, F490S* |
CT-P59 | K417T‡, K417N†, L452R*, L452Q‡, L455F‡, E484K*, E484Q‡, F490S‡, S494L‡, S494P‡ |
Last, to demonstrate the flexibility and extensibility of our approach, we forecasted the spread of mutations specifically on the Delta genomic background, across the full SARS-CoV-2 proteome. Because the components of Epi Score can be calculated for any mutation where sequencing data are available, extension to the full proteome is trivial and not computationally taxing. It can also be reasonably calculated on any subset of sequences to determine which mutations are most likely to spread based on their characteristics within that subset (or lineage). Therefore, it is also straightforward to adapt this approach to produce lineage-specific forecasts. Fig. 6A shows a Manhattan-style plot of Epi Scores across the full SARS-CoV-2 genome. The plot highlights all mutations at positively selected sites (FEL, fixed effects model for detecting site-wise selective pressure, FDR < 0.05) that currently occur at a frequency over 0.1% on a Delta background. We found 151 such mutations, distributed across the proteome. The mutation density was 1.8 per 100 amino acids across the whole proteome, with a rate varying from 0 to 12.3 across SARS-CoV-2 proteins (Fig. 6B). By this measure, the highest mutational density was identified in ORF3/NS3, an accessory protein that is reported to modulate autophagosome–lysosome fusion (ORF3a) (28) and antagonize interferon (Orf3b) (29). Spike was close to average, with a density of 2.3 mutations per 100 amino acids. Based on the Epi Score ranking, the top 5 mutations for potential to spread were Spike:G142D, Spike:T95I, NSP3:A1711V, N:Q9L, and NSP2:K81N. All mutation Epi Scores proteome-wide are presented in data file S2.
In summary, we established a method for predicting spreading mutations and applied it to forecast future contributors to putative VOCs/VOIs/VBMs. These predictions yield mutations known to be important from in vitro data. We conclude that this approach can anticipate spreading mutations many months in advance. We find that a subset of forecast mutations could have implications for the continued efficacy of clinical antibodies, but that the level of these effects varies widely. We then extended our analysis to encompass the full SARS-CoV-2 proteome, and to produce Delta and informative Omicron forecasts. This work also suggests that there is considerable potential for spreading mutations located outside of Spike, underlining the importance of forecasting methods that can be applied across the whole viral proteome.
DISCUSSION
We established a working definition for spreading mutations and leveraged this definition to deliver a systematic analysis of amino acid features predictive of mutation spread. This yielded a simple, explainable, and accurate approach for forecasting mutations several months in advance, across multiple pandemic waves. Calculating this scoring was also efficient enough to enable daily forecast updates on millions of sequences using only a laptop. Although this strategy required nothing more than genomic surveillance data, we also highlighted the value of the complete mapping of epitopes, in vitro deep site-directed mutagenesis, and downstream functional experimental validation. Confidence in the prediction of spreading mutations came through retrospectively evaluating multiple waves of the pandemic and verifying consistency with experimental data, and with a plausible causal framework. Furthermore, long observed lags between the earliest warning signals and high population frequency of current mutations in VOCs, VOIs, and VBMs gave further support for using forecasting to anticipate the spread of future concerning mutations. Although this approach will be limited in its ability to anticipate mutations that appear and rise to high frequencies within a short time frame, we found this to be a rare occurrence.
We evaluated epidemiologic features aggregated in the Epi Score such as mutation frequency, and the distribution of mutations across countries and fraction of unique haplotypes across which a mutation occurs. We explored other predictors, including the rate of increase of each of these features, but did not find that they improved performance. We note that the fraction of unique haplotypes shared similarities to phylogenetic measures of recurrent mutation. However, there is considerable lack of phylogenetic resolution in such calculations, so the number of recurrent mutations is a statistically “noisy” measure, depends strongly on the method used to build phylogenies, and is very expensive to compute. The fraction of unique haplotypes, on the other hand, is fast to compute, can be perfectly estimated, and will increase with both recurrent mutation and single-lineage expansion; both of which are indicative of a positive contribution to fitness.
Omicron emerged as the paper was completing the review process. Despite the limited numbers of viral sequences available as of December 2021, we observed a distinctive pattern of Omicron mutations that, despite low frequency of many individual mutations, already had high Epi Score values. It is also notable that for all mutations, high Epi Score values antedated the emergence of Omicron, even though those mutations had not yet converged on the same haplotypes. We interpret these data as indicative that individual mutations were endowed with advantageous properties in the viral genome even before their co-occurrence on the Omicron spike.
There are limits to this study; general prediction of viral evolution is fundamentally an intractable problem. The current work only addresses a simpler question: predicting which mutations will increase in frequency over some threshold in the near future based on the analysis of their recent patterns of spread. Thus, the study predicts spread of existing mutations, but not a true emergence of previously unobserved mutations. In addition, it is difficult to predict which lineages, i.e., a major viral haplotype, will spread because this would require the complex projection of growth of multiple mutations together. These limitations notwithstanding, the data on Omicron suggest that successful lineages may be defined by the convergence of mutations that, individually, exhibited high Epi Score values and other features that signal adaptive evolution.
Although this work forecasts which mutations will spread, the success of a given mutation does not necessarily result in clinical or public health consequences. Therefore, we posit that the value of the predictions is to prioritize mutations for functional screening. Here, we demonstrate how a subset of spreading mutations differentially impact clinical antibodies. We also extended the analysis to encompass the whole viral proteome. By this approach, we identified spreading amino acid replacements in other viral proteins, and highlighted positions under strong positive selection. Given the limited understanding of the role of non-Spike regions of the proteome in driving the pandemic, we believe that those non-Spike mutations should be prioritized for understanding their role in evading innate immunity, increasing the replication of SARS-CoV-2, and more generally for their contribution to viral fitness. We intend for these results to provide a foundation for future improvement. Although we have shown that Epi Score is robust to shifting evolutionary dynamics, performance can be monitored in real-time, and if necessary, re-tuned to capture novel behavior as now shown with the emergence of Omicron. This approach can also be generalized and improved upon to stay ahead of evolutionary cycles for other pathogens (30), when sufficiently rich and representative genomic sampling is available.
MATERIALS AND METHODS
Study Design. Sample size. The current work to define spreading amino acid mutations was based on viral sequences and metadata obtained from GISAID EpiCoV project (https://www.gisaid.org/). A total of 4,487,305 sequences were analyzed.
Research objectives. We hypothesized that the pattern of spread could be estimated from the large database of GISAID. Next, we hypothesized that one or more variable comprising biological, immunological, epidemiological and genomic (including language) features could be identified as drivers of the spread.
Experimental design. We used predictive models and expressed predictive performance using the area under the receiver operator characteristic curve (AUROC). Prediction was performed using forward feature selection followed by logistic regression. The criterion for forward selection was cross-validated AUROC of the logistic regression model within the training set. Feature selection and model fitting were performed separately within each fold of the outer cross validation loop. Logistic regression was chosen due to its sample efficiency.
Statistical analysis. Spreading mutations were defined based on a Fisher’s exact test for frequency fold change per country, adjusted for multiple comparisons, followed by filters for rate of spread (max fold change of at least 10, fold change > 2 in three or more countries), and a minimum prevalence of 0.1%. We estimated epistasis using pointwise mutual information, which corresponds to the log ratio of the observed prevalence of a pair to the expected prevalence assuming independence. The most predictive variable, “Epi Score” was defined as the exponentially weighted mean ranking across the other epidemiological variables (mutation frequency, fraction of unique haplotypes in which the mutation occurs, and the number of countries in which it occurs. For natural language processing (NLP) neural network features, we used the grammaticality and semantic change scores reported by Hie et al. (17) in which a bidirectional long short-term memory (BiLSTM) model was trained on Spike sequences from GISAID and GenBank. Natural selection features were generated using MEME (31) and FEL (18) methods implemented in the HyPhy package (19) (version 2.5.31). Mediation analysis was based on the Baron and Kenny test. The list of forecast mutations was generated by calculating Epi Scores on the most recent three months of data and taking the top 5% of mutations, a cutoff chosen based empirical analyses.
Acknowledgments
We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID used in the current study. The full GISAID acknowledgement list can be found at http://data.hyphy.org/web/SARS-CoV-2/gisaid.csv.gz. We thank Darren Martin and Emma Hodcroft for useful discussion.
Funding: This research is funded by Vir Biotechnology. D.L.R. is funded by the MRC (MC_UU_12014/12). S.L.K.P was supported in part by the NIH/NIAID (AI134384).
Author contributions: M.C.M., A.T. conceived the study. M.C.M, I.B., S.W., J.dI., F.A.L., B.L.H., performed experiments. M.C.M, I.B., E.F., L.S., F.A.L., B.L.H., S.L.K.P., A.T. analyzed and interpreted data. Br.B., Bo.B., D.L.R., G.S., D.C., H.W.V., S.L.K., A.T. supervised research. M.C.M., D.C., H.W.V., S.L.K.P., A.T. wrote the manuscript.
Competing interests: M.C.M., I.B., J.dI., F.A.L., E.F., L.S., G.S., D.C., H.W.V., and A.T are employees of Vir Biotechnology and may hold shares of the company. H.W.V. is a founder of PierianDx and Casma Therapeutics and holds stock or stock options in these companies. Neither company funded the present work. H.W.V. holds a number of patents from work at Washington University School of Medicine. The present work does not involve these patents. Vir has filed a patent application on this work: 63/212,945; “Predicting mutational drivers of future pathogen spread”.
Data and materials availability: All data associated with this study are present in the paper or supplementary materials. Data file S1 describes the sources for all features used for prediction. Data file S2 provides the Epi Scores proteome-wide. Data file S3 provides all the figure data for figures S3-S7. Code used in this study is available at DOI: https://zenodo.org/badge/latestdoi/440943417.
This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/. This license does not apply to figures/photos/artwork or other content included in the article that is credited to a third party; obtain authorization from the rights holder before using such material.
Supplementary Materials
This PDF file includes:
Other Supplementary Material for this manuscript includes the following:
References and Notes
- 1.McCallum M., De Marco A., Lempp F. A., Tortorici M. A., Pinto D., Walls A. C., Beltramello M., Chen A., Liu Z., Zatta F., Zepeda S., di Iulio J., Bowen J. E., Montiel-Ruiz M., Zhou J., Rosen L. E., Bianchi S., Guarino B., Fregni C. S., Abdelnabi R., Foo S. C., Rothlauf P. W., Bloyet L.-M., Benigni F., Cameroni E., Neyts J., Riva A., Snell G., Telenti A., Whelan S. P. J., Virgin H. W., Corti D., Pizzuto M. S., Veesler D., N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2. Cell 184, 2332–2347.e16 (2021). 10.1016/j.cell.2021.03.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Elbe S., Buckland-Merrett G., Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob. Chall. 1, 33–46 (2017). 10.1002/gch2.1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.C. for D. Control, SARS-CoV-2 Variants of Concern (available at https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/variant-surveillance/variant-info.html).
- 4.Adiga A., Wang L., Hurt B., Peddireddy A., Porebski P., Venkatramanan S., Lewis B., Marathe M., All Models Are Useful: Bayesian Ensembling for Robust High Resolution COVID-19 Forecasting. Medrxiv, 2021.03.12.21253495 (2021). 10.1145/3447548.3467197 [DOI]
- 5.Zhao H., Merchant N. N., McNulty A., Radcliff T. A., Cote M. J., Fischer R. S. B., Sang H., Ory M. G., COVID-19: Short term prediction model using daily incidence data. PLOS ONE 16, e0250110 (2021). 10.1371/journal.pone.0250110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ray E. L., Wattanachit N., Niemi J., Kanji A. H., House K., Cramer E. Y., Bracher J., Zheng A., Yamana T. K., Xiong X., Woody S., Wang Y., Wang L., Walraven R. L., Tomar V., Sherratt K., Sheldon D., Reiner R. C., Prakash B. A., Osthus D., Li M. L., Lee E. C., Koyluoglu U., Keskinocak P., Gu Y., Gu Q., George G. E., España G., Corsetti S., Chhatwal J., Cavany S., Biegel H., Ben-Nun M., Walker J., Slayton R., Lopez V., Biggerstaff M., Johansson M. A., Reich N. G., Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. Medrxiv, 2020.08.19.20177493 (2020). 10.1101/2020.08.19.20177493 [DOI]
- 7.Padane A., Kanteh A., Leye N., Mboup A., Manneh J., Mbow M., Diaw P. A., Ndiaye B. P., Lo G., Lo C. I., Ahoudi A., Gueye-Gaye A., Malomar J. J. N., Dia A., Dia Y. A., Diagne N. D., Wade D., Sesay A. K., Toure-Kane N. C., Dalessandro U., Mboup S., First detection of the British variant of SARS-CoV-2 in Senegal. New Microbes New Infect., 100877 (2021). 10.1016/j.nmni.2021.100877 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Valesano A. L., Rumfelt K. E., Dimcheff D. E., Blair C. N., Fitzsimmons W. J., Petrie J. G., Martin E. T., Lauring A. S., Temporal dynamics of SARS-CoV-2 mutation accumulation within and across infected hosts. PLOS Pathog. 17, e1009499 (2021). 10.1371/journal.ppat.1009499 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Charkiewicz R., Nikliński J., Biecek P., Kiśluk J., Pancewicz S., Moniuszko-Malinowska A. M., Flisiak R., Krętowski A. J., Dzięcioł J., Moniuszko M., Gierczyński R., Juszczyk G., Reszeć J., The first SARS-CoV-2 genetic variants of concern (VOC) in Poland: The concept of a comprehensive approach to monitoring and surveillance of emerging variants. Adv. Med. Sci. 66, 237–245 (2021). 10.1016/j.advms.2021.03.005 [DOI] [PubMed] [Google Scholar]
- 10.Dejnirattisai W., Zhou D., Supasa P., Liu C., Mentzer A. J., Ginn H. M., Zhao Y., Duyvesteyn H. M. E., Tuekprakhon A., Nutalai R., Wang B., López-Camacho C., Slon-Campos J., Walter T. S., Skelly D., Costa Clemens S. A., Naveca F. G., Nascimento V., Nascimento F., Fernandes da Costa C., Resende P. C., Pauvolid-Correa A., Siqueira M. M., Dold C., Levin R., Dong T., Pollard A. J., Knight J. C., Crook D., Lambe T., Clutterbuck E., Bibi S., Flaxman A., Bittaye M., Belij-Rammerstorfer S., Gilbert S. C., Carroll M. W., Klenerman P., Barnes E., Dunachie S. J., Paterson N. G., Williams M. A., Hall D. R., Hulswit R. J. G., Bowden T. A., Fry E. E., Mongkolsapaya J., Ren J., Stuart D. I., Screaton G. R., Antibody evasion by the P.1 strain of SARS-CoV-2. Cell 184, 2939–2954.e9 (2021). 10.1016/j.cell.2021.03.055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Collier D. A., De Marco A., Ferreira I. A. T. M., Meng B., Datir R. P., Walls A. C., Kemp S. A., Bassi J., Pinto D., Silacci-Fregni C., Bianchi S., Tortorici M. A., Bowen J., Culap K., Jaconi S., Cameroni E., Snell G., Pizzuto M. S., Pellanda A. F., Garzoni C., Riva A., Elmer A., Kingston N., Graves B., McCoy L. E., Smith K. G. C., Bradley J. R., Temperton N., Ceron-Gutierrez L., Barcenas-Morales G., Harvey W., Virgin H. W., Lanzavecchia A., Piccoli L., Doffinger R., Wills M., Veesler D., Corti D., Gupta R. K.; CITIID-NIHR BioResource COVID-19 Collaboration; COVID-19 Genomics UK (COG-UK) Consortium , Sensitivity of SARS-CoV-2 B.1.1.7 to mRNA vaccine-elicited antibodies. Nature 593, 136–141 (2021). 10.1038/s41586-021-03412-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Starr T. N., Greaney A. J., Dingens A. S., Bloom J. D., Complete map of SARS-CoV-2 RBD mutations that escape the monoclonal antibody LY-CoV555 and its cocktail with LY-CoV016. Cell Reports Medicine 2, 100255 (2021). 10.1016/j.xcrm.2021.100255 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Starr T. N., Greaney A. J., Hilton S. K., Ellis D., Crawford K. H. D., Dingens A. S., Navarro M. J., Bowen J. E., Tortorici M. A., Walls A. C., King N. P., Veesler D., Bloom J. D., Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182, 1295–1310.e20 (2020). 10.1016/j.cell.2020.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Starr T. N., Greaney A. J., Addetia A., Hannon W. W., Choudhary M. C., Dingens A. S., Li J. Z., Bloom J. D., Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. Science 371, 850–854 (2021). 10.1126/science.abf9302 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Agerer B., Koblischke M., Gudipati V., Montaño-Gutierrez L. F., Smyth M., Popa A., Genger J.-W., Endler L., Florian D. M., Mühlgrabner V., Graninger M., Aberle S. W., Husa A.-M., Shaw L. E., Lercher A., Gattinger P., Torralba-Gombau R., Trapin D., Penz T., Barreca D., Fae I., Wenda S., Traugott M., Walder G., Pickl W. F., Thiel V., Allerberger F., Stockinger H., Puchhammer-Stöckl E., Weninger W., Fischer G., Hoepler W., Pawelka E., Zoufaly A., Valenta R., Bock C., Paster W., Geyeregger R., Farlik M., Halbritter F., Huppa J. B., Aberle J. H., Bergthaler A., SARS-CoV-2 mutations in MHC-I-restricted epitopes evade CD8+ T cell responses. Sci. Immunol. 6, eabg6461 (2021). 10.1126/sciimmunol.abg6461 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tarke A., Sidney J., Methot N., Yu E. D., Zhang Y., Dan J. M., Goodwin B., Rubiro P., Sutherland A., Wang E., Frazier A., Ramirez S. I., Rawlings S. A., Smith D. M., da Silva Antunes R., Peters B., Scheuermann R. H., Weiskopf D., Crotty S., Grifoni A., Sette A., Impact of SARS-CoV-2 variants on the total CD4+ and CD8+ T cell reactivity in infected or vaccinated individuals. Cell Reports Medicine 2, 100355 (2021). 10.1016/j.xcrm.2021.100355 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hie B., Zhong E. D., Berger B., Bryson B., Learning the language of viral evolution and escape. Science 371, 284–288 (2021). 10.1126/science.abd7331 [DOI] [PubMed] [Google Scholar]
- 18.Kosakovsky Pond S. L., Frost S. D. W., Not so different after all: A comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol. 22, 1208–1222 (2005). 10.1093/molbev/msi105 [DOI] [PubMed] [Google Scholar]
- 19.Kosakovsky Pond S. L., Poon A. F. Y., Velazquez R., Weaver S., Hepler N. L., Murrell B., Shank S. D., Magalis B. R., Bouvier D., Nekrutenko A., Wisotsky S., Spielman S. J., Frost S. D. W., Muse S. V., HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Mol. Biol. Evol. 37, 295–299 (2020). 10.1093/molbev/msz197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Martin D. P., Weaver S., Tegally H., San J. E., Shank S. D., Wilkinson E., Lucaci A. G., Giandhari J., Naidoo S., Pillay Y., Singh L., Lessells R. J., Gupta R. K., Wertheim J. O., Nekturenko A., Murrell B., Harkins G. W., Lemey P., MacLean O. A., Robertson D. L., de Oliveira T., Kosakovsky Pond S. L.; NGS-SA; COVID-19 Genomics UK (COG-UK) , The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages. Cell 184, 5189–5200.e7 (2021). 10.1016/j.cell.2021.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Faria N. R., Mellan T. A., Whittaker C., Claro I. M., Candido D. D. S., Mishra S., Crispim M. A. E., Sales F. C. S., Hawryluk I., McCrone J. T., Hulswit R. J. G., Franco L. A. M., Ramundo M. S., de Jesus J. G., Andrade P. S., Coletti T. M., Ferreira G. M., Silva C. A. M., Manuli E. R., Pereira R. H. M., Peixoto P. S., Kraemer M. U. G., Gaburo N. Jr., Camilo C. D. C., Hoeltgebaum H., Souza W. M., Rocha E. C., de Souza L. M., de Pinho M. C., Araujo L. J. T., Malta F. S. V., de Lima A. B., Silva J. D. P., Zauli D. A. G., Ferreira A. C. S., Schnekenberg R. P., Laydon D. J., Walker P. G. T., Schlüter H. M., Dos Santos A. L. P., Vidal M. S., Del Caro V. S., Filho R. M. F., Dos Santos H. M., Aguiar R. S., Proença-Modena J. L., Nelson B., Hay J. A., Monod M., Miscouridou X., Coupland H., Sonabend R., Vollmer M., Gandy A., Prete C. A. Jr., Nascimento V. H., Suchard M. A., Bowden T. A., Pond S. L. K., Wu C.-H., Ratmann O., Ferguson N. M., Dye C., Loman N. J., Lemey P., Rambaut A., Fraiji N. A., Carvalho M. D. P. S. S., Pybus O. G., Flaxman S., Bhatt S., Sabino E. C., Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science 372, 815–821 (2021). 10.1126/science.abh2644 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Vilar S., Cozza G., Moro S., Medicinal chemistry and the molecular operating environment (MOE): Application of QSAR and molecular docking to drug discovery. Curr. Top. Med. Chem. 8, 1555–1572 (2008). 10.2174/156802608786786624 [DOI] [PubMed] [Google Scholar]
- 23.U. H. Security, SARS-CoV-2 variants of concern and variants under investigation in England: Technical briefing 27 (2021; https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1029715/technical-briefing-27.pdf).
- 24.Pearce N., Lawlor D. A., Causal inference-so much more than statistics. Int. J. Epidemiol. 45, 1895–1903 (2016). 10.1093/ije/dyw328 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cathcart A. L., Havenar-Daughton C., Lempp F. A., Ma D., Schmid M. A., Agostini M. L., Guarino B., iulio J. D., Rosen L. E., Tucker H., Dillen J., Subramanian S., Sloan B., Bianchi S., Pinto D., Saliba C., Wojcechowskyj J. A., Noack J., Zhou J., Kaiser H., Chase A., Montiel-Ruiz M., Dellota E., Park A., Spreafico R., Sahakyan A., Lauron E. J., Czudnochowski N., E. Cameroni, Ledoux S., Werts A., Colas C., Soriaga L., Telenti A., Purcell L. A., Hwang S., Snell G., Virgin H. W., Corti D., Hebner C. M., The dual function monoclonal antibodies VIR-7831 and VIR-7832 demonstrate potent in vitro and in vivo activity against SARS-CoV-2. Biorxiv, 2021.03.09.434607 (2021). 10.1101/2021.03.09.434607 [DOI]
- 26.Starr T. N., Czudnochowski N., Liu Z., Zatta F., Park Y.-J., Addetia A., Pinto D., Beltramello M., Hernandez P., Greaney A. J., Marzi R., Glass W. G., Zhang I., Dingens A. S., Bowen J. E., Tortorici M. A., Walls A. C., Wojcechowskyj J. A., De Marco A., Rosen L. E., Zhou J., Montiel-Ruiz M., Kaiser H., Dillen J. R., Tucker H., Bassi J., Silacci-Fregni C., Housley M. P., di Iulio J., Lombardo G., Agostini M., Sprugasci N., Culap K., Jaconi S., Meury M., Dellota E. Jr., Abdelnabi R., Foo S. C., Cameroni E., Stumpf S., Croll T. I., Nix J. C., Havenar-Daughton C., Piccoli L., Benigni F., Neyts J., Telenti A., Lempp F. A., Pizzuto M. S., Chodera J. D., Hebner C. M., Virgin H. W., Whelan S. P. J., Veesler D., Corti D., Bloom J. D., Snell G., SARS-CoV-2 RBD antibodies that maximize breadth and resistance to escape. Nature 597, 97–102 (2021). 10.1038/s41586-021-03807-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Chakraborty S., Evolutionary and structural analysis elucidates mutations on SARS-CoV2 spike protein with altered human ACE2 binding affinity. Biochem. Biophys. Res. Commun. 534, 374–380 (2021). 10.1016/j.bbrc.2020.11.075 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhang Y., Sun H., Pei R., Mao B., Zhao Z., Li H., Lin Y., Lu K., The SARS-CoV-2 protein ORF3a inhibits fusion of autophagosomes with lysosomes. Cell Discov. 7, 31 (2021). 10.1038/s41421-021-00268-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Konno Y., Kimura I., Uriu K., Fukushi M., Irie T., Koyanagi Y., Sauter D., Gifford R. J., Nakagawa S., Sato K.; USFQ-COVID19 Consortium , SARS-CoV-2 ORF3b is a potent interferon antagonist whose activity is increased by a naturally occurring elongation variant. Cell Rep. 32, 108185 (2020). 10.1016/j.celrep.2020.108185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tang J. W., Lam T. T., Zaraket H., Lipkin W. I., Drews S. J., Hatchette T. F., Heraud J.-M., Koopmans M. P., Abraham A. M., Baraket A., Bialasiewicz S., Caniza M. A., Chan P. K. S., Cohen C., Corriveau A., Cowling B. J., Drews S. J., Echavarria M., Fouchier R., Fraaij P. L. A., Hachette T. F., Heraud J.-M., Jalal H., Jennings L., Kabanda A., Kadjo H. A., Khanani M. R., Koay E. S. C., Koopmans M. P., Krajden M., Lam T. T., Lee H. K., Lipkin W. I., Lutwama J., Marchant D., Nishimura H., Nymadawa P., Pinsky B. A., Rughooputh S., Rukelibuga J., Saiyed T., Shet A., Sloots T., Tamfum J. J. M., Tang J. W., Tempia S., Tozer S., Treurnicht F., Waris M., Watanabe A., Wemakoy E. O.; INSPIRE investigators , Global epidemiology of non-influenza RNA respiratory viruses: Data gaps and a growing need for surveillance. Lancet Infect. Dis. 17, e320–e326 (2017). 10.1016/S1473-3099(17)30238-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Murrell B., Wertheim J. O., Moola S., Weighill T., Scheffler K., Kosakovsky Pond S. L., Detecting individual sites subject to episodic diversifying selection. PLOS Genet. 8, e1002764 (2012). 10.1371/journal.pgen.1002764 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Manfredonia I., Nithin C., Ponce-Salvatierra A., Ghosh P., Wirecki T. K., Marinus T., Ogando N. S., Snijder E. J., van Hemert M. J., Bujnicki J. M., Incarnato D., Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements. Nucleic Acids Res. 48, 12436–12452 (2020). 10.1093/nar/gkaa1053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tarke A., Sidney J., Kidd C. K., Dan J. M., Ramirez S. I., Yu E. D., Mateus J., da Silva Antunes R., Moore E., Rubiro P., Methot N., Phillips E., Mallal S., Frazier A., Rawlings S. A., Greenbaum J. A., Peters B., Smith D. M., Crotty S., Weiskopf D., Grifoni A., Sette A., Comprehensive analysis of T cell immunodominance and immunoprevalence of SARS-CoV-2 epitopes in COVID-19 cases. Cell Reports Medicine 2, 100204 (2021). 10.1016/j.xcrm.2021.100204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Greaney A. J., Starr T. N., Gilchuk P., Zost S. J., Binshtein E., Loes A. N., Hilton S. K., Huddleston J., Eguia R., Crawford K. H. D., Dingens A. S., Nargi R. S., Sutton R. E., Suryadevara N., Rothlauf P. W., Liu Z., Whelan S. P. J., Carnahan R. H., Crowe J. E. Jr., Bloom J. D., Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host Microbe 29, 44–57.e9 (2021). 10.1016/j.chom.2020.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, Sylvain, P. Ivanov, D. Avila, Abdalla, S. Abdalla, Willing, J. development team, in 20th International Conference on Electronic Publishing, (2016). [Google Scholar]
- 36.Harris C. R., Millman K. J., van der Walt S. J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N. J., Kern R., Picus M., Hoyer S., van Kerkwijk M. H., Brett M., Haldane A., Del Río J. F., Wiebe M., Peterson P., Gérard-Marchant P., Sheppard K., Reddy T., Weckesser W., Abbasi H., Gohlke C., Oliphant T. E., Array programming with NumPy. Nature 585, 357–362 (2020). 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É., Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
- 38.S. Seabold, J. Perktold, in 9th Python in Science Conference, (2010). [Google Scholar]
- 39.W. McKinney, in Proceedings of the 9th Python Science Conference, S. van der Walt, J. Millman, Eds. (2010), pp. 56–61. [Google Scholar]
- 40.Pond S. L. K., Frost S. D. W., Muse S. V., HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21, 676–679 (2005). 10.1093/bioinformatics/bti079 [DOI] [PubMed] [Google Scholar]
- 41.Greaney A. J., Loes A. N., Crawford K. H. D., Starr T. N., Malone K. D., Chu H. Y., Bloom J. D., Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies. Cell Host Microbe 29, 463–476.e6 (2021). 10.1016/j.chom.2021.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.