Abstract
Knowledge of the fitness effects of mutations to SARS-CoV-2 can inform assessment of new variants, design of therapeutics resistant to escape, and understanding of the functions of viral proteins. However, experimentally measuring effects of mutations is challenging: we lack tractable lab assays for many SARS-CoV-2 proteins, and comprehensive deep mutational scanning has been applied to only two SARS-CoV-2 proteins. Here we develop an approach that leverages millions of publicly available SARS-CoV-2 sequences to estimate effects of mutations. We first calculate how many independent occurrences of each mutation are expected to be observed along the SARS-CoV-2 phylogeny in the absence of selection. We then compare these expected observations to the actual observations to estimate the effect of each mutation. These estimates correlate well with deep mutational scanning measurements. For most genes, synonymous mutations are nearly neutral, stop-codon mutations are deleterious, and amino-acid mutations have a range of effects. However, some viral accessory proteins are under little to no selection. We provide interactive visualizations of effects of mutations to all SARS-CoV-2 proteins (https://jbloomlab.github.io/SARS2-mut-fitness/). The framework we describe is applicable to any virus for which the number of available sequences is sufficiently large that many independent occurrences of each neutral mutation are observed.
The rapid evolution of SARS-CoV-2 has led to the emergence of viral variants with enhanced transmissibility, escape from therapeutics, or reduced recognition by immunity [1, 2]. To anticipate and mitigate this evolution, the scientific community has launched efforts to assess the risk of new viral variants [3] and create therapeutics that target constrained regions of the virus where resistance is less likely to evolve [4, 5, 6]. Both efforts require determining how specific mutations affect viral fitness.
Unfortunately, experimentally measuring the effects of mutations is challenging for most SARS-CoV-2 proteins. For spike, tractable lab assays have identified key functional and antigenic mutations [1, 7], and enabled deep mutational scanning measurements of how most mutations affect receptor binding, cellular infection, and antibody recognition [8, 9, 10, 11]. These experimental data are valuable for assessing new spike variants [3, 12, 13] and designing antibody therapeutics with greater resistance to escape [14, 15, 16]. But most SARS-CoV-2 proteins lack tractable lab assays, despite contributing to viral fitness [17, 18, 19] and being targets of efforts to develop anti-viral drugs [20]. The only non-spike SARS-CoV-2 protein with large-scale experimental measurements of mutation effects is Mpro [21, 22].
An alternative to experiments is to estimate effects of mutations by analyzing natural viral sequences. The amount of data available for such analyses has increased dramatically over the last few years with the sequencing of SARS-CoV-2 from millions of human infections. So far analyses of these sequences have focused on analyzing expanding viral clades to identify mutations that mediate immune escape or increase transmissibility [23, 24, 25]. The basic idea is that mutations that repeatedly appear near the base of clades that increase in relative frequency are likely beneficial to the virus. However, only a small minority of all possible mutations are beneficial, with most being nearly neutral or deleterious. For purposes such as identifying constrained drug targets or understanding the function of viral proteins, it is important to estimate the effects of neutral or deleterious mutations as well as beneficial ones. Other studies have analyzed broader alignments of coronaviruses substantially diverged from SARS-CoV-2 [26, 27], but the resulting estimates are limited by sparse sampling and possible changes in the impacts of some mutations across divergent viruses.
Here we develop a new approach that uses natural sequences to estimate the effects of mutations. Our basic insight is that there are now so many SARS-CoV-2 sequences that all non-deleterious single-nucleotide mutations are expected to independently occur many times along the observed phylogenetic tree. We therefore first calculate the number of expected observations of independent occurrences of each mutation based on the neutral mutation rate of SARS-CoV-2. We then compare these expected observations to the actual observations in the SARS-CoV-2 tree to estimate the effect of each mutation. The resulting estimates correlate well with existing deep mutational scanning data. Most viral proteins have regions under strong selective constraints. However some accessory proteins show only weak selection against amino-acid and even stop-codon mutations. Overall, our work demonstrates a new approach to determine the effects of mutations, and provides detailed maps of mutational effects across the SARS-CoV-2 proteome.
Results
Mutation effects from actual versus expected counts
To determine how many times each mutation is expected to be observed, we used the pre-built UShER tree [28, 29, 30] of ~7-million public SARS-CoV-2 sequences to count nucleotide mutations at four-fold degenerate sites [Figure 1A; 31]. Because mutations at such sites never alter the amino-acid sequence, these counts reflect the mutation process in the absence of protein-level selection (see below for caveats about nucleotide-level selection). The expected counts of a mutation from nucleotide x to y is simply the average count of this type of mutation across all four-fold degenerate sites with parental identity x. Importantly, we count independent occurrences of each mutation along the branches of the tree, not the sequences with the mutation in the final alignment (Figure 1A. We also compute expected counts separately for each SARS-CoV-2 clade to account for shifts in mutation spectrum [31, 32], and apply quality-control steps to remove spurious mutations (see Methods).
The expected counts per mutation vary with mutation type, ranging from ~565 for C→T to only ~9 for T→G mutations (Figure 1B). This variation is because the SARS-CoV-2 mutation spectrum is highly biased towards specific mutation types [31, 32, 33, 34]. However, because there are so many SARS-CoV-2 sequences we are able to estimate the rate of even the rarest mutation types with high accuracy [31]. For instance, there are ~ 1.9 × 104 observed occurrences of T→G mutations across all ~2,100 four-fold degenerate sites with a parental identity of T, which is enough to estimate the T→G mutation rate (and therefore the expected counts of each mutation) with high accuracy.
We compared the expected counts to the actual observed counts of mutations averaged across sites (Figure 1). For synonymous mutations, the expected and actual counts are similar. But for nonsynonymous and especially stop-codon mutations, the actual counts are substantially lower than the expected counts, reflecting purifying selection for protein function.
The ratio of actual to expected counts for each mutation is related to its effect on viral fitness. The intuition is straightforward: mutations arise at all sites, but viruses with deleterious mutations are less likely to transmit and be observed in sequencing of human SARS-CoV-2. Therefore, the ratio of actual to expected counts will be one for neutral mutations, and less than one for deleterious mutations. In the Methods and Appendix, we show that under plausible assumptions about SARS-CoV-2 evolution and sampling intensity (fraction of viruses sequenced), the fitness cost of a deleterious mutation scales roughly inversely with the ratio of actual to expected counts for mutations with costs greater than a few percent. A key result is the dependence on sampling intensity: if all human SARS-CoV-2 were sequenced even deleterious mutations would have a high chance of being sampled and we would need to study the subsequent spread of the mutations to assess their fitness. But the actual sampling intensity is ~0.1%, since there are ~7-million publicly available SARS-CoV-2 sequences and the total number of human infections is now probably roughly on par with the total global population of ~8-billion people. At this sampling intensity, the number of times a mutation is observed reflects more subtle reductions in transmission efficiency. We quantify the effect of each mutation as the logarithm of the ratio of actual to expected counts after summing counts for all nucleotides that encode the relevant amino-acid. The statistical noise is greater for mutations with fewer expected counts: the figures in this paper show mutations with ≥ 10 expected counts unless otherwise noted, with legends linking to interactive plots that enable adjustment of this threshold.
Mutation-effect estimates are robust to subsampling, with some evidence of epistasis in spike
We computed the correlations among mutation-effect estimates made using subsets of SARS-CoV-2 sequences from different viral clades or geographic locations. These estimates were well correlated, with some modest variation in estimates across sequence subsets (Figure 2A,B).
The modest variation in estimates from different sequence subsets could have two causes: statistical noise due to finite mutation counts, or real shifts in mutation effects during SARS-CoV-2 evolution [35, 36]. To test for statistical noise, we computed correlations with different thresholds for how many expected counts are required before making an estimate for a mutation (Figure 2C). Correlations increased with this count threshold, consistent with reduced statistical noise for larger mutation counts. But the correlation for spike mutations was consistently lower for cross-clade but not cross-geography comparisons (Figure 2C). The lower cross-clade correlation for spike appears due to epistatic shifts in mutation effects [35, 36, 37, 38, 39] or changes in the selective landscape [40] between SARS-CoV-2 clades, since the correlation is lower between clades with higher spike divergence (Figure 2D). In particular, the interactive version of Figure 2A shows that mutations that are more beneficial in Omicron BA.5 than Delta are often antibody-escape mutations (eg, K444N or G446S in spike [12])—a result that makes sense, since newer variants like Omicron BA.5 are evolving under increased immune selection compared to earlier variants like Delta that circulated in a more immunologically naive population.
Despite evidence for some shifts in mutation effects in spike, for the rest of this paper we aggregate counts across viral clades to make a single estimate for each amino-acid mutation. The reason is that the accuracy of the estimates increases with the number of counts (Figure 2C), and several mutation types only have enough counts for reasonable estimates when aggregating across clades (Figure 1B). For the purposes of this paper, we deemed it preferable to have more accurate and comprehensive pan-SARS-CoV-2 estimates than noisier clade-specific estimates for fewer mutations. However, the interactive version of Figure 2A linked in the legend enables exploration of mutations with disparate estimates among clades.
An important question is whether the mutation fitness estimates are affected by noise from limited statistical sampling of mutations or whether sequencing errors and bioinformatic artefacts distort the estimates. To assess if this is the case, we repeated the entire fitness estimation using an even larger pre-built UShER mutation-annotated tree of all ~14-million SARS-CoV-2 sequences in GISAID [41] as of March-29–2023. There is an extremely high correlation between fitness estimates made using the ~7-million publicly available sequences and the larger GISAID tree (Figure S1). This concordance indicates that the set of ~7-million public sequences is large enough that doubling the data does not appreciably shift the estimates, and so throughout this paper we use that sequence set due to our preference for publicly available data. Furthermore, fitness estimates using sequences from specific countries (USA and England, Figure 2B) are also highly concordant, suggesting that sequencing and bioinformatic workflows are not driving the signal. Lastly, positions known to be under strong constraint a priori (e.g. start codons, the ribosomal slippage site) typically have no or only few mutations, suggesting that sequencing errors in consensus sequences are rare.
Structural and non-structural proteins are under strong purifying selection, but most accessory proteins are not
The distributions of mutation effects concur with biological intuition about how different classes of mutations impact protein function. Most synonymous mutations are nearly neutral, most stop codons are highly deleterious, and amino-acid mutations range from slightly beneficial to highly deleterious (Figure 3A). The handful of synonymous mutations with highly deleterious effects are in either regions of known non-coding constraint (e.g., the ORF1ab ribosomal slippage site [43]) or two regions in the center of E and the end of M (Figure S2).
To investigate differences in functional constraint among viral proteins, we computed the distributions of mutation effects separately for each gene (Figure 3B and S3). SARS-CoV-2 proteins are grouped into three categories: nonstructural (or nsp) proteins, structural proteins (spike, M, N, and E), and accessory proteins (names prefixed with “ORF”) [44]. The nonstructural and structural proteins are essential, and these proteins show strong selection against stop codons and clear although variable purifying selection against amino-acid mutations (Figure 3B and S3; e.g., nsp13 is under stronger protein-level constraint than nsp1).
However, most accessory proteins are under little constraint (Figure 3B and S3). Stop-codon and amino-acid mutations to ORF7a and ORF8 are not more deleterious than synonymous mutations (although recall that our estimates are only sensitive to fitness costs greater than a few percent). The lack of deleterious mutations to ORF8 is consistent with the fact that viruses with deletions in this gene have spread in humans [45] and that major variants had stop codons early in ORF8. Indeed, the loss of accessory proteins such as ORF8 appears to occur with some regularity during the early evolution of non-human viruses in humans [46]. The only accessory protein under strong purifying selection against stop codons is ORF3a (Figure 3B), for which stop codons in the first 240 residues are clearly deleterious (Figure S4). These observations concur with experiments showing SARS-CoV-2 is attenuated by deletion of ORF3a but there is little effect of deleting ORF6, ORF7a, or ORF8 [19, 47, 48]. However, ORF3a’s function must be relatively insensitive to its protein sequence, since other than selection against stop codons there is only amino-acid level constraint at a few sites like 135 and 138 (Figure S4). Observations such as these could help guide experimental studies to better understand protein function.
Mutation-effect estimates correlate with experiments
We examined how the mutation effects estimated using our approach compare with prior high-throughput deep mutational scanning measurements. For spike, two distinct experimental methodologies have been used to characterize large numbers of mutations: yeast display of the receptor-binding domain (RBD) [8, 49] and spike pseudotyped lentiviruses [9]. For Mpro (also known as nsp5 or 3CLpro), two different labs have performed deep mutational scanning using the same basic methodology of assaying protease cleavage in yeast [21, 50, 22].
For spike, our estimates from natural sequences correlate with the experiments almost as well as the two experimental methodologies correlate with each other (Figure 4A), with Pearson correlations of 0.66 between the estimates and experiments versus 0.72 between the two experiments. Neither experiment fully captures how mutations affect viral fitness, since both RBD yeast display and lentiviral pseudotyping are imperfect proxies for spike function during actual human infections. Therefore, it is unclear how much the differences between the mutation-effect estimates and experiments are due to noise in the estimates versus limitations of the experiments. However, the fact that the estimates correlate with the experiments almost as well as the experiments correlate with each (Figure 4A) suggests the estimates are of comparable quality to experimental measurements. At least some of the mutations with the greatest divergence between our estimates and the deep mutational scanning likely represent experimental artifacts. For instance, P527L, which is favorable in the RBD deep mutational scan but deleterious in the sequence-based estimates and full-spike scan, is at the C-terminus of the yeast-displayed RBD [8] where it may adopt a non-native conformation.
The sequence-based estimates for Mpro also correlate with the deep mutational scans for that protein, although in this case the experiments correlate substantially better with each other than with our estimates (Figure 4B). However, the Mpro experiments all use a similar yeast-based methodology [21, 50, 22] that fails to capture significant aspects of Mpro’s function during human infections. For instance, a stop codon at Q306 is well tolerated in the deep mutational scans but extremely disfavorable in our sequence-based estimates, and such a mutation would clearly be highly deleterious to actual virus as it would truncate the polyprotein. Similarly, K61N is well tolerated in the deep mutational scans but extremely disfavorable in our estimates, probably because in the full viral polyprotein this residue mediates important interactions between Mpro and nsp7–10 [51]
Mutation-effect estimates better capture functional constraint than dN/dS ratios or predictions from other methods
A longstanding approach for analyzing protein constraint is to compare rates of nonsynonymous (dN) and synonymous (dS) substitutions at each site [52, 53]. These dN/dS ratios can be calculated by counting mutations or using phylogenetic substitution models [53, 54]. A limitation of dN/dS ratios is they cannot be interpreted in terms of the fitness effects, since they simply represent the relative rate of amino-acid substitution rather than the effects of specific mutations [55, 56]. Nonetheless, we can compare dN/dS ratios to our mutation-effect estimates as measures of the average constraint at each site. The mutation-effect estimates greatly outperform dN/dS ratios as a measure of site-level constraint as assessed by correlation with deep mutational scanning experiments (Figure S5). The reason is in part because some aspects of functional constraint cannot be captured by a dN/dS ratio. For instance, the ACE2-affinity enhancing spike mutation N501Y arose in several SARS-CoV-2 variants early in the pandemic, and has since remained fixed due its importance for receptor binding [49]. Our mutation-effect estimates correctly reflect that site 501 is strongly prefers tyrosine, but the site has a high dN/dS ratio due to the early convergent evolution of this site to that preferred amino acid.
Our mutation-effect estimates also correlate better with deep mutational scanning experiments than predictions from two algorithms trained to learn epistatic models of mutation effects from phylogenetically broader but more sparsely sampled sequence data [27, 26] (Figure S6). Our estimates also correlate better with experiments than predictions by a machine-learning algorithm that integrates sequence and epidemiological data [25] (Figure S6). These results suggest that our straightforward approach of directly reading out the effects of mutations from their actual versus expected counts can outperform much more complex models when millions of sequences are available.
Fixed mutations tend to have beneficial or neutral effects
Amino-acid mutations that have fixed in at least one viral clade are estimated to mostly have neutral or beneficial effects, whereas most other mutations are deleterious (Figure S7). This fact is unsurprising: viral lineages that expand into new clades do so because they have acquired beneficial mutations while avoiding deleterious ones [57, 58, 59]. But the fact that the beneficial effects of fixed mutations are correctly estimated by our approach, which simply counts mutation occurrences and does not incorporate information on lineage size, demonstrates such mutations occur independently in many viral lineages that are more successful than average.
Most fixed mutations are estimated to be beneficial regardless of whether estimates are made using all viral clades, or just clades that did not fix the mutation (Figure S8). However, a few beneficial fixed mutations show epistatic entrenchment [38, 60] in the sense that they are not particularly beneficial in clades in which they did not fix (Figure S8). The most striking example is S373P in spike, which has experimentally been shown to be neutral or slightly deleterious in pre-Omicron clades, but strongly beneficial in the Omicron clades in which it fixed [49, 36].
Interactive exploration of amino-acid fitnesses
To enable easy access to the mutation-effect estimates, we created interactive plots to enable exploration of the data for each protein. A static view of one of these plots is in Figure 5; see https://jbloomlab.github.io/SARS2-mut-fitness for interactive versions for all proteins. These plots enable both high-level inspection of functional constraint across each protein, and detailed interrogation of the effects of specific mutations.
Discussion
Enough SARS-CoV-2 viruses have now been sequenced that many independent occurrences of every tolerated single-nucleotide mutation have been observed along the viral phylogeny. Here we have described a new approach that leverages this fact to estimate the effects of these mutations. In essence, we treat natural evolution as a deep mutational scan, with the millions of publicly available SARS-CoV-2 sequences providing a readout of this experiment. The key is simply to calculate how many times each mutation has been “tested” along the history of sampled viral sequences, and compare that expectation to the actual observations of the mutation among viruses sufficiently fit to have been sequenced in actual human infections.
The resulting estimates of mutational effects are robust to subsetting on specific viral clades or geographies, and correlate well with experimental measurements. In broad strokes, the mutation effects illuminate patterns of constraint: for instance, there is strong selection on structural and non-structural proteins, but only limited purifying selection on the accessory proteins.
However, the real value of our approach is in the detailed maps of effects of specific mutations to all viral proteins, including proteins with poorly understood functions not easily characterized in the lab. These maps will be of value for designing drugs that target constrained sites, interpreting the consequences of mutations observed during viral surveillance, and guiding experiments to mechanistically characterize protein function.
There are several caveats to our approach. First, because the number of observations of any given mutation is small compared to the millions of SARS-CoV-2 sequences being analyzed, our approach requires careful quality control to remove sequencing errors. Second, we assume the rate of each type of nucleotide mutation is uniform across the viral genome, and neglect higher-order context that may influence mutation rate [61, 62]. Likewise, we neglect constraint on nucleotide identity beyond the encoded protein sequence [63, 64]—although this probably has only a minor effect, since our analyses show just a handful of synonymous sites are under strong selection (Figure S2). Third, the exact relationship between the statistics we calculate and viral fitness depend on the fraction of all infections that are sequenced (sampling intensity) and viral population dynamics. Although we derive this relationship, we do not adjust for sampling intensity and population dynamics when estimating mutation effects. Fourth, we make a single estimate for each mutation across all SARS-CoV-2, neglecting the epistasis that can affect some mutations [35, 36]. Finally there are a few technical caveats to how we count mutations that are discussed in the Methods section.
Conceptually, our approach differs from prior methods that aim to identify beneficial SARS-CoV-2 mutations associated with viral clades that increase in frequency [23, 24, 25]. Those methods draw information primarily from what happens down-stream of a mutation. In contrast, we treat all mutations equivalently regardless of whether they are on a tip node or at the base of a large clade. Our approach is better for estimating effects of deleterious or nearly neutral mutations, but clade-growth methods may be better for beneficial mutations. In particular, clade size carries information beyond that contained in mutation counts alone (Figure S9). Hopefully future work can combine mutation-counting and clade-growth methods for even better estimates of SARS-CoV-2 mutation effects. Note our approach is conceptually similar to estimating fitness costs of HIV or polio mutations from mutation-selection balance in deep sequencing of intra-population viral quasispecies [65, 66], except we analyze mutation occurrences rather than frequencies to account for the phylogenetic structure and genetic hitchhiking that characterize global SARS-CoV-2 evolution.
The power of the approach we have described will increase with more viral sequencing. SARS-CoV-2 is the first virus with enough sequences that every tolerated mutation is observed multiple independent times. As costs drop, it is easy to imagine a future with even more viral sequences. As this occurs, viral genomic sequencing—which has traditionally been used primarily to track evolution and spread—will also become an increasingly precise tool to determine the effects of specific mutations.
Methods
Code and data availability
See the GitHub repository at https://github.com/jbloomlab/SARS2-mut-fitness for the computer code and processed data (eg, fitness estimates and mutation counts). That repository contains a README with links to specific data files as well as a description of the computational pipeline. See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aa_fitness.csv final estimates of amino-acid fitnesses across all clades; other intermediate data files are also provided in the GitHub repository. The specific version of the repository used for this paper is tagged as “bioRxiv-v2” on GitHub (https://github.com/jbloomlab/SARS2-mut-fitness/tree/bioRxiv-v2) The pipeline is fully reproducible, and is run using snakemake [67] with interactive plots rendered using altair [68].
The interactive plots are rendered at https://jbloomlab.github.io/SARS2-mut-fitness via GitHub pages.
Versioning of analyses of different sequence sets
The figures in this manuscript show analyses of the set of all publicly available sequences as of May-11–2023. However, the pipeline can be run on different sequence sets. The sequence sets on which the analysis is currently run include all sets listed under the “mat_trees” key in https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/config.yaml; these include the sets of all public sequences from several earlier dates (such as those available for the first version of this analysis), as well as the set of all sequences in GISAID as of March-29–2023. A version of the results for each sequence set is provided in the GitHub repository (https://github.com/jbloomlab/SARS2-mut-fitness) in subdirectories with names like “results_public_2023–05-11”, and the index page for the interactive plots (https://jbloomlab.github.io/SARS2-mut-fitness) links at the bottom to plots for each sequence set. The “current_mat” key in https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/config.yaml specifies which sequence set is used to generate the main results that in the “results” subdirectory in the GitHub repository and are shown by default in the interactive plats; we anticipate periodically updating this to newer sequence sets as more sequences become available. See https://jbloomlab.github.io/SARS2-mut-fitness/mat_aa_fitness_correlations.html for the correlations among mutation effects estimated from the different sequence sets.
For the GISAID sequence set, we acknowledge the submitters of the sequences listed at the following URLs: https://doi.org/10.55876/gis8.230403ab, https://doi.org/10.55876/gis8.230403hg, https://doi.org/10.55876/gis8.230403ht, and https://doi.org/10.55876/gis8.230403tg.
Counting mutations along the phylogenetic tree
We counted occurrences of each mutation in each viral clade using the UShER pre-built mutation-annotated tree [28, 29, 30] from May-11–2023 (http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2023/05/11/public-2023-05-11.all.masked.nextclade.pangolin.pb.gz), which contains all ~7-million SARS-CoV-2 sequences that are available in public databases. To make these counts at a per-clade level, we first subsetted the mutation-annotated tree on all sequences for each Nexstrain clade [69], retained only clades with at least 104 sequences, and then used the matUtils program distributed with UShER to extract the nucleotide mutations on every branch of the each clade-subsetted mutation-annotated tree. For the analyses by geographic location (Figure 2), we subsetted on all sequences that began with “USA” or “England” as these were the two locations with the most publicly available sequences.
We then performed quality control by ignoring any branch that met any of the following criteria:
it had more than four nucleotide mutations;
it contained more than one nucleotide mutation that was a reversion to the Wuhan-Hu-1 reference sequence;
it contained more than one nucleotide mutation that was a reversion to the founder sequence for that clade as provided at https://raw.githubusercontent.com/neherlab/SC2_variant_rates/7e738194a8c6592082f1caa9a6ca70cb68289790/data/clade_gts.json by [34];
it contained more than one nucleotide mutation to the same codon.
The rationale for the first exclusion is that highly mutated branches are often indicative of sequencing errors of viral evolution in chronically infected humans, neither of which correspond to the pattern of typical SARS-CoV-2 transmission in acute infections. Because the virus’s evolution is very densely sampled, only a small fraction of branches have more than four mutations (Figure S10). The rationale for the second and third exclusions is that excess reversions can arise from base-calling pipelines that erroneously call low-coverage sites as reference. We ignore branches with multiple nucleotide mutations to the same codon (this is very rare) because as detailed below our method is only designed to make estimates for mutations that represent single-nucleotide changes from the clade founder. Note also that the mutation-annotated tree does not include insertion or deletion mutations, and so we only consider (and make estimates for) point mutations.
We then specified for exclusion certain mutations and sites that are prone to sequencing or base-calling errors. Specifically, we excluded
the sites specified in Table S1 of [70] as being error prone;
sites 5629, 6851, 7328, 28095, and 29362 since they had very high error rates in some clades;
the problematic sites listed at https://github.com/W-L/ProblematicSites_SARS-CoV2, which are masked in the pre-built mutation-annotated tree;
for each clade, the clade-specific sites listed in https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/data/usher_masked_sites.yaml, which are masked in the pre-built mutation-annotated tree;
for each clade, any mutation that was a reversion from the clade founder to the Wuhan-Hu-1 reference, and the reverse complements of these mutations.
The last exclusion criteria is because some bioinformatics pipelines called low-coverage sites as reference.
See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/mutation_counts/aggregated.csv for the final counts of each nucleotide mutation in each clade; note that this file also contains excluded mutations.
Calculation of expected counts
To calculate the expected counts for each nucleotide mutation, we analyzed just the four-fold degenerate sites in each clade in an approach paralleling that of [31]. Specifically, we identify all non-excluded four-fold degenerate sites in each clade founder. We then count nucleotide mutations just at those sites in each clade, and calculate the expected per-site number of mutations from nucleotide x to y as the total number of x to y mutations at four-fold degenerate sites divided by the number of four-fold degenerate sites with x as the parental identity. This analysis is done at the clade level for two reasons: referencing mutations to the clade founder (rather than the Wuhan-Hu-1 reference) limits problem with the approach that would arise at sites that substitute multiple times in the history of a sequence (since each clade is a relatively high-identity group multiple mutations at the same site within a clade are very rare), and because it is know that SARS-CoV-2 mutation rates vary somewhat among clades [31, 32]. We only retain clades with at least 5000 mutations at four-fold degenerate sites in order to avoid inaccurate estimates of expected counts due to low sampling of mutations.
Mutational effects from actual versus expected counts
To estimate the effects of mutations, we simply compare the expected counts of each nucleotide mutation to the actual counts in the pre-built mutation-annotated tree. See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/expected_vs_actual_mut_counts/expected_vs_actual_mut_counts.csv for these expected versus actual counts on a per-clade basis; note that this file also includes counts at excluded sites.
To estimate the effects of mutations, we first sum the counts of all non-excluded nucleotide mutations that encode each amino-acid mutation to convert the nucleotide counts to amino-acid counts. In doing this, we exclude any mutations that are not from the clade-founder codon identity: in other words, we ignore sequences with histories that involve multiple mutations at the same codon in the same clade (this is a caveat of the approach, although because each clade is relatively high identity it does not have a major effect). For the overall estimates reported in this paper, we also sum these counts across all retained clades; for the analyses in Figure 2 we also make estimates without summing across clades and only for counts from sequences from specific geographic locations. We then compute the estimated fitness of each mutation as simply the natural logarithm of the ratio of actual to expected counts after adding a pseudocount of to each count, namely .
Note that these mutation-effect estimates will have more statistical noise the smaller the value of the expected counts for each mutation. Therefore, we also track the expected counts alongside the estimates. In this paper, we only show estimates for mutations with expected counts of at least 10 unless otherwise noted. However, the figures link to interactive legends that allow adjustment of this threshold: larger values (eg, 20 or more) will lead to slightly more accurate estimates but drop some mutations, lower values can be used if you need a noisier estimate for a mutation that has less then 10 expected counts.
See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aamut_fitness_all.csv for the estimates of amino-acid mutation effects across all clades, and see https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aamut_fitness_by_clade.csv for the clade-specific estimates. The all-clade estimates of mutation effects are what are shown in Figure 3.
For the clade correlations plotting in Figure 2, we only include clades with at least 5 × 105 expected counts across all sites, as only these clades have enough counts for reasonable per-clade estimates.
Mutation effects to amino-acid fitnesses
For the final estimates of amino-acid fitnesses shown in the heatmaps such as in Figure 5, we need a single estimate for each amino acid. This is straightforward for sites that have the same amino-acid identity in all clade founders: the “wildtype” residue shared across all clades has a fitness of zero, and all other amino acids have fitnesses equal to the effect of mutating from the “wildtype” to that amino acid. However, for sites that change amino-acid identity between clade founders, things are more complicated and we need to take the extra step below.
For each clade have estimated the change in fitness caused by mutating a site from amino-acid where is the amino acid in the clade founder sequence. For each such mutation, we also have which is the number of expected mutations from the clade founder amino acid to . These values are important because they give some estimate of our “confidence” in the values: if a mutation has high expected counts (large ) then we can estimate the change in fitness caused by the mutation more accurately, and if is small then the estimate will be much noisier.
However, we would like to aggregate the data across multiple clades to estimate amino-acid fitness values at a site under the assumption that these are constant across clades. Things get complicated if not all clade founders have the same amino acid identity at a site. For instance, let’s say at our site of interest, the clade founder amino acid is in one clade and in another clade. For each clade we then have a set of and values for the first clade (where ranges over the 20 amino acids, including stop codon, that aren’t ), and another set of up to 20 and values for the second clade (where ranges over the 20 amino acids that aren’t ).
From these sets of mutation fitness changes, we’d like to estimate the fitness of each amino acid , where the values satisfy (in other words, a higher means higher fitness of that amino acid). When there are multiple clades with different founder amino acids at the site, there is no guarantee that we can find values that precisely satisfy the above equation since there are more values than values and the values may have noise (and is some cases even real shifts among clades due to epistasis). Nonetheless, we can try to find the values that come closest to satisfying the above equation.
First, we choose one amino acid to have a fitness value of zero, since the scale of the values is arbitrary and there are really only 20 unique parameters among the 21 values (there are 21 amino acids since we consider stops, but we only measure differences among them, not absolute values). Typically if there was just one clade, we would set the wildtype value of and then for mutations to all other amino acids we would simply have . However, when there are multiple clades with different founder amino acids, there is no longer a well defined “wildtype”. So we choose the most common non-stop parental amino-acid for the observed mutations and set that to zero. In other words, we find x that maximizes and set that value to zero.
Next, we choose the values that most closely match the measured mutation effects, weighting more strongly mutation effects with higher expected counts (since these should be more accurate). Specifically, we define a loss function as
where we ignore effects of synonymous mutations (the term in second summand) because we are only examining protein-level effects. We then use numerical optimization to find the values that minimize that loss .
Finally, we would still like to report an equivalent of the values for the values that give us some sense of how accurately we have estimated the fitness of each amino acid. To do that, we tabulate as the total number of mutations either from or to amino-acid x as the “count” for the amino acid. Amino acids with larger values of should have more accurate estimates of .
See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aa_fitness.csv for these overall amino-acid fitness estimates.
Site numbering and protein naming
All sites are numbered according to the sequential Wuhan-Hu-1 reference numbering scheme, using the reference sequence at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/wuhCor1.fa.gz. The protein annotations are taken from the associated GTF at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/genes/ncbiGenes.gtf.gz. Those protein annotations refer to the polyproteins encoding the non-structural proteins as ORF1a and ORF1ab. To convert to from ORF1ab numbering/naming to the nsp-based naming (eg, nsp1, nsp2, etc) we use the conversions specified under “orf1ab_to_nsps” in https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/config.yaml, which are in turn taken from Theo Sanderson’s annotations at https://github.com/theosanderson/Codon2Nucleotide/blob/main/src/App.js.
Comparison to deep mutational scanning
Deep mutational scanning data were taken from published studies [9, 49, 21, 50, 22], using the data at the links specified under the “dms_datasets” key in https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/config.yaml. For the spike deep mutational scanning [9] we only included mutations with “times seen” values of at least three in the deep mutational scanning. The RBD data [49] include measurements for two phenotypes (ACE2 affinity and RBD expression), and one of the Mpro studies [21] includes measurements for three different phenotypes in yeast (growth, FRET, and transcription factor activity). Figures 4, S5, and S6 shows the effect averaged across all phenotypes measured by each of these studies. For plots that break the correlations out by phenotype, see https://jbloomlab.github.io/SARS2-mut-fitness/dms_S_all_corr.html and https://jbloomlab.github.io/SARS2-mut-fitness/dms_nsp5_all_corr.html.
Comparison to dN/dS and other mutation-effect prediction algorithms
For the comparison to the dN/dS approaches shown in Figure S5, we used the dN/dS values available at https://github.com/spond/SARS-CoV-2-variation [71] for all SARS-CoV-2 sequences, which were calculated using the FEL approach [53]. The dN/dS ratios only provide a single number for each site, which cannot be directly compared to either the mutation-effect estimates or the deep mutational scanning, which estimate the effects of individual amino-acid mutations. We therefore computed site-summary metrics of the mutation-effect estimates and the deep mutational scanning as the average effect of all measured amino-acid mutations at each site, excluding stop codons. The correlations in Figure S5 are with those site-summary metrics.
We also compared both our mutation-effect estimates and the spike deep mutational scanning measurements [9] to predictions from three other algorithms:
the EpiScores reported by Maher et al (2022) [25],
the DCA mutability scores reported by Rodriguez-Rivas et al (2022) [26], and
the EVE scores reported by Thadani et al (2023) [27].
These comparisons are shown in Figure S6. The Maher et al (2022) and Thadani et al (2023) studies report mutation-level predictions and so are compared directly to the deep mutational scanning our mutation-effect estimates; Rodriguez-Rivas et al (2023) report only site-level metrics and so are compared to site-summary metrics as for the dN/dS analysis.
Derivation of relationship between actual to expected count ratio and viral fitness
The ratio of actual to expected counts that we calculate in this paper is related to the probability that we observe a viral lineage containing an occurrence of a specific mutation among sequenced human SARS-CoV-2. This probability depends on three factors: the fitness effect of the mutation, the fraction of all SARS-CoV-2 viruses that are sequenced (sampling intensity), and the growth dynamics of the viral population. In the supplementary appendix, we derive the approximate relationship between this probability as a function of the fitness cost s and sampling intensity e for deleterious mutations for both a constant and exponentially growing viral population.
We show that for a constant viral population size, the probability of observing a lineage containing a deleterious mutation with cost is roughly when , and more weakly dependent on for smaller fitness costs (when ). The intuitive explanation is that the average size of a mutant lineage with fitness cost is and we basically ask whether we sample the lineage before it disappears. If we sample more intensely (larger ), whether a lineage gets sampled depends primarily on the stochastic dynamics and little on the fitness effect. With a typical sampling intensity for SARS-CoV-2 between 1/1000 and 1/100, this means our approach is sensitive to fitness effects larger than a few percent per serial interval; mutations with fitness costs smaller than that will not show an appreciable difference from neutral mutations in their ratio of actual to expected accounts.
In an exponentially growing population, the probability of observing a mutant lineage with fitness cost s again scales as , if where is the time over which the variant has expanded. If is ~ months, that is 20 generations, which again corresponds to s of at least a few percent for . For mutations with smaller fitness costs, the dependence scales more as .
Overall, these calculations indicate that for multiple different growth dynamics of the viral population, the ratio of expected to actual counts will scale inversely with the fitness cost of deleterious mutations for mutations with costs that exceed a few percent. Note that the approach we use in this paper does not account for variation in sampling intensity across space or time, does not attempt to adjust for changes in viral growth dynamics over time, uses the heuristic formula of calculating the effect as the log ratio of counts, and applies this same formula to all mutations regardless of whether they are deleterious, neutral, or beneficial. A more complete derivation might try to calculate the fitness effects from the full distribution of lineage sizes more rigorously and incorporate information about the sampling intensity and viral growth dynamics. However, such a derivation (if possible at all) is beyond the scope of this study, and we also note that good empirical data is generally lacking to precisely account for sampling intensity and viral growth dynamics over the full span of time and space from which the sequences we analyze are drawn. The key point of the derivations for our current study is simply that our approach should be sensitive to detecting the effects of mutations with fitness costs greater than a few percent.
Supplementary Material
Acknowledgments
We thank Angie Hinrichs for providing the pre-built mutation-annotated trees and thank the UShER team for promptly answering and addressing GitHub issues related to use of this package and its pre-built trees. We thank the sequence submitters to GISAID, who are listed in the tables cited in the Methods section. The work of JDB was supported in part by the NIH/NIAID under grants U19AI171399 and R01AI141707, and contract 75N93021C00015. JDB is an Investigator of the Howard Hughes Medical Institute.
Footnotes
Competing interests
JDB consults Apriori Bio, Aerium Therapeutics, Invivyd, the Vaccine Company, GSK, and Pfizer on topics related to viral evolution. JDB receives royalty payments as an inventor on Fred Hutch licensed patents related to deep mutational scanning of viral proteins.
Literature Cited
- [1].Harvey WT, Carabelli AM, Jackson B, Gupta RK, Thomson EC, Harrison EM, et al. SARS-CoV-2 variants, spike mutations and immune escape. Nature Reviews Microbiology. 2021;19(7):409–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Abdool Karim SS, de Oliveira T. New SARS-CoV-2 variants—clinical, public health, and vaccine implications. New England Journal of Medicine. 2021;384(19):1866–1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].DeGrace MM, Ghedin E, Frieman MB, Krammer F, Grifoni A, Alisoltani A, et al. Defining the risk of SARS-CoV-2 variants on immune protection. Nature. 2022;605(7911):640–652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Moghadasi SA, Heilmann E, Khalil AM, Nnabuife C, Kearns FL, Ye C, et al. Transmissible SARS-CoV-2 variants with resistance to clinical protease inhibitors. Science Advances. 2023;9:eade8778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Iketani S, Mohri H, Culbertson B, Hong SJ, Duan Y, Luck MI, et al. Multiple pathways for SARS-CoV-2 resistance to nirmatrelvir. Nature. 2022;DOI 10.1038/s41586-022-05514-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Hiscox JA, Khoo SH, Stewart JP, Owen A. Shutting the gate before the horse has bolted: is it time for a conversation about SARS-CoV-2 and antiviral drug resistance? Journal of Antimicrobial Chemotherapy. 2021;76(9):2230–2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Weisblum Y, Schmidt F, Zhang F, DaSilva J, Poston D, Lorenzi JC, et al. Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants. eLife. 2020;9:e61312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KH, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Dadonaite B, Crawford KH, Radford CE, Farrell AG, Timothy CY, Hannon WW, et al. A pseudovirus system enables deep mutational scanning of the full SARS-CoV-2 spike. Cell. 2023;186:1263–1278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, et al. Complete mapping of mutations to the SARS-CoV-2 spike receptor-binding domain that escape antibody recognition. Cell Host & Microbe. 2021;29(1):44–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Cao Y, Jian F, Wang J, Yu Y, Song W, Yisimayi A, et al. Imprinted SARS-CoV-2 humoral immunity induces convergent Omicron RBD evolution. Nature. 2022;DOI 10.1038/s41586-022-05644-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Greaney AJ, Starr TN, Bloom JD. An antibody-escape estimator for mutations to the SARS-CoV-2 receptor-binding domain. Virus Evolution. 2022;8(1):veac021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Tzou PL, Tao K, Pond SLK, Shafer RW. Coronavirus Resistance Database (CoV-RDB): SARS-CoV-2 susceptibility to monoclonal antibodies, convalescent plasma, and plasma from vaccinated persons. Plos one. 2022;17(3):e0261045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Starr TN, Czudnochowski N, Liu Z, Zatta F, Park YJ, Addetia A, et al. SARS-CoV-2 RBD antibodies that maximize breadth and resistance to escape. Nature. 2021;597(7874):97–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Rappazzo CG, Tse LV, Kaku CI, Wrapp D, Sakharkar M, Huang D, et al. Broad and potent activity against SARS-like viruses by an engineered human monoclonal antibody. Science. 2021;371(6531):823–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Cao Y, Jian F, Zhang Z, Yisimayi A, Hao X, Bao L, et al. Rational identification of potent and broad sarbecovirus-neutralizing antibody cocktails from SARS convalescents. Cell reports. 2022;41(12):111845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Thorne LG, Bouhaddou M, Reuschl AK, Zuliani-Alvarez L, Polacco B, Pelin A, et al. Evolution of enhanced innate immune evasion by SARS-CoV-2. Nature. 2022;602(7897):487–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Syed AM, Taha TY, Tabata T, Chen IP, Ciling A, Khalid MM, et al. Rapid assessment of SARS-CoV-2–evolved variants using virus-like particles. Science. 2021;374(6575):1626–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].McGrath M, Xue Y, Dillen C, Oldfield L, Assad-Garcia N, Zaveri J, et al. SARS-CoV-2 Variant Spike and accessory gene mutations alter pathogenesis. Proceedings National Academy of Sciences USA. 2022;119:e2204717119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Tao K, Tzou PL, Nouhin J, Bonilla H, Jagannathan P, Shafer RW. SARS-CoV-2 antiviral therapy. Clinical microbiology reviews. 2021;34(4):e00109–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Flynn JM, Samant N, Schneider-Nachum G, Barkan DT, Yilmaz NK, Schiffer CA, et al. Comprehensive fitness landscape of SARS-CoV-2 Mpro reveals insights into viral resistance mechanisms. eLife. 2022;11:e77433. doi: 10.7554/eLife.77433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Iketani S, Hong SJ, Sheng J, Bahari F, Culbertson B, Atanaki FF, et al. Functional map of SARS-CoV-2 3CL protease reveals tolerant and immutable sites. Cell Host & Microbe. 2022;30(10):1354–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Obermeyer F, Jankowiak M, Barkas N, Schaffner SF, Pyle JD, Yurkovetskiy L, et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science. 2022;376(6599):1327–1332. doi: 10.1126/science.abm1208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Lee B, Sohail MS, Finney E, Ahmed SF, Quadeer AA, McKay MR, et al. Inferring effects of mutations on SARS-CoV-2 transmission from genomic surveillance data. medRxiv. 2022;10.1101/2021.12.31.21268591v1:2021.12.31.21268591. doi: DOI 10.1101/2021.12.31.21268591. [DOI] [Google Scholar]
- [25].Maher MC, Bartha I, Weaver S, Di Iulio J, Ferri E, Soriaga L, et al. Predicting the mutational drivers of future SARS-CoV-2 variants of concern. Science translational medicine. 2022;14(633):eabk3445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Rodriguez-Rivas J, Croce G, Muscat M, Weigt M. Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proceedings of the National Academy of Sciences. 2022;119(4):e2113118119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Thadani NN, Gurev S, Notin P, Youssef N, Rollins NJ, Sander C, et al. Learning from pre-pandemic data to forecast viral antibody escape. bioRxiv. 2022;DOI 10.1101/2022.07.21.501023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Molecular Biology and Evolution. 2021;38(12):5819–5824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics. 2021;53(6):809–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Lanfear R. A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo. 2020;DOI 10.5281/zenodo.3958883. [DOI] [Google Scholar]
- [31].Bloom JD, Beichman AC, Neher RA, Harris K. Evolution of the SARS-CoV-2 mutational spectrum. Molecular Biology and Evolution. 2023;40:msad085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Ruis C, Peacock TP, Polo LM, Masone D, Alvarez MS, Hinrichs AS, et al. Mutational spectra distinguish SARS-CoV-2 replication niches. bioRxiv. 2022;DOI 10.1101/2022.09.27.509649. [DOI] [Google Scholar]
- [33].De Maio N, Walker CR, Turakhia Y, Lanfear R, Corbett-Detig R, Goldman N. Mutation rates and selection on synonymous mutations in SARS-CoV-2. Genome Biology and Evolution. 2021;13(5):evab087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Neher RA. Contributions of adaptation and purifying selection to SARS-CoV-2 evolution. Virus Evolution. 2022;8(2):veac113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Starr TN, Greaney AJ, Hannon WW, Loes AN, Hauser K, Dillen JR, et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science. 2022;377:420–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Moulana A, Dupic T, Phillips AM, Chang J, Nieves S, Roffler AA, et al. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA. 1. Nature Communications. 2022;13:7011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Pollock DD, Thiltgen G, Goldstein RA. Amino acid coevolution induces an evolutionary Stokes shift. Proceedings of the National Academy of Sciences. 2012;109(21):E1352–E1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proceedings of the National Academy of Sciences. 2015;112(25):E3226–E3235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Lee JM, Huddleston J, Doud MB, Hooper KA, Wu NC, Bedford T, et al. Deep mutational scanning of hemagglutinin helps predict evolutionary fates of human H3N2 influenza variants. Proceedings of the National Academy of Sciences. 2018;115(35):E8276–E8285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Sun K, Tempia S, Kleynhans J, von Gottberg A, McMorrow ML, Wolter N, et al. Rapidly shifting immunologic landscape and severity of SARS-CoV-2 in the Omicron era in South Africa. Nature Communications. 2023;14(1):246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017;22(13):30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Jungreis I, Nelson CW, Ardern Z, Finkel Y, Krogan NJ, Sato K, et al. Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: A homology-based resolution. Virology. 2021;558:145–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Bhatt PR, Scaiola A, Loughran G, Leibundgut M, Kratzel A, Meurs R, et al. Structural basis of ribosomal frameshifting during translation of the SARS-CoV-2 RNA genome. Science. 2021;372(6548):1306–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].V’kovski P, Kratzel A, Steiner S, Stalder H, Thiel V. Coronavirus biology and replication: implications for SARS-CoV-2. Nature Reviews Microbiology. 2021;19(3):155–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Su YC, Anderson DE, Young BE, Linster M, Zhu F, Jayakumar J, et al. Discovery and genomic characterization of a 382-nucleotide deletion in ORF7b and ORF8 during the early evolution of SARS-CoV-2. mBio. 2020;11(4):e01610–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Rochman ND, Wolf YI, Koonin EV. Molecular adaptations during viral epidemics. EMBO reports. 2022;23(8):e55393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Silvas JA, Vasquez DM, Park JG, Chiem K, Allué-Guardia A, Garcia-Vilanova A, et al. Contribution of SARS-CoV-2 accessory proteins to viral pathogenicity in K18 human ACE2 transgenic mice. Journal of Virology. 2021;95(17):e00402–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Liu Y, Zhang X, Liu J, Xia H, Zou J, Muruato AE, et al. A live-attenuated SARS-CoV-2 vaccine candidate with accessory protein deletions. Nature Communications. 2022;13(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Starr TN, Greaney AJ, Stewart CM, Walls AC, Hannon WW, Veesler D, et al. Deep mutational scans for ACE2 binding, RBD expression, and antibody escape in the SARS-CoV-2 Omicron BA. 1 and BA. 2 receptor-binding domains. PLoS Pathogens. 2022;18(11):e1010951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Flynn JM, Huang QYM, Zvornicanin SN, Schneider-Nachum G, Shaqra AM, Kurt Yilmaz N, et al. Systematic analyses of the resistance potential of drugs targeting SARS-CoV-2 main protease. bioRxiv. 2023;DOI 10.1101/2023.03.02.530652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Yadav R, Courouble VV, Dey SK, Harrison JJE, Timm J, Hopkins JB, et al. Biochemical and structural insights into SARS-CoV-2 polyprotein processing by Mpro. Science Advances. 2022;8(49):eadd2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998;148(3):929–936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Kosakovsky Pond SL, Frost SD. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Molecular biology and evolution. 2005;22(5):1208–1222. [DOI] [PubMed] [Google Scholar]
- [54].Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular biology and evolution. 2000;17(1):32–43. [DOI] [PubMed] [Google Scholar]
- [55].Spielman SJ, Wilke CO. The relationship between dN/dS and scaled selection coefficients. Molecular biology and evolution. 2015;32(4):1097–1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genetics. 2008;4(12):e1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].Łuksza M, Lässig M. A predictive fitness model for influenza. Nature. 2014;507(7490):57–61. [DOI] [PubMed] [Google Scholar]
- [58].Koelle K, Rasmussen DA. The effects of a deleterious mutation load on patterns of influenza A/H3N2’s antigenic evolution in humans. Elife. 2015;4:e07361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Huddleston J, Barnes JR, Rowe T, Xu X, Kondor R, Wentworth DE, et al. Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza A/H3N2 evolution. Elife. 2020;9:e60067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Starr TN, Flynn JM, Mishra P, Bolon DN, Thornton JW. Pervasive contingency and entrenchment in a billion years of Hsp90 evolution. Proceedings of the National Academy of Sciences. 2018;115(17):4453–4458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Sadykov M, Mourier T, Guan Q, Pain A. Short sequence motif dynamics in the SARS-CoV-2 genome suggest a role for cytosine deamination in CpG reduction. Journal of Molecular Cell Biology. 2021;13(3):225–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Beale RC, Petersen-Mahrt SK, Watt IN, Harris RS, Rada C, Neuberger MS. Comparison of the differential context-dependence of DNA deamination by APOBEC enzymes: correlation with mutation spectra in vivo. Journal of Molecular Biology. 2004;337(3):585–596. [DOI] [PubMed] [Google Scholar]
- [63].Huston NC, Wan H, Strine MS, Tavares RdCA, Wilen CB, Pyle AM. Comprehensive in vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms. Molecular Cell. 2021;81(3):584–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Kuo L, Masters PS. Functional analysis of the murine coronavirus genomic RNA packaging signal. Journal of Virology. 2013;87(9):5182–5192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].Zanini F, Puller V, Brodin J, Albert J, Neher RA. In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evolution. 2017;3(1):vex003. doi: 10.1093/ve/vex003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Acevedo A, Brodsky L, Andino R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature. 2014;505(7485):686–690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Research. 2021;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].VanderPlas J, Granger B, Heer J, Moritz D, Wongsuphasawat K, Satyanarayan A, et al. Altair: interactive statistical visualizations for Python. Journal of Open Source Software. 2018;3(32):1057. [Google Scholar]
- [69].Aksamentov I, Roemer C, Hodcroft EB, Neher RA. Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software. 2021;6(67):3773. [Google Scholar]
- [70].Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker CR, et al. Stability of SARS-CoV-2 phylogenies. PLoS Genetics. 2020;16(11):e1009175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Martin DP, Weaver S, Tegally H, San JE, Shank SD, Wilkinson E, et al. The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages. Cell. 2021;184(20):5189–5200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Markov PV, Ghafari M, Beer M, Lythgoe K, Simmonds P, Stilianakis NI, et al. The evolution of SARS-CoV-2. Nature Reviews Microbiology. 2023; p. 1–19. doi: 10.1038/s41579-023-00878-2. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
See the GitHub repository at https://github.com/jbloomlab/SARS2-mut-fitness for the computer code and processed data (eg, fitness estimates and mutation counts). That repository contains a README with links to specific data files as well as a description of the computational pipeline. See https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aa_fitness.csv final estimates of amino-acid fitnesses across all clades; other intermediate data files are also provided in the GitHub repository. The specific version of the repository used for this paper is tagged as “bioRxiv-v2” on GitHub (https://github.com/jbloomlab/SARS2-mut-fitness/tree/bioRxiv-v2) The pipeline is fully reproducible, and is run using snakemake [67] with interactive plots rendered using altair [68].
The interactive plots are rendered at https://jbloomlab.github.io/SARS2-mut-fitness via GitHub pages.