Abstract
Predicting the evolutionary patterns of emerging and endemic viruses is key for mitigating their spread. In particular, it is critical to rapidly identify mutations with the potential for immune escape or increased disease burden. Knowing which circulating mutations pose a concern can inform treatment or mitigation strategies such as alternative vaccines or targeted social distancing. In 2021, Hie B, Zhong ED, Berger B, Bryson B. 2021 Learning the language of viral evolution and escape. Science 371, 284–288. (doi:10.1126/science.abd7331) proposed that variants of concern can be identified using two quantities extracted from protein language models, grammaticality and semantic change. These quantities are defined by analogy to concepts from natural language processing. Grammaticality is intended to be a measure of whether a variant viral protein is viable, and semantic change is intended to be a measure of potential for immune escape. Here, we systematically test this hypothesis, taking advantage of several high-throughput datasets that have become available, and also comparing this model with several more recently published machine learning models. We find that grammaticality can be a measure of protein viability, though methods that are trained explicitly to predict mutational effects appear to be more effective. By contrast, we do not find compelling evidence that semantic change is a useful tool for identifying immune escape mutations.
Keywords: protein language models, immune escape, machine learning, SARS-CoV-2
1. Introduction
In an ongoing viral epidemic or pandemic, it is critical to be able to identify variants of concern, that is, viral mutants that may have significant impact on the future course of the epidemic and the potential disease burden caused by it. The gold standard is to perform experimental tests to assess whether mutations confer immune escape in vitro [1,2] or may lead to breakthrough infections in vivo [3], but these experiments are labour-intensive, resulting in the characterization of a small number of variants. Larger- scale experiments are possible with deep mutational scanning (DMS) [4–7], yet DMS experiments often require somewhat artificial experimental setups that do not necessarily reflect the infection conditions viruses encounter in their natural hosts.
As an alternative to experimental strategies, several groups have employed modelling approaches, either by predicting the fitness of variants from epidemiological data [8] or by employing biophysical or other mechanistic modelling approaches to predict the fitness of a mutation from its context in the viral protein and the expected interactions with antibodies and host receptors [9]. The downside of the epidemiological approach is that it is strictly backwards looking and cannot make any prediction for newly emerging variants. By contrast, mechanistic and machine learning (ML) modelling approaches are forward-looking but tend to require intensive compute and extensive experimental data for model calibration.
An ideal modelling approach would be able to make predictions for novel mutations while requiring little or only easily obtainable experimental data. Hie et al. [10] suggested that this goal could be achieved with deep learning models, and specifically with protein language models (pLMs) trained on viral sequence data. In particular, they argued that they could extract two relevant quantities from the pLMs, the grammaticality of a mutation and the amount of semantic change induced in the viral protein. The concepts of grammaticality and semantic change are borrowed from natural language processing. In the context of viral evolution, they represent whether the mutation can be expressed in principle (grammaticality) and to what extent it may change the interaction of the expressed viral protein with its environment (semantic change). In particular, the core idea of Hie et al. [10] was that mutations with large predicted semantic change should be more likely to disrupt the protein–protein interface with neutralizing antibodies, and if those mutations also have high grammaticality, they can be selected for immune escape by evolution in vivo. Hie et al. [10] considered three different viral systems, SARS-CoV-2, influenza A and HIV-1, and demonstrated some association between escape mutations and mutations with both high grammaticality and high semantic change. However, since the original publication of Hie et al. [10], their approach has not seen much application or systematic evaluation (but see [11]). Thus, whether the language modelling approach is useful to predict variants of concern remains unclear.
Here, we take advantage of several high-throughput experimental datasets that have been published since Hie et al. [10] and ask how well the concepts of grammaticality and semantic change correlate with experimentally validated immune escape mutations. We further ask to what extent the concept of grammaticality relates to more conventional concepts such as protein stability, and whether there are alternative modelling approaches that can predict the viability of mutations more reliably. Overall, we find that while measures of both grammaticality and semantic change are somewhat informative about escape mutations, neither is sufficiently predictive to provide a compelling use case. In particular, for the purpose of distinguishing viable from inviable mutations, we find values to be more effective than grammaticality. For the purpose of identifying mutations with immune escape potential, semantic change does not appear sufficiently discriminatory, whereas the more recently developed method EVEscape [12] performs well in several cases. In summary, more recent and alternative approaches to predicting likely variants of concerns appear more promising than the concepts of grammaticality and semantic change.
2. Results
Before presenting new results, we first provide an overview of the fundamental concepts, datasets and models used in our analysis. The model architectures used in the analysis are key for contextualizing how we address whether grammaticality and semantic change are useful for interpreting the evolution of antigenic escape.
2.1. Background and model setup
Protein language models take principles from natural language processing and apply them to protein sequences. Instead of operating on sequences of words, however, protein language models operate on sequences of amino acids. But apart from this difference, model architectures and modelling principles carry over nearly one-to-one. Where natural language models are trained to predict masked tokens in a sentence (i.e. BERT-style self-supervised learning) or the next token in a sentence fragment (i.e. GPT-style self-supervised learning), protein language models are similarly trained to predict masked amino acid tokens (e.g. ProtBERT [13] and ESM [14]) or the next amino acid token (e.g. ProGen [15]). Therefore, it makes sense to assess to what extent other concepts from natural language processing also translate to protein language models.
In natural language processing, it has been fruitful to distinguish the concepts of grammaticality and semantics of words [16,17]. Grammaticality indicates that a word fits in its location in a sentence purely based on the rules of grammar (for example, the word is a noun and a noun is required at its location in the sentence) whereas semantics describes the meaning of words. A word can have high grammaticality in a sentence but be a poor fit semantically or vice versa. In well-formed sentences, each word has high grammaticality and good semantic fit. Hie et al. [10] built on these concepts and proposed that they could be applied to protein language models, and more specifically to evaluate variants of viral proteins. Grammaticality in the LLM context is defined as the emitted probability, , for the mutated input sequence,
at locus . The grammaticality of a variant should be related to whether the mutation can be made in principle, i.e. whether the mutated protein still can be expressed and fold, and the semantic change of a variant should represent to what extent biological function has changed (figure 1). For example, a mutation on the surface of a viral spike protein with high semantic change is more likely to result in structural changes that disrupt protein–protein interfaces and potentially enable immune escape. Semantic change is defined as the distance between mean embeddings, , of the wild-type and mutant sequences. We define this in the same way of Hie et al. [10]:
Figure 1.

Mutations placed on a wild-type genetic background can have impacts on both viability and antigenic variation. In the language-of-viral-escape model [10], grammaticality is an analogy for protein viability and semantic change is an analogy for change in the surface properties of the spike protein and thus specifically antigenic variation. The hypothesis is that the most effective antigenic escape mutations will be both highly grammatical and semantically different. Schematic drawing modified from Hie et al. [10].
Since the concepts of grammaticality and semantic change are derived from natural language modelling, in the following we will refer to them in aggregate as the ‘language-of-viral-escape model.’ See §4 for further discussion on the definitions of semantic change and grammaticality.
However, we caution that the analogies employed by the language-of-viral-escape model are imperfect, and the two quantities grammaticality and semantic change may not adequately quantify a viral protein’s viability and immune escape propensity. For example, antigenic escape is just one possible phenotype that can result from mutational changes on a surface antigen. Viral spike proteins need to bind host receptors for cell entry, and a priori it is not clear why a large semantic change should specifically correspond to weakened interactions with an antibody but not to weakened interactions with the host receptor. Immune escape variants need to disrupt the binding of neutralizing antibodies while preserving binding to the host receptor. Similarly, even if grammaticality as defined by Hie et al. [10] is associated with protein viability, there may be alternative ways of calculating grammaticality, for example using a biophysically based quantity such as , which better capture viability.
2.2. Alternative datasets and alternative models to validate the language-of-viral-escape model
Hie et al. [10] trained three bespoke models for three different viral systems, influenza haemagglutinin (influenza HA), HIV-1 envelope glycoprotein (HIV Env) and SARS-CoV-2 spike protein. At the time of their publication, mutational data for SARS-CoV-2 was sparse. To expand on this work, we benchmarked their model on several more comprehensive SARS-CoV-2 datasets that have since become available [5–7,18]. Additionally, we compared these results with data from influenza HA [4,19] and HIV Env [20,21] to assess the generalization of language-of-viral-escape to other viral systems.
For SARS-CoV-2, Hie et al. [10] validated their semantic change and grammaticality scores on only 12 escape mutations and 16 non-escape mutations reported by Baum et al. [22]. In a direct comparison of these 28 mutations, the escape mutations had consistently higher semantic change than the non-escape mutations and both groups of mutations had high grammaticality (figure 2A, inset). However, when calculating grammaticality and semantic change for a much larger DMS dataset [7], we found that escape mutations were not enriched for high semantic change and many escape mutations had surprisingly low grammaticality scores (figure 2A). In fact, the distributions of the grammaticality and semantic change scores for escape mutations, non-escape mutations, and mutations that rendered the virus not viable were largely overlapping (see figure 2A, marginal density estimates).
Figure 2.
(A) Semantic change and grammaticality scores of mutations predicted by Hie et al. [10] attempt to provide insight into antigenic escape. (B) Deep mutational scanning data [7] validate propensity for a mutation to escape antibody pressure, but also give insight into the viability of mutations. Purple represents mutations that had an escape fraction greater than 0.5 in DMS experiments. Orange represents mutations that had an escape fraction below 0.5. Pink represents mutations that render the virus not viable. Grey represents mutations that were not tested for antigenic escape. Black X’s and circles are escape and non-escape mutations verified by Baum et al. [22], respectively.
These observations suggested that a more systematic analysis of grammaticality and semantic change in relation to the measured phenotypes of viral proteins was needed. Since the DMS data provided us with measurements for both viability and immune escape for all tested mutations, we could systematically investigate the association of grammaticality and semantic change with respect to both phenotypes. And even though DMS data covered only a fraction of the total mutation landscape for SARS-CoV-2 spike, it increased the number of mutations available for validation by over two orders of magnitude (figure 2B).
In addition to this more systematic validation, we also asked whether there are alternative approaches to calculating grammaticality. For example, the machine learning-guided protein engineering community develops deep learning frameworks with the express purpose of predicting allowed mutations, in particular using structure-based frameworks [23–26]. Additionally, virtually all protein language models have been trained on masked token prediction and should also be able to predict whether a mutation is permitted or not [14,25,27–29]. Finally, changes in protein free energy, measured by , have long been used to assess whether a mutation at a given site in a protein is permitted or not [30]. All of these modelling approaches could potentially yield measures of grammaticality that are more informative than the one proposed by Hie et al. [10].
To evaluate the effectiveness of grammaticality, we employed a variety of deep-learning models (table 1). First, we leveraged two protein language models: the bidirectional long short-term memory (BiLSTM) model trained by Hie et al. [10] and the transformer-based ESM2 model from Meta [14]. The bespoke BiLSTM was trained on specific SARS-CoV-2, influenza and HIV protein sequence datasets for their respective surface antigens. ESM2 was trained on the UniRef50 dataset (approx. 65M sequences clustered at a 50% sequence similarity) [31]. Furthermore, for ESM2, we employed two distinct approaches of calculating grammaticality. The first, which we call ‘unmasked,’ was proposed by Hie et al. [10]. For each possible mutation at a site, this approach makes a separate prediction, which is then interpreted as grammaticality score. The second, which we call ‘masked’, is canonical BERT-style masked language modelling (MLM): mask the focal site and use the sequence context to predict amino acid propensities for all 20 amino acids at that site. We convert these propensities into grammaticality scores.
Table 1.
Models used to calculate the grammaticality of mutations.
|
input data |
training mode |
model outputs |
||||
|
sequences |
structures |
self-supervised |
supervised |
AA propensities |
|
|
|
BiLSTM (Hie et al.) |
✓ | — | ✓ | — | ✓ | — |
|
ESM2 |
✓ | — | ✓ | — | ✓ | — |
|
MutComputeX |
— | ✓ | ✓ | — | ✓ | — |
|
MutRank |
✓ | ✓ | ✓ | — | ✓ | — |
|
Stability Oracle |
— | ✓ | — | ✓ | — | ✓ |
|
EVE |
✓ | — | ✓ | — | ✓ | — |
Second, we utilized the structure-based machine learning models MutComputeX [23], MutRank [24] and Stability Oracle [32]. MutComputeX uses a self-supervised three-dimensional residual neural network (3DResNet) architecture trained on millions of spatially represented protein microenvironments to predict the masked amino acid using the cross entropy loss. MutRank is a two-stage self-supervised graph-transformer framework trained to learn protein representations tailored for protein engineering applications. Using a masked microenvironment as input, MutRank is first trained to predict the masked amino acid from its microenvironment, akin to MutComputeX, and then, with ‘From’ and ‘To’ amino acid CLS tokens as additional inputs, it is self-supervised fine-tuned with a rank loss to predict mutational propensities based on the site-specific amino acid distribution obtained from a protein’s multiple sequence alignment (MSA). Finally, Stability Oracle uses the same architecture and training procedure as MutRank. However, rather than being self-supervised fine-tuned to predict MSA-based amino acid propensities, it is supervised fine-tuned on measured data [33]. Thus, rather than predicting amino acid mutational propensities, Stability Oracle predicts values (kcal/mol) for mutations, where more negative values are considered more grammatical.
Third, we used the model EVE to compute grammaticality [34]. (And we used the related model EVEscape [12] to compute semantic scores, see below.) EVE produces fitness scores for mutations via a trained variational autoencoder on case-specific multiple sequence alignments. These EVE scores are one of three components in the EVEscape score, combined with the likelihoods that a mutation will (i) occur in an antibody-accessible region and (ii) disrupt antibody binding. We liken the EVE score to grammaticality, because it is trained to be a generic fitness score of the viral protein, whereas the EVEscape score accounts for interaction with the host immune system. Thus, we liken the EVEscape score to semantic change.
For all models, we rank-ordered grammaticality scores and then normalized the ranks by the total number of mutations to arrive at a relative rank score. This score, which is a value between 0 and 1, allowed us to directly compare predictions from widely differing models producing outputs on different scales and with differing distributions.
2.3. Different grammaticality measures can distinguish between viable and nonviable mutations
We calculated seven different grammaticality scores for all possible point mutations in the SARS-CoV-2 spike protein, using the six different models described above (table 1) and employing two separate strategies for ESM2, masked and unmasked. We had available viability and immune escape data from Starr et al. [7] for approximately 15% of these mutations (figure 2B).
We found that with the exception of the unmasked ESM2 model, all models produced significantly higher grammaticality scores for viable mutations than for nonviable mutations (figure 3A). Among the language models, the bespoke model by Hie et al. [10] displayed a larger effect size than the masked ESM2 model. However, in both cases, while the difference in grammaticality scores among viable and nonviable mutations was significant, the effect size was small and the interquartile ranges of the two distributions showed extensive overlap. Thus, a high or low pLM grammaticality score was not predictive of whether a mutation would be viable or not, respectively.
Figure 3.
All possible mutations of the coronavirus spike protein DMS experiment [7] tested under different models. Colours represent mutations that are viable (green), not viable (pink) or not tested (grey) in the DMS experiment. The values predicted for each mutation are ranked and then normalized to be between 0 and 1. (A) Grammaticality scores for each of the seven models. Note that the ranks for Stability Oracle are reversed since negative values represent more stable mutations. (B) Semantic change scores for both the Hie et al. [10] model and the ESM2 model in addition to EVEscape scores. Results of the Mann–Whitney U rank test are indicated as follows: NS, not significant; *p < 0.05; **p < 0.01; ***p < 0.001.
We saw more substantive differences among the three structure-based frameworks, MutComputeX, MutRank and Stability Oracle. For all three models, the median grammaticality score for nonviable mutations fell into the bottom quartile of grammaticality scores for viable mutations (figure 3A). While effect sizes among the three models were somewhat comparable, overall the strongest separation was seen for Stability Oracle, which predicts values rather than amino acid propensities. These results suggest that changes in protein stability, a biophysical property, best capture changes in protein viability, as compared with grammaticality scores computed from amino acid propensities from sequence- or structure-based self-supervised models. Note however that EVE performed similarly well. This model trained on sequence alignments may inherently capture biophysical limitations based on observed amino acid frequencies.
We also assessed how semantic change related to protein viability. We calculated semantic change for two language models, the bespoke model by Hie et al. [10] and the generic model ESM2. The hybrid model EVEscape, which uses structural and non-structural input data, was also included in our analysis. Results were comparable for both language models, where viable mutations had lower semantic change scores (figure 3B). This result makes intuitive sense—a mutation that changes more of the protein biochemistry (creates a larger semantic change) should also be more disruptive to the protein and more likely to render it nonviable. However, this result puts into question whether grammaticality and semantic change are independent quantities. In fact, in figure 2A we can see a weak negative correlation between grammaticality and semantic change (Pearson’s , p‐value ). By contrast to semantic change, when using the EVEscape score, viable mutations ranked significantly higher than non-viable mutations (figure 3B). This observation can be explained by the fact that the EVEscape score includes information about both viability and escape; indeed, it is strongly correlated to the EVE score (Pearson’s , p‐value ).
Finally, we repeated these analyses for four additional viral surface proteins: the BA.1 (electronic supplementary material, figures S1 and S2) and Delta (electronic supplementary material, Figs. S3 and S4) variants of the SARS-CoV-2 spike, the haemagglutinin (HA) of influenza A virus (electronic supplementary material, figures S5 and S6), and the envelope glycoprotein (env) of human immunodeficiency virus (HIV) (electronic supplementary material, figures S7 and S8). Results were generally consistent with what we had seen for SARS-CoV-2 spike (figure 3). All grammaticality scores except those calculated with ESM2 unmasked were on average larger for viable mutations than for nonviable mutations, but EVE and Stability Oracle tended to perform best overall (electronic supplementary material, figure S2A, figure S4A, figure S6A, figure S8A). And again, viable mutations tended to have lower semantic change than nonviable mutations, regardless of the model according to which semantic change was calculated (electronic supplementary material, figure S2B, figure S4B, figure S6B, figure S8B).
2.4. Semantic change does not predict immune escape
We next proceeded to assess the relationship between semantic change and the immune escape phenotype, which was also assessed in the various DMS experiments [4,7,18–21]. Notably, for SARS-CoV-2 spike, we found no significant difference in semantic change from language models among escape and non-escape mutations (figure 4A). We saw the same result for HIV env (electronic supplementary material, figure S9A), BA.1 spike (electronic supplementary material, figure S10A) and Delta spike (electronic supplementary material, figure S11A). Only for influenza virus HA was there a significant (but small) difference in semantic change between escape and non-escape mutations, and only for the bespoke model trained by Hie et al. [10] (electronic supplementary material, figure S12A). By contrast, EVEscape scores could separate escape from non-escape mutations in several cases. We found significantly larger EVEscape scores for escape mutations in the focal spike protein (figure 4A), influenza HA (electronic supplementary material, figure S12A) and HIV env (electronic supplementary material, figure S9A). However, there was no significant difference in EVEscape scores for the BA.1 spike (electronic supplementary material, figure S10A), and for the Delta spike the effect went in the opposite direction: EVEscape scores of escape mutations were significantly lower (electronic supplementary material, figure S11A). In aggregate, these results suggest that semantic change from language models is not a reliable or strong predictor of immune escape, even when trained on mutant data for a particular viral protein. EVEscape performs better, but there is still room for improvement.
Figure 4.
All possible mutations of the coronavirus spike protein DMS experiment [7] tested under different models. Colours represent mutations that confer escape (purple), non-escape (orange) or not viable (pink) in the DMS experiment. The values predicted for each mutation are ranked and then normalized to be between 0 and 1. (A) Semantic change scores for both the Hie et al. [10] model and the ESM2 model [14] in addition to EVEscape scores. (B) Grammaticality scores for each of the seven models. Note that the ranks for Stability Oracle are reversed since small values are consistent with higher stability. Results of the Mann–Whitney U rank test are indicated as follows: NS, not significant; *p < 0.05; **p < 0.01; ***p < 0.001.
For completeness, we also investigated whether grammaticality scores differed among escape and non-escape mutations. Most models showed no significant difference or only minor differences for SARS-CoV-2 spike (figure 4B). A few more models displayed significant differences for HIV env (electronic supplementary material, Fig. S9B) and influenza virus HA (electronic supplementary material, Fig. S12B). In the latter case, the bespoke model by Hie et al. [10] showed immune escape mutations to be substantially more grammatical than non-escape mutations. Notably, this is the opposite prediction from SARS-CoV-2 spike, where the bespoke model predicted escape mutations to be less grammatical. However, EVE scores significantly differed between escape and non-escape mutations for SARS-CoV-2 spike (figure 4B), HIV env (electronic supplementary material, figure S9B) and influenza virus HA (electronic supplementary material, figure S12B). The BA.1 (electronic supplementary material, figures S10B) and Delta (electronic supplementary material, figure S11B) spike proteins had no grammaticality measure that showed significant differences between escape and non-escape mutations. In aggregate, these results reiterate that grammaticality scores are not necessarily orthogonal to whether or not mutations are escape mutations, and that in general grammaticality and semantic change are somewhat confounded with each other.
2.5. Semantic change is weakly correlated with antibody binding
Instead of considering mutations that have been classified into two categories, escape or non-escape, it may be more useful to ask whether semantic change correlates with the strength of antibody binding for different protein variants. We asked this question on a more comprehensive dataset of 32 768 variants of the SARS-CoV-2 spike protein for which binding constants have been measured for binding to each of four different antibodies and to the ACE2 cell surface receptor [5,6]. A mutation that displays reduced binding or loss of binding to any of the antibodies enables some amount of immune escape and thus is likely beneficial to the virus, whereas a mutation that displays reduced binding to the ACE2 cell surface receptor causes reduced viral fitness and is likely deleterious.
The 32 768 variants in the dataset were chosen because they represent all possible combinations of 15 distinct mutations (215=32 768) that separate the Alpha and the Omicron variants of SARS-CoV-2 in the receptor binding domain of the viral spike protein [5,6]. Thus, the dataset in effect maps out all possible evolutionary paths from Alpha to Omicron. All 32 768 variants were assessed for their binding affinity to class 1, 2, 3 and 4 antibodies (antibodies CB6, CoV555, REGN10987 and S309, respectively) and to the cell surface receptor ACE2.
We used the Hie et al. [10] model trained on the SARS-CoV-2 spike protein to calculate semantic change for all 32 768 variant sequences. We then correlated the semantic change values to the measured binding affinities as measured by . Larger values of indicate stronger binding, and a value below 6 indicates no detectable binding. For all four antibodies, we found a weak to moderate negative correlation between binding affinities above the limit of detection and semantic change (figure 5A). Thus, mutations with larger semantic change on average displayed weaker antibody binding. Moreover, for the three antibodies for which some mutations showed no binding at all (CB6, CoV555 and REGN10987), semantic change values were significantly larger on average for nonbinding variants than for binding variants (figure 5B). We note, however, that effect sizes were small in most cases. Correlation coefficients for three of the four antibodies fell between −0.12 and −0.25; 6.25% or less of the variance in binding was explained by semantic change. Similarly, semantic change for non-binding variants was on average only 0.175 units larger than for binding variants, while the standard deviation of semantic change values was between 0.23 and 0.28. Notably, for the S309 antibody, all assayed spike variants were active binders and demonstrated the strongest correlation between semantic change and binding affinity (Pearson’s , p‐value < 0.0001, figure 5A).
Figure 5.
All possible combinations of 15 mutations defining the omicron BA.1 coronavirus spike protein had binding values measured for four antibodies [5] and ACE2 [6]. Semantic change was inferred from the Hie et al. [10] model. Colours represent mutations that confer escape (purple), or non-escape (orange), in the DMS experiment. (A) For each of spike’s binding partners, Pearson’s was computed for all non-escape mutations since escape mutations were classified as being below the limit of detection for , so they are designated non-binding (NB). Significance is denoted with ****, indicating a p‐value . (B) Density plots show the overlap of computed semantic change values between escape and non-escape mutations. No mutants failed to bind to antibody S309 and ACE2.
Surprisingly, the weakest correlation was between semantic change and the binding affinity to the ACE2 receptor (Pearson’s , p‐value , figure 5A). This is notable in particular because the range of binding affinities to the ACE2 receptor among the 32 768 variants is comparable with the range of binding affinities to any of the antibodies. Thus, while the variants in this dataset clearly differed in their ability to bind to the ACE2 receptor, which impacts viral entry, semantic change was unable to detect this variation.
We performed the same analysis on the combinatorial BA.1 mutations using the ESM2 model. Similar to the previous analysis, we found that the correlation between binding affinity and semantic change for three of the four antibodies (CB6, CoV555 and S309) was significantly different from zero but weak (electronic supplementary material, figure S13A). For CB6, CoV555 and S309, ESM2 captured between roughly 6% and 21% of the variance in binding explained by semantic change (Pearson’s between and , p‐value ). Semantic change did not correlate with spike binding to REGN10987. While all variants bound to antibody S309, differences in means of semantic change between escape and non-escape mutations for the other three antibodies was 0.094 units on average, with standard deviations between 0.28 and 0.33, where all means of binders and non-binders fell within one standard deviation of one another (electronic supplementary material, figure S13B). The ACE2 binding measures correlated weakly with semantic change from the ESM2 model (Pearson’s , p‐value ).
In summary, while semantic change was weakly correlated with loss of antibody binding, the effect size was rather small, and it did not simultaneously correlate with preserved binding to the ACE2 host receptor. Taken together, it would not be possible to reliably identify immune escape mutations based on their semantic change.
3. Discussion
We have systematically tested the language-of-viral-escape model by Hie et al. [10] using several new high-throughput datasets that have been made available since the original publication of the model and also using several additional models to calculate grammaticality or semantic change scores. Overall, we have found that grammaticality is somewhat predictive of whether a mutation is viable or not, whereas semantic change is a less useful indicator of a mutation’s immune escape propensity. We have found that our results are broadly consistent across different viral systems: three SARS-CoV-2 spike variants, influenza HA, and HIV-1 env. We have also found that the bespoke models by Hie et al. [10] trained separately for each viral protein have not systematically outperformed generic, pretrained sequence- or structure-based foundation models. Finally, we have found that for the task of predicting mutant viability (i.e. grammaticality), EVE and structure-based models seem to outperform pLMs, and performance of structure-based methods can be further improved when fine-tuned on experimental datasets.
Importantly, we have seen a major difference in performance between grammaticality scores calculated using the masked or the unmasked ESM2 model. While scores obtained from the masked model always showed reduced grammaticality for nonviable mutations, and in the case of influenza HA were competitive with the structure-based models in terms of the magnitude of predicted difference between viable and nonviable mutations, results were inconclusive or pointed the opposite direction for the unmasked model. We believe the masked approach is the appropriate one and should be used, and we discourage the use of the unmasked approach. ESM models, which are built on a BERT-style [35] encoder-only transformer architecture, utilize masked language modelling (MLM) as the training objective [14]. MLM is a fundamental self-supervised learning technique that enables learning the identity of the masked positions from the sequence context—in proteins, this is the dependencies between amino acids. When we ask the model to make a prediction for the masked site, it predicts all amino-acid propensities at once and conditional on each other. By contrast, in the unmasked approach, the model is asked to make multiple separate predictions, one for each possible sequence variant, and these predictions are not conditional on each other. And more importantly, the training procedure of ESM models for unmasked sites has biased the model towards returning either the input token or any one of the potential output tokens chosen from a uniform distribution [14]. Therefore, there is no good a priori reason why the unmasked inference procedure should be successful, and our analysis here has shown that it is not.
We have found that across all models and viral systems we considered here, the Stability Oracle and EVE have most consistently applied low scores to nonviable mutations and high scores to viable mutations. Notably, these two models are very different. Stability Oracle uses structures as input, and it predicts values. Its strong performance is consistent with the long-standing observation that destabilizing mutations are the primary culprit for loss of function in proteins [9,36–40]. We only considered a single predictor here, because Stability Oracle is one of the best deep learning frameworks for prediction currently available, but we expect that other predictors, such as FoldX [41], PoPMuSiC [42] or Rosetta ddG [43], would perform similarly, and in proportion to their ability to predict accurate values. By contrast, EVE is sequence based, and it has been trained to predict the propensity with which a given protein sequence can be observed in nature [34]. Its performance is consistent with the original vision of the language-of-viral-escape model [10], which is that models trained purely on sequence data can implicitly learn which mutations are viable and which are not and encode this information in their predicted amino acid propensities.
For predicting mutant viability, we found that among the models we evaluated here, with the exception of EVE the structure-based models performed better than the sequence-based models. In a prior study, we had compared some of the same sequence- and structure-based models for their ability to predict the wild-type residue when masked [25] and found that both model types displayed roughly comparable performance. The discrepancies in these two different studies highlight that two model types can have comparable performance on one task yet differing performance on another. To correctly predict the masked wild-type amino acid at a particular residue, a model needs to assign a high probability to the wild-type amino acid and low probabilities to all other amino acids. However, the specific probabilities assigned to the other amino acids do not matter as long as they are lower than the probability of the wild-type amino acid. By contrast, to correctly predict nonviable mutations at a residue, a model needs to consistently assign low probabilities to the nonviable amino acids and higher probabilities to the viable amino acids. More generally, considering the comparable performance of Stability Oracle and EVE in predicting viability, our results here suggest that models with very different types of architectures or input data can be good at zero-shot prediction of a phenotype of interest, and whether a specific model performs well may depend on the details of how exactly it was set up and trained. This interpretation is consistent with systematic benchmarking published in the ProteinGym database [44], which shows a wide range of different model types among the top-performing models for zero-shot prediction of phenotype.
By contrast to grammaticality, we have found that semantic change does not seem to perform as originally expected. It does not consistently differentiate escape from non-escape mutations, and correlations between binding constants and semantic change are weak even if significant. And even absent these results, the concept of semantic change suffers from a fundamental problem [45]: if a large semantic change coincided with loss of binding to an antibody, why should not it also coincide with loss of binding to the host receptor? In both cases, the surface of the spike protein has been altered sufficiently that binding is no longer possible. Instead, we have found here that it does not strongly correlate with either. While distances in embedding space have been useful in some applications, such as inferring GO Terms [46], it appears that the complex phenotype of immune escape is not sufficiently represented by just distance in embedding space (i.e. semantic change). Instead, strategies that have worked well in a variety of different applications consist of transferring or fine-tuning the hidden representations of a language model to predict the phenotype of interest [32,47–50]. An additional strategy includes learning to extract a phenotype-aware representation from the initial hidden representation that is enriched for the specific downstream application [51]. Alternatively, or in combination with such approaches, one can also build biophysical models of protein folding and binding and calibrate with experimental measurements of binding constants or measurements of viral fitness [9,52].
One recent study [11] explored a variation of the language-of-viral-escape model to assess whether variants of concern (VOCs) are distinct from non-VOCs in SARS-CoV-2. Notably, they did not use the bespoke models of Hie et al. [10] and only computed grammaticality and semantic change from ESM2 embeddings. Their main focus was the change of these quantities over evolutionary time, but they also assessed correlations of semantic change and grammaticality with viral escape. Their results were broadly consistent with our findings here: correlations are significant but weak. Moreover, they did not use a masking approach to calculate grammaticality, and we believe this may explain why they also observed weak correlations between grammaticality and physical measures of protein viability such as .
In summary, we have found that the language-of-viral-escape framework [10], as currently developed, is not sufficient to accurately predict immune escape mutations. While the concept of grammaticality is informative about mutant viability, the concept of semantic change provides little information about whether or not a mutant will be likely to confer immune escape. Moreover, even for mutant viability, quantities with precise biophysical meaning, such as stability changes, are often more useful than grammaticality scores extracted from language models. We believe that the way forward for AI applications in viral evolution is to fine-tune (in a supervised framework) representations from pretrained models to predict specific phenotypes rather than rely on zero-shot predictions extracted from pretrained models only.
4. Methods
4.1. Language models
Semantic change is defined as the distance between embeddings of the wild-type and mutant sequences. We define this in the same way of Hie et al. [10]:
| (4.1) |
where is an embedding for a wild-type sequence of length and is the embedding of a sequence mutated to token at locus . The mean embeddings, , are computed across sites resulting in vectors. After taking the difference of these two vectors, the norm is then computed (i.e. the sum of the absolute values of distances from to ).
Grammaticality in the LLM context is defined as the emitted probability from the protein language model for the input sequence,
| (4.2) |
where is the probability of observing mutation at locus from the amino acid alphabet . When considering a mutated sequence, one can either use a masked or unmasked inference approach. Hie et al. [10] computed the probability over the sequence using their trained BiLSTM without masking, and in doing so extracted the probability of the mutant token(s). Their approach is aware of the identity of every token in the sequence when inferring the probability from the embedding, so it is unmasked. In addition to using the embeddings from their bespoke-trained model, we used the embeddings from ESM2 to obtain unmasked grammaticality predictions. For ESM2, we also implemented a masked grammaticality by taking a mutated sequence and swapping the amino acid token of the mutated locus with the <mask> token. The model then must compute the probability of observing each amino acid at the masked locus using only the embeddings of the remaining unmasked amino acids. These probabilities serve as the grammaticality scores for each amino acid at each site.
Semantic change and grammaticality measures were inferred from multiple LLMs. We used the published outputs for influenza and HIV [10], but since SARS-CoV-2 used a different reference sequence in the DMS [7], we used the bespoke pre-trained coronavirus model [10] to produce novel results. We then used ESM2 (esm2_t30_150M_UR50D) [14] with the same model architecture to obtain semantic change and grammaticality scores from this generally trained model. Additionally, we wrote Python scripts to allow for masking of tokens when inferring grammaticality from the ESM2 model using the transformers Python library. Scripts to perform inference of semantic change and grammaticality are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/language_models.
4.2. AlphaFold
For the structure-based models, we needed protein structures as inputs. For all three viruses, we inferred structures using AlphaFold2. As all three viruses exist as homotrimer in their natural state, we used AlphaFold2’s multimer functionality [53,54], as implemented in the ColabFold notebook: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb. The inferred structures are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data.
4.3. MutComputeX and MutRank
MutComputeX [23] is a self-supervised three-dimensional residual neural network (3DResNet) trained on approximately 2.1 million protein masked microenvironments sampled from approximately 23K protein structures. The 3DResNet is trained with the cross entropy loss to predict the masked amino acid in the centre of the masked microenvironment. The trained model outputs likelihood of each amino acid for a particular microenvironment. We used these likelihoods as our grammaticality scores to compare against other models. The inference pipeline can be found here: https://github.com/danny305/MutComputeX/blob/master/scripts/generate_predictions.py.
MutRank [24] is a self-supervised three-dimensional graph transformer that adds a second training step to a graph-transformer analog of MutComputeX. In the second step, it trains a regression head that takes a ‘FromAA’ and ‘ToAA’ CLS tokens as additional inputs to the masked microenvironment using the EvoRank loss on the MutComputeX training set to predict the rank score between the ‘FromAA’ and ‘ToAA’. During inference, for each masked microenvironment we set the ‘FromAA’ CLS token to the wild-type amino acid and predict the rank score for all 20 amino acids by setting them as the ‘ToAA’ CLS token. We used the 20 rank scores obtained for each microenvironment as our grammaticality scores to compare against other models. The inference pipeline is distributed as part of the Stability Oracle project (next subsection).
4.4. Stability Oracle
Stability Oracle [32] is a structure-based graph-transformer model that supervise fine-tunes a graph transformer analog of MutComputeX on empirical values of the cDNA117K dataset. The architecture is identical to the MutRank architecture but rather than training the regression head with the EvoRank loss on the MutComputeX training set, the regression head is trained (and the graph-transformer backbone is fine-tuned) with the huber loss on the cDNA117K dataset. During inference, for each masked microenvironment we set the ‘FromAA’ CLS token to the wild-type amino acid and predict values for all 20 amino acids by setting them as the ‘ToAA’ CLS token. We used the 20 values obtained for each microenvironment as our grammaticality scores to compare against other models. The values are reversed because positive values correspond to more destabilizing mutations (lower grammaticality) and negative values correspond to more stabilizing mutations (higher grammaticality). The inference pipeline can be found here: https://github.com/danny305/StabilityOracle/blob/master/scripts/run_stability_oracle.py.
4.5. EVE and EVEscape
EVEscape [12] is an unsupervised deep learning model that predicts a probability of escape as a product of three probabilities: maintenance of viral fitness (EVE score [34]), occurrence in an antibody-accessible region and disruption of antibody binding. EVE and EVEscape utilize a variational autoencoder to learn the distributions of amino acids within sites across a protein alignment. We obtained EVE and EVEscape scores by utilizing code from the EVEscape repository, https://github.com/OATML-Markslab/EVEscape/tree/main, for the Starr et al. [7] sequence as input. We downloaded EVE and EVEscape scores for the HIV env and influenza HA proteins from the same repository. Additionally, we downloaded EVE and EVEscape scores for the BA.1 and Delta SARS-CoV-2 strains from https://evescape.org/data. Since the sequences differ slightly from those used in the DMS of Delta and BA.1 [18], we created an alignment of the sequences used by EVE and EVEscape with those from the DMS and used scores where the two sequences were identical.
4.6. Statistical testing
All statistical tests were performed in R. Scripts to modify the data for boxplots and perform Mann–Whitney U test are available on Github. For all figures, model names have been abbreviated and here we briefly describe what each label means. ‘Grammaticality (Hie)’ refers to the probabilities emitted by the LLM used in Hie et al. [10] that was trained on coronavirus sequences. ‘Grammaticality (ESM2, unmasked)’ refers to the probabilities emitted by the LLM scheme used by Hie et al. [10] with the general ESM2 model. The model is unmasked because the protein sequence tokens were not masked when the model determines the probability of a token belonging in a particular locus. ‘Grammaticality (ESM2, masked)’ refers to the probabilities emitted by ESM2, but the sequence tokens are masked. ‘MutComputeX’ refers to the amino acid probabilities emitted by the structure-based model from d’Oelsnitz et al. [23]. ‘Stability Oracle’ refers to the values predicted by Diaz et al. [32]. Note that the ranks for Stability Oracle are reversed since more negative values correspond to higher stability and thus higher grammaticality. ‘Semantic change (Hie)’ refers to the normalized differences between wild-type and mutant sequence embeddings emitted by the coronavirus LLM used in Hie et al. [10]. ‘Semantic change (ESM2)’ refers to the normalized differences between wild-type and mutant sequence embeddings emitted by the general ESM2 LLM. ‘EVE’ refers to the fitness scores output by EVE [34]. ‘EVEscape’ refers to the scores output by EVEscape [12] which are a product of viral fitness (EVE score), antibody accessibility and antibody binding scores.
4.7. Coronavirus spike glycoprotein
The spike sequence used as the template for computationally mutating every residue is available on NCBI with GenBank ID QHD43416.1. The Wuhan spike sequence used for the 15 combinatorial Omicron mutations is available on NCBI with GenBank ID YP_009724390.1. The library of mutants used as input to the language models, the template sequences, and the homotrimer protein structure used as input to MutCompute, MutRank and Stability Oracle are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/cov.
We utilized data from a DMS experiment of the SARS-CoV-2 spike protein where a pseudovirus system was used to determine the impact of mutations on antigenic escape [7]. In this experiment, a subset of sites was tested for their antigenic escape phenotype and some mutations did not confer a viable protein. These were denoted by a lack of escape measurement from the experiment. So here, viable mutations were classified as either escape or non-escape by their reported escape fraction. Those with an escape fraction above 0.5 were designated escape. These data are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/cov/starr_dms.
We also used data from previously published DMS experiments where the 15 mutations that define the Omicron SARS-CoV-2 strain were tested for binding affinity for four antibodies [5] and the cell surface receptor ACE2 [6]. These data are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/cov/omicron_experiments.
Last, we used the BA.1 and Delta spike strains from [18] where they performed two separate DMS experiments for escape and viability. The viability cut-offs were −1.38 and −1.46, respectively, and the mutations were classified as viable if their mean effect was not within two standard deviations of the threshold. Similarly, a median escape score for an individual mutation had to be more than two standard deviations from zero to be classified as an escape mutation. The protein sequences for these two strains are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/cov/dadonaite. Note that these two sequences are slightly different from the BA.1 and Delta template sequences used by EVEscape. To amend this, we performed an alignment of the respective BA.1 and Delta sequences and only used DMS data where the two sequences were identical in the alignment.
4.8. Influenza A haemagglutinin protein
The HA sequence used as the template for computationally mutating every residue is identical to the NCBI sequence with GenBank ID QDQ43389.1. The library of mutants used as input to the language models, the template sequence and the homotrimer protein structure used as input to MutCompute, MutRank and Stability Oracle are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/flu. Note that we use the previously published results from Hie et al. [10] for the BiLSTM language model.
We used previously published DMS of the H1 haemagglutinin protein of influenza A/WSN/1933 [4]. They determined the mutational tolerance of each site along the entire protein sequence. The results of this experiment were amino acid preferences at each site (excluding the start codon) for all amino acids; i.e. the expected post-selection frequency of all 20 amino acids at each site for all possible single mutant sequences. From these data, we classified mutations as resulting in viable or not viable proteins. The data we used from this experiment are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/flu/escape_doud2018.
We defined viable mutations as those having an amino acid preference above 0.001. To make this determination, we looked at the curve representing the ranked amino acid preferences and note the behaviour changes at 0.1 and 0.001 (electronic supplementary material, figure S14). Of the 1222 mutations with an amino acid preference above 0.1, 482 are wild-type (electronic supplementary material, Fig. S14). This leaves 82 wild-type mutations classified as not viable if we chose a cutoff of 0.1. The experimental per-codon sequencing error rate is somewhere between 0.0002 and 0.0005, and the observed nonsynonymous post-selection frequency is approximately 0.0008 [4]. Thus, we chose the cutoff of 0.001 to be a slightly more stringent classifier of viability than the bounds for sequencing error. The data we use from this experiment are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/flu/fitness_doud2016. The code used to define the viability of mutations is available at https://github.com/allmanbrent/NLP_viral_escape.
Escape fractions were obtained in the DMS of A/WSN/1993 [19], but a simple numerical cutoff is insufficient to define escape mutations in this case. We used dms_tools2 [55] to classify mutations as escape or non-escape for each of the antibodies tested. We then looked across antibodies and if a variant confers escape under any one antibody selection scheme, then it is considered an escape mutation. The data we used from this experiment are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/flu/escape_doud2018. The code used to define escape mutations is available at https://github.com/allmanbrent/NLP_viral_escape.
4.9. HIV-1 envelope glycoprotein
We used the BG505.W6M.C2.T332N strain of env which has DMS experiments testing viability [21] and antigenic escape [20]. The library of mutants used as input to the language models, the template sequence and the homotrimer protein structure used as input to MutCompute, MutRank and Stability Oracle are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/hiv. Note that we used the previously published results from Hie et al. [10] for the BiLSTM language model.
Similar to the influenza experiment described above, 670 sites from the HIV BG505 strain env protein have been previously mutagenized and then their amino acid preferences were measured based on observed frequencies from deep sequencing taken from in vitro cell passage [21]. To define viability from these amino acid preferences, we again looked at the rank of the preferences (electronic supplementary material, figure S15). Like influenza, the inflection points for the behavior of these ranked data shift at amino acid preferences of 0.1 and 0.001. With a cutoff of 0.1, 243 of the wild-type mutations are classified as not viable. Therefore, we used the cutoff of 0.001 since this captures all wild-type mutations and is slightly more stringent than proposed error rates [21]. The data we use from this experiment are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/hiv/Haddox_supp. The code used to define the viability of mutations is available at https://github.com/allmanbrent/NLP_viral_escape.
We used escape fractions from previously published DMS on BG505 HIV where the virus underwent antibody selection [20]. We used dms_tools2 [55] to classify mutations as escape or non-escape for each of the antibodies tested. We then look across antibodies and if a variant confers escape under any one antibody selection scheme, then it is considered an escape mutation for our purposes. The data we used from this experiment are available at https://github.com/allmanbrent/NLP_viral_escape/tree/main/data/hiv/Dingens_ab_escape. The code we used to define escape mutations is available at https://github.com/allmanbrent/NLP_viral_escape.
Acknowledgements
We would like to thank Anastasiya Kulikova for helpful conversations and assistance with setting up and running AI models. We thank the Institute for Foundations of Machine Learning (IFML), the Texas Advanced Computing Center and the Biomedical Research Computing Facility at the University of Texas at Austin for the computing resources used to do the analyses in this manuscript. We would like to thank AMD for the donation of critical hardware and support resources from its HPC Fund.
Contributor Information
Brent E. Allman, Email: brent.allman@utexas.edu.
Luiz Vieira, Email: luiz.vieira@utexas.edu.
Daniel J. Diaz, Email: danny.diaz@utexas.edu.
Claus O. Wilke, Email: wilke@austin.utexas.edu.
Ethics
This work did not require ethical approval from a human subject or animal welfare committee.
Data accessibility
Code and data required to reproduce this work have been archived on Zenodo, [56]. This archive corresponds to a 13 February 2025 snapshot of our Github repository associated with this project: [57].
Supplementary material is available online [58].
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors’ contributions
B.E.A.: conceptualization, data curation, formal analysis, investigation, methodology, project administration, resources, software, validation, visualization, writing—original draft, writing—review and editing; L.V.: methodology, resources, software; D.J.D.: methodology, resources, software, writing—review and editing; C.O.W.: conceptualization, funding acquisition, investigation, project administration, supervision, writing—original draft, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
D.J.D. has a financial relationship with Intelligent Proteins LLC, which uses AI models for protein engineering.
B.E.A., L.V. and C.O.W. declare no competing interests.
Funding
This study was supported by NSF award DEB 2200169. C.O.W. was also supported by the Jane and Roland Blumberg Centennial Professorship in Molecular Evolution and the Dwight W. and Blanche Faye Reeder Centennial Fellowship in Systematic and Evolutionary Biology at The University of Texas at Austin. D.J.D was supported by the NSF AI Institute for Foundations of Machine Learning (IFML).
References
- 1. Chakraborty C, Sharma AR, Bhattacharya M, Lee SS. 2022. A detailed overview of immune escape, antibody escape, partial vaccine escape of SARS-CoV-2 and their emerging variants with escape mutations. Front. Immunol. 13, 801522. ( 10.3389/fimmu.2022.801522) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Focosi D, Tuccori M, Baj A, Maggi F. 2021. SARS-CoV-2 variants: a synopsis of in vitro efficacy data of convalescent plasma, currently marketed vaccines, and monoclonal antibodies. Viruses 13, 1211. ( 10.3390/v13071211) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Zhou J, et al. 2023. Omicron breakthrough infections in vaccinated or previously infected hamsters. Proc. Natl Acad. Sci. USA 120, e2308655120. ( 10.1073/pnas.2308655120) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Doud M, Bloom J. 2016. Accurate measurement of the effects of all amino-acid mutations on influenza hemagglutinin. Viruses 8, 155. ( 10.3390/v8060155) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Moulana A, et al. 2022. Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 Omicron BA.1. Nat. Commun. 13, 7011. ( 10.1038/s41467-022-34506-z) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Moulana A, Dupic T, Phillips AM, Chang J, Roffler AA, Greaney AJ, Starr TN, Bloom JD, Desai MM. 2023. The landscape of antibody binding affinity in SARS-CoV-2 Omicron BA.1 evolution. eLife 12, e83442. ( 10.7554/elife.83442) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Starr TN, Greaney AJ, Addetia A, Hannon WW, Choudhary MC, Dingens AS, Li JZ, Bloom JD. 2021. Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. Science 371, 850–854. ( 10.1126/science.abf9302) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Obermeyer F, et al. 2022. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376, 1327–1332. ( 10.1126/science.abm1208) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Wang D, Huot M, Mohanty V, Shakhnovich EI. 2024. Biophysical principles predict fitness of SARS-CoV-2 variants. Proc. Natl Acad. Sci. USA 121, e2314518121. ( 10.1073/pnas.2314518121) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Hie B, Zhong ED, Berger B, Bryson B. 2021. Learning the language of viral evolution and escape. Science 371, 284–288. ( 10.1126/science.abd7331) [DOI] [PubMed] [Google Scholar]
- 11. Lamb KD, et al. From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv. ( 10.1101/2024.07.05.602129) [DOI]
- 12. Thadani NN, Gurev S, Notin P, Youssef N, Rollins NJ, Ritter D, Sander C, Gal Y, Marks DS. 2023. 1476-4687. Learning from prepandemic data to forecast viral escape. Nature 622, 818–825. ( 10.1038/s41586-023-06617-0) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. 2022. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110. ( 10.1093/bioinformatics/btac020) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Lin Z, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130. ( 10.1126/science.ade2574) [DOI] [PubMed] [Google Scholar]
- 15. Madani A, et al. 2023. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106. ( 10.1038/s41587-022-01618-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Lau JH, Clark A, Lappin S. 2017. Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cogn. Sci. 41, 1202–1241. ( 10.1111/cogs.12414) [DOI] [PubMed] [Google Scholar]
- 17. Turney PD, Pantel P. 2010. From frequency to meaning: vector space models of semantics. J. Artif. Intell. 37, 141–188. ( 10.1613/jair.2934) [DOI] [Google Scholar]
- 18. Dadonaite B, et al. 2023. A pseudovirus system enables deep mutational scanning of the full SARS-CoV-2 spike. Cell 186, 1263–1278.( 10.1016/j.cell.2023.02.001) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Doud MB, Lee JM, Bloom JD. 2018. How single mutations affect viral escape from broad and narrow antibodies to H1 influenza hemagglutinin. Nat. Commun. 9, 1386. ( 10.1038/s41467-018-03665-3) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Dingens AS, Arenz D, Weight H, Overbaugh J, Bloom JD. 2019. An antigenic atlas of HIV-1 escape from broadly neutralizing antibodies distinguishes functional and structural epitopes. Immunity 50, 520–532.( 10.1016/j.immuni.2018.12.017) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Haddox HK, Dingens AS, Hilton SK, Overbaugh J, Bloom JD. 2018. Mapping mutational effects along the evolutionary landscape of HIV envelope. eLife 7, e34420. ( 10.7554/elife.34420) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Baum A, et al. 2020. Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science 369, 1014–1018. ( 10.1126/science.abd0831) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. d’Oelsnitz S, et al. 2024. Biosensor and machine learning-aided engineering of an amaryllidaceae enzyme. Nat. Commun. 15, 2084. ( 10.1038/s41467-024-46356-y) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Gong C, Klivans A, Loy JM, Chen T, Liu Q, Diaz DJ. 2024. Evolution-inspired loss functions for protein representation learning. Proc. Machine Learning Res. 235, 15. [Google Scholar]
- 25. Kulikova AV, Diaz DJ, Chen T, Cole TJ, Ellington AD, Wilke CO. 2023. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. Sci. Rep. 13, 13280. ( 10.1038/s41598-023-40247-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Torng W, Altman RB. 2017. 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinform. 18, 302. ( 10.1186/s12859-017-1702-0) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Diaz DJ, Kulikova AV, Ellington AD, Wilke CO. 2023. Using machine learning to predict the effects and consequences of mutations in proteins. Curr. Opin. Struct. Biol. 78, 102518. ( 10.1016/j.sbi.2022.102518) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Elnaggar A, et al. 2022. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127. ( 10.1109/tpami.2021.3095381) [DOI] [PubMed] [Google Scholar]
- 29. Geffen Y, Ofran Y, Unger R. 2022. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics 38, ii95–ii98. ( 10.1093/bioinformatics/btac474) [DOI] [PubMed] [Google Scholar]
- 30. Capriotti E, Fariselli P, Rossi I, Casadio R. 2008. A three-state prediction of single point mutations on protein stability changes. BMC Bioinform. 9, S6. ( 10.1186/1471-2105-9-s2-s6) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, The UniProt Consortium . 2015. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932. ( 10.1093/bioinformatics/btu739) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Diaz DJ, Gong C, Ouyang-Zhang J, Loy JM, Wells J, Yang D, Ellington AD, Dimakis AG, Klivans AR. 2024. Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations. Nat. Commun. 15, 6170. ( 10.1038/s41467-024-49780-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, Mangan NM, Ovchinnikov S, Rocklin GJ. 2023. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444. ( 10.1038/s41586-023-06328-6) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. 2021. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95. ( 10.1038/s41586-021-04043-8) [DOI] [PubMed] [Google Scholar]
- 35. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. ( 10.48550/arXiv.1810.04805) [DOI]
- 36. Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. 2005. Thermodynamic prediction of protein neutrality. Proc. Natl Acad. Sci. USA 102, 606–611. ( 10.1073/pnas.0406744102) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Gong LI, Suchard MA, Bloom JD. 2013. Stability-mediated epistasis constrains the evolution of an influenza protein. eLife 2, e00631. ( 10.7554/elife.00631) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Liberles DA, et al. 2012. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 21, 769–785. ( 10.1002/pro.2071) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Rotem A, et al. 2018. Evolution on the biophysical fitness landscape of an RNA virus. Mol. Biol. Evol. 35, 2390–2400. ( 10.1093/molbev/msy131) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Wylie CS, Shakhnovich EI. 2011. A biophysical protein folding model accounts for most mutational fitness effects in viruses. Proc. Natl Acad. Sci. USA 108, 9916–9921. ( 10.1073/pnas.1017572108) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. 2005. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388. ( 10.1093/nar/gki387) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. 2011. PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151. ( 10.1186/1471-2105-12-151) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Sora V, et al. 2023. RosettaDDGPrediction for high‐throughput mutational scans: from stability to binding. Protein Sci. 32, e4527. ( 10.1002/pro.4527) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Notin P, et al. 2023. ProteinGym: Large-scale benchmarks for protein fitness prediction and design. In Advances in neural information processing systems, pp. 64331–64379, vol. 36. Red Hook, NY: Curran Associates. ( 10.1101/2023.12.07.570727) [DOI] [Google Scholar]
- 45. Wilke CO. 2024. The biophysical landscape of viral evolution. Proc. Natl Acad. Sci. USA 121, e2409667121. ( 10.1073/pnas.2409667121) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B. 2021. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11. ( 10.1038/s41598-020-80786-0) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B. 2019. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20, 723. ( 10.1186/s12859-019-3220-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Randall JR, Vieira LC, Wilke CO, Davies BW. 2024. Deep mutational scanning and machine learning for the analysis of antimicrobial-peptide features driving membrane selectivity. Nat. Biomed. Eng. ( 10.21203/rs.3.rs-3280212/v1) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. 2021. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37, 162–170. ( 10.1093/bioinformatics/btaa701) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Ouyang-Zhang J, Diaz D, Klivans A, Kraehenbuehl P. 2023. Predicting a protein’s stability under a million mutations. Adv. Neural Inf. Process. Syst. 36, 76229–76247. [Google Scholar]
- 51. Kulmanov M, Guzmán-Vega FJ, Duek Roggli P, Lane L, Arold ST, Hoehndorf R. 2024. Protein function prediction as approximate semantic entailment. Nat. Mach. Intell. 6, 220–228. ( 10.1038/s42256-024-00795-w) [DOI] [Google Scholar]
- 52. Gong C, Klivans A, Wells J, Loy J, Liu Q, Dimakis A, Diaz D. 2023. Binding Oracle: fine-tuning from stability to binding free energy. See https://openreview.net/forum?id=ChU7MCLk1J.
- 53. Evans R, et al. 2022. Protein complex prediction with AlphaFold-Multimer. bioRxiv. ( 10.1101/2021.10.04.463034) [DOI]
- 54. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. ( 10.1038/s41586-021-03819-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Bloom JD. 2015. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinform. 16, 168. ( 10.1186/s12859-015-0590-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Allman BE, Vieira LC, Diaz DJ, Wilke CO. 2025. Associated code and data files for: a systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks. Zenodo ( 10.5281/zenodo.14867459) [DOI] [PubMed]
- 57. Allman B. 2025. allmanbrent/NLP_viral_escape https://github.com/allmanbrent/ NLP_viral_escape
- 58. Allman BE, Vieira L, Diaz DJ, Wilke CO. 2025. Supplementary material from: A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks. Figshare. ( 10.6084/m9.figshare.c.7742879) [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Code and data required to reproduce this work have been archived on Zenodo, [56]. This archive corresponds to a 13 February 2025 snapshot of our Github repository associated with this project: [57].
Supplementary material is available online [58].


![All possible mutations of the coronavirus spike protein DMS experiment [7] tested under different models](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd96/12040448/8e2bf06201ee/rsif.2024.0598.f003.jpg)
![All possible mutations of the coronavirus spike protein DMS experiment [7] tested under different models](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dd96/12040448/5f34f417cb97/rsif.2024.0598.f004.jpg)
