Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2025 Jun 9;122(24):e2503742122. doi: 10.1073/pnas.2503742122

Predicting high-fitness viral protein variants with Bayesian active learning and biophysics

Marian Huot a,b,1, Dianzhuo Wang a,c,1,2, Jiacheng Liu d, Eugene I Shakhnovich a,2
PMCID: PMC12184641  PMID: 40489612

Significance

The study addresses a critical challenge in pandemic preparedness-rapid identification of high-fitness viral protein variants before they spread widely with scarce experimental resources. We propose a few-shot learning approach for early pandemic surveillance that uniquely integrates protein language models with Bayesian optimization and biophysical modeling to efficiently predict viral evolution. Our approach leverages the evolutionary information from protein language models, the sampling efficiency of active learning, and biophysical models that can be established early in a pandemic. By identifying potential variants of concern and frequently mutated sites using very limited experimental data, our system demonstrates potential as an early warning system for emerging viral threats for the next pandemic.

Keywords: antibody escape, protein evolution, active learning, pandemic preparedness, protein language model

Abstract

The early detection of high-fitness viral variants is critical for pandemic response, yet limited experimental resources at the onset of variant emergence hinder effective identification. To address this, we introduce an active learning framework, VIRAL (Viral Identification via Rapid Active Learning), that integrates protein language model, Gaussian process with uncertainty estimation, and a biophysical model to predict the fitness of novel variants in a few-shot learning setting. By benchmarking on past SARS-CoV-2 data, we demonstrate that our method accelerates the identification of high-fitness variants by up to fivefold compared to random sampling while requiring experimental characterization of fewer than 1% of possible variants. We also demonstrate that our framework effectively identifies sites that are frequently mutated during natural viral evolution with a predictive advantage of up to two years compared to baseline strategies, particularly those enabling antibody escape while preserving ACE2 binding. Through systematic analysis of different acquisition strategies, we show that incorporating uncertainty in variant selection enables broader exploration of the sequence landscape, leading to the identification of evolutionarily distant but potentially dangerous variants. Our results suggest that VIRAL could serve as an effective early warning system for identifying concerning SARS-CoV-2 variants and potentially emerging viruses with pandemic potential before they achieve widespread circulation.


The relentless ascent in the number of SARS-CoV-2 infection cases has catalyzed an unprecedented protein evolution, yielding a profusion of novel mutations. Mutations in the receptor-binding domain (RBD) of the viral spike protein are particularly consequential, as they can increase binding affinity to the Angiotensin-converting enzyme 2 (ACE2) receptor (1, 2), facilitating more efficient host cell entry, or reduce susceptibility to neutralization by monoclonal antibodies (mAbs) and convalescent sera (35). These adaptations enable the emergence of lineages with higher fitness (6), which are better equipped to spread within populations or evade immune responses, thereby posing significant challenges to public health interventions. The rapid antigenic evolution of the SARS-CoV-2 RBD has also complicated the development of long-lasting vaccines (7), often requiring updates to match dominant variants (8). As a result, forecasting variants of concern is not only critical for surveillance but also essential for the design of vaccines capable of anticipating immune escape trajectories.

Supervised machine learning has emerged as a valuable tool for predicting viral fitness (infectivity) and identifying high-risk viral mutations. These models leverage fitness labels collected from sequencing databases such as Global Initiative on Sharing Avian Influenza Data (GISAID) (9) or integrate laboratory measurements with epidemiological data. For example, Obermeyer et al. (10) introduced a nonepistatic approach that infers fitness from GISAID and allows estimation of fitness for combination of observed mutations. Ito et al. (11) and Maher et al. (12) integrate laboratory measurements of binding affinities with epidemiological data to train machine learning models that forecast the fitness of the SARS-CoV-2 variants and identify variants susceptible to widespread transmission. However, these models were rare in the early stages of the pandemic due to the limited availability of labeled fitness data. Fitness labels are difficult to obtain unless the virus has been propagating in the population for a sufficient period.

Given these limitations, biophysical models offer a complementary approach by linking experimentally determined binding constants (KD) to fitness as demonstrated by these works (1315). A key advantage of biophysical modeling is its ability to significantly reduce the functional space compared to machine learning, enabling accurate predictions with minimal experimental data while maintaining as a powerful fitness predictor. The underlying intuition is that high-fitness variants are usually variants that evade antibody binding while maintaining strong cellular receptor affinity. These experimental values can be obtained through low-throughput methods such as surface plasmon resonance (16) and isothermal titration calorimetry (17) or high-throughput methods such as deep mutational scanning (DMS) (18) or combinatorial mutagenesis (CM) (19). However, the time and cost associated with comprehensive experimental measurements can limit its availability during the early stages of a pandemic.

Active learning addresses the challenge of limited labeled data by prioritizing the most promising variants for experimental characterization and has been successfully used for discovery of high fitness proteins (20) as well as chemical space exploration (21, 22). Active learning framework can efficiently be combined with Gaussian Process (GP), as they excel at handling scarce labeled data while providing uncertainty estimates (20, 23, 24).

This paper explores the integration of protein language model (pLM), active learning and biophysical modeling to enhance early pandemic response capabilities, including potential variants of concern (pVOC) detection, and identifying sites underpressure. We combine a biophysical model (13) based with a machine learning pipeline that integrates active learning with GP decoder on embeddings acquired from ESM3. ESM3 is a state-of-the-art pLM that generates structure-aware sequence embeddings (25), which the GP uses to smoothly predict how mutations affect binding affinities to cell receptors and antibodies. Then these binding affinities are piped into a pretrained biophysical model (13). This approach efficiently identifies potentially dangerous mutations by prioritizing the most informative variants based on acquisition functions.

We validated this pipeline using deep mutational scanning data, combinatorial mutagenesis experiments, and GISAID sequencing data, demonstrating that our few-shot learning approach can effectively substitute for high-throughput screening in early pandemic surveillance.

1. Results

1.1. Overview of VIRAL.

We introduce VIRAL (Viral Identification via Rapid Active Learning), a framework that combines GP with active learning to predict the binding specificity of SARS-CoV-2 RBD variants to ACE2 and various antibodies (Fig. 1A). The model operates in a structured pipeline: First, the RBD structure and mutant sequence are input into ESM3, a protein language model, to generate sequence embeddings that incorporate both sequence and structural context. These embeddings are then fed into a GP trained on a limited set of experimental dissociation constants to predict binding affinities. The predicted dissociation constants, as well as predicted uncertainties, are subsequently passed into a biophysical model, following our previous work (13), to infer the infectivity of each variant.

Fig. 1.

Fig. 1.

Overview of active learning framework for detecting high-fitness SARS-CoV-2 RBD variants. (A) The pipeline begins by using ESM3 and RBD structure to generate embeddings for RBD sequences. These embeddings serve as inputs to a combined Gaussian Process and biophysical model that predicts variant fitness. The framework operates in an iterative cycle where: 1) The model predicts fitness and uncertainty for untested variants, 2) Based on these predictions, the most promising variants are selected for experimental testing using various acquisition strategies (greedy, UCB, or random sampling), 3) An oracle representing experiments provides binding constants (KD values), and 4) These new measurements are used to retrain the GP model, improving its predictive power. (B) Spearman correlation of predicted fitness for different training sized on DMS dataset. Error bars show std. ESM3_coord refers to ESM3 with wildtype structure. (C) Similar to (B) but on CM dataset.

A key strength of our approach is its few-shot capability-the ability to rank variant fitness effectively even with extremely limited labeled data points, making it particularly powerful for early variant detection during emerging outbreaks. This capability leverages the Gaussian Process’s inherent ability to learn efficiently from small KD datasets by utilizing the prior knowledge encoded in its kernel function, while simultaneously providing valuable uncertainty estimations about its predictions (26). Additionally, our biophysical model can be trained with minimal data to provide an effective mapping from KD to fitness (13).

Notably, in our benchmark studies (Fig. 1 B and C), incorporating structural information from the wildtype RBD [PDB 6XF5 (27)] to ESM3 significantly enhances predictive accuracy across both deep mutational scanning and combinatorial mutation benchmarks compared with sequence-only methods such as ESM3 sequence only, ESM2, and ESM1v. This advantage of structure-aware embeddings is consistent with Loux et al. (28). In particular, our ESM3 with structure achieves a spearman coefficient of 0.53 on combinatorial dataset while being trained on only 20 points (0.06% of dataset). We also note that our model shows greater difficulty in predicting the fitness of single variants in the DMS setting, likely due to the higher diversity and complexity of that landscape. In contrast, the CM dataset contains a constrained subset of epistatic mutations whose combined effects may be easier to approximate, even under a limited-data regime. Based on these results, all subsequent analyses presented in this work utilize the ESM3 embeddings that integrate both sequence and structural information.

The predictions from this pipeline guide an active learning strategy: Variants with high predicted infectivity are selected for validation, and new measurements are used to iteratively retrain the GP, refining its accuracy over time. This iterative process enables the model to rapidly identify high-risk variants while minimizing the number of required measurements, making it a scalable approach for early warning systems in viral surveillance.

1.2. Active Learning Identifies Top Variants.

Our objective is to maximize the identification of pVOC, defined as those ranking in the top p = 10% across the mutational landscape. To rigorously evaluate our approach, we start with a retrospective study using existing CM datasets containing experimentally measured binding constants (Materials and Methods). This retrospective analysis is crucial as it provides the only systematic way to benchmark our model’s performance: By simulating the sequential selection of variants from a completely characterized fitness landscape, we can precisely quantify how efficiently our framework identifies high-fitness variants compared to alternative strategies. Such comprehensive validation would be impossible in a prospective setting, where the fitness of untested variants remains unknown.

Our main analysis focuses on two acquisition strategies: a greedy strategy that selects variants based solely on predicted fitness, and an Upper Confidence Bound (UCB) approach that combines predicted fitness with model uncertainty (Materials and Methods). While the greedy strategy excels at exploiting regions of known high fitness, UCB balances exploitation with exploration of uncertain regions in the sequence landscape, potentially uncovering novel fitness peaks. Random sampling constitutes the baseline, mimicking what we could achieve using brute force searching-screening every RBD in a library indiscriminately. We define the enrichment factor (EF) as the ratio of the percentage of top variants found by the model-guided search to the percentage of top variants found by a random search. An EF greater than 1 indicates superior performance compared to brute-force screening.

We use an initial training set of variants with a maximum number of 2 mutations, to model the first observed variants in a pandemic. Given that the most concerning variants, such as BA.1, can accumulate up to 15 mutations in the RBD, an effective early warning system must efficiently evaluate variants with increasingly complex mutation combinations to assess their potential for complete antibody escape. This translates into UCB acquisition metric achieving a final EF of 5 after 10 rounds of acquisition corresponding to a total of 120 points and 0.4% of the dataset (Fig. 2B), while the greedy strategy struggles to outperform the random baseline due to limited exploration, which reduces its ability to identify dangerous variants (Fig. 2A), quantified by area under the receiver operating characteristic curve (AUC). The performance gap between UCB and greedy can be explained by the GP. Because we use a zero-mean GP prior, regions lacking nearby training data default to predicting no fitness gain relative to the wildtype-but with high associated uncertainty (Materials and Methods). While greedy acquisition avoids these unexplored regions, UCB explicitly prioritizes points with high predictive uncertainty and captures distant, high-risk variants.

Fig. 2.

Fig. 2.

Active learning performance on combinatorial dataset. (A) Area Under the Curve (AUC), representing the model’s ability to identify top fitness variants, is shown for different strategies (UCB and greedy). An AUC above 0.5 indicates effective identification of high-fitness variants. Each round corresponds to acquiring a new batch of variants, improving the predictor. Shaded regions represent SDs across runs. (B) Enrichment in top variants across acquisition runs for each strategy. Values above 1 represent improvement over a random acquisition. (C) Enrichment across acquisitions runs for different fitness thresholds p defining top variants. (D) Maximum enrichment obtained during active learning for different fitness thresholds p defining top variants. (E) Fitness of acquired variants at each round using UCB. Each point corresponds to an acquired variant from one of ten independent runs. Each run was initialized with a random low-fitness training set and followed by 10 rounds of acquisition. Newly acquired variants at each round are shown as red dots; previously acquired variants are shown in gray.

The performance of the model with UCB improves when we increase the stringency of our definition for dangerous variants by decreasing the threshold p (Fig. 2 C and D). For instance, when defining dangerous variants as those in the top 1% of fitness scores rather than the top 10%, the model achieves a maximum enrichment factor of 11, demonstrating particularly strong performance in identifying the most concerning variants. This occurs because UCB accumulates a significant number of variants in the top 1% of the fitness distribution (Fig. 2E), yielding higher enrichment under stricter definitions of danger. In contrast, the greedy strategy fails to capture rare, outlier variants in the extreme high-fitness tail (SI Appendix, Fig. S2). As the threshold p decreases, the variants acquired by greedy acquisition are no longer sufficiently enriched among the top performers, leading to a decline in enrichment (Fig. 2D).

1.3. Exploration of Dataset.

The contrasting enrichment factors between UCB and greedy strategies reflect their fundamentally different exploration behaviors. Fig. 3A illustrates this distinction through the exploration variance in the ESM3 embedding space, where UCB demonstrates substantially higher variance of the embeddings for acquired points compared to the greedy approach (see SI Appendix, Fig. S3 for impact of uncertainty weight in UCB exploration).

Fig. 3.

Fig. 3.

Exploration of combinatorial landscape. (A) Comparison of embedding variance of points acquired across rounds between UCB (orange) and greedy (blue) strategies. (B) UMAP visualization of sequence space comparing acquired variants (orange) against all top sequences (blue) and background sequences (gray) using UCB (Left) and greedy (Right) acquisition strategies. (C) Distribution of semantic change of acquired RBD variants using different strategies. (D) Correlation of ACE2/antibody binding and fitness with semantic change.

This increased variance signifies that UCB conducts a more thorough and diverse exploration of the sequence landscape, systematically sampling from a broader range of potential variants rather than concentrating on a limited region of the sequence space. This broader exploration is visually demonstrated in the UMAP visualization (Fig. 3B), where UCB successfully identifies and samples from multiple distinct clusters of high-fitness variants, while the greedy strategy remains confined to a more limited region of the sequence space. This observation highlights UCB’s ability to balance exploitation of known high-fitness regions with exploration of potentially promising but unexplored sequence clusters, in contrast to the greedy strategy’s more localized search pattern.

In Fig. 3C, the density plot reveals the semantic change distribution of sampled variants. Semantic change is defined as the Euclidean distance between wildtype and mutant (Materials and Methods). We observe that UCB’s broader sampling extends into regions shifted toward higher semantic change. This broader sampling is particularly important, as shown in Fig. 3D, where higher semantic change positively correlates with enhanced immune escape and increased viral fitness. Additionally, as shown in SI Appendix, Fig. S4, semantic change positively correlated with number of mutations relative to the wild-type. The ability to explore sequences with greater semantic divergence from wild-type is crucial, as these more distant regions of sequence space often harbor novel beneficial mutations that could drive the emergence of escape variants.

1.4. VIRAL Identifies Highly Mutable Sites.

Our framework effectively identifies sites (residues) that emerged as mutation hotspots during natural viral evolution, particularly those linked to antibody escape. A residue is classified as “highly mutable” if, over the entire course of pandemic, it has exhibited more than threshold = 9 out of 20 possible amino acid substitutions. (See SI Appendix, Fig. S5 for model performances with different thresholds)

After running VIRAL on the DMS dataset with different initial training sets (maximal enrichment of 3.1 and AUC of 0.81; see SI Appendix, Fig. S1), we found a strong correlation between sites frequently sampled by our algorithm and those that became highly mutable in natural evolution.

Across 10 independent runs with different initial training sets, the AUC values ranged from 0.62 to 0.78. When acquisition scores were averaged across multiple runs to reduce sampling noise, we obtained a robust AUC of 0.76 (Fig. 4A). These results remain consistent across different thresholds used to define “highly mutable” sites in natural evolution (SI Appendix, Fig. S5).

Fig. 4.

Fig. 4.

Active learning identifies residues under selection pressure. (A) ROC curves illustrating the performance of VIRAL in identifying frequently mutated sites in the GISAID database. Individual experimental runs are shown as gray lines. Blue curve is obtained when averaging the acquisition scores across multiple runs to reduce sampling noise. (B) Comparative ROC analysis of VIRAL benchmarked against one-batch acquisition using single-mutation variants observed by the end of 2020 (orange) or 2022 (green). (C) Number of sampled amino acids per site, plotted against the RBD sequence. (D) Structural representation of the RBD (light blue) in complex with ACE2 (dark blue) and the S309 antibody (green and yellow, representing the heavy and light chains, respectively; PDB: 8FXC). Red highlights indicate the top 20 sampled sites by VIRAL. (EI) The KD distribution for oversampled sites identified by VIRAL is shown for ACE2, LY-Cov016, LY-CoV555, REGN10987, and S309. Higher values indicate binding loss. Red-highlighted sites cause escape and exhibit a greater average mutational binding loss than the protein-wide average.

We benchmarked our methodology against alternative approaches to predict mutation-prone sites in Fig. 4B. Our integrated approach significantly outperforms baseline predictions derived from variants observed before the end of 2020, while performing slightly better than predictions based on variants until the end of 2022. This competitive performance is particularly noteworthy because baseline predictors have an inherent advantage-they are trained directly on pandemic-era mutations that successfully emerged in the population, while our method identifies relevant sites without this prior knowledge. Precisely, with an initial training set of only one single variant per site, our pipeline enhances predictive capability by up to two years in identifying sites under pressure.

Fig. 4 C and D provide sequence and structural representation on the sites prioritized by our framework. The sampling frequency across the RBD sequence is visualized in Fig. 4C, while Fig. 4D maps the top 20 most frequently sampled sites onto the structure of the RBD-ACE2-S309 complex [PDB: 8FXC (29)]. The spatial distribution of these prioritized sites reveals an immunological pattern: They cluster in critical regions, the ACE2 binding interface (which overlaps with class 1 and 2 antibody epitopes) and regions recognized by class 3 and 4 antibodies.

Fig. 4EI further analyze the functional impact of mutations at these oversampled sites by plotting the change in log KD (ΔlogKD) at each site relative to the wild type. The red dashed line represents the global average ΔlogKD across the entire RBD.

In Fig. 4EI, each of these sites is selected either for its ability to escape one or more antibodies while maintaining ACE2 binding, as highlighted in Fig. 4E. In particular, despite being trained on only a few percent of single-variant data, our model successfully identified critical antibody escape mutations that later emerged in major SARS-CoV-2 variants. These key positions include residue 484 (enabling LY-CoV016 and LY-CoV555 escape) in Omicron BA.1; residue 493 (enabling CoV016 and CoV555 escape) in Omicron BA.1, BA.2, and BA.5; positions 340 (enabling S309 escape) and 356 (enabling S309 escape) in Omicron BA.2; residue 445 (enabling REGN10987 escape) in variants B.1, BF.8, and XD; and position 486 (enabling CoV016 and CoV555 escape) in numerous variants ranging from N.6 to Omicron BA.5 to BL1. A comprehensive list of these positions and their associated variants can be found in SI Appendix. We also identified position 507 as an oversampled site. Although mutations at this position allow escape from the LY-CoV016 and REGN10987 antibodies, they significantly compromise the binding affinity of ACE2. This trade-off between immune escape and receptor binding likely explains why mutations at position 507 have not been widely observed in naturally circulating variants.

2. Discussion

In this study, we developed an integrated framework VIRAL for identifying high-fitness SARS-CoV-2 RBD variants and predicting mutation-prone sites using minimal experimental data. Our approach combines three key components: pLM for sequence representation, Gaussian processes for efficient learning and uncertainty quantification, and biophysical modeling for fitness prediction. This integration demonstrates several significant advantages in the context of viral surveillance and variant prediction.

First, our model achieves high efficiency in identifying dangerous variants, obtaining up to fivefold enrichment over random sampling while requiring experimental characterization of less than 1% of possible variants in the combinatorial mutagenesis landscape. This significant reduction in the number of required experiments could substantially accelerate the identification of concerning variants during early pandemic stages, when experimental resources are often limited. While initializing a model with some training data (as in this study) is advantageous, it is also feasible to start with zero training data, where zero-shot predictions initially carry equal uncertainty. Indeed, pLMs have been proved to be effective at zero-shot protein functions, provided they are trained on large and diverse protein sequence databases (30). As more data are gathered, a sample-efficient model leveraging uncertainty can iteratively improve its predictions and confidence. This iterative cycle of computation and experimentation has been central to experiment prioritization, especially in drug discovery (31).

Second, our comparative analysis of acquisition strategies reveals an important balance between exploitation and exploration in viral surveillance. While the greedy strategy efficiently samples known high-fitness regions, the UCB approach enables broader exploration of the sequence landscape, particularly into regions with higher semantic change. This broader sampling is crucial for identifying evolutionarily distant but potentially dangerous variants. This aligns with our understanding of viral evolution, where new variants often emerge by exploring antigenically novel regions while maintaining essential functionality such as binding and folding. For instance, Luksza et al.’s predictive model of influenza evolution (32) showed that viral clades spread by balancing antigenic novelty and fitness, while Meijer et al. (33) demonstrated how population immunity drives selection toward previously unexposed antigenic regions. These insights emphasize systematic exploration of antigenically distant regions, as enabled by our UCB-based framework, is critical for anticipating and mitigating the emergence of immune-escape variants. Importantly, it would not be possible to reliably identify immune escape mutations based solely on their semantic change, as underlined in a recent study (34). However, when combined with protein stability metrics and active learning frameworks, semantic change can play a complementary role in uncovering novel regions of the sequence landscape.

Third, our framework demonstrates high accuracy in identifying biologically relevant mutation sites. By defining the genetic space in terms of the effects of single mutations, we achieve a predictive advantage of two years compared to the baseline strategy, which relies on waiting for variants to emerge in nature, measuring their fitness, and then predicting the fitness effects of new mutations. Combining biophysics and active learning trained on KD values offers two major advantages. First, KD values can be experimentally measured early in a pandemic, unlike fitness values, which require the spread of mutations in the population to infer growth curves. Second, our biophysical fitness predictions are interpretable, unlike black-box models that directly output fitness values. Specifically, our predictions are driven by biophysical insights, such as antibody escape potential or tight ACE2 binding, making them biologically meaningful. Furthermore, the systematic oversampling of positions that persisted in variants of concern, particularly at sites 356, 484, 486, and 493, demonstrates that our model could highlight evolutionarily important sites using limited data. Importantly, as SARS-CoV-2 evolves under immune pressure, it accumulates multiple mutations whose effects are often epistatic (7, 35). In this context, our framework offers a major advantage: It is natively epistatic. First, the protein language model Evolutionary Scale Modeling (ESM), trained on millions of natural sequences, captures nonlinear dependencies between residues through its attention mechanism. Second, the Gaussian process models higher-order interactions present in observed KD values through its flexible kernel. Third, the biophysical fitness mapping inherently reflects nonlinear tradeoffs among ACE2 binding, antibody escape, and folding stability (13). Together, these components allow our method to anticipate high-fitness, multimutant variants with epistatic interactions that would likely be missed by brute-force approaches. Future work could further enhance this by integrating explicitly epistasis-aware kernels, such as Kermut (36), to capture even richer mutational dependencies.

Our framework’s utility is particularly well suited to the intermediate phase of an outbreak-when a new virus has been identified and initial biophysical measurements are starting to be collected, but high-throughput functional assays remain limited. Indeed, deploying our model requires two foundational components: 1) early identification of key antibodies exerting selective pressure, along with initial low-throughput KD measurements for receptor and antibody binding; 2) even very population-level infectivity data (e.g., from GISAID) that can be used to fit the biophysical model linking binding affinities to viral fitness (13). In such early scenarios, comprehensive combinatorial data are typically unavailable. Encouragingly, recent work shows that both antibody identification (37, 38) and KD measurements can be achieved within weeks of pathogen discovery. By leveraging even limited data from low-throughput techniques such as surface plasmon resonance (39), our framework can begin to guide surveillance efforts well before high-throughput methods (40) has been deployed. This could potentially provide a lead time of up to one year, as would have been the case during the SARS-CoV-2 pandemic.

Finally, this capability to identify multiple evolutionarily plausible, high-risk variants in advance opens the door to new strategies in vaccine design. Our method enables the systematic identification of several candidate variants and key residues that are biophysically more likely to gain prominence than the currently circulating strain. This is especially relevant for the design of polyvalent (41) or mosaic vaccines (42), where the goal is to preempt immune escape by eliciting broad, cross-protective responses against multiple potential escape trajectories.

A key assumption of our work is that the biophysical model provides a reliable estimate of viral fitness, as demonstrated for the SARS-CoV-2 pandemic by Wang et al. (13). Without the biophysical mapping from KD to fitness space, active learning strategy could lead to the enrichment of variants that do not align with the actual fitness landscape. Fortunately, theoretical (43), simulation (14), and experimental studies (15) have all shown that viral fitness can be quantitatively linked to molecular properties of the viral protein. These successes underscore the potential of biophysical approaches in modeling complex fitness landscapes for viruses beyond SARS-CoV-2, suggesting that our methodology could be generalized to address future pandemic threats.

3. Materials and Methods

3.1. Datasets.

We utilize two distinct datasets in our research. The first is the combinatorial KD measurements from the work of Moulana et al. (19, 44). In their study, they systematically examined the interactions between all possible combinations of 15 mutations in the RBD of BA.1 relative to the Wuhan Hu-1 strain(totaling 32,768 genotypes) and ACE2, as well as four monoclonal antibodies (LY-CoV016, LY-CoV555, REGN10987, and S309).

The second dataset that we examined is a DMS from Starr et al. (18, 45), providing for all possible RBD single mutants KD values (ACE2) and escape ratios ϵ(mut) (LY-CoV016, LY-CoV555, REGN10987, and S309), and filtered on residues between 334 and 526 included. 28 sites are excluded from this dataset due to missing data for one of the biophysical constants. When computing AUC for comparison with GISAID data, we removed these sites from labels as well to ensure unbiased estimation.

Noting that the dissociation constant of RBD writes:

KD=[RBDfree][Ab][RBDbound][RBDfree][RBDbound]ϵ,

we assumed log dissociation constants of single mutants could be obtained as the sum of wild type log dissociation constant and the variation of escape ratio compared to its minimum value (wildtype):

logKD(mut)= logKD(wt)+logϵ(mut)minmϵ(m).

3.2. ESM.

ESMs are transformer-based pLM designed to extract meaningful representations from protein sequences, enabling tasks such as structure prediction and large-scale protein characterization.

We use these models to obtain semantic representation of protein sequences. For a protein sequence of length L, we first describe it as a sequence of tokens x=defx1,,xL. In the base of RBD, L = 201. We then run a forward pass of ESM and obtain the hidden representations of the final layer, (h1,,hL), where each hiRK. Then, we use mean pooling of these vectors to obtain a representation of the entire sequence, z=fESM(x)=1Li=1LhiRK. Finally, we renormalized the embeddings by mean and variance.

We benchmarked different models: ESM1v (30) (K = 1,280), ESM2 (46) (K = 1,280), ESM3 (25) sequence only, and ESM3 sequence with structures from PDB 6XF5 (27) (K = 1,536). Main results are obtained using ESM3 with structure encoding.

3.3. Semantic Change.

We can denote the sequence of wild-type RBD as xwt and the mutant as xmt, where xmt may have one or more different tokens than xwt. Semantic change is defined as the L2 norm of the embedding distance:

Δzxmt=defzmtzwt=fESM(xmt)fESMxwt

A high semantic change represents large change of the semantic meaning of the protein sequence, which we noticed to be correlated with decreased antibody binding affinity.

3.4. Fitness and Biophysical Model.

We determine the fitness of the RBD based on its contribution to viral infectivity, utilizing a biophysical model established by our previous work (13). This model leverages the Boltzmann distribution to map molecular phenotypes-characterized by binding energies to cell receptors and antibodies-onto a fitness landscape. The RBD’s fitness is primarily determined by two factors: its binding affinity for the ACE2 receptor and its ability to evade antibody neutralization.

In our model, the RBD can exist in multiple states, each with its associated free energy: unfolded state (Gu), folded but unbound state (Gf), folded and bound to ACE2 (GbA), and folded and bound to one of four distinct antibodies (Gai, where i indexes the antibodies).

The fitness function can be expressed as

F=aCeβGbA+eβGfCeβGbA+ΣiCieβGai+eβGf+eβGu [1]

In this equation, C=[ACE2]C0 and Ci=[Abi]miC0 represent the normalized concentrations of ACE2 and antibodies respectively, where […] denotes molar concentration and mi is a neutralization coefficient specific to each antibody. Noting that βΔG=β(GiGf)=ln(KD)/T for every state i, where T is a hyperparameter proportional to system temperature, fitness can be expressed as a function of measured dissociation constants.

The model parameters-including the scaling factor a and effective concentrations C and Ci-were calibrated by combining experimental measurements of dissociation constants (KD) with variant prevalence data from the GISAID database (9). This biophysical framework can then predict the fitness (F) of any RBD variant given its binding affinities, whether measured experimentally or predicted computationally.

In this paper, we did not refit the biophysical model iteratively, instead we used the following coefficients fitted from ref. 13: T = 1.6, a = 1.57363338, C=5.4764×107, K1=5.6015×108, K2=4.5128×108, K3=7.1825×108, and K4=4.7273×107. While we did not refit these parameters iteratively in our work, the biophysical model can be trained with very limited fitness data due to its small number of parameter space and still approximate population-level fitness with high accuracy-even in the early stages of a pandemic. Moreover, the biophysical model could be iteratively updated as new experimental and fitness data become available, allowing it to adapt dynamically in a real-time outbreak setting.

3.5. Uncertainty of Predictions.

To propagate variance from the predicted dissociation constants to the fitness function, we used SE propagation formula, which states that the variance of a function f can be approximated as

σf2ifki2σki2, [2]

where σf2 is the variance of the function, fki is the partial derivative of the function with respect to the i-th variable, and σki2 is the variance of the i-th variable. The variance of variables is obtained from the posterior distribution of the GP.

The derivatives flog10KD,i were computed symbolically using SymPy and evaluated numerically with the respective parameter values.

3.6. Gaussian Process Kernel Selection.

To model protein binding using Gaussian processes, we defined a kernel function that captures the notion of similarity between different variants in the sequence embedding space. We employed a Rational Quadratic kernel, which is well suited for modeling functions with varying degrees of smoothness. The kernel function is defined as

K(z,z)=1+zz22αl2α,

where α is the scale mixture parameter, and l is the length scale parameter.

From a Bayesian perspective, the kernel defines the prior covariance between data points, ensuring that variants that are closer in embedding space (zz2 small) exhibit similar binding values, while more distant variants remain weakly correlated. This reflects the fact that mutations with similar physicochemical properties and structural contexts tend to have correlated effects on protein binding.

In our framework, the kernel hyperparameters α and l were optimized by maximizing the marginal likelihood of the observed binding data, allowing the model to adaptively learn the appropriate scale of binding variation across the mutational landscape.

When training the Gaussian process on a training set (Z1,Y1) of size N and making predictions for a new point (z2,y2), we used the analytical solutions for the posterior distribution:

P(y2|Y1,Z1,z2)N(μpred,σpred2)

with

μpred=K(z2,Z1)·[K(Z1,Z1)]1·Y1

and

σpred2=k(z2,z2)K(z2,Z1)·[K(Z1,Z1)]1·K(Z1,z2),

where K(z2,Z1)=K(Z1,z2)T is a vector of dimension N, K(Z1,Z1) is the covariance matrix of dimension N × N, and k(z2,z2) is the kernel function evaluated at the test point.

The mean of the posterior distribution serves as a prediction for the output variable y2 corresponding to the input sample z2, while the variance (the diagonal of the covariance matrix) acts as a proxy for uncertainty. The mean of the posterior predictions in a Gaussian process represents a weighted average of the observed variables, with the weights determined by the covariance function.

In our specific application, z represents the embedding of a sequence, and y(z)=ΔlogKD denotes the variation of log-transformed dissociation constant compared to wildtype.

Importantly, when a new point z2 lies far from all training points in the embedding space, the entries of K(z2,Z1) become very small (approaching zero). As a result, the posterior mean μpred approaches zero-reflecting the GP prior assumption that in regions without nearby data, ΔlogKD=0, meaning no predicted change in binding compared to the wildtype. However, the posterior variance at such points remains high, indicating substantial uncertainty.

The kernel hyperparameters were optimized by maximizing the marginalized likelihood function using sklearn.gaussianprocess library.

3.7. Model Benchmark.

We evaluate the ability of VIRAL to rank variants by computing the Spearman correlation between inferred fitness scores and ground-truth labels.

For the combinatorial dataset, the model is trained on varying numbers of data points, ranging from 20 to 160, and then used to predict the fitness of the remaining variants.

For the DMS dataset, training sets are constructed by sampling an average of n{0.2,0.4,0.6,0.8,1,2,3}mutations per site. When n < 1, a single mutant at each site is randomly selected and included in the training set with probability n.

To ensure robustness, each experiment is repeated 10 times for both datasets and for each training size, enabling a reliable assessment of the model’s predictive accuracy across different levels of data availability.

3.8. Active Learning.

The Bayesian optimization approach used in this study incorporates active learning principles to efficiently explore a discrete set of candidate sequences, referred to as the “pool.” This strategy selects and evaluates sequences in batches, progressively refining the optimization process. The iterative procedure consists of several key steps:

  1. A random batch, denoted as S0, is initially selected from the candidate set D. The labels for these points are calculated (e.g., experimental dissociation constants), forming the initial training dataset Dtrain. Unexplored dataset is then DD\S0.

  2. A surrogate model f^(x) is then trained on the dataset Dtrain, in order to predict fitness of unknown points in D. Here, it first predicts dissociation constants, the latest being fed to a biophysical model which converts them into a fitness proxy. Each fitness prediction has value μ^(x) and variance σ^2(x).

  3. An acquisition function α(x;f^) is used to determine the utility of acquiring a given point x. This function considers various metrics, such as the predicted values for fitness, as well as the associated uncertainties.

  4. The points with the highest utility, denoted as an ensemble X, are selected or “acquired.” We obtain the ground truth value for their dissociation constants and add them to the training set (DtrainDtrain{X}) while removing it from the unexplored dataset (DD\{X}).

Steps 2 to 4 are repeated iteratively until a stopping criterion is met, such as a fixed number of iterations or insufficient improvement.

In our study, we adopt following active learning parameters. For the DMS benchmark, we chose initial training size of 165 mutations (randomly sampled, one mutation per RBD site) corresponding to ∼5% of dataset and 10 more rounds acquiring 50 points each. For the Combinatorial benchmark, we chose initial training size of 20 variants (randomly sampled among single and double mutants) corresponding to ∼0.06% of dataset and 10 more rounds acquiring 10 points each. Active learning acquisition was repeated 10 times for each benchmark, using different initial training sets sampled as described.

3.9. Acquisition Functions.

Various acquisition functions for active learning are considered in this study, each influencing the point selection process differently. These include:

  • Random(x,f^): Randomly selecting points from the candidate set.

  • Greedy(x,f^): Selecting points based on the surrogate model’s mean (μ^(x)).

  • UCB(x): Employing the Upper Confidence Bound strategy, combining the mean and a scaled SD (μ^(x)+βσ^(x)).

To normalize the contribution of the variance term relative to the fitness values in UCB, we chose the coefficient β as

β=0.2×std(fitnesses)std(vars)

.

This ensures that the uncertainty term vars is scaled to have a comparable range to the fitness values, adjusted by a scaling. We explored in SI Appendix, Fig. S3 impact of uncertainty weight in UCB exploration.

3.10. Evaluation of Active Learning Performance.

The primary metric used to evaluate our pipeline is the EF, defined as

EF(strategy)=#Top variants acquired (strategy)#Top variants acquired (random),

where “top variants” refers to those with fitness in the top p% of the dataset. Unless otherwise specified, we use p = 10% as the default threshold for defining high-fitness variants. An EF value greater than 1 indicates that the strategy outperforms random sampling in identifying top variants.

Additionally, we compute the AUC using the active learning predicted fitness as a score to identify whether a variant belongs to the top 10% of the dataset. Notably, this metric is evaluated on the entire dataset (including both tested and untested variants) to ensure the results are not negatively biased against models that have acquired all the top variants.

Last, we calculate the embedding variance of the tested variants as a quantitative measure of sequence exploration diversity. This metric is defined as the mean variance across all dimensions of the ESM3 embeddings among the acquired variants, providing intuition into how broadly the acquisition strategy samples the protein sequence space.

3.11. Comparison with GISAID Data.

We analyzed SARS-CoV-2 spike sequences from the GISAID database (47), collecting 15,371,428 sequences up to April 14, 2023. Following the methodology of Starr et al. (48), we implemented a sequence filtration process. Sequences were excluded if they were 1) from nonhuman hosts, 2) outside the length range of 1,260 to 1,276 amino acids, 3) contained unicode errors, gaps, or ambiguous characters. The remaining sequences were aligned using MAFFT (49).

The final dataset comprised 11,976,984 submissions, containing 25,725 unique RBD sequences. For each unique RBD sequence, we tracked its frequency of occurrence and estimated its emergence time using the fifth percentile of its temporal distribution. To ensure robustness, we excluded singleton sequences that appeared only once in the dataset.

We assessed site mutability by analyzing the amino acid diversity at each position within the RBD. A site was labeled as “highly mutable” if it exhibited at least 9 distinct amino acid substitutions (out of a possible 20), each observed in RBD variants that appeared at least 10 times in the GISAID database. This prevalence threshold reduces the sequencing errors and helps ensure that the identified mutations reflect true evolutionary advantage rather than random noise, allowing us to identify sites under selection pressure during the pandemic. To evaluate the effectiveness of our active learning pipeline, we used site sampling frequency as a performance metric and calculated the Area Under the Receiver Operating Characteristic Curve (AUC). The AUC score quantifies our algorithm’s ability to identify positions that emerged as mutation hotspots throughout the pandemic. AUC scores were also tested for mutation thresholds different from 9; see SI Appendix.

3.12. Baseline for Identification of Highly Mutable Sites.

We define a baseline model to identify high-fitness mutations, trained on pandemic data. The training data include mutations observed in the GISAID database within a variant having a minimum count of 1,000, under the assumption that variants with counts below this threshold lack reliable fitness estimates. Specifically, mutations in variants with occurrences prior December 2020/December 2022 and counts exceeding 1,000 were selected. The baseline was trained on 14 mutations for 2020 deadline and 64 mutations for 2022 deadline. Single-batch acquisition size is 665-training size, ensuring the total number of variants acquired by the active learning process and the baseline is the same. We then computed the acquisition score for every site, based on this single batch acquisition, and compared it to GISAID data.

Supplementary Material

Appendix 01 (PDF)

pnas.2503742122.sapp.pdf (23.7MB, pdf)

Acknowledgments

This work is supported by NIH R35GM139571. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We gratefully acknowledge all data contributors, i.e., the authors and their originating laboratories responsible for obtaining the specimens, and their submitting laboratories for generating the genetic sequence and metadata and sharing via the Global Initiative on Sharing All Influenza Data Initiative, on which this research is based.

Author contributions

M.H., D.W., and E.I.S. designed research; M.H. and D.W. performed research; M.H., D.W., and J.L. analyzed data; and M.H., D.W., and E.I.S. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Contributor Information

Dianzhuo Wang, Email: johnwang@g.harvard.edu.

Eugene I. Shakhnovich, Email: shakhnovich@chemistry.harvard.edu.

Data, Materials, and Software Availability

Code data have been deposited in GitHub (https://github.com/m-huot/VIRAL) (50). All other data are included in the manuscript and/or SI Appendix.

Supporting Information

References

  • 1.Ozono S., et al. , SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity. Nat. Commun. 12, 848 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Barton M. I., et al. , Effects of common mutations in the SARS-CoV-2 Spike RBD and its ligand, the human ACE2 receptor on binding affinity and kinetics. eLife 10, e70658 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tuekprakhon A., et al. , Antibody escape of SARS-CoV-2 Omicron BA.4 and BA.5 from vaccine and BA.1 serum. Cell 185, 2422–2433.e13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Nabel K. G., et al. , Structural basis for continued antibody evasion by the SARS-CoV-2 receptor binding domain. Science 375, eabl6251 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wang Q., et al. , Alarming antibody evasion properties of rising SARS-CoV-2 BQ and XBB subvariants. Cell 186, 279–286.e8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Carabelli A. M., et al. , SARS-CoV-2 variant biology: Immune escape, transmission and fitness. Nat. Rev. Microbiol. 21, 162–177 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rochman N. D., et al. , Epistasis at the SARS-CoV-2 receptor-binding domain interface and the propitiously boring implications for vaccine escape. mBio 13, e00135–22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cao Y., et al. , Imprinted SARS-CoV-2 humoral immunity induces convergent omicron RBD evolution. Nature 614, 521–529 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Elbe S., Buckland-Merrett G., Data, disease and diplomacy: Gisaid’s innovative contribution to global health. Glob. Chall. 1, 33–46 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Obermeyer F., et al. , Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science 376, 1327–1332 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.J. Ito et al. , A protein language model for exploring viral fitness landscapes. Nat. Commun. 16, 4236 (2025). [DOI] [PMC free article] [PubMed]
  • 12.Maher M. C., et al. , Predicting the mutational drivers of future SARS-CoV-2 variants of concern. Sci. Transl. Med. 14, eabk3445 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang D., Huot M., Mohanty V., Shakhnovich E. I., Biophysical principles predict fitness of SARS-CoV-2 variants. Proc. Natl. Acad. Sci. U.S.A. 121, e2314518121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chéron N., Serohijos A. W. R., Choi J. M., Shakhnovich E. I., Evolutionary dynamics of viral escape under antibodies stress: A biophysical model. Protein Sci. 25, 1332–1340 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rotem A., et al. , Evolution on the biophysical fitness landscape of an RNA virus. Mol. Biol. Evol. 35, 2390–2400 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Zhang Q. E., et al. , SARS-CoV-2 omicron XBB lineage spike structures, conformations, antigenicity, and receptor recognition. Mol. Cell. 84, 2747–2764 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Upadhyay V., Panja S., Lucas A., Patrick C., Mallela K. M., Biophysical evolution of the receptor-binding domains of SARS-CoVs. Biophys. J. 122, 4489–4502 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Starr T. N., et al. , Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182, 1295–1310.e20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Moulana A., et al. , The landscape of antibody binding affinity in SARS-CoV-2 omicron BA.1 evolution. eLife 12, e83442 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hie B., Bryson B. D., Berger B., Leveraging uncertainty in machine learning accelerates biological discovery and design. Cell Syst. 11, 461–477.e9 (2020). [DOI] [PubMed] [Google Scholar]
  • 21.Khalak Y., Tresadern G., Hahn D. F., De Groot B. L., Gapsys V., Chemical space exploration with active learning and alchemical free energies. J. Chem. Theory Comput. 18, 6259–6270 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Graff D. E., Shakhnovich E. I., Coley C. W., Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem. Sci. 12, 7866–7881 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.A. Gessner, S. W. Ober, O. Vickery, D. Oglić, T. Uçar, Active learning for affinity prediction of antibodies. arXiv [Preprint] (2024). https://arxiv.org/pdf/2406.07263 (Accessed 21 September 2024).
  • 25.T. Hayes et al. , Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025). [DOI] [PubMed]
  • 26.Seeger M., Gaussian processes for machine learning. Int. J. Neural Syst. 14, 69–106 (2004). [DOI] [PubMed] [Google Scholar]
  • 27.Zhou T., et al. , Structure-based design with tag-based purification and in-process biotinylation enable streamlined development of SARS-CoV-2 spike molecular probes. SSRN Electron. J. 33, 108322 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.T. Loux, D. Wang, E. I. Shakhnovich, More structure, less accuracy: ESM3’s binding prediction paradox. bioRxiv [Preprint] (2024). 10.1101/2024.12.09.627585 (Accessed 15 December 2024). [DOI]
  • 29.Addetia A., et al. , Neutralization, effector function and immune imprinting of omicron variants. Nature 621, 592–601 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).
  • 31.Eisenstein M., Active machine learning helps drug hunters tackle biology. Nat. Biotechnol. 38, 512–514 (2020). [DOI] [PubMed] [Google Scholar]
  • 32.Łuksza M., Lässig M., A predictive fitness model for influenza. Nature 507, 57–61 (2014). [DOI] [PubMed] [Google Scholar]
  • 33.Meijers M., Ruchnewitz D., Eberhardt J., Łuksza M., Lässig M., Population immunity predicts evolutionary trajectories of SARS-CoV-2. Cell 186, 5151–5164.e13 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Allman B. E., Vieira L., Diaz D. J., Wilke C. O., A systematic evaluation of the language-of-viralescape model using multiple machine learning frameworks. J. R. Soc. Interface 22, 20240598 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Rochman N. D., et al. , Ongoing global and regional adaptive evolution of SARS-CoV-2. Proc. Natl. Acad. Sci. U.S.A. 118, e2104241118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.P. M. Groth, M. H. Kerrn, L. Olsen, J. Salomon, W. Boomsma, Kermut: Composite kernel regression for protein variant effects. bioRxiv [Preprint] (2024). 10.48550/arXiv.2407.00002 (Accessed 20 December 2024). [DOI]
  • 37.Barnes C. O., et al. , Structures of human antibodies bound to SARS-CoV-2 spike reveal common epitopes and recurrent features of antibodies. Cell 182, 828–842 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Barnes C. O., et al. , SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature 588, 682–687 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Huo J., et al. , Neutralizing nanobodies bind SARS-CoV-2 spike RBD and block interaction with ACE2. Nat. Struct. Mol. Biol. 27, 846–854 (2020). [DOI] [PubMed] [Google Scholar]
  • 40.Tortorici M. A., et al. , Broad sarbecovirus neutralization by a human monoclonal antibody. Nature 597, 103–108 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chang S., et al. , Strategy to develop broadly effective multivalent COVID-19 vaccines against emerging variants based on AD5/35 platform. Proc. Natl. Acad. Sci. U.S.A. 121, e2313681121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cohen A. A., et al. , Mosaic RBD nanoparticles protect against challenge by diverse sarbecoviruses in animal models. Science 377, eabq0839 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.V. Mohanty, E. I. Shakhnovich, Biophysical fitness landscape design traps viral evolution. bioRxiv [Preprint] (2025). 10.1101/2025.03.30.646233 (Accessed 1 April 2025). [DOI]
  • 44.Moulana A., et al. , Compensatory epistasis maintains ACE2 affinity in SARS-CoV-2 omicron BA.1. Nat. Commun. 13, 7011 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Starr T. N., et al. , SARS-CoV-2 RBD antibodies that maximize breadth and resistance to escape. Nature 597, 97–102 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lin Z., et al. , Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
  • 47.Elbe S., Buckland-Merrett G., Data, disease and diplomacy: GISAID’s innovative contribution to global health: Data, Disease and Diplomacy. Glob. Chall. 1, 33–46 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Starr T. N., et al. , Prospective mapping of viral mutations that escape antibodies used to treat COVID-19. Science 371, 850–854 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Katoh K., Standley D. M., MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30, 772–780 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.M. Huot et al. , VIRAL. GitHub. https://github.com/m-huot/VIRAL. Deposited 21 May 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2503742122.sapp.pdf (23.7MB, pdf)

Data Availability Statement

Code data have been deposited in GitHub (https://github.com/m-huot/VIRAL) (50). All other data are included in the manuscript and/or SI Appendix.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES