Early computational detection of potential high-risk SARS-CoV-2 variants

Karim Beguir; Marcin J Skwark; Yunguan Fu; Thomas Pierrot; Nicolas Lopez Carranza; Alexandre Laterre; Ibtissem Kadri; Abir Korched; Anna U Lowegard; Bonny Gaby Lui; Bianca Sänger; Yunpeng Liu; Asaf Poran; Alexander Muik; Uğur Şahin

doi:10.1016/j.compbiomed.2023.106618

. 2023 Feb 2;155:106618. doi: 10.1016/j.compbiomed.2023.106618

Early computational detection of potential high-risk SARS-CoV-2 variants

Karim Beguir ^a,^∗, Marcin J Skwark ^a, Yunguan Fu ^a, Thomas Pierrot ^a, Nicolas Lopez Carranza ^a, Alexandre Laterre ^a, Ibtissem Kadri ^a, Abir Korched ^a, Anna U Lowegard ^a, Bonny Gaby Lui ^b, Bianca Sänger ^b, Yunpeng Liu ^c, Asaf Poran ^c, Alexander Muik ^b, Uğur Şahin ^b,^∗∗

PMCID: PMC9892295 PMID: 36774893

Abstract

The ongoing COVID-19 pandemic is leading to the discovery of hundreds of novel SARS-CoV-2 variants daily. While most variants do not impact the course of the pandemic, some variants pose an increased risk when the acquired mutations allow better evasion of antibody neutralisation or increased transmissibility. Early detection of such high-risk variants (HRVs) is paramount for the proper management of the pandemic. However, experimental assays to determine immune evasion and transmissibility characteristics of new variants are resource-intensive and time-consuming, potentially leading to delays in appropriate responses by decision makers. Presented herein is a novel in silico approach combining spike (S) protein structure modelling and large protein transformer language models on S protein sequences to accurately rank SARS-CoV-2 variants for immune escape and fitness potential. Both metrics were experimentally validated using in vitro pseudovirus-based neutralisation test and binding assays and were subsequently combined to explore the changing landscape of the pandemic and to create an automated Early Warning System (EWS) capable of evaluating new variants in minutes and risk-monitoring variant lineages in near real-time. The system accurately pinpoints the putatively dangerous variants by selecting on average less than 0.3% of the novel variants each week. The EWS flagged all 16 variants designated by the World Health Organization (WHO) as variants of interest (VOIs) if applicable or variants of concern (VOCs) otherwise with an average lead time of more than one and a half months ahead of their designation as such.

1. Introduction

Since the emergence of the human coronavirus SARS-CoV-2 in Wuhan in December 2019, over 510,000 unique spike (S) protein sequences (as of October 20, 2022) have been identified in the protein-coding viral sequences deposited in the GISAID [[1], [2], [3]] (Global Initiative on Sharing All Influenza Data) database. Of these, over 12,000 individual missense mutations have been observed in the S protein, the key target for neutralising antibodies (nAbs). While most mutations either reduce the overall fitness of the virus or bear no consequence to its features, some individual or combinations of mutations lead to high-risk variants (HRVs) with modified immune evasion capabilities and/or improved transmissibility. The World Health Organization (WHO) in collaboration with various partners works to identify and label new variants as they arise in the population based on certain characteristics as either variants under monitoring (VUMs), variants of interest (VOIs) or variants of concern (VOC). For example, the Alpha (B.1.1.7, Q) VOC spread widely due to higher transmissibility compared to the Wuhan variant, while the Beta (B.1.351) VOC has been shown to be less effectively neutralised by both convalescent sera and antibodies elicited by approved SARS-CoV-2 wild-type S protein-based vaccines [4]. The Delta (B.1.617.2, AY) variant – characterised by high transmissibility – led to increased mortality and triggered a renewed increase in cases in countries with both high and low vaccination rates such as the United Kingdom [5] and India [6]. More recently, the heavily mutated Omicron (BA.1) variant was amongst the quickest variants to be designated as a VOC by the WHO due to a combination of widespread dissemination and several concerning mutations in the S and other proteins [7]. Omicron sublineages BA.2, BA.4 and BA.5 have developed even further with new mutational profiles to further escape host immune responses while simultaneously improving their ability to spread in the human population [[8], [9], [10]].

As new sequences continue to emerge, the potential for the generation of variants that are both fit and highly immune resistant creates a significant public health challenge. The transmissibility and immune escape potential of a given variant can be assessed experimentally, however, methods are resource-intensive, time-consuming and cannot be scaled to properly address the multitude of emergent variants. Available data show that thousands of new S protein variants are emerging each week at an increasing rate. For example, in the GISAID [[1], [2], [3]] database an average of 70 previously unseen variants per week emerged in September 2020 whereas an average of 11,500 per week arose in February 2022. Moreover, these numbers were likely underestimates given limited viral sequencing and data deposition in many countries [11]. Consequently, it is not feasible for health authorities to conduct experimental risk assessments whenever a new variant is identified, despite the benefits of a proactive stance in detecting HRVs before their spread. Due to these challenges and the importance of tracking the unpredictable evolution of the pandemic, calls have been made to strengthen the surveillance capacity for new SARS-CoV-2 variants [12].

In this work, a new method is presented to evaluate SARS-CoV-2 S protein variants leveraging in silico structural modelling and machine learning (ML)-based language modelling to capture features of a given variant's fitness as well as its immune escape properties. The emergence of recurrent and attention-based deep neural networks in the domain of Natural Language Processing (NLP) has led to impressive results for text generation and translation, and recently this technology has been leveraged to learn the language of biology for protein mutation analysis [13,14]. The models are trained on large datasets of available protein sequences in an unsupervised manner, meaning that protein-associated labels are not required. Once trained, information about protein properties is represented in two distinct modes in the model. On one side, the probabilities returned by the model indicate how likely this sequence is to exist as a protein. On the other hand, the outputs of the model's layers provide a high-dimensional representation for each sequence – known as the embedding of the protein – which can be used either directly or to train a classification or regression model. Recently, Meier et al. demonstrated that these models also capture the effects of mutations on protein function [15]. Presented herein are two scores, immune escape score and fitness prior score, which incorporate such a ML-based model in conjunction with structural modelling to capture important aspects of the SARS-CoV-2 S protein sufficiently and non-specifically while still being grounded in biologically relevant properties. These scores were validated using previously published and newly generated experimental data and were used to build an Early Warning System (EWS) to detect new, emerging HRVs prior to their designation as either a VOI or a VOC by the WHO.

2. Materials and Methods

Data. The S protein sequences were collected from GISAID [[1], [2], [3]] up until May 18th, 2022 with lineages assigned using pangolin [16]. Multiple data cleaning procedures were performed, including removing sequences not conforming to basic biological assumptions, sequences having more than ten continuous amino acid mutations, and sequences whose submission date was more than two months later than its collection date. The details of this cleaning procedure are described further in the Supplementary Materials and Methods with the resulting data summarised and visualised in Fig. S1.

Epitope alteration score. The epitope alteration score counts the number of antibody neutralising epitopes potentially impacted by mutations. It was calculated by first mapping 719 binding epitopes observed in 332 experimentally resolved structures of nAbs [[17], [18], [19], [20]] onto the S protein using available protein structures (Table S1). For each structure, an epitope per antibody was calculated as the set of positions that were in contact with an antibody, where two residues were considered to be in contact if the smallest Euclidean distance between their atoms was smaller than 4 Ångstroms. Each nAb was evaluated and considered to be evaded by a variant if any position of its epitope was mutated. The number of unique evaded nAbs normalised with respect to the total number of nAbs considered was defined as the epitope alteration score.

Semantic change score. The semantic change score estimates a variant's functional change based on its distance from a reference in embedding space. An attention-based transformer model [21] was first pre-trained over the large collection of diverse proteins included in UniRef100 (Fig. S1a). The transformer model was then fine-tuned each month on the SARS-CoV-2 S protein variants that had been registered at least twice in GISAID [[1], [2], [3]]. At the inference step, the language model returned a probability distribution of the 20 natural amino acids for each residue position given the surrounding sequence, thereby leveraging the underlying biology and implicit evolution of the large number of sequences seen during training. Finally, the semantic change score calculation was computed by taking the minimum score when considering a set of reference sequences: the wild-type variant, D614G variant and all designated VOCs. For more details of the model architecture, unsupervised training, and the inference and machine learning calculations, see the Supplementary Materials and Methods section.

ACE2 binding score. The ACE2 binding score estimates the binding between each variant S protein's receptor-binding domain (RBD) and the ACE2 protein. First, 402 RBD-differentiated variants were selected for in silico simulation. For each variant, an implementation of the ML-based protein complex structure prediction pipeline (analogous to Jumper et al. [22]) was used to produce a putative structure of the RBD-ACE2 complex. During the structure generation, standard optimisation methodologies were applied to the coordinates of the predicted complex to optimise the knowledge-based potential [23]. This initial estimate was subsequently used for a local complex optimisation protocol (see Supplementary Materials and Methods) which was repeated 500 times per estimate resulting in >250,000 structures in total. Finally, to calculate the ACE2 binding score, the median binding energy (the change in estimated Gibbs free energy, ΔG, between bound and unbound states) was determined for each structure (Fig. S2). For sequences with RBD mutation combinations for which there were no simulation results, a KNeighborsRegressor model was built where the training data included sequences with known RBDs along with their language model embedding and energy metrics. For each sequence in the embedding space, 7 neighbours were identified and their energy metrics were averaged as an estimation.

Conditional log-likelihood score. The conditional log-likelihood score estimates fitness by considering the likelihood of the particular combination of mutations found in each variant's S protein. The same trained transformer model described previously (for details see the Supplementary Materials and Methods section) was leveraged to calculate the log-likelihood (Fig. 1 b). The conditional log-likelihood score was calculated by comparing the raw log-likelihood of each S protein variant in question to other variants with similar mutational loads as opposed to the entire viral population.

Fig. 1 — A schematic of the Early Warning System (EWS), a protocol for the analysis and early detection of high-risk SARS-CoV-2 variants. (a) Illustrated on the left-hand side, structural modelling was used to predict the binding affinity of the SARS-CoV-2 S protein to the host protein, ACE2, and to score the mutated epitope regarding its impact on immune escape. The right hand side panel represents the machine learning (ML)-based modelling that was used to extract implicit information from unlabeled data for the hundreds of thousands of registered variants in the GISAID [[1], [2], [3]] database. Looking at the middle panel, the EWS relies on the information from structural modelling and ML-based modelling to compute an *immune escape score* and a *fitness prior score*. **(b)** A schematic of the ML model structure for assessing semantic change and log-likelihood. Once trained (Fig. S1a), the model received a variant S protein sequence as input and returned an embedding vector of the S protein sequence as well as probabilities over amino acids for each residue position (Fig. S1b). The embedding vector was then used to calculate the semantic change from a set of reference variants while the probabilities were used to compute the log-likelihood (see Material and Methods).

Growth score. The growth score measures the change in prevalence of each S protein sequence and was computed using the GISAID [[1], [2], [3]] metadata. At a given date, only S protein sequences that had been submitted within the last 8 weeks were considered. For each lineage, its proportion amongst all submissions was calculated for the 8-week window and for the last week, denoted by r_win and r_last, respectively. The growth of the lineage was defined by their ratio, r_last/r_win, measuring the change of the proportion. Having a value larger than one indicates that the lineage is rising in prevalence and smaller than one indicates a decline. To eliminate trivial spikes in growth, this metric was computed with regard to the total count of the base lineage and not the individual sequence.

Immune escape score and fitness prior score. The epitope alteration score, semantic change, ACE2 binding score, conditional log-likelihood score and growth score have different scales and units. Therefore, a scaling strategy was introduced to set scores in the range of 0–100 so they could be compared directly. This procedure is detailed further in the Supplementary Materials and Methods. Subsequently, the immune escape score was computed as the average of the scaled semantic change score and the scaled epitope alteration score, and the fitness prior score was computed as the average of the scaled ACE2 binding score, scaled conditional log-likelihood and scaled growth score.

Retrospective detection of HRVs. For each week, the language model was trained on S protein variants up to that week, the structures used were limited to those available prior to the analysis date, VOC reference sequences were included only for the weeks subsequent to each variant's appearance, and each sequence's lineage was limited to its associated lineage at the time of analysis. Each week, all sequences reported in the previous 8 weeks were first ranked according to immune escape score. Sequences were then filtered down to only those which were novel to that particular week and had not already been labelled as a VOC by the WHO. Next, to promote diversity in the detected pool of sequences, the sequences were clustered by their RBD mutations. For each cluster, only the highest scoring sequence was kept. Finally, the 12 top scoring sequences were selected to make up the watch list.

VSV-SARS-CoV-2 S pseudovirus neutralisation assay. A recombinant replication-deficient VSV vector that encodes a green fluorescent protein (GFP) and luciferase (Luc) instead of the VSV-glycoprotein (VSV-G) was pseudotyped with SARS-CoV-2 spike (S) protein derived from either the Wuhan reference strain (NCBI Ref: 43740568) or variants of interest (listed in Table S5) according to published pseudotyping protocols [24]. HEK293T/17 monolayers transfected to express SARS-CoV-2 S protein with the C-terminal cytoplasmic 19 amino acids truncated (SARS-CoV-2-S[CΔ19]) were inoculated with the VSVΔG-GFP/Luc vector. After incubation for 1 h at 37 °C, the inoculum was removed, and cells were washed with PBS before a medium supplemented with anti-VSV-G antibody (clone 8G5F11, Kerafast) was added to neutralise the residual input virus. VSV-SARS-CoV-2 pseudovirus-containing medium was collected 20 h after inoculation, 0.2 μm filtered and stored at −80 °C. For pseudovirus neutralisation assays, 40,000 Vero 76 cells were seeded per 96-well. Sera were serially diluted 1:2 in culture medium starting with a 1:15 dilution (dilution range of 1:15 to 1:7680). VSV-SARS-CoV-2-S pseudoparticles were diluted in a culture medium to obtain either ∼1000 or ∼200 transducing units (TU) in the assay. The same input virus amounts for all pseudoviruses were used within an individual experiment (Table S5). Serum dilutions were mixed 1:1 with pseudovirus for 30 min at room temperature prior to addition to Vero 76 cell monolayers and incubation at 37 °C for 24 h. Supernatants were removed, and the cells were lysed with luciferase reagent (Promega). Luminescence was recorded and neutralisation titres were calculated by generating a four-parameter logistic fit of the percent neutralisation at each serial serum dilution. The pVNT₅₀ is reported as the interpolated reciprocal of the dilution yielding a 50% reduction in luminescence. If no neutralisation yielding a 50% reduction in luminescence was observed, an arbitrary titer value of 7.5, half of the limit of detection (LOD), was reported.

Binding kinetics of RBD variants to ACE2 using surface plasmon resonance (SPR) spectroscopy. Binding kinetics of RBD variants were determined using a Biacore T200 device (Cytiva) with HBS-EP + running buffer (BR100669, Cytiva) at 25 °C. Carboxyl groups on the CM5 sensor chip matrix were activated with a mixture of 1-ethyl-3-(3-dimethylaminopropyl) carbodiimide hydrochloride (EDC) and N-hydroxysuccinimide (NHS) to form active esters for the reaction with amine groups. Anti-mouse-Fc-antibody (BR100838, Cytiva) was diluted in 10 mM sodium acetate buffer pH 5 (30 μg/mL) for covalent coupling to immobilisation level of ∼6000 response units (RU). Free N-hydroxysuccinimide esters on the sensor surface were deactivated with ethanolamine-HCl. Human ACE2-mFc (10108-H05H, Sino Biological Inc.) was diluted to 5 μg/mL with HBS-EP + buffer and applied at 10 μL/min for 15 s to the active flow cell for capture by immobilised antibody, while the reference flow cell was treated with buffer. Binding analysis of captured hACE2-mFc to RBD variants (for a list see Supplementary Materials and Methods) was performed using a multi-cycle kinetic method with concentrations ranging from 3.125 to 50 nM. An association period of 120 s was followed by a dissociation period of 300 s with a constant flow rate of 30 μL/min and a final regeneration step. Binding kinetics were calculated using a global kinetic fit model (1:1 Langmuir, Biacore T200 Evaluation Software Version 3.1, Cytiva).

3. Results

Presented herein is a novel in silico approach combining spike (S) protein structure modelling and large protein transformer language models on S protein sequences to accurately rank SARS-CoV-2 variants for immune escape and fitness potential (Fig. 1). This system is made up of an immune escape score which incorporates an ML-based model via the semantic change score alongside structural modelling captured within an epitope alteration score (see Materials and Methods). Additionally, viral fitness is estimated as the fitness prior score. Fitness prior is an a priori estimate of the propensity of the pathogen to proliferate, which comprises such complex traits as translational efficiency of the S protein, its cleavability by proteases, and RBD mobility, among others. These traits are all abstracted into a high-level metric, which is denoted as a fitness prior, an estimate of relative fitness versus other, observed S protein variants. The fitness prior score is designed to measure the probability that a particular variant is fit by combining three informative priors: the structural-modelling driven ACE2 binding score, the ML-based conditional log-likelihood score and the growth score (see Materials and Methods). The success of a viral variant in the global population is driven by more than these properties, but the S protein – the only SARS-CoV-2 protein considered in this study – primarily mediates host cell entry through ACE2 binding and serves as the main target of neutralising antibodies [26]. Therefore, the immune escape and fitness prior metrics described herein were considered the most relevant scores. These scores were subsequently validated with previously published and newly generated experimental data and were also used to explore the changing landscape of the pandemic and to build an Early Warning System (EWS) capable of evaluating new variants in minutes and risk-monitoring variant lineages in near real-time.

In silico estimation of immune escape potential and in vitro validation. The epitope alteration score (see Materials and Methods: Epitope alteration score) estimates how well each variant evades neutralising antibodies. The 227 antibodies and corresponding 719 epitopes used were well distributed across different receptor-binding domain (RBD)-targeting antibody classes [17] and epitope classes [25] (Tables S2 and S3). In addition to the RBD, the epitopes also covered positions across the entire S protein [25], including the N-terminal domain (NTD) and protease cleavage sites [26] (Table S4). An overlay of all nAb:S protein interaction interfaces was used to generate a colour-coded heatmap, indicating which surface-exposed amino acids were located in high epitope density regions (Fig. 2 a).

Fig. 2 — *In silico* scores for immune escape and fitness prior correlate with *in vitro* data. (a) The surface of a SARS-CoV-2 spike (S) protein structure (PDB ID: 7KDL [28]). The top row structure is coloured by the frequency of contact of S protein surface residues with neutralising antibodies (brighter, warmer colour corresponds to more antibody binding). The middle and bottom rows depict the number of evaded epitopes in Beta (B.1.351) and Omicron (BA.1), respectively (red indicates a higher number). (b–d) Relationships of the epitope alteration score, semantic change score, and combined immune escape score with the observed 50% pseudovirus neutralisation titer (pVNT₅₀) reduction are shown across n = 21 selected SARS-CoV-2 S protein variants. The pVNT₅₀ reduction compared to wild-type (WT) SARS-CoV-2 pseudovirus is given in percent. Variants for which pVNT₅₀ values exceeded those against the wild-type variant were assigned a pVNT₅₀ reduction of 0 (equal to wild-type). **(e)** Validation of the ACE2 binding score with the experimentally determined ACE2 binding affinity (K_D, dissociation constant) are shown across n = 19 receptor-binding domain (RBD) variants, along with a fitted regression dashed line.

The immune escape score also considers the semantic change score, a metric to capture how different an S protein variant is from a set of reference sequences, as a proxy for the potential ability to evade antibodies. If a sequence differs vastly from a known reference sequence, it is likely to have an improved ability to evade existing antibodies. The semantic change score was calculated using language models (see Materials and Methods: Semantic change score), which capture the biological properties of proteins through unsupervised learning on large amounts of biological data [13,14,27]. The model was pre-trained on UniRef100 before subsequently being trained on SARS-CoV-2 S protein sequences.

To validate the immune escape score, in vitro pseudovirus-based neutralisation test (pVNT) assays were conducted (see Materials and Methods: VSV-SARS-CoV-2 S pseudovirus neutralisation assay). The cross-neutralising effect of n ≥ 12 BNT162b2-immune sera collected after the primary 2-dose vaccination series was assessed against vesicular stomatitis virus (VSV)-based pseudoviruses bearing the S proteins of 21 selected SARS-CoV-2 variants, including Omicron BA.1, BA.2, and BA.4/5 (BA.4 and BA.5 have identical S protein sequences) (Fig. 2b–d, Fig. S3, Table S5) using previously published data [[29], [30], [31]] as well as results from new experiments. The Omicron pseudoviruses were by far the most immune escaping compared to the other variants included in the experiment. Only 40% of serum samples tested showed a detectable 50% pseudovirus neutralisation titer (pVNT₅₀) against Omicron BA.1, with a >40-fold reduction in group geometric mean titers (GMTs) against the wild-type reference (Fig. S2) [31]. To allow better comparison of neutralising GMTs across variants tested at different time points, we normalised the SARS-CoV-2 variant pVN₅₀ GMTs against those of the corresponding wild-type reference in the same assay. For Omicron BA.1 the geometric mean ratio (GMR) was 0.025, indicating another 10-fold drop in the neutralising activity against Omicron BA.1 compared to the most immune escaping non-Omicron variant B.1.1.7 + E484K (GMR 0.253) (Fig. S3c). This result is in good concordance with the in silico immune escape score for Omicron BA.1, which was the highest amongst observed, circulating variants at the time of performing the assay. Across all variant pseudoviruses tested, the epitope alteration score correlated positively with the calculated pVNT₅₀ reduction (1-GMR) percentage (Fig. 2b; Pearson r = 0.64, p = 1E-3; Spearman r = 0.77, p = 3E-5). The predictive power of the epitope alteration score was distributed across different antibody and epitope classes rather than being dependent on a single class (Tables S6–S8). Positive correlation with pVNT₅₀ reduction was also observed with the semantic change score (Fig. 2c; Pearson r = 0.81, p = 5E-6; Spearman r = 0.76, p = 3E-5). The combined immune escape score exhibited an even stronger correlation with the observed reduction in neutralising titers (Fig. 2d; Pearson r = 0.81, p = 5E-6; Spearman r = 0.86 p = 3E-7). Consistent strong correlations were also observed using neutralising titer data presented in multiple independent studies [4,[32], [33], [34], [35], [36]] underlining the accuracy and robustness of this score (Figs. S4–S6).

In silico estimation of fitness and in vitro validation. The immune escape score does not capture protein changes that either enhance the efficacy of viral cell entry or otherwise impact its structure or function. A key determinant of viral spread is the effectiveness with which virus particles can attach to and invade host cells. This quality was estimated using the ACE2 binding score (see Materials and Methods: ACE2 binding score). The ACE2 binding score was based on the predicted impact of sets of mutations on the binding affinity of the variant S protein to the human ACE2 receptor, the receptor used for cell entry of SARS-CoV-2. In order to assess the validity of the ACE2 binding score, the simulation results were compared with in vitro results. Surface plasmon resonance (SPR) binding analysis was performed to determine the binding affinity (K_D, dissociation constant) between 19 RBD variants and the ACE2 receptor (see Materials and Methods: Binding kinetics of RBD variants to ACE2 using surface plasmon resonance spectroscopy). This assay measures observable association rates which are a result of a dynamic process whereas simulations measure aggregated, static binding affinity, thus marginalising the contribution of mutations toward the flexibility and kinetics of the S protein. Despite this, the ACE2 binding score showed meaningful correlation with the K_D values (Fig. 2e; Pearson r = −0.53, p = 2E-2; Spearman r = −0.45, p = 5E-2).

Viral fitness can also be approximated from a language model perspective when considering the probability of a particular S protein sequence. In general, the higher the log-likelihood of an S protein variant, the more probable the variant is to occur. In particular, the log-likelihood metric supports substitutions, insertions, and deletions without requiring a reference sequence to measure against (unlike the grammaticality of Hie et al. [37]). Our previously described language model (see Materials and Methods and Supplementary Materials and Methods) was not provided with explicit sequence count data in the training phase, yet on average assigned higher log-likelihood values to sequences with a higher actual observed count (Fig. S7; Pearson r = 0.96, p = 1E-11; Spearman r = 0.99, p = 3E-20). High log-likelihood may indicate features common in the general variant population, which are likely to be fitness-related, thus allowing variants harbouring these to sustain additional such mutations. However, the values of log-likelihood tend to diminish as the number of mutations increases which creates a bias toward variants with low mutation counts. Considering that all the samples used for training have been detected in the global population, and as such have satisfied minimal fitness criteria, this bias was addressed by introducing the conditional log-likelihood score (as described in Materials and Methods: Conditional log-likelihood score). The conditional log-likelihood score sheds more light on highly mutated, potentially concerning variants like Omicron BA.1. Due to its high mutational load, Omicron BA.1 may be perceived by raw log-likelihood as highly unlikely, however, relative to other variant sequences with a similar number of mutations, it becomes clear that Omicron BA.1 stands out, leading to a higher conditional log-likelihood score (Fig. S8). However, the conditional log-likelihood score cannot fully assess variants that exhibit completely new sequence features. To account for this, the fitness prior metric includes a growth score, an empirical term representing the quantified change in the fraction of the sequences in the database that a variant in question comprises (see Materials and Methods: Growth score). This addresses the intuitive notion that novel variants which are increasing in prevalence are more imminently interesting than those which are not.

Combining fitness prior and immune escape scores to continuously monitor potential high-risk variants. Different selective pressures on virus evolution lead to variants with high immune escape and fitness because a virus must remain evolutionarily competent to successfully spread. A system that keeps track of immune escape and fitness factors (as depicted in Fig. 3 and Fig. S9) could continuously monitor potential HRVs on a near real-time basis as new sequences are added to the data pool. To begin probing this hypothesis, for each week corresponding to new VOC designations, the immune escape score and fitness prior score per lineage were averaged to rank and visualise the variant landscape. It is important to note that the ranking is relative and depends on the scores of the other circulating variants. This means that, while relative ranking of individual variants with regard to each metric remains the same, the raw rankings differ from week-to-week. Therefore, compound scores (like immune escape) tend to differ as well, which allows for the granularity necessary to discriminate between variants in different circumstances, for example, when there are multiple HRVs circulating, each with its own characteristics, or when the landscape is dominated by subvariants of a single lineage (e.g., the Delta outbreak of Summer 2021). As shown in Fig. 3, the VOCs were mostly comparatively highly immune escaping and had satisfactory fitness prior score at the week of WHO VOC designation. Over time, the most prevalent VOCs, such as Alpha: B.1.1.7, Delta: B.1.617.2 and AY and Omicron: BA.1 and BA.2 often diversify and lead to the emergence of sub-lineages. These sub-lineages often have increased fitness prior at the cost of decreased immune escape score for improved transmissibility. Simultaneously, with new emerging VOCs becoming more prevalent, the past VOCs gradually decrease in prevalence with reduced scores for both fitness prior and immune escape. This changing landscape over time was easily visualised using the fitness prior score and the immune escape score (Fig. 3). Note how the initially innocuous BA.2 lineage (Fig. 3e) evolved into the highly immune evasive BA.5 sublineage (Fig. 3f).

Fig. 3 — Combining immune escape and fitness prior for continuous monitoring of the SARS-CoV-2 variant landscape. Snapshot of lineages in terms of fitness prior and immune escape score on **(a)** December 20, 2020, **(b)** January 17, 2021, **(c)** May 16, 2021, **(d)** November 28, 2021, **(e)** February 27, 2022, and **(f)** May 15, 2022, corresponding to the week of the designations of Alpha/Beta, Gamma, Delta, Omicron BA.1, Omicron BA.2, Omicron BA.4 and Omicron BA.5 by the WHO as VOCs. Red markers indicate the designated lineages of the week, yellow markers are the previously designated lineages and grey markers indicate other lineages. Circles correspond to non-variant of concern (VOC) lineages and other symbols correspond to designated variants and their closely related lineages. The cross, north-east-pointing triangle, south-pointing triangle, flattened diamond, diamond, square, north-pointing triangle, and pentagon correspond to Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2, AY), Omicron BA.1, Omicron BA.2, and Omicron BA.4/BA.5 lineages, respectively. Lineages such as BA.4 and BA.5 with the same S protein sequence are indicated with the same shape. Only lineages that had been observed within the past 8 weeks relative to the indicated date for each plot and had been reported more than 10 times were included. See Fig. S9 for the corresponding density contour plots of sequences.

Detection of potentially high-risk variants prior to substantial spread in the population. The in silico Early Warning System (EWS) trains on the complete GISAID [[1], [2], [3]] SARS-CoV-2 S protein sequence database in less than a day and can score novel S protein variants within minutes. It is a non-trivial task, as newly emerging HRVs most often comprise new sets of mutations in the S protein and not all combinations of mutations present in previously concerning S protein variants lead to enhanced immune evasion or transmissibility. As shown in the previous section, the VOCs were often among the top immune escaping lineages, while the fitness prior scores were not necessarily the highest as observed with, for example, Delta (B.1.617.2, AY) in May 2021 (Fig. 3c). Furthermore, prolonged viral evolution in patients who are not able to clear the virus generates intrapatient viral variants with increased immune evasion rather than increased fitness [[38], [39], [40]]. These results, together with increased vaccination rates worldwide, put an added emphasis on immune evasion as a key risk factor in newly emerging variants. This emphasis motivated using the immune escape score alone in the EWS for early HRV detection. Therefore, each week, the top 12 sequences according to immune escape score following the protocol outlined above (see Materials and Methods: Retrospective detection of HRVs) were aligned and used to populate a heatmap (Fig. 4 ) to highlight potential concerning mutations and sequences. For example, a striking difference is clear between the heatmap corresponding to the week of Omicron BA.1 lineage emergence and that immediately preceding it (Fig. 4). The pre-Omicron week (Fig. 4a) was characterised by incremental changes to existing lineages in which already circulating, fitness-inducing mutations (like deletions in NTD loops) were observed. The low number of unique mutations and their relatively high background frequency (heat on the red-yellow scale) were diverse across samples, and thus were not considered imminently threatening. In the week of Omicron BA.1 emergence (Fig. 4b), a large fraction of new, previously unobserved mutations (in deep red) was seen. Most of the unassignable lineage samples, which were later assigned as Omicron BA.1 variants (in bold in Fig. 4b), were ranked highly and harbour the same clusters of mutations. These traits combined strongly suggest the emergence of a new (sub)lineage of potential high risk.

Fig. 4 — The EWS protocol for detection of high-risk variants (HRVs) and output heatmaps. A heatmap was constructed each week with the top selected sequences for the immune escape score resulting from the detection protocol outlined in the section Materials and Methods: Retrospective detection of HRVs. The two heatmaps shown here represent the top sequences resulting from this protocol from **(a)** the week just prior to the appearance of Omicron, November 21, 2021, and **(b)** the week of the appearance of Omicron, November 28, 2021. Each row represents a spike (S) protein sequence in decreasing order of immune escape score from top (highest score) to bottom (lower score). The label at the front of each row indicates either an associated, reported lineage or an unknown lineage (UNK). The UNK labels in bold in (b) indicate sequences that were later designated as belonging to Omicron. Sequences labelled with UNK were compared to the closest VOC lineage based on sequence similarity in order to distinguish common and uncommon mutations as indicated in the colourbars. All non-bolded UNK-labelled sequences in (a) and (b) were later designated to other lineages. Each column represents a mutation present in the S protein within the NTD and RBD regions indicated in purple and teal, respectively. The colour scale from green to blue represents lineage-defining mutations (defined by the WHO or inferred from mutation frequency) and the relative frequency of a mutation from 0.0 to 1.0 in the sequence population of the assigned lineage. The colour scale from red to yellow represents non-lineage-defining mutations and their frequency from 0.0 to 1.0 in the sequence population of the lineage. Boxes marked with an M indicate lineage-defining mutations that are missing from the corresponding sequence.

The EWS flagged all 16 VOI/VOC designated variants months in advance of their WHO designation. One goal of the EWS was to detect lineages each week that have not yet been designated as a VOI or a VOC (where applicable) by the WHO but will be designated as such in the future. This often represented a small proportion of the sequence candidates, as shown in Fig. 5 a. To assess the system's precision, a retrospective analysis was conducted for each week between September 16th, 2020, and May 15th, 2022 (see Materials and Methods: Retrospective detection of HRVs). When using a weekly watch list of only 12 variants, less than 0.3% of the data on average, the EWS flagged all 16 (Alpha, Beta, Gamma, Delta, Epsilon, Zeta, Eta, Theta, Iota, Kappa, Lambda, Mu, Omicron BA.1, Omicron BA.2, Omicron BA.4 and Omicron BA.5) WHO VOI/VOC designated variants (Fig. 5b) with an average lead time of 52 days prior to designation as such (Table S9). The date of VOC designation was used when available, otherwise, the date of designation as a VOI was used. The EWS identified Omicron BA.1 as the highest immune escaping variant over the more than 75,000 unique S protein variants deposited between early October and early December 2021. The combination of mutations in BA.1 allows the virus to evade a large fraction of nAbs, which was accurately captured by the exceptionally high epitope alteration score (Fig. 3d and e). Roughly at the time of emergence of BA.1, another variant, BA.2, emerged (Fig. 3e). While early BA.2 variants were not as immune escaping as BA.1, several of its descendants, including BA.4 and BA.5 variants, were flagged by the EWS as increasingly immune escaping upon emergence, outperforming BA.1 and BA.2 variants (Fig. 3f).

Fig. 5 — The EWS flags high-risk variants (HRVs) ahead of their WHO designation as either a variant of interest (VOI) or a variant of concern (VOC). (a) Each bar corresponds to a week and represents all novel, non-VOC spike (S) protein sequences. Each bar is split into 3 groups: sequences that will later be designated as either VOIs or VOC by the WHO (green), sequences that were labelled with unknown lineages that later will be designated as known VOIs or VOCs (grey) and other sequences (white). **(b)** The cumulative sum of all submissions to GISAID of a given variant lineage (in log scale) over time. Green and red dashed lines indicate the date of WHO designation as either a VOC or a VOI and the date of flagging as a HRV by the EWS, respectively.

The EWS immune escape score performs better than other metrics. Exploring all the metrics presented, one can consider the growth score alone as a plausible metric that requires neither ML nor simulation to detect 15 variants early (Fig. 6 a, Table S10), however, the growth score calculation requires additional observational data, such as occurrence frequencies. This caveat makes the growth score applicable only in the case of already ongoing outbreaks or variant spread which precludes its application to newly emerging and theoretical variants, hence reducing its real-life utility (see Materials and Methods: Growth score). Additionally, the immune escape score outperformed the growth score in terms of average lead time ahead of WHO VOI or, where applicable, VOC designation and the growth score required 50% more sequences per week to successfully detect Gamma (Table S10).

Fig. 6 — Comparing detection of high-risk variants (HRVs) using different metrics and machine learning (ML)-based approaches. (a) Detection results using the Early Warning System (EWS) metrics, immune escape score, semantic change score, epitope alteration score, and growth score, compared to standard machine learning (ML) techniques: generalised linear model (GLM) and uniform manifold approximation and projection (UMAP). The left bar chart displays the percentage of variants detected ahead of their designation as either a VOC or VOI by the WHO, the centre bar chart displays the average precision in percent for each metric and the right bar chart displays the proportion of the weeks where the used metric achieves an enrichment greater than 1 (better than random). **(b)** Detection results with respect to the watch list size per week using immune escape score, semantic change score, epitope alteration score, growth score, GLM and UMAP. The markers correspond to 12 sequences per week, the size used in the EWS.

The EWS was also evaluated in terms of precision (1-false positive rate), where a positive class represents detected sequences that would later be designated as either a VOI or a VOC by the WHO. Maximising precision requires the system to recall as many of these sequences as possible with a fixed watchlist size. The growth score underperforms the immune escape score when only a limited number of sequences can be detected per week (Fig. 6b). In particular, the growth score fails to detect WHO VOI/VOC designated sequences during the majority of the Summer of 2021 when many lineages were spreading widely (Fig. S10). However, as the analysis used a fixed-length list of predictions each week, typical classification metrics such as precision are limited because there were periods when there were no new WHO VOI/VOC designated sequences to detect. Therefore, the number of correctly detected sequences was also compared with the number expected from random sampling each week through an enrichment metric. For each week, the enrichment score was calculated as the number of VOI/VOC WHO designated sequences detected by a method divided by the expected number detected by random sampling. An enrichment score greater than 1 indicates that a particular method performs better than random sampling. This analysis demonstrated that the immune escape score consistently had more weeks with enrichment higher than 1 than any other metric, including the growth score (Fig. 6b). Using a watchlist size of only 12 sequences per week (Fig. 6b) across the 59 relevant weeks, the immune escape score outperformed random sampling more than 75% of the time while the growth score outperformed random sampling less than 60% of the time (p = 2E-2; paired t-test).

The EWS outperforms random sampling and standard ML. To further validate the EWS, standard ML techniques were applied for comparison. Both supervised and unsupervised ML approaches were tested, corresponding conceptually to epitope alteration score and semantic change score, respectively. For unsupervised learning, Uniform Manifold Approximation and Projection (UMAP) was applied, which has been successfully used for analogous problems in biology and is known to render meaningful insights in life science settings [41]. Only 9 out of 16 variants were detected using UMAP with an average lead time across all variants of 8 days after VOI/VOC designation by the WHO (Fig. 6, Table S10). For supervised learning, a Generalised Linear Model (GLM) was explored and failed to detect 3 out of 16 variants all together and only detected 8 out of 16 variants in advance with an average lead time across all variants of 10 days after VOI/VOC designation by the WHO (Fig. 6, Table S10). Overall, these standard ML techniques do not reach the same predictive performance as the methods proposed in this work (details in Supplementary Materials and Methods, Table S10 and Fig. 6).

4. Discussion

Validation of the immune escape and fitness prior scores using previously published and newly generated data along with the direct comparison to other ML-based approaches support that the combination of structural simulations and ML-based modelling of SARS-CoV-2 S protein variants presented herein allows for large-scale, continuous risk monitoring. The combination of fitness prior and immune escape score allows for better visualisation and understanding of the pandemic landscape and evolution whereas the immune escape score alone facilitates precise early detection of HRVs. The EWS, filtering down to a hit list of only 12 sequences per week from thousands (i.e., on average less than 0.3% of weekly novel variants), can accurately detect all 16 HRVs on average months ahead of the official WHO designation as either a VOI or a VOC. Additionally, detection is often possible within the same week the sequenced variant enters the database. For instance, the EWS flagged Omicron BA.2 as soon as it was uploaded to GISAID [[1], [2], [3]], nearly 3 months ahead of its designation as a VOC by the WHO. Compared with other baselines, the EWS achieves significantly higher predictive power in terms of enrichment over random sampling and all other considered metrics, demonstrating a sustained precision and robustness over the period from 2021 to 2022. Each week the EWS predicts new potential sequences of concern which, together with experimental assays, could be harnessed by public health authorities and governments worldwide to increase their preparedness for HRVs and potentially alleviate associated human and economic costs. The EWS and the models presented herein can also be adapted for application to similar viruses with accuracy relying on the amount of available sequence information and in vitro data for validation. Future development of the EWS could include expanding the immune escape scores to consider varied vaccination status and developing further functionalities such as the assessment of known and predicted T-cell epitopes and the projection of prospective variant evolution.

Author contributions

U.S., K.B., M.J.S., A.M., Y.F, T.P. and A.P. conceived and conceptualised the work. K.B. conceived the machine learning scoring models. K.B., Y.F., T.P., and A.L. conceived the machine learning training procedure. K.B., M.J.S., Y.F., N.L.C., A.K, A.L., and I.K. conceived and developed the data pipeline, software, and visuals. Y.F., T.P., and I.K. performed the machine learning experiments. M.J.S. conceived and developed the in silico epitope alteration score and structural bioinformatics methodology. A.M. and B.G.L. planned and supervised the in vitro experiments. A.M., B.G.L. and B.S. performed in vitro experiments. A.M., B.G.L. and B.S. analyzed in vitro experimental data. U.S., K.B., M.J.S., Y.F., T.P., N.L.C., A.M., A.U.L., A.P., and Y.L. interpreted data, drafted the manuscript and revised the manuscript critically for important intellectual content. All authors supported the review of the manuscript.

Material availability

Biological materials are available from the authors under a material transfer agreement with BioNTech.

Declaration of competing interest

U.S. is a management board member and employee at BioNTech SE. A.M., B.G.L. and B.S. are employees at BioNTech SE. A.P. and Y.L. are employees at BioNTech US: U.S., A.M., Y.L., and A.P. are inventors on patents and patent applications related to RNA technology and/or the COVID-19 vaccine. U.S., A.M., B.G.L., and B.S. have securities from BioNTech SE. K.B. is a management board member and employee at InstaDeep Ltd. M.J.S., Y.F., T.P., N.L.C., A.L., I.K., A.K. and A.U.L. are employees of InstaDeep Ltd or its subsidiaries. K.B., M.J.S., Y.F., T.P., N.L.C., and A.L. are inventors of patents and patent applications related to machine learning technology. K.B., M.J.S., Y.F., T.P., N.L.C., and A.L. have securities from InstaDeep Ltd.

Acknowledgements

Supported by BioNTech and InstaDeep. We thank the BioNTech German clinical trial (NCT04380701, EudraCT: 2020-001038-36) participants, from whom the post-immunisation human sera for the cross-neutralisation analysis were obtained.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.compbiomed.2023.106618.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1

mmc1.docx^{(3.1MB, docx)}

References

1.Elbe S., Buckland-Merrett G. Data, disease and diplomacy: GISAID's innovative contribution to global health. Global Chall. 2017;1:33–46. doi: 10.1002/gch2.1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Shu Y., McCauley J. GISAID: global initiative on sharing all influenza data–from vision to reality. Euro Surveill. 2017;22:30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Khare S., et al. Gisaid's role in pandemic response. China CDC Wkly. 2021;3:1049–1051. doi: 10.46234/ccdcw2021.255. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Liu Y., et al. Neutralizing activity of BNT162b2-elicited serum. N. Engl. J. Med. 2021;384:1466–1468. doi: 10.1056/NEJMc2102017. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Twohig K.A., et al. Hospital admission and emergency care attendance risk for SARS-CoV-2 delta (B.1.617.2) compared with alpha (B.1.1.7) variants of concern: a cohort study. Lancet Infect. Dis. 2022;22:35–42. doi: 10.1016/S1473-3099(21)00475-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Singh J., et al. SARS-CoV-2 variants of concern are emerging in India. Nat. Med. 2021;27:1131–1133. doi: 10.1038/s41591-021-01397-4. [DOI] [PubMed] [Google Scholar]
7.The technical advisory group on SARS-CoV-2 virus evolution (TAG-VE). Classification of Omicron (B.1.1.529): SARS-CoV-2 Variant of Concern. Nov 2021;26 https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern [cited 28 Nov 2021] [Google Scholar]
8.Wang Q., et al. Antibody evasion by SARS-CoV-2 Omicron subvariants BA.2.12.1, BA.4 and BA.5. Nature. 2022;608:603–608. doi: 10.1038/s41586-022-05053-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Cao Y., et al. BA.2.12.1, BA.4 and BA.5 escape antibodies elicited by Omicron infection. Nature. 2022;608:593–602. doi: 10.1038/s41586-022-04980-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Yao L., et al. Omicron subvariants escape antibodies elicited by vaccination and BA.2.2 infection. Lancet Infect. Dis. 2022;22:1116–1117. doi: 10.1016/S1473-3099(22)00410-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kalia K., et al. The lag in SARS-CoV-2 genome submissions to GISAID. Nat. Biotechnol. 2021;39:1058–1060. doi: 10.1038/s41587-021-01040-0. [DOI] [PubMed] [Google Scholar]
12.Subissi L., et al. An early warning system for emerging SARS-CoV-2 variants. Nat. Med. 2022;28:1110–1115. doi: 10.1038/s41591-022-01836-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Rives A., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 2021:118. doi: 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Elnaggar A., et al. ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing. arXiv preprint arXiv. 2020;200706225 [Google Scholar]
15.Meier J., et al. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv. 2021 doi: 10.1101/2021.07.09.450648. [DOI] [Google Scholar]
16.O'Toole Á, et al. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021:7. doi: 10.1093/ve/veab064. veab064. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Barnes C.O., et al. SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature. 2020;588:682–687. doi: 10.1038/s41586-020-2852-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Ju B., et al. Human neutralizing antibodies elicited by SARS-CoV-2 infection. Nature. 2020;584:115–119. doi: 10.1038/s41586-020-2380-z. [DOI] [PubMed] [Google Scholar]
19.Dejnirattisai W., et al. The antigenic anatomy of SARS-CoV-2 receptor binding domain. Cell. 2021;184:2183–2200. doi: 10.1016/j.cell.2021.02.032. e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Yan R., et al. Structural basis for bivalent binding and inhibition of SARS-CoV-2 infection by human potent neutralizing antibodies. Cell Res. 2021;31:517–525. doi: 10.1038/s41422-021-00487-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Vaswani A., et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017:5998–6008. [Google Scholar]
22.Jumper J., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Alford R.F., et al. The rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theor. Comput. 2017;13:3031–3048. doi: 10.1021/acs.jctc.7b00125. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Berger Rentsch M., Zimmer G. A vesicular stomatitis virus replicon-based bioassay for the rapid and sensitive determination of multi-species type I interferon. PLoS One. 2011;6:e25858. doi: 10.1371/journal.pone.0025858. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Sikora M., et al. Computational epitope map of SARS-CoV-2 spike protein. PLoS Comput. Biol. 2021;17:e1008790. doi: 10.1371/journal.pcbi.1008790. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Harrison A.G., et al. Mechanisms of SARS-CoV-2 transmission and pathogenesis. Trends Immunol. 2020;41:1100–1115. doi: 10.1016/j.it.2020.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Steinegger M., et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinf. 2019;20:473. doi: 10.1186/s12859-019-3019-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Gobeil S.M.-C., et al. D614G mutation alters SARS-CoV-2 spike conformation and enhances protease cleavage at the S1/S2 junction. Cell Rep. 2021;34:108630. doi: 10.1016/j.celrep.2020.108630. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Muik A., et al. Neutralization of SARS-CoV-2 lineage B.1.1.7 pseudovirus by BNT162b2 vaccine-elicited human sera. Science. 2021;371:1152–1153. doi: 10.1126/science.abg6105. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Sahin U., et al. BNT162b2 vaccine induces neutralizing antibodies and poly-specific T cells in humans. Nature. 2021;595:572–577. doi: 10.1038/s41586-021-03653-6. [DOI] [PubMed] [Google Scholar]
31.Quandt J., et al. Omicron BA.1 breakthrough infection drives cross-variant neutralization and memory B cell formation against conserved epitopes. Sci Immunol. 2022;7 doi: 10.1126/sciimmunol.abq2427. eabq2427. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Choi A., et al. Serum neutralizing activity of mRNA-1273 against SARS-CoV-2 variants. J. Virol. 2021:95. doi: 10.1128/JVI.01313-21. e0131321. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Tada T., et al. Comparison of neutralizing antibody titers elicited by mRNA and adenoviral vector vaccine against SARS-CoV-2 variants. bioRxiv. 2021 doi: 10.1101/2021.07.19.452771. [DOI] [Google Scholar]
34.Liu Y., et al. BNT162b2-Elicited neutralization against new SARS-CoV-2 spike variants. N. Engl. J. Med. 2021;385:472–474. doi: 10.1056/NEJMc2106083. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Liu J., et al. BNT162b2-elicited neutralization of B.1.617 and other SARS-CoV-2 variants. Nature. 2021;596:273–275. doi: 10.1038/s41586-021-03693-y. [DOI] [PubMed] [Google Scholar]
36.Xia H., et al. Neutralization and durability of 2 or 3 doses of the BNT162b2 vaccine against Omicron SARS-CoV-2. Cell Host Microbe. 2022 doi: 10.1016/j.chom.2022.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Hie B., et al. Learning the language of viral evolution and escape. Science. 2021;371:284–288. doi: 10.1126/science.abd7331. [DOI] [PubMed] [Google Scholar]
38.Corey L., et al. SARS-CoV-2 variants in patients with immunosuppression. N. Engl. J. Med. 2021;385:562–566. doi: 10.1056/NEJMsb2104756. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Sun Y., et al. Origin and evolutionary analysis of the SARS-CoV-2 Omicron variant. J. Biosaf. Biosecur. 2022;4:33–37. doi: 10.1016/j.jobb.2021.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Ma W., et al. Genomic perspectives on the emerging SARS-CoV-2 Omicron variant. Dev. Reprod. Biol. 2022;20:60–69. doi: 10.1016/j.gpb.2022.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Becht E., et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2018;37:38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.docx^{(3.1MB, docx)}

[bib1] 1.Elbe S., Buckland-Merrett G. Data, disease and diplomacy: GISAID's innovative contribution to global health. Global Chall. 2017;1:33–46. doi: 10.1002/gch2.1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Shu Y., McCauley J. GISAID: global initiative on sharing all influenza data–from vision to reality. Euro Surveill. 2017;22:30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Khare S., et al. Gisaid's role in pandemic response. China CDC Wkly. 2021;3:1049–1051. doi: 10.46234/ccdcw2021.255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Liu Y., et al. Neutralizing activity of BNT162b2-elicited serum. N. Engl. J. Med. 2021;384:1466–1468. doi: 10.1056/NEJMc2102017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Twohig K.A., et al. Hospital admission and emergency care attendance risk for SARS-CoV-2 delta (B.1.617.2) compared with alpha (B.1.1.7) variants of concern: a cohort study. Lancet Infect. Dis. 2022;22:35–42. doi: 10.1016/S1473-3099(21)00475-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Singh J., et al. SARS-CoV-2 variants of concern are emerging in India. Nat. Med. 2021;27:1131–1133. doi: 10.1038/s41591-021-01397-4. [DOI] [PubMed] [Google Scholar]

[bib7] 7.The technical advisory group on SARS-CoV-2 virus evolution (TAG-VE). Classification of Omicron (B.1.1.529): SARS-CoV-2 Variant of Concern. Nov 2021;26 https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern [cited 28 Nov 2021] [Google Scholar]

[bib8] 8.Wang Q., et al. Antibody evasion by SARS-CoV-2 Omicron subvariants BA.2.12.1, BA.4 and BA.5. Nature. 2022;608:603–608. doi: 10.1038/s41586-022-05053-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Cao Y., et al. BA.2.12.1, BA.4 and BA.5 escape antibodies elicited by Omicron infection. Nature. 2022;608:593–602. doi: 10.1038/s41586-022-04980-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Yao L., et al. Omicron subvariants escape antibodies elicited by vaccination and BA.2.2 infection. Lancet Infect. Dis. 2022;22:1116–1117. doi: 10.1016/S1473-3099(22)00410-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Kalia K., et al. The lag in SARS-CoV-2 genome submissions to GISAID. Nat. Biotechnol. 2021;39:1058–1060. doi: 10.1038/s41587-021-01040-0. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Subissi L., et al. An early warning system for emerging SARS-CoV-2 variants. Nat. Med. 2022;28:1110–1115. doi: 10.1038/s41591-022-01836-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Rives A., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 2021:118. doi: 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Elnaggar A., et al. ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing. arXiv preprint arXiv. 2020;200706225 [Google Scholar]

[bib15] 15.Meier J., et al. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv. 2021 doi: 10.1101/2021.07.09.450648. [DOI] [Google Scholar]

[bib16] 16.O'Toole Á, et al. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021:7. doi: 10.1093/ve/veab064. veab064. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Barnes C.O., et al. SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature. 2020;588:682–687. doi: 10.1038/s41586-020-2852-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Ju B., et al. Human neutralizing antibodies elicited by SARS-CoV-2 infection. Nature. 2020;584:115–119. doi: 10.1038/s41586-020-2380-z. [DOI] [PubMed] [Google Scholar]

[bib19] 19.Dejnirattisai W., et al. The antigenic anatomy of SARS-CoV-2 receptor binding domain. Cell. 2021;184:2183–2200. doi: 10.1016/j.cell.2021.02.032. e22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Yan R., et al. Structural basis for bivalent binding and inhibition of SARS-CoV-2 infection by human potent neutralizing antibodies. Cell Res. 2021;31:517–525. doi: 10.1038/s41422-021-00487-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Vaswani A., et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017:5998–6008. [Google Scholar]

[bib22] 22.Jumper J., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Alford R.F., et al. The rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theor. Comput. 2017;13:3031–3048. doi: 10.1021/acs.jctc.7b00125. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Berger Rentsch M., Zimmer G. A vesicular stomatitis virus replicon-based bioassay for the rapid and sensitive determination of multi-species type I interferon. PLoS One. 2011;6:e25858. doi: 10.1371/journal.pone.0025858. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Sikora M., et al. Computational epitope map of SARS-CoV-2 spike protein. PLoS Comput. Biol. 2021;17:e1008790. doi: 10.1371/journal.pcbi.1008790. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Harrison A.G., et al. Mechanisms of SARS-CoV-2 transmission and pathogenesis. Trends Immunol. 2020;41:1100–1115. doi: 10.1016/j.it.2020.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Steinegger M., et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinf. 2019;20:473. doi: 10.1186/s12859-019-3019-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Gobeil S.M.-C., et al. D614G mutation alters SARS-CoV-2 spike conformation and enhances protease cleavage at the S1/S2 junction. Cell Rep. 2021;34:108630. doi: 10.1016/j.celrep.2020.108630. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Muik A., et al. Neutralization of SARS-CoV-2 lineage B.1.1.7 pseudovirus by BNT162b2 vaccine-elicited human sera. Science. 2021;371:1152–1153. doi: 10.1126/science.abg6105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Sahin U., et al. BNT162b2 vaccine induces neutralizing antibodies and poly-specific T cells in humans. Nature. 2021;595:572–577. doi: 10.1038/s41586-021-03653-6. [DOI] [PubMed] [Google Scholar]

[bib31] 31.Quandt J., et al. Omicron BA.1 breakthrough infection drives cross-variant neutralization and memory B cell formation against conserved epitopes. Sci Immunol. 2022;7 doi: 10.1126/sciimmunol.abq2427. eabq2427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Choi A., et al. Serum neutralizing activity of mRNA-1273 against SARS-CoV-2 variants. J. Virol. 2021:95. doi: 10.1128/JVI.01313-21. e0131321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Tada T., et al. Comparison of neutralizing antibody titers elicited by mRNA and adenoviral vector vaccine against SARS-CoV-2 variants. bioRxiv. 2021 doi: 10.1101/2021.07.19.452771. [DOI] [Google Scholar]

[bib34] 34.Liu Y., et al. BNT162b2-Elicited neutralization against new SARS-CoV-2 spike variants. N. Engl. J. Med. 2021;385:472–474. doi: 10.1056/NEJMc2106083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Liu J., et al. BNT162b2-elicited neutralization of B.1.617 and other SARS-CoV-2 variants. Nature. 2021;596:273–275. doi: 10.1038/s41586-021-03693-y. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Xia H., et al. Neutralization and durability of 2 or 3 doses of the BNT162b2 vaccine against Omicron SARS-CoV-2. Cell Host Microbe. 2022 doi: 10.1016/j.chom.2022.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Hie B., et al. Learning the language of viral evolution and escape. Science. 2021;371:284–288. doi: 10.1126/science.abd7331. [DOI] [PubMed] [Google Scholar]

[bib38] 38.Corey L., et al. SARS-CoV-2 variants in patients with immunosuppression. N. Engl. J. Med. 2021;385:562–566. doi: 10.1056/NEJMsb2104756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Sun Y., et al. Origin and evolutionary analysis of the SARS-CoV-2 Omicron variant. J. Biosaf. Biosecur. 2022;4:33–37. doi: 10.1016/j.jobb.2021.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Ma W., et al. Genomic perspectives on the emerging SARS-CoV-2 Omicron variant. Dev. Reprod. Biol. 2022;20:60–69. doi: 10.1016/j.gpb.2022.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Becht E., et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2018;37:38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]

PERMALINK

Early computational detection of potential high-risk SARS-CoV-2 variants

Karim Beguir

Marcin J Skwark

Yunguan Fu

Thomas Pierrot

Nicolas Lopez Carranza

Alexandre Laterre

Ibtissem Kadri

Abir Korched

Anna U Lowegard

Bonny Gaby Lui

Bianca Sänger

Yunpeng Liu

Asaf Poran

Alexander Muik

Uğur Şahin

Abstract

1. Introduction

2. Materials and Methods

Fig. 1.

3. Results

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

4. Discussion

Author contributions

Material availability

Declaration of competing interest

Acknowledgements

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases