Skip to main content
Research logoLink to Research
. 2025 Sep 30;8:0871. doi: 10.34133/research.0871

General Intelligence Framework to Predict Virus Adaptation Based on a Genome Language Model

Shu-Yang Jiang 1,2, Shi-Shun Zhao 1, Jun-Qing Wei 2,3, Sen Zhang 2, Zhongpeng Zhao 2, Yigang Tong 3, Wei Liu 2, Jianwei Wang 4,5,*, Tao Jiang 2,*, Jing Li 2,*
PMCID: PMC12480747  PMID: 41035817

Abstract

Most human viral pandemics are caused by animal-originated viruses with human adaptation. It is challenging to infer adaptation from viral genes or their coded protein sequences, particularly when the data labels for modeling are inadequate or the input sequence to be predicted is incomplete. Here, we developed a semi-supervised General Intelligence framework to predict Virus Adaptation based on Language-model-embedded protein sequences (GIVAL) for blind input of virus sequences. The language model in GIVAL, named virus Bidirectional Encoder Representations from Transformers (vBERT), was pretrained for embedding using hidden Markov model-contextualized tokens of viral protein sequences. vBERT outperformed prevalent pretrained models like DNABERT-2, proteinBERT, ESM-2, Transformer, and Word2Vec on distinguishing viral proteins with various-grained labels, such as serotypes and single phenotype-altering mutation. The semi-supervised GIVAL obtained higher accuracy in virus adaptation prediction and better fault tolerance on raw labels in the training dataset, overcoming the obstacle of modeling with insufficient labels and predicting blind input. GIVAL was applicable to the adaptation prediction of diverse viruses. For influenza A viruses (IAVs), higher human adaptation was predicted for equine-origin H3N8 IAVs and bovine H5N1 IAVs with simulated mutations. For coronaviruses, GIVAL predicted an adaptation shift of receptor binding from Middle East respiratory syndrome–related coronavirus (MERS-CoV) receptor to severe acute respiratory syndrome coronavirus receptor of 2 recently reported MERS-CoV-like virus variants. For monkeypox viruses, GIVAL quantified an incremental adaptation shift of viral variants, matching the rise in human monkeypox cases. Summarily, GIVAL provides a generally intelligent framework for predicting virus adaptation based on its genotype, with the potential to extend to more genotype-to-phenotype prediction scenarios.

Introduction

Most emerging or remerging virus epidemics are caused by zoonotic viruses [1], such as the last 5 influenza pandemics [2] caused by influenza A viruses (IAVs), the worldwide Coronavirus Disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [3], and other global infectious disease outbreaks [4]. However, it is challenging to predict the cause of the next virus pandemic. It is well known that most mammal- or bird-prevalent viruses only occasionally cause spillover human infection and are incapable of causing sustained inter-human transmission or even launching a pandemic before being adapted to humans [5,6]. Such adaptation manifests the phenotypes of human-specific receptor binding, higher replication efficiency, antagonizing the host’s anti-virus immune response, more efficient transmission, and other phenotypes [69]. It is inspiring that viral phenotypes, including host adaptation, are determined by their genotypes and are thus theoretically predictable based on their genotypes [1016]. However, it would be time-consuming for traditional strategies to explore the viral genotype–phenotype causality using reverse genetics and other molecular virological methods [1113,17]. Given the urgency and importance of assessing the pandemic potential of mammalian or avian viruses, it is vital to build a method to predict the adaptation of a virus to human host based on its genotype.

Inspiring predictions of viral phenotypes, such as host adaptation, using deep learning or machine learning, have been reported based on viral genome information. Distinctive viral genomic compositional traits or encoded proteins accurately predicted the reservoir hosts and arthropod vectors [18], and the adaptation and transmission of IAVs [19,20], coronaviruses (CoVs) [21,22], and monkeypox virus [23]. Such genome distinctiveness has been biologically interpreted as immunomodulatory activity [6,24], viral virulence [25], and replication [26] and, thus, as host specific. Natural language processing (NLP) embedding of protein sequences examines the genotype–phenotype association in more detail. Language model-embedded proteins intelligently predicted signal peptides [27], phylogenetic relationships [28], subcellular localization [29], post-translation modification [30], and structural features [31]. Moreover, language-embedded viral proteins accurately predicted viral evolution and structural escape [32] and conservation and variant effects [33]. However, the insufficiency of adaptation or other phenotype labels for virus proteins was one of the main obstacles for supervised learning models [34]. Additionally, most current intelligent predictors were limited to specified viruses and specified virus genes within a narrow sequence length range [35,36]. Thus, more general intelligent predictors were anticipated.

The present study aimed to pretrain a general viral genome language embedder, named virus Bidirectional Encoder Representations from Transformers (vBERT), for general embedding of viral protein sequences, and then to build a generalized framework of General Intelligence to predict Virus Adaptation based on Language model (GIVAL), without specified training data, a specified sample label, and a specified trained model. Any input of viral protein sequence (full or segmented) will lead to an output list of mapped virus genotypes, automatically optimized labeled data for modeling, automatically in-time model training, and input-based virus adaptation prediction. It is worth mentioning that semi-supervised learning was conducted to obtain optimized labels based on data clustering, to train a supervised classifier, and to statistically infer the adaptation label based on the predicted classes and raw adaptation labels of the training data. Our framework provides general intelligence for genotype-to-phenotype prediction of viruses, such as viral adaptation. GIVAL and its language embedder were benchmarked and evaluated for their potential to generally predict viral adaptation based on their embedded proteins for CoVs, IAVs, and monkeypox viruses.

Results

Workflow of vBERT-based GIVAL to predict host adaptation of viruses

GIVAL intelligently predicted the input viral protein sequence or segment adaptation without a trained model based on specified data with labels. Firstly, after inputting the unknown viral sequence with a varied length, a customized BLAST+ mapping with the reference sequence dataset identified the viral protein and retrieved the mapped dataset (Fig. 1A). Secondly, the vBERT-embedded retrieved dataset was re-labeled and sampled to create the training dataset (Fig. 1B). Thirdly, a ResNet predictor was trained on the training dataset (Fig. 1B). Finally, the adaptation risk of the input sequence was statistically inferred and quantified based on the predicted label and raw annotations (Fig. 1C). The integrated results of the identified viral protein, input-customized datasets, labels, model, and predicted adaptation risk were output (Fig. 1D). GIVAL provided a semi-supervised framework for input sequence retrieval, dataset labeling and sampling, predictor training, and adaptation inference.

Fig. 1.

Fig. 1.

The pipeline of GIVAL to generally predict the adaptation risk of a virus based on vBERT embedding of its genotype. The pipeline of GIVAL can be divided into 4 parts. (A) Dataset intelligentization for input sequence identification and mapped dataset retrieval. (B) Label intelligentization and label-based sampling based on the vBERT embedding of the mapped dataset and model intelligentization by timely training a ResNet model with flexibly optimized labels for input sequence prediction. (C) Inference and quantifying of the adaptation risk of the input sequence based on the predicted label and raw annotations. (D) Integrated output of identified viral protein, flexible dataset with label, the input-customized model, and predicted results for the input sequence.

Performance of hidden Markov model tokenization and vBERT embedding on virus proteins

vBERT was pretrained for hidden Markov model (HMM)-tokenized viral protein sequences (Fig. S1A to E). The HMM tokenizer was trained based on the sampled dataset and optimized for the parameters. Firstly, viral family based and statistical sampling for the whole dataset (Figs. S2 and S3A and B) was validated and compared to random sampling (Fig. S3C). The more even number distribution of each viral family of vBERT pretraining dataset and HMM training dataset compared with the whole dataset were shown by the higher Simpson indexes (Fig. S3D). The absence of marked differences in the indexes within each family before and after sampling also reflected the rationality of sampling in preserving the genetic diversity while ensuring balanced representation of distinct sequence types (Fig. S3E). Secondly, the HMM tokenizer was evaluated. Statistical analysis indicated more than 99% coverage of 1- to 4-amino-acid (AA) tokens and a plunged coverage of 9% for 6-AA tokens (Fig. 2A). Therefore, the HMM tokenizer was trained using 1- to 5-AA tokens. The highest frequency of 3-AA tokens was followed by 4-AA and 5-AA tokens, with significantly fewer 1- or 2-AA tokens (P = 0.0495, Fig. 2B). The HMM-tokenized vocabulary list for vBERT pretraining indicated a weighted coverage of more than 98% and an average coverage of 92% for the whole dataset (Fig. 2C and Fig. S4A). Thirdly, probability parameters were optimized for the HMM tokenizer. A starting probability of {B: 0.95, M: 0, E: 0, S: 0.05}, a transition probability of {B: {M: 0.97, E: 0.03}, M: {M: 0.64, E: 0.36}, E: {B: 0.99, S: 0.01}, S: {S: 0.54, B: 0.46}}, and an emit probability of serially probability values were obtained (Fig. 2D and Table S1). Fourthly, the robustness of the HMM tokenizer was verified with the high consistency between the parameters based on the real and virtual datasets (Fig. 2E), the similarity of the frequency of the top 50 vocabularies among all major RNA and DNA (Fig. S4B and C) viral families, and the separation between the full-token frequency vectors for each RNA or DNA viral family (Fig. 2F and G).

Fig. 2.

Fig. 2.

Performance of HMM tokenizer in vBERT on different datasets. (A) The coverage of 1- to 10-AA vocabularies in the sequences of the whole dataset, HMM tokenizer dataset, and vBERT pretraining dataset. (B) The natural logarithm of the number of tokens with 1 to 5 AAs in the HMM-tokenized vocabulary list of the whole dataset, HMM tokenizer dataset, and vBERT pretraining dataset. (C) The coverage with weight of the HMM vocabulary list for each type of viruses in the whole dataset. (D) The starting, emit, and transition probability of 20 types of AAs and other unknown characters for the HMM tokenizer. (E) The angle cosine between the parameter vectors of the HMM model based on the real and virtual dataset. Connection indexes of RNA viruses with value greater than 0.6 (F) and DNA viruses (G).

vBERT was optimized for its parameters and benchmarked based on embedding performance. Firstly, the vBERT-optimized was pretrained based on the HMM-tokenized sequences, indicating better performance than the models pretrained with the fixed 2-AA, 3-AA, or 4-AA tokens in the ablation study of HMM tokenization. The vBERT-optimized performance was significantly higher for CoV receptor binding domain (RBD) (P < 0.05, Fig. 3A). A similar outperformance of the vBERT-optimized over the BERT with 2-AA tokens was also observed for IAV hemagglutinin (HA) (P < 0.05, Fig. 3B). The vBERT-optimized underperformed on the HA test dataset but outperformed vBERT-BPE (byte pair encoding) on the Spike RBD test dataset. However, vBERT-BPE exhibited marked performance instability across both datasets. Overall, among the 3 tokenization methods (2-4AA, BPE, and HMM), HMM demonstrated the most robust performance. Secondly, the vBERT-optimized was benchmarked at various granularities. The coarse-grained outperformance was verified with clearer intra-group clustering and inter-group separation (Fig. 3C and D), compared with Transformer, ESM-2, DNABERT-2, proteinBERT, and other models for parameter optimization (Figs. S5A to R and S6A to R, Table S2, and Table). Compared with the BERT models pretrained based on the whole and the simulated dataset, the outstanding performance of the vBERT-optimized confirmed that excessive homologous sequences were not conducive to embedding. Furthermore, the vBERT-optimized based on the BERT-base configuration performed better than the models based on smaller-scale configuration, such as BERT-tiny and BERT-medium. The fine-grained outperformance was demonstrated by the more balanced clustering and separation of vBERT-optimized-embedded tokens (Fig. 3E and Fig. S7A to D) covering biologically important AA sites of 16, 313, 319, or 357 in IAV nucleoprotein (NP) [37] than Word2Vec-embedded tokens (Fig. 3E and Fig. S7E to H) and vBERT-BPE-embedded tokens (Fig. 3E and Fig. S7I to L). The enriched semantics of domain–function association were shown by the clear separation between the vBERT embedding of the HMM tokens in hypervariable IAV HA1 domain and conserved HA2 domain, addressing that the HMM-based vBERT embedding was able to capture protein domain boundaries instead of merely focusing on token frequency (Fig. S8A to O). Thirdly, the embedding performance of the vBERT-optimized was further tested in a scenario to evaluate the immune escape of IAV and the singular AA variation of the DMS datasets. WHO-referred IAV H1 HA strains were closer to prevalent H1 strains (Fig. S9A), while the referred H3 strains were more dispersed from prevalent H3 strains (Fig. S9B), with the immune escape indexes (Fig. 3F) aligning with the number of reported H1 and H3 samples (Fig. 3G). The difference between high- and low-binding variants was observed for the principal component analysis (PCA) reduced value of the vBERT-optimized-embedded SARS-CoV-2 Spike, with singular AA variation at the site of 339, 449, 452, or 505 (P < 0.1, Fig. S9C). The 2 classes of IAVs with different preference entropies were also significantly different in PCA1 of the embedded HAs (P = 0.01, Fig. 3H). Taken together, the optimized vBERT performed well in multiple-grained scenarios, sensitive to discrimination of subtle semantics of virus proteins.

Fig. 3.

Fig. 3.

Embedding performance of vBERT and other pretrained models on various virus proteins. The embedding performance of the vBERT-optimized compared with models pretrained with 2 to 4 AAs as tokens and BPE tokenization, respectively, with normalized silhouette, CH, ARI, NMI, and one minus normalized DBI, on the SARS-CoV-2 Spike RBD (A) and IAV HA (B) test dataset. Reduced 2 components with t-SNE of vBERT-embedded IAV HA (C) and SARS-CoV-2 Spike RBD (D). Reduced 2 components with t-SNE of the vBERT-optimized-, vBERT-BPE-, and Word2Vec-embedded (E) tokens at site 313 and other tokens of IAV NP and the maximum ratio of the tokens at site 313 in the same cluster. The immune escape index calculated based on vBERT embedding (F) and the number of circulating H1 and H3 from 2021 to 2023 (G). (H) The PCA1 variance values of sites with high and low site entropy. The PCA1 variance values of sequences with each site singly mutated with 20 AAs were obtained based on the vBERT-embedded IAV HA DMS dataset.

Table.

Benchmarking of vBERT with other pretrained language models. The 5 clustering indexes (silhouette, CH, DBI, ARI, and NMI) of all models involved in benchmarking were calculated and compared. Average score represents the average of normalized silhouette, CH, ARI, and NMI values and one minus normalized DBI value. The optimal indexes were emphasized using boldface.

Index vBERT Transformer ESM-2 proteinBERT DNABERT-2
IAV_HA Silhouette 0.576 0.590 0.562 0.511 0.550
CH 3,348.260 3,060.565 3,064.684 2,423.532 3,070.174
DBI 0.580 0.636 0.683 0.721 0.636
ARI 0.834 0.871 0.906 0.688 0.740
NMI 0.905 0.933 0.935 0.812 0.823
Spike_RBD Silhouette 0.870 0.779 0.581 0.707 0.873
CH 14,400.057 4,262.676 1,165.237 1,019.655 10,420.517
DBI 0.215 0.353 0.677 0.502 0.192
ARI 0.968 0.961 0.771 0.784 0.963
NMI 0.955 0.947 0.813 0.776 0.951
Average score 0.919 0.762 0.382 0.086 0.678

Generalization validation of GIVAL to predict virus adaptation

The generalization performance of the input-specified GIVAL was validated using IAV HA, a segmented IAV HA RBD, and a segmented SARS-CoV-2 Spike RBD [38]. Firstly, BLAST+ mapping based on the reference sequences was verified with higher accuracy than Diamond on the test dataset and the 2 input sequences (Fig. 4A and B). Secondly, the outperformance of flexible labeling in GIVAL was shown in the ablation study compared with specified labeling. From the perspective of unsupervised learning, the flexible labels predicted by GIVAL based on RBD-covered Spike sequences generally divided CoV Spike sequences into 3 groups: Middle East respiratory syndrome coronaviruses (MERS-CoVs), SARS-CoV-2/SARS-related viruses, and others (Fig. 4C and Fig. S10A to C). Such prediction was consistent with the 3 types of receptor binding specificity of CoVs, dipeptidyl peptidase-4 (DPP4) for MERS-CoVs, angiotensin-converting enzyme 2 (ACE2) for SARS-CoV-2/SARS-related viruses, and aminopeptidase N (APN) for others [12,39,40], with one exception of NL63 in the third group and with ACE2 as a receptor, which was proven by the longer distance from NL63 to SARS-related than to TGEV, PEDV, PDCoV, and 229E (Fig. 4D and E) and phylogenetic analysis (Fig. 4F and G). The 2 strains of PDF-2180 and NeoCoV were predicted exactly to be ACE2-binding, rather than DPP4-binding (Fig. 4C and Fig. S10D), by our model, which was not distinguished by phylogenetic analysis (Fig. S10E). Our result was definitely consistent with the reported results [41]. Flexible labeling showed higher rationality with distinct separation of sequences (Fig. 4H) compared to the specified host labels (Fig. 4I). The flexible labels of HA RBD (Fig. S11A to D) generally divided the dataset into human-adaptive H1 and H3 and avian-adaptive samples. From the perspective of supervised learning, flexible labeling contributed to higher accuracy and raw label fault tolerance of the ResNet model than specified labeling (Fig. 4J and Fig. S12A to F), and the prediction of the model was verified with high accuracy in the confusion matrices and ROC_AUC of the 3 cross-validation sets of HA RBD, HA complete sequence, and Spike RBD (Figs. S13A to F, S14A to F, and S15A to F) and of the independent validation set of the HA RBD model (Fig. 4K and L). Tests on input with extremely short lengths exhibited that GIVAL showed high accuracy on segments with 40 and 50 AAs, while the prediction accuracy was relatively unstable at 30-AA segments (Fig. S16A to I). This might be caused by the overlapping local features of different types of sequences and the insufficient information-carrying capacity of short sequences, and therefore, it is recommended to input sequences with more than 40 AAs in GIVAL for prediction. The prediction performance of GIVAL was further validated with sequences from several influenza pandemics, and a cross-family generalization test was conducted by predicting influenza B virus (IBV) with a model trained on IAVs (Fig. S17A). All of the IAV sequences from the 2009 H1N1 and 1968 H3N2 pandemics and the IBVs were predicted to be human-adaptive, confirming the rationality of GIVAL’s prediction (Fig. S17B). Therefore, the GIVAL framework provides a solution for the generalized adaptation prediction of input virus sequence with a varied length.

Fig. 4.

Fig. 4.

Performance of GIVAL on predicting adaptation risk of IAVs and CoVs. (A) The number of correctly and incorrectly mapped sequences in the test dataset based on customized BLAST+ and Diamond. The percentage represents the mapping accuracy. (B) The mapping identity between the input HA RBD and Spike RBD sequences and the mapped reference sequences (IAV_HA_ref1-3 are strain A/Jalna/NIV9436/2009, A/mallard/Sweden/101490/2009 and A/Louisiana/02/2017; CoV_Spike_ref1-3 are strain NC045512.2, KF530080.1 and MG596802.1). (C) The Artiodactyla (ART)-, Chiroptera (CHI)-, Suiformes (SUI)-, and Primates (PRI)-adaptive ratio of CoV Spike RBD sequences in each flexible label. The red star represents the predicted label of PDF-2180 and NeoCoV strain. Reduced 2 components with PCA of vBERT-embedded 229E, NL63, PDCoV, SARS-CoV-2, SARS-related, and TGEV Spike RBD sequences (D) and the distances between the PCA vectors of NL63 and others (E). Phylogenetic analysis on Spike RBD (F) and Spike complete sequences (G) of SUI, ART, CHI SARS-like, and PRI 229E, NL63, OC43, SARS-related, and SARS-CoV-2. Reduced 2 components with PCA of vBERT-embedded IAV HA training dataset based on flexible (H) or specified adaptive host (I) labels. (J) The performance of models based on flexible and specified labels with various error rates of training human H3 label. The confusion matrix (K) and ROC_AUC (L) of the model based on the input IAV HA RBD on the independent validation dataset.

GIVAL predicts and infers high-risk mammalian H3N8 and bovine H5N1 IAVs

Initially, GIVAL in response to the input of IAV HA RBD was applied to predict the adaptation of mammalian IAVs other than human IAVs. Firstly, the host preference with spatiotemporal distribution was obtained. More than 80% of H3N2/H3N1, 59% of H1N1, and 25% of H1N2 were predicted to be human-adaptive for the swine IAV HA samples, whereas most HAs from other mammals were predicted to be avian-adaptive (Fig. 5A). Surprisingly, more than 83% of equine H3N8 HAs were predicted to be human-adaptive (Fig. 5A). Spatiotemporal analysis of these predicted samples indicated a continuous high adaptation of the IAV HAs from North America to humans and sporadic high adaptation in Asia from 1971 to 1980, in South America from 2001 to 2010, and in Oceania from 2011 to 2020 (Fig. 5B). Secondly, a Bayes method was utilized to screen key AAs and their locations in HA for H3N2 and H3N8, respectively, to biologically interpret the predicted host preference. A marked difference in AA distribution was observed at sites 134, 132, 186, 223, 193, 183, 190, or 226 between human-adaptive and avian-adaptive H3N2 HA sequences predicted by GIVAL. These sites were located near the RBD, which included the 130-loop, 190-helix, and 220-loop (Fig. S18A). A high frequency of Leu at the 219 site, Val at 115, Asp at 101, Ala at 135, and Asn at 89 and 185 was shown for predicted avian-adaptive H3N8 HAs, compared to the high frequency of Trp at 219, Leu at 115, Asn at 101, Ser at 135 and 89, and Thr at 185 for predicted human-adaptive H3N8 HAs (Fig. S18B). Thirdly, the predicted host preference of H3 samples was verified using protein structure prediction. The root mean square deviation (RMSD) value showed that the structure of the swine H3N2 samples, which were predicted to be human-adaptive, was significantly different from that of the canine H3N2 samples, which were predicted to be avian-adaptive and were significantly closer to human H3 (P < 0.0001, Fig. S18C), and the structure of equine H3N8 samples, which were predicted to be human-adaptive, was significantly different from that of canine H3N8 samples, which were predicted to be avian-adaptive and were also significantly closer to human H3 (P < 0.0001, Fig. 5C). The HA RBD sequences that were predicted to be human- or avian-adaptive differed at sites 132, 213, and 220 of canine H3N8 (Fig. S18D) and sites 135, 141, and 220 of equine H3N8 (Fig. S18E). To validate our predictions regarding mammalian adaptation, we characterized the receptor-binding properties of IAV HA1 proteins using bio-layer interferometry (BLI) with α2,3-linked (3′-SLNLN) and α2,6-linked (6′-SLNLN) sialylglycan receptors. Initial validation using avian H5N1 (Fig. S19A and B) and human H3N2 (Fig. S19C and D) HA1 proteins as control successfully reproduced their characteristic binding profiles, with human H3N2 exhibiting stronger binding ability for 6′-SLNLN and avian H5N1 HA1 showing preferential binding to 3′-SLNLN, thereby confirming the reliability of our experimental system. Importantly, equine H3N8 exhibited stronger binding to 6′-SLNLN than to 3′-SLNLN (Fig. S19E and F), consistent with the predicted potential for human adaptation in equine H3N8 HA RBD.

Fig. 5.

Fig. 5.

Prediction and inference of human-adapted H3N8 and H5N1 IAVs based on HA RBD. The ratio of human- or avian-adaptive samples of each serotype and host (A) and the HRI of samples collected in each time period and continent (B) (AS, NA, EU, SA, OA, and UN represent Asia, North America, Europe, South America, Oceania, and unknown, respectively). (C) The RMSD between the HA RBD structure of H3N8 reference sequences and other human-adaptive equine H3N8 and avian-adaptive canine H3N8 based on GIVAL prediction. (D) The pipeline of high-risk H5N1 mutations prediction and quantification. (E) Logo plot of the AA distribution of the top 30 important sites in human- and avian-adaptive HA RBD sequences. (F) The predicted human and avian adaptation score for the 30,000 mutations according to AA distribution. The dashed lines represent the score of the reference H5N1 strain [A/Texas/37/2024(H5N1)]. (G) Effect indexes of top 30 sites. (H) The distribution of the top 12 sites (colored in red) and the 130-loop, 190-helix, and 220-loop (colored in orange) on the protein structure of the reference strain. (I) The predicted adaptation score of the human-adapted mutations with only the top 12 sites mutated.

Additionally, high-risk H5N1 mutations for humans were inferred using GIVAL based on the HA RBD model (Fig. 5D). Firstly, the top 30 sites with relatively high importance values and clear differences between the AA distribution of human- and avian-adaptive sequences (Fig. 5E) were selected. Secondly, the adaptation of the mutations generated with the top 30 sites mutated were predicted and quantified using GIVAL. Compared with the reference strain, the overall avian adaptation score decreased and the human adaptation score increased (Fig. 5F). Thirdly, according to the downward trend of the site effect indexes of the 30 sites (Fig. 5G), the top 12 and top 8 sites were respectively selected for generating mutations, and the selected 12 sites were found to be close to or within the HA 130-loop, 190-helix, and 220-loop on the predicted protein structure of the reference strain (Fig. 5H). According to GIVAL prediction, 34 mutations with 12 sites (Table S3) mutated and 2 with 8 sites (Table S4) mutated were predicted to be human-adaptive, and V152N, T192I, V151L, and N193S mutations on H5N1 HA might largely affect the adaptation risk (Fig. 5I). The relationship between the high-risk H5N1 mutations and the circulating H5N1 strains was analyzed, and it was found that a large proportion of circulating strains containing the high-risk mutations were from dairy cows in North America and might pose a potential threat to transmission in human population (Table S5).

To summarize, a large proportion of equine H3N8 IAVs were predicted to be human-adaptive, supported by protein structure prediction, and high-risk H5N1 mutations were inferred using GIVAL.

GIVAL predicts marked adaptation shift of the prevalent monkeypox viruses

The adaptation shift of the prevalent monkeypox viruses was predicted using GIVAL. Firstly, according to the score of vBERT embedding (Fig. S20A), DNT (Fig. S20B), and other prior knowledge, OPG002, OPG015, OPG019, OPG031, OPG034, OPG049, OPG100, OPG130, OPG170, and OPG172 were respectively selected for adaptation prediction. Secondly, the prediction performance was validated with high accuracy of the model based on each protein (Fig. S21A to J). Thirdly, the spatiotemporal distribution of the adaptation shift was obtained. A higher reported number and score for type II adaptation probability from 2022 to 2024 was shown, which might reflect a higher transmission risk of the adaptation shift in the future (Fig. 6A). A higher transmission probability was observed in North America and Europe, and sequences of the predicted dataset in Asia and Oceania were found to be completely type II adaptive. A relatively high portion of samples in Africa showed partial type I adaptability, which indicated that part of these samples might have higher pathogenicity (Fig. 6B). Additionally, the relationship and similarity between the adaptation shift of each clade was further analyzed. A shorter fully connected layer (FC)-based distance was observed from monkeypox viruses with type II adaptation to clade Ib than to clade Ia in most of the proteins (Fig. 6C and Fig. S22A to J), which was consistent with the incremental type II adaptation shift from clade Ia to Ib and from IIb B to IIb C (Fig. 6D). Most monkeypox viruses circulating before 2022 were found to be clade Ia, IIa, and IIb A, while samples after 2022 were mainly from clade Ia, Ib, IIb B, and IIb C (Fig. S23). Similarities between the adaptation shift from clade Ia before 2022 to clade Ib after 2022 and from clade IIa before 2022 to clade IIb B after 2022 were indicated in 80% of the proteins based on the vBERT-optimized embedding (Fig. 6E and F and Fig. S24A to J). Generally, a distinct incremental adaptation shift has been predicted using GIVAL for monkeypox viruses since 2022.

Fig. 6.

Fig. 6.

Adaptation shift prediction of monkeypox viruses based on main proteins. Mean adaptation degree and score of each year (A) and region (the official review number of the map is “GS(2016)1665”). (B) of 10 proteins of monkeypox viruses post-normalization (to a range of 0.5 to 1.0). (C) Reduced 3 components with PCA of FC vector of OPG015 sequences from types I and II in the training set and clades Ia and Ib in the validation set. (D) The average adaptation score of 10 proteins of clades Ia, Ib, IIb B, and IIb C. (E) Reduced 2 components with PCA of vBERT-embedded OPG015 sequences from clades I and II collected before and after 2022. The adaptation shift from clade IIa to IIb B.1.20 and that from clade Ia to Ib were emphasized with arrows [triangles represent the 4 representative sequences, namely, IIb B.1.20 (2024): EPI_ISL_19459746; IIa (1962): EPI_ISL_13056556; Ib (2024): EPI_ISL_19345034; Ia (1978): EPI_ISL_13058456, and circles represent other sequences]. (F) The angle cosine between adaptation shift vectors of clades I and II of each protein.

Discussion

It is challenging to accurately assess the human adaptation of an emerging virus in advance, based solely on its genotype. Intelligent genotype-to-phenotype prediction has provided a promising risk assessment strategy for such cases. However, there is still a generalization insufficiency for currently available artificial intelligence predictors, which are inflexible after training and are limited to a specified virus protein within a narrow length range. This insufficiency was addressed in the present study with a general intelligent framework, integrating multiple pipelines of input virus protein sequence retrieval, sequence embedding with a pretrained language model, input-mapped semi-supervised learning, and statistical inference of adaptation based on the identical distribution of the predicted sample with some group of adaptation-labeled samples.

The protein language model of vBERT for all viruses outperformed the accurate embedding multi-grained genotype–phenotype association. The best performance of vBERT was achieved on embedding by balancing data size and genome variation with statistical sampling and a more context-dependent tokenization of the viral protein sequence using NLP techniques. vBERT was competent at multiple-grained embedding tasks, such as coarse-grained protein clustering and fine-grained key mutation capture, implying a deep understanding of the biological semantics. Moreover, the GIVAL framework can be generally utilized for tasks, such as viral adaptation phenotype assessment based on an unspecified “X” viral protein sequence with full or segmented length. Dataset mapping and labeling, model training, and risk prediction were automatically performed upon input into GIVAL. The “X” input was specifically mapped to its viral protein sequence in a database of currently available virus reference genes via the benchmarked mapping tool of BLAST+ before training an input-specific predictor. Mismatch of adaptation labels for some sample records in public databases was another defect in training a virus adaptation model, owing to biologically unreasonable labeling of adaptation hosts. For example, H5N1, H7N9, and some other IAV samples in public databases were labeled as human hosts, although they were untransmissible in the human population after spillover infection [42,43]. Such a defect could be partially remedied based on the data distribution of vBERT-embedded sequences [44] and, thus, also contributed to the outperformance of GIVAL in predicting the host adaptation of the virus. The data distribution-based label setting was also more reasonable than the arbitrary setting of label numbers, followed by the statistical inference of the adaptation of the input. The inference of adaptation based on predicted flexible labels effectively addresses the limitations of incomplete raw adaptation-related annotations in datasets, enabling comprehensive analysis of various adaptation annotations for input sequences. Notably, influenza viruses exhibit unique characteristics where sequences sharing the same adaptive host label may belong to diverse serotypes, resulting in relatively high heterogeneity among sequences within each flexible label category. To overcome the limitations and potential inaccuracies of direct inference using the most frequent adaptation label from predicted flexible labels, we developed a hierarchical algorithm for determining adaptation thresholds. For predicting sequences of other viruses, users may either employ our default method or optionally utilize this hierarchical approach for adaptation analysis, depending on their specific research needs. The performance of GIVAL was validated by accurately predicting the host-specified receptor binding of IAVs and CoVs.

The trained GIVAL model upon an input of RBD-containing HA segment was utilized to predict the adaptation risk of IAVs that were identified in avian or mammalian hosts other than humans in recent years, since it is never excessive to worry about the next influenza pandemic. Firstly, a high human adaptation of swine H3N2/H3N1 or swine H1N1 IAVs was predicted by GIVAL, which was consistent with the well-known human receptor-binding specificity [45,46]. Surprisingly, a discrepancy in the adaptation of receptor binding was observed based on GIVAL predictions between equine and canine H3N8 IAVs. More than 75% of equine H3N8 HA were predicted to be human-adapted, whereas canine H3N8 HA was not predicted by GIVAL to adapt to humans, which was proven by the smaller difference in the structural RMSD values between the HA RBD of equine H3N8 to human H3 than that between canine H3N8 and human H3. The potential for cross-species transmission of equine H3N8 to humans also verified the above predictions [47].

The currently prevalent H5N1 spread in birds and mammals has raised concerns about another influenza pandemic [48], although only limited binding of it to human receptors has been observed [49]. It is more urgent to warn of possible mutations with increasing human receptor binding. GIVAL predicted some potential variants with mutations within or near HA RBD, such as T192I, V151L, and R227P. We propose that in-time and close attention should be paid to these possible mutations and the adaptation shift of receptor binding of present bovine H5N1 viruses.

The monkeypox virus poses another concern for pandemics [50]. In particular, there was a sharp increase in the number of cases and a spread of monkeypox worldwide. GIVAL automatically provided 2 types of adaptation labels based on data distribution, consistent with the virological classification of type I with higher morbidity and mortality and type II with higher transmission ability [51], based on 10 main viral proteins. According to the risk index of qualitative adaptation degree and quantitative score, the major circulating monkeypox viruses have undergone a marked shift in human adaptation after 2022. High type II adaptation risk was also found in viruses from some regions in North America, Asia, Europe, and Oceania, based on test data from the National Center for Biotechnology Information (NCBI), which is consistent with the endemic regions over the past decades [52]. Additionally, there is a similar adaptation shift of clade Ia to Ib and of clade IIa to IIb B/IIb C, consistent with the evolution of the adaptation pattern [53,54].

However, there is much performance space to update the embedding tool of vBERT and the GIVAL prediction framework. It is of great importance for establishing a language model based on a sufficiently rich and balanced virus protein data space. However, the unbalanced availability and distribution of virus samples has limited the effectiveness of vBERT embedding, although statistical sampling was conducted for vBERT to balance the number of positive and negative samples [55]. In our study, the dataset for vBERT pretraining was also statistically downsampled based on biological knowledge. However, oversampling was not performed to prevent data distortion. Therefore, it is crucial to further optimize the balanced number of sequences without affecting the dataset’s structure. Reasonable labeling is the key to training a supervised learning prediction model. A semi-supervised model utilizes unlabeled data more effectively by learning the probability distribution and classes [56]. In our study, the semi-supervised framework overcomes the label insufficiency of recorded sequences in public databases and is thus more applicable for phenotype prediction tasks based on sequences with insufficient labels. However, owing to the lack of information on virus families and names of viral proteins for some recorded samples, the mapped dataset was extracted based on the homology between sequences, which may lead to low accuracy of the prediction for some virus families. The imbalance in the number of various viruses has led to imperfect flexible labeling, resulting in difficulties in predicting some phenotypes. Summarily, how to overcome these limitations to achieve more accurate predictions on more enriched tasks is a vital issue to be explored.

In summary, GIVAL provides a framework for automatic sample labeling, data sampling, and model training to obtain integrated adaptation outputs upon an “X” viral protein input. GIVAL can not only assess the adaptation risk of currently prevalent viruses, such as IAVs and monkeypox viruses, but also predict possible high-risk variations of the virus in concern, such as bovine H5N1 viruses.

Materials and Methods

Pipeline of GIVAL for viral protein embedding and adaptation prediction

Viral protein sequences from 42 families were parsed, tokenized using HMM, and segmented for pretraining the viral protein embedder, vBERT. The optimized vBERT was selected through parameter optimization and benchmarking against other pre-trained models. Leveraging vBERT embeddings, the semi-supervised GIVAL framework was developed for virus and protein identification, dataset label optimization, and adaptive risk prediction, with risk quantification performed using a timely trained ResNet model.

Preparation of viral protein sequences

Sequences of viruses downloaded from NCBI (https://www.ncbi.nlm.nih.gov/nuccore/), Global Initiative of Sharing All Influenza Data (GISAID) (https://gisaid.org/), and Bacterial and Viral Bioinformatics Resource Center (BV-BRC) (https://www.bv-brc.org/) were cleaned and protein sequences (with more than 50 amino acids) were extracted. Sequence deduplication was performed to create the whole dataset (1.86 million protein sequences from 42 families). Strain names and other annotations were extracted using Python scripts. Sampling was conducted based on different methods for different types of virus sequences in the deduplicated whole dataset. Protein sequences from the faa file were firstly sampled to ensure that the number of sequences in each family did not exceed 6,000. The number of sequences from the gb and faa files were then sampled to 147,850 based on the homology between the sequences. The number of SARS-CoV-2 sequences were sampled to 20,000 according to the Pango lineage. Host-based sampling was conducted to IAV sequences to ensure that the number of sequences from each host did not exceed 6,193 and IAV sequences were sampled to 19,999. Sampling was not performed in 9,213 non-SARS-CoV-2 CoV sequences to balance the number of sequences from different types of viruses. The sampled dataset (197,062 sequences) was utilized for the HMM tokenizer and 100,000 sequences were further randomly sampled for vBERT pretraining. Additionally, the sampled dataset with 100,000 sequences and with 50,000 sampled sequences of IAV HA added was created as the simulated dataset. The Simpson index and normalized Simpson index were selected to analyze the diversity of sequences in the datasets. The indexes were calculated using Eqs. 1 and 2:

Simpson index=1i=1Spi2 (1)
Normalized Simpson index=1i=1Spi211S (2)

In Eqs. 1 and 2, pi represents the frequency of the ith type of sequence, and S represents the number of types.

Pretraining and benchmarking of vBERT for viral protein embedding

Tokenization and segmentation of viral proteins

Flexible tokenization was performed with HMM, and specified tokenization with a fixed number of AAs as tokens and BPE tokenization were performed as a control. The HMM and BPE tokenizer (trained with the Python package tokenizers.trainers.BpeTrainer) were trained on the HMM dataset.

To establish the HMM, the Python package jieba was downloaded from GitHub (https://github.com/fxsjy/jieba/). Sequences were firstly tokenized with 1 to 5 AAs as tokens, and vocabulary selection was performed. Based on the selected vocabulary list, the sequences were firstly tokenized without HMM. The statuses of AAs can be divided into 4 categories, namely, beginning (B), ending (E), middle (M), and single (S). For tokens spanning multiple AAs, B denoted the first position, E the last position, and M all intermediate positions; for single-AA tokens, the position was labeled as S. On this basis, the starting, emit, and transition probabilities of 20 types of amino acids and one unknown character were calculated to establish HMM using Eqs. 3 to 5:

Starting probablity:Pk=PX1=Sk (3)
Emit probablity:Pkb=Pyi=b|Xi=Sk (4)
Transmission probablity:Pkl=PXt+1=Sl|Xt=Sk (5)

In Eqs. 3 to 5, Xi represents the status of the ith AA, Si represents the ith status, yi represents the ith AA, and b represents the AA of interest. Using these probabilities, the Viterbi algorithm in HMM determined each AA token boundary. Viral protein tokenization with HMM and sequence segmentation from random start locations based on the maximum token limit of vBERT were performed. The tokenized vocabularies were deduplicated to create a pretraining vocabulary list.

To further evaluate the performance of the HMM tokenizer, the vocabulary frequency vectors of each type of RNA or DNA virus were counted, and the connection index (CI) was defined using Eq. 6:

CIi,j=1cosvivj (6)

In Eq. 6, vi and vjrepresent the vocabulary frequency vectors of the ith and jth RNA/DNA family. The frequency of 20 AAs and an unknown character in each sequence of the sampled dataset (real dataset) was counted. Virtual sequences, equal in number, were generated based on the AA frequency vectors to form a virtual dataset to further evaluate the HMM performance. More details about the performance evaluation of HMM can be found in the “Performance evaluation of HMM tokenizer” section in the Supplementary Materials.

Pretraining of vBERT embedder and parameter optimization

To pretrain the vBERT model for viral protein embedding, a BERT-base configuration (12 encoder layers, 12 attention heads, 163M parameters, and 768 hidden size) was chosen, with a training batch size of 32, a learning rate of 2e−4, and AdamWeightDecay optimization with a β1 of 0.900 and a β2 of 0.999. The total number of tokens in the sampled dataset with 100,000 sequences and the whole dataset were 37 million and 1 billion, respectively. Sequences were segmented into segments with fewer than 256 tokens. vBERT pretraining code was from the original BERT model (https://github.com/google-research/bert/), using the [CLS] token and summation of token, segment, and position embeddings, with “max position embeddings” set to 512.

For parameter optimization, vBERT models were pretrained on various datasets (sampled 100,000, whole, simulated) with different tokenization (HMM, BPE, and fixed-AAs [2-AA, 3-AA, and 4-AA]), segmentation (no segmentation, 96-token, and 256-token), learning rates (2e−3, 2e−4, and 2e−5), and training steps (220,000, 300,000, and 380,000) based on BERT. To study the impact of model size, BERT-tiny (2 encoder layers, 2 heads, hidden size 128) and BERT-medium (8 layers, 8 heads, hidden size 512) were selected to pretrain another 2 vBERT models. Fixed-AAs vBERT models were set as ablation to compare with HMM vBERT to quantify the HMM tokenization impact.

Benchmarking of vBERT embedding for viral proteins

For each IAV serotype or SARS-CoV-2 type, 200 sequences (2,000 for HA and 1,000 for Spike RBD) were randomly sampled to form the test dataset. The Transformer [57], ESM-2 [58], DNABERT-2 (https://arxiv.org/abs/2306.15006), and proteinBERT [59] were selected as benchmarking models. The reduced 2 components with t-distributed stochastic neighbor embedding [60] of vBERT-embedded and padded sequences of the 2 datasets were clustered. More details on dimensionality reduction and clustering can be found in the “Dimensionality reduction and clustering of embedded sequences” section in the Supplementary Materials. The embedding performance of each model was evaluated using 5 clustering indexes, namely, silhouette score (silhouette), Calinski-Harbasz Score (CH), Davies–Bouldin index (DBI), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI). The vBERT model that obtained more optimal indexes was selected. Statistical testing was conducted based on the methods described in the “Statistical testing of significant differences” section in the Supplementary Materials.

The IAV NP dataset (500 human-adaptive and 500 avian-adaptive sequences) was sampled to evaluate the embedding performance at each key site. One key site was selected each time, and token embeddings including the site were extracted from vBERT and Word2Vec. All tokens in the sequences were clustered into 5 categories for each model. The maximum proportion of embeddings of tokens including each selected site in the same cluster based on the embeddings of the 2 models was calculated and compared. Additionally, vBERT embedding performance was evaluated on immune escape assessment and DMS datasets, and more details can be found in the “vBERT embedding-based immune escape analysis of IAV vaccine strains” and “vBERT embedding-based mutational effect analysis of single amino acid” sections in the Supplementary Materials.

Establishment and evaluation of GIVAL to predict virus adaptation

Mapping input “X” sequence to viral protein dataset

All available virus reference sequences (9,231 records) were downloaded from the NCBI database. After cleaning, 98,939 protein sequences were extracted for mapping. The whole dataset (1.86 million sequences) was selected as the sequence dataset for GIVAL prediction, and users can add other virus and reference sequences for subsequent steps. To improve efficiency, a smaller-scale reference dataset (742 sequences) and sequence dataset (26 101 sequences) were also provided.

The input protein sequence segment was mapped to the reference dataset for sequence retrieval using customized BLAST+ (BLAST+ 2.16.0 with the reference dataset as the database), and the mapping results were further verified. The customized BLAST+ based on our reference dataset was selected to keep the consistency between the annotations of the identified viral gene and the viral gene in the sequence dataset, so as to guarantee the correct matching of the dataset related to the input sequence. Details on dataset retrieval can be found in the “Dataset retrieval in GIVAL” section in the Supplementary Materials.

To map with the retrieved protein dataset, if it was already aligned, sequences in the retrieved dataset were cut according to the reference sequence mapping location. Otherwise, the first 20 amino acids of the mapped segment were aligned with each dataset sequence via a sliding window, and the best mapping positions were located by identifying the domain with the lowest Levenshtein distance (LD). The last 20 amino acids were located using Eqs. 7 to 9:

le0=ls+lengthsegment1, (7)
LDk,lLDend20AAsmsk,l, (8)
le=le0,ifLDle019,le0<6le0+argminx55LDle019+x,le0+x,else. (9)

In Eqs. 7 to 9, ls and le represent the starting and ending sites of the mapped domain on the sequence in the matched dataset, respectively, and msi,j represents site i to site j of the mapped sequence (sites i and j were included).

Establishment of semi-supervised GIVAL based on vBERT embedding

Semi-supervised learning was conducted for the integrated prediction of the input sequence. Firstly, the optimized labels were obtained via unsupervised learning based on the vBERT embedding. The input sequence and the retrieved dataset were embedded with vBERT. The embedded retrieved dataset was reduced using PCA and clustered to obtain flexible labels. Random sampling was performed to balance the sequence numbers from each flexible clustering label to create the training dataset.

Secondly, the supervised ResNet predictor was trained based on the flexible labels. Each sequence embedding in the training dataset was padded to ensure that the size of the input embedding was a fixed integer multiple of 16×768. The embeddings were reshaped into c6464 as the ResNet input (c positive integer). The ResNet model was built with a 64-channel convolution neural network, a 64-channel max pooling layer, residual blocks (3, 4, 6, and 3 layers with 64, 128, 256, and 512 channels, respectively), average pooling, and a dropout layer. BatchNorm normalization and ReLU activation were selected. The input was converted into 64,32,32 after convolution, 64,16,16 after max pooling, and 64,16,16, 128,88, 256,44, 51222 after each residual block calculation, and then 51211, and original probabilities of flexible labels were predicted and output with Softmax function. The dimensionality of the output layer in the real-time trained model is dynamically adjusted based on the input, equaling the number of flexible cluster labels present in the training dataset. The dataset was split into 3 training–validation datasets for 3-fold cross-validation. Training was stopped when both the training and validation accuracies exceeded the threshold.

Finally, the flexible label of the input sequence was predicted using the timely trained ResNet model, and the adaptation of the input sequence was further statistically inferred based on the predicted flexible label and the raw annotations related to viral adaptation in the training dataset. During the training of the prediction model, we relied solely on sequence distribution-based flexible labels rather than the raw annotations from the dataset. This allows users to analyze different types of adaptation annotations according to their specific needs when utilizing the model. In this study, we primarily examined annotations related to influenza virus host adaptation, CoV receptor-binding specificity, and monkeypox virus pathogenicity and transmissibility in humans. Our model employs a hierarchical approach to determine input sequence adaptation, with the “adaptation threshold” defined as follows. Adaptation labels were determined by the majority of adaptive hosts or type annotations in the predicted flexible cluster for non-influenza viruses. Based on the FC vector, the adaptation risk was further quantified using Eqs. 10 to 13:

CDFCx,type i=minmi1nmij=1nmi1cosFCx,FCyj, (10)
CDscore(FCx,type i)=1CDFCx,typeiiCDFCx,typei, (11)
Adaptation scorei=SoftmaxCDscoreFCx,typei,If Adaptation scorepredicted labelmaximum score, (12)
Adaptation scorei=1n+0.0001,ifi=predicted label11n+0.0001n1,else. (13)

In Eqs. 10 to 13, mi represents the flexible label with type i adaptation, nmi represents the sample number of flexible label mi, x represents the predicting sample, yj represents sample in the training set with flexible label mi, and n represents the number of adaptation labels.

Given the high diversity of influenza virus subtypes from each adaptive host, the adaptation labels of the input were inferred according to the following principles if the virus was identified as influenza virus. If the majority host count in the predicted flexible cluster was at least 3 times greater than that of any other host label, the adaptation label was inferred using the above-mentioned method; otherwise, the host adaptation label was determined based on Eqs. 14 and 15:

CDFCx,typei=minkKi(1cosFCx,FCk), (14)
Adaptation labelx=argmini(CDFCxtypei). (15)

In Eqs. 14 and 15, Ki represents the sequences with adaptation label i in the predicted flexible cluster of x. Adaptation labelx represents the inferred adaptation label of the input.

Performance evaluation of GIVAL for clustering and classification

Firstly, the mapping performance was evaluated and benchmarked. A test dataset for mapping was created by randomly extracting 500 segments with a length of 150 to 300 AAs from the sequence dataset for GIVAL prediction. The customized BLAST+ was benchmarked with Diamond [61] on the test dataset based on the mapping accuracy. The IAV HA RBD segment (151 AAs from sites 88 to 238, excluding the signal peptide of NC_007362.1) and SARS-CoV-2 Spike RBD segment (207 AAs from sites 344 to 550 of NC_045512.2) were mapped using the customized BLAST+ and Diamond, and the results were compared by extracting the mapping identity of the input segment and the mapped reference sequences. Notably, all sites on HA sequences in this study were renumbered by offsetting the 12-residue signal peptide to maintain consistent positional references.

Secondly, complete HA sequences were selected to establish GIVAL for flexible labeling and GIVAL performance evaluation. Flexible and specified labeling based on host labels were conducted for ablation to quantify the contribution of flexible labeling. Human-adaptive H3 sequences highly homologous to clade 3c.3a [62,63] and avian-adaptive H5 sequences were extracted from the training dataset (2,343 sequences) to create an independent validation set (657 sequences). The silhouettes of flexible and specified clusters were calculated and compared. To further evaluate the impact of flexible labeling on prediction, while training models based on 2 types of labels, 5%, 10%, 15%, and 20% of the human-adaptive H3 sequences in the training dataset were mislabeled as avian-adaptive to train the prediction models. The performance of the models on the independent validation dataset was compared using 5 indexes, namely, Accuracy (ACC), Precision of human-adaptive (Pre_P) and avian-adaptive (Pre_N) cluster, and Recall of human-adaptive (TPR) and avian-adaptive (TNR). The indexes were calculated using Eqs. 16 to 20:

ACC=TN+TPTN+TP+FN+FP, (16)
Pre_P=TPTP+FP, (17)
Pre_N=TNTN+FN, (18)
TPR=TPTP+FN, (19)
TNR=TNTN+FP (20)

Thirdly, the HA RBD and Spike RBD segment were respectively input to establish GIVAL for performance evaluation. The prediction accuracy of the 2 models was tested based on confusion matrices and ROC with AUC. The flexible labels of the mapped CoV Spike dataset (970 sequences) were verified by constructing maximum-likelihood phylogenetic trees using MEGA and iTOL (https://itol.embl.de/). Based on the CoV Spike RBD model, the PDF-2180 (NC_034440.1) and NeoCoV (KC869678.4) were predicted and verified. On this basis, the prediction robustness for extreme sequence lengths was evaluated, and adaptation risk of IAVs from pandemics and IBVs were predicted for validation and cross-species generalization test of GIVAL. The related details can be found in the “Prediction robustness evaluation of GIVAL for extreme sequence lengths” and “Validation and cross-species generalization test of GIVAL based on IAV from pandemics and IBVs” sections in the Supplementary Materials.

Risk assessment of multiple viruses via vBERT embedding and GIVAL prediction

Workflow for applying GIVAL to real-world viral adaptation scenarios

For an unknown viral protein sequence input, GIVAL follows the pipeline below for viral and gene identification, dataset matching, label optimization, model training, and adaptive risk prediction. Firstly, the reference sequences were utilized to establish the reference database for BLAST+ mapping, and the virus and gene of the input were identified. Secondly, the mapping result was extracted to match the proper dataset for training the predictor. The label of each sequence in the matched dataset was optimized based on vBERT embedding. Thirdly, based on the training dataset with optimized labels, a ResNet model was trained to predict the optimized label of the input. Finally, the adaptation risk was statistically inferred. Based on existing databases, users can add new sequences to both the reference dataset and the sequence dataset as needed. The identified virus and gene, information of the optimized labels, trained predictor, predicted optimized labels, and inferred adaptation risk were automatically saved.

Adaptation risk prediction of multiple viruses

Firstly, based on the trained IAV HA RBD prediction model, the adaptation risk of mammalian IAVs and H5N1 mutations was predicted. The adaptation of HA RBD sequences of mammalian IAVs (5,403 sequences) was predicted and verified using protein structure prediction, Bayesian inference, and experimental validation. More details on protein structure, Bayes inference, mammalian IAV prediction and experimental validation of the predicted adaptation of equine H3N8 IAVs can be found in the “Protein structure prediction, alignment and visualization”, “Bayes inference of adaptation-important amino acid sites”, “Adaptation prediction of IAV HA RBD from mammalian hosts”, and “Binding kinetics analysis of HA1-Glycan interactions by bio-layer interferometry (BLI)” sections in the Supplementary Materials. Based on the key sites and AA distribution obtained by Bayesian analysis, 30,000 H5N1 HA RBD mutations were generated and high-risk mutations were inferred. More details on high-risk IAV H5N1 mutation inference can be found in the “Adaptation prediction and quantify of H5N1 mutations” section in the Supplementary Materials.

Secondly, the monkeypox virus adaptation shift was predicted using GIVAL with 10 proteins. The adaptation shift was predicted using samples from clade I before 2022 and clade II before 2024 as the training set (300 sequences per protein) and clade I after 2022 (including 2022) and clade II in 2024 as the validation set (318 sequences per protein). More details on the prediction of monkeypox virus adaptation shift can be found in the “Prediction of adaptation shift for monkeypox viruses” section in the Supplementary Materials.

Acknowledgments

We gratefully acknowledge all data contributors, i.e., the authors and their originating laboratories responsible for obtaining the specimens, and their submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We appreciate Prof. Dr. Lili Ren from the National Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China for her guidance and support to the study design. We also appreciate Prof. Dr. Yuhai Bi from the Institute of Microbiology, Chinese Academy of Sciences for kindly providing the biotinylated glycan ligands (3′-SLNLN and 6′-SLNLN) for our experimental validation.

Funding: This research was supported by grants from the National Key Research and Development Program of China (grant no. 2022YFC2305005), the State Key Laboratory of Pathogen and Biosecurity (grant no. SKLPBS2408), and the Natural Science Foundation of China (grant no. 32070166).

Author contributions: J.L., T.J., J.W., and S.-S.Z. conceptualized the study. J.L., T.J., J.W., S.-Y.J., and S.-S.Z. designed the study. J.L., J.W., T.J., S.-Y.J., J.-Q.W., and S.Z. contributed to the acquisition of data. S.-Y.J., S.-S.Z., S.Z., Z.Z., Y.T., and W.L. performed data preprocessing and analysis. J.L., S.-Y.J., and J.-Q.W. built the model. J.L., T.J., and J.W supervised the project. J.L. and S.-Y.J. wrote the manuscript. J.L., T.J., J.W., S.-Y.J., Z.Z., S.-S.Z., Y.T., and W.L. revised the paper. S.Z. conducted the experimental validation.

Competing interests: The authors declare that they have no competing interests.

Data Availability

Code and data were available online (https://github.com/Jamalijama/GIVAL and https://doi.org/10.5281/zenodo.16566992) or upon request (J.L., lj-pbs@163.com).

Supplementary Materials

Supplementary 1

Supplementary Methods

Figs. S1 to S24

Tables S1 to S5

research.0871.f1.pdf (5.3MB, pdf)

References

  • 1.Douam F, Gaska JM, Winer BY, Ding Q, Schaewen M, Ploss A. Genetic dissection of the host tropism of human-tropic pathogens. Annu Rev Genet. 2015; 49:21–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ciminski K, Chase GP, Beer M, Schwemmle M. Influenza A viruses: Understanding human host determinants. Trends Mol Med. 2021;27(2):104–112. [DOI] [PubMed] [Google Scholar]
  • 3.Lytras S, Xia W, Hughes J, Jiang X, Robertson DL. The animal origin of SARS-CoV-2. Science. 2021;373(6558):968–970. [DOI] [PubMed] [Google Scholar]
  • 4.Smith KF, Goldberg M, Rosenthal S, Carlson L, Chen J, Chen C, Ramachandran S. Global rise in human infectious disease outbreaks. J R Soc Interface. 2014;11(101):20140950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Choudhury PR, Saha T, Goel S, Shah JM, Ganjewala D. Cross-species virus transmission and its pandemic potential. Bull Natl Res Cent. 2022;46(1):18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Purushotham JN, Lutz HL, Parker E, Andersen KG. Immunological drivers of zoonotic virus emergence, evolution, and endemicity. Immunity. 2025;58(4):784–796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lu G, Wang Q, Gao GF. Bat-to-human: Spike features determining ’host jump’ of coronaviruses SARS-CoV, MERS-CoV, and beyond. Trends Microbiol. 2015;23(8):468–478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kutter JS, Linster M, Meulder D, Bestebroer TM, Lexmond P, Rosu ME, Richard M, Vries RP, Fouchier R, Herfst S. Continued adaptation of A/H2N2 viruses during pandemic circulation in humans. J Gen Virol. 2023;104(8):001881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu M, Bakker AS, Narimatsu Y, Kuppeveld F, Clausen H, Haan C, Vries E. H3N2 influenza A virus gradually adapts to human-type receptor binding and entry specificity after the start of the 1968 pandemic. Proc Natl Acad Sci USA. 2023;120(31): Article e1989975176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bupi N, Sangaraju VK, Phan LT, Lal A, Vo T, Ho PT, Qureshi MA, Tabassum M, Lee S, Manavalan B. An effective integrated machine learning framework for identifying severity of tomato yellow leaf curl virus and their experimental validation. Research. 2023;6:0016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jagger BW, Memoli MJ, Sheng ZM, Qi L, Hrabal RJ, Allen GL, Dugan VG, Wang R, Digard P, Kash JC, et al. The PB2-E627K mutation attenuates viruses containing the 2009 H1N1 influenza pandemic polymerase. Mbio. 2010;1(1): Article e00067-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, Geng Q, Auerbach A, Li F. Structural basis of receptor recognition by SARS-CoV-2. Nature. 2020;581(7807):221–224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Starr TN, Greaney AJ, Hilton SK, Crawford K, Navarro MJ, Bowen JE, Tortorici MA, Walls AC, Veesler D, Bloom JD. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. bioRxiv. 2020. https://www.biorxiv.org/content/10.1101/2020.06.17.157982v1 [DOI] [PMC free article] [PubMed]
  • 14.Imai M, Kawaoka Y. The role of receptor binding specificity in interspecies transmission of influenza viruses. Curr Opin Virol. 2012;2(2):160–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Arora P, Pöhlmann S, Hoffmann M. Mutation D614G increases SARS-CoV-2 transmission. Signal Transduct Tar. 2021;6(1):101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu Z, Qian W, Cai W, Song W, Wang W, Maharjan DT, Cheng W, Chen J, Wang H, Xu D, et al. Inferring the effects of protein variants on protein-protein interactions with interpretable transformer representations. Research. 2023;6:0219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Marriott AC, Easton AJ. Reverse genetics of the Paramyxoviridae. Adv Virus Res. 1999;53:321–340. [PubMed] [Google Scholar]
  • 18.Babayan SA, Orton RJ, Streicker DG. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science. 2018;362(6414):577–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Eng C, Tong JC, Tan TW. Predicting zoonotic risk of influenza A viruses from host tropism protein signature using random forest. Int J Mol Sci. 2017;18(6):1135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li J, Zhang S, Li B, Hu Y, Kang X, Wu X, Huang M, Li Y, Zhao Z, Qin C, et al. Machine learning methods for predicting human-adaptive influenza A viruses based on viral nucleotide compositions. Mol Biol Evol. 2020;37(4):1224–1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Li J, Wu Y, Zhang S, Kang X, Jiang T. Deep learning based on biologically interpretable genome representation predicts two types of human adaptation of SARS-CoV-2 variants. Brief Bioinform. 2022;23(3): Article bbac036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jiang S, Zhang S, Kang X, Feng Y, Li Y, Nie M, Li Y, Chen Y, Zhao S, Jiang T, et al. Risk assessment of the possible intermediate host role of pigs for coronaviruses with a deep learning predictor. Viruses. 2023;15(7): Article 1556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhang S, Li YD, Cai YR, Kang XP, Feng Y, Li YC, Chen YH, Li J, Bao LL, Jiang T. Compositional features analysis by machine learning in genome represents linear adaptation of monkeypox virus. Front Genet. 2024;15: Article 1361952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Takata MA, Goncalves-Carneiro D, Zang TM, Soll SJ, York A, Blanco-Melo D, Bieniasz PD. CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature. 2017;550(7674):124–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Atkinson NJ, Witteveldt J, Evans DJ, Simmonds P. The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication. Nucleic Acids Res. 2014;42(7):4527–4545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Witteveldt J, Martin-Gans M, Simmonds P. Enhancement of the replication of hepatitis C virus replicons of genotypes 1 to 4 by manipulation of CpG and UpA dinucleotide frequencies and use of cell lines expressing SECL14L2 for antiviral resistance testing. Antimicrob Agents Ch. 2016;60(5):2981–2992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Teufel F, Almagro AJ, Johansen AR, Gislason MH, Pihl SI, Tsirigos KD, Winther O, Brunak S, Heijne G, Nielsen H. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40(7):1023–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun. 2022;13(1):6298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Odum MT, Teufel F, Thumuluri V, Almagro AJ, Johansen AR, Winther O, Nielsen H. DeepLoc 2.1: Multi-label membrane protein type prediction using protein language models. Nucleic Acids Res. 2024;52(W1):W215–W220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Martin MS, Jacob-Dolan JW, Pham V, Sjoblom NM, Scheck RA. The chemical language of protein glycation. Nat Chem Biol. 2024;21(3):324–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hie B, Zhong ED, Berger B, Bryson B. Learning the language of viral evolution and escape. Science. 2021;371(6526):284–288. [DOI] [PubMed] [Google Scholar]
  • 33.Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, Nechaev D, Rost B. Embeddings from protein language models predict conservation and variant effects. Hum Genet. 2021;141(10):1629–1647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lamy-Besnier Q, Brancotte B, Menager H, Debarbieux L. Viral Host Range database, an online tool for recording, analyzing and disseminating virus-host interactions. Bioinformatics. 2021;37(17):2798–2801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Young F, Rogers S, Robertson DL. Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLOS Comput Biol. 2020;16(5): Article e1007894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li H, Sun F. Comparative studies of alignment, alignment-free and SVM based approaches for predicting the hosts of viruses based on viral sequences. Sci Rep. 2018;8(1):10032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Pinto RM, Bakshi S, Lytras S, Zakaria MK, Swingler S, Worrell JC, Herder V, Hargrave KE, Varjak M, Cameron-Ruiz N, et al. BTN3A3 evasion promotes the zoonotic potential of influenza A viruses. Nature. 2023;619(7969):338–347. [DOI] [PubMed] [Google Scholar]
  • 38.Ma X, Liang J, Zhu G, Bhoria P, Shoara AA, Mackeigan DT, Khoury CJ, Slavkovic S, Lin L, Karakas D, et al. SARS-CoV-2 RBD and its variants can induce platelet activation and clearance: Implications for antibody therapy and vaccinations against COVID-19. Research-China. 2023;6: Article 0124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Letko M, Miazgowicz K, Mcminn R, Seifert SN, Sola I, Enjuanes L, Carmody A, Doremalen N, Munster V. Adaptive evolution of MERS-CoV to species variation in DPP4. Cell Rep. 2018;24(7):1730–1737. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cui T, Theuns S, Xie J, Van den Broeck W, Nauwynck HJ. Role of porcine aminopeptidase N and sialic acids in porcine coronavirus infections in primary porcine enterocytes. Viruses. 2020;12(4):402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Xiong Q, Cao L, Ma C, Tortorici MA, Liu C, Si J, Liu P, Gu M, Walls AC, Wang C, et al. Close relatives of MERS-CoV in bats use ACE2 as their functional receptors. Nature. 2022;612(7941):748–757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Shu Y, Mccauley J. GISAID: Global initiative on sharing all influenza data—From vision to reality. Eurosurveillance. 2017;22(13): Article 30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Eisfeld AJ, Biswas A, Guan L, Gu C, Maemura T, Trifkovic S, Wang T, Babujee L, Dahn R, Halfmann PJ, et al. Pathogenicity and transmissibility of bovine H5N1 influenza virus. Nature. 2024;633(8029):426–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mostafa A, Naguib MM, Nogales A, Barre RS, Stewart JP, Garcia-Sastre A, Martinez-Sobrido L. Avian influenza A (H5N1) virus in dairy cattle: Origin, evolution, and cross-species transmission. Mbio. 2024;15(12): Article e0254224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Wang G, Dos ABL, Stadlbauer D, Ramos I, Bermudez GM, He J, Ding Y, Wei Z, Ouyang K, Huang W, et al. Characterization of swine-origin H1N1 canine influenza viruses. Emerg Microbes Infec. 2019;8(1):1017–1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Cong Y, Sun Y, Wang W, Meng Q, Ran W, Zhu L, Yang G, Yang W, Yang L, Wang C, et al. Comparative analysis of receptor-binding specificity and pathogenicity in natural reassortant and non-reassortant H3N2 swine influenza virus. Vet Microbiol. 2014;168(1):105–115. [DOI] [PubMed] [Google Scholar]
  • 47.Baz M, Paskel M, Matsuoka Y, Zengel J, Cheng X, Treanor JJ, Jin H, Subbarao K. A live attenuated equine H3N8 influenza vaccine is highly immunogenic and efficacious in mice and ferrets. J Virol. 2015;89(3):1652–1659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Caserta LC, Frye EA, Butt SL, Laverack M, Nooruzzaman M, Covaleda LM, Thompson AC, Koscielny MP, Cronk B, Johnson A, et al. Spillover of highly pathogenic avian influenza H5N1 virus to dairy cattle. Nature. 2024;634(8034):669–676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Song H, Hao T, Han P, Wang H, Zhang X, Li X, Wang Y, Chen J, Li Y, Jin X, et al. Receptor binding, structure, and tissue tropism of cattle-infecting H5N1 avian influenza virus hemagglutinin. Cell. 2025;188(4):919–929. [DOI] [PubMed] [Google Scholar]
  • 50.Elsayed S, Bondy L, Hanage WP. Monkeypox virus infections in humans. Clin Microbiol Rev. 2022;35(4): Article e0009222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Monzon S, Varona S, Negredo A, Vidal-Freire S, Patino-Galindo JA, Ferressini-Gerpe N, Zaballos A, Orviz E, Ayerdi O, Munoz-Gomez A, et al. Monkeypox virus genomic accordion strategies. Nat Commun. 2024;15(1):3059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Patino LH, Guerra S, Munoz M, Luna N, Farrugia K, Guchte A, Khalil Z, Gonzalez-Reiche AS, Hernandez MM, Banu R, et al. Phylogenetic landscape of monkeypox virus (MPV) during the early outbreak in New York City, 2022. Emerg Microbes Infec. 2023;12(1): Article e2192830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Yu J, Zhang X, Liu J, Xiang L, Huang S, Xie X, Fang L, Lin Y, Zhang M, Wang L, et al. Phylogeny and molecular evolution of the first local monkeypox virus cluster in Guangdong Province, China. Nat Commun. 2023;14(1):8241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Zhang S, Wang F, Peng Y, Gong X, Fan G, Lin Y, Yang L, Shen L, Niu S, Liu J, et al. Evolutionary trajectory and characteristics of Mpox virus in 2023 based on a large-scale genomic surveillance in Shenzhen, China. Nat Commun. 2024;15(1):7452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hou X, He Y, Fang P, Mei SQ, Xu Z, Wu WC, Tian JH, Zhang S, Zeng ZY, Gou QY, et al. Using artificial intelligence to document the hidden RNA virosphere. Cell. 2024;187(24):6929–6942.e16. [DOI] [PubMed] [Google Scholar]
  • 56.Wei X, Qiu Y, Ma Z, Hong X, Gong Y. Semi-supervised crowd counting via multiple representation learning. IEEE Trans Image Process. 2023;32:5220–5230. [DOI] [PubMed] [Google Scholar]
  • 57.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser A, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach (CA): Curran Associates Inc.; 2017.
  • 58.Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–1130. [DOI] [PubMed] [Google Scholar]
  • 59.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–2110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Maaten LVD, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(86):2579–2605. [Google Scholar]
  • 61.Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.World Health Organization. Recommended composition of influenza virus vaccines for use in the 2019-2020 northern hemisphere influenza season. 21 Feb 2019. https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2019-2020-northern-hemisphere-influenza-season
  • 63.Kim JI, Lee I, Park S, Bae JY, Yoo K, Cheong HJ, Noh JY, Hong KW, Lemey P, Vrancken B, et al. Phylogenetic relationships of the HA and NA genes between vaccine and seasonal influenza A(H3N2) strains in Korea. PLOS ONE. 2017;12(3): Article e0172059. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary 1

Supplementary Methods

Figs. S1 to S24

Tables S1 to S5

research.0871.f1.pdf (5.3MB, pdf)

Data Availability Statement

Code and data were available online (https://github.com/Jamalijama/GIVAL and https://doi.org/10.5281/zenodo.16566992) or upon request (J.L., lj-pbs@163.com).


Articles from Research are provided here courtesy of American Association for the Advancement of Science (AAAS) and Science and Technology Review Publishing House

RESOURCES