Skip to main content
Advanced Science logoLink to Advanced Science
. 2024 Oct 14;11(45):2405058. doi: 10.1002/advs.202405058

Machine Learning Early Detection of SARS‐CoV‐2 High‐Risk Variants

Lun Li 1,2, Cuiping Li 1,2, Na Li 1,2, Dong Zou 1,2, Wenming Zhao 1,2,3,4, Hong Luo 1,2, Yongbiao Xue 1,2,4,, Zhang Zhang 1,2,3,4,, Yiming Bao 1,2,3,4,, Shuhui Song 1,2,3,4,
PMCID: PMC11615786  PMID: 39401400

Abstract

The severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) has evolved many high‐risk variants, resulting in repeated COVID‐19 waves over the past years. Therefore, accurate early warning of high‐risk variants is vital for epidemic prevention and control. However, detecting high‐risk variants through experimental and epidemiological research is time‐consuming and often lags behind the emergence and spread of these variants. In this study, HiRisk‐Detector a machine learning algorithm based on haplotype network, is developed for computationally early detecting high‐risk SARS‐CoV‐2 variants. Leveraging over 7.6 million high‐quality and complete SARS‐CoV‐2 genomes and metadata, the effectiveness, robustness, and generalizability of HiRisk‐Detector are validated. First, HiRisk‐Detector is evaluated on actual empirical data, successfully detecting all 13 high‐risk variants, preceding World Health Organization announcements by 27 days on average. Second, its robustness is tested by reducing sequencing intensity to one‐fourth, noting only a minimal delay of 3.8 days, demonstrating its effectiveness. Third, HiRisk‐Detector is applied to detect risks among SARS‐CoV‐2 Omicron variant sub‐lineages, confirming its broad applicability and high ROC‐AUC and PR‐AUC performance. Overall, HiRisk‐Detector features powerful capacity for early detection of high‐risk variants, bearing great utility for any public emergency caused by infectious diseases or viruses.

Keywords: haplotype network, high‐risk variant, machine learning, pre‐warning, SARS‐CoV‐2


This study first validates a correlation between haplotype network features and the risk levels of SARS‐CoV‐2 variants. Building on this, HiRisk‐Detector, a machine learning algorithm, is developed for the early detection of high‐risk variants. The effectiveness, robustness, and generalizability of HiRisk‐Detector are confirmed using over 7.6 million SARS‐CoV‐2 genomes.

graphic file with name ADVS-11-2405058-g004.jpg

1. Introduction

The COVID‐19 pandemic was characterized by recurrent waves of cases, driven by the emergence of multiple high‐risk SARS‐CoV‐2 variants with increased capacity for transmission and evasion of existing immunity by infection or vaccine.[ 1 ] Therefore, rapid and early detection of such high‐risk variants is critical to guide outbreak response.

To effectively address this challenge, the global health community has developed several surveillance systems to identify and monitor these variants. Currently, there are several methods for monitoring potential high‐risk variants of SARS‐CoV‐2, which are primarily based on early warning systems established by global public health organizations,[ 2 , 3 ] such as the World Health Organization (WHO), the United States Centers for Disease Control and Prevention, and the European Centers for Disease Control and Prevention. Among them, one representative method is that WHO classifies the potential high‐risk variants of SARS‐CoV‐2 into three categories—variants of concern (VOC), variants of interest (VOI), and variants under monitoring (VUM), with high risk to low risk according to transmissibility, virulence, immune escape, etc. Although this pre‐warning system is highly recognized and widely used, it requires the involvement of the WHO's Technical Advisory Group on SARS‐CoV‐2 Virus Evolution (TAG‐VE) to divide the risk of different variants. In other words, existing methods require extensive human knowledge and intervention and cannot fully automate the risk assessment. Critically, they have a certain degree of pre‐alarm lag, e.g., approximately one month by the WHO method.

On the contrary, several computational methods have been proposed in the past years to predict and analyze these variants based on a variety of molecular features primarily derived from genomes and metadata. These methods are generally classified into two types based on the need for reference sequence: reference‐free[ 4 , 5 ] and reference‐based.[ 6 , 7 , 8 , 9 ] Representative reference‐free methods include k‐mer‐based algorithm[ 4 ] and dinucleotide composition representation (DCR) based algorithm.[ 5 ] Specifically, the k‐mer‐based algorithm represents each SARS‐CoV‐2 amino acid sequence as k‐mer counts, and utilizes One‐class Support Vector Machines to identify VOC/VOI from a variety of variants.[ 4 ] In contrast, the DCR algorithm involves DCR to parse the general human adaptation of RNA viruses and applies a three‐dimensional convolutional neural network to predict the adaptation of SARS‐CoV‐2 variants.[ 5 ] While reference‐free methods offer the flexibility to study unknown genomes without the need for a reference sequence, they face challenges such as a potentially lower resolution in distinguishing highly similar sequences within the same species in the absence of a reference.

Regarding reference‐based methods, they utilize the Wuhan wild‐type variant as the reference sequence, as typified by VarEPS,[ 6 ] spreading mutations‐based algorithm,[ 7 ] and PyR0.[ 8 ] Specifically, VarEPS focuses on analyzes the effects of mutations from the perspectives of genomics and structural biology, and uses these effects to predict the risk of SARS‐CoV‐2 variants based on random forest models. In comparison, the spreading‐mutations‐based algorithm develops a pipeline that predicts which individual amino acid mutations in SARS‐CoV‐2 are likely to become more prevalent in the coming months using logistic regression. By contrast, PyR0 provides a genome‐wide, automated approach for detecting viral lineages with increased fitness using Bayesian model. These methods have extensively examined the significance of features such as mutation frequency, spatial and temporal distribution, and protein structure within risk warning systems; nevertheless, they did not sufficiently make full use of evolutionary information in early warning of high‐risk variants.

Studying the evolutionary history of SARS‐CoV‐2 is crucial for the prevention and control of the outbreak. Evidence has shown that tracking the evolutionary history holds potential value in predicting successful mutations in future variants.[ 10 ] Haplotype network is a common way to illustrate evolutionary history, offering valuable insights not only into the evolutionary relationships among sequences[ 11 ] but also into the distinct transmission characteristics of various sequences.[ 12 , 13 ] Network features in the field of graph theory, such as node degree, betweenness, and network density, have been demonstrated to be valuable for studying transmission dynamics.[ 14 ] Therefore, understanding the role of haplotype network features, and how they are related to virus transmissibility is crucial for performing risk assessment of SARS‐CoV‐2.

In this study, we initially investigated the potential of haplotype network features to aid in the early detection of SARS‐CoV‐2 high‐risk variants. Subsequently, we developed a haplotype network‐based machine learning algorithm, called HiRisk‐Detector. Building on this, we established a web‐based platform that utilizes the HiRisk‐Detector algorithm to detect high‐risk variants weekly in RCoV19 (https://ngdc.cncb.ac.cn/ncov/monitoring/risk)[ 15 ] Based on empirical real data, we show that haplotype networks have great potential for detecting high‐risk variants and that HiRisk‐Detector is capable of timely and accurate detection of SARS‐CoV‐2 potential high‐risk variants with better robustness and generality, demonstrating its great potential for better improving health preparedness against public outbreak caused by any infectious disease.

2. Results

2.1. Correlation Between Haplotype Network Features and the Risk Levels of SARS‐CoV‐2 Variants

Haplotype networks have been demonstrated to be invaluable for tracing the transmission and evolution of SARS‐CoV‐2.[ 16 ] By representing different haplotypes as nodes in the network and their connections as transmission and evolutionary relationships, we can gain insights into the virus's spread and genetic changes. A notable example is the analysis of publicly available genome sequences of SARS‐CoV‐2 from March to May 2021 globally. The resulting haplotype networks for the Delta variant exhibited unique characteristics, such as a rapidly increasing out‐degree of Delta haplotypes and an increase in the number of countries associated with them (Figure 1a). This suggests that features within the haplotype network may aid in identifying high‐risk variants of SARS‐CoV‐2 that pose a threat to public health. Overall, the use of haplotype networks provides a powerful approach to understanding the complex dynamics of SARS‐CoV‐2 transmission and evolution. With the increasing availability of genomic data, it has become feasible to detect high‐risk variants of SARS‐CoV‐2 by examining the network characteristics of haplotypes.

Figure 1.

Figure 1

Haplotype network features of SARS‐CoV‐2 variants. a) A schematic diagram of haplotype network evolution, with obvious changes for some network features, e.g., node size and geographical transmission range. b) Box plot for the distribution of the features between VOC/VOI and others. The p‐values of the Mann‐Whitney test between VOC/VOI and others are labeled as ns: 5e‐2 < p ≤ 1, *: 1e‐2 < p ≤ 5e‐2, **: 1e‐3 < p ≤ 1e‐2, ***: 1e‐4 < p ≤ 1e‐3, ****: p ≤ 1e‐4. c) Haplotypes illustration in two‐dimensional space by t‐SNE (setting perplexity as 50, iteration as 1000, and random state as 42).

To verify the hypothesis that there is a correlation between haplotype network features and the risk levels of SARS‐CoV‐2 variants, we constructed a series of haplotype networks from the standard dataset (Data S1, Supporting Information) at time points with a 7‐day interval ranging from September 5, 2020, to February 26, 2022 (see details of networks in Data S2, Supporting Information). We extracted seven features, including betweenness, out‐degree, depth, number of sequences, geographic information entropy, proportion of new sequences, and weighted depth (see definitions of these features in Supplementary Note S1, Supporting Information), from each haplotype, which has been observed within 7 days (called active haplotype, defined in Definition 3). To demonstrate the potential of haplotype networks for high‐risk variant detection, rather than the special mutations, these seven features were calculated only from haplotype networks without knowing the mutations of each haplotype.

By comparing the distributions of these network features between VOC/VOI and other variants, we calculated p‐values for each feature to assess the statistical significance of differences between high‐ and low‐risk haplotypes (Table S1, Supporting Information). Figure 1b shows that all p‐values are less than or equal to 1e‐4, which strongly implies that there are significant differences between the features of high‐ and low‐risk haplotypes. To visualize the haplotypes with those network features, we generated a random sample of 20 000 active haplotypes and reduced the dimension of features from seven to two by t‐Distributed Stochastic Neighbor Embedding[ 17 ] (t‐SNE, setting perplexity as 50, iteration as 1000, and random state as 42) implemented in scikit‐learn.[ 18 ] The 2D projection of haplotypes reveals clear boundaries between VOC/VOI and other variants for approximately three‐quarters of samples (Figure 1c). Together, our results suggest that these haplotype network features have the potential to help detect high‐risk variants of SARS‐CoV‐2.

2.2. Detection of SARS‐CoV‐2 High‐Risk Variants Based on Haplotype Network Features and Machine Learning Methods

To validate the actual ability of these haplotype network features in identifying high‐risk variants of SARS‐CoV‐2, we employed them along with machine learning methods for retrospective analysis and verification on historical high‐risk variants of SARS‐CoV‐2. We used k‐fold cross‐validation to select the optimal features and machine learning model, and evaluated their performance on the testing set.

2.2.1. Dataset Preparation

We randomly partitioned all active haplotypes into a training set and a testing set, with the former containing 80% of the active haplotypes, and the latter containing the remaining 20%. To ensure robustness, we further randomly split the training set into five groups with the same number of haplotypes for k‐fold cross‐validation.

2.2.2. Feature and Model Selection (k‐Fold Cross‐Validation)

We performed fivefold cross‐validation on the training set to select the optimal machine learning method and features. Specifically, we evaluated the cross‐validation scores for three state‐of‐the‐art machine learning methods, including LightGBM,[ 19 ] random forest,[ 20 ] and logistic regression,[ 21 ] along with each combination of the seven haplotype network features mentioned above. Due to the high transmissibility of high‐risk variants, the high‐risk haplotypes account for ≈ 80.9% (= 310 299 / 383 691) of the total number of haplotypes, after removing duplicates. To reduce the impact of data imbalance, we adopted cost‐sensitive learning[ 22 ] by setting the parameter “class_weight” to “balanced” in scikit‐learn. Note that for each machine learning method, we utilized only its default parameters except for “class_weight”. Figure S1 (Supporting Information) shows the average cross‐validation score of each machine learning method (top 10 combinations of features), showing that the random forest method with five features, including betweenness, out‐degree, depth, geographic information entropy, and weighted depth, achieves the highest cross‐validation score. Therefore, we selected the random forest method and these five features as the hyperparameters for the model.

2.2.3. Performance on Testing Data Set

To assess the final performance, we trained the model on the entire training set using the selected five features and the random forest method and then evaluated its performance on the testing set. Six metrics of performance were evaluated on the testing set, showing that all these metrics, including ROC‐AUC, PR‐AUC, F1 score, accuracy, recall, and precision, exceed 0.96 (Figure 2a). The normalized confusion matrix (Figure 2b) shows that the true positive rate and true negative rate are both greater than 0.96, demonstrating a high level of accuracy in correctly identifying positive and negative instances. Furthermore, the false negative rate and false positive rate are both below 0.04, implying a low rate of incorrectly classifying instances. The violin plot of the risk score shows that the VOC/VOI has a median and average risk score of 1.00 and 0.97, respectively, indicating an excellent performance. In contrast, the other variants have significantly lower scores with median and average risk scores of 0.00 and 0.07, respectively (Figure 2c). This suggests that the model is highly effective in identifying VOC/VOI. To summarize, our study findings indicate that these five haplotype network features are powerful in helping detect high‐risk variants of SARS‐CoV‐2.

Figure 2.

Figure 2

Performance on testing set. The model was retrained on the entire training set, with the selected AI method (random forest) and five features (betweenness, out‐degree, depth, geographic information entropy, and weighted depth); and tested on the testing set. a) shows six metrics including ROC‐AUC, PR‐AUC, F1 score, accuracy, recall, and precision on the testing set. The heights of the bars represent the values of each metric. b) The normalized confusion matrix. c) Risk score of VOC/VOI and the other variants. The risk score is the probability that a test sample is a high‐risk variant, which is predicted by the trained model.

2.3. Performance with Label Noise and Data Incompleteness

There is often a lag between the emergence of a new high‐risk variant and its official designation as a VOC/VOI by organizations, such as WHO and CDC. For example, the average lag time between the designation dates of each existing VOC/VOI and their earliest submission dates is 29.5 days (Table S2, Supporting Information). As a result, in practical applications, it is difficult to assign accurate labels to high‐risk variants in a training set until WHO designs them as VOC/VOI. In addition, strains submitted after the detection time are not available at the time of detection. All these factors prompted us to evaluate the performance of data characterized by label noise and data incompleteness.

Therefore, we developed haplotype network‐based machine learning algorithm, HiRisk‐Detector (illustrated in Figure 3 and described in Experimental Section); and assessed its efficacy on the standard dataset. To evaluate the performance of HiRisk‐Detector, we calculated PR‐AUC, ROC‐AUC, and precision at each surveillance time point after December 18, 2020. Figure 4a shows that all precision values are higher than 0.85 (Table S3, Supporting Information). In addition, we noted a steady increase in PR‐AUC, ROC‐AUC, and precision over time, ultimately approaching 1 due to the expansion of sample size and enhanced accuracy of labels in the training set. We calculated the average risk score of all active haplotypes within each WHO risk. The result shows that the average risk score of VOC is as high as 0.98, while other variants (the variants that are not VOC, VOI, or VUM) have a lower average score of 0.06 (Figure 4b; Figure S2, Supporting Information). Moreover, we observed that the average risk scores of VOC, VOI, VUM, and other variants decreased in turn. These results suggest that HiRisk‐Detector is robust to label noise and data incompleteness. This can be attributed to the inherent tolerance of random forests for outliers and noise, coupled with their reduced susceptibility to overfitting. Figure 4c shows that HiRisk‐Detector successfully identified all 13 VOC/VOI, with an average pre‐warning time of 27 days earlier than WHO's announcement. Furthermore, we calculated the number of high‐quality and complete sequences available for each VOC/VOI before HiRisk‐Detector detected at least one haplotype of this variant as a high‐risk variant (Figure S3, Supporting Information). Our findings indicate that to identify a VOC/VOI as a high‐risk variant, a minimum of 99.9 and 51.0 sequences of this variant are necessary, based on the mean and median values respectively. Taken together, these results indicate that HiRisk‐Detector is capable of delivering earlier warnings for high‐risk variants with high precision.

Figure 3.

Figure 3

Schematic illustration for HiRisk‐Detector algorithm. The process began by establishing surveillance time points every 7 days starting from September 5, 2020, to monitor the evolution and risk of SARS‐CoV‐2. The initial phase involved training a model using active haplotypes from networks constructed between September 5, 2020, and December 19, 2020, shortly after the WHO designated the first VOC/VOI. Subsequently, for each time point after December 18, 2020, a haplotype network was formed from all available sequences, from which key features were extracted to predict risk scores for each haplotype. Haplotypes with scores over 0.5 were classified as high‐risk. Additionally, if the WHO identified new variants between two consecutive surveillance points, the model was retrained with data from all networks constructed up to the current time point.

Figure 4.

Figure 4

Results of retrospective detection with data incompleteness and label noise on the standard dataset. HiRisk‐Detector was tested on the standard dataset containing 4396290 strains with submission data no later than February 28, 2022, to evaluate performance under data incompleteness and label noise. a) PR‐AUC, ROC‐AUC, and precision on each week. b) HiRisk‐Detector of each WHO risk, including VOC, VOI, VUM, and the other variants. The bars' heights represent each WHO risk's average risk score. The error bars represent the standard deviation. c) shows the earliest date that a VOC/VOI was detected as a high‐risk variant by HiRisk‐Detector, the earliest date that the VOC/VOI was submitted to public databases, and the date that the VOC/VOI was designed as VOC/VOI by WHO. The differences between the earliest date that a VOC/VOI was detected as a high‐risk variant by HiRisk‐Detector and the date that the variant was designed as VOC/VOI by WHO are illustrated on the right side. If the former is earlier than the latter, the bar is colored by Persian green; otherwise, the bar is colored by wild blue yonder.

To enable the visualization of pre‐warning results and the traceability of historical data, we further developed a web‐based platform (https://ngdc.cncb.ac.cn/ncov/monitoring/risk)[ 15 ] that uses the HiRisk‐Detector to update high‐risk variants in RCoV19 on a weekly basis.

2.4. Effect of Sequencing Intensity

Over the past year, the availability of new SARS‐CoV‐2 sequences has declined (Figure S4, Supporting Information), largely due to cost and resource limitations. Large‐scale viral sequencing demands significant resources and funding, and as the pandemic has subsided, many countries have reduced their investments in these activities. Consequently, this reduction in sequencing capacity has led us to assess the performance of HiRisk‐Detector in scenarios where sequencing resources are constrained.

As sequencing intensity decreases, the number of training samples available at each detection time point diminishes, posing significant challenges to the machine learning model's performance due to the limited data available for effective learning. To assess the model's robustness, we simulated various levels of sequencing intensity reduction, mirroring real‐world scenarios where sequencing resources may be constrained. We conducted experiments on the standard dataset with an actual sequencing intensity of 1.01%. To evaluate the performance of the algorithm under reduced sequencing conditions, we downsampled the dataset to intensities of 0.51%, 0.25%, and 0.13%, corresponding to 50%, 25%, and 12.5% of the actual intensity, respectively. Subsequently, we applied the HiRisk‐Detector algorithm as outlined in the previous section to these downsampled datasets. To enhance the reliability of our findings, we repeated each experiment three times. Figure 5 shows that there is no significant change in performances including PR‐AUC, ROC‐AUC, and precision when the sampling rate is reduced to 50%, 25%, and 12.5%. Furthermore, as the sequencing intensity decreases from 1.01% to 0.25%, the average warning time is only delayed by 3.7 days (Figure S5, Table S4, Supporting Information), and precision at each time point remains above 0.85 (Figure 5). These results indicate that HiRisk‐Detector maintains high performance even at lower sequencing intensities, demonstrating its robustness to variations in sequencing intensity.

Figure 5.

Figure 5

Performance at different sampling rates. The first column shows the variation of metrics, including PR‐AUC, ROC‐AUC, and precision, at each sampling rate, such as 50%, 25%, and 12.5%. The second column shows the boxplot of each metric at each sampling rate. The p‐values of the Mann‐Whitney test between each sampling rate are labeled as ns: 5e‐2 < p ≤ 1, *: 1e‐2 < p ≤ 5e‐2, **: 1e‐3 < p ≤ 1e‐2, ***: 1e‐4 < p ≤ 1e‐3, ****: p ≤ 1e‐4. The first, second, and third rows show the results of PR‐AUC, ROC‐AUC, and precision, respectively.

2.5. Generalization of HiRisk‐Detector for Omicron Sub‐Lineages

On November 24, 2021, scientists in Botswana and South Africa discovered the most potent SARS‐CoV‐2 variant, which caused global concern. On November 26, 2021, the WHO's TAG‐VE classified this new variant as a VOC and named it “Omicron”. Since its emergence, Omicron variants have continued to evolve genetically and antigenically, with an expanding range of sub‐lineages. There is consensus among experts in WHO's TAG‐VE that compared to previous variants, Omicron represents the most divergent VOC seen to date. The previous system classified all Omicron sub‐lineages as part of the Omicron VOC, which lacked the necessary granularity to compare new descendent lineages with altered phenotypes to the Omicron parent lineages. Therefore, in mid‐March 2023, WHO updated its tracking system and working definitions for VOC, VOI, and VUM of SARS‐CoV‐2 to better monitor the current variant landscape, which is dominated by Omicron descendent lineages (https://www.who.int/en/activities/tracking‐SARS‐CoV‐2‐variants). Correspondingly, we have also updated HiRisk‐Detector for risk assessment of Omicron sub‐lineages. We changed the dataset to the post‐Omicron dataset and assigned labels on the training set according to the WHO's new tracking system, while other steps remain the same in HiRisk‐Detector (see Experimental Section).

The results show that metrics including PR‐AUC, ROC‐AUC, F1 score, accuracy, recall, and precision are all higher than 0.92 at all testing time points (Figure 6a; Table S5, Supporting Information). The average risk score of VOI and VUM is as high as 0.99, while the average risk score of other variants is only 0.05 (Figure 6b). Additionally, the VOI/VUM (designed no earlier than March 20, 2023) was detected on average 29.8 days earlier than WHO's announcement (Table 1 ).

Figure 6.

Figure 6

Results of retrospective detection with data incompleteness and label noise on the post‐Omicron dataset. a) PR‐AUC, ROC‐AUC, F1 score, accuracy, recall, and precision on each week. b) The risk score of each WHO risk, including VOI, VUM, and the other variants. The bars' heights represent each WHO risk's average risk score. The error bars represent the standard deviation.

Table 1.

Key dates of each VOI/VUM from experiments on the post‐Omicron dataset.

PANGO lineage a) Detection date b) , c) Designed as VOI/VUM c) , d) Earlier than WHO (days) e) Earliest strains c) , f)
XBB.2.3 2023/3/20 2023/5/17 58 2023/1/4
XBB.1.9.2 2023/3/20 2023/4/26 37 2023/1/31
XBB.1.16 2023/4/3 2023/4/17 14 2023/3/21
XBB.1.9.1 2023/3/20 2023/3/30 10 2022/12/13
CH.1.1 g) 2023/3/20 2023/2/8 −40 2022/10/13
XBB.1.5 g) 2023/3/20 2023/1/11 −68 2022/11/14
XBB g) 2023/3/20 2022/10/12 −159 2022/9/12
BA.2.75 g) 2023/4/10 2022/7/6 −278 2022/6/21
a)

Include all VOI/VUM variants defined by WHO's new tracking system, as of July 2023;

b)

The date of a given variant detected as a high‐risk variant by HiRisk‐Detector;

c)

The date format is year/month/day;

d)

The date of a variant designed as VOI/VUM by WHO's new tracking system;

e)

The difference between the 3rd column and the 2nd column;

f)

The date of earliest high‐quality and complete strains based on the submission date extracted from metadata in RCoV19;

g)

Variants that were designed as VOI/VUM before March 2023.

3. Discussion and Conclusion

In this work, we explored the potential of haplotype network to help detect high‐risk variants of SARS‐CoV‐2 and developed a haplotype network‐based machine learning algorithm, HiRisk‐Detector. Subsequently, we established a web‐based platform to weekly update SARS‐CoV‐2 high‐risk variants, providing real‐time information to researchers and public health authorities. Our results indicate that haplotype network features hold great promise for detecting high‐risk variants. HiRisk‐Detector is capable of timely and accurate detection of high‐risk variants, exhibits robustness against label noise, incomplete data, and low sequencing intensity, and can be generalized to assess the risk of Omicron sub‐lineages.

Several previous studies have used mutations as inputs to machine learning models, which may lead to over‐reliance on some key mutations. To reduce the effect of mutations, we calculate seven features solely from the haplotype network without considering which specific mutations are present on each haplotype and edge. Figure S6 (Supporting Information) shows that the precision of the network‐based workflow is much higher than that of the k‐mer‐based workflow proposed in.[ 4 ] Our proposed approach highlights the significant role of population genomic data in detecting high‐risk variants.

The seven features can be divided into two categories: node‐feature and edge‐feature, according to whether a feature is calculated from node or edge. The node‐feature contains three features including geographic information entropy, number of sequences, and proportion of new sequences. In contrast, the edge‐feature contains four aspects including betweenness, out‐degree, depth, and weighted depth. Our analysis revealed that only geographic information entropy is a node‐feature among the five selected features in HiRisk‐Detector, while all others are edge‐features. This suggests that evolutionary relationships play a crucial role in detecting high‐risk variants.

Sequencing intensity plays a crucial role in the early detection and warning of SARS‐CoV‐2 high‐risk variants. However, due to differences in sequencing capabilities and levels, sequencing intensity varies across countries and regions. Additionally, as people's attention toward the pandemic wanes in the post‐pandemic era, the sequencing intensity has decreased, which can result in delays in identifying high‐risk variants. Therefore, it is essential to develop tools that are not affected or minimally affected by changes in sequencing intensity. This will ensure that we continue to monitor and track the evolution of the virus effectively. To tackle the challenges posed by insufficient sampling of rare high‐risk variants resulting from low sequencing intensity, we integrated haplotype networks into the methodologies for detecting high‐risk variants. By incorporating these networks, the model is equipped to discern the underlying evolutionary and transmission dynamics common among high‐risk variants present in the haplotype network. This approach significantly mitigates the effects of incomplete rare high‐risk variant samples during the model training phase, enhancing the robustness and accuracy of the detection process.

The JN.1 variant has maintained its prominence as the predominant variant, as the initial paper completed. Our HiRisk‐Detector has consistently tracked JN.1, assigning it a sustained high‐risk score (>0.8) since November 7, 2023, as documented in the online monitoring resource (https://ngdc.cncb.ac.cn/ncov/monitoring/risk). This serves as yet another compelling illustration of the algorithm's capacity to effectively surveil high‐risk haplotypes.

HiRisk‐Detector has been demonstrated to be a valuable tool in detecting high‐risk variants, but it still has limitations in several aspects. First, acquiring labeled data is essential for the training or retraining models. This implies that HiRisk‐Detector process heavily relies on authoritative tracking systems such as WHO. Second, the time complexity of all known haplotype network construction algorithms, including TCS,[ 23 ] MSN, MJN,[ 24 ] and McAN,[ 25 ] are not lower than O(n 2) in the worst‐case scenario, where n is the number of haplotypes. As a result, the risk detection method based on haplotype network features is relatively time‐consuming in the feature extraction stage. Therefore, future research should focus on reducing the computational time required for constructing haplotype networks. Third, in the current workflow, we have manually chosen seven candidate haplotype network features for training the model, relying on prior work experience and data analysis. However, this approach to feature extraction is somewhat arbitrary and lacks automatic capability. Consequently, employing automatic feature extraction methods such as Graph Convolutional Networks (GCN)[ 26 ] could be advantageous in improving the performance of HiRisk‐Detector. Fourth, the presence of degenerate bases in the genome sequence may arise from mixed infections or intra‐host evolution. Currently, our algorithm focuses on consensus sequences, which may overlook these complexities. To improve the detection of such variations, we plan to refine our algorithm by incorporating a step that extracts non‐consensus sequences from raw next generation sequencing data. This enhancement will allow for risk monitoring at a more detailed and finer granularity.

In summary, it is a new perspective to identify high‐risk variants of SARS‐CoV‐2 using haplotype network features. The high accuracy and robustness of HiRisk‐Detector bear promise to improve public health preparedness against the ongoing evolving virus. HiRisk‐Detector has the potential to be generalized to monitor other rapidly evolving pathogens with sufficient genomic data.

4. Experimental Section

Data Collection and Preparation

A total of 7 662 981 complete and high‐quality SARS‐CoV‐2 genome sequences (Supplementary Note 2, Supporting Information) along with their associated metadata were downloaded from RCoV19 (https://ngdc.cncb.ac.cn/ncov/).[ 27 , 28 ] This includes all available data as of July 27, 2023. To ensure accuracy, it has standardized the metadata and removed any sequences that were missing Pango lineage, submission date, or sampling location information. These data will no longer be used in subsequent analyses.

In order to be suitable for different application scenarios, two datasets were constructed for assessing HiRisk‐Detector. Since February 2022, the Omicron variants of SARS‐CoV‐2 have become the dominant strain, accounting for over 98% of publicly available sequences. Therefore, a dataset (called a standard dataset, Data S1, Supporting Information) was defined as all data collected before February 28, 2022, which includes a total of 4396290 SARS‐CoV‐2 sequences and associated metadata submitted before that date. In addition, to evaluate the performance of HiRisk‐Detector on Omicron sub‐lineages, another dataset (called post‐Omicron dataset, Data S3, Supporting Information) consisting of 4661821 SARS‐CoV‐2 sequences submitted between November 22, 2021, and July 27, 2023 was constructed. Notably, November 22, 2021, was the date of submission of the first Omicron variant sequence included in RCoV19.

In order to assign labels to haplotypes, the alias file of PANGO lineages was downloaded, which included 271 aliases (https://github.com/cov‐lineages/pango‐designation/alias_key.json).[ 29 , 30 , 31 ] Subsequently, the association between PANGO lineage, WHO label, and WHO risk was documented based on the information provided on the official WHO website (https://www.who.int/en/activities/tracking‐SARS‐CoV‐2‐variants).

Haplotype Network Construction

The sequence mutations were identified by comparing the retained sequences against the earliest released SARS‐CoV‐2 genome (GenBank: MN908947.3). To accurately characterize the diversity of haplotypes, degenerate mutations that did not belong to the [A/T/G/C] bases were eliminated. The population mutation frequency (PMF) was then calculated for each mutation within a one‐week time window. Mutations with PMF greater than 0.005 in non‐UTR regions were selected for haplotype network construction.

With the progression of the pandemic, a minority of existing variants were continuously spreading while new variants continued to emerge, leading to the constant evolution of the SARS‐CoV‐2 haplotype network over time. Haplotype network was any graph used to represent evolutionary relationships between a set of haplotypes.[ 32 ] Note that, the mutations in each haplotype were not contained in haplotype network in this work (Definition 1). To track the evolution of SARS‐CoV‐2, a range of equally spaced time points were generated and a series of haplotype networks were constructed using McAN[ 25 ] with default parameters based on the sequences available at these time points.

Definition 1

(Haplotype network) Haplotype network is defined as a graph G  = (V, E) , where V=hi|i=0,1,,m is a collection of haplotypes, hi is the (i + 1)th haplotype, m is the number of haplotypes in haplotype network G, E  = {eij = (hi ,hj )}  is the collection of edges, representing the evolutionary relationships among haplotypes. In graph G, each haplotype hi V has a node weight (the number of strains in the haplotype hi ), denoted as wv (hi ), and each edge eij has an edge weight we (eij ), defined as the number of differential nucleotide mutations between haplotype hi and hj . Haplotype network also contains the temporal and spatial distribution of strains in each haplotype, denoted as pte,hi(date) and psp,hi(location), i=0,1,,m.

More formally, let S be a collection of strains with their mutations, locations, and submission dates. The collection of time points was defined as T=ti|t=t0+Δt·i,i=0,1,,n1, where Δt represents the time interval, n the number of time points, ti the (i + 1)th time point. Haplotype networks were constructed at each time point ti in set T (Definition 2).

Definition 2

(Haplotype network at a given time point) Let S be a collection of strains with their mutations, locations, and submission dates. The haplotype network of set S at a given time point t is a haplotype network constructed from strains in set S such that their submission dates are no later than t.

Specifically, two datasets of haplotype networks were built from the standard dataset and the post‐Omicron dataset. A series of time points with a 7‐day interval ranging from September 5, 2020, to February 26, 2022, was produced and haplotype networks were constructed from standard datasets at those time points (Data S2, Supporting Information). Besides, haplotype networks were constructed from the post‐Omicron dataset at time points with a 7‐day interval ranging from November 22, 2021, to July 24, 2023 (Data S4, Supporting Information).

Feature Extraction

Seven features from the haplotype network, including betweenness, out‐degree, depth, number of sequences, geographic information entropy, the proportion of new sequences, and weighted depth (Supplementary Note 1, Supporting Information) were extracted using igraph,[ 33 ] which is a collection of network analysis tools. It is important to note that these features were calculated based on the following information without considering the specific mutations in each haplotype and edge:

  • The collection of haplotype pairs connected by edges (edges set E)

  • The node weights and edge weights (wh and we )

  • The temporal and spatial distribution of haplotypes (pte,hi(date) and psp,hi(location))

Focusing on the emerging variants within a specific period, features were extracted only from the active haplotypes (Definition 3).

Definition 3

(Active haplotype) Let G be a haplotype network, constructed at a given time point t from S, a collection of strains with their mutations, locations, and submission dates. For a given length of time Δt, an active haplotype h is a haplotype in the haplotype network G, such that there exists at least one strain s in the haplotype h, satisfying tΔttst, where ts is the submission date of strain s.

Label Assignment

To simplify the high‐risk variants detection problem, SARS‐CoV‐2 haplotypes were divided into two risk levels: high‐risk and low‐risk. This differs from the four risk levels (VOC, VOI, VUM, and Others) established by WHO. On the standard dataset, haplotypes belonging to VOC or VOI were assigned as high‐risk variants, while those belonging to VUM and others were assigned as low‐risk variants (Table S1, Supporting Information, explained in Supplementary Note 3, Supporting Information). However, on the post‐Omicron dataset, haplotypes belonging to VOI or VUM were assigned as high‐risk variants, while all other haplotypes were assigned as low‐risk variants (Table S6, Supporting Information). The reason for using different classification criteria for different datasets was that as of July 2023, no SARS‐CoV‐2 variant has been classified as VOC in the new tracking system of WHO.

HiRisk‐Detector for Detecting High‐Risk Variants of SARS‐CoV‐2

HiRisk‐Detector was developed for detecting high‐risk variants of SARS‐CoV‐2, which leveraged the haplotype network features (Figure 3; Figure S7, Supporting Information).

HiRisk‐Detector for Detecting High‐Risk Variants of SARS‐CoV‐2—Test HiRisk‐Detector on the Standard Dataset

First, a series of surveillance time points were generated with a 7‐day interval ranging from September 5, 2020, to February 26, 2022, to regularly surveil the evolution and risk of SARS‐CoV‐2. Second, the initial model was trained on active haplotypes in each haplotype network constructed at surveillance time points from September 5, 2020, to December 19, 2020. December 19, 2020, was the earliest surveillance time point after WHO designed the first VOC/VOI. Third, at each surveillance time point ti after December 18, 2020, a haplotype network was constructed using all sequences available at ti , five selected features were extracted from each active haplotype in the constructed haplotype network at ti and the risk score was predicted for each active haplotype by the trained model. Haplotypes with a risk score higher than 0.5 were defined as high‐risk variants. After that, it was determined whether WHO designed a new VOC/VOI between ti and ti − Δt, where Δt equals to seven days. If so, the model was retrained with all active haplotypes in each haplotype network constructed at each surveillance time point no later than ti . the Python package Statannotations (available at https://pypi.org/project/statannotations/) was utilized to calculate the statistical significance between high‐ and low‐risk haplotypes.

HiRisk‐Detector for Detecting High‐Risk Variants of SARS‐CoV‐2—Test HiRisk‐Detector on Post‐Omicron Dataset

A series of haplotype networks were constructed from the post‐Omicron dataset (Data S3, Supporting Information) at time points with a 7‐day interval ranging from November 22, 2021, to July 24, 2023 (Data S4, Supporting Information). The performance was evaluated on the post‐Omicron dataset, and labels were assigned according to WHO's new tracking system (Table S6, Supporting Information). The time points for detecting high‐risk haplotypes range from March 20, 2023, to July 24, 2023, with a 7‐day interval.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

L.L., C.L., and N.L. contributed equally to this work. Conceptualization, bioinformatics analysis: S.H.S., L.L., C.P.L., and N.L.; wrote original draft: S.H.S. and L.L.; performed Web development: D.Z.; reviewed and edited the original manuscript: Y.B.X., Z.Z., Y.M.B., W.M.Z., and H.L.

Supporting information

Supporting Information

Acknowledgements

The authors thank the members of the China National Center for Bioinformation for providing SARS‐CoV‐2 mutation data and metadata, and all contributing experts in SARS‐CoV‐2 sequences. This work was supported by the National Key Research & Development Program of China (2023YFC3041500 to C.P.L.), the Key Collaborative Research Program of the Alliance of National and International·Science Organizations for the Belt·and·Road Regions (ANSO‐CR‐KP‐2022‐09 to S.H.S.), and was sponsored by Beijing Nova Program (Z211100002121006 to L.L.). This work was also supported by the National Key Research & Development Program of China (2021YFF0703703 to S.H.S., 2023YFC2604400 to F.C.), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB38030200 to Y.M.B.), the National Natural Science Foundation of China (32170678 to W.M.Z., 32270718 to S.H.S.), Youth Innovation Promotion Association of CAS (Y2021038 to S.H.S.).

Li L., Li C., Li N., Zou D., Zhao W., Luo H., Xue Y., Zhang Z., Bao Y., Song S., Machine Learning Early Detection of SARS‐CoV‐2 High‐Risk Variants. Adv. Sci. 2024, 11, 2405058. 10.1002/advs.202405058

Contributor Information

Yongbiao Xue, Email: ybxue@big.ac.cn.

Zhang Zhang, Email: zhangzhang@big.ac.cn.

Yiming Bao, Email: baoym@big.ac.cn.

Shuhui Song, Email: songshh@big.ac.cn.

Data Availability Statement

Source code of HiRisk‐Detector is written in Python and available at GitHub (https://github.com/Theory‐Lun/HiRiskPredictor), BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007386), and Zenodo (https://zenodo.org/records/10060683) under the MIT license. Web server of HiRisk‐Detector is available at https://ngdc.cncb.ac.cn/ncov/monitoring/risk. All complete and high‐quality SARS‐CoV‐2 genome sequences and associated metadata were downloaded from RCoV19 (https://ngdc.cncb.ac.cn/ncov/). The number of daily confirmed cases and deaths of COVID‐19 were downloaded from WHO (https://covid19.who.int/data). The alias of PANGO lineages were downloaded from https://github.com/cov‐lineages/pango‐designation/.

References

  • 1. Wang P., Nair M. S., Liu L., Iketani S., Luo Y., Guo Y., Wang M., Yu J., Zhang B., Kwong P. D., Graham B. S., Mascola J. R., Chang J. Y., Yin M. T., Sobieszczyk M., Kyratsous C. A., Shapiro L., Sheng Z., Huang Y., Ho D. D., Nature 2021, 593, 130. [DOI] [PubMed] [Google Scholar]
  • 2. Eurosurveillance editorial team , Eurosurveillance 2021, 26, 2101211.33478621 [Google Scholar]
  • 3. DeGrace M. M., Ghedin E., Frieman M. B., Krammer F., Grifoni A., Alisoltani A., Alter G., Amara R. R., Baric R. S., Barouch D. H., Bloom J. D., Bloyet L.‐M., Bonenfant G., Boon A. C. M., Boritz E. A., Bratt D. L., Bricker T. L., Brown L., Buchser W. J., Carreño J. M., Cohen‐Lavi L., Darling T. L., Davis‐Gardner M. E., Dearlove B. L., Di H., Dittmann M., Doria‐Rose N. A., Douek D. C., Drosten C., Edara V.‐V., et al., Nature 2022, 605, 640.35361968 [Google Scholar]
  • 4. Nicora G., Salemi M., Marini S., Bellazzi R., BMJ Health Care Inform. 2022, 29, 100643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Li J., Wu Y.‐N., Zhang S., Kang X.‐P., Jiang T., Brief. Bioinform. 2022, 23, bbac036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Sun Q., Shu C., Shi W., Luo Y., Fan G., Nie J., Bi Y., Wang Q., Qi J., Lu J., Zhou Y., Shen Z., Meng Z., Zhang X., Yu Z., Gao S., Wu L., Ma J., Hu S., Nucleic Acids Res. 2022, 50, D888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Maher M. C., Bartha I., Weaver S., di Iulio J., Ferri E., Soriaga L., Lempp F. A., Hie B. L., Bryson B., Berger B., Robertson D. L., Snell G., Corti D., Virgin H. W., Kosakovsky Pond S. L., Telenti A., Sci. Transl. Med. 2022, 14, abk3445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Obermeyer F., Jankowiak M., Barkas N., Schaffner S. F., Pyle J. D., Yurkovetskiy L., Bosso M., Park D. J., Babadi M., MacInnis B. L., Luban J., Sabeti P. C., Lemieux J. E., Science 2022, 376, 1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Beguir K., Skwark M. J., Fu Y., Pierrot T., Carranza N. L., Laterre A., Kadri I., Korched A., Lowegard A. U., Lui B. G., Sänger B., Liu Y., Poran A., Muik A., Sahin U., Comput. Biol. Med. 2023, 155, 106618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Harari S., Miller D., Fleishon S., Burstein D., Stern A., Nat. Commun. 2024, 15, 648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Forster P., Forster L., Renfrew C., Forster M., Proc. Natl. Acad. Sci. USA 2020, 117, 9241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Azad S., Devi S., J. Travel Med. 2020, 27, taaa130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Song S., Li C., Kang L., Tian D., Badar N., Ma W., Zhao S., Jiang X., Wang C., Sun Y., Li W., Lei M., Li S., Qi Q., Ikram A., Salman M., Umair M., Shireen H., Batool F., Zhang B., Chen H., Yang Y.‐G., Abbasi A. A., Li M., Xue Y., Bao Y., Genomics Proteomics Bioinf. 2021, 19, 727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Koutrouli M., Karatzas E., Paez‐Espino D., Pavlopoulos G. A., Front. Bioeng. Biotechnol. 2020, 8, 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Li C., Ma L., Zou D., Zhang R., Bai X., Li L., Wu G., Huang T., Zhao W., Jin E., Bao Y., Song S., Genomics Proteomics Bioinf. 2023, 21, 1066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kawano‐Sugaya T., Yatsu K., Sekizuka T., Itokawa K., Hashino M., Tanaka R., Kuroda M., G3 (Bethesda) 2021, 11, jkab126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Van der Maaten L., Hinton G., J. Mach. Learn. Res. 2008, 9, 2579. [Google Scholar]
  • 18. Pedregosa F., J. Mach. Learn. Res. 2011, 12, 2825. [Google Scholar]
  • 19. Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T. Y., Advances in Neural Information Processing Systems, Curran Associates, Inc., Red Hook, New York, 2017. [Google Scholar]
  • 20. Ho T. K., in Proceedings of 3rd international conference on document analysis and recognition , 1995.
  • 21. Cox D. R., J. Royal Stat. Soc. 1958, 20, 215. [Google Scholar]
  • 22. Haixiang G., Yijing L., Shang J., Mingyun G., Yuanyue H., Bing G., Expert Syst. Appl. 2017, 73, 220. [Google Scholar]
  • 23. Templeton A. R., Crandall K. A., Sing C. F., Genetics 1992, 132, 619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Bandelt H. J., Forster P., Röhl A., Mol. Biol. Evol. 1999, 16, 37. [DOI] [PubMed] [Google Scholar]
  • 25. Li L., Xu B., Tian D., Wang A., Zhu J., Li C., Li N., Zhao W., Shi L., Xue Y., Zhang Z., Bao Y., Zhao W., Song S., Brief Bioinform. 2023, 24, bbad174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Kipf T. N., Welling M., arXiv 1609.02907, 2016.
  • 27. Song S., Ma L., Zou D., Tian D., Li C., Zhu J., Chen M., Wang A., Ma Y., Li M., Teng X., Cui Y., Duan G., Zhang M., Jin T., Shi C., Du Z., Zhang Y., Liu C., Li R., Zeng J., Hao L., Jiang S., Chen H., Han D., Xiao J., Zhang Z., Zhao W., Xue Y., Bao Y., Genomics Proteomics Bioinf. 2020, 18, 749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Zhao W. M., Song S. H., Chen M. L., Zou D., Ma L. N., Ma Y. K., Li R. J., Hao L. L., Li C. P., Tian D. M., Tang B. X., Wang Y. Q., Zhu J. W., Chen H. X., Zhang Z., Xue Y. B., Bao Y. M., Yi Chuan 2020, 42, 212. [DOI] [PubMed] [Google Scholar]
  • 29. O'Toole Á., Scher E., Underwood A., Jackson B., Hill V., McCrone J. T., Colquhoun R., Ruis C., Abu‐Dahab K., Taylor B., Yeats C., du Plessis L., Maloney D., Medd N., Attwood S. W., Aanensen D. M., Holmes E. C., Pybus O. G., Rambaut A., Virus Evol. 2021, 7, veab064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Rambaut A., Holmes E. C., O'Toole Á., Hill V., McCrone J. T., Ruis C., du Plessis L., Pybus O. G., Nat. Microbiol. 2021, 6, 415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Rambaut A., Holmes E. C., O'Toole Á., Hill V., McCrone J. T., Ruis C., du Plessis L., Pybus O. G., Nat. Microbiol. 2020, 5, 1403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Huson D. H., Rupp R., Scornavacca C., Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, Cambridge: 2010. [Google Scholar]
  • 33. Csardi G., Nepusz T., InterJournal, Complex Syst. 2006, 1695, 1. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Data Availability Statement

Source code of HiRisk‐Detector is written in Python and available at GitHub (https://github.com/Theory‐Lun/HiRiskPredictor), BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007386), and Zenodo (https://zenodo.org/records/10060683) under the MIT license. Web server of HiRisk‐Detector is available at https://ngdc.cncb.ac.cn/ncov/monitoring/risk. All complete and high‐quality SARS‐CoV‐2 genome sequences and associated metadata were downloaded from RCoV19 (https://ngdc.cncb.ac.cn/ncov/). The number of daily confirmed cases and deaths of COVID‐19 were downloaded from WHO (https://covid19.who.int/data). The alias of PANGO lineages were downloaded from https://github.com/cov‐lineages/pango‐designation/.


Articles from Advanced Science are provided here courtesy of Wiley

RESOURCES