Highlights
-
•
This review summarized the databases of known emerging infectious viruses and techniques focusing on virus variant forecasting and early warning.
-
•
It included multi-dimensional information integration and database construction of emerging infectious viruses, virus mutation spectrum construction and variant forecast model, analysis of the affinity between mutation antigen and the receptor, propagation model of virus dynamic evolution, monitoring, and early warning for variants.
-
•
This review provided a comprehensive view of the latest virus research and a reference for future virus prevention and control research.
Keywords: Emerging infectious disease, SARS-CoV-2, Multimodal data, Early warning
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has dramatically increased the awareness of emerging infectious diseases. The advancement of multiomics analysis technology has resulted in the development of several databases containing virus information. Several scientists have integrated existing data on viruses to construct phylogenetic trees and predict virus mutation and transmission in different ways, providing prospective technical support for epidemic prevention and control. This review summarized the databases of known emerging infectious viruses and techniques focusing on virus variant forecasting and early warning. It focuses on the multi-dimensional information integration and database construction of emerging infectious viruses, virus mutation spectrum construction and variant forecast model, analysis of the affinity between mutation antigen and the receptor, propagation model of virus dynamic evolution, and monitoring and early warning for variants. As people have suffered from COVID-19 and repeated flu outbreaks, we focused on the research results of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and influenza viruses. This review comprehensively viewed the latest virus research and provided a reference for future virus prevention and control research.
1. Introduction
Since its discovery in December 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has emerged in almost all countries. The cumulative global infections have exceeded 700 million, resulting in more than 6 million deaths [1], significantly impacting the economy of all countries worldwide. This burdens medical, health services, and other industries, severely threatening people's lives. The Omicron variant, more infectious than the Delta strain [2], is spreading worldwide and evolving. Considering the accumulation of research and clinical data on SARS-CoV-2 and other infectious diseases, such as influenza viruses, severe acute respiratory syndrome (SARS-CoV-1), and Middle East respiratory syndrome coronavirus (MERS-CoV) of coronavirus-β, and Ebola virus, which have a very high fatality rate, it is urgent to use these data resources efficiently to predict the future evolution and transmission trend of viruses. At present, many researchers are focusing on the different aspects of the evolution of the virus and its transmission in the population dynamics. However, there is no diversified platform integrating multi-dimensional data and tools of evolution for forecasting and predicting early warning for emerging infectious diseases. Such a platform would be essential for preventing and controlling local epidemics. The study reviewed the existing prominent databases and methods, allowing readers to understand the progress of virus-related research from various perspectives quickly.
2. Methodology
To understand the feasibility of this work, we conducted extensive literature investigations considering the following five aspects: multi-dimensional information integration and database construction of emerging infectious viruses, virus mutation spectrum construction and variant forecast model, analysis of the affinity between mutation antigen and the receptor, propagation model of virus dynamic evolution, and monitoring and early warning for variants (Fig. 1). The keywords specific to these five aspects were retrieved from the Web of Science and PubMed. We screened the search results, and articles closely related to the subject of our review were kept in the references. We summarized the core contents and analyzed their significance in virus research and prevention.
Fig. 1.
Forecasting and early warning of virus evolution based on a multimodal database. A) A database integrating big data from multiomics, immunology, epidemiology, and clinical research cohorts. B) The virus mutation spectrum was constructed and improved in time, and the hot spots of mutation were predicted by computer modeling. C) According to the mutation prediction results, the affinity between the mutant antigen and receptor was calculated and analyzed, inferring the viral transmissibility. D) Combining the changes in the future biological characteristics of the virus with various social information, the virus's dynamic evolution and transmission model was established. E) Monitoring and early warning of virus variants with high pathogenicity, which can easily cause large-scale epidemics and immune evasion.
Section 1 is the introduction, Section 2 demonstrates the investigation method and article structure, Section 3 illustrates the databases related to emerging infectious viruses and their examples, focusing on the research achievements that are the basis of virus research, Section 4 demonstrates the current research status of virus lineage and mutation prediction. After predicting viral variants, we evaluated the pathogenicity of the variants, and Section 5 summarized the current research methods for analyzing the binding of viral variant antigens and human receptors. The research on the virus transmission model was mentioned in Section 6. A crucial step is the monitoring and early warning of infectious diseases, which should be combined with the results of the above four steps. Section 7 summarized the existing research on virus monitoring and early warning from environmental samples and big data. Section 8 concluded the preceding and discussed the potential of some recent “trendy” technologies for developing infectious disease research.
3. Database of emerging infectious viruses
COVID-19 is a global pandemic. Since its outbreak, several multimodal viral genome data, clinical research information, and relevant epidemiological and immunological data have emerged. Some researchers have collected the data and established a series of data platforms allowing immediate access to the latest data and information on the epidemic and virus, providing a solid foundation for preventing and controlling the epidemic and developing viral vaccines. Herein, we summarized the research work on data storage of SARS-CoV-2 and several kinds of influenza viruses (Table 1).
Table 1.
Multimodal databases of emerging infectious viruses.
| Database | URL | Description | Category | References |
|---|---|---|---|---|
| GISAID | https://www.gisaid.org/ | Virus information (including influenza virus, coronavirus, HIV) | Virus basic information | [3] |
| ViPR | https://www.viprbrc.org/ | Virus information (including coronavirus, herpes virus, flavivirus, poxvirus, filovirus, etc.) | [4] | |
| ENA | https://www.ebi.ac.uk/ena/browser/home | Virus information (including coronavirus, herpes virus, flavivirus, poxvirus, filovirus, etc.) | / | |
| Nextstrain | https://nextstrain.org/ | Virus information (including influenza virus, SARS-CoV-2, dengue virus, Zika virus, Monkeypox, Ebola virus, etc.) | [5] | |
| NCBI Influenza Virus Database | https://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi?go=database | Influenza virus information | / | |
| NCBI SARS-CoV-2 Resources | https://www.ncbi.nlm.nih.gov/sars-cov-2/ | SARS-CoV-2 information | / | |
| UniProt | https://www.uniprot.org/ | Contains virus related proteins information | [8] | |
| AlphaFold | https://alphafold.com/ | Contains structural biological information of virus | / | |
| COVIDomic | https://covidomic.com/ | Multi-omics health data of COVID-19 patients | Clinical data | [6] |
| ClinicalTrials.gov | https://clinicaltrials.gov/ct2/home | Clinical data | / | |
| FluReassort | https://www.jianglab.tech/FluReassort | Genomic reassortments of influenza virus | Database for specific studies | [9] |
| EpiGraphDB | https://epigraphdb.org | Contains epidemiological data on diseases caused by various viral infections | Epidemiological data | [11] |
| KGCoV | https://www.biosino.org/kgcov/ | SARS-CoV genome-epidemiological knowledge graph | [12] | |
| CoV-AbDab | https://opig.stats.ox.ac.uk/webapps/covabdab/ | Information on structures of coronavirus antibodies | Immunology related information | [13] |
| CoV3D | https://cov3d.ibbr.umd.edu/ | Information on structures of coronavirus antibodies | [14] | |
| COVIEdb | https://biopharm.zju.edu.cn/coviedb/ | Information on immune epitopes of coronavirus | [15] | |
| MMDB | https://www.ncbi.nlm.nih.gov/structure/ | Immune-related crystallography structure data | [16] | |
| CORD-19 | https://allenai.org/data/cord-19 | Scientific research literature | Scientific research literature | [17] |
| DrugBank COVID-19 Dashboard | https://go.drugbank.com/covid-19 | Information on drugs to treat COVID-19 | Drug information | [18] |
| DockCoV2 | https://covirus.cc/drugs/ | Information on drugs to treat COVID-19 | [19] |
Abbreviations: HIV, human immunodeficiency virus; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2; COVID-19, coronavirus disease 2019.
3.1. Multimodal and multiomics databases
Currently, the Global Initiative on Sharing All Influenza Data (GISAID) [3] is the only complete influenza data platform in the world, containing thousands of virus sequences that may have caused the pandemic in human populations, such as Zika virus, Ebola virus, different subtypes of influenza virus, and SARS-CoV-2 virus, which are still widely spreading worldwide. It collected relevant clinical and epidemiological data and established a comprehensive virus archive. The World Health Organization (WHO) and many national virus prevention and control institutions worldwide cooperated with GISAID to realize the sharing of data resources on various infectious viruses. The Virus Pathogen Database and Analysis Resource (ViPR) was created by Pickett et al.[4] covered the lineage, gene, protein, immune epitope, clinical, and other multimodal virus information of many virus families, including coronavirus, herpes virus, and poxvirus, and provided various bioinformatics analysis, such as sequence alignment and phylogenetic tree construction. Nextstrain, developed by James Hadfield et al.[5] includes several virus-related information, with the most significant feature of using system dynamics analysis technology to generate a phylogenetic tree of the virus and collect spatiotemporal information when the data is updated, presenting a spatiotemporal view of the evolution and transmission of the virus to users. COVIDomic developed by Naumov et al.[6] is a multiomics online platform that collects large amounts of health data from COVID-19 patients to determine the origin of the virus and the expected severity of the disease by analyzing multimodal genomic data from these patients. In addition, websites such as National Center for Biotechnology Information (NCBI) and European Nucleotide Archive collect genomic information on many viruses, while the platforms like Protein Data Bank (PDB) [7], UniProt [8], and AlphaFold collect the structural biology and protein information of viruses. It is worth mentioning that Ding et al.[9] developed the FluReassort database for influenza virus genomic reassortments, summarizing and describing the correlation and reassortment preference among different influenza virus subtypes to understand the direction of virus evolution better. This is an inspiration for studying the possible genetic recombination of SARS-CoV-2.
3.2. Epidemiological database
According to a literature survey, some countries have established epidemiological databases for their regions to cope with emerging infectious diseases in different areas. Although most of these epidemiological resources are public, there must be more interaction between databases. Some databases exist independently for a specific time, space, or disease classification or even became poorly extended data sets due to a lack of updates. The Global Influenza B Study (GIBS) [10], launched in 2012, collected epidemiological information on 935,673 influenza B cases in countries worldwide in the past ten years, mainly by the WHO and influenza detection systems of various countries. The EpiGraphDB developed by Liu et al.[11] is a graphic database integrating various diseases' biomedical and epidemiological relations, supporting multiple interactive or programmatic access methods. It covers epidemiological data on many infectious viruses, such as avian influenza (AIV), swine influenza (SIV), Zika, dengue, and SARS-CoV-2. Furthermore, Wang et al.[12] constructed the SARS-CoV genome-epidemiological knowledge graph, named KGCoV.
3.3. Immune information database
With the outbreak of COVID-19, immunological datasets have proliferated. This data is often closely linked to the research and development of vaccines and drugs for the virus, making proper storage and sharing of the data a significant issue. The CoV-AbDab [13] and CoV3D [14] databases store antibody structure data of SARS-CoV-1, SARS-CoV-2, MERS-CoV, and other coronaviruses. In addition, since T and B cells recognize viral epitopes to form immunological memory in adaptive immune response, understanding the epitopes is essential for virus diagnosis and corresponding vaccine design. Therefore, scientists have developed several virus antigen epitope databases, such as COVIEdb [15], that predict and store coronaviruses' potential immune epitope information. Based on the NCBI platform, the Macromolecular Modeling Database (MMDB) [16] is the most extensive database storing immune-related crystallography structure data. Some data platforms with diverse resources contain immunological data, such as PDB, UniProt, and ViPR.
3.4. Clinical research database
The collection and rational use of clinical and scientific research data is essential for understanding emerging infectious diseases. At present, BioRxiv & MedRxiv, NLM (The United States National Library of Medicine), and other platforms collect scientific research literature on various infectious diseases. Meanwhile, scientific research datasets related to various infectious diseases are constantly emerging. For example, The COVID-19 Open Research Dataset (CORD-19) [16] is an increasing open datasets of scientific research papers related to COVID-19. In terms of clinical data, the Centers for Disease Control and Prevention (CDC) of the U.S. collects clinicopathological data from doctors' diagnostic reports. In China, the Biological Medicine Information Center and China National Center for Bioinformation provided information related to diseases, including clinical data and offered access to world-renowned biomedical databases. Besides, ClinicalTrials.gov is a repository of hundreds of thousands of clinical studies from 220 countries around the world. In addition, there are several drug databases for various infectious diseases, including influenza and coronavirus, such as DrugBank [17], PubChem, and DockCoV2[18].
4. Construction of virus mutation spectrum and variant forecast models
Based on many virus database resources, scientists constructed the virus mutation spectra and techniques to develop prediction models of virus mutations. We reviewed recent studies on the lineage construction of several emerging infectious viruses and virus mutation prediction technology.
4.1. Virus lineage nomenclature
The lineage of the SARS-CoV-2 virus in the pandemic stage includes the Pango [19], GISAID [20], Nextstrain [21], and L/S lineage nomenclature [22]. The first three methods are named according to the evolutionary distance of the virus, but they differ in their naming format and specification (Fig. 2). Pango nomenclature specifies that the beginning letter represents the pedigree, followed by dots and letters to indicate the descendant branches. GISAID uses a phylogenetic clustering framework called PhyCLIP [20], named according to the mutation sites used to cluster avian influenza H5Nx. Nextstrain is named after the time when a virus variant appeared, with the numbers being the year and letters indicating the order in which the variant appeared that year. The L and S lineages were divided according to two closely linked SNPs [22], and the researchers of this study further divided the L lineages into L1 and L2 sublineages. These nomenclatures are more commonly used in scientific research.
Fig. 2.
The global severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) lineage maps displayed on Nextstrain. A) The nodes were colored according to Pango lineage nomenclature. B) The nodes were colored according to GISAID lineage nomenclature. C) The nodes were colored according to Nextstrain lineage nomenclature. D) L/S lineage nomenclature was based on the difference of nucleotides at 8,782 and 28,144, which were C and T in the L subtype and T and C in the S subtype.
However, to help the public distinguish between different subtypes of the virus, the WHO classified SARS-CoV-2 into variant of concern (VOC) and variant of interest (VOI) according to the virulence and transmissibility of the virus and used Greek letters to denote the different strains [23]. Unlike SARS-CoV-2, influenza viruses are classified into A, B, and C according to virus structure, antigen type, and severity of infection. Influenza A viruses are classified based on antigens on hemagglutinin and neuraminidase [24], [25]. The Ebola virus, circulating in Africa for years, is classified by regions and mutations [26]. The lineage nomenclature of these emerging infectious viruses is diverse. Suppose the mutation spectra are constructed according to the different classification methods. In that case, the work will be chaotic, which is not conducive to understanding and researching different viruses by scientific researchers and increases the difficulty of follow-up work. However, different viruses have different characteristics and cannot be given the same classification rules. Therefore, the formulation of standardized lineage nomenclature rules for different viruses is a challenge that needs to be solved in the future.
4.2. Mutation identification based on viral genomic data
The database resources mentioned in the first part focus on virus evolution dynamics. To understand the history of virus mutation, the database needs to update the temporal and spatial information of virus transmission and evolution in time. Spatiotemporal tracking of variants requires detailed epidemiological investigations. A study on the super-transmission of SARS-CoV-2 in Austria [27] combined the phylogenetic study of the virus with epidemiological data to reconstruct the tree containing patient distribution information. This study collected spatiotemporal information from artificially selected samples from various regions in Austria. However, the location where we collected samples does not indicate that the virus variant was created there, so a tracking method like this one must be more accurate for studying the spatiotemporal development of the virus lineage. Since that, some teams tried to infer the history and location of virus lineage transmission. Philippe Lemey et al.[28] integrated patient travel history data into Bayesian geographic inference, a classic analysis method of virus lineage diffusion history, to improve the impact of sampling location deviation. However, this method requires intensive calculation, and it takes time to analyze extensive data quickly. To solve this problem, Simon Dellicour et al.[29] proposed a workflow for phylogenetic tree reconstruction of lineage location on the time scale, through which a more detailed analysis of virus lineages in interested geographical regions can be performed. The team used this achievement to map the spatiotemporal spread of SARS-CoV-2 spike protein mutations in Belgium in 2020 [30]. The spatiotemporal tracking of variants needs to accurately acquire the generation time and geographic location of variants; however, relying only on epidemiological survey information collection cannot achieve this goal. Therefore, some mathematical models need to be used for inference, and improving the speed and accuracy of inference is the problem to be solved in the current research direction.
The tree and ancestral state reconstruction are the two standard methods for identifying virus mutations. Nextflu [31] tracked four seasonal influenza lineages using a processing pipeline named Augur. Nextstrain[5] improved on Nextflu and expanded the analysis to more viruses, including Ebola, Zika, and coronavirus. Slightly similar to the Nextstrain is GenBrowser [32]. This tool enables complete phylogenetic analysis and detects positive selection in progress while having a user-friendly interactive interface for SARS-CoV-2. However, since the conventional phylogenetic pipeline cannot handle large amounts of data, mapping emerging infectious virus lineages needs to be updated in time, necessitating scientists to explore ways to solve this problem. Sudhir Kumar et al.[33] improved the mutational order approach, commonly used to construct the evolutionary history of tumor cells and applied it to the mutation history reconstruction of SARS-CoV-2. Yatish Turakhia et al. [34] proposed UShER, a tree-based data framework that referenced Branch Parsimony Score to add new genome sequences to existing phylogenetic trees, significantly improving the analysis speed compared to conventional phylogenetic analysis. Moreover, unlike tree reconstruction and ancestral state reconstruction, the virus evolutionary network based on a network analysis algorithm has the advantage of less computation cost and additional spatial freedom of connectivity. The Chinese Academy of Sciences team developed VENAS [35], which tracks the mutations in the virus as SARS-CoV-2 spreads. It is worth mentioning that this software uses topology-based community detection, and the network disassortative trimming algorithm enables the identification of critical viral groups in transmission. Besides, practical data-driven approaches detect new mutations, which is enlightening. Anna Bernasconi et al.[36] focused on data mining to search for dynamic changes in the amino acid sequence of viruses to monitor the emergence of variants.
However, phylogenetics is a mainstream method to construct a virus mutation spectrum. To cope with the rapid increase of genomic data of emerging infectious diseases and map virus evolution accurately and efficiently, the optimization of the phylogenetics method needs to be focused on in the future. It can be seen that some phylogenetic tools, such as IQ-TREE [37] and Bayesian Evolutionary Analysis by Sampling Trees (BEAST)[38], are being updated in pursuit of completing analysis tasks with larger volumes and higher speed.
4.3. Analysis of viral genome information from different perspectives to obtain viral evolutionary trends
A recent review [39] provided a comprehensive introduction to molecular histories of the adaptation of eight viruses, including variola virus, HIV-1 M, SARS-CoV-1, H1N1-SIV, MERS, Ebola, Zika, and SARS-CoV-2 to the human host that helps understand the molecular adaptation of various viruses in the evolutionary process. In the last century, scientists proposed quasispecies models that could help understand the evolution of self-replicating populations [40], [41]. However, this model is very weak in quantitative prediction, so some scientists optimized based on this theory, analyzed the genotypes (nucleotide sequences) and phenotypes (amino acid sequences) of the viruses, and built a quantitative model of virus evolution under immune selection [42]. This model predicts the mutation trend of the adaptive evolution of viruses in response to immune responses under ideal conditions. Some scientists use antigenicity to understand viruses' adaptability and build models to predict their evolutionary direction. Łuksza and Lässig [43] developed an adaptive model of HA (hemagglutinin) to predict the evolution of influenza A based on protein thermodynamics. Other than this prediction approach based on the theory of virus evolution, a statistical technique called Shannon entropy was used to identify the regions of SARS-CoV-2 spike proteins more prone to mutations, assisting mutation prediction [44]. In addition, there are VarEPS [45] and PyR0 [46] studies, which predict the fitness of variant lineages based on millions of viral genomes, predicting the evolutionary trend of viruses.
Artificial intelligence technology is widely used in mutation prediction as we do not need prior knowledge or consider the complex situation involved in virus evolution. We can use the existing data to build the model and predict.
Yao et al. [47] used a joint random forest to analyze HA sequence data and established a model for predicting the antigenicity of H3N2. Michael R. Garvin et al. [48] proposed an interpretable AI model to discover potential SARS-CoV-2 adaptive mutations, combining iterative random forest and random intersection trees to support high-performance computing. In addition to random forest, long-short-term memory (LSTM) can deal with the problems related to time series, favoring virus mutation prediction. Refat Khan Pathan et al. [49] used LSTM to predict nucleotide mutation rates of SARS-CoV-2 infected viruses and found differences in mutation rates of several nucleotides. Md Shahadat Hossain et al. [50] combined bioinformatics analysis with LSTM to study SNPs of SARS-CoV-2 and predict their mutation rates in the future, with a prediction accuracy of 97%. However, most of these AI algorithms have some issues. For example, in some cases, models are overfitting, which can be generated easily in the case of a random forest with significant noise. However, the complex model calculation takes time. When LSTM has large time series span samples, and if the network is intense, the calculations will be vast and time-consuming. Since then, some researchers have proposed combining multiple machine-learning models. Yin et al. [51] used a stacked model combining logistic regression, support vector machine, naive Bayes, neural network, and nearest neighbor models to predict H1N1 mutations. This stacked model can improve the accuracy and stability of prediction through the diversity of algorithms and is more suitable for solving complex problems. Furthermore, few studies on prediction models of virus mutation, and most of them focus on a specific virus or subtype. The performance of these models on other analysis objects has yet to be verified.
5. Analysis of the mutation antigen and receptor affinity
After predicting future mutations of an emerging infectious virus, we evaluated the variant's infectivity by calculating the antigen's ability to bind to the receptor. We introduced some methods and research on the antigen-receptor critical analysis of infectious viruses.
5.1. Research techniques for antigen-receptor binding
The antigen-receptor binding study can calculate the virus protein's and host receptor's affinity. Standard methods to study ligand-receptor binding affinity include molecular dynamics simulation, molecular mechanics calculation, and machine-learning models. Molecular dynamics mainly depends on Newtonian mechanics to simulate the dynamic process of the molecular system. It needs to sample in the system that consists of molecules in different states, calculate the score, and study other macroscopic properties according to the score. This method simulates macromolecular conformational changes and enzyme reactions. Ligand-receptor binding belongs to the former. The research on molecular dynamics mainly involves constructing a molecular field with strong computational performance and proposing sampling methods that can capture molecular states at various time scales [52]. The mode to calculate the molecular mechanics is relatively simple, i.e., calculating the energy of different molecular conformations by simulating them. In addition, machine-learning algorithm models are introduced into antigen-receptor binding analysis, i.e., neural network model. In the following section, we sort out the applications of these methods in infectious disease research.
5.2. Applications in the study of emerging infectious diseases
Deng et al. [53] analyzed the binding affinity of hemagglutinin protein H7 of H7N9 with human receptors by molecular docking. Molecular docking can model ligand-receptor binding but cannot explain more detailed dynamic changes, whereas molecular dynamics simulation performs better. A study by Chen et al. [54] used the coarse-grained field in molecular dynamics to calculate the mutation-free energy change. However, molecular dynamics simulation requires a high time scale and calculates the antigen-receptor binding ability more efficiently, especially in response to widespread and rapidly mutating SARS-CoV-2, Zhan et al. [55] chose to calculate the molecular mechanics-Poisson-Boltzmann surface area (MM-PBSA) binding free energy by using the energy minimization structure of the complex before and after the residue change. Thus, the differences in the binding free energy of SARS-CoV-2 and ACE2 before and after spike protein mutation can be predicted quickly. They used this method to calculate the binding force of the Omicron variant (BA.2) spike protein to ACE2 [56]. Moreover, 3D structural modeling of the ligand-receptor binding complex is used to find high affinities variants. Hin HarkGan et al. [57] used this method to screen SARS-CoV-2 variants, in which molecular dynamics simulation and energy analysis were used to detect high-affinity combinations. In addition to improving traditional molecular dynamics and molecular mechanics methods to calculate the binding force between mutant antigen and receptor, researchers have developed some handy calculation processes and prediction platforms for more people to use. Nash D. Rochman et al.[58] compared the changes of different variants of SARS-CoV-2 and the wild type at the ligand-receptor binding interface by protein structure modeling. Fabrizio Pucci et al.[59] developed a computational model named SpikePro, which can predict viral adaptation from the amino acid sequence and structure of the SARS-CoV-2 spike protein. Spike protein Antigenicity for the SARS-CoV-2 (SAS) platform provided by Zhang et al.[60] can predict the antigenicity of SARS-CoV-2 variants and help researchers better target immunobinding experiments to monitor the effectiveness of vaccine antibodies. Besides, Chen et al.[61] predicted binding affinity of SARS-CoV-2 receptor binding domains or ACE2 with different amino acid substitutions by introducing a convolutional neural network (CNN) model for training on protein sequence and structural characteristics.
6. Propagation model of virus dynamic evolution
To help relevant regions and countries scientifically and efficiently cope with the possible emergence of high-transmissible virus variants, we need to build transmission models of the future dynamic evolution of viruses based on the biological characteristics of viruses obtained from the data accumulation in the early stage. In epidemiology, the research on the virus transmission model is more in-depth. Various mathematical models describe and predict virus transmission modes for different infectious diseases. We summarized the classical models of virus transmission and focused on the evolution and transmission of SARS-CoV-2 in the recent two years.
6.1. Epidemiological analysis of virus transmission
A.G. McKendrick and W.O. Kermack [62] established a classical mathematical model for studying the evolution of epidemic diseases. This compartment model, which classifies the population into “Susceptible (S)”, “Exposed (E)”, “Infectious (I)”, and “Recovered (R)”. There are two basic compartment models, SIR, which is suitable for the analysis of some acute infectious diseases, such as mumps, and the other is SEIR (Fig. 3), which is suitable for infectious diseases with incubation periods, such as COVID-19, and has been used by some researchers for epidemiological analysis. The compartment model can have other combinations that describe different infectious disease transmission modes. Simulating the propagation mode of infectious diseases may need to consider other situations, such as environmental factors. We can build an extra compartment representing the environment to construct multiple compartments based on classical compartment models. Except for the compartment models above, there are other methods for modeling the epidemic process, among which statistical methods occupy the majority, such as maximum likelihood estimation and correlation analysis.
Fig. 3.
The susceptible-exposed-infectious-removed model of an epidemic. β is the transmission rate, referring to the possibility of virus transmission from infected to susceptible people. σ is the latency rate, referring to the proportion of the susceptible population infected. γ is the recovery rate, referring to the proportion of patients recovering.
6.2. Transmission models of COVID-19
Currently, many researchers use the compartment model to analyze the transmission of SARS-CoV-2. However, conventional compartment models are insufficient to deal with the complexity of a pandemic situation; therefore, they combine compartment models with other methods. Michael Small et al.[63] constructed a system with a complex network and modeled it based on the SEIR model. This study explored a method that could establish an appropriate model without reasonable estimates of basic epidemiological parameters. A study evaluating the potential incidence of COVID-19 and the effectiveness of preventing and controlling measures in Spain [64] used a data-driven approach combined with SEIR models to track the spatial spread of the virus. Matthieu Domenech de Cellès et al. [65] modified the standard SEIR model to test the impact of influenza on the epidemiological dynamics of SARS-CoV-2. Many researchers used different mathematical methods to interpret the compartment models to construct new ones, of which Caputo-Fabrizio and Atangana-Baleanu derivatives were widely used. These mathematical methods analyzed the spread of COVID-19 under different conditions (vaccines and other prevention measures). Scientists hope to use mathematics as a powerful tool to build a stable and accurate model of the dynamic evolution of the transmission of COVID-19. Furthermore, Wang et al.[66] proposed a crowd spatiotemporal activity model named STHAM to simulate the transmission dynamics of SARS-CoV-2. This research is different from current articles that start from the compartment models, carried out based on the agent that records the dynamic information of the individual and allocates different movement modes to the crowd with different attributes. This is an excellent example of how demography can help study the dynamics of infectious diseases.
7. Monitoring and early warning for variants
We have mentioned studies on virus variant prediction, antigenicity assessment, and transmission models. However, there is a crucial missing step in integrating these findings into local government efforts to combat the pandemic, i.e., monitoring and early warning for variants. Tracking the spread of viruses in society and monitoring changes in the viral genome in real time can provide guidance and buy valuable time for public health authorities to deploy interventions. The virus variant monitoring and early warning methods can be diverse (Fig. 4). This section introduced the current research on monitoring and early warning methods for SARS-CoV-2, influenza, and other emerging viruses from two aspects of environmental samples and considerable data information.
Fig. 4.
Diversified technologies monitor and warn against the evolution and transmission of viruses. Environmental samples, such as wastewater, soil, and air in epidemic areas, can provide physical evidence for detecting virus variants. Data mining on internet platforms like Google and Twitter provides new ideas for virus information search. Multiomics analysis and phylogenetic methods improve the basic information of virus lineages. Big data and artificial intelligence combined with the global initiative on sharing can provide technical support for establishing an online monitoring and early warning platform for virus variants.
7.1. Monitoring variants by sampling from environment
In the case of SARS-CoV-2, an essential sample from environment is wastewater, as studies have confirmed that significant concentrations of viral RNA can be detected in the feces of COVID-19 patients even after their respiratory symptoms have disappeared. Therefore, community wastewater monitoring can help understand the transmission of the viruses. Several novel coronavirus surveillance studies are currently using domestic wastewater as samples [67], [68], [69]. The research directions include establishing early warning systems for viruses through wastewater detection, assessing transmission trends of infectious diseases, estimating prevalence, and investigating virus variants. In addition to wastewater, SARS-CoV-2 is a respiratory virus that spreads through the air, with high viral load aerosols as the primary transmission agent in hard-hit areas. The viruses can be present on surfaces touched by patients. Moreover, studies have confirmed that SARS-CoV-2 can be detected in soil samples from affected areas [70]. The above studies suggested that collecting and detecting environmental samples in the region plays an essential role in early warning of viruses. However, detecting infectious viruses is not included in the sampling and detection of environmental regulatory authorities in most countries, which should be given serious attention.
7.2. Virus variant monitoring and early warning based on big data
Different techniques have been used to speed up access to clinical information about infectious diseases. M. Santillana et al.[71] developed a machine-learning model, AutoRegressive Electronic health record Support vector machine (ARES), that enables real-time influenza monitoring in a specific region based on cloud-based electronic health records. Yang et al. [72] created AutoRegression with Google search data (ARGO), an influenza tracking model, to accurately estimate the epidemic situation of influenza based on Google search data. Canelle Poirier et al.[73] combined data sources, such as Google search, real-time climate, and influenza-related Twitter with electronic health data, and used an integrated machine-learning method to make real-time estimation and short-term prediction of the influenza situation in France. Furthermore, Geographic Information System (GIS) prevents and detect epidemics [74].
8. Discussion
This review includes a detailed introduction to the databases and analysis tools of emerging infectious diseases published at home and abroad in recent years to observe the trend and progress of relevant research. We summarized the representative research methods and achievements in all the citations (Table 2). As COVID-19 is still amid a global pandemic, we are interested in SARS-CoV-2-related research.
Table 2.
Representative methods in all citations.
| Name of the method | Uniform resource locator | Description | References |
|---|---|---|---|
| Pangolin | https://github.com/cov-lineages/pangolin | A lineage nomenclature, whose beginning letter represents the pedigree followed by dots and letters to indicate the descendant branches. | [20] |
| PhyCLIP | https://github.com/alvinxhan/PhyCLIP | Phylogenetic clustering by linear integer programming (PhyCLIP), a statistically principled phylogenetic clustering approach and a lineage nomenclature which is used by GISAID. | [21] |
| Nextstrain | https://nextstrain.org/ncov/gisaid/global/6m | The lineage nomenclature created by Nextstrain which is named after the time when a virus variant appeared. | [22] |
| Nextflu | https://nextflu.org | A workflow and visualization program capable of analyzing and exploring the latest influenza virus sequence data. | [32] |
| UShER | https://genome.ucsc.edu/cgi-bin/hgPhyloPlace | A tree-based data framework which encodes the inferred evolutionary history of the virus to improve the speed of phylogenetic analysis. | [35] |
| VENAS | https://github.com/qianjiaqiang/VENAS | The viral genome evolutionary analysis system (VENAS), a software providing integrated analysis of genomic variations which reduces time and computation cost. | [36] |
| VarEPS | https://nmdc.cn/ncovn/ | An evaluation and prewarning system for SARS-CoV-2 which includes known and virtual mutations to achieve rapid evaluation of the risks posed by mutant strains. | [46] |
| PyR0 | https://github.com/broadinstitute/pyro-cov/tree/v0.2.1 | A pipeline based on hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions and identifies mutations relevant to fitness. | [47] |
| SpikePro | https://github.com/3BioCompBio/SpikeProSARS-CoV-2 | A simplified computational model which enables to predict the SARS-CoV-2 fitness from the amino acid sequence and structure of the spike protein. | [60] |
| SAS | https://www.biosino.org/sas | A platform which can predict the resistant effect of emerging variants and the dynamic coverage of SARS-CoV-2 antibodies among circulating strains. | [61] |
| STHAM | https://github.com/uofu-ccts/prisms-comp-model-stham | Spatio-temporal human activity model (STHAM), a kind of extended agent-based model which enables to simulate SARS-CoV-2 transmission dynamics. | [67] |
Abbreviations: GISAID, the Global Initiative on Sharing All Influenza Data; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2.
Since research on viruses has been performed from the perspectives of multiomics data, clinical and epidemiological data, evolution, transmission and pathogenicity, and vaccine research, many research achievements have been accumulated. We established a complete virus database to integrate all virus and disease information to deal with emerging infectious diseases. Some international databases, such as GISAID, have achieved this goal. However, due to the huge population base and vast territory, some international databases cannot cover the information of all infectious disease patients in China, and the information usually needs to be updated in time. For virus research, the research achievements of different teams in China were scattered. Some cannot be operated and maintained for long, so the research achievements eventually became invalid. We need to build a homegrown research platform, including all kinds of data and information on infectious diseases nationwide and a series of tools for virus research, such as mutation prediction, antigenicity assessment, transmission models, early warning, and monitoring. At the same time, these biological and medical technologies will combine national social, economic, medical, and other data to provide substantial scientific support for national disease control cause.
In the past decade, blockchain has been widely used in finance, the Internet of Things, public services, and other fields. As a “distributed ledger,” it has the characteristics of “data that is hard to tamper with once written” and “decentralized” [75], making the information stored in blockchain secure and reliable. When an infectious disease breaks out, traditional databases face some critical problems in the face of a surge in patient records, scientific data, and the number of users who need to use data. The blockchain will play a significant role in combating emerging infectious diseases. Moreover, digital twin technology has attracted attention in China in recent years, and smart cities have been built nationwide to advocate smart manufacturing. Smart medical treatment is an essential application of digital twins. Telemedicine, health monitoring, precision medicine, and other aspects are inseparable from implementing this technology. Hence, the potential of digital twinning in the scientific and medical industry is limitless when dealing with emerging infectious diseases, such as SARS-CoV-2. Digital twinning can combine all information related to infectious diseases to create a scaled-down “meta-universe” that integrates multidisciplinary knowledge to help managers make more comprehensive and accurate decisions.
There is a trend to apply these emerging concepts to the healthcare and biomedical industries. The COVID-19 pandemic has lasted for more than three years, the era of the epidemic has brought more challenges to the life of our human society, and people have realized that preventing and controlling the virus is a protracted battle for all humankind. In this war, it is not just the simple deployment of medical services and government administration that is being invested, but the collaboration of all sectors of society. In the era of Industry 4.0, realizing “intelligence” is the goal of all industries. For virus research and prevention, we aim to achieve intelligent scientific research and intelligent anti-epidemic.
Acknowledgements
This work was supported by the National Key R&D Program of China (2022YFF1203202, 2018YFC2000205), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB38050200, XDA26040304), and the Self-supporting Program of Guangzhou Laboratory (SRPG22-007).
Conflict of interest statement
The authors declare that there are no conflicts of interest.
Author contributions
Haotian Ren: Conceptualization, Investigation, Visualization, Writing – original draft. Yunchao Ling: Supervision. Ruifang Cao: Supervision. Zhen Wang: Supervision. Yixue Li: Supervision. Tao Huang: Conceptualization, Writing – review & editing.
Contributor Information
Haotian Ren, Email: renhaotian2021@sinh.ac.cn.
Yunchao Ling, Email: lingyunchao@sinh.ac.cn.
Ruifang Cao, Email: rfcao@sinh.ac.cn.
Zhen Wang, Email: zwang01@sinh.ac.cn.
Yixue Li, Email: yxli@sinh.ac.cn.
Tao Huang, Email: huangtao@sinh.ac.cn.
References
- 1.WHO, Coronavirus disease (COVID-19) pandemic. https://www.who.int/emergencies/diseases/novel-coronavirus-2019, 2023 (accessed 3 May 2023).
- 2.J. A. Lewnard, V. X. Hong, M. M. Patel, R., Lipsitch, M. Kahn, S. Y. Tartof, Clinical outcomes associated with SARS-CoV-2 Omicron (B.1.1.529) variant and BA.1/BA.1.1 or BA.2 subvariant infection in Southern California, Nat. Med. 28(2022)1933-1943, https://doi.org/10.1038/s41591-022-01887-z. [DOI] [PMC free article] [PubMed]
- 3.Elbe S., Buckland-Merrett G. Data, disease and diplomacy: GISAID's innovative contribution to global health. Global Challenges. 2017;1:33–46. doi: 10.1002/gch2.1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pickett B.E., Sadat E.L., Zhang Y., Noronha J.M., Squires R.B., Hunt V., Liu M., Kumar S., Zaremba S., et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012;40:D593–D598. doi: 10.1093/nar/gkr859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hadfield J., Megill C., Bell S.M., Huddleston J., Potter B., Callender C., Sagulenko P., Bedford T., Neher R.A. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34:4121–4123. doi: 10.1093/bioinformatics/bty407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Naumov V., Putin E., Pushkov S., Kozlova E., Romantsov K., Kalashnikov A., Galkin F., Tihonova N., Shneyderman A., et al. COVIDomic: A multi-modal cloud-based platform for identification of risk factors associated with COVID-19 severity. PLoS Comput. Biol. 2021;17:e1009183. doi: 10.1371/journal.pcbi.1009183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Berman H., Henrick K., Nakamura H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 2003;10:980. doi: 10.1038/nsb1203-980. [DOI] [PubMed] [Google Scholar]
- 8.Bateman A., Martin M.-J., Orchard S., Magrane M., Agivetova R., Ahmad S., Alpi E., Bowler-Barnett E.H., Britto R., et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ding X., Yuan X., Mao L., Wu A., Jiang T. FluReassort: a database for the study of genomic reassortments among influenza viruses. Brief. Bioinform. 2020;21:2126–2132. doi: 10.1093/bib/bbz128. [DOI] [PubMed] [Google Scholar]
- 10.S. Caini, Q.S. Huang, M.A. Ciblak, G. Kusznierz, R. Owen, S. Wangchuk, C.M.P. Henriques, R. Njouom, R.A. Fasce, H. Yu, et al., Epidemiological and virological characteristics of influenza B: results of the Global Influenza B Study, Influenza Other Respi. Viruses 9 (S1) (2015) 3–12, https://doi.org/10.1111/irv.12319. [DOI] [PMC free article] [PubMed]
- 11.Liu Y., Elsworth B., Erola P., Haberland V., Hemani G., Lyon M., Zheng J., Lloyd O., Vabistsevits M., et al. EpiGraphDB: a database and data mining platform for health data science. Bioinformatics. 2021;37:1304–1311. doi: 10.1093/bioinformatics/btaa961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Y. Wang, J. Yang, X. Zhuang, Y. Ling, R. Cao, Q. Xu, P. Wang, P. Xu, G. Zhang, Linking genomic and epidemiologic information to advance the study of COVID-19, Sci. Data 9 (2022) 121, https://doi.org/10.1038/s41597-022-01237-1. [DOI] [PMC free article] [PubMed]
- 13.Raybould M.I.J., Kovaltsuk A., Marks C., Deane C.M. CoV-AbDab: the coronavirus antibody database. Bioinformatics. 2021;37:734–735. doi: 10.1093/bioinformatics/btaa739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gowthaman R., Guest J.D., Yin R., Adolf-Bryfogle J., Schief W.R., Pierce B.G. CoV3D: a database of high resolution coronavirus protein structures. Nucleic Acids Res. 2021;49:D282–D287. doi: 10.1093/nar/gkaa731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wu J., Chen W., Zhou J., Zhao W., Sun Y., Zhu H., Yao P., Chen S., Jiang J., et al. COVIEdb: A database for potential immune epitopes of coronaviruses. Front. Pharmacol. 2020;11:646111. doi: 10.3389/fphar.2020.572249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lu Wang, L., K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, et al., CORD-19: The Covid-19 Open Research Dataset [Preprint], ArXiv. (2020) 2004.10706v4.
- 17.Wishart, D.S., Y.D. Feunang, A.C. Guo, E.J. Lo, A. Marcu, J.R. Grant, T. Sajed, D. Johnson, C. Li, et al., DrugBank 5.0: a major update to the DrugBank database for 2018, Nucleic Acids Research. 46 (2018) D1074-D1082, https://doi.org/10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed]
- 18.Chen T.F., Chang Y.C., Hsiao Y., Lee K.H., Hsiao Y.C., Lin Y.H., Tu Y.C.E., Huang H.C., Chen C.Y., et al. DockCoV2: a drug database against SARS-CoV-2. Nucleic Acids Res. 2021;49:D1152–D1159. doi: 10.1093/nar/gkaa861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rambaut A., Holmes E.C., O’Toole A., Hill V., McCrone J.T., Ruis C., du Plessis L., Pybus O.G. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 2020;5:1403–1407. doi: 10.1038/s41564-020-0770-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Han A.X., Parker E., Scholer F., Maurer-Stroh S., Russell C.A. Phylogenetic clustering by linear integer programming (PhyCLIP) Mol. Biol. Evol. 2019;36:1580–1595. doi: 10.1093/molbev/msz053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Emma B Hodcroft, J.H., Richard A Neher, Trevor Bedford. Year-letter genetic clade naming for SARS-CoV-2 on nextstrain.org. https://nextstrain.org/blog/2020-06-02-SARSCoV2-clade-naming, 2020 (accessed 2 February 2023).
- 22.Tang X., Wu C., Li X., Song Y., Yao X., Wu X., Duan Y., Zhang H., Wang Y., et al. On the origin and continuing evolution of SARS-CoV-2. Natl. Sci. Review. 2020;7:1012–1023. doi: 10.1093/nsr/nwaa036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.WHO, Tracking SARS-CoV-2 variants. https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/, 2023 (accessed 8 May 2023).
- 24.Hause B.M., Collin E.A., Liu R., Huang B., Sheng Z., Lu W., Wang D., Nelson E.A., Li F. Characterization of a novel influenza virus in cattle and Swine: proposal for a new genus in the Orthomyxoviridae family. mBio. 2014;5:e00031–00014. doi: 10.1128/mBio.00031-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Collin E.A., Sheng Z., Lang Y., Ma W., Hause B.M., Li F. Cocirculation of two distinct genetic and antigenic lineages of proposed influenza D virus in cattle. J. Virol. 2015;89:1036–1042. doi: 10.1128/jvi.02718-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Simon-Loriere E., Faye O., Faye O., Koivogui L., Magassouba N., Keita S., Thiberge J.M., Diancourt L., Bouchier C., et al. Distinct lineages of Ebola virus in Guinea during the 2014 West African epidemic. Nature. 2014;524(2015):102–104. doi: 10.1038/nature14612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.A. Popa, J.W. Genger, M.D. Nicholson, T. Penz, D. Schmid, S.W. Aberle, B. Agerer, A. Lercher, L. Endler, et al., Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2, Sci. Transl. Med. 12 (2020) eabe2555, https://doi.org/10.1126/scitranslmed.abe2555. [DOI] [PMC free article] [PubMed]
- 28.Lemey P., Hong S.L., Hill V., Baele G., Poletto C., Colizza V., O'Toole A., McCrone J.T., Andersen K.G., et al. Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2. Nat. Commun. 2020;11:5110. doi: 10.1038/s41467-020-18877-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Dellicour S., Durkin K., Hong S.L., Vanmechelen B., Marti-Carreras J., Gill M.S., Meex C., Bontems S., Andre E., et al. A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. Mol. Biol. Evol. 2021;38:1608–1613. doi: 10.1093/molbev/msaa284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bollen N., Artesi M., Durkin K., Hong S.L., Potter B., Boujemla B., Vanmechelen B., Marti-Carreras J., Wawina-Bokalanga T., et al. Exploiting genomic surveillance to map the spatio-temporal dispersal of SARS-CoV-2 spike mutations in Belgium across 2020. Sci. Rep. 2021;11:18580. doi: 10.1038/s41598-021-97667-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Neher R.A., Bedford T. Nextflu: real-time tracking of seasonal influenza virus evolution in humans. Bioinformatics. 2015;31:3546–3548. doi: 10.1093/bioinformatics/btv381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yu D., Yang X., Tang B., Pan Y.H., Yang J., Duan G., Zhu J., Hao Z.Q., Mu H., et al. Coronavirus GenBrowser for monitoring the transmission and evolution of SARS-CoV-2. Brief. Bioinform. 2022;23:bbab583. doi: 10.1093/bib/bbab583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Kumar S., Tao Q.Q., Weaver S., Sanderford M., Caraballo-Ortiz M.A., Sharma S., Pond S.L.K., Miura S., Yeager M. An evolutionary portrait of the progenitor SARS-CoV-2 and its dominant offshoots in COVID-19 pandemic. Mol. Biol. Evol. 2021;38:3046–3059. doi: 10.1093/molbev/msab118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Turakhia Y., Thornlow B., Hinrichs A.S., De Maio N., Gozashti L., Lanfear R., Haussler D., Corbett-Detig R. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat. Genet. 2021;53:809–816. doi: 10.1038/s41588-021-00862-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ling Y., Cao R., Qian J., Li J., Zhou H., Yuan L., Wang Z., Ma L., Zheng G., et al. An interactive viral genome evolution network analysis system enabling rapid large-scale molecular tracing of SARS-CoV-2. Science Bulletin. 2022;67:665–669. doi: 10.1016/j.scib.2022.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bernasconi A., Mari L., Casagrandi R., Ceri S. Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence. Sci. Rep. 2021;11:21068. doi: 10.1038/s41598-021-00496-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., von Haeseler A., Lanfear R. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Suchard, M.A., P. Lemey, G. Baele, D.L. Ayres, A.J. Drummond, and A. Rambaut, Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10, Virus Evolution. 4 (2018) vey016, https://doi.org/10.1093/ve/vey016. [DOI] [PMC free article] [PubMed]
- 39.Rochman, N.D., Y.I. Wolf, and E.V. Koonin, Molecular adaptations during viral epidemics, Embo Reports. 23(2022) e55393, https://doi.org/10.15252/embr.202255393. [DOI] [PMC free article] [PubMed]
- 40.Eigen M. Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften. 1971;58:465–523. doi: 10.1007/BF00623322. [DOI] [PubMed] [Google Scholar]
- 41.Swetina J., Schuster P. Self-replication with errors. A model for polynucleotide replication. Biophys. Chem. 1982;16:329–345. doi: 10.1016/0301-4622(82)87037-3. [DOI] [PubMed] [Google Scholar]
- 42.Woo H.J., Reifman J. A quantitative quasispecies theory-based model of virus escape mutation under immune selection. PNAS. 2012;109:12980–12985. doi: 10.1073/pnas.1117201109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Łuksza M., Lässig M. A predictive fitness model for influenza. Nature. 2014;507:57–61. doi: 10.1038/nature13087. [DOI] [PubMed] [Google Scholar]
- 44.Mullick B., Magar R., Jhunjhunwala A., Farimani A.B. Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering. Comput. Biol. Med. 2021;138 doi: 10.1016/j.compbiomed.2021.104915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Sun Q., Shu C., Shi W., Luo Y., Fan G., Nie J., Bi Y., Wang Q., Qi J., et al. VarEPS: an evaluation and prewarning system of known and virtual variations of SARS-CoV-2 genomes. Nucleic Acids Res. 2022;50:D888–D977. doi: 10.1093/nar/gkab921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Obermeyer F., Jankowiak M., Barkas N., Schaffner S.F., Pyle J.D., Yurkovetskiy L., Bosso M., Park D.J., Babadi M., et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness [Preprint] medRxiv. 2022 doi: 10.1101/2021.09.07.21263228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yao Y., Li X., Liao B., Huang L., He P., Wang F., Yang J., Sun H., Zhao Y., et al. Predicting influenza antigenicity from Hemagglutintin sequence data based on a joint random forest method. Sci. Rep. 2017;7:1545. doi: 10.1038/s41598-017-01699-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Garvin, M.R., T.P. E, M. Pavicic, P. Jones, B.K. Amos, A. Geiger, M.B. Shah, J. Streich, J.G. Felipe Machado Gazolla, et al., Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal and explainable AI models, Genome Biol. 21 (2020) 304, https://doi.org/10.1186/s13059-020-02191-0. [DOI] [PMC free article] [PubMed]
- 49.Pathan R.K., Biswas M., Khandaker M.U. Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos Solitons Fractals. 2020;138 doi: 10.1016/j.chaos.2020.110018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hossain M.S., Pathan A.Q.M.S.U., Islam M.N., Tonmoy M.I.Q., Rakib M.I., Munim M.A., Saha O., Fariha A., Reza H.A., Roy M., et al. Genome-wide identification and prediction of SARS-CoV-2 mutations show an abundance of variants: Integrated study of bioinformatics and deep neural learning. Inf. Med. Unlocked. 2021;27 doi: 10.1016/j.imu.2021.100798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.R. Yin, V.H. Tran, X.R. Zhou, J. Zheng, C.K. Kwoh, Predicting antigenic variants of H1N1 influenza virus based on epidemics and pandemics using a stacking model, PLoS One 13 (2018) e0207777, https://doi.org/10.1371/journal.pone.0207777. [DOI] [PMC free article] [PubMed]
- 52.Cao L., Zhang C., Zhang D., Chu H., Zhang Y., Li G. Recent developments in using molecular dynamics simulation techniques to study biomolecules. Acta Physico-Chimica Sinica. 2017;33:1354–1365. doi: 10.3866/PKU.WHXB201704144. [DOI] [Google Scholar]
- 53.Y. Deng, Q. Liu, Q. Huang, Molecular docking of human-like receptor to hemagglutinins of avian influenza A viruses, Acta Phys. Chim. Sin. 33 (2017) 633–641, https://doi.org/10.3866/PKU.WHXB201612052.
- 54.Bai C., Wang J., Chen G., Zhang H., An K., Xu P., Du Y., Ye R.D., Saha A., et al. Predicting mutational effects on receptor binding of the spike protein of SARS-CoV-2 variants. J. Am. Chem. Soc. 2021;143:17646–17654. doi: 10.1021/jacs.1c07965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Williams A.H., Zhan C.G. Fast prediction of binding affinities of the SARS-CoV-2 spike protein mutant N501Y (UK Variant) with ACE2 and miniprotein drug candidates. J. Phys. Chem. B. 2021;125:4330–4336. doi: 10.1021/acs.jpcb.1c00869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Williams A.H., Zhan C.G. Generalized methodology for the quick prediction of variant SARS-CoV-2 spike protein binding affinities with human angiotensin-converting enzyme II. J. Phys. Chem. B. 2022;126:2353–2360. doi: 10.1021/acs.jpcb.1c10718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Gan H.H., Twaddle A., Marchand B., Gunsalus K.C. Structural modeling of the SARS-CoV-2 spike/Human ACE2 complex interface can identify high-affinity variants associated with increased transmissibility. J. Mol. Biol. 2021;433 doi: 10.1016/j.jmb.2021.167051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.N.D. Rochman, G. Faure, Y.I. Wolf, P.L. Freddolino, F. Zhang, and E.V. Koonin, Epistasis at the SARS-CoV-2 receptor-binding domain interface and the propitiously boring implications for vaccine escape, mBio. 13 (2022) e0013522, https://doi.org/10.1128/mbio.00135-22. [DOI] [PMC free article] [PubMed]
- 59.Pucci F., Rooman M. Prediction and evolution of the molecular fitness of SARS-CoV-2 variants: Introducing SpikePro. Viruses. 2021;13:935. doi: 10.3390/v13050935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Zhang L., Cao R., Mao T., Wang Y., Lv D., Yang L., Tang Y., Zhou M., Ling Y., Zhang G., Qiu T., Cao Z. SAS: A Platform of Spike Antigenicity for SARS-CoV-2. Front. Cell Dev. Biol. 2021;9 doi: 10.3389/fcell.2021.713188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Chen C., Boorla V.S., Chowdhury R., Nissly R.H., Gontu A., Chothe S.K., LaBella L., Jakka P., Ramasamy S., et al. A CNN model for predicting binding affinity changes between SARS-CoV-2 spike RBD variants and ACE2 homologues [Preprint] bioRxiv. 2022 doi: 10.1101/2022.03.22.485413. [DOI] [Google Scholar]
- 62.Kermack W.O., McKendrick A.G. Contribution to the mathematical theory of epidemics. Proceed. Roy. Soc. London Series a-Contain. Papers Mathemat. Phys. Character. 1927;115(1927):700–721. doi: 10.1098/rspa.1927.0118. [DOI] [Google Scholar]
- 63.Small M., Cavanagh D. Modelling strong control measures for epidemic propagation with networks-A COVID-19 case study. IEEE Access. 2020;8:109719–109731. doi: 10.1109/ACCESS.2020.3001298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Aleta A., Moreno Y. Evaluation of the potential incidence of COVID-19 and effectiveness of containment measures in Spain: a data-driven approach. BMC Med. 2020;18:157. doi: 10.1186/s12916-020-01619-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Cellès, M.D.d., J.S. Casalegno, B. Lina, and L. Opatowski, Estimating the impact of influenza on the epidemiological dynamics of SARS-CoV-2, PeerJ. 9 (2021) e12566, https://doi.org/10.7717/peerj.12566. [DOI] [PMC free article] [PubMed]
- 66.Wang Y., Li B., Gouripeddi R., Facelli J.C. Human activity pattern implications for modeling SARS-CoV-2 transmission. Comput. Methods Programs Biomed. 2021;199 doi: 10.1016/j.cmpb.2020.105896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Bivins A., Greaves J., Fischer R., Yinda K.C., Ahmed W., Kitajima M., Munster V.J., Bibby K. Persistence of SARS-CoV-2 in water and wastewater. Environ. Sci. Technol. Lett. 2020;7:937–942. doi: 10.1021/acs.estlett.0c00730. [DOI] [PubMed] [Google Scholar]
- 68.Layton B.A., Kaya D., Kelly C., Williamson K.J., Alegre D., Bachhuber S.M., Banwarth P.G., Bethel J.W., Carter K., et al. Evaluation of a wastewater-based epidemiological approach to estimate the prevalence of SARS-CoV-2 infections and the detection of viral variants in disparate Oregon communities at city and neighborhood scales. Environ. Health Perspect. 2022;130:67010. doi: 10.1289/Ehp10289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Yanac K., Adegoke A., Wang L.Q., Uyaguari M., Yuan Q.Y. Detection of SARS-CoV-2 RNA throughout wastewater treatment plants and a modeling approach to understand COVID-19 infection dynamics in Winnipeg, Canada. Sci. Total Environ. 2022;825:153906. doi: 10.1016/j.scitotenv.2022.153906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Liu Z., Skowron K., Grudlewska-Buda K., Wiktorczyk-Kapischke N. The existence, spread, and strategies for environmental monitoring and control of SARS-CoV-2 in environmental media. Sci. Total Environ. 2021;795 doi: 10.1016/j.scitotenv.2021.148949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Santillana M., Nguyen A.T., Louie T., Zink A., Gray J., Sung I., Brownstein J.S. Cloud-based electronic health records for real-time, region-specific influenza surveillance. Scientific Reports. 2016;6:25732. doi: 10.1038/srep25732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Yang S.H., Santillana M., Kou S.C. Accurate estimation of influenza epidemics using Google search data via ARGO. PNAS. 2015;112(2015):14473–14478. doi: 10.1073/pnas.1515373112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Poirier C., Hswen Y., Bouzille G., Cuggia M., Lavenu A., Brownstein J.S., Brewer T., Santillana M. Influenza forecasting for French regions combining EHR, web and climatic data sources with a machine learning ensemble approach. PLoS One. 2021;16 doi: 10.1371/journal.pone.0250890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.M.N. Kamel Boulos, E.M. Geraghty, Geographical tracking and mapping of coronavirus disease COVID-19/severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) epidemic and associated events around the world: how 21st century GIS technologies are supporting the global fight against outbreaks and epidemics, Int. J. Health. Geogr. 19 (2020) 8, https://doi.org/10.1186/s12942-020-00202-8. [DOI] [PMC free article] [PubMed]
- 75.Trent McConaghy, Rodolphe Marques, Andreas M¨uller, Dimitri De Jonghe, Troy McConaghy, Greg McMullen, Ryan Henderson, Sylvain Bellemare, and A. Granzotto. BigchainDB: A Scalable Blockchain Database. https://www.bigchaindb.com/whitepaper/, 2018 (accessed 8 May 2023).




