Skip to main content
IEEE - PMC COVID-19 Collection logoLink to IEEE - PMC COVID-19 Collection
. 2021 Jan 6;18(4):1250–1261. doi: 10.1109/TCBB.2021.3049617

2019nCoVAS: Developing the Web Service for Epidemic Transmission Prediction, Genome Analysis, and Psychological Stress Assessment for 2019-nCoV

Ming Xiao 1, Guangdi Liu 2, Jianghang Xie 1, Zichun Dai 1, Zihao Wei 1, Ziyao Ren 1, Jun Yu 3, Le Zhang 1,
PMCID: PMC8769043  PMID: 33406042

Abstract

Since the COVID-19 epidemic is still expanding around the world and poses a serious threat to human life and health, it is necessary for us to carry out epidemic transmission prediction, whole genome sequence analysis, and public psychological stress assessment for 2019-nCoV. However, transmission prediction models are insufficiently accurate and genome sequence characteristics are not clear, and it is difficult to dynamically assess the public psychological stress state under the 2019-nCoV epidemic. Therefore, this study develops a 2019nCoVAS web service (http://www.combio-lezhang.online/2019ncov/home.html) that not only offers online epidemic transmission prediction and lineage-associated underrepresented permutation (LAUP) analysis services to investigate the spreading trends and genome sequence characteristics, but also provides psychological stress assessments based on such an emotional dictionary that we built for 2019-nCoV. Finally, we discuss the shortcomings and further study of the 2019nCoVAS web service.

Keywords: 2019-nCoV, COVID-19, epidemic prediction models, LAUPs (lineage-associated underrepresented permutations), psychological stress assessment, genome analysis

1. Introduction

The COVID-19 epidemic, caused by the pathogenic virus 2019-nCoV, is still expanding around the world and poses a serious threat to human life and health [1]. Thus, it is critical for us to carry out epidemic transmission prediction [2], [3], genome sequence analysis [4], [5], and public psychological stress assessments [6], [7] for 2019-nCoV.

From the perspective of epidemic transmission, the basic reproduction number (R0) is one of the key parameters for 2019-nCoV epidemic transmission prediction [8], [9], [10], [11], [12]. Although previously well-developed SIR [13] or SEIR [14] models can estimate the basic reproduction number (R0), neither SIR nor SEIR consider the factors of suspected patient quarantine. Furthermore, the most commonly used web services of 2019-nCoV [5], [15], [16] only focus on the statistical analysis of real epidemic data, and there are only a few online predictive services with different epidemic transmission models.

From the perspective of genome sequence analysis, recent studies [17], [18], [19] performed sequence analysis and phylogenetic tree construction for 2019-nCoV. Most of these studies employed k-mer counting as a basic method to explore the frequent subsequence of genomes [20]. However, the k-mer counting method did not consider the characteristics of subsequences such as permutation specificity and CG content change from the perspective of lineage for 2019-nCoV, and sequencing errors cannot be avoided for the frequently mutated 2019-nCoV. Therefore, we cannot accurately and comprehensively describe the characteristics of the 2019-nCoV genome only by k-mer counting.

From the perspective of public psychological stress assessment, many previous studies [21], [22], [23] investigated the impact of 2019-nCoV on the public psychological stress state. For example, Chang et al. [21] evaluated the psychological health of 3881 students from Guangdong University using self-compiled 2019-nCoV scales. Fan et al. [22] assessed the psychological health status of people in Gansu Province through questionnaires. However, since we usually employ questionnaires to collect data, the population coverage is so narrow that it is difficult to dynamically assess the public psychological state and develop a professional emotional dictionary for 2019-nCoV. Furthermore, previous studies did not consider the connections among the public psychological stress state, real epidemic trends, and genome variation rate.

For these reasons, we developed an easy-to-use 2019 Novel Coronavirus Analysis Service (2019nCoVAS, http://www.combio-lezhang.online/2019ncov/home.html) with the following three major innovations.

First, 2019nCoVAS not only implements such a predictive model that considers the factors of suspected patient quarantine but also offers online epidemic transmission prediction and R0 trend analysis.

Second, 2019nCoVAS downloads all open 2019-nCoV genomes from mainstream databases for sequence analysis, as well as uses JBLA [24] to count and analyze the common lineage-associated underrepresented permutations (LAUPs) for all 2019-nCoV genomes. Additionally, we introduce MOTIF discovery [25] to find the frequent permutation pattern of common LAUPs since it can successfully describe the sequence characteristics of the genome [24], [26], [27] from the perspective of never-existing permutations. Thus, 2019nCoVAS can help us to improve the accuracy of sequence and phylogenetic analysis.

Thrid, 2019nCoVAS not only builds up an emotional dictionary of 2019-nCoV by crawling big Weibo data [38] that can significantly expand the data size from the questionnaire but also provides related services such as high-frequency vocabulary visualization and public psychological stress assessments.

In general, 2019nCoVAS can provide epidemic transmission prediction, genome sequence analysis, and public psychological stress assessment for 2019-nCoV.

2. Implementations

2.1. Epidemic Transmission Prediction

In the beginning, we obtained the 2019-nCoV epidemic data for China by Akshare [28] from January 20 to May 1, 2020. The data consists of the number of confirmed cases, suspected cases, and deaths. We developed a Python script to preprocess the data. It should be noted that since Hubei Province added 14840 newly confirmed cases on February 12 due to a change in detection standards [29], we must proportion the newly confirmed cases on February 12 from February 7 to February 12.

To predict the epidemic transmission and basic reproduction number (R0) for 2019-nCoV, we build up three epidemic transmission predictive models: SIR, SEIR, and SEIRQ. In particular, the SEIRQ model considers quarantined case factors, which can be used to predict epidemic situations under the condition of suspected quarantine cases. Next, we discuss these models.

2.1.1. SIR Model

Fig. 1 shows that the SIR model [30] classifies the total population into susceptible (S), infected (I), and recovered (R) populations. The susceptible (S) population transforms to infected (I) according to infection rate β (eq. (1) of the supplementary file, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2021.3049671), and infected (I) gradually recovers (R) at recovery rate γ (Eq. 2 of the supplementary file, available online). Related methods are listed in supplementary file S1.1, available online.

Fig. 1.

Fig. 1.

SIR model.

2.1.2. SEIR Model

Since 2019-nCoV has a latent period, Fig. 2 introduces a SEIR model [14] that considers the exposed factor (E), which carries the virus without symptoms. Here, the susceptible (S) population gradually transforms into exposed (E) rather than infected (I). The exposed (E) gradually converts to infected (I) with the conversion ratio α, and the infected (I) eventually converts to recovered (R).

Fig. 2.

Fig. 2.

SEIR model.

The conversion relationship between exposed (E) and infected (I) is described by 1 [14], and the rest of the information is listed in supplementary file S1.2, available online.

2.1.2.dEtdt=βStItN-αEt.((1))

Here, Inline graphicα represents the probability that the exposed (E) will transform into infected (I). N is the sum of Inline graphicS(t),I(t), and Inline graphicR(t).

2.1.3. SEIRQ Model

After we quarantined the suspected patients for 2019-nCoV outbreaks in China [1], we proposed a novel infectious SEIRQ model by considering the quarantined population, which is based on a modified SEIR model [31]. Fig. 3 shows a schematic diagram for SEIRQ.

Fig. 3.

Fig. 3.

SEIRQ model.

Based on SEIR model, SEIRQ model incorporates two types of quarantined populations. One is the quarantined patients in observe(O), and the other is the quarantined patients in treatment(T).

The susceptible (S) population will gradually change to the quarantined in observed (O) at a quarantine rate Inline graphicδ1. Susceptible (S) is transformed into exposed (E) at infection rate Inline graphicβ. The exposed (E) is quarantined into the quarantined in observed (O), and the quarantine rate is Inline graphicɛ. At the same time, the exposed (E) is transformed into the infected (I) at rate Inline graphicα. After the quarantine period, the quarantined in observed (O) is quarantined into susceptible (S) at quarantine rate Inline graphicδ2, which is considered to be uninfected. The quarantined in observed (O) will become the quarantined patients in treatment at rate Inline graphicθ1. Infected (I) is confirmed in quarantined patients in treatment (T) at rate Inline graphicθ2. Meanwhile, the infected die at rate Inline graphicμ2. The quarantined patients in treatment (T) gradually recover (R) at a recovery rate γ. Finally, the quarantined patients in treatment (T) die at rate Inline graphicμ1. Table 1 and supplementary file S1.3, available online list the key parameters and equations, respectively.

TABLE 1. Key Parameters of SEIRQ Model.
Symbol Significance Symbol Significance
S Susceptible population I Infected population
R Recovered population E Exposed population in latent period
O Quarantined patients in observed T Quarantined patients in treatment
D Deaths Inline graphicα Incidence rate of exposed (E)
Inline graphicβ Infection rate Inline graphicγ Cure rate
Inline graphicδ1 Quarantine rate of susceptible (S) Inline graphicδ2 Quarantine rate of quarantined (Q)
Inline graphicε Quarantine rate of exposed (E) Inline graphicθ1 Rate at which quarantined in observed (O) become quarantined in treatment (T)
Inline graphicθ2 Incidence rate of infected patients (I) Inline graphicμ1 Death rate of quarantined patients in treatment (T)
Inline graphicμ2 Death rate of infected patients (I)

To validate the predictive capacity of the model, we selected real data from January 20 to January 30 as the training data to predict the number of infected populations for the SIR, SEIR, and SEIRQ models (Fig. 4A). We also use the cross-validation method [32] to compute the average root mean square error (AVG_RMSE) for the three models by Eq. (2). Fig. 4B shows the AVG_RMSE values that describe the deviation between the predictive and actual curves.

Fig. 4.

Fig. 4.

Comparison between predicted and actual curves. (A) Predictive and actual curves for infected population. Horizontal and vertical axes represent date and average infected population value, respectively. (B) Average RMSE for SIR, SEIR, and SEIRQ models. Here, horizontal and vertical axes represent date and average RMSE value, respectively.

Specifically, we select the first two days of real data as training data and the next day as testing data to calculate the root mean square error (RMSE [33], eq. (2.1)). Then, we add the third-day data to the training dataset and compute the RMSE value by using the fourth-day data as the training data. Next, we use the same rule for three day-forward iterations.

2.1.3.RMSE=1NiNxreal,i-xmodel,i2(2.1)AVG_RMSE=jDRMSEjD.(2.2)((2))

Here,Inline graphicxreal,i and Inline graphicxmodel,i represent the real confirmed case number and the number of infections predicted by the model, respectively. N is the number of days of testing data, and D represents the number of days of day-forward iterations.

2.2. Basic Reproduction Number (Inline graphicR0) Estimation

We usually employ Inline graphicR0 to describe the dynamic change for the infective trend, which is computed by eq. (3) for each day [34]. Inline graphicR0 represents the average number of people infected with one person who can be transmitted to others without external intervention and group immunity [35].

2.2.R0=1+λTg+ρ1-ρλTg2((3))

Here, Inline graphicρ is the ratio of the incubation period to the generation time. Inline graphicλ is the growth rate during early exponential growth. Inline graphicTg is the sum of the incubation period and the infection period [34].

2.3. Genome Sequence Analysis

For 2019-nCoV genome sequence analysis, we downloaded all 2019-nCoV genome sequences and variation data from the 2019nCoVR [18] and GISAID [36] databases. We also compared 2019-nCoV with other coronavirus genomes by using all coronavirus sequence data from the NGDC [37].

To analyze the LAUP sequence characteristics for the 2019-nCoV genome, we used the Jellyfish-based LAUP analysis (JBLA) application [24] to compute the underrepresented sequence Eq. (4) and common LAUPs. We also introduced the MOTIF discovery method [25] to find the most frequent arrangement pattern in the common LAUPs and provide data download services to help users conduct further sequence and phylogenetic analysis. To analyze the connections among the variation rate, epidemic transmission trend, and public psychological stress state, we counted the average variation rate for 2019-nCoV genomes every day.

2.3.1. Multi-Genome LAUPs Analysis

To investigate the characteristics of the 2019-nCoV sequence, we used Jellyfish [38] to calculate all k-mers for 2019-nCoV and then calculated the LAUP [24] sequence for each 2019-nCoV genome by Eq. (4).

2.3.1. LAUP sk=C Sim _ se tk Kwg _ se tk((4))

Here, Inline graphic Sim _ se tk includes all possible Inline graphic4k k-mers. Inline graphic Kwg _ se tk contains permutations in which all k-mers have appeared in the genome of 2019-nCoV. LAUPs are defined as complements of Inline graphic Kwg _ se tk on Inline graphic Sim _ se tk.

In addition, we calculated the common LAUPs for the whole 2019-nCoV by Eq. (5) [24], determined the shortest common LAUPs, and analyzed the GC content of the common LAUPs by Eq. (6).

2.3.1.CommonLAUPs=i=1NLAUPski.((5))

Here, k is the length of LAUP and N is the total number of all new coronavirus genomes.

2.3.1.contentGCLAUPk=numGk+numCkk.((6))

Here, k is the length of LAUP. The function num () indicates the number of bases. Content_ GC () is used to calculate the content of CG and LAUP.

As a single-strand positive-sense RNA virus, 2019-nCoV follows all the molecular rules of the RNA world. Two of the primary rules are U as a nuclear base, instead of T in DNA, and a secondary structure formed by single-strand RNA molecules that are mostly intramolecular [39]. To apply k-mer and LAUP concepts in 2019-nCoV data analysis, we first convert the full sequence of the virus into k-mers of various lengths and subsequently look for LAUPs by comparing them with two k-mer pools: one contains random k-mers generated in limited G+C and A+G content windows, and the other is generated from all unique and high-quality RNA vuses. In this protocol, LAUPs contain sequence-derived permutations that are excluded by RNA viruses. Therefore, in doing so, we are not only able to avoid the impact of genome sequencing errors, but also discover sequences that are negatively selected by viral populations [24].

Virus-infected humans, individual animals, and populations often serve as hosts for viruses to select their best fitness by forming quasi-species where deleterious mutations are excluded so that the viral fitness is evolved. In this process, LAUPs as a set of sequences are subjected to selection in terms of secondary structures and targets of cellular RNA surveillance and interactive systems (such as RNA degradation and miRNA targeting), which is complement to the selection of protein sequences. These analyses of LAUPs can help us to improve the accuracy of genome sequence and phylogenetic analyses as well as viral biology and host pathophysiology.

2.3.2. Statistic Analysis of Genome Variation Rate

To analyze the connections among the 2019-nCoV epidemic, genome variation, and public psychological stress state, Eq. (7) computes the median variation rate. Then, we can visualize the trend of the variation rate under the real epidemic situation and public psychological stress state (Section 2.3), and investigate their connections.

2.3.2. Variation _ median = Variatio nN+12Nmod2=112 Variatio nN2+ Variatio nN2+1Nmod2=0.((7))

Here, N represents the number of all 2019-nCoV genomes on this day. Variation represents the sorted variation rate array of the day.

2.4. Psychological Stress Assesment

To obtain the Chinese public psychological stress data for 2019-nCoV, we built a crawler program based on the Weibo [40] API that crawled the data for all tweets and comments about 2019-nCoV from January 1 to March 31, 2020. Table 2 shows a set of searching Chinese strings of 2019-nCoV for our crawler program.

TABLE 2. Searching Chinese Strings of 2019-nCoV.

No. Searching Chinese strings English Translation
1 冠状病毒/新冠病毒 Coronavirus/COVID/SARS-Cov-2/2019-nCoV
2 肺炎 Pneumonia
3 疫情 Epidemic
4 防控 Prevention and control
5 感染 Infect/Infection
6 医院 Hospital
7 确诊病例/患者 Confirmed cases/patients
8 新确诊病例/患者 Newly confirmed cases/patients
9 武汉/湖北 Wuhan/Hubei
10 出院 Discharged
11 死亡 Death
12 密切接触 Close contact
13 口罩 Face mask

To dynamically assess the public psychological stress state during the epidemic, we proposed an automatic expansion method for the emotional dictionary in two steps. First, we employ a left and right information entropy algorithm [41] to locate the candidate words and construct the emotional dictionary. Second, we use the SO-PMI [42] and word2vec algorithms [43], [44] to determine the emotional polarity for the candidate words and screen new words for emotional polarity discrimination.

Thus, we developed two corresponding features for our web server. One builds an emotional dictionary for 2019-nCoV, which can highlight its high frequency vocabulary. The other assesses and visualizes the dynamic changes for the public's positive and negative emotions in response to 2019-nCoV at different time points. These features not only are able to provide a retrospective assessment function for psychologists but can also be used as references for national policy development.

2.4.1. Exploring Candidate Word

We introduce left and right information entropy [45], [46] as a quantitative measure for the boundary degrees of freedom of candidate words by Eq. (8). The left and right information entropies for the candidate string (Inline graphicw) are labeled Inline graphicHl and Inline graphicHr, respectively.

2.4.1.Hl=-wlslpwl|wlog2pwl|w(8.1)Hr=-wrsrp(wr|w)log2p(wr|w).(8.2)((8))

where Inline graphicsl is the left adjacency set of candidate word w, Inline graphicwl is an element in Inline graphicsl, Inline graphicsr is the right adjacency set of candidate word w, and Inline graphicwr is the element of Inline graphicsr.

2.4.2. Discrimination of Emotional Polarity Based on PMI and Word2Vec

The SO-PMI algorithm [47], [48], [49] is mainly used to determine the degree of correlation between words. We employ it to compute the mutual information between words by Eq. (9).

2.4.2.PMI=log2pwi,wjpwi×pwj.((9))

Here, word probabilities Inline graphicp(wi), Inline graphicp(wj) and joint probabilities Inline graphicp(wi,wj) can be estimated by counting the number of observations of Inline graphicwi and Inline graphicwj as well as the co-occurrence of Inline graphicwi and Inline graphicwj [50].

Here, the high mutual information indicates a high probability for the co-occurrence of two emotional words in many texts. We also introduced the Semantic Orientation (SO) [48] to determine whether a certain emotional word W is positive or negative by Eq. (10).

2.4.2.SOW=PMIW,β+-PMIW,β-.((10))

We need to select two sets of seed words to compute SO. One is the obvious positive tendency (Inline graphicβ+), and the other is the obvious negative tendency (Inline graphicβ-).

3. Performance

Fig. 5 shows the home page for 2019nCoVAS, the top of which is the functional navigation bar. The “home page” link shows the main features of 2019nCoVAS, followed by four drop-down menus: “infectious disease model,” “genome sequence analysis,” “psychological stress assessment,” and “related links.”

Fig. 5.

Fig. 5.

Home page of 2019nCoVAS.

3.1. Infectious Disease Model

The “infectious disease model” offers two functional modes. One is “epidemic transmission prediction,” which can predict the epidemic transmission for 2019-nCoV by SIR, SEIR, and SEIRQ. The other is “R0 trend analysis,” which can carry out trend analysis for the basic reproduction number (R0).

3.1.1. Epidemic Transmission Prediction

After clicking the “epidemic transmission prediction” link, the user can employ the selective interface to input the start and end dates and choose the appropriate epidemic transmission predictive models (Fig. 6A), which are comprised of SIR, SEIR, and SEIRQ. Finally, users can view epidemic transmission predictions by clicking the “submit” button (Fig. 6A) or clicking the “reset” button to restore the parameters.

Fig. 6.

Fig. 6.

Epidemic transmission prediction. (A) Infectious disease model selective interface. (B) Estimated parameters. (C) Predicted result. Horizontal and vertical axes represent date and number of infected people, respectively. Red and blue represent real confirmed cases and predicted number of infected people, respectively.

The predictive results are composed of two parts. One is Fig. 6B, which shows the estimated parameters such as the infective rate (β), incidence rate of the exposed (α), and cure rate (γ). The other is Fig. 6C, which shows the predicted epidemic transmission curve. For example, after choosing “2020-1-20” and “SEIRQ,” Fig. 6B lists the estimated parameters. Fig. 6C shows the predicted infected case curve and the real confirmed case curve.

3.1.2. R0 Trend Analysis

After clicking the “R0 trend analysis” link, the user can employ a selective interface to choose the start date (Fig. 7A) and the epidemic transmission predictive model (Fig. 7A), which includes SIR, SEIR, and SEIRQ. The user can obtain the dynamic trend for R0 by clicking the “submit” button (Fig. 7A) or clicking the “reset” button to restore the parameters to their default values. Here, R0 indicates the infectivity of the disease. We usually consider that the epidemic situation is well controlled when R0 is less than 1 [51].

Fig. 7.

Fig. 7.

R0 trend analysis: (A) Selective interface. (B) R0 trend; horizontal and vertical axes represent date and value of R0, respectively.

Fig. 7B shows the daily R0 trend Eq. (3). For example, after choosing “2020-1-20” and “SEIRQ,” Fig. 7B shows that R0 continues to decrease, but it is still greater than 1 until the end of April.

3.2. Genome Sequence Analysis

The “genome sequence analysis” has three features. The first is “genome k-mer analysis,” which can count k-mers and analyze the genome of 2019-nCoV. The second is “genome LAUP analysis,” which can explore LAUPs for the 2019-nCoV genome. Third is “genome variation analysis,” which can investigate the connections among the genomic variation rate, real number of infection cases, and public positive emotional rate.

3.2.1. Genome K-Mer Analysis

After clicking the “genome k-mer analysis” link, the user can employ the selective interface to input the start and end dates and choose the length of the k-mer (Fig. 8A). The user can click the “submit” button to obtain the k-mer counting results for all 2019-nCoV genomes in the selected time period (Figs. 8B and 8C), or click the “reset” button to restore the parameters to their default values.

Fig. 8.

Fig. 8.

Genome k-mer analysis: (A) Selective interface. (B) Abundance histogram of k-mer counting results. Horizontal axis represents abundance of k-mers, and vertical axis represents frequency of k-mers with that abundance in the relevant genome [52]; red and blue lines represent 2019-nCoV and all coronavirus sequences, respectively. (C) Top 10 k-mer permutations with most frequent occurrence. Vertical and horizontal axes represent k-mer permutation and counting value of k-mer, respectively.

The k-mer analysis consists of two figures. One is Fig. 8B, which shows the abundance histogram [52] of the k-mer counting results. The other is Fig. 8C, which shows the top 10 k-mer permutations with the most frequent occurrences for all 2019n-CoV genomes in the selected time period. Additionally, the user can click the “data download” button (Fig. 8C) to download the detailed k-mer counting file.

For example, after choosing start data “2020-01-23,” end date “2020-04-30,” and length of k-mer “12” (Fig. 8A), Fig. 8B shows that the peak of the 12-mer frequencies of 2019-nCoV appeared at abundances of 6, whereas that of all coronaviruses appeared at abundances of 12. Fig. 8C shows that the most frequent 12-mer permutation for 2019n-CoV is “TTTTTTTTTTTT.”

3.2.2. Genome LAUPs Analysis

After clicking the “genome LAUP analysis” link, the user can employ a selective interface to select the start and end dates (Fig. 9A) and then click the “submit” button to obtain the LAUPs analysis for all 2019-nCoV genomes in the selected time period (Figs. 9B and 9C), or click the “reset” button to restore the parameters to their default values.

Fig. 9.

Fig. 9.

LAUP analysis: (A) Selective interface (b) LAUP number statistics. (c) CG content statistics of LAUPs; red and blue lines represent 2019n-CoV and all Coronavirus sequences, respectively.

The LAUP analysis consists of two figures. One is Fig. 9B, which shows the statistics of common LAUPs. The other is Fig. 9C, which shows the statistics of CG content Eq. (6) for the common LAUPs of all 2019n-CoV genomes in the selected time period. Additionally, the user can click the “data download” button (Fig. 9C) to download the detailed LAUP analytical file.

For example, after choosing start date “2020-01-23” and end date “2020-06-01” (Fig. 9A), Fig. 9B shows that the length of the shortest common LAUPs of 2019-nCoV is 6, whereas the length of the shortest common LAUPs for all coronavirus sequences is 8. If the 6-mers of a genome sequence have any common LAUPs of 2019-nCoV when constructing the phylogenetic tree, this indicates that a distant genetic connection between this genome and the genome of 2019-nCoV. Since the number of common LAUPs is small, its computing cost is much lower than when using the normal sequence alignment method to investigate the connections between the candidate genome and the genomet of 2019-nCoV. Fig. 9C shows that the LAUPs’ CG content of 2019-nCoV is greater than that of all coronaviruses under the same length of K.

3.2.3. Genome Variation Analysis

After clicking the “Genome variation analysis” link, the user can employ a selective interface to select the start and end dates (Fig. 10A) and then click the “Submit” button to obtain the dynamic visualization of the genomic variation rate for all 2019-nCoV genomes in the selected time period (Fig. 10A) or click the “Reset” button to restore the parameters to default.

Fig. 10.

Fig. 10.

Genome variation analysis0 (A) Selective interface. (B) Visualization of genome variation rate, real confirmed cases, and public positive emotion. Vertical axis represents date, and horizontal axis represents number of confirmed cases, rate of genomic variation, and proportion of public positive emotions (%). Red, green, and blue represent newly confirmed cases, genome variation rate, and public positive emotion rate, respectively.

Fig. 10B visualizes the daily median genome variation rate eq. (7), number of new infections in the real epidemic, and public positive emotions eq. (10). Additionally, the user can click the “data download” button (Fig. 10B) to download the detailed variation rate result file.

For example, after choosing start date “2020-01-23” and End date “2020-03-15” (Fig. 10A), Fig. 10B shows that the virus variation rate gradually increases, the new confirmed cases continue to decrease, and the public positive emotion rate gradually returned to zero after February 19.

3.3. Psychological Stress Assessment

The “psychological stress assessment” offers two functional modes. One is the “public psychological stress assessment,” which can analyze the dynamic public psychological stress state under the epidemic situation; the other is “analysis of emotional words,” which can visualize the high frequency of emotional words for 2019-nCoV.

3.3.1. Public Psychological Stress Assessment

After clicking the “public psychological stress assessment” link, the interface displays the rate of positive and negative emotion per day (Fig. 11A). The user can select the period of observation by dragging the scroll bar from left to right. Additionally, the user can select the start and end dates of statistics from two drop-down boxes (Fig. 11B). Finally, the user can click the “download” button to obtain the analytic results based on the selected date.

Fig. 11.

Fig. 11.

Public psychological stress assessment: (A) Visualization of positive and negative emotional trends; horizontal and vertical axes represent date and emotion rate, respectively. Red and blue lines represent positive and negative emotion rates, respectively. (B) Data download interface.

The analytic results are composed of two components (Fig. 11A). The top of Fig. 11A shows the proportional distribution of the positive emotion rate over time (in days), and the bottom of Fig. 11A shows that of the negative emotions. For example, the proportion of positive emotion is low before February 5, whereas the negative emotion is high before February 3. However, positive emotion gradually increases and negative emotion gradually decreases after February 5.

3.3.2. Analysis of Emotional Words

After clicking the “analysis of emotional words” link, the interface displays the word clouds (Fig. 12A) for all emotional words generated during the epidemic period (February 2020 to April 2020). Additionally, the user can query words through the text box (Fig. 12B) and click the “submit” button to have emotional polarity for the word from the text box.

Fig. 12.

Fig. 12.

Analysis of emotional words: (A) High-frequency word cloud. (B) Emotional word query interface. (C) Top five emotional words in Chinese and English.

The word clouds (Fig. 12A) include positive words, negative words, and neutral words. Here, the font size of a word is positively related to its occurrence frequency. After typing “加油” and clicking the submit button, Fig. 12B shows that the emotional polarity of the word is positive. In addition, considering that the words in the source data are in Chinese, we translated the top five emotional words in Fig. 12C.

3.4. Related Links

Finally, “related links” include well-developed tools such as “motif discovery” [25] and “CGIDLA” [26], in which “motif discovery” can provide an online service for motif discovery [25], and “CGIDLA” [26] can help us further analyze the CG permutation specificity of LAUPs.

4. Conclusions

2019nCoVAS provides an informative and interactive platform for the analysis and visualization of epidemic transmission prediction, genome sequence analysis, and public psychological stress assessments.

Fig. 4 demonstrates that the average RMSE (Eq. (2)) of SEIRQ is significantly less than that of the SIR and SEIR models between January 24 and January 30. This occurred because Wuhan started using strict isolation measures on January 24 [53] and most Chinese provinces started the level-one response to public health emergencies on January 30 [54]. Thus, the SEIRQ model, which has isolation features, is more suitable for epidemic cases with isolation measures. Fig. 6C demonstrates that the number of infective cases predicted by SEIRQ is less than the number of real confirmed cases in February, implying that the epidemic may be underestimated in February. Additionally, Fig. 7B shows that R0 estimated by SEIRQ continues to decrease but is still greater than 1 until the end of April, indicating that the epidemic will continue.

Second, Fig. 8 shows that 2019-nCoV has a strong specificity of permutation and base content compared with other coronaviruses by genome sequence analysis. For example, Fig. 8B shows that the peak of the 12-mer frequency of 2019-nCoV appeared at different abundances from those of other coronaviruses. Additionally, Fig. 9 demonstrates that the length of the shortest common LAUPs of 2019-nCoV is different than that of other coronaviruses (Fig. 9B), and the CG content of LAUPs of 2019-nCoV is greater than that of other coronaviruses under the same length of K (Fig. 9C).

Additionally, through dynamic visualization of variation, we found that the variation rate is gradually increasing, the number of newly confirmed cases is decreasing, and the public emotion gradually calmed (Fig. 10B) after February 19, which indicates that the government's prevention and control measures quickly controlled the epidemic and effectively stabilized the public mood.

Third, Fig. 11 shows that the positive mood gradually increased and the negative mood gradually decreased (Fig. 11A) after February 5. We consider that this phenomenon may be related to the government's prevention, control measures, and major news events such as the completion of Leishenshan Hospital on February 8 [55]. According to the emotional dictionary analysis (Fig. 12A), the highly frequent positive emotional part includes words such as “come on” and “thanks,” and the negative part includes words such as “face masks,” “supplies,” and other words related to medical supplies. This indicates that people are still worried about the provision of medical supplies during the epidemic of 2019n-CoV.

In general, 2019nCoVAS is an effective web service for 2019-nCoV. However, due to the limitations of computing power, it does not provide real-time simulations. In the distant future, we will not only employ high-performance computing technology [56], [57] to realize real-time simulations but will also develop more in-depth LAUP analysis methods [26], [27] to further analyze the sequence characteristics of the 2019-nCoV genome. Finally, since the 2019- nCoV epidemic is still spreading around the world, we will carry out genomic sequence analysis, epidemic transmission prediction, and public psychological stress assessments for different countries.

Acknowledgments

This work was supported by National Science and Technology Major Project [2018ZX10201002], China Postdoctoral Science Foundation [2020M673221], and Sichuan University Postdoctoral Research and Development Foundation [2020SCU12056].

Biographies

graphic file with name xiao-3049617.gif

Ming Xiao (Member, IEEE) received the BS, MS, and PhD degrees from Southwest University, China, in 2007, 2010 and 2018, respectively. Currently he is a postdoctor with the College of Computer Science, Sichuan University. His research interests involve bioinformatics, data mining, and parallel computing.

graphic file with name liu-3049617.gif

Guangdi Liu received the MS degrees from Xihua University, China, in 2014. Currently he is working toward the doctoral degree in the college of computer and information science, Southwest University. His research interests include bioinformatics, artificial intelligence, and data mining.

graphic file with name xie-3049617.gif

Jianghang Xie received the BS degree from the Chongqing University of Posts and Telecommunications, in 2019. Currently, he is working toward the master's degree at the College of Computer Science, Sichuan University. His research interests include machine learning and parallel computing.

graphic file with name dai-3049617.gif

Zichun Dai received the BS degree from Sichuan University, China, in 2018. Currently he is working toward the master's degree at the College of Computer Science, Sichuan University. His research interests include bioinformatics, parallel computing, and data mining.

graphic file with name wei-3049617.gif

Zihao Wei is currently working toward the graduate degree at the College of Computer Science, Sichuan University. His research interests include web services development and data mining.

graphic file with name wen-3049617.gif

Ziyao Ren is currently working toward the graduate degree at the College of Computer Science, Sichuan University. His research interests include data mining and parallel computing.

graphic file with name yu-3049617.gif

Jun Yu received the BS degree from Jilin University, China, in 1983, and the MS and PhD degrees from the New York University School of Medicine, in 1986 and 1990. From 1998 to 2003, he worked as a research scientist at the Human Genome Center, Institute of Genetics, Chinese Academy of Sciences. From 2003 to 2012, he served as the deputy director of Beijing Institute of Genomics, Chinese Academy of Sciences. Currently, he is a research scientist with the Beijing Institute of Genomics, Chinese Academy of Sciences, His research interests include genomics and bioinformatics.

graphic file with name zhang-3049617.gif

Le Zhang received BS degree from the Beijing Institute of Technology, China, in 1999, and the MS and PhD degrees from Louisiana Tech University in 2005, completed his postdoctoral training from 2005 to 2008 in Harvard Medical School. Currently he is a full professor at the College of Computer Science, Sichuan University. His research interests include bioinformatics, computational biology, artificial intelligence, and high-performance computing.

Funding Statement

This work was supported by National Science and Technology Major Project [2018ZX10201002], China Postdoctoral Science Foundation [2020M673221], and Sichuan University Postdoctoral Research and Development Foundation [2020SCU12056].

Contributor Information

Ming Xiao, Email: xiaoming@scu.edu.cn.

Guangdi Liu, Email: liuguangdi1103@126.com.

Jianghang Xie, Email: xjh0013@163.com.

Zichun Dai, Email: daizichun@stu.scu.edu.cn.

Zihao Wei, Email: 2018141461086@stu.scu.edu.cn.

Ziyao Ren, Email: ziyaoren99@gmail.com.

Jun Yu, Email: junyu@big.ac.cn.

Le Zhang, Email: zhangle06@scu.edu.cn.

References

  • [1].Toit A. D.., “Outbreak of a novel coronavirus,” Nat. Rev. Microbiol., vol. 18, no. 3, pp. 123–123, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Li T., Cheng Z., and Zhang L., “Developing a novel parameter estimation method for agent-based model in immune system simulation under the framework of history matching: A case study on influenza a virus infection,” Int. J. Mol. Sci., vol. 18, no. 12, Dec./ Jan. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Wu W., Song L., Yang Y., Wang J., Liu H., and Zhang L., “Exploring the dynamics and interplay of human papillomavirus and cervical tumorigenesis by integrating biological data into a mathematical model,” BMC Bioinf., vol. 21, no. Suppl 7, May 2020, Art. no. 152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Zhang L., Bai W., Yuan N., and Du Z., “Comprehensively benchmarking applications for detecting copy number variation,” PLoS Comput. Biol., vol. 15, no. 5, May 28, 2019, Art. no. e1007069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Zhang L., et al. , “Revealing dynamic regulations and the related key proteins of myeloma-initiating cells by integrating experimental data into a systems biological model,” Bioinformatics, Jul. 26, 2019, doi: 10.1093/bioinformatics/btz542. [DOI] [PMC free article] [PubMed]
  • [6].Liu G.-D., Li Y.-C., Zhang W., and Zhang L., “A brief review of artificial intelligence applications and algorithms for psychiatric disorders,” Engineering, vol. 6, no. 4, pp. 462–467, 2020. [Google Scholar]
  • [7].Liu G., et al. , “Research on psychological scales based on multitheory fusion,” Curr. Bioinf., vol. 15, pp. 1–9, 2019. [Google Scholar]
  • [8].Ivorra B., Ferrandez M. R., Vela-Perez M., and Ramos A. M., “Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19) taking into account the undetected infections. The case of china,” Commun. Nonlinear Sci. Numer. Simul., pp. 105303–105303, 2020. [DOI] [PMC free article] [PubMed]
  • [9].Roda W. C., Varughese M. B., Han D., and Li M. Y., “Why is it difficult to accurately predict the COVID-19 epidemic?,” Infect. Dis. Modelling, vol. 5, pp. 271–281, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Yang Z., et al. , “Modified SEIR and AI prediction of the epidemics trend of COVID-19 in china under public health interventions,” J. Thoracic Dis., vol. 12, no. 3, pp. 165–174, Mar. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Hou C., et al. , “The effectiveness of quarantine of wuhan city against the corona virus disease 2019 (COVID-19): A well-mixed SEIR model analysis,” J. Med. Virol., vol. 92, pp. 841–848, 2020. [DOI] [PubMed] [Google Scholar]
  • [12].Prem K., et al. , “The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in wuhan, china: A modelling study,” Lancet Public Health, vol. 5, no. 5, pp. E261–E270, May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Arulampalam M. S., Maskell S., Gordon N., and Clapp T., “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. [Google Scholar]
  • [14].Hethcote H. W., “The mathematics of infectious diseases,” SIAM Rev., vol. 42, no. 4, pp. 599–653, Dec. 2000. [Google Scholar]
  • [15].Dong E., Du H., and Gardner L., “An interactive web-based dashboard to track COVID-19 in real time,” Lancet Infect. Dis., vol. 20, no. 5, pp. 533–534, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Valls J., Tobias A., Satorra P., and Tebe C., “COVID19-Tracker: A shiny app to analise data on SARS-CoV-2 epidemic in spain,” Gaceta Sanitaria, vol. 35, pp. 99–101, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Lu R., et al. , “Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding,” Lancet, vol. 395, no. 10224, pp. 565–574, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Zhao W.-M., et al. , “The 2019 novel coronavirus resource,” Yi Chuan = Hereditas, vol. 42, no. 2, pp. 212–221, 2020. [DOI] [PubMed] [Google Scholar]
  • [19].Zhu N., et al. , “A novel coronavirus from patients with pneumonia in china, 2019,” New Engl. J. Med., vol. 382, no. 8, pp. 727–733, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Xiao M., et al. , “K-mer counting: Memory-efficient strategy, parallel computing and field of application for bioinformatics,” in Proc. IEEE Int. Conf. Bioinf. Biomed., 2018, pp. 2561–2567. [Google Scholar]
  • [21].Jinghui C., Yuxin Y., and Dong W., “Mental health status and its influencing factors among college students during the epidemic of COVID-19,” J. South Med. Univ., no. 02, pp. 171–176, 2020. [DOI] [PMC free article] [PubMed]
  • [22].peng F., et al. , “Analysis of public psychological behavior and countermeasures during the epidemic of COVID-19, ” Soc. Sci. Rev., no. 2, pp. 1–5, 2020.
  • [23].Zhang J., Wu W., Zhao X., and Zhang W., “Recommended psychological crisis intervention response to the 2019 novel Coronavirus pneumonia outbreak in China: A model of West China hospital,” Precis. Clin. Med., vol. 3, no. 1, pp. 3–8, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Zhang L., Xiao M., Zhou J., and Yu J., “Lineage-associated underrepresented permutations (LAUPs) of mammalian genomic sequences based on a Jellyfish-based LAUPs analysis application (JBLA),” Bioinformatics, vol. 34, no. 21, pp. 3624–3630, Nov. 1, 2018. [DOI] [PubMed] [Google Scholar]
  • [25].Bailey T. L., et al. , “MEME SUITE: Tools for motif discovery and searching,” Nucl. Acids Res., vol. 37, no. Web Server issue, pp. W202–W208, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Xiao M., Yang X., Yu J., and Zhang L., “CGIDLA:Developing the web server for CpG island related density and LAUPs (Lineage-associated underrepresented permutations) study,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 17, no. 6, pp. 2148–2154, Nov./Dec. 2020. [DOI] [PubMed] [Google Scholar]
  • [27].Zhang L., Dai Z., Yu J., and Xiao M., “CpG-Island-based annotation and analysis of human house-keeping genes,” Brief. Bioinf., vol. 22, no. 1, pp. 515–525, Jan. 18, 2021. [DOI] [PubMed] [Google Scholar]
  • [28].King A.. “AkShare,” 2019. [Online]. Available: https://github.com/jindaxiang/akshare
  • [29].Ren C., Ren S., Chai Y., and Liu Y., “Modeling agile supply chain dynamics: A complex adaptive system perspective,” in Proc. IEEE Int. Conf. Syst. Man Cybern., 2002, Art. no. 6. [Google Scholar]
  • [30].Sedlazeck F. J., et al. , “Accurate detection of complex structural variations using single-molecule sequencing,” Nat. Methods, vol. 15, no. 6, pp. 461–468, Jun. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].CAO S., FENG P., and SHI P., “Study on the epidemic development of COVID-19 in Hubei province by a modified SEIR model.,” J. Zhejiang Univ. (Med. Sci.), vol. 49, no. 2, pp. 178–184, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Bergmeir C. and Benítez J. M., “On the use of cross-validation for time series predictor evaluation,” Inf. Sci., vol. 191, no. none, pp. 192–213, 2012. [Google Scholar]
  • [33].Chai T. and Draxler R. R., “Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature,” Geosci. Model Develop., vol. 7, no. 3, pp. 1247–1250, 2014. [Google Scholar]
  • [34].Chan J. F.-W., et al. , “A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: A study of a family cluster,” Lancet, vol. 395, no. 10223, pp. 514–523, Feb. 15, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Dietz K., “The estimation of the basic reproduction number for infectious diseases,” Statist. Methods Med. Res., vol. 2, no. 1, pp. 23–41, 1993. [DOI] [PubMed] [Google Scholar]
  • [36].Shu Y. and Mccauley J., “GISAID: Global initiative on sharing all influenza data – From vision to reality,” Euro Surveillance, vol. 22, no. 13, 2017, Art. no. 30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].N. G. D. C. Members, and Partners, “Database resources of the national genomics data center in 2020,” Nucleic Acids Res., vol. 48, pp. D24–D33, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Kingsford C., “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27, no. 6, pp. 764–770, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Green T. J., Cox R., Tsao J., Rowse M., Qiu S., and Luo M., “Common mechanism for RNA encapsidation by negative-strand RNA viruses,” J. Virol., vol. 88, no. 7, pp. 3766–3775, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Xu T. S., Yang X., Zhang H. C., and Zhang J., “An advanced data capture method based on sina weibo” pp. 1489–1492, 2015.
  • [41].YAO R., XU G., and SONG J., “Micro-blog new word discovery method based on improved mutual information and branch entropy,” J. Comput. Appl., vol. 36, no. 10, pp. 2772–2776, 2016. [Google Scholar]
  • [42].Kanna P. R. and Pandiaraja P., “An efficient sentiment analysis approach for product review using turney algorithm,” Procedia Comput. Sci., vol. 165, pp. 356–362, 2019. [Google Scholar]
  • [43].Zhang D., Xu H., Su Z., and Xu Y., “Chinese comments sentiment classification based on word2vec and SVMperf,” Expert Syst. Appl., vol. 42, no. 4, pp. 1857–1863, 2015. [Google Scholar]
  • [44].Rong X., “Word2vec Parameter Learning Explained,” pp. 1–19, 2014, arXiv:1411.2738.
  • [45].Liu G., Xia Y., Yang C., and Zhang L., “The review of the major entropy methods and applications in biomedical signal research,” in Proc. Int. Symp. Bioinf. Res. Appl., pp. 87–100, 2018. [Google Scholar]
  • [46].Lin C.-Y., Xue N., Zhao D., Huang X., and Feng Y., “Natural Lan-guage Understanding and Intelligent Applications,” in Proc. Int. Conf. Comput. Process. Oriental Lang., 2016., pp. 175–177. [Google Scholar]
  • [47].Salle A. and Villavicencio A., “Why so down? The role of negative (and positive) pointwise mutual information in distributional semantics,” pp. 1–6, 2019, arXiv:1908.06941. [Google Scholar]
  • [48].Toprak A. and Turan M., “The positive effect of PMI on the selection of meaningful words,” in Proc. 11th Int. Conf. Elect. Electronics Eng., 2019, pp. 911–915. [Google Scholar]
  • [49].Bouma G., “Normalized (pointwise) mutual information in collocation extraction,” in Proc. GSCL, 2009, pp. 31–40.
  • [50].Matsumoto Y., Sproat R., Wong K.-F., and Zhang M., “Computer processing of oriental languages beyond the orient: The research challenges ahead,” in Proc. 21st Int. Conf. Comput. Orient. Lang., 2006, pp. 256–277. [Google Scholar]
  • [51].Zhao S., et al. , “Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak,” Int. J. Infect. Dis., vol. 92, pp. 214–217, Mar. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Chor B., Horn D., Goldman N., Levy Y., and Massingham T., “Genomic DNA k -mer spectra: Models and modalities,” Genome Biol., vol. 10, no. 10, pp. 1–10, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].I. c. c. i. Wuhan, “Notice on epidemic prevention and control of 2019nCov in Wuhan,” 2020. [Online]. Available: http://www.gov.cn/xinwen/2020-01/23/content_5471751.htm
  • [54].Caixin, “All 31 provinces in mainland China launched the first level response to public health emergencies,” 2020. [Online]. Available: http://china.caixin.com/2020-01-29/101509411.html
  • [55].C. Daily. “Leishenshan hospital ready to receive patients,” Feb. 8, 2020; http://www.chinadaily.com.cn/a/202002/08/WS5e3e7810a310128217275fb3.html
  • [56].Shi L., Huang Z., Hu N., Yang Z., and Zhang L., “Integrating semantic query function into D-NetWeaver,” J. Med. Imag. Health Inform., vol. 5, no. 5, pp. 982–986, 2015. [Google Scholar]
  • [57].Jiang B., Dai W., Khaliq A., Zhou X., and Zhang L., “Accelerating 3D diffusion model in a cylindrical coordinate system by graphics processing unit (GPU) techniques,” Math. Comput. Simul., vol. 109, pp. 1–19, 2015. [Google Scholar]

Articles from Ieee/Acm Transactions on Computational Biology and Bioinformatics are provided here courtesy of Institute of Electrical and Electronics Engineers

RESOURCES