Abstract
An approach based on fractal scaling analysis to characterize the organization of the SARS-CoV-2 genome sequence was used. The method is based on the detrended fluctuation analysis (DFA) implemented on a sliding window scheme to detect variations of long-range correlations over the genome sequence regions. The nucleotides sequence is mapped in a numerical sequence by using four different assignation rules: amino-keto, purine-pyrimidine, hydrogen-bond and hydrophobicity patterns. The originally reported sequence from Wuhan isolates (Wuhan Hu-1) was considered as a reference to contrast the structure of the 2002–2004 SARS-CoV-1 strain. Long-range correlations, quantified in terms of a scaling exponent, depended on both the mapping rule and the sequence region. Deviations from randomness were attributed to serial correlations or anti-correlations, which can be ascribed to ordered regions of the genome sequence. It was found that the Wuhan Hu-1 sequence was more random than the SARS-CoV-1 sequence, which suggests that the SARS-CoV-2 possesses a more efficient genomic structure for replication and infection. In general, the virus isolated in the early 2020 months showed slight correlation differences with the Wuhan Hu-1 sequence. However, early isolates from India and Italy presented visible differences that led to a more ordered sequence organization. It is apparent that the increased sequence order, particularly in the spike region, endowed some early variants with a more efficient mechanism to spreading, replicating and infecting. Overall, the results showed that the DFA provides a suitable framework to assess long-term correlations hidden in the internal organization of the SARS-CoV-2 genome sequence.
Keywords: Detrended Fluctuation Analysis (DFA), Sequence Mapping Assignation Rules, Long Range Correlations, SARS-CoV-2 Genome Sequence
1. Introduction
Several cases of a new severe acute respiratory syndrome (SARS) were detected in Wuhan, China in December 2019. The syndrome showed several similarities with the SARS outbreak also surged in China in 2002, which was caused by a coronavirus. The new outbreak was also caused by a coronavirus that was named SARS-CoV-2 or Covid-19 in short. The new syndrome propagated outside China and by March 2020 the virus was detected in at least 100 countries with about 100,000 confirmed cases. The infection has persisted with an increasing number of infected cases and deaths and is considered a pandemic threat for the economy, social stability and human health.
The danger of the syndrome was promptly recognized, and so was the necessity of an accurate understanding of the propagation, infection and replication mechanisms and dynamics. Swiftly, researchers and public health administrators were aware that a vaccine was probably the only expedient way to alleviate the social, economic and public health system problems caused by the virus. The first detailed description of the genome sequence was reported on January 20 [1], unfolding important efforts to characterize its structure and organization to develop an effective vaccine. The monitoring of the virus replication showed the surge of the first variant of concern (Alpha – B.1.1.7) first identified in the United Kingdom in November 2020. New variants of concern were detected, including the nowadays dominant variant Delta (B.1.617.2) documented in May 2021, which spreads much faster and may cause more severe cases than the other variants.
A plethora of studies have been carried out to characterize the structure and mutations of the SARS-CoV-2 genome sequence composed of about 29800 bp. Most of these studies focus on the description of the functional regions and their modifications caused by different mutations aimed to classify the variants into lineages [2]. Mutations, spike structure and immune escape mechanisms have been of particular interest to understand the infection mechanisms and vaccine effectivity [3]. Commonly, the analysis of the SARS-CoV-2 genome relies on comparing the nucleotide arrangement of two or more sequences to detect modifications linked to mutations of interest. More involved methods have also been used to gain insights into the complexity of the sequence. In particular, methods based on information theory are expected to provide valuable insights into the evolution of genome functionality. Ghanchi et al. [4] found that the earlier genome sequences isolated before June 2020 exhibited a higher mean and site-specific entropy compared with that of posterior isolates. This suggests that the viral genome achieved stability after the initial early period of the pandemic. Nawaz et al. [5] postulated that alignment-free methods can provide valuable insights into the structure and functionality of the SARS-CoV-2 genome sequence. Namazi [6] used sample and approximate entropy methods to conclude that SARS-CoV-2 has a significantly different complex structure than SARS-CoV-1. Fractal dimension methods and scaling analysis based on detrended fluctuation analysis (DFA) were used to find that the genome sequence possesses persistent long-range memory [7].
The reduced number of studies on the informational structure (e.g., long-range correlations) of the SARS-CoV-2 genome sequence has a limited impact on the characterization of the mechanisms and mutation effects in the sequence structure. Results in this line should provide valuable insights that might complement the studies based on evolutionary genetics. While most studies have relied on comparing the locations of nucleotides in different regions, the effect of mutations (substitutions and displacements) on the intrinsic informational organization, order and complexity of the SARS-CoV-2 genome sequence has not been explored. The reduced number of studies only provided an average characterization of the genome structure [4], [5], [7], without focusing on the local structure. Mutations are located in small regions, such that the modifications of the genome structure are also expressed in local terms. In this regard, the present study aims to explore the local organization of the SARS-CoV-2 genome sequence using long-term correlation analysis. The approach is based on a sliding window implementation of the detrended fluctuation analysis (DFA) to characterize the local fluctuations of the fractal scaling properties [8]. The main results showed that the SARS-CoV-2 genome sequence is more structured (i.e., less random) than its SARS-CoV-1 counterpart, a feature that would endow the virus with improved capacities for replication and infection. Mutations were linked to local variations of the scaling properties, particularly in the surface glycoprotein (spike) region. Overall, the results showed that the DFA is a suitable framework to characterize the internal organization of the genome sequence in terms of long-range correlations.
2. Detrended fluctuation analysis
For a given numerical mapping rule of the genome sequence, the aim is to estimate local autocorrelations that can be of interest in terms of mutations reported in the literature. A diverse set of methods has been proposed in recent decades, including spectral analysis and rescaled range (R/S) analysis. However, real sequences can be affected by non-stationarities and long-term trends, which might induce biased estimates of fractal scaling properties. The so-called detrended fluctuation analysis (DFA) was proposed in statistical physics to overcome the drawbacks of traditional methods. The method is an improvement of the classical fluctuation analysis by removing long-term trends [9]. The DFA allows the detection of intrinsic self-similarity embedded in a seemingly nonstationary time series, and also avoids the spurious detection of apparent self-similarity, which may be an artifact of extrinsic trends. These features have made the DFA the most widely used approach for the fractal analysis of complex time series in a large diversity of fields, from geophysics [10] and financial systems [11] to biology [12], physiology [13] and chemistry [14], [15], and provides an easy interpretation of fluctuation patterns in terms of scaling exponents.
For the sake of completeness, the DFA can be briefly described as follows. Given a bounded sequence , , the summation first converts the sequence in an unbounded sequence
| (1) |
where is the mean value of the sequence, and corresponds to the profile. Next, the sequence is divided into subsequences of length samples each. The size of the subsequences represents the scale at which the fluctuations of the unbounded sequence are characterized. A local least-squares straight-line fit is computed for each subsequence. Let be the resulting piecewise sequence of straight-line fits, which represents the long-term trend of the profile sequence for the scale . Then, the fluctuations are quantified as the root-mean-square deviation from the trend:
| (2) |
In this way, the fluctuation function es dependent on the scale . For small values of , the least-squares fitting provides a better description of the fluctuations . In contrast, for large values of , a coarser fitting of the fluctuations is obtained and the difference is higher than for small values of the scale . The procedure of detrending and fluctuation quantification is repeated over a range of different subsequence lengths . If the process is self-affine, the fluctuation function can be described as a power-law function of the form
| (3) |
The exponent is a generalization of the Hurst exponent and reflects the scaling behavior of the detrended fluctuations. For sufficiently long sequences (i.e., very large), one has that for uncorrelated random walks (i.e., the walk displacement grows like ). Deviations from indicate the presence of self-correlations. In particular, reflects anti-persistent behavior (i.e., a positive move is likely followed by a negative move and vice versa), whereas corresponds to persistent behavior (i.e., a positive/negative move is likely followed by a positive/negative move). The usual range of scales for the computation of the DFA scaling exponent is [9], and the scaling exponent is estimated as the slope of a log–log plot of the fluctuation function versus the scale .
2.1. Randomness test
The genome sequence analysis involves the detection of regions containing serial correlations. In this way, one should test whether the DFA predicts deviations from randomness. In principle, serial correlations are present when the scaling exponent is not equal to 0.5. However, the value indicating the absence of serial correlations, holds for asymptotic statistics of the fluctuation function (i.e., for very long values of the scale ) [16]. This is a drawback since commonly one should deal with sequences of relatively small length. If the probability distribution that generated the values of the sequence is available, an approach is to generate many random sequences of finite size and to compute the statistics of the DFA scaling exponent to obtain the confidence intervals (CI). However, the exact distribution is hardly available in practice for a given process. Bootstrapping estimates can be used by considering an approximate (i.e., empirical) distribution. The iso-distributional surrogate data test [17] was used to test randomness. Briefly, the following procedure is proposed to estimate the CI for randomness: a) Compute shuffled sequences from the original sequence . Shuffling destroys serial correlations while retaining the statistical distribution of values. That is, the sequences and were generated from a common distribution . b) Compute the DFA scaling exponent for the shuffled sequences , which reflects the scaling behavior of a random sequence. c) Carry out the statistical analysis of the scaling exponent values to obtain the corresponding CI for randomness. In the sequel, randomized sequences were used to estimate the confidence intervals.
3. Numerical mapping of the genome sequence
A genome is a text sequence of the nucleotides () adenine (A), cytosine (C), guanine (G) and tyrosine (T). To apply the DFA to a sequence of nucleotides, the text sequence should be mapped into a one-dimensional numerical sequence. In this regard, four different mappings were considered [18]:
-
a)
Amino-Keto Rule (AK Rule). if is an amino (A or C) and if is a keto (G or T).
-
b)
Purine-Pyrimidine Rule (PP Rule). if is a purine (A or G) and if is a pyrimidine (C or T).
-
c)
Hydrogen Bond Energy Rule (HB Rule). for strongly bonded pairs (C or G) and for weakly bounded pairs (A or T).
-
d)
Hydrophobicity Rule (HP Rule). The numerical value corresponds to the dipolar moment of the nucleotide according to the following assignation [19]: 2.46 for A, 7.04 for C, 6.75 for G and 4.56 for T.
4. SARS-Cov-2 genome sequences
The genome sequences were downloaded from the NCBI GenBank database at the publicly accessible site www.ncbi.nlm.nih.gov. The genome sequence for SARS-CoV-1 (NC_004718.3) is composed by 29,751 nucleotides [20]. The firstly reported genome sequence for SARS-CoV-2 (Wuhan Hu-1, MN908947, 29903 bp) collected from Wuhan, China [1] was taken as a reference. Genome sequences for SARS-CoV-2 from samples collected in the early months of 2020 were used to contrast the results of the Wuhan Hu-1 isolate: MT019529 (China, December 23, 2019, 29,989 nucleotides, isolated), MT012098 (India, Collection Date January 27, 2020, 29854 bp), MT039890 (South Korea, Collection Date January 2020, 29903 bp), MT066156 (Italy, January 30, 2020, 29867 bp), MN985325 (USA, Collection Date January 19, 2020, 29882 bp) and MT126808 (Brazil, Collection Date February 28, 2020, 29876 bp). It is noted that the above sequences correspond to isolates made before the variants of concerns by WHO. For instance, the Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1) and Delta (B.1.617.2) earliest samples were obtained not before May 2020. The aim of considering the above isolates was to evaluate earlier variations of the genome sequence in terms of the fractal scaling properties.
5. Results and discussion
5.1. Global scaling pattern
Fig. 1.a shows the results of applying the DFA to the Wuhan Hu-1 sequence (MN908947) for the four different numerical mapping rules. The dotted line denotes the scaling behavior for an uncorrelated control noise, corresponding to the scaling exponent . The four mapping rules exhibited a scaling behavior similar to that of uncorrelated noise for scales of up to 800 bp, which is in line with results reported by Buldyrev et al. [18] for genome sequences from different sources from eukaryotes to vertebrates. A marked deviation from the uncorrelated behavior was observed for scales higher than 1000 bp. In such a case, the fluctuation function points are located above the line of the uncorrelated noise, suggesting the presence of long-term correlations. A more detailed view of the scaling behavior can be assessed by computing the local scaling exponent computed as the local derivative of the log–log plot [21] and the results are exhibited by Fig. 1.b. These results confirm that the scaling behavior of the Covid-19 genome is scale-dependent, with a pattern close to that of uncorrelated dynamics for relatively small scales, up to about 700–800 bp. On the other hand, the local scaling exponent is visibly higher than 0.5 for long scales, a behavior that in principle may be ascribed to the presence of long-term serial correlations. Buldyrev et al. [22] postulated that the detected correlations for scales of hundreds and small thousands of bp could be attributed to simple-repeat expansions. Repeated patterns of two or more nucleotides commonly exhibit fat tails, which may be fit by a power-law function [22]. In this way, the deviations of the scaling exponent from for high scales do not necessarily reflect long-term correlations, but instead the presence of fat-tailed distributions that can be described as power-law functions. This feature is illustrated by Fig. 1.c where the scaling behavior of the AK Rule is compared with that from 200 sequences obtained by shuffling (i.e., randomizing) the original genome sequence. It is noted that the fluctuation function of the original sequence is located into the envelope of fluctuations functions of the shuffled sequences, such that the scaling behavior of the original sequence may not be distinguished from that of random sequences having the same distribution. In this way, one concludes that the analysis of the genome sequence is hampered by spurious effects induced by, e.g., fat-tailed distributions. Carpena et al. [23] reported that no single scaling exists in genome sequences. In this way, the scaling analysis should consider variations of the scaling behavior along the genome sequence.
Fig. 1.
(a) Behavior of the fluctuation function for the four different mapping rules. The DFA was carried out for the full sequence (about 29800 bp). (b) Local scaling exponent obtained as the derivative of the log–log plot of the fluctuation function. (c) Fluctuation function for the AK Rule compared with that obtained from 200 shuffled (i.e., randomized) sequences.
5.2. Sliding window scaling analysis
The results in Fig. 1 showed that the Covid-19 genome is a complex sequence that is affected by long-term non-stationarities and as such detection of spurious correlations for relatively high scales. In this regard, the analysis is focused on sequences of smaller size via the DFA implementation in a sliding window. This approach offers the advantage that variations of the scaling properties over different genome regions may be detected [9]. Given the uncertainty of the scaling exponent estimates for high scales, Fig. 1.c suggests carrying out the DFA analysis for scales non-higher than 500 bp. To illustrate the effect of the sliding window size, Fig. 2 presents the variations of the scaling exponent for the AK Rule along with the bp position for three different values of the sliding window size (300, 400 and 500 bps) and sliding length of 100 bps. The gray band denotes the 90% randomness confidence interval (CI) band obtained from 10,000 shuffled sequences of the original sliding window sequence, whereas the green line denotes the corresponding mean value. The vertical lines denote the boundaries of the functional domains of the SARS-Cov-2 genome sequence. The scaling exponent exhibits alternating regions of random and correlated behavior, which can be linked to the different functional regions of the genome sequence. In general, the structure of the scaling exponent is preserved when the sliding window size increases from 300 (Fig. 2.a) to 500 bps (Fig. 2.c). One of the effects of the sliding window size was to smooth the scaling exponent fluctuations. The scaling exponent for 300 observations offered more details, although with some high-frequency fluctuations. In this way, the sliding window size of 400 offers a balance between the details and smoothing of the scaling exponent variations. In the sequel, the window size of 400 bp will be used to analyze the Covid-19 genome sequence.
Fig. 2.
Behavior of the AK Rule scaling exponent (red line) along the genome domain for three different sliding window sizes. (a) 300, (b) 400 and (c) 500 bp. The gray band and the blue line denote respectively the mean value and the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence.
Fig. 3 presents the variation of the scaling exponent for the four different mapping rules. As already shown in Fig. 2, the scaling exponent exhibits important variations along with the genome position. However, the variations and the correlation structure depend on the mapping rule. The scaling exponent for the AK Rule (Fig. 3.a) fluctuates between the 90% CI randomness band and the region corresponding to long-range correlations (i.e., above the randomness band). The large peak at about 5620–6350 bp reflects significant correlations of the region ascribed to the non-structured protein nsp5. It is also remarkable the peak at about 23700–24700 bp, which is in the region responsible for the spike glycoprotein. The PP Rule exhibited a different picture (Fig. 3.b), with most peaks that are outside the 90% CI randomness band in the anti-correlation region (i.e., below the randomness band). A prominent anti-correlation peak was displayed at about 8890–9620 bp, which is linked to the non-structured protein nsp4. The pattern exhibited by the HB Rule is quite interesting (Fig. 3.c). In the first third part of the genome sequence, this scaling exponent indicates the presence of long-range anti-correlated patterns. This region is linked to several non-structured proteins (nsp2-nsp9) as well as to the 3C-like proteinase [24]. The behavior of the scaling exponent for higher bp showed a pattern of alternate randomness and anti-correlation, which can be ascribed to the different functional regions. In particular, the anti-correlation peak at 23500–25000 bp indicates that hydrogen bonding structure plays an important role in the formation of the surface glycoprotein. The results for the HP Rule are presented in Fig. 3.d, showing that randomness is dominant in the fluctuations of the scaling exponent. The wide peak at about 23150–24300 bp denotes anti-correlations in the surface glycoprotein region. Overall, the results presented in Fig. 3 show that the long-range correlation pattern of the genome sequence depends on the rule used to map the nucleotides sequence. In this way, every mapping rule reveals the presence of correlations/anti-correlations that can be interpreted in terms of the characteristics (e.g., energy bounding) of the molecules involved in the genome structure. Table 1 summarizes the findings of Fig. 3 by denoting the type of scaling pattern (randomness, correlation and anti-correlation) linked to every functional region of the SARS-CoV-2 genome sequence. Except for the initial part linked to the leading protein, all regions can be ascribed to at least one type of scaling pattern. For instance, the helicase region (16237–18039 bp) exhibited correlations with the AK Rule and anti-correlation for the other three rules. Interestingly, the surface glycoprotein region (21553–25384 bp) exhibited the same pattern.
Fig. 3.
Behavior of the scaling exponent for a sliding window size of 400 bp. (a) AK Rule, (b) PP Rule, (c) HB Rule, and (d) HP Rule. The gray band and the blue line denote respectively the mean value and the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence.
Table 1.
Genome organization of SARS-Cov-2 according to the detrended fluctuation analysis (DFA).
| Region | Function | AK Rule | PP Rule | HB Rule | HP Rule |
|---|---|---|---|---|---|
| 1–265 | 5′ UTR | R | R | R | R |
| 266–805 | Leader Protein | R | R | A | R |
| 806–2719 | nsp2 | R | R | A | R |
| 2720–8554 | nsp3 | C | R | A | C |
| 8555–10054 | nsp4 | C | C,A | A | R |
| 10055–10972 | 3C-like Proteinase | R | A | A | A |
| 10973–11842 | nsp6 | R | A | A | R |
| 11843–12091 | nsp7 | R | A | A | R |
| 12092–12685 | nsp8 | R | C | A | R |
| 12686–13024 | nsp9 | R | R | A | A |
| 13025–13441 | nsp10 | C | A | A | R |
| 13442–16236 | RNA Polymerase | C | A | A | R |
| 16237–18039 | Helicase | C | A | A | A |
| 18040–19620 | EndoRNAse | R | A | A | R |
| 19621–20658 | 3′-to-5′ Exonuclease | C | A | A | A |
| 20659–21552 | 2′-O-Ribose Methyltransferase | R | C | R | A |
| 21553–25384 | Gene: S (Surface Glycoprotein) | C | A | A | A |
| 25385–26620 | Gene: ORF3a | C | R | A | C |
| 26621–26472 | Gene: E | R | R | A | R |
| 26523–27191 | Gene: M | R | R | A | R |
| 27202–27387 | Gene: ORF6 | R | R | R | A |
| 27394–27759 | Gene: ORF7a | R | C | A | R |
| 27756–27887 | Gene: ORF7b | R | R | A | R |
| 27894–28259 | Gene: ORF8 | C | R | A | R |
| 28274–29533 | Gene: N | R | R | A | A |
| 29558–29674 | Gene: ORF10 | R | R | A | R |
R: random, A: anti-correlated, C: correlated.
5.3. SARS-CoV-1 versus SARS-CoV-2
The SARS-Cov-1 was detected in November 2002 in Yunnan, China. An interesting question is the similarity of genome sequences between SARS-CoV-1 and SARS-CoV-2. Fig. 4 compares the variation of the scaling exponent for the two genome sequences and the four different mapping rules. Important differences are exhibited for all cases, indicating that the scaling characteristics of the genome sequence of the two viruses are quite different. The more visible differences are presented for the regions 8555–10054 (nsp4) and 21553–25384 (surface glycoprotein). The Pearson’s correlation coefficient between SARS-CoV-1 and SARS-CoV-2 for the region 8555–10054 was 0.79, 0.77, 0.75 and 0.77 (p < 0.05) for the AK, PP, HB and HP mapping rules, respectively. Similarly, the Pearson's correlation coefficient was 0.84, 0.81, 0.84 and 0.85 (p < 0.05) for the region 21553–25384. The correlation is sufficiently strong to consider assuming that the genome sequence of SARS-CoV-2 shares some similarities with the genome sequence of SARS-CoV-1. However, as detected by the fractal scaling analysis, the imperfect correlation indicates that certain differences there exist between the two genome sequences. Fig. 4.a for the AK Rule shows an increase of the correlation for SARS-CoV-2 (red line) concerning the SARS-CoV-1 (blue line) in the nsp4 region. Also, the correlation peak in the surface glycoprotein region shifted to the left. The strength of the energy bounds also exhibits important changes from SARS-CoV-1 to SARS-CoV-2 (Fig. 4.c). The large peak located at about 6350 bp for SARS-CoV-1 decreases its intensity and shifts to the right to about 7750 bp. To obtain a more quantitative comparison between the scaling structures, let us consider a degree of randomness (DR) of a genome sequence as the number of bp compatible with a random (shuffled) sequence relative to the whole sequence. That is,
| (4) |
where a bp is considered random if the corresponding scaling exponent is into the 90% CI randomness band. Table 2 presents the values of the DR for SARS-CoV-1 and different SARS-CoV-2 genome sequences. For both cases, the lowest degree of randomness is exhibited by the HB Rule, which suggests that the genomic sequence is primordially organized around the hydrogen bond structure. In contrast, the highest degree of randomness is exhibited by the AP Rule, with values of 65–75%. This means that the AP Rule based on the amide-keto sequencing provides a relatively low level of information on the organization of the SARS-CoV genome sequences. In terms of the two genome sequences, the results in Table 2 indicated that, except for the HP Rule, the SARS-CoV-2 shows a reduced level of randomness relative to the SARS-CoV-1. That is, the genome sequence of SARS-CoV-2 has a less random organization than the genome sequence of SARS-CoV-1 in terms of the amine-keto, purine-pyrimidine and hydrogen bond sequencing. In contrast, the SARS-CoV-2 exhibits a higher degree of randomness relative to the hydrophobicity sequencing. Overall, the results in Fig. 4 and Table 2 indicate that the genome sequences of SARS-CoV-1 and SARS-CoV-2 exhibit important differences in the long-range order (i.e., scaling structure) of the nucleotide base. Results in this line were reported by Ghanchi et al. [4], who found that the earlier genome sequences isolated before June 2020 exhibit a higher mean and site-specific entropy compared with that of posterior isolates. This suggests that the viral genome achieved stability after the initial early period of the pandemic.
Fig. 4.
Comparison of the scaling exponent fluctuations between the SARS-CoV-1 (blue line) and SARS-CoV-2 (red line, Wuhan Hu-1) genome sequences for (a) AK Rule, (b) PP Rule, (c) HB Rule, and (d) HP Rule. The gray band denote the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 2.
Degree of randomness (DR) of different SARS-CoV genome sequences computed by DFA and entropy methods.
| Method | Rule | SARS-CoV-1 | SARS-CoV-2 Wuhan | SARS-CoV-2 India | SARS-CoV-2 Brazil | SARS-CoV-2 USA | SARS-CoV-2 S-Korea | SARS-CoV-2 Italy |
|---|---|---|---|---|---|---|---|---|
| DFA | AK | 74.75 | 64.71 | 62.98 | 64.71 | 64.36 | 64.12 | 62.12 |
| DFA | PP | 69.21 | 66.79 | 66.09 | 66.44 | 66.78 | 66.26 | 66.43 |
| DFA | HB | 41.87 | 37.72 | 39.45 | 38.07 | 37.72 | 38.23 | 40.32 |
| DFA | HP | 68.17 | 73.71 | 75.78 | 73.71 | 74.05 | 74.32 | 75.93 |
| Entropy | AK | 80.95 | 74.82 | 74.14 | 74.16 | 74.14 | 74.15 | 74.82 |
| Entropy | PP | 79.59 | 84.35 | 87.75 | 85.03 | 85.03 | 85.03 | 84.35 |
| Entropy | HB | 70.06 | 74.82 | 73.10 | 75.51 | 75.51 | 75.51 | 75.51 |
| Entropy | HP | 85.71 | 84.68 | 85.72 | 84.35 | 84.35 | 84.36 | 85.43 |
The results in Fig. 4 showed important differences between the genome sequences of SARS-CoV-1 and SARS-CoV-2. In particular, the genome sequence of SARS-CoV-2 appeared to be more locally correlated than the genome sequence of SARS-CoV-1. The entropy of the genome sequences was computed to contrast the findings obtained with the DFA [25]. For such computations, the mapped genome sequence is considered as a sequence of letters. For instance, the PP mapping rule produced a sequence of the two () letters +1 and −1. These letters can form words of length . Let the set of words of length . For a given sequence, one can compute the normalized frequency of the occurrence of the j-th word , such that . The entropy for the set of words of length is computed with the Shannon entropy formula
| (5) |
High (resp., low) entropy values indicate a homogeneous (resp., heterogeneous) occurrence of the words of the set . That is, a given word is equally likely to occur than the other words of the set. The maximum entropy is obtained when and is equal to . In this way, the above formula can be normalized to obtain the normalized entropy
| (6) |
In this way, the range of is [0,1]. The entropy computations were carried for words of length over a sliding window of 100 nucleotides. Triplets play an important role in the structure of genome sequences since these words have been linked to important diseases [26] and advantageous mutations in evolution [27]. Ad done for the DFA, the randomness of the sequence in the sliding window was tested using the iso-distributional surrogate data test [17] described in Section 2.1. Fig. 5 presents the variation of the normalized entropy through the SARS-CoV-1 and SARS-CoV-2 genome sequences. The entropy is contained within the randomness band for most sequences, indicating that the genomic mosaic is highly disorganized in most regions. However, important deviations from randomness can be observed in the nsp4 and surface glycoprotein regions, mainly for the AK and HB mapping rules (Fig. 5.a and .c). These important regions exhibited domains with important nucleotide organization. The HP Rule was not sufficiently sensitive to variations of the entropy (Fig. 5.c) whereas the PP Rule showed marked deviations from randomness in the region of about 3000 bps. In general, the entropy results in Fig. 5 agree well with the results obtained with the DFA (Fig. 4) in the sense that the genome sequence exhibits alternating regions of disordered and ordered nucleotide arrangements. The information obtained with one method should be complemented with that obtained by the other method. The DFA provides insights on the serial organization while the entropy reveals the diversity of patterns involved in such serial organization [28].
Fig. 5.
Comparison of the entropy of triplets between the Wuhan Hu-1 (red line) and the isolate MT066156 (Italy, blue line) genome sequences for (a) AK Rule, (b) PP Rule, (c) HB Rule, and (d) HP Rule. The arrows indicate the location of some reported mutations. The gray band denote the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The lower section of Table 2 presents the degree of randomness computed with Eq. (4) for the entropy computations shown in Fig. 5. The SARS-CoV-2 exhibited an increased DR relative to the SARS-CoV-1 for the PP and HB rules. In contrast, the DR decreased for the AK rule. The DR for the HP Rule was practically unchanged. The variation of the DR in Table 2 suggests that the transit from the SARS-CoV-1 to the SARS-CoV-2 was realized by marked increases in the organization of the purine-pyrimidine and hydrogen bond energy mosaic organizations and an increase of the diversity of the amino-keto patterns, mainly in the nsp4 and surface glycoprotein regions. In this way, the hydrogen bonding capacity of the spike protein has been the main objective of drug developments to reduce the negative impact of SARS-CoV-2 infections [29].
5.4. Early SARS-CoV-2 variants
The first case of SARS-CoV-2 was detected in December 2019 in Wuhan, China. The genome sequence of the virus was firstly reported on January 5, 2020 [1]. The uncontained worldwide spreading of the virus provoked its intensive replication. As part of the strategy to track the dynamics of the SARS-CoV-2, the evolution of the genome sequence was monitored and the results were reported in publicly available sites (e.g., www.ncbi.nlm.nih.gov). Promptly, the WHO was aware of the different variations of the genome sequence, which would threaten the control of the infection. Variants are characterized by a set of mutations in the genome sequence aimed to enhance viral fitness. Mutations involve different substitutions of nucleoid bases, which in principle should modify the scaling characteristics of the genome sequence. The degree of randomness for sequences isolated in five different countries (see Section 4) isolated in the first months of 2020 is shown in Table 2. Isolates from USA, South Korea and Brazil showed only slight differences with the reference Wuhan Hu-1. In contrast, isolates MT066156 (India) and MT012098 (Italy) exhibited important deviations from Wuhan Hu-1. In such cases, the DR of the amino-keto structure decreased, whereas the DR of the purine-pyrimidine structure was unchanged. In contrast, the DR of the hydrogen-bond and hydrophobicity structures shows a slight increase. It should be recalled that the infection was particularly strong in the Lombardian Region of Italy where the isolate was obtained. The stronger changes of the scaling structure equipped the Italy variant with an improved fitting strategy to prevail over the original and other variants of concern. A more detailed analysis of the scaling differences between the Wuhan Hu-1 and the isolate MT066156 can be carried out by looking at the fluctuations of the scaling exponent. Similar results were obtained for the isolate MT012098 (India). Fig. 6 presents the results for a sliding window size of 300 observations, which provides a more detailed view of the scaling exponent fluctuations. In general, the scaling exponents of the Wuhan Hu-1 (red line) and the isolate MT066156 (blue line) coincide over the whole range. However, some pointwise differences can be observed, and some of them are indicated by arrows. Such differences are linked to increases in the distance from randomness, and so increases in the local ordering of the genome sequence. The large peak at 6115 bp for the AK Rule (Fig. 6.a) is located in the nsp4 region. Increased correlations about 23,450 and 24,870 bp are linked to mutations of the surface glycoprotein, which can be ascribed to the mutations P681R and D950N, respectively [30]. It has been postulated that the P681R mutation enhanced the cleavage of the full-length spike to the spike 1 and S2 subunits, leading to increased infection via cell surface entry [30]. Besides, mutations in the surface glycoprotein have been linked to increases in hospitalizations with severe outcomes [31]. The peak at about 3000 bp for the PP Rule (Fig. 6.b) reflects a change of the correlation structure in the nsp4 region. On the other hand, the large peak at about 21500 bp can be attributed to the mutation T19R15 [30], which is linked to the formation of the surface glycoprotein. The large long-range anti-correlation peak at about 23650 bp for the HB Rule (Fig. 6.c) is also linked to the surface glycoprotein and can also be related to the P681R mutation. The increase of anti-correlated behavior was also reflected at about 23800 bp for the HP Rule (Fig. 6.d).
Fig. 6.
Comparison of the scaling exponent fluctuations between the Wuhan Hu-1 (red line) and the isolate MT066156 (Italy, blue line) genome sequences for (a) AK Rule, (b) PP Rule, (c) HB Rule, and (d) HP Rule. The arrows indicate the location of some reported mutations. The gray band denote the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
5.5. Discussion
The results aforementioned showed that the SARS-CoV-2 genome sequence has a complex structure, with the correlation structure depending on the rule used for mapping the sequence into a numeric series. A correlation structure can be ascribed to every functional region, which reflects certain ordering in the mosaic organization of the nucleotide arrangement. Of particular interest is the region 21553–25384 bp responsible for the surface glycoprotein. The spike glycoprotein is located on the outside of the virus and gives coronavirus viruses their crown-like appearance. This glycoprotein plays an important role in the attachment mechanisms of the virus particle and entry into the host cell, and as such is an important target for vaccine development, antibody therapies and diagnostic antigen-based tests [32]. The results in Table 1 showed that the region 21553–25384 bp exhibits a marked scaling structure over the four different mapping rules, such that the correlation structure is linked to the amino-keto, purine-pyrimidine, hydrogen-bond and hydrophobicity sequencing. Besides, early variants, like those isolated in India and Italy, exhibit increased levels of correlation and ordering in the surface glycoprotein region. In this way, the results described before pointed out that an accurate characterization of the structure and function of the surface glycoprotein region should provide valuable insights to develop efficient treatments and vaccines [33], [34]. In particular, mutations in the surface glycoprotein region have been explored as the drivers of enhanced transmissibility and infectibility of the Delta variant [35], isolated as earliest as October 2020, which suggests that variants with mutations that are distinctive of the Delta variant were already present in the early months of 2020. The issue is of prime importance since deeper mutations in the surface glycoprotein region might lead to variants (e.g., Delta plus) with greater binding capacity and escape from antibodies action [36], besides increased pathogenicity and mortality [37]. In this way, several studies have focused on characterizing and understanding the structure of the surface glycoprotein region and the effects of different mutations in such structure [35].
In terms of the approach used in our work, an interesting question is how the mutations of the early-2020 isolates MT012098 (India) and MT066156 (Italy) modified the organization of the surface glycoprotein region [38]. To address this point, the scaling exponent was computed over the surface glycoprotein region 21553–25384 bp. The sliding window size was set as 150 bp to dispose of a detailed view of the fluctuations of the scaling exponent. The results for the four different mapping rules are exhibited in Fig. 7 , where the scaling exponents of the Wuhan Hu-1 and the MT066156 isolate genome sequences were compared. Similar results were obtained for the isolate MT012098. In general, the scaling exponent of the two genome sequences has a similar pattern, with several peak deviations from the random behavior. Small differences are noted, which might be induced by the different mutations. Interestingly, the scaling exponent of the MT066156 isolate (blue line) exhibits increased deviations from randomness at certain peak locations indicated by arrows. Such increase of the distance from randomness indicates that the isolates MT066156 and MT012098 have a more ordered genome sequence than the reference sequence Wuhan Hu-1, and the increased ordering is manifested in terms of amino-keto, purine-pyrimidine, weak-strong hydrogen bonding and hydrophobicity. It is apparent that the increased transmissibility documented in the Lombardian Region in the early months of 2020 was linked to modifications of the surface glycoprotein region to obtain a more ordered organization, which would lead to fewer replication errors. Some peaks of the scaling exponent of the isolates MT066156 and MT012098 coincided with reported mutations in the spike region of the Delta variant [39], [40]. For instance, the spike L452R mutation has been linked to increased infectivity and potential cellular immunity evasion [41]. Also, it has been postulated that the spike mutation P681R is responsible of the advantages of the Delta variant to bind human lung epithelial cells and primary human airway tissues [42]. The possible mutation L452R was reflected in variations of the scaling structure of the amino-keto (Fig. 7.a) and purine-pyrimidine (Fig. 7.b) structures, whereas the mutation P681R was detected in the hydrogen-bond structure (Fig. 7.c). In this way, the variations of the scaling exponent between the different variations relative to the reference Wuhan Hu-1 sequence can be used as a guideline to explore the presence of mutations of interest [43].
Fig. 7.
Comparison of the scaling exponent fluctuations between the Wuhan Hu-1 (red line) and the isolate MT066156 (Italy, blue line) genome sequences in the surface glycoprotein region for (a) AK Rule, (b) PP Rule, (c) HB Rule, and (d) HP Rule. The arrows indicate the location of some reported mutations. The gray band denote the 90% CI of the scaling exponent computed from 10,000 shuffled sequences. The vertical lines denote the boundaries of the different functional regions of the genome sequence. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
6. Conclusions
This work used detrended fluctuation analysis (DFA) to study the mosaic organization of the SARS-CoV-2 genome sequence. The DFA was implemented on a sliding window scheme to visualize the fluctuations of the scaling exponent over the different genome sequence functional regions. The method was endowed with a bootstrap scheme to compute a randomness confidence band computed from the statistics of many shuffled sequences. The sliding window scaling exponent showed that some regions of the genome sequence are likely random, whereas other regions exhibit serial correlated/anti-correlated behavior. The behavior of the scaling exponent depended on the rule used to map the nucleotide genome sequence into a numerical sequence. In this way, the scaling exponent provides valuable information on the amino-keto, purine-pyrimidine, hydrogen-bond and hydrophobic organization of the genome sequence. The Wuhan Hu-1 was used as the reference sequence to contrast the results of mutations and variants. The different SARS-CoV-2 variants exhibit a markedly more ordered sequence than the preceding SARS-CoV-1, which would imply that the genome sequence of SARS-CoV-2 has more algorithmically efficient surveillance mechanisms. That is, the more ordered genome sequence of SARS-CoV-2 would be more efficient for replication by reducing the replication errors. Most SARS-CoV-1 variants exhibited only slight modifications of the scaling exponent behavior as compared to the reference Wuhan Hu-1. However, some samples isolated in the early months of 2020 contain modifications that increase the sequence order. In particular, the surface glycoprotein (spike) region of the isolates MT012098 (India) and MT066156, 29,867 (Italy) showed increased deviations from randomness at specific locations that can be linked to reported important mutations. It is apparent that the increase of the order in the spike region equipped some variants with a more informationally efficient procedure for replication, and hence for binding with the host cells. Overall, the results reported in this work show that the scaling analysis based on a sliding window DFA can provide valuable insights on the organization of the SARS-CoV-2 genome sequence, and the drawn information could be used to help the detection and monitoring of regions of interest in terms of mutations of concern.
CRediT authorship contribution statement
M. Meraz: Conceptualization, Investigation. E.J. Vernon-Carter: Conceptualization, Investigation. E. Rodriguez: Visualization, Investigation, Writing – review & editing. J. Alvarez-Ramirez: Writing – original draft.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- 1.Wu F., Zhao S.u., Yu B., Chen Y.-M., Wang W., Song Z.-G., Hu Y.i., Tao Z.-W., Tian J.-H., Pei Y.-Y., Yuan M.-L., Zhang Y.-L., Dai F.-H., Liu Y.i., Wang Q.-M., Zheng J.-J., Xu L., Holmes E.C., Zhang Y.-Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579(7798):265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Khailany R.A., Safdar M., Ozaslan M. Genomic characterization of a novel SARS-CoV-2. Gene Rep. 2020;19:100682. doi: 10.1016/j.genrep.2020.100682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Harvey W.T., Carabelli A.M., Jackson B., Gupta R.K., Thomson E.C., Harrison E.M., Ludden C., Reeve R., Rambaut A., Peacock S.J., Robertson D.L. SARS-CoV-2 variants, spike mutations and immune escape. Nat. Rev. Microbiol. 2021;19(7):409–424. doi: 10.1038/s41579-021-00573-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ghanchi N.K., Nasir A., Masood K.I., Abidi S.H., Mahmood S.F., Kanji A., Razzak S., Khan W., Shahid S., Yameen M., Raza A., Ashraf J., Ansar Z., Dharejo M.B., Islam N., Hasan Z., Hasan R., Atta S. Higher entropy observed in SARS-CoV-2 genomes from the first COVID-19 wave in Pakistan. PloS One. 2021;16(8):e0256451. doi: 10.1371/journal.pone.0256451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nawaz M.S., Fournier-Viger P., Niu X., Wu Y., Lin J.C.W. July). COVID-19 genome analysis using alignment-free methods. Int. Conf. Ind., Eng. Other Appl. Appl. Intelligent Syst. 2021;2021:316–328. doi: 10.1007/978-3-030-79457-6_28. [DOI] [Google Scholar]
- 6.Namazi H. Complexity-based classification of the coronavirus genome versus genomes of the human immunodeficiency virus (HIV) and dengue virus. Fractals. 2020;28(07):2050129. doi: 10.1142/S0218348X20501297. [DOI] [Google Scholar]
- 7.de Salazar e Fernandes T., de Oliveira Filho J.S., da Silva Lopes I.M.S. Fractal signature of coronaviruses related to severe acute respiratory syndrome. Res. Biomed. Eng. 2020 doi: 10.1007/s42600-020-00069-5. [DOI] [Google Scholar]
- 8.Peng C.K., Buldyrev S.V., Goldberger A.L., Havlin S., Mantegna R.N., Simons M., Stanley H.E. Statistical properties of DNA sequences. Physica A. 1995;221(1–3):180–192. doi: 10.1016/0378-4371(95)00247-5. [DOI] [PubMed] [Google Scholar]
- 9.Peng C.-K., Buldyrev S.V., Havlin S., Simons M., Stanley H.E., Goldberger A.L. Mosaic organization of DNA nucleotides. Phys. Rev. E. 1994;49(2):1685–1689. doi: 10.1103/PhysRevE.49.1685. [DOI] [PubMed] [Google Scholar]
- 10.Li J., Zhang X., Tang J. Noise suppression for magnetotelluric using variational mode decomposition and detrended fluctuation analysis. J. Appl. Geophys. 2020;180:104127. doi: 10.1016/j.jappgeo.2020.104127. [DOI] [Google Scholar]
- 11.Shrestha K. Multifractal detrended fluctuation analysis of return on Bitcoin. Int. Rev. Finance. 2021;21(1):312–323. doi: 10.1111/irfi.12256. [DOI] [Google Scholar]
- 12.Mesquita V.B., Oliveira Filho F.M., Rodrigues P.C. Detection of crossover points in detrended fluctuation analysis: an application to EEG signals of patients with epilepsy. Bioinformatics. 2021;37(9):1278–1284. doi: 10.1093/bioinformatics/btaa955. [DOI] [PubMed] [Google Scholar]
- 13.Ravi D.K., Marmelat V., Taylor W.R., Newell K.M., Stergiou N., Singh N.B. Assessing the temporal organization of walking variability: a systematic review and consensus guidelines on detrended fluctuation analysis. Front. Physiol. 2020;11:562. doi: 10.3389/fphys.2020.00562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rafique M., Iqbal J., Lone K.J., Kearfott K.J., Rahman S.U., Hussain L. Multifractal detrended fluctuation analysis of soil radon (222Rn) and thoron (220Rn) time series. J. Radioanal. Nucl. Chem. 2021;328(1):425–434. [Google Scholar]
- 15.Zenteno-Catemaxca R., Moguel-Castañeda J.G., Rivera V.M., Puebla H., Hernandez-Martinez E. Monitoring a chemical reaction using pH measurements: an approach based on multiscale fractal analysis. Chaos, Solitons Fractals. 2021;152:111336. doi: 10.1016/j.chaos.2021.111336. [DOI] [Google Scholar]
- 16.Lo A.W., MacKinlay A.C. Princeton University Press; NJ: 1999. A Non-random Walk Down Wall Street. [Google Scholar]
- 17.Theiler J., Eubank S., Longtin A., Galdrikian B., Farmer J.D. Testing for nonlinearity in time series: the method of surrogate data. Physica D. 1992;58(1–4):77–94. doi: 10.1016/0167-2789(92)90102-S. [DOI] [Google Scholar]
- 18.Buldyrev S.V., Goldberger A.L., Havlin S., Mantegna R.N., Matsa M.E., Peng C.K., Simons M., Stanley H.E. Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys. Rev. E. 1995;51(5):5084–5091. doi: 10.1103/PhysRevE.51.5084. [DOI] [PubMed] [Google Scholar]
- 19.Shih P., Pedersen L.G., Gibbs P.R., Wolfenden R. Hydrophobicities of the nucleic acid bases: distribution coefficients from water to cyclohexane. J. Mol. Biol. 1998;280(3):421–430. doi: 10.1006/jmbi.1998.1880. [DOI] [PubMed] [Google Scholar]
- 20.He R., Dobie F., Ballantine M., Leeson A., Li Y., Bastien N., Cutts T., Andonov A., Cao J., Booth T.F., Plummer F.A., Tyler S., Baker L., Li X. Analysis of multimerization of the SARS coronavirus nucleocapsid protein. Biochem. Biophys. Res. Commun. 2004;316(2):476–483. doi: 10.1016/j.bbrc.2004.02.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Echeverría J.C., Woolfson M.S., Crowe J.A., Hayes-Gill B.R., Croaker G.D.H., Vyas H. Interpretation of heart rate variability via detrended fluctuation analysis and αβ filter. Chaos. 2003;13(2):467–475. doi: 10.1063/1.1562051. [DOI] [PubMed] [Google Scholar]
- 22.Buldyrev S.V., Dokholyan N.V., Goldberger A.L., Havlin S., Peng C.K., Stanley H.E., Viswanathan G.M. Analysis of DNA sequences using methods of statistical physics. Physica A. 1998;249(1–4):430–438. doi: 10.1016/S0378-4371(97)00503-7. [DOI] [Google Scholar]
- 23.Carpena P., Bernaola-Galván B., Coronado A.V., Hackenberg M., Oliver J.L. Identifying characteristic scales in the human genome. Phys. Rev. E. 2007;75(3) doi: 10.1103/PhysRevE.75.032903. [DOI] [PubMed] [Google Scholar]
- 24.Kim D., Lee J.Y., Yang J.S., Kim J.W., Kim V.N., Chang H. The architecture of SARS-CoV-2 transcriptome. Cell. 2020;181(4):914–921. doi: 10.1016/j.cell.2020.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Koslicki D. Topological entropy of DNA sequences. Bioinformatics. 2011;27(8):1061–1067. doi: 10.1093/bioinformatics/btr077. [DOI] [PubMed] [Google Scholar]
- 26.Subramanian S., Madgula V.M., George R., Mishra R.K., Pandit M.W., Kumar C.S., Singh L. Triplet repeats in human genome: distribution and their association with genes and other genomic regions. Bioinformatics. 2003;19(5):549–552. doi: 10.1093/bioinformatics/btg029. [DOI] [PubMed] [Google Scholar]
- 27.Kashi Y., King D.G. Simple sequence repeats as advantageous mutators in evolution. Trends Genet. 2006;22(5):253–259. doi: 10.1016/j.tig.2006.03.005. [DOI] [PubMed] [Google Scholar]
- 28.Kvalseth T.O. Entropy and correlation: some comments. IEEE Trans. Syst., Man, Cybernet. 1987;17(3):517–519. doi: 10.1109/TSMC.1987.4309069. [DOI] [Google Scholar]
- 29.Pandey P., Rane J.S., Chatterjee A., Kumar A., Khan R., Prakash A., Ray S. Targeting SARS-CoV-2 spike protein of COVID-19 with naturally occurring phytochemicals: an in silico study for drug development. J. Biomol. Struct. Dyn. 2021;39(16):6306–6316. doi: 10.1080/07391102.2020.1796811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Y. Liu, J. Liu, B. A. Johnson, H. Xia, Z. Ku, C. Schindewolf, S. G. Widen, Z. An, S. C. Weaver, V. D. Menachery, X. Xie, P. Y. Shi, Delta spike P681R mutation enhances SARS-CoV-2 fitness over Alpha variant, bioRxiv (2021). doi:10.1101/2021.08.12.456173. [DOI] [PMC free article] [PubMed]
- 31.Nagy Ádám, Pongor Sándor, Győrffy B. Different mutations in SARS-CoV-2 associate with severe and mild outcome. Int. J. Antimicrob. Agents. 2021;57(2):106272. doi: 10.1016/j.ijantimicag.2020.106272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Oladipo E.K., Ajayi A.F., Ariyo O.E., Onile S.O., Jimah E.M., Ezediuno L.O., Adebayo O.I., Adebayo E.T., Odeyemi A.N., Oyeleke M.O., Oyewole M.P., Oguntomi A.S., Akindiya O.E., Olamoyegun B.O., Aremu V.O., Arowosaye A.O., Aboderin D.O., Bello H.B., Senbadejo T.Y., Awoyelu E.H., Oladipo A.A., Oladipo B.B., Ajayi L.O., Majolagbe O.N., Oyawoye O.M., Oloke J.K. Exploration of surface glycoprotein to design multi-epitope vaccine for the prevention of Covid-19. Inf. Med. Unlocked. 2020;21:100438. doi: 10.1016/j.imu.2020.100438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Chowdhury U.F., Shohan M.U.S., Hoque K.I., Beg M.A., Siam M.K.S., Moni M.A. A computational approach to design potential siRNA molecules as a prospective tool for silencing nucleocapsid phosphoprotein and surface glycoprotein gene of SARS-CoV-2. Genomics. 2021;113(1):331–343. doi: 10.1016/j.ygeno.2020.12.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Toor H.G., Banerjee D.I., Lipsa Rath S., Darji S.A. Computational drug re-purposing targeting the spike glycoprotein of SARS-CoV-2 as an effective strategy to neutralize COVID-19. Eur. J. Pharmacol. 2021;890:173720. doi: 10.1016/j.ejphar.2020.173720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kannan S.R., Spratt A.N., Cohen A.R., Naqvi S.H., Chand H.S., Quinn T.P., Lorson C.L., Byrareddy S.N., Singh K. Evolutionary analysis of the Delta and Delta Plus variants of the SARS-CoV-2 viruses. J. Autoimmunity. 2021;124:102715. doi: 10.1016/j.jaut.2021.102715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Planas D., Veyer D., Baidaliuk A., Staropoli I., Guivel-Benhassine F., Rajah M.M., Planchais C., Porrot F., Robillard N., Puech J., Prot M., Gallais F., Gantner P., Velay A., Le Guen J., Kassis-Chikhani N., Edriss D., Belec L., Seve A., Courtellemont L., Péré H., Hocqueloux L., Fafi-Kremer S., Prazuck T., Mouquet H., Bruel T., Simon-Lorière E., Rey F.A., Schwartz O. Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization. Nature. 2021;596(7871):276–280. doi: 10.1038/s41586-021-03777-9. [DOI] [PubMed] [Google Scholar]
- 37.Challen R., Brooks-Pollock E., Read J.M., Dyson L., Tsaneva-Atanasova K., Danon L. Risk of mortality in patients infected with SARS-CoV-2 variant of concern 202012/1: matched cohort study. BMJ. 2021;372:n579. doi: 10.1136/bmj.n579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Schrörs B., Riesgo-Ferreiro P., Sorn P., Gudimella R., Bukur T., Rösler T., Löwer M., Sahin U., Khudyakov Y.E. Large-scale analysis of SARS-CoV-2 spike-glycoprotein mutants demonstrates the need for continuous screening of virus isolates. Plos One. 2021;16(9):e0249254. doi: 10.1371/journal.pone.0249254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Deng X., Garcia-Knight M.A., Khalid M.M., Servellita V., Wang C., Morris M.K., Sotomayor-González A., Glasner D.R., Reyes K.R., Gliwa A.S., Reddy N.P., Sanchez San Martin C., Federman S., Cheng J., Balcerek J., Taylor J., Streithorst J.A., Miller S., Sreekumar B., Chen P.-Y., Schulze-Gahmen U., Taha T.Y., Hayashi J.M., Simoneau C.R., Kumar G.R., McMahon S., Lidsky P.V., Xiao Y., Hemarajata P., Green N.M., Espinosa A., Kath C., Haw M., Bell J., Hacker J.K., Hanson C., Wadford D.A., Anaya C., Ferguson D., Frankino P.A., Shivram H., Lareau L.F., Wyman S.K., Ott M., Andino R., Chiu C.Y. Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant. Cell. 2021;184(13):3426–3437.e8. doi: 10.1016/j.cell.2021.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Liu C., Ginn H.M., Dejnirattisai W., Supasa P., Wang B., Tuekprakhon A., Nutalai R., Zhou D., Mentzer A.J., Zhao Y., Duyvesteyn H.M.E., López-Camacho César, Slon-Campos J., Walter T.S., Skelly D., Johnson S.A., Ritter T.G., Mason C., Costa Clemens S.A., Gomes Naveca F., Nascimento V., Nascimento F., Fernandes da Costa C., Resende P.C., Pauvolid-Correa A., Siqueira M.M., Dold C., Temperton N., Dong T., Pollard A.J., Knight J.C., Crook D., Lambe T., Clutterbuck E., Bibi S., Flaxman A., Bittaye M., Belij-Rammerstorfer S., Gilbert S.C., Malik T., Carroll M.W., Klenerman P., Barnes E., Dunachie S.J., Baillie V., Serafin N., Ditse Z., Da Silva K., Paterson N.G., Williams M.A., Hall D.R., Madhi S., Nunes M.C., Goulder P., Fry E.E., Mongkolsapaya J., Ren J., Stuart D.I., Screaton G.R. Reduced neutralization of SARS-CoV-2 B.1.617 by vaccine and convalescent serum. Cell. 2021;184(16):4220–4236.e13. doi: 10.1016/j.cell.2021.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Motozono C., Toyoda M., Zahradnik J., Saito A., Nasser H., Tan T.S., Ngare I., Kimura I., Uriu K., Kosugi Y., Yue Y., Shimizu R., Ito J., Torii S., Yonekawa A., Shimono N., Nagasaki Y., Minami R., Toya T., Sekiya N., Fukuhara T., Matsuura Y., Schreiber G., Ikeda T., Nakagawa S., Ueno T., Sato K. SARS-CoV-2 spike L452R variant evades cellular immunity and increases infectivity. Cell Host Microbe. 2021;29(7):1124–1136.e11. doi: 10.1016/j.chom.2021.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.B. Li, A. Deng, K. Li, Y. Hu, Z. Li, Q. Xiong, Z. Liu, Q. Guo, L. Zou, H. Zhang, M. Zhang, F. Ouyang, J. Su, W. Su, J. Xu, H. Lin, J. Sun, J. Peng, H. Jiang, P. Zhou, T. Hu, M. Luo, Y. Zhang, H. Zheng, J. Xiao, T. Liu, R. Che, H. Zeng, Z. Zheng, Y. Huang, J. Yu, L. Yi, J. Wu, J. Chen, H. Zhong, X. Deng, M. Kang, O. G. Pybus, M. Hall, K. A. Lythgoe, Y. Li, J. Yuan, J. He, J. Lu, Viral infection and transmission in a large well-traced outbreak caused by the Delta SARS-CoV-2 variant, medRxiv (2021) doi:10.1101/2021.07.07.21260122.
- 43.van Dorp L., Houldcroft C.J., Richard D., Balloux F. COVID-19, the first pandemic in the post-genomic era. Curr. Opin. Virol. 2021;50:40–48. doi: 10.1016/j.coviro.2021.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]







