Abstract
With the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS), computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palindromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markov-chain model for the genome sequence, the mean and standard deviation for the number of palindromes at or above a given length are derived. These theoretical results are complemented by extensive simulations to provide empirical estimates. Using a z score obtained from these mathematical and empirical means and standard deviations, we have observed that palindromes of length four are significantly underrepresented in all the coronaviruses in our data set. In contrast, length-six palindromes are significantly underrepresented only in the SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length-22 palindrome TCTTTAACAAGCTTGTTAAAGA spanning positions 25962–25983. Second, there are two repeating length-12 palindromes TTATAATTATAA spanning positions 22712–22723 and 22796–22807. Some further investigations into possible biological implications of these palindrome features are proposed.
Keywords: Markov chain, palindrome counts, simulation, RNA viral genome, severe acute respiratory syndrome
1. Introduction
In March 2003, a novel coronavirus associated with the severe acute respiratory syndrome (SARS) was identified. The outbreak of SARS in different parts of the world, causing hundreds of deaths, has initiated much international effort that includes clinical, epidemiologic, and laboratory investigations with the aim of controlling the spread of the virus (Bloom 2003, Marra et al. 2003, Ruan et al. 2003, Rota et al. 2003). Although the world was cleared of new SARS cases by July 2003, the pursuit for a thorough understanding of the origin, evolution, and pathogenicity of this deadly virus continues.
With the availability of the complete genome sequence of the SARS and several other coronaviruses in public databases (e.g., GenBank), it is possible to do a computational analysis of the viral genome, looking for unusual genome sequence features either unique to the SARS virus or common to the coronavirus family. Such information can give clues to the origin, natural reservoir, and evolution of the virus. It may contribute to the studies of the immune response to this virus and the pathogenesis of SARS-related disease (Rota et al. 2003).
Statistical and experimental studies of palindromes in the other classes of viral genomes, such as the double stranded DNA viruses, bacteriophages, retroviruses, etc., have been performed (Cain et al. 2001, Dirac et al. 2002, Hill et al. 2003, Karlin et al. 1992, Leung et al. 2002, Rocha et al. 2001, among others). These studies have suggested that palindromes might be involved in the viral packaging, replication, and defense mechanisms. Unlike these well-studied viruses involved in fatal diseases such as AIDS and various cancers, the coronaviruses have not received as much attention until the recent outbreak of SARS.
In the present study, we focus our attention on palindromes in the positive-stranded RNA genomes of coronaviruses. In accordance with GenBank convention, we represent an RNA sequence as a string of letters from the alphabet
= {A, C, G, T}. The four letters respectively stand for the RNA bases adenine, cytosine, guanine, and uracil. The letters A and T are complementary to each other because adenine and uracil form hydrogen bonds with each other. The same applies to C and G. A palindrome is a symmetrical word such that when it is read in the reverse direction, it is exactly the complement of itself. For example, ACGT is a palindrome of length four. A palindrome is necessarily even in length because the middle base in any odd-length nucleotide string cannot be identical to its complement.
Several points are worth noting from this initial exploratory analysis of palindromes in the coronavirus genome sequences: (1) The palindrome counts in the coronavirus genomes seem lower than what would be expected from random sequences. (2) The SARS virus contains an exceptionally long palindrome with 22 nucleotide bases. This is the longest among all palindromes observed in the coronaviruses. (3) There are two copies of a length-12 palindrome situated within 100 bases of each other in the SARS genome. This is not observed in the other coronaviruses.
Whether or not these palindrome-related features have any biological relevance will, of course, have to rely on careful laboratory investigations by the virologists. At this stage, however, it would be only reasonable to assess whether these features can indeed be considered statistically unusual when compared to random-sequence models. Our observations call for investigations into the probability distributions of palindrome counts, lengths, and locations in a random sequence. This paper will focus only on the palindrome counts, leaving the others for future studies.
In the next section, the mathematical formulas for the theoretical mean and variance for the number of palindromes at or above a prescribed length are derived based on a Markov-chain random-sequence model. Section 3 summarizes the computational results in comparing palindrome counts of the coronavirus genomes to the random-sequence models. In §4, we propose some biological questions that may be investigated in relation to these observed nonrandom features. A few concluding remarks are given in §5.
2. Palindrome Counts in Markov-Chain Models
The main objective of this paper is to assess whether the palindrome counts in the coronavirus genomes are observed more (or less) frequently than expected, under some specified probability models. We model the genome sequence as a realization of a sequence of random variables ξ1, ξ2, …, ξn taking values in
= {A, C, G, T} and n is the genome length. Throughout, we will assume that either
{ξ1, ξ2, …, ξn} are independent and identically distributed (M0); or
{ξ1, ξ2, …, ξn} form a stationary Markov chain of order one (M1).
For studying DNA words of length k, one can choose to use Markov chains of order up to the maximum order of k–2 as the sequence model. A higherorder Markov chain will better fit the data sequence, but at the same time the number of parameters in the model increases exponentially. In this study, we carried out some simulations using the second-order Markov-chain model (M2). The computation takes much longer, but the z scores obtained gave the same interpretation as that of the M1 model. We therefore content ourselves with the M0 and M1 models for our analysis of palindromes of length four and above.
We are interested in deriving the mean and standard deviation of the random variable XL, total number of palindromes of length at least 2L under the M0 and M1 sequence models. This will help quantify the extent of deviation of the observed palindrome counts in the coronavirus genome from the expected counts under the specified probability model. For L ≤ k ≤ n−L, define
We say that a palindrome occurs at k when Ik =1. Therefore, . Note that the distribution of Ik depends only on the joint distribution of (ξk−L+1, …, ξk+L). Under the M0 or M1 model, the joint distribution of (ξk−L+1, …, ξk+L) is independent of k. Hence ℙ[Ik = 1] is a constant in k. Similarly ℙ [Ij =1, Ik =1] depends only on |j −k|. Therefore, for L ≤ k ≤ n−L and 1 ≤ d ≤ n−L−k, we define
The expressions of γ(0) and γ(d) are crucial to calculating the mean and variance of XL (see Proposition 3 below). Lemma 1 (respectively, Lemma 2) deals with the computation of γ(0) and γ(d) under the M1 (respectively, M0) sequence model. Indeed, we will deduce Lemma 2 from Lemma 1.
Throughout, we use b′ to denote the complementary base of b, and w′ the inversion (i.e., the complementary word read in reverse) of the word w. There are quite a few details to work out all the possible overlap cases because the overlap structures depend on the relative sizes of d (the extent of overlap) and 2L (the cutoff length of a palindrome). However, there are only two basic patterns in the overlap. In the first pattern (as illustrated by Figure 1b), the shaded segment, due to the complimentary requirement of a palindrome, will uniquely determine the left and right ends of Ck and Ck+d. And in the other pattern (as illustrated by Figure 1c), the shaded segment will determine the rest of both palindromes. In Figure 1a, even though palindromes Ck and Ck+d do not actually overlap (i.e., d ≥ 2L), the occurrence of a palindrome at k will still have an effect on the probability that a palindrome will occur at k+d under the M1 sequence model. Lemma 1 provides expressions of γ(d) under all possible situations.
Figure 1. Overlapping Structures of Palindromes Ck and Ck+d for Different Values of d.
Note. (a), (b), and (c) are drawn with different scales.
Lemma 1
Suppose the genome sequence is modeled as a stationary Markov chain of order one with stationary distribution π := (π(A), π(C), π(G), π(T)). For a, b ∈
and m ≥ 1, let P(a, b) and P(m) (a, b) respectively denote the transition probability and the m-step transition probability from base a to base b.
- We have
(1) -
For d ≥ 1, we have the following three cases:
- d ≥ 2L:
- L ≤d <2L:
-
1 ≤d <L: we let L=qd +r.where
Proof
(a) Note that a palindrome of length at least 2L is of the form
where b1, …, bL ∈
. Therefore,
Because
(1) follows immediately after rearranging terms.
(b) To compute the overlap probability γ(d), i.e., the probability that there are palindromes at k and k+d, we call the stretch of bases ξk−L+1 ···ξk+d+L the span of palindromes Ck and Ck+d.
For (i) d ≥ 2L: The span s of the two palindromes Ck and Ck+d is of the form acb where , c=c1 · · ·cd−2L, and . Hence,
Hence (i) follows immediately from
and
For (ii) L ≤ d < 2L: Refer to Figure 1(b), let w = bd−L+1 ···bL denote the common segment of palindromes Ck and Ck+d. Assuming d >L, let u =b1 ···bd−L and v = bL+1 ···bd; we can represent Ck =w′;u′uw and Ck+d =wvv′w′ where b1, …, bd ∈
. Therefore,
Writing it out in terms of the initial distribution and transition probabilities, we have proved (ii) for d >L. The case for d = L is similar: Take u and v as null words and proceed as in the case d >L.
To prove (iii), we consider the case r ≥ 1 first. This time, let w = b1 ···bd denote the first d bases to the right of the center of Ck and to the left of the center of Ck+d. Let u = b1 ···br and v = bd−r+1 ···bd, respectively denote the first and last r bases of w. Figure 1(c) displays the necessary structure in Ck and Ck+d for both of them to be palindromes when q =3. If q is odd, then the span of Ck and Ck+d is of the form . Therefore,
| (2) |
If q is even, then the span of Ck and Ck+d is changed accordingly to the form and
| (3) |
By making the one-to-one transformation in the summation, , and we can see that both sums on the RHS of (2) and (3) are the same. So without loss of generality, we compute γ(d) under the assumption that q is odd. The crucial step is then to calculate the probability of the span of Ck and Ck+d, and part (iii) will follow immediately from summing over all possible b1, …, bd. We first consider r ≥ 2, then
| (4) |
For r =1, (4) becomes
If r = 0, reasoning similar to the above leads us to consider just the case q is odd. However, the span of Ck and Ck+d becomes (one can take u and v as empty words) . And hence,
Under the M0 model, the stationary distribution π = (pA, pC, pG, pT), and the transition probabilities P(a, b) = pb and P(m) (a, b) = pb for any a, b ∈
, m≥ 1. Substituting these into Lemma 1(a) and (i) and (ii) of Lemma 1(b) immediately gives us the corresponding parts in Lemma 2 below. Part (iii) of Lemma 1(b) can be simplified further according to how big the remainder r is in relation to d. We shall omit the details. In this way, we have deduced the following Lemma 2, which was first proved in Leung et al. (2002).
Lemma 2
Suppose the genome sequence is modeled as M0and let
- We have
-
For d ≥ 1, we have the following four cases:
- d ≥ 2L:
-
L ≤d <2L:
when 1 ≤ d < L we let L = qd + r where 0 ≤ r <d, and consider two subcases according to how big the remainder r is in relation to d.
- 1 ≤d <L and 0 ≤r <(d+1)/2:
- 1 ≤d <L and (d +1)/2 ≤r <d:
Proposition 3
With the Ik’s as defined at the beginning of §2, the total number of palindromes of length at least 2L is given by . And hence,
and
where γ(0) and γ(d) are given as in Lemma 2 under the M0sequence model, and Lemma 1 under M1 sequence model.
Proof
The first equation follows immediately from taking expectations on both sides of , and
3. Palindrome Counts in Coronaviruses
The derived means and variances under the M0 and M1 sequence models enable us to assess whether the observed palindrome count in a genome is too abundant or rare. The z score defined in (5) below is a modification of a generally accepted measure of over (or under)representation of a DNA word. For L ≥ 2, a standardized frequency under the assumption of the M1 sequence model is defined as
| (5) |
where XL is the observed number of palindromes of length at least 2L, and μM1 and σM1 denote its expected value and standard deviation, respectively. (For simplicity, we do not indicate the dependence of μ and σ on L.) The corresponding z score is defined similarly for the M0 sequence model. When L is small compared with the genome length n, XL is a sum of weakly dependent random indicators Ik and it is therefore well approximated by a normal distribution. Indeed, if we let denote the number of occurrences of the jth palindrome in the genome, then the count vector ( ) will converge to a multivariate normal distribution as n→∞ (see Theorem 12.5 in Waterman 1995). And hence will converge to a normal distribution as n→∞. For L = 2 or 3, and n in the range 30,000, we expect that the distribution of the z scores will be approximately standard normal. The near-straight lines in the Q-Q plots in Figure 2 confirmed that this is the case. This motivates our definition: The count is said to be over (or under)represented, if the z score is greater than 1.645 or less than −1.645, respectively (i.e., in the upper or lower 5% of a standard normal distribution, as commonly used in one-tailed hypothesis tests in biological experiments). However, it should be emphasized that these cutoff z score values can only be considered as a convenient statistical guideline to help bring out interesting observations rather than a strict criterion to lead to a definitive conclusion.
Figure 2.
Normal Q-Q Plots of Counts of Palindromes of Length Four (Top) and Six (Bottom) in the 1,000 Random Sequences Under the M1 Model for the SARS Genome
We compute the z scores of the genomes in the following data set: It is composed of seven coronaviruses with complete genome sequences and four other RNA viruses. For some coronaviruses, the genome sequences of multiple strains of the same virus are available. Only one strain is included in our data set because their genomes are very similar. Four other RNA viruses outside the coronavirus family are included in the data set. Two of these (the rubella virus and the equine arteritis virus) have positive-stranded RNA genomes like the coronaviruses, one (rabies virus) has a negative-stranded RNA genome, and the remaining one (HIV) is a retrovirus. Table 1 lists the names of the viruses, abbreviations, GenBank accession numbers, genome lengths, and base compositions of the seven coronaviruses and the other four RNA viruses. Table 2 displays the z scores for counts of palindromes of length four and above under the M0 and M1 models.
Table 1.
List of Seven Coronaviruses and Four Other RNA Viruses to be Analyzed
| Name | Abbrev. | Accession | Length | Base composition |
|---|---|---|---|---|
| SARS coronavirus Urbani | SARS | AY278741 | 29,727 | (0.28, 0.20, 0.21, 0.31) |
| Avian infectious bronchitis virus | AIBV | NC_001451.1 | 27,608 | (0.29, 0.16, 0.22, 0.33) |
| Bovine coronavirus | BCoV | NC_003045.1 | 31,028 | (0.27, 0.15, 0.22, 0.36) |
| Human coronavirus 229E | HCoV | NC_002645.1 | 27,317 | (0.27, 0.17, 0.22, 0.35) |
| Murine hepatitis virus | MHV | NC_001846 | 31,357 | (0.26, 0.18, 0.24, 0.32) |
| Porcine epidemic diarrhea virus | PEDV | NC_003436.1 | 28,033 | (0.25, 0.19, 0.23, 0.33) |
| Transmissible gastroenteritis virus | TGV | NC_002306.2 | 28,586 | (0.29, 0.17, 0.21, 0.33) |
| Rubella virus | RUV | NC_001545.1 | 9,755 | (0.15, 0.39, 0.31, 0.15) |
| Equine arteritis virus | EAV | NC_002532.2 | 12,704 | (0.21, 0.26, 0.26, 0.27) |
| Rabies virus | RV | NC_001542.1 | 11,932 | (0.29, 0.22, 0.23, 0.26) |
| Human immunodeficiency virus 1 | HIV-1 | NC_001802.1 | 9,181 | (0.36, 0.18, 0.24, 0.22) |
Table 2.
Z Scores for Counts of Palindromes of Length Four and Above
| Virus | Counts | μM0 (σM0) | μM1 (σM1) | ZM0 | ZM1 |
|---|---|---|---|---|---|
| SARS | 1,554 | 1,981.0 (43.4) | 1,687.6 (40.3) | −9.83 | −3.32 |
| AIBV | 1,578 | 1,896.6 (42.8) | 1,675.3 (38.2) | −7.45 | −2.54 |
| BCoV | 1,886 | 2,115.6 (45.4) | 2,007.5 (45.5) | −5.06 | −2.67 |
| HCoV | 1,451 | 1,843.6 (42.2) | 1,567.6 (37.0) | −9.30 | −3.15 |
| MHV | 1,793 | 2,006.6 (43.8) | 1,911.3 (41.4) | −4.88 | −2.86 |
| PEDV | 1,457 | 1,781.6 (41.2) | 1,578.8 (38.3) | −7.87 | −3.18 |
| TGV | 1,610 | 1,993.9 (43.8) | 1,695.6 (38.9) | −8.76 | −2.20 |
| RUV | 868 | 793.2 (28.0) | 845.6 (28.3) | 2.67 | 0.79 |
| EAV | 672 | 784.3 (27.2) | 710.4 (25.8) | −4.13 | −1.49 |
| RV | 559 | 758.0 (26.7) | 564.3 (23.0) | −7.45 | −0.23 |
| HIV-1 | 475 | 551.9 (23.1) | 480.2 (21.9) | −3.33 | −0.24 |
Table 2 indicates that there is a general avoidance of palindromes of length four and above in the coronavirus genomes. A natural question that follows is whether palindromes of a given exact length are also underrepresented in these viruses.
To answer this question, one would need the mean ν and standard deviation τ for the count YL of palindromes of exact length 2L. It is easy to obtain the mean because ν = E(YL) = E(XL) − E(XL+1). The standard deviation of YL can be derived with suitable modification of the method of proofs in Lemmas 1 and 2, but the expression obtained is rather lengthy due to an increase in the overlapping structures. Instead, we adopt an alternative approach to estimate the standard deviation by simulation, which at the same time serves to validate our derived means and standard deviations. This approach has a further advantage of giving us the empirical distributions, and Figure 2 shows that for small values of L, the distributions are well approximated by normal distributions.
For each virus in Table 1, 1,000 random sequences were generated for both the M0 and M1 models using scripts written in the R language (http://www.r-project.org/). The sequences are run through the palindrome program which is part of EMBOSS (European Molecular Biology Open Software Suite, Rice et al. 2000) to extract the palindrome positions and length. Each output is then read by R again and the counts of palindromes of various length are tabulated.
Tables 3 and 4 present the counts of palindromes of exact length four, six, and eight, along with their expected values ν, estimated standard deviations τ̂, and z scores. Based on the z scores, Tables 3 and 4 indicate that length-four palindromes are significantly underrepresented across the coronavirus family under both the M0 and M1 sequence models. However, for length-six palindromes, SARS is the only member of the coronavirus family that shows underrepresentation under the M1 sequence model. For length eight or above, no distinct patterns are observed.
Table 3.
Z Scores for Palindromes of Various Lengths Under the M0 Model
| Length-four palindromes
|
Length-six palindromes
|
Length-eight palindromes
|
|||||||
|---|---|---|---|---|---|---|---|---|---|
| Counts | νM0 (τ̂M0) | ZM0 | Counts | νM0 (τ̂M0) | ZM0 | Counts | νM0 (τ̂M0) | ZM0 | |
| SARS | 1,144 | 1,469.6 (36.9) | −8.82 | 284 | 379.4 (19.4) | −4.92 | 90 | 97.9 (9.7) | −0.82 |
| AIBV | 1,142 | 1,399.5 (37.5) | −6.87 | 320 | 366.8 (18.6) | −2.52 | 91 | 96.1 (9.9) | −0.52 |
| BCoV | 1,360 | 1,563.2 (40.4) | −5.03 | 389 | 408.2 (20.4) | −0.94 | 98 | 106.6 (10.7) | −0.80 |
| HCoV | 1,054 | 1,364.7 (36.9) | −8.42 | 287 | 354.5 (18.9) | −3.57 | 82 | 92.1 (9.8) | −1.03 |
| MHV | 1,328 | 1,499.0 (38.0) | −4.50 | 340 | 379.2 (19.5) | −2.01 | 82 | 95.9 (9.9) | −1.41 |
| PEDV | 1,079 | 1,332.5 (36.5) | −6.94 | 274 | 335.9 (18.5) | −3.35 | 79 | 84.7 (9.2) | −0.62 |
| TGV | 1,180 | 1,467.3 (38.4) | −7.48 | 306 | 387.5 (19.7) | −4.14 | 85 | 102.3 (9.8) | −1.77 |
| RUV | 610 | 567.0 (22.8) | 1.89 | 167 | 161.7 (12.6) | 0.42 | 68 | 46.1 (6.9) | 3.17 |
| EAV | 479 | 589.4 (23.8) | −4.64 | 145 | 146.4 (12.3) | −0.12 | 36 | 36.4 (6.1) | −0.06 |
| RV | 407 | 567.0 (23.7) | −6.75 | 102 | 142.9 (12.4) | −3.30 | 38 | 36.0 (5.9) | 0.34 |
| HIV-1 | 347 | 416.6 (20.1) | −3.46 | 89 | 102.1 (10.2) | −1.29 | 34 | 25.0 (4.8) | 1.87 |
Table 4.
Z Scores for Palindromes of Various Lengths Under the M1 Model
| Length-four palindromes
|
Length-six palindromes
|
Length-eight palindromes
|
|||||||
|---|---|---|---|---|---|---|---|---|---|
| Counts | νM1 (τ̂M1) | ZM1 | Counts | νM1 (τ̂M1) | ZM1 | Counts | νM1 (τ̂M1) | ZM1 | |
| SARS | 1,144 | 1,242.7 (33.4) | −2.96 | 284 | 327.3 (18.0) | −2.41 | 90 | 86.5 (9.4) | 0.37 |
| AIBV | 1,142 | 1,229.8 (35.4) | −2.48 | 320 | 326.9 (17.8) | −0.39 | 91 | 87.0 (9.4) | 0.42 |
| BCoV | 1,360 | 1,476.5 (37.2) | −3.13 | 389 | 390.4 (19.5) | −0.07 | 98 | 103.4 (9.8) | −0.55 |
| HCoV | 1,054 | 1,146.9 (34.5) | −2.69 | 287 | 307.6 (17.4) | −1.18 | 82 | 82.7 (8.9) | −0.08 |
| MHV | 1,328 | 1,421.3 (37.8) | −2.47 | 340 | 364.3 (18.8) | −1.29 | 82 | 93.5 (9.8) | −1.17 |
| PEDV | 1,079 | 1,169.8 (34.5) | −2.63 | 274 | 302.9 (17.5) | −1.65 | 79 | 78.6 (9.1) | 0.05 |
| TGV | 1,180 | 1,239.5 (34.0) | −1.75 | 306 | 333.2 (18.4) | −1.48 | 85 | 89.8 (9.7) | −0.49 |
| RUV | 610 | 604.3 (24.5) | 0.23 | 167 | 172.5 (13.8) | −0.40 | 68 | 49.2 (6.9) | 2.72 |
| EAV | 479 | 529.6 (22.5) | −2.25 | 145 | 134.8 (11.3) | 0.91 | 36 | 34.3 (5.7) | 0.30 |
| RV | 407 | 415.2 (19.1) | −0.43 | 102 | 109.8 (10.4) | −0.75 | 38 | 28.9 (5.3) | 1.71 |
| HIV-1 | 347 | 358.3 (18.7) | −0.60 | 89 | 91.0 (9.6) | −0.21 | 34 | 23.1 (4.5) | 2.42 |
For palindromes of length four and above, it is possible to fit higher-order Markov models to the genome sequence. For example, the second-order Markov-chain model that takes the base, dinucleotide, as well as trinucleotide composition into account, can be used to calculate the z scores. We simulated 1,000 random sequences with the M2 model, but the results did not differ much from the M1 model.
As the EMBOSS palindrome program provides us with a detailed listing of all occurrences of palindromes of length four and above, we are able to notice two unique features in SARS. First, the SARS sequence contains a long palindrome of length 22, the longest among all palindromes observed in the coronaviruses. Second, there are two identical, length-12 palindromes situated within 100 bases of each other in the SARS genome. These are not observed in the other coronaviruses. Although contributing little to the total palindrome counts, these three palindromes appear unusual enough to warrant further study of their possible biological roles, as discussed in the next section.
4. Discussion
Various statistical assessments of unusual abundance and rarity of individual words, including individual palindromes, in nucleotide sequences have been done using random-sequence models in a number of previous studies (Karlin et al. 1992; Merkl and Fritz 1996; Rocha et al. 1998, 2001; Schbath et al. 1995, to name just a few). The present study, however, aims at investigating the unusual abundance and rarity of palindromes collectively rather than individually. The mathematical results in §2 provide a directly computable formula to give a single z score for all palindromes with a given minimal length. We hope the exploratory results in this paper will serve as a basis for more detailed investigations to see how palindromes might be involved in important biological mechanisms of the coronaviruses.
There are two random sequence models M0 and M1 used in this paper. Because M1 can take the genome dinucleotide compositions into consideration while M0 cannot, M1 is preferred over M0. Comparatively, the z scores under M1 are less extreme than those of M0. M1 is therefore more conservative in declaring the palindrome counts in a genome to be significantly different from those in random sequences. We shall base our discussion of the results on M1 whenever possible.
The counts of palindromes of length at least four in each coronavirus analyzed are significantly lower than expected (see Table 2). As the palindrome length increases to six and above, the underrepresentation of palindromes no longer holds across the family (theoretical z scores under M1 range from –1.66 to 0.46). This suggests that there is a family-wide avoidance of palindromes of exact length four in the coronaviruses, which is confirmed by the empirical z scores for exact-length palindromes in Tables 3 and 4. With this knowledge, a thorough examination of the relative abundance of individual length-four palindromes, conditional on the total length-four palindrome count is called for. We are in the process of setting up such a study.
Although the underrepresentation of length-four palindromes is observed for all of the coronaviruses in our data set that include members from all three antigenic groups (Marra et al. 2003), this underrepresentation is not universally true in all RNA viruses, as demonstrated by the other RNA viruses outside the coronavirus family. While it is conceivable that palindrome underrepresentation is just a characteristic of the common ancestor of the coronaviruses, it is worth noting that the characteristic is preserved in the family despite the reputation for RNA viruses to be nature’s swiftest evolvers (Worobey and Holmes 1999). So far, we cannot find any previous report of underrepresentation of short palindromes in RNA viruses with eukaryotic hosts. However, avoidance of short palindromes in some bacterial and phage DNA genomes has been reported in several studies (Karlin et al. 1992; Merkl and Fritz 1996; Rocha et al. 1998, 2001, among others). The phenomenon is generally explained in relation to the defense mechanisms of the bacterial and phage genomes, protecting themselves against being destroyed by restriction enzymes capable of cutting up DNA molecules at certain palindromic sites. It will be interesting to investigate whether there is any possible interaction of the short palindromes in the coronavirus genomes with the immune system of the host cells that might have detrimental effects on the survival of the virus.
Length-six palindromes are found significantly underrepresented only in SARS but not in the other six coronaviruses (see Table 4). Would this avoidance of length-six palindromes in the SARS genome offer a protective effect on the virus, making it comparatively more difficult to be destroyed and contributing to the rapid spread and the severity of the disease? This will be an interesting point to observe as we seek to learn more about the SARS virus.
Among all palindromes found in the seven coronaviruses genomes we analyzed, the longest one resides in SARS. It is composed of the 22 bases TCTTTAACAAGCTTGTTAAAGA spanning positions 25962–25983. Because the probability distribution of palindrome lengths has not been rigorously obtained, we can only attempt a rough estimation, based on the simple M0 sequence model, of observing a length-22 palindrome in a genome with base composition like that of SARS. It has been demonstrated in Leung et al. (2002) that for larger values of L (say ≥5), we may approximate the counts of palindromes at or above length 2L by a Poisson random variable with parameter λ equal to the expected count. We therefore have ℙ[maximal palindrome length ≥ 22]= ℙ [X11 ≥ 1], which can be approximated by the corresponding Poisson probability with λ11 = E(X11) = 0.01008 by Proposition 3. This Poisson probability is equal to 1−e−λ11, about 1%.
Knowing that this long palindrome is quite unlikely to occur by chance, one would logically ask the question of whether it plays any particular functional role. According to the classification of open reading frames (ORFs) encoding potential nonstructural proteins of the SARS virus (Rota et al. 2003, Table 1), this palindrome occurs in the overlapping region of the two ORFs designated X1 and X2. Due to the location of this palindrome, it is tempting to speculate that it might be involved in some secondary structures serving similar purposes like those of a pseudoknot, which is typically found at frame-shift locations in overlapping coding sequences (Giedroc et al. 2000). One would have to perform a detailed secondary structure prediction on this part of the SARS and other coronavirus genomes before further suggestions can be made. The methods and tools used by Qin et al. (2003) to predict the secondary structure in another part of the SARS virus genome (around the packaging-signal sequence) are likely to be applicable here as well.
Another feature unique to SARS is the occurrence of two repeating length-12 palindromes TTATAATTATAA spanning positions 22712–22723 and 22796–22807, all within 100 bases of the genome in the coding sequence of the surface-spike glycoprotein, which is important for virus entry and virus-receptor interactions (Yu et al. 2003). Both copies begin on the third position of a codon. Three amino acids Tyr-Asn-Tyr are coded by the second through tenth bases of the palindrome. No such repeating palindromes are observed in the corresponding glycoprotein-coding sequences for any of the other six coronaviruses. Probabilistic assessment of close repeating palindromes occurring in random sequences has yet to be formulated mathematically or estimated by simulation. (The method of Robin and Daudin 1999 can be used to assess the probability that a given palindrome repeats itself in close proximity.) If such an observation is found to be unlikely to occur by chance, then these repeating palindromes might be tested for potential regulatory functions. Large palindromes present in single-stranded RNA have the inherent ability to form double-stranded stem structures through the formation of intramolecular base pairs; thus, it is possible that these sequences form secondary RNA structures in the genomic RNA and in one or more subgenomic RNAs of the SARS virus. In many of the single-stranded RNA viruses, stem structures play important regulatory roles in genome replication or gene expression. It should be possible to investigate potential regulatory roles of these repeated length-12 palindromes by engineering silent mutations within these sequences such that the encoded protein is not altered but the palindromes and putative secondary structures are lost.
5. Concluding Remarks
While we hope that there will never be another outbreak of SARS, we believe that detailed analysis of the SARS genome sequence can help generate useful information for understanding the biology of the coronaviruses and perhaps other RNA viruses in general. This first exploration about palindromes in the coronavirus family generates many questions to be investigated in greater detail mathematically, computationally, as well as biologically.
Closely related to palindromes is the sequence feature of close inversion, which is a palindrome with its two halves separated by a short stretch of intervening nucleotides. These close inversions are well known to form stem-loop and other secondary structures involved in the viral recombination and packaging process (Rowe et al. 1997, Qin et al. 2003). We anticipate that a set of interesting and challenging questions in random-sequence models will again emerge from the analysis of close inversions.
Acknowledgments
K. P. Choi was supported by BMRC Grant BMRC01/1/21/ 19/140 and M. Y. Leung by NIH Grants S06GM08194-23 and S06GM08194-24 and NSF Grant DUE9981104.
Contributor Information
David S. H. Chew, Email: matchewd@nus.edu.sg, Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
Kwok Pui Choi, Email: matckp@nus.edu.sg, Departments of Mathematics, and of Statistics and Applied Probability, National University of Singapore, Singapore 117543, Singapore.
Hans Heidner, Email: hheidner@utsa.edu, Department of Biology, University of Texas at San Antonio, San Antonio, Texas 78249, USA.
Ming-Ying Leung, Email: mleung@utep.edu, Department of Mathematical Sciences, University of Texas at El Paso, El Paso, Texas 79968, USA.
References
- Bloom BR. Lessons from SARS. Science. 2003;300:701. doi: 10.1126/science.300.5620.701. [DOI] [PubMed] [Google Scholar]
- Cain D, Erlwein O, Grigg A, Russell RA, McClure MO. Palindromic sequence plays a critical role in human foamy virus dimerization. J Virology. 2001;75:3731–3739. doi: 10.1128/JVI.75.8.3731-3739.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dirac AM, Huthoff H, Kjems J, Berkhout B. Requirements for RNA heterodimerization of the human immunodeficiency virus type 1 (HIV-1) and HIV-2 genomes. J General Virology. 2002;83:2533–2542. doi: 10.1099/0022-1317-83-10-2533. [DOI] [PubMed] [Google Scholar]
- Giedroc DP, Theimer CA, Nixon PL. Structure, stability and function of RNA pseudoknots involved in stimulating ribosomal frameshifting. J Molecular Biol. 2000;298:167–185. doi: 10.1006/jmbi.2000.3668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill MK, Shehu-Xhilaga M, Campbell SM, Poumbourios P, Crowe SM, Mak J. The dimer initiation sequence stem-loop of Human Immunodeficiency Virus Type 1 is dispensable for viral replication in peripheral blood mononuclear cells. J Virology. 2003;77:8329–8335. doi: 10.1128/JVI.77.15.8329-8335.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S, Burge C, Campbell AM. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992;20:1363–1370. doi: 10.1093/nar/20.6.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leung MY, Choi KP, Xia A, Chen LHY. IMS preprint series 2002-2. Institute for Mathematical Sciences, National University of Singapore; Singapore: 2002. Nonrandom clusters of palindromes in herpesvirus genomes. [Google Scholar]
- Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YS, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL. The genome sequence of the SARS-associated coronavirus. Science. 2003;300:1399–1404. doi: 10.1126/science.1085953. [DOI] [PubMed] [Google Scholar]
- Merkl R, Fritz HJ. Statistical evidence for a biochemical pathway of natural, sequence-targeted G/C to C/G transversion mutagenesis in Haemophilus influenzae Rd. Nucleic Acids Res. 1996;24:4146–4151. doi: 10.1093/nar/24.21.4146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin L, Xiong B, Luo C, Guo ZM, Hao P, Su J, Nan P, Feng Y, Shi YX, Yu XJ, Luo XM, Chen KX, Shen X, Shen JH, Zou JP, Zhao GP, Shi TL, He WZ, Zhong Y, Jiang HL, Li YX. Identification of probable genomic packaging signal sequence from SARS-CoV genome by bioinformatics analysis. Acta Pharmacologica Sinica. 2003;24:489–496. [PubMed] [Google Scholar]
- Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genetics. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- Robin S, Daudin JJ. Exact distribution of word occurrences in a random sequence of letters. J Appl Probab. 1999;36:179–193. [Google Scholar]
- Rocha EP, Danchin A, Viari A. Evolutionary role of restriction/ modification systems as revealed by comparative genome analysis. Genome Res. 2001;11:946–958. doi: 10.1101/gr.gr-1531rr. [DOI] [PubMed] [Google Scholar]
- Rocha EP, Viari A, Danchin A. Oligonucleotide bias in Bacillus subtilis: General trends and taxonomic comparisons. Nucleic Acids Res. 1998;26:2971–2980. doi: 10.1093/nar/26.12.2971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rota PA, Oberste MS, Monroe SS, Nix WA, Campagnoli R, Icenogle JP, Penaranda S, Bankamp B, Maher K, Chen MH, Tong S, Tamin A, Lowe L, Frace M, DeRisi JL, Chen Q, Wang D, Erdman DD, Peret TC, Burns C, Ksiazek TG, Rollin PE, Sanchez A, Liffick S, Holloway B, Limor J, McCaustland K, Olsen-Rasmussen M, Fouchier R, Gunther S, Osterhaus AD, Drosten C, Pallansch MA, Anderson LJ, Bellini WJ. Characterization of a novel coronavirus associated with severe acute respiratory syndrome. Science. 2003;300:1394–1399. doi: 10.1126/science.1085952. [DOI] [PubMed] [Google Scholar]
- Rowe CL, Fleming JO, Nathan MJ, Sgro JY, Palmenberg AC, Baker SC. Generation of coronavirus spike deletion variants by high-frequency recombination at regions of predicted RNA secondary structure. J Virology. 1997;71:6183–6190. doi: 10.1128/jvi.71.8.6183-6190.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruan YJ, Wei CL, Ling AE, Vega VB, Thoreau H, Su ST, Chia JM, Ng P, Chiu KP, Lim L, Zhang T, Chan KP, Oon LE, Ng ML, Leo SY, Ng LFP, Ren EC, Stanton LW, Long PM, Liu ET. Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection. Lancet. 2003;361:1779–1785. doi: 10.1016/S0140-6736(03)13414-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schbath S, Prum B, de Turckheim E. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J Comput Biol. 1995;2:417–437. doi: 10.1089/cmb.1995.2.417. [DOI] [PubMed] [Google Scholar]
- Waterman MS. Introduction to Computational Biology. Chapman & Hall; New York: 1995. [Google Scholar]
- Worobey M, Holmes EC. Evolutionary aspects of recombination in RNA viruses. J General Virology. 1999;80:2535–2543. doi: 10.1099/0022-1317-80-10-2535. [DOI] [PubMed] [Google Scholar]
- Yu XJ, Luo C, Lin JC, Hao P, He YY, Guo ZM, Qin L, Su J, Liu BS, Huang Y, Nan P, Li CS, Xiong B, Luo XM, Zhao GP, Pei G, Chen KX, Shen X, Shen JH, Zou JP, He WZ, Shi TL, Zhong Y, Jiang HL, Li YX. Putative hAPN receptor binding sites in SARS-CoV spike protein. Acta Pharmacologica Sinica. 2003;24:481–488. [PubMed] [Google Scholar]


