Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2010 Sep 1;39(1):292–299. doi: 10.1093/nar/gkq642

The ends of a large RNA molecule are necessarily close

Aron M Yoffe 1, Peter Prinsen 2, William M Gelbart 1,*, Avinoam Ben-Shaul 3,*
PMCID: PMC3017586  PMID: 20810537

Abstract

We show on general theoretical grounds that the two ends of single-stranded (ss) RNA molecules (consisting of roughly equal proportions of A, C, G and U) are necessarily close together, largely independent of their length and sequence. This is demonstrated to be a direct consequence of two generic properties of the equilibrium secondary structures, namely that the average proportion of bases in pairs is ∼60% and that the average duplex length is ∼4. Based on mfold and Vienna computations on large numbers of ssRNAs of various lengths (1000–10 000 nt) and sequences (both random and biological), we find that the 5′–3′ distance—defined as the sum of H-bond and covalent (ss) links separating the ends of the RNA chain—is small, averaging 15–20 for each set of viral sequences tested. For random sequences this distance is ∼12, consistent with the theory. We discuss the relevance of these results to evolved sequence complementarity and specific protein binding effects that are known to be important for keeping the two ends of viral and messenger RNAs in close proximity. Finally we speculate on how our conclusions imply indistinguishability in size and shape of equilibrated forms of linear and covalently circularized ssRNA molecules.

INTRODUCTION

There are many situations in which it is biologically important for the two ends of a large RNA molecule to be close to each other. In animal viruses with single-stranded (ss) RNA genomes, for example, efficient replication of the genome has been shown to depend on its effective ‘circularization’. More explicitly, complementary sequences have been identified at or near the 5′- and 3′-ends that are responsible for forming ‘panhandles’ that keep the two ends close together. These panhandles are duplexes that are 21 bp in the case of yellow fever virus (1), and 15 bp in the case of influenza A (2), thereby according them unusual robustness. Another example where RNA genome circularization of this kind has been implicated in RNA replication is sindbis virus; here an 18 bp 5′–3′ panhandle has been shown to survive denaturing conditions sufficient to eliminate much of the remaining secondary structure, leaving the genome with a circular appearance in electron micrographs (3). In dengue, also (like yellow fever, influenza A and sindbis) a positive-sense RNA virus, minus-strand synthesis involves long-distance 5′–3′ base pairing that facilitates the transfer of the RNA-dependent RNA polymerase from its binding site at the 5′-end to the initiation site at the 3′-end (4). Similarly, circularization of HIV-1 has been shown to arise from base pairing between the 5′- and 3′-ends of the RNA genome (5); these interactions are found to occur as well in different HIV-1 subtypes with large sequence variation, suggesting they share an evolutionary basis.

It has also long been known that effective circularization of messenger RNA molecules is important for efficient translation. The 5′- ‘capping’ and 3′-polyadenylation of mRNAs—through a variety of specific protein-binding events—result in the association of the two ends of the molecules and subsequent formation of translation initiation complexes (6). In eukaryotes, for example, the 3′-poly(A) ‘tail’ interacts with the poly(A)-binding protein, the 5′-G-cap binds a eukaryotic initiation factor, and these two bound proteins—with the full length of mRNA intervening—simultaneously bind a ‘bridging’ protein. This effective circularization of the molecule results in recruitment of the 40S ribosomal subunit (via binding of still another protein) and initiation of translation.

Because circularization of mRNA is so important for its translation, mechanisms that co-localize the ends have evolved even in cases where the molecules are not capped or polyadenylated. Plant viruses, for example, often lack both of these special sequences and yet are translated efficiently (7,8). The effective circularization is enhanced by direct base pairing between sub-sequences in the untranslated regions (UTRs) at the 5′- and 3′-ends; the UTRs functionally replace the G-cap and poly(A) tail. Further, the RNAs of many positive-sense (mRNA) viruses have internal ribosome entry sites (IRESs) at their 5′-ends, i.e. subsequences that recruit ribosomes and initiate translation (9,10).

In all of the above examples—involving both direct interaction between 5′- and 3′-ends or interaction mediated by binding proteins—particular, evolved, subsequences are involved in effective circularization. But in all of these scenarios, an even more fundamental requirement is that the two ends of the fluctuating molecule must spend enough time near each other in order for there to be a high probability for the special elements—RNA subsequences or binding proteins—to find one another. More explicitly, we will argue here that effective circularization of large RNA molecules is achieved through generic properties of secondary structure that are essentially independent of sequence. The specific evolved subsequences mentioned above are not needed so much for circularization as for facilitating the binding of particular proteins—e.g. RNA replicases and ribosome initiation factors—that are important for biological function of the circularized RNA.

Consider the analogous situation of double-stranded (ds) DNA with ‘sticky’ ends arising from complementary ss overhangs (generated, say, by a restriction enzyme). Here the probability of the two ends being covalently bound by a ligase is directly determined by—and ultimately limited by—the likelihood that they are close enough to each other to bind, i.e. that the double helix can twist and bend enough for its two ends to get close together (11). This classic problem is informed by the well-known statistical mechanical result giving the likelihood of the ends of a linear, semiflexible, polymer being within a monomer distance of one another. For sufficiently long molecules Inline graphic this probability is of order Inline graphic where Inline graphic and Inline graphic are the contour and persistence lengths, respectively, of the linear polymer; the contour length is the number of monomers times the average inter-monomer distance, and the persistence length is the distance along the chain contour beyond which the polymer can bend almost freely (12). Thus, the circularization probability of long DNA is small because Inline graphic is large, i.e. the molecule is long compared to its persistence length (50 nm, for DNA): maximization of configurational entropy requires that the ends be far apart. The small probability of finding them close, decreasing as Inline graphic reflects directly the fact that the root-mean-square distance between the ends of the molecule is increasing as Inline graphic.

To understand the basis for effective circularization of ssRNA, then, it is natural to ask: is there, in analogy with dsDNA, a generic result for the probability of finding the two ends of an RNA molecule close to one another, and how different is it from that for a linear polymer? In this article we argue that there is indeed a universal distribution of end-to-end distances in large RNA molecules, and furthermore that it is essentially independent of overall sequence and length. We show in particular that the distance between ends is necessarily small, because of generic features of the secondary structure, notably that the percentage (f) of paired nucleotides (nt) is ∼60% and that the average duplex length (Inline graphic) is ∼4. Using an early variant of the RNA folding algorithm developed by Zuker et al. (13,14), Fontana et al. (15) have calculated various characteristics of the minimum free energy (MFE) structure corresponding to several different types of short (20–100) nucleotide sequences. Averaging over many sequences of the same length (number of nucleotides, N) and base composition (Inline graphic), they found that Inline graphic and Inline graphic approach a constant value with increasing N. They also calculated a property (the number of unpaired bases in ‘joints’ and ‘free ends’) that is closely related to our definition of the 5′–3′ distance (see next section), finding that for the short chains analyzed this number increases, yet with a gradually decreasing slope, as Inline graphic increases. The constancy of Inline graphic and Inline graphic has been confirmed for a wide range of biological (viral and yeast) ssRNA sequences (16) by application of the mfold and Vienna codes for predicting thermally accessible secondary structures.

For certain models of polynucleotide chains, the Inline graphic-independence of Inline graphic and Inline graphic has been proven analytically, using a variety of powerful theoretical tools. Hofacker et al. (17), applying an elegant graph-theoretic approach, derived exact results for these properties (see their Table 3) and various other secondary structure attributes of RNA-like heteropolymers. Their results apply to an idealized ensemble where all possible secondary structures have equal statistical weight, resulting in low values of Inline graphic and Inline graphic. More recently, Clote et al. (18), using the Nussinov–Jacobson (‘maximum base pairing’) model (19) have shown that, for an ssRNA chain with Watson–Crick pairing rules, Inline graphic approaches a constant value slightly exceeding 90% for Inline graphic large (>1000). Earlier, de Gennes had noted (20) that, for a random sequence of two complementary nucleotides, the distance between chain ends remains finite even as Inline graphic approaches infinity. Based on this notion he also concluded that ‘ … many properties of a large, open, strand are not very different from those of a cyclic strand of equal molecular length’ (20). We elaborate on this idea in the next section.

Our goal in the present work is to emphasize the generality of the proximity of the 5′- and 3′-ends of large RNA molecules of arbitrary length and sequence. Based on the general findings noted above for large ssRNA chains, we derive a simple expression for the 5′–3′ distance that can be evaluated numerically for sequences of given Inline graphic and Inline graphic. We also calculate this distance using the RNAsubopt (21,22) and mfold (23,24) folding algorithms. A further consequence of our analyses is that the secondary—and hence tertiary—structures of linear and covalently-circularized RNA molecules are practically identical. These conclusions are tested against several systematic calculations of secondary structures for specific linear and circular sequences, both random and viral.

METHODS

Figure 1A displays the MFE secondary structure of a rather short (200 nt) random-sequence ssRNA molecule, composed of equal numbers of A, C, G and U, as predicted by the mfold algorithm (23,24). The duplexes are represented in the usual way by straight ‘ladders’ and the loops by circles of different sizes. The same secondary structure is visualized slightly less schematically in Figure 1B, with more realistic scaling of duplex dimensions, using the jViz.Rna drawing program (25). This latter representation illustrates that the dangling ss segments in the ‘exterior loop’—the one including the 5′- and 3′-ends—are independent flexible chains. In Figure 1C the secondary structure is mapped into a tree graph, where each edge (bond) represents a duplex and the vertices represent the loops (15,17,26); the interior loops are denoted by solid circles, and the exterior loop by an open circle. The term ‘interior loop’ is conventionally defined as the chain of bases, both paired and unpaired, comprising a closed loop, excluding its closing (‘downstream’) base pair. In the following we slightly depart from this definition and include the closing base pair as part of the (hence closed) loop. Our definition of the exterior loop, which lacks a closing base pair, is identical to the conventional one, namely, it includes all bases (paired and unpaired) along the shortest connected (covalently or H-bonded) path from the 5′- to the 3′-end.

Figure 1.

Figure 1.

Three different representations of the mfold-predicted minimum free energy secondary structure of a random 200 nt ssRNA of uniform composition (25% A, C, G, U). (A) Conventional schematic, drawn with mfold, showing base-paired regions (duplexes) and single-stranded loops. (B) jViz.Rna drawing (16), emphasizing the flexibility of single-stranded loops and scaled dimensions of duplexes. (C) Graph-theoretic mapping of this secondary structure, reducing duplexes to edges (bonds) and loops to vertices (filled circles); the single ‘exterior’ loop is depicted by an open circle.

5′–3′ Distance

As a simple intuitive measure of the 5′–3′ distance (in a given secondary structure of a given sequence) we use the total number of nucleotide links comprising the exterior loop, i.e.

graphic file with name gkq642m1.jpg (1)

Here Inline graphic is the number of covalent (phosphodiester) bonds (hereafter also referred to as ss links) in the exterior loop and Inline graphic is the number of base-paired (H-bonded, ds) links in the exterior loop or, equivalently, the number of duplexes emanating from the exterior loop. As it is the total number of (ss and ds) links in the nucleotide chain constituting the exterior loop, we shall refer to Inline graphic as the ‘effective contour length’ of this loop. Expressing Inline graphic in the form Inline graphic where Inline graphic is the total number of nucleotides in the exterior loop, and noting that Inline graphic is the total number of paired bases in the exterior loop, it follows from Equation (1) that Inline graphic is the number of unpaired bases in this loop. Figure 2 illustrates an exterior loop where Inline graphic whereas in Figure 1 Inline graphic. It should be emphasized that the average physical distance between the 5′- and 3′-ends depends not only on Inline graphic but also on the specific sequence of the loop, as well as the number of duplexes branching from the loop. In fact the lengths of the covalent and H-bonded links are different (the latter are about three times larger). If all links were of equal length Inline graphic, and their joints were fully flexible, then the physical 5′–3′ distance would be roughly Inline graphic, where we have neglected excluded volume effects because of the shortness of the exterior loop (12). It follows that small, Inline graphic-independent, Inline graphic-values imply small, Inline graphic-independent physical distances between the two chain ends.

Figure 2.

Figure 2.

Detailed view of an exterior loop consisting of Inline graphic covalent links and Inline graphic H-bonded links of nucleotides. The effective contour length of the loop is Inline graphic.

Four simple observations will guide our calculation of the 5′–3′ distance:

  1. The MFE secondary structures of a given linear ssRNA molecule and that of the circular RNA obtained by linking the 5′- and 3′-ends of the linear chain are very similar, and their energies practically identical. This is because the presence or absence of a covalent (phosphodiester) bond between the terminal nucleotides does not significantly alter overall base pairing. Its small influence on the configurational free energy of the molecule enters only through the entropy difference between the open exterior loop in the linear RNA and the corresponding closed (interior) loop in the circular analog. Actually, for any secondary structure of the linear ssRNA, not only the one of minimum free energy, the corresponding circular structure has essentially the same energetic and structural characteristics. Conversely, any secondary structure of a linear RNA can be regarded as derived from ‘cutting’ a specific covalent bond in one of the interior loops of the corresponding circular RNA. We thus expect that secondary structure characteristics of long RNA molecules, such as the pairing fraction or average duplex length, are practically the same for the linear and circularized ‘isomers’. These conclusions have been confirmed by numerical analyses of a large number of linear and circular RNA sequences of different lengths and compositions, as reported below and in Supplementary Figure S1 and Supplementary Table S1.

  2. As noted in the Introduction, for long chains (say Inline graphic) composed of comparable proportions of A, C, G and U (25 ± 5%), we find that Inline graphic for randomly-permuted sequences and for most viral RNAs (Tables 1 and 2).

  3. For long chains, we also know that the average length of (i.e. number of base pairs in) a duplex, Inline graphic, is independent of Inline graphic and rather insensitive to Inline graphic (for compositions involving 25 ± 5% of the four bases). For nearly all the sets of sequences examined in this study—randomly-permuted, viral and yeast-derived—Inline graphic is between 4 and 5 (Tables 1 and 2; Supplementary Table S1).

  4. As is well known, every secondary structure can be represented by a tree graph (26), as illustrated in Figure 1C.

Two simple and important results can easily be proved from the tree graph analogy. First, the number of vertices, Inline graphic, and the number of bonds, Inline graphic, of a circular RNA are related by the equality Inline graphic. This relation is also valid for linear RNAs provided the exterior loop is also represented by a vertex (possibly differently labeled, as in Fig. 1C). Second, on average (over all loops in any given structure), each loop (vertex) is connected to Inline graphic duplexes (edges). For long (Inline graphic) sequences we also find Inline graphic (see below), in which case we can safely set Inline graphic which (unless otherwise stated) will be the value used in our calculations. Note that the averaging here is over all loops in a given structure. The same holds, of course, after averaging over any number of structures and/or sequences. Note also that we always have Inline graphic, with Inline graphic corresponding to a ‘hairpin’ loop, Inline graphic to a ‘bubble’ or ‘bulge,’ and Inline graphic to a ‘multi loop’.

Table 1.

Composition (Inline graphic)-dependence of the average percentage of bases paired (f), the average duplex length (k) and the average 5′–3′ distance (D), for different sets of random and yeast-derived sequences of length 3000 nt; each set consists of 500 sequences

Type of ssRNA Folding program Inline graphic (%)a
Inline graphic (%) Inline graphic (bp) Inline graphic, links Inline graphic, from Equation (2)
G C A U
Random, viral-like Inline graphic RNAsubopt 24 22 26 28 62 ± 1 4.0 ± 0.1 12 ± 4 11.6
Random, uniform Inline graphic RNAsubopt 25 25 25 25 61 ± 1 3.9 ± 0.1 12 ± 5 12.6
Yeast-derivedb RNAsubopt 19 19 31 31 58 ± 2 4.1 ± 0.1 14 ± 5 11.9
Random, viral-like Inline graphic mfold 24 22 26 28 61 ± 1 4.5 ± 0.1 14 ± 7 12.8

Values following the ± symbols are standard deviations.

aThe randomly-permuted ssRNAs of each type are of identical composition; for the yeast ssRNAs, the mean composition is listed.

bThese are ssRNA transcripts of successive 3000 bp sections of yeast (S. cerevisiae) chromosomes XI and XII.

Table 2.

Values of f, k and D for viral ssRNAs, determined with RNAsubopt

Viral taxon No. of seq.a Host N (nt) f (%) k (bp) D, links
Bromoviridae RNA3 8 Plant 2210 63 ± 1 4.2 ± 0.1 19 ± 6
Bromoviridae RNA2 8 Plant 2891 63 ± 2 4.3 ± 0.1 18 ± 4
Bromoviridae RNA1 8 Plant 3265 64 ± 2 4.3 ± 0.1 15 ± 3
Leviviridae 9 Bacterium 3780 68 ± 2 4.3 ± 0.1 15 ± 9
Sobemovirus 9 Plant 4199 66 ± 2 4.2 ± 0.2 17 ± 4
Luteovirus 17 Plant 5725 62 ± 1 4.2 ± 0.1 16 ± 7
Tymovirus 9 Plant 6300 45 ± 4 3.9 ± 0.1 26 ± 5
Tobamovirus 22 Plant 6425 64 ± 1 4.2 ± 0.1 19 ± 5
Astroviridae 6 Animal 6719 63 ± 1 4.3 ± 0.1 16 ± 8
Caliciviridae 18 Animal 7713 62 ± 1 4.1 ± 0.1 20 ± 19

Values following the ± symbols are standard deviations.

aNumber of sequences analyzed.

Among the numerous possible secondary structures of long RNA sequences, there are often thousands whose free energies are just marginally higher (Inline graphic or less) than that of the MFE configuration, and under equilibrium conditions all these structures are nearly equally likely. Consequently, any property of the molecule that depends on its secondary structures should be averaged over their full thermal (Boltzmann) distribution. Suppose that, using RNAsubopt or a similar program, we have stochastically sampled the thermal ensemble of structures corresponding to a certain circular ssRNA sequence of given Inline graphic and Inline graphic. As argued in (i), above, all the linear ssRNA molecules derived by cutting any covalent (ss) bond in any interior loop of any member of the above ensemble will fold into ensembles of structures that are practically identical both to each other, and to the ensemble of the original circular molecule. The only difference is the appearance of an exterior loop, which now contains the 5′- and 3′-ends. For every given circular structure containing Inline graphic interior loops, this cutting procedure yields Inline graphic linear ssRNA sequences, where Inline graphic is the total number of ss (covalent) bonds in all loops of the given structure, Inline graphic denoting the number of covalent bonds in loop Inline graphic. Noting that the total number of nucleotides in the closed loop Inline graphic, namely Inline graphic is equal to the total number of bonds in this loop (Inline graphic), we find Inline graphic, with Inline graphic and Inline graphic denoting the number of unpaired and H-bonded nucleotides in loop Inline graphic, respectively, and Inline graphic the number of duplexes emerging from this loop. This yields Inline graphic. We have used the fact that the first sum is the total number of unpaired nucleotides, Inline graphic, and the fact that because every duplex is connected to two loops, the second sum is twice the total number (Inline graphic) of duplexes in the structure. But Inline graphic can be expressed in the form Inline graphic so that Inline graphic. Here, and in all subsequent analytical expressions involving Inline graphic, its numerical value will be understood to be the fraction of bases in pairs, rather than the percentage. As before, Inline graphic denotes the average duplex length in the particular sequence considered. For Inline graphic and Inline graphic we find Inline graphic.

In the next section we present numerical calculations of the average 5′–3′ distance Inline graphic for two types of ssRNA molecules, biological (yeast-derived and viral) and randomly-permuted sequences. The random sequences were included both for direct comparison to the biological sequences, and for general theoretical interest. In each case, a Boltzmann-weighted average Inline graphic-value is determined for the thermal ensemble of structures associated with each sequence. We then report the mean of these ensemble-average Inline graphic-values for each set of sequences.

For the random sequences a simple theoretical prediction of Inline graphic (showing good agreement with the numerical calculation) can be derived based on two reasonable approximations, as argued in the Appendix 1. We show there that, for any given secondary structure of a very long (Inline graphic) ssRNA molecule, the 5′–3′ distance is given by

graphic file with name gkq642m2.jpg (2)

with Inline graphic denoting the average number of ss covalent bonds per interior loop in the structure considered. In terms of the pairing fraction, Inline graphic, and duplex length, Inline graphic, of this structure we obtain Inline graphic. For both the MFE structure and the canonical ensemble averages of secondary structures of random (but also viral) sequences containing roughly equal proportions of the four bases it is found that Inline graphic and Inline graphic, yielding Inline graphic, and hence Inline graphic. See also Table 1.

Numerical computations

RNA sequences

Randomly-permuted ssRNA sequences were generated with a Fisher–Yates shuffle driven by a Mersenne Twister random number generator (27) implemented in C++ (by R. Wagner, University of Michigan, available at: www-personal.umich.edu/∼wagnerr/MersenneTwister.html). Viral ssRNA sequences were obtained from the National Center for Biotechnology Information Genome Database (www.ncbi.nlm.nih.gov). Yeast (Saccharomyces cerevisiae) genomic sequences were obtained from the Saccharomyces Genome Database (www.yeastgenome.org).

Folding programs

Secondary structure predictions were made with two RNA folding programs, RNAsubopt, a program in the Vienna RNA Package, Version 1.7 (21,22), and mfold, Version 3.1 (23,24). These programs employ detailed empirically-based energy models to estimate the free energies of the non-pseudoknotted secondary structures that are formed by a specified ssRNA sequence. With RNAsubopt, it is possible to sample stochastically from the ensemble of secondary structures, with a sampling probability in proportion to each structure’s Boltzmann weight. Thus, sampling a sufficient number of structures (we use 1000), and averaging the Inline graphic-values for this set, gives a close approximation to the ensemble-average predicted value of the end-to-end distance for that sequence. In earlier work (16) we demonstrated that the average properties of subsets of 1000 structures are not significantly different from those of the complete ensemble of structures. More generally, for any property Inline graphic, its RNAsubopt-predicted ensemble-average value is calculated as Inline graphic, where Inline graphic is its value in the Inline graphic member of the stochastically-generated subset of the Boltzmann ensemble of secondary structures. In mfold, by contrast, an algorithm is used to generate a structurally diverse representation of the ensemble, rather than a thermally-representative average. We configured mfold to generate the 1000 lowest-energy structures from such a set, measured Inline graphic for each, and averaged them in proportion to their Boltzmann weights, to give an mfold-averaged Inline graphic-value. For any property Inline graphic, its mfold-predicted average value is Inline graphic with Inline graphic the free energy of the Inline graphic secondary structure relative to the MFE for that sequence.

RESULTS

While there can be significant inter-taxon variation, the average composition, Inline graphic, of the viral RNAs in this study is ∼24% G, 22% C, 26% A and 28% U (16). With this ‘viral-like’ Inline graphic, we generated 2000 random sequences of lengths 50, 100, 200 and 400; 1000 of lengths 800 and 1500; 500 of lengths 2000, 2500, 3000 and 4000; 300 of lengths 5000, 6000 and 7000; and 1000 of length 8000. These sequences were folded with RNAsubopt. Figure 3 shows the mean Inline graphic and standard deviation for each length of RNA, and a regression line fitted to sequences of length 400 and greater. Except for the very short sequences, Inline graphic is ∼12, independent of sequence length; in addition, it is relatively insensitive to small changes in Inline graphic. That this Inline graphic-value is identical to the estimate obtained above, through the theoretical calculation, is coincidental, because the latter is based on the somewhat approximate expression given in Eq. (2) (the approximations are explained in Appendix 1). But it is nevertheless very striking, and highly significant, that the simple theory predicts a Inline graphic-value that is of the correct magnitude and that is independent of length and sequence.

Figure 3.

Figure 3.

Mean ensemble-averaged 5′–3′ distances, Inline graphic, from Equation (1), for random and viral sequences. Standard deviations are shown with vertical bars. The small black points represent the 10 groups of viral sequences listed in Table 2. The large gray points represent the 14 different lengths of randomly-permuted RNAs (50–8000 nt), of viral-like composition, described in the text. The line is a least-squares fit to the Inline graphic values for random sequences with Inline graphic. The asymptotic value of Inline graphic for the random sequences is very close to the theoretically predicted one, Inline graphic [see Equation (2)].

Table 1 shows the results for 500 3000-nt ssRNAs of viral-like and uniform Inline graphic, as well as 500 ssRNAs that are the transcripts of consecutive 3000 bp sections on yeast (S. cerevisiae) chromosomes XI and XII. In these sets, the values of Inline graphic, Inline graphic and Inline graphic (averaged over the 500 sequences) were 12–14, ∼60% and ∼4, respectively. The last column in the table lists the values of Inline graphic calculated according to Equation (2), and these results are seen to agree closely with those from the detailed numerical calculations (especially for the random sequences, as expected).

The viral taxa analyzed are listed in Table 2. All are non-enveloped ssRNA viruses and, except for the rod-shaped Tobamoviruses, have Inline graphic icosahedral capsids. The Leviviridae infect bacteria, the Astroviridae and Caliciviridae are animal viruses, and the remainder infect plants. The Bromoviridae are, in addition, tripartite: the genome consists of three ssRNAs, divided among three separate capsids. The number of sequences analyzed in each case corresponds to the number of species considered.

From Figure 3 it can be seen that the values and standard deviations of D for the viral RNAs are higher, but overlap those of the random sequences for all taxa except the Tymoviruses. The latter can be understood from the fact that small Inline graphic-values are an inherent consequence of base pairing; all non-pathological secondary structures with a sufficiently high percentage of bases in pairs, Inline graphic, will have a low Inline graphic. The Tymoviruses show a relatively larger Inline graphic (although still small relative to sequence length) because they have a significantly smaller Inline graphic.

We note that current RNA folding programs have been shown to be limited in their ability to correctly predict individual base pairs in long ssRNA sequences (28). Consistent with this, RNAsubopt and mfold (which use slightly different energy models to generate their ensembles of secondary structures, and different algorithms to sample from these ensembles), when given long sequences to fold, output structures that often show significant differences in the details of base pairing, as well as overall appearance. However, our simple theoretical model predicts that Inline graphic depends only on the values of Inline graphic and Inline graphic, which we have previously found to be robust with respect to the details of the folding program used (16). Consequently, Inline graphic should likewise be robust to the details of the folding program, and thus insensitive to low-level inaccuracies in specific predictions of base pairing. To test this, we compared predictions of Inline graphic made using mfold and RNAsubopt. As expected, we found that the values do not differ significantly between the two folding programs, and can thus be considered broadly robust to the specific characteristics of the energy model used (Table 1).

There is currently no published experimental work that directly measures the 5′–3′ distance of large (103–104 nt) ssRNAs in their native state (i.e. not complexed with proteins). However, based on a combination of experimental and computational approaches, Filomatori et al. (4) have proposed a model for the secondary structure of the exterior loop of native dengue ssRNA. Their proposed loop has a D-value of 25, which is of the same magnitude as both the theoretical predictions in Table 1, and the numerical predictions in Table 2.

DISCUSSION

We have made two predictions in the current work, both of which can be tested experimentally. First, we have predicted with general theoretical arguments—and demonstrated with numerical computations involving the equilibrated secondary structures of a large number of different lengths and sequences—that the distance between ends of an ssRNA (or ssDNA) should be ∼10–15 nt links. This corresponds to a 3D physical distance of a few nm, which is far smaller than the contour lengths of large ssRNA molecules. As mentioned earlier, a crude estimate of the 3D distance between ends may be obtained in terms of the root-mean-square (RMS) end-to-end distance (Inline graphic) associated with a flexible linear polymer defined by the string of covalent and H-bonded links shown in Figure 2. With an average link size, Inline graphic, of ∼3/4 nm, and a Inline graphic of 12, one obtains an RMS end-to-end distance of ∼3 nm. This is approximately an order of magnitude less than the 37 nm average distance between nucleotides (radius of gyration) that has been measured by small-angle X-ray scattering for a 6400 nt viral ssRNA (29). Our estimate of 3 nm could be confirmed by fluorescence resonance energy transfer (FRET) measurements, or still more directly by cryo-EM imaging of large ssRNA molecules whose ends have been labeled by small gold particles (for example, 1 nm particles conjugated to oligonucleotides that are complementary to the 5′- and 3′-ends).

Second, we have predicted that all the linearized ssRNAs obtained by making a single cut in a long circular ssRNA molecule should have secondary (and hence) tertiary structures that are essentially identical to that of the parent circular form. Accordingly, they should have the same size and shape. And because they necessarily have the same charge, they should show virtually indistinguishable band positions in native gels, even though the linear and circular forms can be easily distinguished in denaturing gels where the secondary structure needed to effectively circularize the linear molecule has been destroyed. Similarly, under native conditions, small-angle X-ray scattering experiments, cryo-EM, and measurements of diffusion coefficients/hydrodynamic radii should show no difference between the circular and linearized molecules. The only caveat here, as well as for the measurements of 5′–3′ distance described earlier, is that the secondary structures of the molecules be equilibrated, since this is explicitly assumed in the theoretical arguments leading to all of these predictions [for a critical discussion of the equilibration/renaturation (and the lack thereof) of ssRNA, see Uhlenbeck (30)].

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Supplementary Data

FUNDING

US National Science Foundation (grant number CHE07-14411 to W.M.G.); the Israel Science Foundation (grant number 695/06 to A.B.-S.); the US–Israel Bi-National Science Foundation (grant number 2006-401 to A.B.-S.); The Netherlands Organization for Scientific Research, Rubicon grant (to P.P.); and the University of California, Los Angeles, a Dissertation Year Fellowship (to A.M.Y.). Funding for open access charge: Research grant of A.B.-S. (grant number ISF 695/06).

Conflict of interest statement. None declared.

ACKNOWLEDGMENTS

We thank Li Tai Fang and Charles M. Knobler for many helpful discussions.

APPENDIX 1: DERIVATION OF D

Consider a particular secondary structure Inline graphic of a given circular ssRNA molecule, containing Inline graphic nucleotides and with base composition Inline graphic. Let Inline graphic denote the number of Inline graphic-loops (i.e. loops composed of Inline graphic unpaired nucleotides and Inline graphic duplexes) in this structure. Each Inline graphic-loop can be cut through any of its Inline graphic covalent bonds, yielding open exterior loops of Inline graphic links. The average effective contour length Inline graphic resulting from this cutting procedure is

graphic file with name gkq642m3.jpg (A1)

where the averages after the second equality are over all loops belonging to the particular structure. This follows from the fact that Inline graphic is the effective contour length of the exterior loop in a particular secondary structure, and Inline graphic is the statistical weight of Inline graphic-loops containing Inline graphic covalent bonds. Inline graphic, with Inline graphic denoting the fraction of Inline graphic-loops in this structure and Inline graphic denoting the total number of loops in this structure. The ‘marginal’ probability distribution Inline graphic is the fraction of loops containing Inline graphic unpaired nucleotides, regardless of the number of duplexes connected to these loops. Similarly, Inline graphic, etc. The sums over Inline graphic include all Inline graphic (Inline graphic corresponds to a bulge) yet we also note that, in the case of a hairpin (Inline graphic), energetic considerations generally imply Inline graphic. The sums over Inline graphic include all Inline graphic.

For long random sequences a simplified expression for Inline graphic [see Equation (2)], involving only Inline graphic, can be derived based on two reasonable approximations. The first is to assume there are no correlations between the distributions of unpaired and paired nucleotides in loops, i.e. Inline graphic, from which it follows that Inline graphic. Small deviations from this approximation may occur because, for hairpins, we generally have Inline graphic, whereas for other loops we have Inline graphic. The second approximation serves to relate Inline graphic to Inline graphic and Inline graphic to Inline graphic. Here we assume that the distributions Inline graphic and Inline graphic of, respectively, [the (1−fα)N] unpaired nucleotides and (Inline graphic) duplexes among the Inline graphic loops of structure Inline graphic, are random. These distributions (analogous to those of indistinguishable balls randomly distributed among boxes) are determined by maximizing the (entropy) functional Inline graphic (Inline graphic), subject to the normalization Inline graphic and conservation Inline graphic constraints. In this way we find Inline graphic, with a similar expression for Inline graphic. For concreteness and simplicity we set Inline graphic and Inline graphic for the minimum values of Inline graphic and Inline graphic, thus obtaining Inline graphic and Inline graphic. Similarly, Inline graphic, with the second equality following from the fact that, for all structures, Inline graphic. Equation (A1) now yields Equation (2) of the main text.

REFERENCES

  • 1.Corver J, Lenches E, Smith K, Robison RA, Sando T, Strauss EG, Strauss JH. Fine mapping of a cis-acting sequence element in yellow fever virus RNA that is required for RNA replication and cyclization. J. Virol. 2003;77:2265–2270. doi: 10.1128/JVI.77.3.2265-2270.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hsu M-T, Parvin JD, Gupta S, Krystal M, Palese P. Genomic RNAs of influenza viruses are held in a circular conformation in virions and in infected cells by a terminal panhandle. Proc. Natl Acad. Sci. USA. 1987;84:8140–8144. doi: 10.1073/pnas.84.22.8140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Frey TK, Gard DL, Strauss JH. Biophysical studies of circle formation by sindbis virus 49S RNA. J. Mol. Biol. 1979;132:1–18. doi: 10.1016/0022-2836(79)90493-5. [DOI] [PubMed] [Google Scholar]
  • 4.Filomatori CV, Lodeiro MF, Alvarez DE, Samsa MM, Pietrasanta L, Gamarnik AV. A 5′ RNA element promotes dengue virus RNA synthesis on a circular genome. Genes Dev. 2006;20:2238–2249. doi: 10.1101/gad.1444206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ooms M, Abbink TEM, Pham C, Berkhout B. Circularization of the HIV-1 RNA genome. Nucleic Acids Res. 2007;35:5253–5261. doi: 10.1093/nar/gkm564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gallie DR. The cap and poly(A) tail function synergistically to regulate mRNA translational efficiency. Genes Dev. 1991;5:2108–2116. doi: 10.1101/gad.5.11.2108. [DOI] [PubMed] [Google Scholar]
  • 7.Kneller ELP, Rakotondrafara AM, Miller WA. Cap-independent translation of plant viral RNAs. Virus Res. 2006;119:63–75. doi: 10.1016/j.virusres.2005.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Miller WA, White KA. Long-distance RNA-RNA interactions in plant virus gene expression and replication. Annu. Rev. Phytopathol. 2006;44:447–467. doi: 10.1146/annurev.phyto.44.070505.143353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Karetnikov A, Lehto K. Translation mechanisms involving long-distance base pairing interactions between the 5′ and 3′ non-translated regions and internal ribosomal entry are conserved for both genomic RNAs of blackcurrant reversion nepovirus. Virology. 2008;371:292–308. doi: 10.1016/j.virol.2007.10.003. [DOI] [PubMed] [Google Scholar]
  • 10.Fabian MR, White KA. 5′–3′ RNA-RNA interaction facilitates cap- and poly(A) tail-independent translation of tomato bushy stunt virus mRNA: a potential common mechanism for Tombusviridae. J. Biol. Chem. 2004;279:28862–28872. doi: 10.1074/jbc.M401272200. [DOI] [PubMed] [Google Scholar]
  • 11.Cloutier TE, Widom J. DNA twisting flexibility and the formation of sharply looped protein-DNA complexes. Proc. Natl Acad. Sci. USA. 2005;102:3645–3650. doi: 10.1073/pnas.0409059102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Grosberg AY, Khokhlov AR. Statistical Physics of Macromolecules. New York: AIP Press; 1994. [Google Scholar]
  • 13.Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res. 1981;9:133–148. doi: 10.1093/nar/9.1.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zuker M, Sankoff D. RNA secondary structures and their prediction. Bull. Math. Biol. 1984;46:591–621. [Google Scholar]
  • 15.Fontana W, Konings DAM, Stadler PF, Schuster P. Statistics of RNA secondary structures. Biopolymers. 1993;33:1389–1404. doi: 10.1002/bip.360330909. [DOI] [PubMed] [Google Scholar]
  • 16.Yoffe AM, Prinsen P, Gopal A, Knobler CM, Gelbart WM, Ben-Shaul A. Predicting the sizes of large RNA molecules. Proc. Natl Acad. Sci. USA. 2008;105:16153–16158. doi: 10.1073/pnas.0808089105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hofacker IL, Schuster P, Stadler PF. Combinatorics of RNA secondary structures. Discr. Appl. Math. 1998;88:207–237. [Google Scholar]
  • 18.Clote P, Kranakis E, Krizanc D, Stacho L. Asymptotic expected number of base pairs in optimal secondary structure for random RNA using the Nussinov–Jacobson energy model. Discr. Appl. Math. 2007;155:759–787. [Google Scholar]
  • 19.Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single stranded RNA. Proc. Natl Acad. Sci. USA. 1980;77:6309–6313. doi: 10.1073/pnas.77.11.6309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.de Gennes PG. Statistics of branching and hairpin helices for the dAT copolymer. Biopolymers. 1968;6:715–729. doi: 10.1002/bip.1968.360060508. [DOI] [PubMed] [Google Scholar]
  • 21.Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 1994;125:167–188. [Google Scholar]
  • 22.Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49:145–165. doi: 10.1002/(SICI)1097-0282(199902)49:2<145::AID-BIP4>3.0.CO;2-G. [DOI] [PubMed] [Google Scholar]
  • 23.Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. doi: 10.1093/nar/gkg595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mathews DH, Sabina J, Zuker M, Turner DH. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
  • 25.Wiese KC, Glen E, Vasudevan A. JViz.Rna—a Java tool for RNA secondary structure visualization. IEEE T. Nanobiosci. 2005;4:212–218. doi: 10.1109/tnb.2005.853646. [DOI] [PubMed] [Google Scholar]
  • 26.Waterman MS. Secondary structure of single-stranded nucleic acids. Adv. Math. Suppl. Stud. 1978;1:167–212. [Google Scholar]
  • 27.Matsumoto M, Nishimura T. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM T Model. Comput. Sci. 1998;8:3–30. [Google Scholar]
  • 28.Mathews DH. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA. 2004;10:1178–1190. doi: 10.1261/rna.7650904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Muroga Y, Sano Y, Inoue H, Suzuki K, Miyata T, Hiyoshi T, Yokota K, Watanabe Y, Liu X, Ichikawa S, et al. Small angle X-ray scattering studies on local structure of tobacco mosaic virus RNA in solution. Biophys. Chem. 2000;83:197–209. doi: 10.1016/s0301-4622(99)00141-6. [DOI] [PubMed] [Google Scholar]
  • 30.Uhlenbeck OC. Keeping RNA happy. RNA. 1995;1:4–6. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES