Abstract
Recent studies have shown that RNA structural motifs play essential roles in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remains a challenging task. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. Other structural motif identification methods consider only nested canonical base-pairing structures and cannot be used to identify complex RNA structural motifs that often consist of various non-canonical base pairs due to uncommon hydrogen bond interactions. In this article, we present a novel RNA structural alignment method for RNA structural motif identification, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan is demonstrated by searching for kink-turn, C-loop, sarcin-ricin, reverse kink-turn and E-loop motifs against a 23S rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. Finally, we search these motifs against the RNA structures in the entire Protein Data Bank and the abundances of them are estimated. RNAMotifScan is freely available at our supplementary website (http://genome.ucf.edu/RNAMotifScan).
INTRODUCTION
Non-coding RNAs play a large variety of roles inside a cell, and recent discoveries point to many of their novel cellular functions (1,2). The variety of functionalities of non-coding RNA is determined by their complex structures. Unlike DNAs, which usually exhibit regular double helical structures due to the interactions with the complementary strands, RNAs are single strand molecules and can fold into irregular 3D structures. Among the complex structures, there exist conserved and recurrent segments whose arrangement, abundance and interaction largely determine the folding behaviors and functionalities of the structures. These segments, viewed as the `building blocks' of RNA architecture, are usually referred to as RNA structural motifs (3–5). The identification and analysis of these motifs have largely enriched our experiences in RNA studies.
The common approach for RNA structural motif identification is to represent the RNA structural motifs by different 3D properties (i.e. torsion angles or atomic distances) of the key nucleotides and then apply heuristics to searching for the topological occurrences of the motif in the 3D RNA structures [similar to the methods for 3D protein structure comparison (6)]. Computer program, such as PRIMOS (7) and COMPADRES (8), represents and searches certain backbone conformations using pseudotorsion angles. On the other hand, NASSAM encodes the 3D motif by using a graph to store pairwise atomic distances between the key nucleotides (9). To reduce the information contained in pairwise atomic distances, ARTS builds approximated anchors based on a set of seed points before detailed matching (10). Recent progress uses shape histograms, which are also computed from pairwise atomic distances, to summarize the structural motifs (11). This method has identified the occurrences of many structural motifs in ribosomal RNAs (12). Instead of considering solely torsion angles or atomic distances, FR3D, which searches for recurrent motifs considering a combination of geometric, symbolic and sequence information, achieves the most satisfying performance (13). Although the existing methods have successfully identified many occurrences of several known RNA structural motifs, most of them require the accurate 3D coordinates of the query motif, and thus are limited to structural motifs with rigid 3D topologies. However, it is known that many motifs exhibit certain structural variation, and thus cannot be well characterized by their 3D topologies (14). Therefore, the more conserved base-pairing pattern should be considered when searching for RNA structural motifs (15,16).
It was observed that many non-canonical base pairs in RNA structural motifs are isosteric and these base pairs can interchange with each other without affecting the overall RNA structure (17). Generally, a base pair should have three properties: (i) the two nucleotides interacting through hydrogen bonds; (ii) nucleotide edges participating in the interaction; and (iii) the relative orientation of the glycosidic bonds, which is either cis or trans. Each nucleotide has three edges that can interact with another nucleotide to form a base pair, namely the Watson–Crick edge (denoted as `WC' edge), Hoogsteen edge (denoted as `H' edge) and Sugar edge (denoted as `SE' edge). Given the three properties, it is sufficient to classify all base pairs into one of the isosteric groups (17). Modeling RNA structural motifs through non-canonical base pairs is theoretically sound and can largely reduce the complexity of 3D RNA motifs. First, the definition of isostericity serves as the foundation of relating tertiary structure with non-canonical base pairs. Second, some motifs are defined by their characterized non-canonical base-pairing patterns, instead of their 3D structures. Finally, modeling RNA structural motifs by their base-pairing pattern is easier to understand comparing to their atomic coordinates.
Djelloul and Denise (19) modeled the RNA structural motifs through graphical representation of these non-canonical base pairs. They extracted structural segments containing non-canonical base pairs from the annotated RNA 3D structure. By constructing clusters through the measurement of pairwise maximum isomorphic base-pairing cores, they characterized the recurrent base-pairing patterns among these structural segments. This method has led to the rediscovery of many structural motifs, which shows the potential power of utilization of non-canonical base pairs in modeling RNA structural motifs. However, this method is not optimized for structural motif identification, for the isomorphic condition is not suitable to identify the motifs that exhibit variations in non-canonical base pairs.
Therefore, well-developed algorithms for comparing the non-canonical base-pairing patterns between two RNA tertiary structural segments are in urgent demand. However, most existing methods model and compare RNA structures only through canonical base pairs. In a typical approach, free energy values are assigned to the canonical base pairs, and secondary structure with minimum free energy are computed to model the structure (20–24). Comparative genomics approaches aim at the identification of consensus canonical base pairs from a set of synthetic genomic sequences of multiple species that are previously aligned (25,26) or even unaligned (27–30). The RNA homolog search approaches attempt to find genome sequences that match a query RNA in sequence and a model secondary structure annotated with canonical base pairs (31–33). RNA canonical base pairs are also modeled into tree structures, and the edit distance between two tree structures is then computed (34,35). Recently, variants of Sankoff's algorithm (36) are also used to compare the canonical base pairs between two RNA structures (37,38).
These computational methods can be extended to comparing RNA structures with non-canonical base pairs. We need to address the following issues raised by the inclusion of non-canonical base pairs. Most importantly, the similarity between two non-canonical base pairs should be measured. The reason is that canonical base pairs can interchange with each other while maintaining the tertiary structure, but such possibility is not guaranteed for non-canonical base pairs as defined in the isosteric matrices. In addition, canonical base pairs are usually nested stacked in forming the A-form helical regions, while RNA structural motifs usually include many multi-pairings (interactions involves more than two nucleotide residues, i.e. base triples) and pseudoknots (crossing base pairs), see Figure 3. Therefore, non-canonical base pairs, multi-pairing and crossing base pairs must be handled in order to properly compare the structural motifs.
In this article we describe a new computational method for RNA structural motif identification that takes into account isosteric base pairs and multi-pairings. Given a query motif (represented by base-pairing patterns, see Figure 1b), our new method, called RNAMotifScan, attempts to identify all possible similar motifs from the target 3D structures. The core algorithm of RNAMotifScan finds the maximum common isosteric base pairs between two RNA structures, which runs in the time complexity of O(m2n2), where m and n are the number of base pairs in the query and target RNA structural segment. Since RNA structure motifs usually have only a small number of base pairs, our rigorous algorithm is extremely efficient. We tested RNAMotifScan by searching for five previously known motifs in RNA 3D structures from Protein Data Bank (PDB) (39) and compared the results with related publications as well as the SCOR database (40). It is shown that RNAMotifScan can identify many new motif occurrences that are previously unknown and has better performance in terms of both its speed and accuracy. The complete search results can be found at the supplementary website (http://genome.ucf.edu/RNAMotifScan).
MATERIALS AND METHODS
The query RNA structural motif base-pairing patterns are adopted from related publications (see ‘Data processing' Section). We concatenate two strands of the query RNA motif into one sequence for the alignment (see Figure 1c and d, there are two ways to concatenate the query and both are searched against the target). For the target RNA segments, we first use annotation software (see ‘Data processing' Section) to translate the RNA 3D coordinates into base-pairing patterns that contain sufficient information for isosteric group classification (i.e. pairing nucleotides, interacting edges, and relative glycosidic bond orientations). We then cut the annotated target RNA structure into many local (interactions within two strands, long-range interactions are ignored) RNA structural segments. Similarly, we concatenate two strands of the target RNA structural segments into one sequence. To identify RNA motif instances, we use a dynamic programming procedure to compute the similarity between the query RNA motif and all structural segments in the target RNA and report the significant hits.
The recursive functions of the alignment procedure need to address three major issues. First, the isostericity of the base pairs should be incorporated into the scoring functions such that only base pairs belong to the same isosteric group (17) can be matched to each other. Second, there are many multi-pairings occurring in the RNA structural motif and the target RNA, which is introduced by one nucleotide simultaneously paired with two or more other nucleotides. This can be observed since each nucleotide has three edges, thus the nucleotide is able to participate in at most three base pairs. We discuss the multi-pairing issue in ‘Base-pairing relations in RNA structured motifs' Section for the alignment procedure. Finally, both the query RNA motif and the target RNA segments may contain crossing base pairs.
We divide the alignment into two steps. We first align non-crossing base pairs in the query. (Crossing base pairs in query are removed temporarily and processed in the second step, while the crossing base pairs in target structure are retained.) We then try to reinsert the removed crossing base pairs based on the resulting alignment. Note that we select the minimum number of base pairs to be matched in the second step so that most of the base pairs can be aligned optimally in the first step. Because the structural motifs are likely to be well represented by its major part of nested base pairs, which are matched optimally, it should work in most practical cases. Also, users can select the base pairs to form the query motif for the first step searching.
Base-pairing relations in RNA structural motifs
Multi-pairings are not only frequently occurred, but also important in forming the RNA structural motifs. Here, we formally define the classifications and relations of base pairs including multi-pairings. We denote the indices of the left and right nucleotides of a base pair P as Pl,Pr. Generally, two base pairs, PA and PA′, may have one of the following relations: (i) PA and PA′ are interleaving; (ii) PA′ is enclosed with PA (denoted by PA′ < IPA); (iii) PA′ is juxtapose to PA and before PA (denoted by PA′ < pPA). Specifically, RNA structural motifs may contain multi-pairings. To handle these situations, we need to redefine the above definition. We extend the enclosing relation (<I) to three subgroups (Figure 2c): PA′ < I1PA (), PA′ < I2PA () and PA′ < I3PA (). We also extend the juxtaposing relation (<p) to two subgroups (Figure 2d): PA′ < p1PA () and PA′ < p2PA ().
Aligning two RNA structural motifs
We can use a dynamic programming algorithm to compute an optimal alignment between two RNA structural segments (27). There are three major contributions in this algorithm. First, the dynamic programming algorithm is guided by the partial order base pairs. Second, we consider non-canonical base pairs and their isostericity. Finally, we also allow non-crossing multi-pairings for the query and target structure.
Given an RNA structural motif A and a target RNA structural segment B with concatenated strands and m and n base pairs, respectively. Dummy base pairs were added between nucleotides A[0] and A[|A|+1] and between nucleotides B[0] and B[|B|+1]. Let and denote the two sets of base pairs, ordered according to increasing values of the right-most base. Define the following terms:
: the two nucleotides that form the base pair PA, given by and .
: the subsequence covered by the two nucleotides of the base pair PA excluding the two nucleotides themselves. In other words, the sequence .
: the term is defined if and only if PA′ is completely juxtaposing to the left of PA, as the loop region corresponding to .
The score of the optimal alignment between two RNA sequences consists of three parts: the score of matching base pairs, the score of matching paired bases, and the score of matching unpaired subsequences (including gaps). These scores are assigned with different weights (w1, w2 and w3, respectively) to distinguish the importance of them in building an RNA motif. Define the following terms:
(PA,PB): the matching score between two base pairs, PA and PB. The score is evaluated by the isostericity between two PA and PB. Base pairs within the same isostericity group are considered to have similar structural contribution to the motifs, and their matching is given higher bonus score. Non-isosteric matching is also allowed, but with less bonus score.
(A[i...j],B[k...l]): the matching score between two subsequences A[i...j] and B[k...l]. The score is evaluated through the optimal global alignment between the two subsequences.
Gap(k): the gap penalty of inserting/deleting a sequence of length k.
M[PA,PB]: the score of the optimal alignment of the regions enclosed by base pairs PA and PB, given that PA and PB are aligned to each other. Entry records the score of the optimal alignment between two structures A and B.
All the weights and scores defined above are fixed for all searches conducted in this work.
We can compute M[PA,PB] for all pairs in , which would take O(m2n2) time, where m and n are the number of base pairs in A and B, respectively. While many RNA structural alignment algorithms have biquadratic time complexity in terms of sequence length, our algorithm is relatively efficient since the number of base pairs in an RNA structure is much smaller than its length in sequence. In computing M[PA,PB], we have two choices for matching the subsequences inside PA and PB, as they could either form consensus hairpin loops (the terminal case) or there are base pairs to be matched inside (nested base pairs, internal loop or multi-loop). Therefore,
(1) |
Here, Ms[PA,PB] is the score of matching base pairs PA and PB based on both structure isostericity and sequence conservation, and thus can be computed by
(2) |
Mh[PA,PB] is the score of matching the loop regions of PA and PB, assuming that no consensus base pair is included by PA and PB. (For example, these regions form matched hairpin loops.) It can be computed by
(3) |
For the nested base pairs, internal-loop or multi-loop case, we need to define some additional terms. A sequence of base pairs P1,P2,…,Pk form a chain if . Ml[PA,PB] represents the matching score between PA and PB, given that there is a pair of chains included by PA and PB, which form the loop. Let (, respectively) denote base pairs enclosed by PA (PB, respectively), and ordered according to increasing values of the last coordinate. For two base pairs PA′, PA that PA′ < IPA, Loop(PA) is separated into three major regions: left region, Loop(PA′) and right region. We denote the left region as () and the right region as (). Then, we will have
(4) |
To enforce the matched base pairs have the same multi-pairing pattern, we must ensure that and PA, and PB are in the same enclosing subgroup ( or <I3, Figure 2). Here, is defined as the score of two chains of the optimal matching configurations that end at and , and begin at some , and . Denote if and there is no base pair such that . Then,
(5) |
The Gap means the corresponding sequences are matched to nothing (i.e. they are deleted). Similarly, to enforce the matched base pairs have the same multi-pairing constraint, we must ensure that and PA, and PB are in the same enclosing subgroup, and and , and are in the same juxtaposing subgroup.
P-value computation
To compute the P-value for the probability that an RNA motif hits a random substructure in the database, we used the non-parametric Chebyshev's inequality. In future research, we will optimize these parameters by fitting the distribution of the overall alignment scores between pairs of RNA structures into a Gumbel-like distribution to get more accurate P-value. To obtain the mean and variance, the query is aligned against the background segments, which are generated by randomly picking base pairs from real RNA structures while maintaining the similar GC content, as well as frequencies of the interacting edges and glycosidic bonds orientations. We applied this approach on kink-turn motif, and observed Gumbel's distribution of the alignment scores (see supplementary website, http://genome.ucf.edu/RNAMotifScan). Since each motif has its own base-pairing patterns and degree of tolerance against base-pair variations, we suggest different P-value cutoffs for different motifs based on tested results (see Table 3 for the cutoffs). Additionally, false positive rates (FPRs) are computed through simulation and available on the supplementary website (http://genome.ucf.edu/RNAMotifScan).
Table 3.
Motif | P-value cutoff | PDB | NR PDB | SCOR |
---|---|---|---|---|
Kink-turn | 0.07 | 553 | 39 | 195 |
C-loop | 0.04 | 167 | 18 | – |
Sarcin–ricin | 0.02 | 633 | 46 | 107 |
Reverse kink-turn | 0.14 | 56 | 3 | – |
E-loop | 0.13 | 1356 | 148 | 37 |
C-loop and reverse kink-turn are not included in SCOR. Motifs characterized in SCOR were from the entire PDB released by October. 24, 2004. The non-redundant set (NR PDB) is constructed by removing entries with sequence identities >90%.
Data processing
Base-pair interactions of all RNA 3D structures from PDB (39) (released on August 2008) were first annotated by using MC-Annotate (41). RNAVIEW (42) generates similar results based on our experiments, and RNAMotifScan provides interfaces for both annotation tools. After annotation, 1445 RNA structures were generated from PDB (including incomplete RNA chains in the raw PDB file). Five RNA structural motifs were used as queries to test our method: the kink-turn, C-loop, sarcin–ricin, reverse kink-turn and E-loop motifs. Because they are well characterized, documented and important for many RNA folding behaviors or functionalities. The query base-pairing patterns for these motifs come from the following references: kink-turn (43), C-loop (14), sarcin–ricin (44), reverse kink-turn (4) and E-loop (14). The 2D diagrams for query base-pairing patterns of these motifs are shown in Figure 3. RNAMotifScan was implemented in ANSI C. All experiments were carried out on an Intel Xeon 2.66 GHz workstation. The tertiary structure figures were generated using PyMol (http://www.pymol.org).
RESULTS
To assess the performance of RNAMotifScan, we searched five RNA motifs against a 23S rRNA structure from Haloarcula marismortui (1S72, resolution 2.40 Å). We compared our results with three latest methods: FR3D (13), a de novo clustering method developed by Djelloul and Denise (19), and the shape histogram method developed by Apostolico et al. (11). Since the clustering method mainly aims at the de novo motif discovery, the method may miss some true instances. We also used RNAMotifScan to search the five motifs against the entire PDB for new motif occurrences.
Kink-turn
The kink-turn motif is an asymmetric internal loop serving as an important site for protein recognition and RNA tertiary interactions (45,46). The `kink' can be observed in the longer strand of the loop, which is stabilized by the two cross-strand stacking adenine residues. It brings together the two minor groove edges, and, consequently, produces a sharp turn of the two supporting helices (14,43).
RNAMotifScan has identified six local motifs (motifs involve two or less strands) following by one composite motif (motifs involve three or more strands) from 1S72 (Table 1). FR3D finds all these seven motifs but introducing several `related motifs' using the same query [see Table 5 of FR3D results (13)]. FR3D also retrieves two more composite motifs. (The reason is that FR3D produces target segment structure based on spacial frame instead of sequence order.) The current version of RNAMotifScan does not focus on identifying composite motifs, but this feature can be included in the future (see ‘Discussion' Section). The shape histogram method finds all the six local motifs, but missing all the composite motifs. The de novo clustering method successfully rediscovers the motif, however, it misses four out of the six local motifs and all composite motifs. The results suggest that RNAMotifScan has higher sensitivity than shape histogram method and de novo clustering method in identifying kink-turn motifs.
Table 1.
Ranking | Chain | Location | Score | P-value | FR3D | de novo Clustering | Shape Histogram |
---|---|---|---|---|---|---|---|
Kink-turn | |||||||
1 | 0 | 77-82/92-100 | 70.2 | 0.009 | * | * | * |
2 | 0 | 1211-1217/1146-1156 | 62.1 | 0.014 | * | * | |
3 | 0 | 936-941/1025-1034 | 55.8 | 0.022 | * | * | * |
4 | 0 | 1338-1343/1311-1319 | 54.7 | 0.024 | * | * | |
5 | 0 | 1586-1593/1601-1609 | 45.4 | 0.062 | (*) | * | |
6 | 0 | 244-250/259-267 | 44.4 | 0.072 | (*) | * | |
7 | 0 | 2903-2906/2845-2855 | 43.8 | 0.078 | (*) | ||
C-loop | |||||||
1 | 0 | 1436-1440/1424-1430 | 40.9 | 0.033 | – | * | – |
2 | 0 | 2760-2764/2716-2722 | 39.1 | 0.041 | – | * | – |
3 | 0 | 1939-1945/1892-1898 | 38.4 | 0.044 | – | – | |
4 | 0 | 1004-1009/957-964 | 34.4 | 0.081 | – | – | |
Sarcin–ricin | |||||||
1 | 0 | 211-215/225-228 | 42.8 | 0.007 | * | * | – |
2 | 0 | 1368-1372/2053-2056 | 42.8 | 0.007 | * | * | – |
3 | 0 | 2690-2694/2701-2704 | 42.8 | 0.007 | * | * | – |
4 | 9 | 76-80/102-105 | 42.0 | 0.007 | * | – | |
5 | 0 | 461-466/475-478 | 37.5 | 0.010 | * | * | – |
6 | 0 | 380-383/406-408 | 34.4 | 0.013 | * | – | |
7 | 0 | 951-955/1012-1016 | 33.4 | 0.015 | – | ||
8 | 0 | 173-177/159-162 | 29.8 | 0.022 | * | * | – |
9 | 0 | 2090-2094/2651-2654 | 26.2 | 0.037 | – | ||
10 | 0 | 1775-1779/1765-1768 | 25.5 | 0.042 | – | ||
11 | 0 | 1542-1545/1640-1643 | 21.0 | 0.117 | – | ||
12 | 0 | 585-590/568-572 | 20.8 | 0.126 | * | – | |
13 | 0 | 355-360/292-296 | 20.8 | 0.126 | * | – | |
Reverse kink-turn | |||||||
1 | 0 | 1661-1666/1520-1530 | 48.6 | 0.114 | – | * | – |
2 | 0 | 1530-1536/1649-1661 | 46.8 | 0.145 | – | * | – |
E-loop | |||||||
1 | 0 | 706-708/720-722 | 21.2 | 0.052 | – | * | |
2 | 0 | 1543-1545/1640-1642 | 20.6 | 0.061 | – | * | |
3 | 0 | 174-177/159-161 | 18.7 | 0.098 | – | * | |
4 | 0 | 663-666/680-683 | 18.6 | 0.100 | – | ||
5 | 0 | 586-590/568-571 | 18.0 | 0.120 | – | * | |
6 | 0 | 356-360/292-295 | 18.0 | 0.120 | – | * | |
7 | 0 | 2691-2694/2701-2703 | 17.8 | 0.130 | – | * | |
8 | 0 | 1369-1372/2053-2055 | 17.8 | 0.130 | – | * | |
9 | 0 | 463-466/475-477 | 17.8 | 0.130 | – | * | |
10 | 0 | 380-383/406-408 | 17.8 | 0.130 | – | * |
Symbol notations: `*' the motif occurrences are identified by the corresponding method; `(*)' motif occurrences rank below some `related motifs'; `-' the motif is not studied by the corresponding method. The bona fide motifs validated by visual inspection are indicated with bold typeface of their location. The underlined motifs are de novo found by RNAMotifScan (even they might be manually characterized before).
C-loop
The C-loop motif is an RNA–protein binding site, and characterized by the unique multi-pairings formed by its two cytosine residues (14). The two interleaving non-canonical base pairs from the two multi-pairings bring together the interacting nucleotides, leaving the unpaired adenine residue at the minor groove and fully accessible (47).
RNAMotifScan has identified three C-loop motifs in 1S72 (Table 1). The de novo clustering method can also classify the first two C-loop motifs. (FR3D and shape histogram methods were not used to search C-loop motifs. Because it is difficult for these 3D structure-based methods to identify motifs that are small and usually exhibit high structural variations, such as C-loops.) The first two C-loop motifs exhibit high conservation comparing to the query motif (isomorphic as defined in the de novo clustering method), such that they can be easily detected by the de novo clustering method. The fourth C-loop motif [supported by (43)] has one nucleotide inserted between the two multi-paired cytosine residues. Therefore, it cannot be found by the de novo clustering method but still can be detected by RNAMotifScan in which insertions (deletions) are taken into account. The results suggest that RNAMotifScan has higher sensitivity than the de novo clustering method. At the same time, we expect that our specificity can also be raised by carefully distinguishing the effects of different variations (see ‘Discussion' Section).
Sarcin–ricin
The sarcin–ricin motif in the ribosomal RNAs is involved in the interaction with elongation factors (48). This interaction can be inhibited while the motif is bounded and modified by ribotoxins such as α-sarcin (ribonuclease) and ricin (RNA N-glycosidase) (49). The base-pairing pattern is highly conserved in 23S–28S rRNA from large ribosomal subunit, producing an `S' shape bend in most of the sarcin–ricin motifs.
RNAMotifScan has identified nine known sarcin–ricin motifs, whereas eight were identified by FR3D and six were classified by the de novo clustering method. RNAMotifScan identified one new sarcin–ricin motif, which was also observed by St-Onge et al. (50). Three other motifs found by RNAMotifScan rank at low places in the results, showing a satisfactory specificity for our method (Table 1). Even though these instances show higher structural variation from the query structure, we suggest that they should be further inspected as they show interesting conservations in base-pairing pattern comparing to the known sarcin–ricin motifs.
Reverse kink-turn
The reverse kink-turn is also an asymmetric internal loop that produces sharp bend as the kink-turn motif, however, towards the opposite direction (4). Another difference is that the longer strand of the kink-turn motif makes a tight bend, while in the reverse kink-turn motif, the tight bend is observed in the shorter strand as the longer strand gradually turns to the major/deep groove (51).
The de novo clustering method suggests six reverse kink-turn occurrences. (FR3D and shape histogram method were not used to search reverse kink-turn motifs either.) We noticed that three of these six motifs given by clustering are false positives (2397–2399/2389–2391, 2307–2310/2298–2300 and 1132–1134/1228–1230), as they either come from the irregular pairing regions near hairpin loop regions instead of being the junction regions between two helical regions, or do not produce significant sharp turns. RNAMotifScan has identified two of the three true reverse kink-turn motifs (Table 1). The one motif missed is due to its higher structural variation. Even though RNAMotifScan may miss several occurrences, it has much higher specificity and thus more reliable is practical applications.
E-loop
The E-loop was originally defined as the symmetric internal loop region in the 5S rRNA that separates its helical regions IV and V (52,53). The motif can be decomposed into two isosteric submotifs, which are positioned with relative 180° rotation (44,53). The submotif is usually referred to as `bacterial E-loop', and its base-pairing pattern was summarized as a trans H/SE base pair, a trans WC/H or trans SE/H base pair, and a cis bifurcated or trans SE/H base pair by Leontis et al. (44). Since the isostericity related with bifurcated base pair is not defined, we consider only the trans SE/H as the third base pair in the query.
There are two E-loop motifs classified by the de novo clustering method and eight identified by the shape histogram method. The two sets of results show no overlap and the union of them gives totally 10 E-loop motifs. RNAMotifScan has successfully identified nine of them (Table 1), and one new E-loop occurrence. This new E-loop occurrence, as well as a segment of regular A-form helix, are superimposed with a well characterized E-loop motif (Figure 4). The superimposition of the new E-loop instance results much smaller RMSD than the superimposition of the A-form helix, indicating that this E-loop occurrence cannot be expected to find randomly. RNAMotifScan has missed one E-loop motif that has both high sequence and base-pairing variations. Note that E-loop motifs can tolerate higher variations comparing to other motifs. [They were clustered into three families using the de novo clustering method (19).] Therefore, the results generated by searching only one of its variants could be limited. However, RNAMotifScan outperforms both methods when given only one query, and the E-loop identification can be further optimized by including other variants of E-loop motifs as query.
3D Resolution affects identification accuracy
We observe that the identification results of RNAMotifScan is dependent on the quality of the annotation program, which turns out to be dependent on the resolution of the 3D RNA structure. To demonstrate this, we selected three PDB entries with different resolutions for the same 16S rRNA structure from Thermus thermophilus (PDBid: 2VQE, 1J5E and 1I95), and used RNAMotifScan to identify the five motifs in them. Only hits with P-value less than the defined cutoffs (Table 3) are counted. Since the RNA structure from 2VQE contains three RNA chains, while the other two structures contain only one RNA chain, we only consider their common RNA chain (chain A in the comparison). The results are shown in Table 2. In Table 2, we can find that MC-Annotate tends to annotate fewer base pairs in the low-resolution RNA structures. Among those missed base pairs, most of them are non-canonical base pairs, which are critical for the structural motif identification. Even if the numbers of annotated base pairs are comparable for two structures with different resolutions, their qualities differ. For example, 2VQE and 1J5E have almost the same number of annotated base pairs, but one kink-turn that can be identified in 2VQE is missed in 1J5E.
Table 2.
PDB ID | Resolution | Length | #bp | #Can. bp | #Non-can. bp | #KT | #CL | #SR | #RK | #EL |
---|---|---|---|---|---|---|---|---|---|---|
2VQE | 2.50 Å | 1522 | 766 | 433 | 333 | 3 | 0 | 2 | 0 | 6 |
1J5E | 3.05 Å | 1522 | 761 | 434 | 327 | 2 | 0 | 2 | 0 | 6 |
1I95 | 4.50 Å | 1514 | 699 | 422 | 277 | 1 | 0 | 0 | 0 | 3 |
The columns in the tables represent PDB codes of the RNA structures, the resolution, the length, the number of base pairs (bp) annotated by MC-Annotate, the number of annotated canonical base pairs (Can. bp), the number of annotated non-canonical base pairs (Non-can. bp), the number of kink-turn (KT), C-loop (CL), sarcin–ricin (SR), reverse kink-turn (RK) and E-loop (EL) being identified. All structures are Thermus thermophilus 16S rRNA structures. The P-value cutoffs are the same as those shown in Table 3.
Scanning PDB
Finally, we searched the entire PDB for the five query motifs. The running time for scanning PDB is 64 m35s for kink-turn, 74 m29s for C-loop, 51m49s for sarcin–ricin, 77 m59s for reverse kink-turn and 72 m55s for E-loop motif. The results are summarized in Table 3. The motifs identified by RNAMotifScan are several times more than the current known instances (P-value cutoffs are shown in Table 3, the estimated FPR is <0.01). Still, we expect the numbers are underestimated since our cutoffs are set to be rather stringent. Although the large difference between the identified motifs and the currently known ones may due to the fast growing of RNA structures deposited in PDB, we still find new RNA motif occurrences in non-ribosomal RNAs, such as riboswitches, ribozymes and protein–mRNA complexes. The complete results can be found at the supplementary website http://genome.ucf.edu/RNAMotif Scan.
To demonstrate the advantages of RNAMotifScan, we compared five query motifs (Figure 3) with five different newly identified motifs (Figure 5). For C-loop motif, we observed that the sequence identity is 66% between the C-loop query (Figure 3b) and the new identified C-loop motif (Figure 5b), which sequence-based search methods may miss. The sarcin–ricin motif (Figure 3c) and the E-loop motif (Figure 3e) consist of all non-canonical base pairs, such that they cannot be searched by methods that are restricted to canonical base pairs. The newly identified sarcin–ricin motif and E-loop motifs also have three isosteric base-pairing changes (Figure 5c and e). The newly identified kink-turn motif (Figure 5a) shows two base-pairing variations (trans SE-H to cis SE-SE, and trans SE-H to cis WC-WC), which would be missed by the strict base-pairing graph isomorphism search. More importantly, we found that the newly identified kink-turn (Figure 5a) and reverse kink-turn motifs (Figure 5d) show structural variations comparing to the query motifs. One nucleotide is inserted at the `kink' region of the newly identified kink-turn motif, resulting an `U' shape `kink' rather than the `V' shape `kink' in the query (Figure 6a). For the newly identified reverse kink-turn motif, the structural variation is observed at the longer strand of its junction between two helices. Two nucleotides are inserted at this region, relaxing the turn significantly (Figure 5d). At the same time, a sharp bend is created at this region (Figure 6b), in order to accommodate the insertions and maintain the proper structure of the motif.
DISCUSSION
The base pairs from the RNA 3D structures are extracted and classified by various annotation tools. The annotations of base pairs are produced based on the geometric constraints among atoms involving the hydrogen bond interactions. In another word, the accurate coordinates of atoms are critical for the classification of base pairs. Therefore, the quality of annotation results, and consequently the accuracy of RNAMotifScan, depends largely on the resolution of the RNA 3D structure (Table 2). We anticipate that with the advances of RNA structure determination techniques, more and more high-quality data can be produced and the RNA motif identification can be more reliable.
It is mentioned that FR3D is capable of discovering composite motifs, while RNAMotifScan mainly focuses on local motifs. However, RNAMotifScan can be easily extended to include RNA composite motifs. If the motif consists of n strands, there are in total n! combinations of orders that these strands can be concatenated. Theoretically, it is possible to include any number of strands with the compensation of running time. In practice, there is only a small number of strands in RNA structural motifs. Therefore, it is feasible to enumerate all possible strand concatenations. We plan to include this feature in the future versions of RNAMotifScan.
Currently, RNAMotifScan uses a scoring function that does not distinguish substitutions between different isosteric groups. Recently, Stombaugh et al. (54) studied the frequencies of non-canonical base pair substitution among different isosteric groups and proposed a more sophisticated scoring function. We plan to incorporate such scoring function into our method. Moreover, the scoring function should also be position dependent (similar as the position-specific scoring matrix). For example, the determination of C-loop motif relies on the two multi-paired cytosine residues. We should assign heavy penalty to the mutations on these nucleotides. Similarly, for E-loop motifs, we should give heavy weight to the conserved trans H/SE base pair according to the E-loop motif definition. With the incorporation of more sophisticated base pair substitution scoring function and position-dependent weights, we anticipate that RNAMotifScan will become much more accurate in identifying RNA structural motifs.
FUNDING
C.Z. and S.Z. are funded in part by the University of Central Florida In-House Research Grant (1048479). H.T. is funded in part by the METACyt Initiative at Indiana University (funded by Lilly Endowment, Inc.). Funding for open access charge: METACyt Initiative at Indiana University (funded by Lilly Endowment, Inc.).
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors thank François Major and Eric Westhof for helpful discussions and comments and the anonymous reviewers for their helpful criticism.
REFERENCES
- 1.Eddy S. Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2001;2:919–929. doi: 10.1038/35103511. [DOI] [PubMed] [Google Scholar]
- 2.Storz G. An expanding universe of noncoding RNAs. Science. 2002;296:1260–1263. doi: 10.1126/science.1072249. [DOI] [PubMed] [Google Scholar]
- 3.Hendrix D, Brenner S, Holbrook S. RNA structural motifs: building blocks of a modular biomolecule. Q. Rev. Biophys. 2005;38:221–243. doi: 10.1017/S0033583506004215. [DOI] [PubMed] [Google Scholar]
- 4.Leontis NB, Lescoute A, Westhof E. The building blocks and motifs of RNA architecture. Curr. Opin. Struct. Biol. 2006;16:279–287. doi: 10.1016/j.sbi.2006.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Moore P. Structural motifs in RNA. Annu. Rev. Biochem. 1999;68:287–300. doi: 10.1146/annurev.biochem.68.1.287. [DOI] [PubMed] [Google Scholar]
- 6.Alesker V, Nussinov R, Wolfson H. Detection of non-topological motifs in protein structures. Protein Eng. 1996;9:1103–1119. doi: 10.1093/protein/9.12.1103. [DOI] [PubMed] [Google Scholar]
- 7.Duarte CM, Wadley LM, Pyle AM. RNA structure comparison, motif search and discovery using a reduced representation of RNA conformational space. Nucleic Acids Res. 2003;31:4755–4761. doi: 10.1093/nar/gkg682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wadley LM, Pyle AM. The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res. 2004;32:6650–6659. doi: 10.1093/nar/gkh1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Harrison AM, South DR, Willett P, Artymiuk PJ. Representation, searching and discovery of patterns of bases in complex RNA structures. J. Comput. Aided Mol. Des. 2003;17:537–549. doi: 10.1023/b:jcam.0000004603.15856.32. [DOI] [PubMed] [Google Scholar]
- 10.Dror O, Nussinov R, Wolfson H. ARTS: alignment of RNA tertiary structures. Bioinformatics. 2005;21(Suppl. 2):47–53. doi: 10.1093/bioinformatics/bti1108. [DOI] [PubMed] [Google Scholar]
- 11.Apostolico A, Ciriello G, Guerra C, Heitsch CE, Hsiao C, Williams LD. Finding 3D motifs in ribosomal RNA structures. Nucleic Acids Res. 2009;37:e29. doi: 10.1093/nar/gkn1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sargsyan K, Lim C. Arrangement of 3D structural motifs in ribosomal RNA. Nucleic Acids Res. 2010;38:3512–3522. doi: 10.1093/nar/gkq074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sarver M, Zirbel C, Stombaugh J, Mokdad A, Leontis N. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math. Biol. 2008;56:215–252. doi: 10.1007/s00285-007-0110-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Leontis NB, Westhof E. Analysis of RNA motifs. Curr. Opin. Struct. Biol. 2003;13:300–308. doi: 10.1016/s0959-440x(03)00076-9. [DOI] [PubMed] [Google Scholar]
- 15.Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R. RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res. 2001;29:4724–4735. doi: 10.1093/nar/29.22.4724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Parisien M, Cruz JA, Westhof E, Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA. 2009;15:1875–1885. doi: 10.1261/rna.1700409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Leontis N, Stombaugh J, Westhof E. The non-Watson-Crick base pairs and their associated isostericity matrices. Nucleic Acids Res. 2002;30:3497–3531. doi: 10.1093/nar/gkf481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Leontis N, Westhof E. Geometric nomenclature and classification of RNA base pairs. RNA. 2001;7:499–512. doi: 10.1017/s1355838201002515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Djelloul M, Denise A. Automated motif extraction and classification in RNA tertiary structures. RNA. 2008;14:2489–2497. doi: 10.1261/rna.1061108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hofacker I. Vienna RNA secondary structure server. Nucleic Acids Res. 2003;31:3429–3431. doi: 10.1093/nar/gkg599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tinoco I, Uhlenbeck OC, Levine MD. Estimation of secondary structure in ribonucleic acids. Nature. 1971;230:362–367. doi: 10.1038/230362a0. [DOI] [PubMed] [Google Scholar]
- 22.Waterman M. Secondary structure of single stranded nucleic acids. Adv. Math. Suppl. Stud. 1978;I:167–212. [Google Scholar]
- 23.Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. doi: 10.1093/nar/gkg595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zuker M, Sankoff D. RNA secondary structures and their prediction. Bull. Math. Biol. 1984;46:591–621. [Google Scholar]
- 25.Pedersen AG, Nielsen H. Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1997;5:226–233. [PubMed] [Google Scholar]
- 26.Washietl S, Hofacker I, Stadler P. Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA. 2005;102:2454–2459. doi: 10.1073/pnas.0409169102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bafna V, Tang H, Zhang S. Consensus folding of unaligned RNA sequences revisited. J. Comput. Biol. 2006;13:283–295. doi: 10.1089/cmb.2006.13.283. [DOI] [PubMed] [Google Scholar]
- 28.Bouthinon D, Soldano H. A new method to predict the consensus secondary structure of a set of unaligned RNA sequences. Bioinformatics. 1999;15:785–798. doi: 10.1093/bioinformatics/15.10.785. [DOI] [PubMed] [Google Scholar]
- 29.Davydov E, Batzoglou S. A computational model for RNA multiple structural alignment. Theoret. Comput. Sci. 2006;368:205–216. [Google Scholar]
- 30.Gorodkin J, Heyer L, Stormo G. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res. 1997;25:3724–32. doi: 10.1093/nar/25.18.3724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Klein R, Eddy S. RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44. doi: 10.1186/1471-2105-4-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang S, Borovok I, Aharonowitz Y, Sharan R, Bafna V. A sequence-based filtering method for ncRNA identification and its application to searching for riboswitch elements. Bioinformatics. 2006;22:e557–565. doi: 10.1093/bioinformatics/btl232. [DOI] [PubMed] [Google Scholar]
- 33.Zhang S, Hass B, Eskin E, Bafna V. Searching Genomes for non-coding RNA using FastR. IEEE/ACM Trans. Comput. Biol. Bioinf. 2005;2:366–379. doi: 10.1109/TCBB.2005.57. [DOI] [PubMed] [Google Scholar]
- 34.Höchsmann M, Töller T, Giegerich R, Kurtz S. Proc. IEEE Comput. Soc. Bioinform. Conf. 2003. Local similarity in RNA secondary structures; pp. 159–168. [PubMed] [Google Scholar]
- 35.Jiang T, Lin G, Ma B, Zhang K. A general edit distance between RNA structures. J. Mol. Biol. 2002;9:371–388. doi: 10.1089/10665270252935511. [DOI] [PubMed] [Google Scholar]
- 36.Sankoff D. Simulations solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 1985;45:810–825. [Google Scholar]
- 37.Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23:926–932. doi: 10.1093/bioinformatics/btm049. [DOI] [PubMed] [Google Scholar]
- 38.Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol. 2007;3:e65. doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Tamura M, Hendrix D, Klosterman P, Schimmelman N, Brenner S, Holbrook S. SCOR: structural classification of RNA, version 2.0. Nucleic Acids Res. 2004;32:D182–184. doi: 10.1093/nar/gkh080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gendron P, Lemieux S, Major F. Quantitative analysis of nucleic acid three-dimensional structures. J. Mol. Biol. 2001;308:919–936. doi: 10.1006/jmbi.2001.4626. [DOI] [PubMed] [Google Scholar]
- 42.Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H, Westhof E. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003;31:3450–3460. doi: 10.1093/nar/gkg529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lescoute A, Leontis N, Massire C, Westhof E. Recurrent structural RNA motifs, isostericity matrices and sequence alignments. Nucleic Acids Res. 2005;33:2395–2409. doi: 10.1093/nar/gki535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Leontis NB, Stombaugh J, Westhof E. Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules. Biochimie. 2002;84:961–973. doi: 10.1016/s0300-9084(02)01463-3. [DOI] [PubMed] [Google Scholar]
- 45.Vidovic I, Nottrott S, Hartmuth K, Luhrmann R, Ficner R. Crystal structure of the spliceosomal 15.5 kD protein bound to a U4 snRNA fragment. Mol. Cell. 2000;6:1331–1342. doi: 10.1016/s1097-2765(00)00131-3. [DOI] [PubMed] [Google Scholar]
- 46.Klein D, Schmeing T, Moore P, Steitz T. The kink-turn: a new RNA secondary structure motif. EMBO J. 2001;20:4214–4221. doi: 10.1093/emboj/20.15.4214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Torres-Larios A, Dock-Bregeon AC, Romby P, Rees B, Sankaranarayanan R, Caillet J, Springer M, Ehresmann C, Ehresmann B, Moras D. Structural basis of translational control by Escherichia coli threonyl tRNA synthetase. Nat. Struct. Biol. 2002;9:343–347. doi: 10.1038/nsb789. [DOI] [PubMed] [Google Scholar]
- 48.Szewczak AA, Moore PB, Chang YL, Wool IG. The conformation of the sarcin/ricin loop from 28S ribosomal RNA. Proc. Natl Acad. Sci. USA. 1993;90:9581–9585. doi: 10.1073/pnas.90.20.9581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Spackova N, Sponer J. Molecular dynamics simulations of sarcin-ricin rRNA motif. Nucleic Acids Res. 2006;34:697–708. doi: 10.1093/nar/gkj470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.St-Onge K, Thibault P, Hamel S, Major F. Modeling RNA tertiary structure motifs by graph-grammars. Nucleic Acids Res. 2007;35:1726–1736. doi: 10.1093/nar/gkm069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Strobel SA, Adams PL, Stahley MR, Wang J. RNA kink turns to the left and to the right. RNA. 2004;10:1852–1854. doi: 10.1261/rna.7141504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Correll CC, Freeborn B, Moore PB, Steitz TA. Metals, motifs, and recognition in the crystal structure of a 5S rRNA domain. Cell. 1997;91:705–712. doi: 10.1016/s0092-8674(00)80457-2. [DOI] [PubMed] [Google Scholar]
- 53.Leontis NB, Westhof E. The 5S rRNA loop E: chemical probing and phylogenetic data versus crystal structure. RNA. 1998;4:1134–1153. doi: 10.1017/s1355838298980566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Stombaugh J, Zirbel CL, Westhof E, Leontis NB. Frequency and isostericity of RNA base pairs. Nucleic Acids Res. 2009;37:2294–2312. doi: 10.1093/nar/gkp011. [DOI] [PMC free article] [PubMed] [Google Scholar]