Abstract
As functional components in three-dimensional (3D) conformation of an RNA, the RNA structural motifs provide an easy way to associate the molecular architectures with their biological mechanisms. In the past years, many computational tools have been developed to search motif instances by using the existing knowledge of well-studied families. Recently, with the rapidly increasing number of resolved RNA 3D structures, there is an urgent need to discover novel motifs with the newly presented information. In this work, we classify all the loops in non-redundant RNA 3D structures to detect plausible RNA structural motif families by using a clustering pipeline. Compared with other clustering approaches, our method has two benefits: first, the underlying alignment algorithm is tolerant to the variations in 3D structures. Second, sophisticated downstream analysis has been performed to ensure the clusters are valid and easily applied to further research. The final clustering results contain many interesting new variants of known motif families, such as GNAA tetraloop, kink-turn, sarcin-ricin and T-loop. We have also discovered potential novel functional motifs conserved in ribosomal RNA, sgRNA, SRP RNA, riboswitch and ribozyme.
INTRODUCTION
Non-coding RNAs (ncRNAs) achieve their specific biological functions by folding into three-dimensional (3D) structures with many locally stable components. Among them, some highly abundant building blocks, called ‘RNA structural motifs’, are found to play important roles which may determine the behaviors of the molecules. For examples, the kink-turn motifs are the important binding sites for nine proteins in the bacterial 23S ribosomal RNAs (rRNAs) (1); the cleavage at sarcin-ricin motifs led by the toxic proteins may result in complete shutdown of protein synthesis (2). Therefore, the identification and understanding of these recurrent structural components are indispensable for the study of RNA molecules. Considering that the number of resolved RNA 3D structures is rapidly increasing in recent years, thorough analysis of structural motifs is expected to extend our knowledge of the relationship between RNA structures and functions.
One major computational approach for studying RNA structural motifs is to search homologous instances of known motifs by using comparative methods. Traditionally, the motifs are modeled with their 3D geometric features, such as backbone conformations or torsion angles. NASSAM (3) and PRIMOS (4) are typical tools which primarily rely on the 3D atomic coordinates. They perform well for some simple motifs, but may not work for complex ones since the underlying computational methods are too rigid to identify the flexible variations in structures. Besides 3D information, FR3D integrates pairwise interactions as constraints into the screening for RNA structural motifs (5). However, as the most critical character of RNAs, the base–base interactions should be used as key factors in the assessment of structural discrepancy directly (6). Based on this idea, RNAMotifScan is proposed to search new motif candidates that share non-canonical base–base interaction patterns with the query (7,8). The similarity between base pairs is measured with substitution scores computed from the isosteric matrix. Isostericity includes interacting edges, glycosidic bonds and C1′–C1′ distance to capture geometric features of paired bases.
The benchmarking results show that RNAMotifScan outperforms other state-of-the-art RNA structural motif searching tools, especially for the instances with geometric variations caused by insertions or deletions.
An issue of searching tools is that they are based on the existing knowledge of RNA structural motifs, and thus cannot be used to detect new families. To tackle this problem, comparative methods are incorporated into clustering pipelines for the de novo discovery of conserved structural elements. One example is COMPADRES (9), which makes use of PRIMOS to categorize RNA structural motifs in the database of existing RNA 3D structures. Its performance is limited by the rigid comparison between loop regions, and the clustering results are difficult to be applied to further research due to the complex models only covering 3D geometric information. LENCS (longest extensible non-canonical substructure) adopts a much simpler model which defines the RNA structural motifs as graphs of nucleotides interconnected by base pairs and glycosidic bonds (10). Thus the structural similarity of two motifs can be evaluated by the maximum isomorphic subgraph. With this measurement, a hierarchical clustering tree is built, and the motifs with similar base-pairing patterns are categorized by cutting it with a universal threshold. LENCS has successfully identified several putative new motifs in three rRNAs without using any tertiary information. But its sensitivity to potential structural variations is low, because only the same types of base pairs are allowed to be matched in the graphs. A recent approach of classifying RNA structural motifs takes into account all the hairpin and internal loops in the non-redundant (NR) RNA 3D structures (11). Based on FR3D, this pipeline aims at grouping the loop regions conserved in 3D space together with the help of pairwise interaction constraints. All the annotated motif instances and families are well organized in an online database named RNA 3D Motif Atlas. One issue of RNA 3D Motif Atlas is the rigid restriction on the 3D geometric discrepancy between cluster members. For example, the helix-end nucleotides of the query and target loops are required to be very close in 3D space after superimposition. As a result, it intends to categorize the highly similar components into numerous small groups, and may lose insights of the possible structural variations in a motif family. RNA Bricks is an RNA structural motif database which also stores the external molecular environment of RNA structural motifs, such as contacts with other RNA motifs, proteins, metal ions and ligands (12). But its classification of RNA structural motif families still depends on the RMSDs between atoms, which restricts the capacity to identify potential variations.
To address this issue, we developed a new clustering framework named RNAMSC for de novo RNA structural motif identification in rRNAs (13). To ensure the high coverage of base-pairing information on the RNA sequences, the base-pairing annotation of two different tools, MC-Annotate (14) and RNAView (15), were combined. Then the non-canonical base pairs in the loops were compared according to their isostericity (16), and the statistically significant alignments were determined by using P-values which were inferred from the simulated background data. After that, the conserved candidate pairs with low P-values were summarized into a graph, in which the strongly connected subgraphs were retrieved. The experimental results show that RNAMSC not only outperforms LENCS in the recovery of known motifs, but also discovers several novel motif families. Compared with RNA 3D Motif Atlas, our approach adopts base-pairing information to measure structural similarity in the clustering. As a result, RNAMSC can detect potential motifs with higher structural flexibility.
Here we propose a new clustering pipeline to automatically detect novel RNA structural motifs by extending RNAMSC. First, the new pipeline is optimized for the large-scale inputs, such as the NR RNA structure dataset. Second, all the single-stranded regions in the RNA molecules, including the multi-way junctions, are considered in the classification. Third, the clustering results are post-processed to analyze their functionalities and relationships. By using this new approach, we have identified 191 motif families including both known and unknown motifs. Generally, the large clusters contain the known motifs in RNA 3D structures, such as GNRA tetraloop, T-loop, kink-turn and sarcin-ricin. The variations in some motif families which are separated from the majority can be retrieved based on checking their secondary and tertiary structural patterns in the downstream analysis. Furthermore, we also have discovered some novel motifs conserved in both rRNAs and non-rRNAs, such as single guide RNA (sgRNA) in Cas9 complex, Alu domain in the signal recognition particle RNA (SRP RNA), GlmS riboswitch and twister ribozyme.
MATERIALS AND METHODS
Data preparation
Our clustering approach is designed to use the known knowledge of RNA 3D structures deposited in the PDB database (17). As the paper is written, there were over 3000 experimentally resolved macro-molecular structures containing RNAs. To avoid possible biases in the statistical evaluation of structural conservation, the NR list (of RNA-containing PDB structures) from the BGSU RNA group (18) was adopted. This dataset eliminated the redundancy both in a single PDB file and among multiple PDB files, while keeping sufficiently diverged homologous structures. The selected 876 PDB files (including 1307 RNA chains) at 4.0 Å resolution threshold in v1.89 NR list were downloaded.
All the plausible pairing interactions in the RNA 3D structures were identified by using MC-Annotate (14) and RNAView (15). Their annotation results were merged, and the conflicts were resolved by taking the MC-annotate predictions. For each chain, the predicted cis Watson–Crick base pairs were retrieved to reveal the stacks in the RNA secondary structure. The pseudoknots in the structure were recognized by the program K2N (19) and then eliminated. In the pseudoknot-free secondary structure, the single-stranded regions were decomposed into hairpin loops, internal loops (including bulges) and multi-loops by the consecutively nested cis Watson–Crick base pairs (≥2). The loops without base-pairing interaction were removed to refine the datasets. Given the fact that some known structural motifs were closed by cis Watson–Crick base pairs, the helix ends were retained in the loops.
Loop alignment and clustering
All the motif candidates were grouped into three different datasets: HL (from hairpin loops), IL (from internal loops and bulges) and ML (from multi-loops). The HL dataset contained 1036 instances, the IL dataset contained 1868 instances and the ML dataset contained 2778 instances. In each dataset, an all-to-all alignment was performed by using RNAMotifScan. RNAMotifScan is developed for RNA structural motif searching, and it treats queries and targets differently in the computation. Thus, for any two candidates, one was aligned twice to its partner, as the query in the first alignment and as the target in the second alignment. The two corresponding Z-scores were computed with the alignment score distributions of queries, and score for the better alignment was assigned to the candidate pair as the numerical measurement of their structural similarity.
After that, three weighted graphs were constructed from the alignment results for different datasets. In these graphs, the vertices were the loops in RNAs and the edges were labeled with the alignment Z-scores. Note that internal loops and multi-loops have multiple candidates due to the different ways of the concatenation of the loop segments (see Supplementary Figure S1). The maximum Z-score of all candidate alignments for two loops was chosen as the weight of the edge. A Z-score cutoff was set to determine whether an edge should be removed or not. The remaining edges represent the highly significant structural conservation between loops, and the strongly connected sub-graphs were identified with a CAST-like clique finding algorithm (20) (see Supplementary Figure S1 for more details of the pipeline).
During the loop alignment and the clustering, a set of parameters for RNAMotifScan were chosen for the loop alignment and a Z-score cutoff was selected for the clustering. Similar to the approach used in RNAMSC (13), a set of known motifs were used to help selecting the best possible clustering results. For RNAMotifScan, parameters were generated by the combinations of weights of sequence and structural similarity (0.2, 0.8) or (0.4, 0.6); gap start and extend penalties (3, 2), (6, 2) or (6, 4); penalty of missing one base pair in two inputs, 1–5. These parameters derived thirty different pairwise alignment graphs for each loop dataset. For each of the alignment graph, different Z-score cutoffs, ranging from 1.0 to 3.0 with step size 0.1 were applied. Therefore, there were 630 (30 × 21) different clustering results for each loop dataset (HL, IL or ML). The quality of the clustering results were measured by the performance (in terms of precision and sensitivity) in clustering the known motifs. For HL and IL datasets, respectively, the known motifs in rRNAs (1S72 and 1J5E) were used for benchmarking. The clustering result selected for HL uses following parameters accordingly: 0.2, 0.8, 6, 4, 3 and 1.1. For IL, the selected clustering result uses following parameters: 0.2, 0.8, 6, 2, 2 and 2.2. For the ML dataset, since there is not enough known motif instances in multi-loops to conduct a benchmarking, the clustering result for ML was selected based on the same parameters as IL.
Motif family identification
We extracted the potentially conserved motif families from the clusters for HL, IL and ML datasets by using both computational methods and downstream annotation. The clusters with size <3 were not considered in the further analysis. Each edge in a cluster must satisfy two requirements to be retained in the graph: first, the length of the loop region in the consensus structure derived from the alignment must be >2. Second, the root-mean-square deviation (RMSD) of corresponding C1′ atoms must be <4 Å. Removing all the invalid edges in the graph reduces the size of some clusters below three motifs. Those clusters are referred to as false classifications, and they are not considered in the further analysis. To distinguish the original clusters and the processed clusters, henceforth we will use ‘clusters’ and ‘refined clusters’ to refer to them, respectively. Furthermore, different motifs may be classified together if they share common patterns in their secondary structures (8). To identify possible sub-clusters, all the remaining loops were docked together by aligning the conserved base pairs. Then the members were separated if there are different 3D patterns inspected in the docking figure. Some manual analysis was done to fine-tune the clustering results in this step.
After that, the secondary and tertiary structural features of sub-clusters were extracted for function annotation, and the identical sub-clusters were investigated through comparing these features. Finally, we designed an ID system to refer to the clusters and sub-clusters. The cluster ID contains two fields: a loop type prefix and a cluster index suffix (e.g. IL1). Based on that, the sub-cluster ID is defined as cluster ID followed by the sub-cluster index, separated by an underscore character (e.g. IL1_1).
RESULTS
Summary of the clustering results
We have identified 191 clusters whose sizes are >2, among them 68 clusters from the HL dataset containing 600 loops, 77 clusters from the IL dataset containing 727 loops, and 46 clusters from ML dataset containing 203 loops. All the clusters and their members are listed in Supplementary Table S1. After removing the false classification, the results have 57 refined HL clusters containing 473 loops, 57 refined IL clusters containing 463 loops and 42 refined ML clusters containing 165 loops. Some of the refined clusters were further divided into sub-clusters. There are 68, 91 and 46 sub-clusters for the HL, IL and ML datasets, respectively. Among them, 13 HL sub-clusters and 37 IL sub-clusters were annotated as known motif families. All the other sub-clusters, 55 from HL, 54 from IL and 46 from ML, potentially belong to novel motif families. All the sub-clusters and the corresponding annotation information are listed in Supplementary Table S2.
To evaluate the performance of the automatic pipeline, 10 well-studied motif families in the clustering results were analyzed. The clusters containing the maximum numbers of instances for these motifs were chosen as representatives. All the other motif instances not in the representatives were annotated as plausible variations. Table 1 summarizes the benchmarking results for the 10 motif families, including the prediction accuracy (based on motif instances in the representatives) and the numbers of variations (based on motif instances not in the representatives). The precision was computed by dividing the number of true motifs with the size of the representative cluster. If one loop consists of several different motifs, it will be counted multiple times, as a true instance in representative and as variances of other motifs. From Table 1, we can see that the clustering results of GNAA and GNGA motifs are very accurate. This is because they are highly conserved in both sequences and secondary structures. The related variations mainly come from the hairpin loops with multiple motifs, which causes them to be categorized into other representatives. T-loops are relatively hard to be clustered together, due to the low sequence identity and simple base-pairing patterns. It indicates that the structural similarity weight should be set much greater than sequence similarity weight when searching T-loops in known RNA 3D structures. Both sarcin-ricin and kink-turn motifs have numerous variations, which are not classified with the majority of instances. One possible reason is that the binding activity may disturb the base-pairing interactions in them, and we will show several examples in the later sections. In addition, the precision of the kink-turn cluster (IL5) is relatively low because it contains several E-loops, whose secondary structure consensus is partially similar to the kink-turn’s (7). These two types of motifs are further divided into sub-families in the subsequent annotation procedure. Hook-turn has a unique base-pairing pattern, so it is relatively easy to identify. C-loop is relatively difficult to detect because crossing base pairs are the major components of the structural consensus. To compare the pseudoknot regions, RNAMotifScan needs to align the non-crossing base pairs first and uses them as an anchor. Thanks to the proper parameter selection, our pipeline still achieves accurate classification for C-loop, whose accuracy is similar to other families’. All the other three motifs, E-loop, tandem shear and reverse kink-turn, consist of continuous non-canonical base pairs. E-loop and tandem shear have similar 3D structures, so we mainly use their secondary structural features to distinguish them. Note that the previous accuracy analysis is based on the original clustering results. After the post-processing, the precisions of the sub-clusters for the benchmarking motifs are all 100%.
Table 1. The clustering results of 10 well-known motif families.
Motif name | Cluster ID | # of true instances | Cluster size | Precision (%) | # of variations |
---|---|---|---|---|---|
GNAA | HL1 | 85 | 87 | 98 | 5 |
GNGA | HL3 | 45 | 45 | 100 | 14 |
T-loop | HL4 | 29 | 31 | 94 | 55(6) |
Sarcin-ricin | IL3 | 47 | 56 | 84 | 18(14) |
Kink-turn | IL5 | 25 | 39 | 64 | 38 |
Hook-turn | IL6 | 26 | 31 | 84 | 0 |
C-loop | IL8 | 16 | 21 | 76 | 7 |
E-loop | IL9 | 16 | 21 | 76 | 7 |
Tandem shear | IL13 | 22 | 30 | 73 | 5 |
Reverse kink-turn | IL21 | 19 | 20 | 95 | 6 |
The numbers in the brackets show the variations detected from the datasets not containing the clusters.
Besides the motifs in the table, we have discovered other functional ones in the clustering results. The first example is the well-known tetraloop receptor in group I intron (IL4_1 and IL22_1) (21). Some of them are used in the target molecules to maximize their crystallizability (22). The L1 protuberance of 50S rRNA and mRNA were also clustered together in IL18. It has already been proved that they have both similar 3D structures and binding activities (23). We have also detected the motifs that are conserved both in mitochondrial 16S rRNAs and bacterial 23S rRNAs (24). The identification of these known functional motifs indicates that the clustering results can be applied to further analysis for new motifs.
Novel instances of known motifs
Tetraloop
Tetraloops are the basic building blocks of RNA 3D structures. They are very important for thermodynamic stability and binding activity of the molecules (25,26). The most frequent two types of tetraloops are GNRA loops (27,28) and UUCG loops (29). GNRA loops can be further categorized into GNGA loops and GNAA loops. In our clustering results, the majority of GNAA, GNGA and UUCG motifs are in HL1, HL3 and HL6. Some other GNAAs and GNGAs co-exist with sarcin-ricin motifs in the loops. One instance of this motif module is shown in Figure 1A. This loop is from the region C3120-A3136 in the Homo sapiens mitochondrial 16S rRNA. It can be seen that the 3D structure of A3125-G3131 is highly conserved to a GNAA reference. We also found that the corresponding region in the Haloarcula marismortui 23S rRNA contains a GNGA motif (see HL35). Similar modules for UUCG have been also detected. One example is shown in Figure 1B, which is in the H. marismortui 23S rRNA. The ‘U-shape’ turn in this loop is docked with the blue UUCG tetraloop precisely in the 3D space. Based on the observation, we hypothesize that the combination of sarcin-ricin and tetraloop may be a very common module in RNA 3D structures.
T-loop
T-loop is a compact U-turn-like loop which was originally discovered in tRNA (30). After that, many T-loop instances have been identified in a variety of ncRNAs, ranging from rRNA to riboswitch (31). Our clustering results cover almost all the known T-loops in the hairpin loops. What’s more, we also found two new instances of T-loop in the internal loops (IL26). One of them is in the Thi-box (thiamine pyrophosphate sensing) riboswitch and known for the ligand 4-amino-5-hydroxymethyl-2-methylpyrimidine (32). The other one is in a T-box stem I RNA. Figure 2 shows its structure and the corresponding 3D docking to a T-loop in a tRNA. Note that both secondary structures consist of one trans S/H and one trans W/H base-pairing interactions. The difference is that in the tRNA the base pairs exist in a hairpin loop, while the two interactions in the T-box stem I RNA bend one segment of the internal loop to a U-shape turn. Considering the relatively large size of the twisted loop region, the third interaction at G38/G70 may be important to the stability of the entire loop. This T-loop also works with another T-loop (the corresponding homology is 4MGN_C:51-63 in HL8) in the same RNA to stack on tRNA elbow (33). The similar binding behavior is also found in RNase P and rRNA, so the study of this T-loop and its partner may provide useful information for searching more functional modules.
Kink-turn
Kink-turn is a motif in the internal loop region with an asymmetrical architecture (1). Its key feature is the tight kink at the backbone of the longer segment, which causes the axes of the two helical stems differ by about 120°. In a real cellular environment, kink-turn may adopt a dynamic conformation (34). To maintain the specific 3D geometry, the motifs require the presence of metal ions (35) or binding with proteins (36). We detected two new kink-turn-like motif instances in the cluster IL37. Note that the loop in the 16S rRNA was detected in our previous work (13). With the newly discovered instance, we can analyze their conserved patterns and the related functions. Figure 3 shows their secondary and 3D structures. It can be seen that all base pairs can be matched. Compared with the base-pairing pattern of common kink-turns (8), these two instances form the kinks by three base pairs (G247/A282, A246/G278, A246/G281 in 1FJG and G18/A48, A17/A44, A17/G47 in 3RW6). In the common kink-turns, the Watson–Crick base pairs, C242/G284 in 1FJG and U11/G50 in 3RW6, should be followed by two continuous non-Watson–Crick base pairs. However, in these two loops, the two interactions are separated by the nucleotides marked with red color. In Figure 3, these red nucleotides form the bulges in the shorter segment, which do not exist in the common kink-turns. Additionally, both red regions have long-range interactions. According to the results of MC-Annotate, the nucleotide U244 in 1FJG is paired with A893 in another loop region. On the other hand, the large bulge in 3RW6 containing the flipped out nucleotides, A13, G14 and A15, is the binding site of the TAP protein and critical to the formation of CTE–TAP complex (37). Based on the function similarity, we suggest that the secondary structural pattern is important for the long-range interactions of the kink-turn motifs.
Sarcin-ricin
Sarcin-ricin motif is first found in the large ribosomal subunit as the attacking site of two protein toxins, ricin and α-sarcin. The catalyzation among them will impact the binding between elongation factors and ribosome, which may result in the cessation of the protein synthesis (38). More sarcin-ricin instances with similar structural features have been discovered in other RNAs, including 5S and 16S rRNAs, by using computational methods (7,39). In our clustering results, the majority of sarcin-ricins were also detected in rRNAs (see the cluster IL3). Their secondary structures are almost the same as the widely used consensus (39), and the 3D structures are highly conserved with the known instances. On the other hand, we have also found some new functional loops that share structural features with sarcin-ricin. Here, we present two possible variations of sarcin-ricin whose secondary and 3D structures are shown in Figure 4. The first loop is in the cluster IL38. Based on the secondary structure, the S-shape turn in its 3D structure is mainly supported by two non-canonical base pairs (A415/G428 and A414/A430) and one outward stacking interaction (G428/A430). All these three pairing interactions are in the secondary structural consensus of sarcin-ricins (8). However, in common sarcin-ricins, the cis H/W base pair U429/A431 should be a cis H/S interaction between U429/A430. The possible reason for this difference is that the segment U427→C433 is longer than it in the consensus. And this motif instance contains a bulge at the segment G409→G416, which interacts with the S4 protein at G410, A411 and A412 (40). So, the two base-pairing interactions not in the consensus, A411/A430 and G413/G428, may be important for the maintaining of the long-range linkages. We hypothesize that this motif is a sarcin-ricin variation whose secondary and 3D structures are disturbed by the protein binding activity. And the comparison of its secondary structure pattern with the sarcin-ricin consensus may help us to detect potential RNA–protein interactions.
Another interesting loop is in the cluster IL62. We call it ‘double S-turns’ because there are two symmetrical S-shape turns in its 3D structure (see Figure 4C). In the existing model for ligand-induced folding of the TPP riboswitch, this loop is the TPP-bind pocket which is critical for the ligand recognition (32). The two nucleotides, U62 and U79, shape the pocket by protruding into solution and weakening the stacking effects to the adjacent bases. From Figure 4D, it can be seen that there are two stacking interactions, A61/C63 and G78/A80, to enforce the local stability around these two nucleotides. They also cause the large turns in the S-shape structures. On the other hand, the other two non-canonical base pairs tighten the two segments together. The analysis of this internal loop, as well as the motif in Figure 4B, indicates that the stacking effect between discontinuous bases is an important evidence of detecting specific structural motifs, such as bulge and S-turn. In addition, this specific organization of interactions, including pairing interactions and stacking interactions, may be important to form pocket-like 3D structures.
Novel motif families
Novel motif families in the hairpin loop regions
The first potential motif family mainly contains four different instances from HL2_1 and HL53_1. One of them is the loop 10 in yeast 18S rRNA (41). The other three are the ‘stem loop 1’ in the sgRNA of the Cas9–sgRNA–DNA ternary complex (42,43). The mutations of residues interacting with stem loop 1 result in decreased DNA cleavage activity of the CRISPR-Cas system, which indicates the loop is essential for the formation of the functional Cas9–sgRNA complex. Figure 5 shows the high similarity between these two internal loops in terms of both geometric and base-pairing patterns. Except C275/G281 in 3U5F and G54/C60 in 4OO8, all the other interacted bases are identical in two loops. The continuity of the stacks is broken by U280 and U59 (labeled red color in Figure 5). Both of them flip out from the stems and cause the turns in the backbone of two loops. The most important feature is that they have similar functional roles: U280 interacts with L24e protein through the eB13 bridge in the hyper-rotated state (44,45); U59 in the sgRNA hydrogen bonds with Asn77 in the bridge helix of Cas9 (43). So, these loops are not only conserved in 3D structures but also in the functions, which implies the potential close relationship between the base-pairing pattern and the protein binding activity.
Novel motif families in the internal loop regions
IL16_1 contains four conserved regions in the 16S rRNAs and two loop B in the 5S rRNAs. We choose two representatives and describe their 3D and secondary structures in Figure 6. From Figure 6A and B, we can see that the common base-pairing interactions in two loops, which are shown in red, are highly conserved in 3D space. The corresponding base pairs in the secondary structures are from the same groups in the isostericity matrices (16): U375-A389 and G56-C26 belong to cis W/W I1; A374/C390 and A55/A27 belong to trans W/S I1; A373/G371 and A54/U52 belong to trans H/S I1. Therefore, they are co-varying mutations, and the interchange between them will maintain the 3D structures of the loops. What’s more, although the interactions of C372/A389 (trans W/H) and U53/C26 (cis H/W) are not from the same isosteric group, the geometric relationship of bases in them are quite similar (46). Therefore, these two base pairs may also contribute to the 3D structural similarity of these two internal loops.
The major difference between two motif instances comes from the regions U387-G388 and G21-A25. First, the lengths of two regions are different, which suggests a potential insertion in the loop of 5S rRNA. A significant feature shared between them is the turn on the phosphate backbone (Figure 6A and B). However, the backbone of the internal loop in 5S rRNA (the blue one) turns with a slightly large angle. The reason may be the trans H/S base-pairing interaction between G22 and U53. Although the 3D structures of two regions are not completely the same, they actually may have similar molecular functions. Based on the results of MC-Annotate, the nucleotide G388 in the 16S rRNA, which flips out from the stem, interacts with C58. For the region in the 5S rRNA, a possible contact to helix 89 in 23S rRNA has been identified by a SELEX (systematic evolution of ligands by exponential enrichment) experiment (47). It is hypothesized that A23 is the possible binding site due to its base twisting further than the backbone. Moreover, the base-pairing consensus we detected here may be very critical for their interlinking functions.
Another possible novel functional motif is discovered in the cluster IL42. One instance in this cluster is from the 16S rRNA of Thermus thermophilus, while the other two are identical internal loops in the GlmS riboswitch of Bacillus anthracis. Riboswitches are metabolite-sensing RNAs that can directly control the expression of downstream genes (48). By binding to specific ligands, their structures are changed to terminate the transcription or hinder the translation. However, unlike other riboswitches, the GlmS riboswitch does not alternate its structure upon the binding of glucosamine-6-phosphate (GlcN6P) (49). Instead, the binding activity results in a cleavage on the GlmS mRNA which reduces the GlcN6P synthetase production greatly (50). So it is also called ‘GlmS ribozyme’. The internal loop studied here interlinks two helices, P4 and P4.1, in the GlmS riboswitch. Its secondary and 3D structures are aligned with the loop in 16S rRNA, and the results are shown in Figure 7. We can see that although both segments of the loop in Figure 7A are shorter than those of the loop in Figure 7B, the consensus marked by red color is highly conserved in sequences, base-pairing interactions and 3D structures. The ‘S-shape’ turns in the regions C1284-A1287 and U96-A98 are important common features of two loops too. In the GlmS riboswitch, the turn is supposed to pack obliquely into the minor groove of P2.1 helix, which is important for the GlcN6P binding (51). On the other hand, we also find that the flipped out nucleotide A1287 in the 16S rRNA also forms two interactions with A1353 and A1370. So, the bulge-like structures may be indicators for the long-range tertiary interactions. The discovered motif may also be critical for the stability of the large internal loops whose structures are disturbed by intra-molecular linkages.
Novel motif families in the multi-loop regions
The first potential novel family in multi-loops is obtained from the sub-cluster ML2_1. Ten members are the orthologous regions from 21S, 23S, 25S and 28S rRNAs, and the last one comes from the Alu domain of an SRP RNA (Bacillus subtilis). SRP is a highly diverse ribonucleoprotein complex existing in all three kingdoms of life (52). The RNA in it can be divided into two functional domains, and one of them, the Alu domain, arrests protein biosynthesis by blocking the elongation factor entry site (53,54). Then by hindering the translation, SRP can prevent membrane proteins from being prematurely released from the ribosome. The multi-loop in the cluster is the one interlinking helix 1, helix 2 and helix 5a in the SRP RNA. Figure 8 shows the comparison of its secondary and 3D structures with the loop in 23S rRNA. Both loops have three segments, which are inter-connected by three highly conserved non-canonical base pairs: G1681/A1414 (trans S/H), A1414/A1682 (trans W/W) and A1682/U1696 (trans H/W) in 1S72, G62/A12 (trans S/H), A12/A64 (trans W/W) and A64/U101 (trans H/W) in 4WFL. The eight interacted nucleotides are marked with red color in Figure 8A and B. Note that the 3D geometric patterns of the consensus are quite similar in two loops, except there is an insertion A63 between G62 and A64 in 4WFL. The structural difference between the Alu domain of SRP RNA in mammalian and bacteria may explain the potential function of the motif. The G-A-A-U four-base platform observed in bacteria (B. subtilis) is absent from the Alu domain in eukaryota. Previous experiments have already shown that the 5′ region of human Alu domain is very flexible and SRP9/14 proteins are required to stabilize the conformation and induce the binding to 50S rRNA (55). On the other hand, the bacterial Alu domain adopts a closed conformation directly with the help of the four-base platform. This evidence may suggest that the discovered motif is critical to the stabilization of the local structure that binds to proteins.
ML17_1 contains three conserved regions in 23S rRNAs and one instance in env22 (type P1) twister ribozyme. As a small self-cleaving ribozyme, twister presents in many species of bacteria and eukaryota (56). Further research shows that twister may play a similar role as the hammerhead ribozyme in the biological systems. The instances of twister are categorized into three groups, type P1, type P3 and type P5, which can circularly permute to each others. The crystal structure of the twister used here comes from a type 1 instance. To compare it with the multi-loop in 23S rRNAs, we picked the one in 1S72 as a representative. Figure 9 shows the secondary structures of two loops and the 3D superimposition of their extensions. One common feature is that both loops are linked by trans S/S base-pairing interactions (A1492/C1514 and A42/C14). Furthermore, the neighbors of the paired bases (G1512 and C1513 in 1S72, G12 and C13 in 4RGE) form pseudoknots with nucleotides outside of the multi-loops (C1450 and G1449 in 1S72, C37 and G36 in 4RGE). Their 3D structures are highly conserved (red in Figure 9). Note that the orange multi-loop in 1S72 has four segments, while the blue one only has three. The extension of one segment in the orange loop (C1455→G1453) involves in the formation of the pseudoknot. Although not a direct substitution, the blue loop has one sharply bent segment (G25→A26) who makes an 180° turn to serve the interaction. This interesting case in which the loops with completely different secondary structures have highly similar 3D structures may suggest the underlying tertiary structural pattern is very important.
We also extend the 3D docking to the P2 and P4 helices of the twister ribozyme to study its local structural similarity with the 23S rRNA. Figure 9A shows that the two RNAs are quite conserved in these 40-nt regions. The self-cleavage sites in the twister, dU5 and A6, are highlighted with green color. During the transcription, guanosine and Mg2 + are coordinated to the non-bridging phosphate oxygen at the U-A step for cleavage catalysis and structural integrity. The corresponding nucleotides, U1505 and U1506 in 1S72 (green) share a similar splayed-apart conformation with the cleavage sites in the twister ribozyme. With so many common features, these two regions should be further studied with the experimental effort to confirm their functional correlation.
DISCUSSION AND CONCLUSION
In this paper, we study the RNA structural motifs in NR RNA 3D structures by using a de novo clustering approach. The single-stranded regions in the corresponding secondary structures were extracted and categorized into hairpin loops, internal loops and multi-loops. The base-pairing patterns in the same type of loops were compared by RNAMotifScan, and then the significant conservations were assembled into a graph. The densely connected sub-graphs were retrieved to form the clusters in which the members share common secondary structural features. In each cluster, by evaluating the alignments, the loops not close to any others in 3D space were removed. The remaining loops in the clusters were further analyzed, and then classified into different sub-clusters if their 3D structures were distinguishable from critical conformations. Finally, we tried to detect the homologous sub-clusters in different clusters by measuring the similarity of their secondary and 3D structural patterns. The clustering results for the known motifs indicate the high prediction accuracy of this new pipeline. Some interesting instances, which not only maintain the key features of known motifs but also exhibit specific structural variations, were found in the downstream analysis. We also identified numerous novel motif families, even in the multi-loop regions.
The in-depth investigation of the clusters provides directions for the further research. First, RNA structural motifs may work together as a ‘module’, such as the hairpin loops containing sarcin-ricins and tetraloops (see Figure 1), and the two T-loops in the T-box stem I RNA (see Figure 2). However, all the existing searching tools only focus on detecting the single motifs in isolation. Therefore, a new tool for discovering motif modules may provide essential evidence of the relationship among RNA structural motifs, which is important for the study of RNA structures and their functions. Another problem is to use base-pairing interactions to infer the potential binding activities between RNAs and other molecules. The disturbed secondary structures of the kink-turn and sarcin-ricin variations (see Figures 3 and 4) reveal that they may be the indicators of the long-range linkages. Furthermore, the affected base pairs also have specific patterns which can be easily integrated into computational methods. This approach should be more accurate than the other methods based on indirect measurements, such as using the distances between atoms.
DATA AVAILABILITY
The clustering results, along with 3D figures of each motif instance are publicly accessible and available on http://genome.ucf.edu/RNAMotifClustersNAR2017.
Supplementary Material
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
National Institute of General Medical Sciences of the National Institutes of Health (NIH NIGMS) (R01GM102515). Funding for open access charge: NIH NIGMS [R01 GM102515].
Conflict of interest statement. None declared.
REFERENCES
- 1. Klein D.J., Schmeing T.M., Moore P.B., Steitz T.A.. The kink-turn: a new RNA secondary structure motif. EMBO J. 2001; 20:4214–4221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Garcia-Ortega L., Alvarez-Garcia E., Gavilanes J.G., Martinez-del Pozo A., Joseph S.. Cleavage of the sarcin-ricin loop of 23S rRNA differentially affects EF-G and EF-Tu binding. Nucleic Acids Res. 2010; 38:4108–4119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Harrison A.M., South D.R., Willett P., Artymiuk P.J.. Representation, searching and discovery of patterns of bases in complex RNA structures. J. Comput. Aided Mol. Des. 2003; 17:537–549. [DOI] [PubMed] [Google Scholar]
- 4. Duarte C.M., Wadley L.M., Pyle A.M.. RNA structure comparison, motif search and discovery using a reduced representation of RNA conformational space. Nucleic Acids Res. 2003; 31:4755–4761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Sarver M., Zirbel C.L., Stombaugh J., Mokdad A., Leontis N.B.. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math. Biol. 2008; 56:215–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Parisien M., Cruz J.A., Westhof E., Major F.. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA. 2009; 15:1875–1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zhong C., Tang H., Zhang S.. RNAMotifScan: automatic identification of RNA structural motifs using secondary structural alignment. Nucleic Acids Res. 2010; 38:e176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zhong C., Zhang S.. RNAMotifScanX: a graph alignment approach for RNA structural motif identification. RNA. 2015; 21:333–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Wadley L.M., Pyle A.M.. The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res. 2004; 32:6650–6659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Djelloul M., Denise A.. Automated motif extraction and classification in RNA tertiary structures. RNA. 2008; 14:2489–2497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Petrov A.I., Zirbel C.L., Leontis N.B.. Automated classification of RNA 3D motifs and the RNA 3D Motif Atlas. RNA. 2013; 19:1327–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chojnowski G., Walen T., Bujnicki J.M.. RNA Bricks–a database of RNA 3D motifs and their interactions. Nucleic Acids Res. 2014; 42:D123–D131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Zhong C., Zhang S.. Clustering RNA structural motifs in ribosomal RNAs using secondary structural alignment. Nucleic Acids Res. 2012; 40:1307–1317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Lemieux S., Major F.. RNA canonical and non-canonical base pairing types: a recognition method and complete repertoire. Nucleic Acids Res. 2002; 30:4250–4263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Yang H., Jossinet F., Leontis N., Chen L., Westbrook J., Berman H., Westhof E.. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003; 31:3450–3460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Leontis N.B., Stombaugh J., Westhof E.. The non-Watson-Crick base pairs and their associated isostericity matrices. Nucleic Acids Res. 2002; 30:3497–3531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E.. The Protein Data Bank. Nucleic Acids Res. 2000; 28:235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Leontis N., Zirbel C.. Leontis N, Westhof E. Nonredundant 3D structure datasets for RNA knowledge extraction and benchmarking. RNA 3D Structure Analysis and Prediction. 2012; 27:Berlin Heidelberg: Springer; 281–298.Nucleic Acids and Molecular Biology [Google Scholar]
- 19. Smit S., Rother K., Heringa J., Knight R.. From knotted to nested RNA structures: a variety of computational methods for pseudoknot removal. RNA. 2008; 14:410–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Ben-Dor A., Shamir R., Yakhini Z.. Clustering gene expression patterns. J. Comput. Biol. 1999; 6:281–297. [DOI] [PubMed] [Google Scholar]
- 21. Adams P.L., Stahley M.R., Kosek A.B., Wang J., Strobel S.A.. Crystal structure of a self-splicing group I intron with both exons. Nature. 2004; 430:45–50. [DOI] [PubMed] [Google Scholar]
- 22. Ferre-D’Amare A.R., Zhou K., Doudna J.A.. A general module for RNA crystallization. J. Mol. Biol. 1998; 279:621–631. [DOI] [PubMed] [Google Scholar]
- 23. Nikulin A., Eliseikina I., Tishchenko S., Nevskaya N., Davydova N., Platonova O., Piendl W., Selmer M., Liljas A., Drygin D. et al. . Structure of the L1 protuberance in the ribosome. Nat. Struct. Biol. 2003; 10:104–108. [DOI] [PubMed] [Google Scholar]
- 24. Sharma M.R., Koc E.C., Datta P.P., Booth T.M., Spremulli L.L., Agrawal R.K.. Structure of the mammalian mitochondrial ribosome reveals an expanded functional role for its component proteins. Cell. 2003; 115:97–108. [DOI] [PubMed] [Google Scholar]
- 25. Fiore J.L., Nesbitt D.J.. An RNA folding motif: GNRA tetraloop-receptor interactions. Q. Rev. Biophys. 2013; 46:223–264. [DOI] [PubMed] [Google Scholar]
- 26. Sheehy J.P., Davis A.R., Znosko B.M.. Thermodynamic characterization of naturally occurring RNA tetraloops. RNA. 2010; 16:417–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Woese C.R., Winker S., Gutell R.R.. Architecture of ribosomal RNA: constraints on the sequence of “tetra-loops”. Proc. Natl. Acad. Sci. U.S.A. 1990; 87:8467–8471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Lemieux S., Major F.. Automated extraction and classification of RNA tertiary structure cyclic motifs. Nucleic Acids Res. 2006; 34:2340–2346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Ennifar E., Nikulin A., Tishchenko S., Serganov A., Nevskaya N., Garber M., Ehresmann B., Ehresmann C., Nikonov S., Dumas P.. The crystal structure of UUCG tetraloop. J. Mol. Biol. 2000; 304:35–42. [DOI] [PubMed] [Google Scholar]
- 30. Robertus J.D., Ladner J.E., Finch J.T., Rhodes D., Brown R.S., Clark B.F., Klug A.. Structure of yeast phenylalanine tRNA at 3 A resolution. Nature. 1974; 250:546–551. [DOI] [PubMed] [Google Scholar]
- 31. Chan C.W., Chetnani B., Mondragon A.. Structure and function of the T-loop structural motif in noncoding RNAs. Wiley Interdiscip. Rev. RNA. 2013; 4:507–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Serganov A., Polonskaia A., Phan A.T., Breaker R.R., Patel D.J.. Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch. Nature. 2006; 441:1167–1171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Zhang J., Ferre-D’Amare A.R.. Co-crystal structure of a T-box riboswitch stem I domain in complex with its cognate tRNA. Nature. 2013; 500:363–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Schroeder K.T., McPhee S.A., Ouellet J., Lilley D.M.. A structural database for k-turn motifs in RNA. RNA. 2010; 16:1463–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Matsumura S., Ikawa Y., Inoue T.. Biochemical characterization of the kink-turn RNA motif. Nucleic Acids Res. 2003; 31:5544–5551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Turner B., Melcher S.E., Wilson T.J., Norman D.G., Lilley D.M.. Induced fit of RNA on binding the L7Ae protein to the kink-turn motif. RNA. 2005; 11:1192–1200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Teplova M., Wohlbold L., Khin N.W., Izaurralde E., Patel D.J.. Structure-function studies of nucleocytoplasmic transport of retroviral genomic RNA by mRNA export factor TAP. Nat. Struct. Mol. Biol. 2011; 18:990–998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Hausner T.P., Atmadja J., Nierhaus K.H.. Evidence that the G2661 region of 23S rRNA is located at the ribosomal binding sites of both elongation factors. Biochimie. 1987; 69:911–923. [DOI] [PubMed] [Google Scholar]
- 39. Leontis N.B., Stombaugh J., Westhof E.. Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules. Biochimie. 2002; 84:961–973. [DOI] [PubMed] [Google Scholar]
- 40. Brodersen D.E., Clemons W.M., Carter A.P., Wimberly B.T., Ramakrishnan V.. Crystal structure of the 30 S ribosomal subunit from Thermus thermophilus: structure of the proteins and their interactions with 16 S RNA. J. Mol. Biol. 2002; 316:725–768. [DOI] [PubMed] [Google Scholar]
- 41. Lempereur L., Nicoloso M., Riehl N., Ehresmann C., Ehresmann B., Bachellerie J.P.. Conformation of yeast 18S rRNA. Direct chemical probing of the 5′ domain in ribosomal subunits and in deproteinized RNA by reverse transcriptase mapping of dimethyl sulfate-accessible. Nucleic Acids Res. 1985; 13:8339–8357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Anders C., Niewoehner O., Duerst A., Jinek M.. Structural basis of PAM-dependent target DNA recognition by the Cas9 endonuclease. Nature. 2014; 513:569–573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Nishimasu H., Ran F.A., Hsu P.D., Konermann S., Shehata S.I., Dohmae N., Ishitani R., Zhang F., Nureki O.. Crystal structure of Cas9 in complex with guide RNA and target DNA. Cell. 2014; 156:935–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Ben-Shem A., Garreau de Loubresse N., Melnikov S., Jenner L., Yusupova G., Yusupov M.. The structure of the eukaryotic ribosome at 3.0 Å resolution. Science. 2011; 334:1524–1529. [DOI] [PubMed] [Google Scholar]
- 45. Gulay S. Building a map of the dynamic ribosome. 2015; University of Maryland Department of Cell Biology and Molecular Genetics; Ph.D. Thesis [Google Scholar]
- 46. Stombaugh J., Zirbel C.L., Westhof E., Leontis N.B.. Frequency and isostericity of RNA base pairs. Nucleic Acids Res. 2009; 37:2294–2312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Ko J., Lee Y., Park I., Cho B.. Identification of a structural motif of 23S rRNA interacting with 5S rRNA. FEBS Lett. 2001; 508:300–304. [DOI] [PubMed] [Google Scholar]
- 48. Winkler W.C., Breaker R.R.. Regulation of bacterial gene expression by riboswitches. Annu. Rev. Microbiol. 2005; 59:487–517. [DOI] [PubMed] [Google Scholar]
- 49. Hampel K.J., Tinsley M.M.. Evidence for preorganization of the glmS ribozyme ligand binding pocket. Biochemistry. 2006; 45:7861–7871. [DOI] [PubMed] [Google Scholar]
- 50. Winkler W.C., Nahvi A., Roth A., Collins J.A., Breaker R.R.. Control of gene expression by a natural metabolite-responsive ribozyme. Nature. 2004; 428:281–286. [DOI] [PubMed] [Google Scholar]
- 51. Cochrane J.C., Lipchock S.V., Strobel S.A.. Structural investigation of the GlmS ribozyme bound to Its catalytic cofactor. Chem. Biol. 2007; 14:97–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Rosenblad M.A., Larsen N., Samuelsson T., Zwieb C.. Kinship in the SRP RNA family. RNA Biol. 2009; 6:508–516. [DOI] [PubMed] [Google Scholar]
- 53. Siegel V., Walter P.. Removal of the Alu structural domain from signal recognition particle leaves its protein translocation activity intact. Nature. 1986; 320:81–84. [DOI] [PubMed] [Google Scholar]
- 54. Wolin S.L., Walter P.. Signal recognition particle mediates a transient elongation arrest of preprolactin in reticulocyte lysate. J. Cell Biol. 1989; 109:2617–2622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Weichenrieder O., Wild K., Strub K., Cusack S.. Structure and assembly of the Alu domain of the mammalian signal recognition particle. Nature. 2000; 408:167–173. [DOI] [PubMed] [Google Scholar]
- 56. Roth A., Weinberg Z., Chen A.G., Kim P.B., Ames T.D., Breaker R.R.. A widespread self-cleaving ribozyme class is revealed by bioinformatics. Nat. Chem. Biol. 2014; 10:56–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The clustering results, along with 3D figures of each motif instance are publicly accessible and available on http://genome.ucf.edu/RNAMotifClustersNAR2017.