Skip to main content
RNA logoLink to RNA
. 2023 May;29(5):584–595. doi: 10.1261/rna.079211.122

bpRNA-align: improved RNA secondary structure global alignment for comparing and clustering RNA structures

Brittany Lasher 1, David A Hendrix 1,2,
PMCID: PMC10159002  PMID: 36759128

Abstract

Ribonucleic acid (RNA) is a polymeric molecule that is fundamental to biological processes, with structure being more highly conserved than primary sequence and often key to its function. Advances in RNA structure characterization have resulted in an increase in the number of accurate secondary structures. The task of uncovering common RNA structural motifs with a collective function through structural comparison, providing a level of similarity, remains challenging and could be used to improve RNA secondary structure databases and discover new RNA families. In this work, we present a novel secondary structure alignment method, bpRNA-align. bpRNA-align is a customized global structural alignment method, utilizing an inverted (gap extend costs more than gap open) and context-specific affine gap penalty along with a structural, feature-specific substitution matrix to provide similarity scores. We evaluate our similarity scores in comparison to other methods, using affinity propagation clustering, applied to a benchmarking data set of known structure types. bpRNA-align shows improvement in clustering performance over a broad range of structure types.

Keywords: RNA, RNA secondary structure, RNA structural alignment, RNA structural clustering, clustering

INTRODUCTION

Ribonucleic acid (RNA) consists of polymeric chains of nucleotides that form structures through base pair interactions. RNA has numerous biological roles including encoding information, as in messenger RNAs (mRNAs), decoding information (e.g., transfer RNAs, tRNAs), and regulation of gene expression. Noncoding RNAs (ncRNAs) are RNAs that, unlike mRNAs, do not encode proteins. ncRNA structure is more highly conserved than primary sequence and defines their function. With advances in RNA structure characterization through X-ray crystallography (Terayama et al. 2018), cryogenic electronic microscopy (Zhang et al. 2019), nuclear magnetic resonance (NMR) (Marchanka et al. 2015), and secondary structure (SS) predictions guided by RNA probing data, an increased quantity of accurate SSs have become available. In the Protein Data Bank (PDB) alone, over 50% of the RNA-only structures have been identified since 2010 (Berman et al. 2000). The past decade has also seen the rise of high-throughput structure probing assays that have generated experimentally supported structures for over 150 unique transcriptomes (Li et al. 2021).

The rising tide of available RNA structure data necessitates the need for improved approaches to compare and identify similar structures, and to organize available data. Although tools such as Infernal do exist and have commonly been used to search sequence databases for RNAs with sequence and structure similarity, this tool is not designed to generate pairwise similarity scores between RNA structures. Recent RNA clustering tools have been developed, such as GraphClust2, which is an iterative clustering approach that uses both Infernal and LocARNA. GraphClust2 provides a good baseline for clustering; however, it requires multiple iterations to compute accurate clustering results and does not output pairwise structural similarity. RNAforester and BEAGLE are two approaches that calculate similarity between two RNA structures. RNAforester, is a comparison method that applies a forest-based alignment approach (often computationally expensive), to simultaneously align structure and sequence (Höchsmann et al. 2003; Höchsmann 2005; Blin et al. 2010). Under this method, an SS is represented as a rooted ordered forest, which contains “P” nodes representing the base paired bonds, with sequence in the leaves. The RNAforester global structure alignment has made vast improvements to its algorithm, resulting in a time complexity of O((n2/k)d2), where n is the number of nodes in the forest, and d is the degree of the forest and k splits a forest at all possible points (Schirmer and Giegerich 2011). This approach can be challenging to understand and lacks distinct representations of unpaired loop types at the nucleotide-level (e.g., hairpin loops vs. multiloops).

BEAGLE (Mattei et al. 2015), another structural comparison method, utilizes the BEAR encoding to represent structural information as blocks characterizing different SS types (loop, internal loop, stem, or right/left bulge) and their lengths (Mattei et al. 2014). For example, stems of length five are represented as S5, and are encoded into the BEAR representation using a corresponding letter of the alphabet. This results in an overly complex representation, involving a different letter of the alphabet for each combination of length and structure type, where a limited maximum length is fixed for each structure block. Although, structural direction differentiation is implemented for bulges, with right and left bulge options, this is not applied for stem structure blocks, which could lead to misalignments as structures become more complex. BEAGLE aligns its structure representation using a modified Needleman–Wunsch sequence alignment algorithm, with generation of a log-odds substitution matrix for scoring alignments (Mattei et al. 2014, 2015). The large alphabet within the BEAR encoding requires BEAGLE to have a large substitution matrix with many terms. The challenge remains to develop fast and accurate SS comparison algorithms that still capture fine detail in their structure representations.

In our prior work, we developed an automated structure annotation method, bpRNA, to label the structural features of RNA, including multiloops, bulges, hairpin loops, internal loops, and external loops, and end regions (Danaee et al. 2018). bpRNA outputs an RNA SS representation with single-character structure codes for each nucleotide, referred to as a “structure array” (Fig. 1A). The structure array is similar to the Washington University Secondary Structure (WUSS) notation (Eddy 2005), but provides differentiation between external loops and multiloops, as well as bulges and internal loops. The structure array demonstrates a structural representation with several advantages compared to other methods such as BEAR or the traditional dot bracket notation (DBN). It represents not only paired and unpaired nucleotides, as in the DBN, but also the type of structural feature corresponding to the unpaired nucleotides (hairpin, external loop, internal loop, multiloop, bulge, and end). This results in a feature dense structural representation providing more information on the nucleotide-level than that of the DBN. In comparison to the BEAR representation used in the BEAGLE comparison approach, the structure array uses a much smaller alphabet than BEAR, with single-character code corresponding to each position of the sequence. In other words, a 4-nt hairpin loop would be represented as HHHH, rather than using a different character for loops of different lengths (Mattei et al. 2014). Second, the structure array explicitly incorporates multiloops, external loops, and ends into the sequence, rather than using the same character for each. An example comparison among the DBN, BEAR encoding, WUSS, and the bpRNA structure array is shown for a simple RNA structure that includes a bulge, internal loop, external loop, stems, hairpin loops, and end regions (Fig. 1B). The simple alphabet in the bpRNA structure array results in an easily interpreted representation on a first glance. In contrast, the BEAR notation utilizes a large nonintuitive alphabet, where ends, external loops, and multiloops are all represented with the same character, and many different characters are used to represent internal loops and bulges of different lengths. Within this work we utilize the bpRNA structure array as a more detailed input than the DBN, to globally align RNA structures.

FIGURE 1.

FIGURE 1.

(A) Needleman–Wunsch algorithm customizations. Primary sequence (purple), dot bracket structure notation (blue), and bpRNA array structure representation (green) for the structure bpRNA_PDB_528. (B) Comparison of RNA structure notations BEAGLE, DBN, and bpRNA structure array for a hairpin structure. (C) Scoring strategy used to generate the substitution matrix for the customized NW alignment approach, with positive scores correlated with their contribution to the structure stability, and with mismatch scores negatively correlated with their structure stability. (D) Scoring matrix generated from the scoring strategy. Colors and shades represent similar structural feature types. (E) Approaches to handle gaps in the alignment strategy. (F) Alignment moves between matrices for gaps, extensions, and match and mismatches.

We present bpRNA-align, a fast, easy-to-use program that identifies global similarity between RNA structures and has code and use instructions available on github (https:// github.com/BLasher113/bpRNA_align). bpRNA-align addresses previous RNA comparison limitations by using the bpRNA structure array. The length of the bpRNA structure array is the same as the RNA and uses a small alphabet to represent all structural feature types. While affine gap penalties normally apply a greater penalty for gap openings than gap extensions, we do the reverse because single-nucleotide insertions or deletions are less destabilizing to RNA structure (e.g., a 1-nt bulge). Furthermore, we have also implemented context-specific gap penalties into bpRNA-align, which allow for natural variability observed within RNA structural features, enabling accurate identification of globally similar RNA structures, capable of connecting function to structure.

RESULTS

Needleman–Wunsch structure alignment

To demonstrate the functionality of bpRNA-align, we applied it on an example involving two tRNA structures with bpRNA-1m identifications of bpRNA_RFAM_1154 (Fig. 2A), and bpRNA_PDB_528 (Fig. 2B). These structures were chosen from different originating databases, with each structure being obtained through different methods. Structure bpRNA_PDB_528 (source ID: PDB 3TRA) originated from the PDB and was obtained through X-ray diffraction (Westhof et al. 1988). Structure bpRNA_RFAM_1154 (source ID: RFAM RF00005_AF008220.1_6334-6422) was obtained from the RFAM database and was determined through comparative sequence analysis (Nawrocki and Eddy 2013). The X, Y, and M matrices for nonbanded alignment (Fig. 2C), representing the scores for all possible moves, show a strong high scoring diagonal trend, indicating the similarity between these two tRNA structures. The search space for banded alignment can be visualized in similar scoring matrices, X, Y, and M, demonstrating the same diagonal trend (Fig. 2D). The alignment results (Fig. 2E), obtained through tracing back the ideal path between the three matrices, demonstrate a high similarity between the two structures with most gaps originating in the third branch of the multiloop.

FIGURE 2.

FIGURE 2.

Alignment example between two tRNA structures. (A,B) Structures bpRNA_PDB_528 and bpRNA_RFAM_1154 pulled from the bpRNA-1m database and aligned with bpRNA-align. (C) Alignment matrices for the alignment between the structures shown in A and B, without the banded alignment customization. (D) Alignment matrices for the alignment between A and B, demonstrating the banded alignment customization. For C,D negative values were mapped to zero for visualization purposes. (E) Alignment results for the alignment of A and B with affine gaps. (F) Alignment result for the aligment of A and B with linear gaps.

Banded alignment

When we applied the banded strategy to align the same two tRNA structures, with a bandwidth of w = 20, we obtained the same alignment result (Fig. 2E). Although the same result was reached, the complexity of our method decreased from O(n2) to O(wn). The search space band can be visualized in the scoring matrices, X, Y, and M (Fig. 2D).

“Inverted” affine gaps

To allow for some flexibility between structures, the affine gap strategy was implemented. Opposite of traditional sequence alignment approaches, a small gap opening cost and a greater gap extension cost was applied, which resulted in higher scores for structures with many small gap-regions throughout the structure, and lower scores for structures with long continuous gap-regions. To examine this application on an example case, bpRNA-align was implemented with both the linear gap penalty and the affine gap penalty for the pair of tRNAs in Figure 2. The main distinction between the two structures is a 15-length difference in the third branch of the multiloop, also known as the variable loop (Fig. 2A,B). The alignment utilizing affine gaps resulted in 18 gaps and a similarity score of 239.4 (Fig. 2E), in comparison with the application of linear gaps that resulted in 20 gaps, but with a similarity score of 260.9 (Fig. 2F). Therefore, the addition of affine gaps was able to produce a structural alignment using fewer gap characters, but 10 gapped regions in comparison to six gapped regions in the linear alignment approach. The lower similarity score when applying affine gaps, penalizes the large difference in the length of the variable loop. While we do recognize that variable loops in tRNAs can range in length, this example illustrates how the implementation of affine gaps in bpRNA-align penalizes large changes in loop length as opposed to small changes.

Natural variability within structural feature types

Evaluation of the mean absolute deviation (MAD) of the length of each instance of a structural feature across RNA families revealed the feature types that demonstrate greatest length-MAD. End regions and external loops demonstrate the greatest length-MAD overall, as these features do not significantly affect the core structure. Bulges resulted in the lowest overall length-MAD spread, with stems, internal loops and multiloops falling close behind. Mean MAD values for each feature over all RNA families are as follows: 0.42 (Hairpin), 0.88 (End), 0.031 (Bulge), 0.040 (Stem), 0.075 (Internal loop), 0.46 (External Loop), 0.052 (Multiloop).

Context-specific gap penalties

The addition of context-specific affine gap penalties helped improve the performance of bpRNA-align by increasing the accuracy in most of the benchmark data sets (Fig. 3). To separate out the effect of affine gaps from context-specific gaps, the accuracies of bpRNA-align with 3 different implementations was determined (Fig. 3). First, bpRNA-align was implemented with linear gaps (Linear), next it was implemented with affine gaps (Affine), and last it was implemented with context-specific affine gaps (Contextual Affine). Figure 3 shows the accuracy results for each of these cases. The addition of affine gaps helped improve the accuracy in the Riboswitch data set only, whereas the addition of context-specific gaps improved the accuracy in three of the four data sets examined.

FIGURE 3.

FIGURE 3.

Updates in bpRNA-align performance. Comparison of bpRNA-align before the addition of context-specific affine gaps (purple), bpRNA-align with only affine gaps implemented (blue), optimized bpRNA-align (green) with context-specific affine gaps.

Global alignment

Not only is our method capable of globally aligning shorter RNA structures (Fig. 2) but it is also capable of accurately aligning long RNA structures. To demonstrate this, we applied and evaluated our method on both 16S and 23S structures ranging in length up to 3000 nt. For both subfamilies, we chose one structure from the CRW database, originating from comparative sequence analysis using covariance models (Gutell et al. 1992; Gautheret et al. 1995; Shang et al. 2012), and one structure from the PDB, originating from electron microscopy. For these RNAs, we were able to identify both conserved and nonconserved regions within the structures (Supplemental Figs. 4, 5), showing bpRNA-align's capability to examine larger RNAs.

Method comparison

The performance of the methods bpRNA-align, RNAforester, and BEAGLE were all evaluated on the five benchmark data sets generated from the meta-database bpRNA-1m (Danaee et al. 2018). The first data set consisted of RNA riboswitches, the second consisted of microRNAs (miRNAs), the third consisted of three-segment structures, the fourth consisted of four-segment structures, and the fifth consisted of a combination of all four previous data sets. In all thoroughness, we did compare bpRNA-align with Infernal (Supplemental Fig. 6). However, since Infernal is not designed for this task, it performed substantially poorer in all but one case, where it matched the lowest performing method.

Clustering performance on riboswitch RNA

Evaluation of the Riboswitch data set (Fig. 4A) resulted in bpRNA-align performing the highest, with an accuracy score and Jaccard index of 1.0 and a silhouette score of 0.31 (Fig. 4B). RNAforester and BEAGLE demonstrated lower accuracies of 0.83 and 0.88, respectively, along with lower Jaccard indices of 0.55 and 0.66, and silhouette scores of 0.24 and 0.36 (Fig. 4B). Confusion matrices provide a visualization of the number of clusters predicted and which RNA families are not correctly clustered (Fig. 4C). bpRNA-align was the only method that correctly predicted the number of clusters within the riboswitch data set. BEAGLE predicted an extra two clusters by splitting RNA subtypes SMK_box and preQ 1-II into two independent clusters. RNAforester performed the lowest, by predicting three extra clusters and splitting RNA categories Glycine, SMK_box and SAM each into two separate clusters.

FIGURE 4.

FIGURE 4.

Comparison of structure comparison approaches applied to riboswitch structures. (A) Riboswitch subfamilies utilized in development of the data set. (B) Affinity silhouette scores, accuracy, and Jaccard index for each structure comparison method. (C) Confusion matrices for method comparison results.

Clustering performance on microRNAs

Evaluation of the microRNA data set (Fig. 5A) revealed that both bpRNA-align and BEAGLE scored perfect accuracy scores and Jaccard indices of 1.0 (Fig. 5B), with BEAGLE obtaining a higher silhouette score (0.59) compared to bpRNA-align (0.17) (Fig. 5B). Assessment of the confusion matrices showed that RNAforester did not correctly cluster the miRNA subtypes, resulting in a lower accuracy score (0.89), Jaccard index (0.69), and silhouette score (0.11) (Fig. 5B). Clustering of RNAforester resulted in miss-grouping of the RNA subtypes mir-166, mir-286, and mir-159 by either splitting into multiple independent clusters or combining incorrect categories together in a cluster (Fig. 5C).

FIGURE 5.

FIGURE 5.

Comparison of structure comparison approaches applied to microRNA structures. (A) MicroRNA subfamilies utilized in development of the data set. (B) Affinity silhouette scores, accuracy scores, and Jaccard Indices resulting from the structure comparison methods. (C) Confusion matrix results for the methods compared.

Clustering performance on three-segment structures

The three-segment structure data set (Supplemental Fig. 2A) was the only example where bpRNA-align did not achieve or match the highest accuracy. Clustering resulted in RNAforester obtaining a 1.0 accuracy score and Jaccard index, followed by bpRNA-align with an accuracy of 0.95 and Jaccard index of 0.90, and BEAGLE with an accuracy of 0.89 and Jaccard index of 0.69 (Supplemental Fig. 2B). Silhouette scores for RNAforester, bpRNA-align, and BEAGLE are determined to be 0.33, 0.37, and 0.51, respectively (Supplemental Fig. 2B). Evaluation of the confusion matrices (Supplemental Fig. 2C) show that BEAGLE splits both RNA families ROSE and His_leader into two independent clusters. bpRNA-align splits RNA family ROSE into multiple other RNA family clusters besides its own cluster.

Clustering performance on four-segment structures

For the data set consisting of four-segment structures (Supplemental Fig. 3A), RNAforester and bpRNA-align predicted accuracies and Jaccard indices of 1.0, with BEAGLE obtaining an accuracy score and Jaccard index of 0.93 and 0.80 (Supplemental Fig. 3B). Silhouette scores for the three methods were 0.42 for RNAforester, 0.44 for bpRNA-align and 0.58 for BEAGLE (Supplemental Fig. 3B). Visualization of the confusion matrices (Supplemental Fig. 3C) shows that BEAGLE was not able to correctly predict the RNA family BTE, where it clustered the structures into two independent clusters.

Clustering performance on structures from all combined data sets

All four data sets were combined to form a more complex larger data set. RNAforester, bpRNA-align, and BEAGLE obtained accuracies of 0.93, 0.96, and 0.94 and Jaccard indices of 0.89, 0.94, and 0.82. Silhouette scores for the three methods were 0.14 for RNAforester, 0.19 for bpRNA-align, and 0.31 for BEAGLE (Supplemental Fig. 7). The confusion matrices show that BEAGLE struggles to correctly cluster four different RNA subfamilies, which results in additional clusters from splitting subfamilies into two distinct clusters. RNAforester does not correctly cluster five different subfamilies, and results in misclustering structures within the incorrect RNA subfamilies. bpRNA-align results in the highest accuracy, only struggling to correctly cluster one RNA subfamily. Examination of RNAforester alignment results identified that structures within the same predicted cluster had a high number of mismatches, compared to structural alignments within the same subfamily, whereas the number of gaps was comparable between both.

Negative control comparison through bpRNA-align

bpRNA-align prediction of pairwise similarity scores for the negative control data set showed a separation in scores between true alignments of the same RNA families and alignments of a negative control group structure with a structure of a true RNA family. Similarity scores of structures from the same RNA family were much higher, as expected, than those of a true structure with a negative control group structure. The confusion matrix (Supplemental Fig. 8) shows that the negative control structures do not each form their own individual cluster. This is expected, because of the clustering algorithm, which prefers structures to cluster together rather than forming individual single clusters.

DISCUSSION

How bpRNA-align addresses feature-specific length variability

To compare and cluster RNA structures, there needs to be a metric capable of accurately representing the similarity between two structures. This metric should consider the natural variation displayed within each structural feature over a broad range of RNA families. This requires that the structure representation used to compare structures can distinguish different types of unpaired regions. Our method utilizes a compact yet descriptive structure representation which, unlike RNAforester, specifically differentiates between unpaired feature regions of RNA structure, and distinguishes between left- and right- handed stems using a compact alphabet, unlike BEAGLE.

To address the variability in feature length, bpRNA-align utilizes an inverted, context-specific affine gap strategy, which not only allows for differentiation between gap openings and gap extensions, but also allows for gap penalties to depend on the feature type. Context-specific gap penalties have been used previously in protein alignment to better capture the natural variation in constraints on protein structure (Thompson 1995). Our use of context-specific gap penalties improved performance and was guided by the natural variation in the lengths of RNA structural features. These gap penalties are calculated and weighted by the observed mean MAD for each structural feature. By applying lower gap penalties in feature types with more variation, and higher gap penalties in feature types with less variation, our method provides the adaptability to correctly capture structural similarity. Our analysis of the MAD length for each structural feature over all RNA categories in the bpRNA-1m meta-database, excluding only RNA categories in the benchmark data sets, revealed a broad range of variation (Fig. 6). Our analysis shows that the feature types of stems, multiloops, bulges, and internal loops should be weighted heavier in the scoring process and is a consideration we took when developing bpRNA-align.

FIGURE 6.

FIGURE 6.

Utilization of structural feature length variation to guide gap penalties. Feature-specific violin plot of length mean values for each character in the feature pattern of each RNA subfamily.

Benchmarking bpRNA-align through biological clustering

We wanted to evaluate our method's ability to identify RNA functional categories. An ideal approach for this is a type of biological clustering (Vendramin et al. 2009). This type of framework allows for evaluation of the alignment quality based on its ability to reach the biological goal (grouping RNAs from the same category) and not just a comparison on the sequence level (Wang et al. 2018). To achieve this, we sought to develop benchmark data sets composed of different RNA families to demonstrate the accuracy of each method's ability to capture the biological context. The five data sets developed in this work represent similar structures from different RNA categories, creating challenging cases to cluster, based on similarity scores. To determine the accuracy, affinity propagation was applied to cluster the data sets. This is an ideal method, as it is based on the similarity scores from each method to perform clustering rather than distance.

Assessing method performance and scoring

bpRNA-align matches or outperforms the other methods in four out of the five data sets examined in this work. This is likely due to the type of structural representation used and the inherent adaptability added within this method, which allows it to be more robust over a broader range of RNA structures. The inverted affine gap implementation with a small gap opening allows for flexibility in the alignments without leading to large separations between similar structures. A large portion of this flexibility is also due to the addition of context-specific gap opening and extension penalties. This allows for flexibility in structure features that are less stable and more likely to vary in length.

RNAforester demonstrates large scoring effects for gaps, but mismatches in structure features appear to have little consequence on the score. An example of low penalty for mismatches, resulting in different RNA types scoring highly, is the alignment between bpRNA_RFAM_6133 of category mir-166 and bpRNA_RFAM_18555 of category MIR159 (Supplemental Fig. 9). Although these structures appear to have completely different patterns of base-pairing, they are scored much higher than alignments of the same RNA type in the mir-166 category. In other cases, RNAforester split a single RNA family into multiple clusters due to a large section of gaps in variable-length regions, which leads to a large differentiation between structures of the same RNA type. For example, hairpins are observed to have a larger level of variation (e.g., mir-166, Supplemental Fig. 9) in some RNA families, and in these cases RNAforester scores may be too sensitive to cluster the families correctly.

Within the data sets analyzed in this work, BEAGLE is observed to predict too many clusters in four of the five data sets examined. This result is due to splitting RNA families into multiple clusters. To understand why BEAGLE was splitting categories into multiple clusters, each split family was examined individually. This revealed that the number of gap characters within each alignment varied largely between a single RNA category, most often within an unpaired region that has more variability. As a result, BEAGLE has a larger scoring distinction between certain structures within the same family, and this results in splitting the family into multiple clusters.

Future directions and considerations

An aspect that both RNAforester and BEAGLE apply in their alignment approach is the use of primary sequence and not just structure alone. This is something bpRNA-align does not currently account for but is an addition we plan to make in the future. We know that some types of RNAs are more conserved in their primary sequences than others, and thus it is important to apply a user-weighted adjustable option for sequence information. This could potentially boost the performance of the alignment results, especially for cases that are less structurally conserved, and more sequence conserved.

Another factor that could boost performance is an optimized substitution matrix. bpRNA-align currently utilizes a knowledge-based substitution matrix, but there is room for improvement here. The alignment substitution matrix can be improved using the log-likelihood method, analogous to the generation of the BLOSUM (Henikoff and Henikoff 1996) and point accepted mutation (PAM) substitution matrices for amino acids (Dayhoff et al. 1978) but instead with RNA structural elements. BEAGLE constructs a log-likelihood substitution matrix, but as mentioned previously, its use of different characters for each loop length results in a large number of parameters to calculate. Our approach will rely on a set of high-similarity structural alignments, curated from known homologous RNAs. By examining the empirical frequency of each type of substitution, the structural substitution matrix can be computed.

Another development direction includes the addition of a local alignment approach utilizing the Smith–Waterman algorithm. Although global structural alignments provide vital information about the similarity between overall structures, it does not allow for identification of locally conserved regions within RNAs. Future work could compute a local alignment applying similar customizations as to what we currently demonstrate, but with the Smith–Waterman algorithm or other alignment approaches that capture local similarity. This would further address the challenges of aligning more evolutionarily distant RNAs where there may be large insertions or deletions, or where structural conservation is more localized.

MATERIALS AND METHODS

Alignment algorithm

The bpRNA structural annotation tool identifies structural features and creates the structural array sequence that includes stems (S), internal loops (I), bulges (B), hairpins (H), multiloops (M), external loops (X), and ends (E). For our bpRNA-align global alignment method, we further enhanced the bpRNA structure arrays using the dot-bracket sequence to convert the stem characters into left- and right-handed stem characters, L, and R. Our approach to identify similarity and generate an alignment, uses the Needleman–Wunsch algorithm (Needleman and Wunsch 1970) as a basis for our global structure alignment approach. This is an application of dynamic programing, where a larger task is broken down into smaller subtasks, which are each solved to obtain the overall result. The Needleman–Wunsch algorithm will align sequences x and y, which involves recursively computing a score matrix and a traceback matrix indicating the direction of each recursion step. Tracing backward through this matrix, based on the directional pathway starting from the lower right position, provides the optimal alignment of x to y.

Substitution matrix

To identify the scores assigned for match and mismatch results in the Needleman–Wunsch alignment algorithm, a substitution matrix, s(a,b), is required, where a and b are the possible characters within the sequences x and y. Since a substitution matrix consisting of structural features as characters did not currently exist for scoring RNA structure arrays, it was necessary for us to develop one. We accomplished this through an approach based on domain-specific knowledge and physical intuition about RNA structure. The level of structural disruption due to a mismatch was considered when selecting scores. To assign matched and mismatched scores to the matrix, we followed a reward and penalty strategy (Fig. 1C) utilizing whole numbers and following our physical intuition along with examples from literature (Svoboda and di Cara 2006). High rewards were applied for matching elements that were key to the structural integrity, and high penalties for mismatches that would significantly disrupt the structure. Matching right-sided stem (R) characters or matching left-sided stem (L) characters would have a high reward because stems are key to the stability of the structure. Thus, the highest reward, six, was given to RR and LL matches. Conversely, mismatching a right-sided stem with a left-sided stem would result in high structural disruption, and so the highest penalty of −8 was given for LR mismatches. In the case of mismatching an internal loop with a bulge, the structural characters are similar in nature, and so a penalty of 0 was given. Other mismatches within the substitution matrix were assigned either −2 or −4 as a score, depending on the structural features. For example, the lower penalty of −2 was given for mismatches that had similarities in their structural features, such as end regions (E) and external loops (X), which both demonstrate an open region in the RNA. A larger penalty of −4 was applied to hairpin loops mismatched to other loop types, given their distinct importance in guiding RNA folding (Svoboda and di Cara 2006). Matches were assigned either a score of two or four depending on our knowledge about observed mutations in that type of structural feature. The resulting substitution matrix is shown in Figure 1D using terms color-coded by similar structure types.

Affine gap penalty

Sequence alignments of DNA, RNA, and proteins use gap penalties that are a function of the length of the gap. Similarly, thermodynamic models of the destabilizing energies for hairpin, internal, and bulge loops are calculated as a function of the length of the loop feature. We were therefore motivated to implement gap penalties for RNA structural alignments that are a function of the linear length of the gap. Unlike linear gap penalties, the affine gap penalty is a strategy commonly used in sequence alignment methods (Altschul 1998; Altschul and Erickson 1986), which allows for penalizing gap openings with a parameter G < 0 different from gap extensions with parameter E < 0 (Fig. 1E). Implementing affine gaps into dynamic programming alignment involves keeping track of three individual matrices, one for each type of step in the recursion (Fig. 1F). Computing the optimal alignment includes diagonal steps along the M-matrix that include match or mismatch scores, s(xi,yj), horizontal steps along the X-matrix that correspond to gaps in x, and vertical moves along the Y-matrix that correspond to gaps in Y (Fig. 1F). Equations 13 show the system of recursion relations used for each possible step within the M, X, and Y matrices; where i, and j represent the position along the sequences x and y, respectively.

M(i,j)=max{M(i1,j1)+s(xi,yj)X(i1,j1)+s(xi,yj)Y(i1,j1)+s(xi,yj) (1)
X(i,j)=max{M(i,j1)+GX(i,j1)+EY(i,j1)+G (2)
Y(i,j)=max{M(i1,j)+GX(i1,j)+GY(i1,j)+E. (3)

We implemented the affine gap penalty (Altschul and Erickson 1986) in our method to allow for small gaps intermittently within the structure while still penalizing long stretches of gaps. This was achieved by setting a small gap opening penalty and a large gap extension penalty. This is opposite of traditional sequence alignment methods, which tend to assign a greater penalty to the insertion of a gap compared to extension of a gap (Zachariah et al. 2005).

Natural variability within structural feature types and context-specific gap penalties

To determine the variability within each structural feature type, we analyzed a broad range of structures within a train data set consisting of all RNA subfamilies within the bpRNA-1m meta-database (Danaee et al. 2018), excluding RNA subfamilies within the benchmark test data sets. For each family type, feature patterns were identified within the data, defined as a series of features (i.e., loop types, and S), each with a defined feature-number corresponding to the order in which they appear in the structure but is not feature length specific. Thus, a structure composed of two stems on either side of a hairpin loop, would be represented as S1H1S1. To be a feature pattern, it was required that at least seven structures demonstrated the pattern. All structures in a family that fall within a feature pattern were used to determine the MAD of the length of each instance of a feature (e.g., H1, H2, H3, E1, E2, S1, S2, S3), which resulted in a total of 12,857 structures comprising the gap penalty training data set. Examination of histograms comparing the GC-content, length, and base pair percentage for both the benchmark and train data confirmed that the distribution of values from the benchmark data is comparable to the training data (Supplemental Fig. 1A–C). Furthermore, to determine that there was no topological overlap between the benchmark and the train data sets, we identified unique patterns within each data set, which were based on both the feature length and the feature, in the order they appear in the structure. Using these feature-length patterns, we searched for any overlap between the train and benchmark data sets and found none. Using the training data, we quantified the natural length variability within each structural feature (S, H, I, M, E, X, B) (Fig. 6), and computed the mean MAD over all the RNA families.

To compute context-specific gap penalties, we defined gap extension penalty E based on the mean MAD values for each structural feature, and the gap open penalty G to be E/2. To determine gap penalties within an adequate range and weighted based on their variability defined by their mean MAD value, we constructed a linear range of gap extension penalties between Emin = −1 and Emax = −6. These upper and lower bounds were chosen to be of a comparable weight to the scores in the substitution matrix. The trend used to determine the gap extension penalties is shown by the linear equation

E(x)=(EminEmaxmaxf(meanF(MADfi)))x+Emax.

The x values represent the mean MAD value of a specific feature f, and E(x) values are the resulting gap extension penalties for the corresponding feature. In this equation, we first take the MAD length of each feature instance fi (e.g., H1, H2, S1), and then take the mean over all RNA families F. The denominator has the maximum meanF(MADfi) length over all feature types f. Therefore, when x is zero, E(x)=Emax, and when x is maxf(meanF(MADfi)), then E(x)=Emin. Within our algorithm, we apply these context-specific gap penalties for a particular feature type when the positions in each structure i and j are inside, but not bordering, the same feature type region.

Banded alignment

When structure arrays x and y have similarity to each other, we expect the optimal alignment path to be roughly diagonal. In these cases, the banded alignment strategy can be applied, minimizing the search space to a band, w, around the diagonal of (half) width (Fig. 2A–D). In our global alignment approach, we make use of the banded alignment strategy to increase speed. The banded-alignment approach can be used with affine gaps, which results in a lower score than linear gaps, due to greater penalties for long gaps (Fig. 2E,F). First, we select the longer of the two structure arrays, defining n = max (length(x), length(y)). The value of w is set to be w = 25 for our benchmark clustering tests. The band width, w, is a user tunable input that should be selected based on the specific data set being examined. In practice, too narrow of a w could limit accuracy, but may also be more selective for highly similar structures. A band width of w ≈ n/4 typically sets a wide enough band to account for length variability. The band width approach speeds up calculations, but as with any alignment method, should be adjusted to accommodate expected insertions and deletions that could occur in distantly homologous RNAs. For large evolutionary distances, detection of homology could be challenged by large inserts when using a small band width.

Benchmark data sets

We generated five different data sets, containing a range of RNA families or functional categories within bpRNA-1m (Danaee et al. 2018), to test the performance of bpRNA-align. For each category in each data set, up to eight (dependent on the number of available structures) unique structures were randomly selected. The first data set generated was composed of six different RNA riboswitch categories, including Glycine, PreQ1-II, Purine, drz agam-1, SMK Box, and SAM (Fig. 4A). The second data set is composed of microRNAs including MIR159, mir-544, mir-1937, mir-166, mir-286, mir-720, and mir-684 (Fig. 5A). These eight different categories were composed of seven unique structures for each category, all of which consist of a single segment. A segment is a collection of adjacent or near-adjacent base pairs, only being interrupted by bulges, hairpin loops, or internal loops, but not multiloops (Danaee et al. 2018). Because the segment allows for some unpaired structural features, it is distinct from a stem (Andronescu et al. 2008). For the third and fourth data sets, RNA families were chosen based on a consistent average number of structural segments. Data set three consists of three-segment structures, including HCV_X3, SCARNA16, SNORA22, His_leader, 5S, PyrR, and ROSE (Supplemental Fig. 2A). For each family in this data set, eight unique structures were selected. Data set four consisted of structures with four segments, including radC, tfoR, MicX, RsaD, BTE, and rpsL_psuedo (Supplemental Fig. 3A). For each family in data set four, seven unique RNA structures were included. Data set five consisted of all RNAs combined from the previous data sets.

Method comparison

To evaluate the performance of bpRNA-align, we compared it with two other RNA-comparison methods available, BEAGLE (Mattei et al. 2015) and RNAforester (Höchsmann et al. 2003). Our goal is to compute structural similarity scores, and although this is not itself clustering, we will evaluate our similarity scores based on how well they can be applied in clustering and expect as good a performance as iterative clustering approaches. As a baseline, we initially applied GraphClust2 (Miladi et al. 2019), which is an iterative clustering approach that utilizes Infernal (Nawrocki and Eddy 2013), LocARNA (Will et al. 2012), and structure prediction to cluster RNA sequences. Although the clustering performance of GraphClust2 improves after each iteration, it did not achieve perfect clustering after three iterations (Supplemental Table 1). When a GraphClust2 cluster is formed, it tends to cluster all RNAs from a specific family, without overlap. However, for some family types, it does not identify that they should be grouped, and all RNAs of that family are left unclustered (Supplemental Table 1).

Our clustering approach to evaluate the performance of structure similarity scores was performed using scikit-learn affinity propagation (Jancey 1966; Wu et al. 2007; Pedregosa Fabianpedregosa et al. 2011), which represents each data point as a node in a network, and iteratively refines the clustering by sending messages between nodes until an optimal set of exemplars and clusters are determined (Frey and Dueck 2007). Because affinity propagation uses an affinity matrix for clustering, rather than a distance matrix, it was chosen as a means of clustering our structure similarity scores. This method relies on simple formulas to find the minima of an energy function, updating messages that convey the affinity of a data point for selecting another data point as its exemplar (Frey and Dueck 2007). This allowed us to compute the clustering using the affinity scores determined from each method, without converting into distances. Similarity scores were generated for each data set using each of the three comparison methods and min–max normalized to prevent negative values for the clustering approach.

To evaluate our clustering results, we collected quantitative affinity silhouette scores using the SNFpy package (Rousseeuw 1987), and qualitative confusion matrices (Townsend 1971; Kohavi 1998) for each method along with the corresponding accuracy metric and Jaccard index. In order to compute confusion matrices, it is necessary to match the predicted clusters to the RNA subfamilies that result in the highest possible accuracy. Due to the number of subfamilies in each benchmark data set, it is slow to identify the subfamily-cluster matches by calculating the accuracy for all possible combinations and choosing the combination that maximizes accuracy. Thus, we generated a cost matrix, that could be applied in the linear sum assignment function in SciPy (Crouse 2016; Virtanen et al. 2020). This function is applied to generate an optimal assignment between row and column indices by solving the following equation, minijCi,jXi,j, where C is the cost matrix and X is a Boolean matrix with terms Xi,j describing whether cluster i is assigned to subfamily j. The terms of our cost matrix Ci,j is the cost for each cluster-subfamily pair being the number of the incorrect subfamily jj within the corresponding cluster i. The silhouette score depicts the level of separation between clusters, with a high score resulting from easily distinguishable clusters and a low score for clusters not easily separable. The silhouette coefficient of a sample is ab/max (a, b), where a is the mean intra-cluster affinity and b is the mean nearest-cluster affinity for each sample (Rousseeuw 1987). Accuracy and silhouette score metrics were used to evaluate clustering performance, along with confusion matrices, which provide visualization of the RNA families in comparison to the clustering results. One-hundred percent accuracy corresponds to a confusion matrix with a perfect diagonal trend. For each alignment method, scores were generated for all unique combinations of structural alignments within each data set. The affinity silhouette scores, accuracies, Jaccard indices, and confusion matrices were computed to evaluate each method.

Generation of a negative control data set

To identify whether bpRNA-align was able to differentiate between true structures and shuffled sequence-based structures, a negative control group was added to the four-segment data set. The negative control was generated by pulling a sequence from each of the groupings/subfamilies in the four-segment data set. These sequences were shuffled using dinucleotide shuffling. The structure for each sequence in the control group was predicted using RNAfold. This resulted in the negative control data set, composed of the four-segment data set along with the new control grouping.

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

Supplementary Material

Supplemental Material

ACKNOWLEDGMENTS

This work was supported by National Institutes of Health grant no. R01GM145986.

Footnotes

Freely available online through the RNA Open Access option.

MEET THE FIRST AUTHOR

Brittany Lasher.

Brittany Lasher

Meet the First Author(s) is an editorial feature within RNA, in which the first author(s) of research-based papers in each issue have the opportunity to introduce themselves and their work to readers of RNA and the RNA research community. Brittany Lasher is the first author of this paper, “bpRNA-align: improved RNA secondary structure global alignment for comparing and clustering RNA structures.” Brittany is a graduate student at Oregon State University in the laboratory of David Hendrix. Her research focuses on the development of computational methods to better analyze and understand RNA structure.

What are the major results described in your paper and how do they impact this branch of the field?

RNA structure is more highly conserved than primary sequence; therefore, the development of methods to evaluate RNA structure is fundamental to understanding the many roles that RNA plays in biological processes. Many tools exist to predict secondary structure, but few tools exist to compare and quantify how similar two existing structures are. In this work, we have developed bpRNA-align, a method for identifying similarity between RNA structures. This method is based on a customized global structure alignment algorithm and uses the bpRNA structure array, a representation that provides feature information on the nucleotide level. This method takes into consideration the natural length variation of structural features and applies a structural, feature-specific substitution matrix. bpRNA-align demonstrates high performance on five benchmark data sets composed of RNAs from a variety of RNA subfamilies.

What led you to study RNA or this aspect of RNA science?

Coming from a background of computationally studying proteins, I was excited to move into the field of RNA, which poses many challenges due to its inherent structural flexibility. Originally, I focused on investigating rhythmic genes through gene expression data but was drawn toward RNA structure through the analysis of viral genomic RNA. I found RNA to be fascinating and was specifically interested in developing methods for identifying similarity between RNA structures.

During the course of these experiments, were there any surprising results or particular difficulties that altered your thinking and subsequent focus?

One surprising result was the fact that we are using gap penalties opposite to how they are normally used. For proteins, we often use affine gap penalties with a higher “gap open” cost compared to “gap extension.” For RNA we did the opposite, under the reasoning that a single-nucleotide insertion or deletion is often not highly disruptive to structure, or at least less disruptive compared to a large insertion or deletion.

What are some of the landmark moments that provoked your interest in science or your development as a scientist?

I have always been drawn to understanding biological processes and was originally interested in pursuing a career in the medical field. However, as an undergraduate, I was inspired by my chemistry professor, who encouraged me to pursue my degree in chemical engineering. While an undergraduate, I joined a laboratory that introduced me to molecular dynamics simulations and sparked my interest in computational biology. Gaining a deeper understanding of protein function through computational approaches led me to identifying my passion for biochemistry and biophysics, which has guided me to where I am now.

What are your subsequent near- or long-term career plans?

In the immediate future, I plan to apply RNA computational methods to refine the bpRNA-1m RNA structure meta-database. I aim to enhance the current database by updating the pipeline and filtering out spurious structures carried over from source databases. As for long-term plans, I would like to pursue a postdoc in the field of RNA, possibly focused on the development of molecular dynamics simulations or Monte Carlo–based approaches.

REFERENCES

  1. Altschul SF. 1998. Generalized affine gap costs for protein sequence alignment. Proteins 32: 88–96. [DOI] [PubMed] [Google Scholar]
  2. Altschul SF, Erickson BW. 1986. Optimal sequence alignment using affine gap costs. Bull Math Biol 48: 603–616. 10.1016/S0092-8240(86)90010-8 [DOI] [PubMed] [Google Scholar]
  3. Andronescu M, Bereg V, Hoos HH, Condon A. 2008. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics 9: 340. 10.1186/1471-2105-9-340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The Protein Data Bank. Nucleic Acids Res 28: 235–242. 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Blin G, Denise A, Dulucq S, Herrbach C, Touzet H. 2010. Alignments of RNA structures. IEEE/ACM Trans Comput Biol Bioinform 7: 309–322. 10.1109/TCBB.2008.28 [DOI] [PubMed] [Google Scholar]
  6. Crouse DF. 2016. On implementing 2D rectangular assignment algorithms. IEEE Trans Aerosp Electron Syst 52: 1679–1696. 10.1109/TAES.2016.140952 [DOI] [Google Scholar]
  7. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. 2018. BPRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res 46: 5381–5394. 10.1093/nar/gky285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dayhoff MO, Schwartz RM, Orcutt BC. 1978. A model of evolutionary change in proteins. In Atlas of protein sequence and structure (ed. Dayhoff MO), Chap, 22, pp. 345–352. National Biomedical Research Foundation, Washington DC. [Google Scholar]
  9. Eddy S. 2005. INFERNAL user's guide: sequence analysis using profiles of RNA secondary structure consensus. http://eddylab.org/infernal/
  10. Frey BJ, Dueck D. 2007. Clustering by passing messages between data points. Science (1979) 315: 972–976. 10.1126/science.1136800 [DOI] [PubMed] [Google Scholar]
  11. Gautheret D, Damberger SH, Gutell RR. 1995. Identification of base-triples in RNA using comparative sequence analysis. J Mol Biol 248: 27–43. 10.1006/jmbi.1995.0200 [DOI] [PubMed] [Google Scholar]
  12. Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD. 1992. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res 20: 5785–5795. 10.1093/nar/20.21.5785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Henikoff JG, Henikoff S. 1996. Using substitution probabilities to improve position-specific scoring matrices. Bioinformatics 12: 135–143. 10.1093/bioinformatics/12.2.135 [DOI] [PubMed] [Google Scholar]
  14. Höchsmann M. 2005. “The tree alignment model: algorithms, implementations and applications for the analysis of RNA secondary structures.” PhD thesis, University of Bielefeld. [Google Scholar]
  15. Höchsmann M, Töller T, Giegerich R, Kurtz S. 2003. Local similarity in RNA secondary structures. In Proceedings of the 2003 IEEE bioinformatics conference, CSB 2003, pp. 159–168. Institute of Electrical and Electronics Engineers, New York. [PubMed] [Google Scholar]
  16. Jancey R. 1966. Multidimensional group analysis. Aust J Bot 14: 127–130. 10.1071/BT9660127 [DOI] [Google Scholar]
  17. Kohavi R. 1998. Glossary of terms special issue on applications of machine learning and the knowledge discovery process. Mach Learn 30: 271–274. 10.1023/A:1017181826899 [DOI] [Google Scholar]
  18. Li P, Zhou X, Xu K, Zhang QC. 2021. RASP: an atlas of transcriptome-wide RNA secondary structure probing data. Nucleic Acids Res 49: D183–D191. 10.1093/nar/gkaa880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Marchanka A, Simon B, Althoff-Ospelt G, Carlomagno T. 2015. RNA structure determination by solid-state NMR spectroscopy. Nat Commun 6: 7024. 10.1038/ncomms8024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Mattei E, Ausiello G, Ferrè F, Helmer-Citterich M. 2014. A novel approach to represent and compare RNA secondary structures. Nucleic Acids Res 42: 6146–6157. 10.1093/nar/gku283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Mattei E, Pietrosanto M, Ferrè F, Helmer-Citterich M. 2015. Web-Beagle: a web server for the alignment of RNA secondary structures. Nucleic Acids Res 43: W493–W497. 10.1093/nar/gkv489 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Miladi M, Sokhoyan E, Houwaart T, Heyne S, Costa F, Grüning B, Backofen R. 2019. GraphClust2: annotation and discovery of structured RNAs with scalable and accessible integrative clustering. Gigascience 8: giz150. 10.1093/gigascience/giz150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29: 2933–2935. 10.1093/bioinformatics/btt509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443–453. 10.1016/0022-2836(70)90057-4 [DOI] [PubMed] [Google Scholar]
  25. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, et al. 2011. Scikit-learn: machine learning in Python. J Mach Learn Res 12: 2825–2830. [Google Scholar]
  26. Rousseeuw PJ. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20: 53–65. 10.1016/0377-0427(87)90125-7 [DOI] [Google Scholar]
  27. Schirmer S, Giegerich R. 2011. Forest alignment with affine gaps and anchors. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics) (ed. Giancarlo R, Manzini G), pp. 104–117. Springer, Berlin, Heidelberg. [Google Scholar]
  28. Shang L, Xu W, Ozer S, Gutell RR. 2012. Structural constraints identified with covariation analysis in ribosomal RNA. PLoS ONE 7: e39383. 10.1371/journal.pone.0039383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shapiro BA. 1988. An algorithm for comparing multiple RNA secondary structures. Bioinformatics 4: 387–393. 10.1093/bioinformatics/4.3.387 [DOI] [PubMed] [Google Scholar]
  30. Svoboda P, di Cara A. 2006. Hairpin RNA: a secondary structure of primary importance. Cell Mol Life Sci 63: 901–918. 10.1007/s00018-005-5558-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Terayama K, Yamashita T, Oguchi T, Tsuda K. 2018. Fine-grained optimization method for crystal structure prediction. NPJ Comput Mater 4: 1–8. 10.1038/s41524-018-0090-y [DOI] [Google Scholar]
  32. Thompson JD. 1995. Introducing variable gap penalties to sequence alignment in linear space. CABIOS 11: 181–186. 10.1093/bioinformatics/11.2.181 [DOI] [PubMed] [Google Scholar]
  33. Townsend JT. 1971. Theoretical analysis of an alphabetic confusion matrix. Percept Psychophys 9: 40–50. 10.3758/BF03213026 [DOI] [Google Scholar]
  34. Vendramin L, Campello RJGB, Hruschka ER. 2009. On the comparison of relative clustering validity criteria. In SIAM international conference on data mining, Sparks, Nevada, pp. 733–744. 10.1137/1.9781611972795.63 [DOI] [Google Scholar]
  35. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17: 261–272. 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wang Y, Wu H, Cai Y. 2018. A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics 19: 95–104. 10.1186/s12859-018-2088-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Westhof E, Dumas P, Moras D. 1988. Restrained refinement of two crystalline forms of yeast aspartic acid and phenylalanine transfer RNA crystals. Acta Crystallogr A 44: 112–123. 10.1107/S010876738700446X [DOI] [PubMed] [Google Scholar]
  38. Will S, Joshi T, Hofacker IL, Stadler PF, Backofen R. 2012. LocARNA-P: accurate boundary prediction and improved detection of structural RNAs. RNA 18: 900. 10.1261/rna.029041.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, et al. 2007. Top 10 algorithms in data mining. Knowl Inf Syst 14: 1–37. 10.1007/s10115-007-0114-2 [DOI] [Google Scholar]
  40. Zachariah MA, Crooks GE, Holbrook SR, Brenner SE. 2005. A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins 58: 329–338. 10.1002/prot.20299 [DOI] [PubMed] [Google Scholar]
  41. Zhang K, Li S, Kappel K, Pintilie G, Su Z, Mou T-C, Schmid MF, Das R, Chiu W. 2019. Cryo-EM structure of a 40 kDa SAM-IV riboswitch RNA at 3.7 Å resolution. Nat Commun 10: 5511. 10.1038/s41467-019-13494-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from RNA are provided here courtesy of The RNA Society

RESOURCES