Abstract
Motivation
Simultaneous alignment and folding (SA&F) of RNAs is the indispensable gold standard for inferring the structure of non-coding RNAs and their general analysis. The original algorithm, proposed by Sankoff, solves the theoretical problem exactly with a complexity of in the full energy model. Over the last two decades, several variants and improvements of the Sankoff algorithm have been proposed to reduce its extreme complexity by proposing simplified energy models or imposing restrictions on the predicted alignments.
Results
Here, we introduce a novel variant of Sankoff’s algorithm that reconciles the simplifications of PMcomp, namely moving from the full energy model to a simpler base pair-based model, with the accuracy of the loop-based full energy model. Instead of estimating pseudo-energies from unconditional base pair probabilities, our model calculates energies from conditional base pair probabilities that allow to accurately capture structure probabilities, which obey a conditional dependency. This model gives rise to the fast and highly accurate novel algorithm Pankov (Probabilistic Sankoff-like simultaneous alignment and folding of RNAs inspired by Markov chains).
Conclusions
Pankov benefits from the speed-up of excluding unreliable base-pairing without compromising the loop-based free energy model of the Sankoff’s algorithm. We show that Pankov outperforms its predecessors LocARNA and SPARSE in folding quality and is faster than LocARNA.
Keywords: RNA secondary structure, Alignment and folding of RNAs, Structural bioinformatics
Background
In all forms of life, RNAs play essential roles that go beyond coding as messenger RNAs for the synthesis of proteins. Non-coding RNAs (ncRNAs) directly regulate cellular mechanisms, where some are known to be conserved for billions of years [1]. ncRNAs often have only weak sequence conservation, since their (conserved) structure crucially determines their function. Therefore, inferring the conserved secondary structure of homologs—most often, based on RNA alignments, is central for the discovery and annotation of functional RNAs.
RNA structural alignment algorithms can be classified depending on whether they fold and align simultaneously or in turn [2]. The gold standard for computing reliable alignments (and common structures) of RNAs is still the simultaneous algorithm proposed by Sankoff in 1985 [3]. By simultaneously aligning and folding the RNAs, it resolves the vicious cycle that reliable RNA alignments must consider RNA structures (especially for RNAs of medium to low sequence identity [4]), while computational structure prediction is typically unreliable without comparative information. For a pair of RNA sequences, the algorithm finds the optimal alignment and two compatible secondary structures that minimize the total of sequence-alignment distance score and the free energies of the predicted structures. With a run-time complexity of and for memory, the method requires extreme computational resources, such that its application is largely restricted to small instances and data sets. Efficient alignment algorithms are needed for the multiple alignments of the clusters that can be obtained from large scale clustering of data from high-throughput sequencing experiments [5].
Thus, several approaches have adapted Sankoff’s original algorithm to reduce its computational costs. Two main lines of the variants can be distinguished. Methods like Dynalign [6] and FoldAlign [7] reduce the computational complexity by heuristic restrictions, e.g. introducing banding strategies or limiting the maximum size of the comparable sub-structures. Since such methods need to perform expensive energy computations in the nearest neighbor model [8], their applications are still limited without considering heuristic restrictions that in turn could compromise their accuracy.
A highly viable alternative was proposed with the PMcomp algorithm [9], which replaces the nearest neighbor energy model with a still accurate probabilistic model. This model allows to drastically simplify the algorithms, which strongly reduces the computational overhead and supports further algorithmic optimizations in PMcomp’s successors. For reducing the overhead, PMcomp-type algorithms evaluate structures based on base pair probabilities, which are precomputed by McCaskill’s algorithm [10] instead of calculating nearest neighbor energy terms. Moreover, PMcomp-type algorithms such as LocARNA and similar approaches with a probabilistic energy model [11–14] further speed up by reducing the search space. To this end, LocARNA considers only base-pairs that pass a defined probability threshold. This sparsification improves over PMcomp’s complexity of time and space, each, by a quadratic factor (resulting in the time and space requirements of LocARNA).
With the algorithm SPARSE [15], we introduced a second level of sparsification using loop-closing aware recursions to filter based on the joint probability of base-pairs and associated loop-closing base-pairs. This sparsification reduces the time complexity of the alignment algorithm to , starting from the precomputed probabilities. The joint probabilities were computed based on our extension of the McCaskill’s algorithm [16].
We emphasize that all these methods, starting with the original Sankoff algorithm, consider only non-crossing structures, even if crossing base pairs occur relatively frequently in physical structures [17]. This is generally justified, since most secondary structures are dominated by non-crossing base pairs; in turn, the limitation to non-crossing structures allows dynamic programming techniques, which are far more efficient and flexible than comparable techniques that consider (even limited forms of) base pair crossings.
In this work, we utilize the joint probabilities in a novel way—not only for strong sparsification as in SPARSE but as well to evaluate RNA structure more accurately in a PMcomp-type algorithm. We start with showing that joint (or equivalently, conditional) probabilities allow to precisely capture structure probabilities in the nearest neighbor energy model. This corresponds to the exact capture of the nearest neighbor energies themselves. Remarkably, while previous work discussed only the stacking base-pair helices [18], we cope with all loop types. Based on the exact model, we suggest careful simplifications, that allow incorporation of the model into alignment and folding (in the variant of the SPARSE algorithm). Based on the novel precise probabilistic model, we propose the novel algorithm Pankov with time complexity. As fundamental novelty over its predecessors, it applies an accurate full-loop energy model for evaluating the structures.
Performing an established benchmark, we show that Pankov is in practice faster than LocARNA and significantly improves structure prediction over both SPARSE and LocARNA. Compared to SPARSE, it even improves the sequence alignment quality.
Methods
Preliminaries
Basic notations
An RNA sequence A is a string over the alphabet . A base pair a of A is a pair such that the respective sequence positions are complementary, i.e. AU, GC or GU. A non-crossing RNA structure of A, in the following called structure, is a set of base pairs, where each two different base pairs (i, j) and of do not cross, i.e. , and do not share any end, i.e. and are pairwise different. To treat the external bases pairs of an RNA structure, we introduce a pseudo-base-pair , which formally encloses all base pairs of A.
Tree structure
A nested RNA secondary structure S can be represented as a rooted structure tree, exemplified in Fig. 1a, b, where base-pairs are encoded as nodes and the enclosing base-pairs are the parents of the directly enclosed base-pairs. The function provides the set of children base pairs that are directly enclosed by a given base pair a. Thus, the cardinality of is zero for hairpin loops ( in Fig. 1), at least two for multi-loops () and one otherwise (), which represents stackings, bulges or interior loops. Furthermore, the pseudo base pair recursively encloses all base pairs of any structure that can be formed by A.
Energy and probability of an RNA structure
The energy of a structure S can be estimated using the nearest neighbor energy model, which is based on a loop-decomposition of the structure, where a loop is defined as the substructure defined by a base pair a and its enclosed base pairs . The energy model provides contributions for such loops, which are summed up, i.e.
1 |
By assuming a Boltzmann distribution of the structures based on the principles of statistical mechanics, there is a bijection between energies and probabilities of structures. Thus, the probability P(S) of the structure S is related to its energy E(S) by
2 |
3 |
based on its loops’ Boltzmann weights (Eq. 3) and the partition function , which is efficiently computed by McCaskill’s algorithm [10]. If inverted, this allows transforming structure probabilities back to energies:
4 |
where denotes the ensemble energy. This notation can be further used to derive the common definition of the probability of a base pair a P(a) as
5 |
which can be efficiently computed by McCaskill’s algorithm.
PMcomp assumes independence of base-pair probabilities
PMcomp alignment and folding score
The alignment and folding score of PMcomp [9], which is assigned to an alignment and RNA structures and , can be formulated as
The first two terms define the structural component of the score that is discussed in the next section. The last two terms define the sequence component of the score. is the base similarity for two matched sequence positions and from sequence A and B, resp., and is the gap penalty. is the number of insertions and deletions in .
PMcomp structure scoring model
Here, we focus on the structure component of PMcomp’s alignment and folding score, since we want to investigate how well the probabilistic model reflects the thermodynamic energy model. The score of a structure S of sequence A normalizes and sums up the log-scores of its base pair probabilities, i.e.
6 |
Here, base pair probabilities P(a) are normalized via , the minimal probability of a significant base-pair, such that less probable base-pairs are unfavored. The logarithm transfers the probabilistic model back to the energy space similar to Eq. 4.
Putting the normalization term aside, the PMcomp’s structure score contains a notion of structure probability (see first log term on the right of Eq. 6), which we denote with
7 |
Noticeably, base pairing events are assumed to be independent, which is in violation with the underlying Nearest-Neighbor energy model, as we will show next. Thus, PMcomp’s structure score does not relate well with the energy of the respective structure.
Exact computation of structure probabilities based on conditional loop probabilities
Here we prove that the equilibrium probability of a structure within the ensemble of possible structures can be expressed exactly based on conditional loop probabilities. This provides the theoretical foundation for discussing the Pankov energy model in the subsequent section.
Theorem 1
Let P(S) be the probability of structure S and have as the conditional probability of the loop in S closed by base-pair a, the following equation holds:
8 |
Proof
The free energy of the secondary structure S is composed of its loop energies in the nearest-neighbor thermodynamic model (Eq. 1), which implies that its probability can be computed from the respective Boltzmann weights of loops (Eq. 2). Decomposing the right term of Eq. 8 by the partition function inside the base-pairs a, i.e.
9 |
we get:
10 |
(): Every arc but occurs exactly once as a child in the numerator product.
(): encloses all possible structures, thus .
Eventually, this work extends and generalizes the approach for canonical helices (only stackings) from [18] to arbitrary secondary structures.
Pankov structure scoring model
Here, we want to score structures for simultaneous alignment and folding based on the derived Eq. 8, i.e. based on pre-computed conditional loop probabilities, to better reflect the structures’ energy within the overall alignment score. Due to the exponential number of possible multi-loop branchings, a polynomial pre-computation and storing of respective multi-loop probabilities is not feasible. Thus, we propose an approximation of such terms based on pair-in-loop probabilities introduced next.
Approximate loop probabilities using pair-in-loop probabilities
To handle arbitrary multi-loops (closed by a) with any number and composition of children base pairs , we restrict computation and storage to all parent-child pairs , i.e. we define the pair-in-loop probability , in the following abbreviated as in-loop probabilities, as
11 |
where we calculate the joint probabilities of the form by an extension of McCaskill’s algorithm introduced in [16], which can be performed in time. To this end, the pair-in-loop information is stored in additional matrices during the partition function computation without increasing the computational complexity. The probability of a base-pair being external, i.e. being enclosed by the pseudo-arc , is also computed and stored. Within the following, we discuss how to approximate loop probabilities from in-loop probabilities.
Pair-in-loop approximation of non-branching-loop probabilities
When using pair-in-loop probabilities to approximate non-branching loop probabilities, i.e. loops with exactly one child base pair, the latter is overestimated since Eq. 11 does not distinguish the loop context of the pair. Therefore, also multi-loops contribute to in-loop probabilities such that it follows
12 |
Scoring based on least stable multi-loop branch
An alternative approximation would be to assign the least probable branch to the whole multi-loop. This can be intuited in the energy space as that least stable branching of the multi-loop dominants formation of the multi-loop.
13 |
Due to the same reasons as for non-branching loops, the following relation holds, i.e.
14 |
Assuming multi-loop-branch independence.
As a straightforward approach we assume an independence between the multi-loop branches that are conditioned to be closed under the same base-pair, i.e.
15 |
Weighted overall structure scores
For scoring the structure in the implementation of the Pankov alignment algorithm, we assign the ultimate scoring based on the product of multi-loop contributions following Eq. 15, that we designate as the Pankov’s probability of structure.
16 |
Similar to PMcomp, the probabilities are incorporated on the logarithmic scale via the function, which also includes a normalization via the bonus term . The latter term subsequently balances structure and sequence contributions within the alignment scoring.
17 |
Pankov alignment approach
The Pankov algorithm keeps track of closing-loop base-pairs in an efficient manner during dynamic programming computation of the score matrices. The matrices are defined similar to the dynamic programming recursions of the SPARSE algorithm [15], which achieve a quadratic time complexity for the alignment by exploiting the sparsity of the in-loop probabilities. Thus, Pankov uses the following matrices:
for the scores of matching the two base pairs and and aligning the two enclosed subsequences;
for storing the maximum score of all possible alignments and foldings of the subsequences and that are under the loops enclosed by a and b; and
and for supporting variability in the helix size via deletion and insertion of base-pairs under the loops closed by a and b respectively.
The matrix entries can be calculated recursively:
Here is the probability threshold and is the corresponding threshold for the joint in-loop probabilities. Following PMcomp’s energy model, SPARSE calculates as (independent of the enclosing base pair a)—in Pankov we make use of the full flexibility.
The asymptotic time complexity of the Pankov alignment algorithm is . Similar to SPARSE, the base pairing and joint in-loop probabilities need to be computed only once for each sequence. This is important for computing multiple alignments, using a pairwise aligner like Pankov in a progressive scheme (e.g. via mlocarna tool from the LocARNA software package). The Pankov implementation of the recursions keeps track of both ends of the base-pairs, while in SPARSE the tracking is relaxed and M matrices with common left-ends are combined for further speed-up, although the complexity is not affected. So in practice, compared to SPARSE, Pankov’s run-time is slightly increased.
Alignment with domain insertion and deletion
The recursions of Pankov can be extended to allow for the deletion of an entire branch of one of the two predicted structures, which is then aligned to a gap in the other structure. This is similar to the Dynalign-II support for the insertion/deletion of domains [19]. Below, the additional two cases of Pankov’s M-recursion are shown.
The cost of deleting/inserting both ends of the enclosing base pairs of the domain, or , is . The extended D matrix entries, i.e. and , are initialized with the cost for the deletion/insertion of the enclosed domain. The maximum allowed size of a deleted domain is limited by the parameter , since arbitrary large domain indels are unlikely. This keeps the run-time increase moderate and allows to turn off the feature of domain insertion/deletion by .
Results and discussion
Evaluation of the probabilistic energy models
The evaluation procedure
We evaluated the agreement between the reference and the probabilistic free energy models. Having the Turner [8] nearest-neighbor full energy model as the reference, we compared the performance of PMcomp’s base-pair independence model and Pankov’s loop-based conditional probability model. The Sankoff-like algorithms maximize (or minimize) the sum of structure and sequence alignment scores over the space of possible formations of alignments and structure. Hence, a higher correlation between the calculated energies of a model with the reference free energy values indicates better modeling of the structure score, that is expected to perform better for the task of RNA simultaneous alignment and folding.
We developed an evaluation procedure to measure the level of agreement between the probabilistic models and the reference energy model with correlation coefficients. The procedure performs these steps for an input RNA sequence: (i) suboptimal secondary structures are generated using RNAsubopt method [20], for the range of the minimum free energy structure up to 5kcal/mol (-e=5 -s) and 500 suboptimals. (ii) for each suboptimal structure, the probability or free energy is calculated according to the described energy models and structure scores (iii) Over the set of suboptimal structures, the Spearman’s rank correlation coefficient between the reference RNAsubopt’s free energies and the free energies/scores of the models are computed.
The five approximation variants of the probabilistic models are evaluated which compute the probability or score of a structure. More precisely, these variants are computed according to the equations (Eq. 7), (Eq. 6), using (Eq. 13), using (Eq. 15) and using and terms (Eq. 17). The structure probabilities are transferred to the energy dimension according to Eq. 4. For consistency in the representation, the scores were also scaled with a similar scheme. As a monotonic function, the energy scale transformation does not affect the absolute rank correlations in neither of the cases.
The minimal significance probability in Eq. 6 was set to , this is used for the PMcomp score scheme of LocARNA. The bonus balance term from Eq. 17 was set to 1.5, a bonus of zero makes the rank correlation of same as the rank correlations of using . However, a non-zero value is needed to balance the total alignment score by appropriately shifting the energy. The base-pair and conditional in-loop probabilities were computed using the extended implementation of McCaskill’s algorithm (see methods). The run-time complexity for calculating these values for a structure, using the precomputed matrices, are linear to the number of base-pairs in structure (O(|S|)) as well as the length of the sequence.
Evaluation of real ncRNAs
The described evaluation procedure was repeated for the set of RNA sequences obtained from the RNAstrand database (sequence length [30–200] nucleotides, entries without ambiguous or spurious characters). Figure 2 shows the distribution of the rank correlation evaluation of the sequences.
To inspect the model agreements in details, we visualized the output on a sample tRNA transcript of RNAstrand (ID: SPR_00633) in Fig. 3. An evaluation for the top 500 suboptimal structures of lowest free energies is shown. As can be seen in Fig. 3, the Pankov’s in-loop-probability-based models (bottom row of Fig. 3) perform best in preserving the reference free energy ranks. It is also notable that the energy scaled values are precisely scaling back to the range of reference free energies.
The scatter plot for Pankov energy (ML-min) in Fig. 3 confirms the relation for the probability of multiloops that was presented in the methods section (Eq. 14), with a approximation is bounded by the exact probability of the structure such that the approximated energies are always less than or equal to the reference energies.
Alignment performance evaluation
We evaluated our implementation of Pankov alignment algorithm on the pairwise alignment benchmark set, Bralibase 2.1 [21]. For the evaluated methods, the sparsification probability thresholds were set similar. Namely, LocARNA, SPARSE and Pankov with the minimum base-pair probability to 0.001 (option-p). For SPARSE and Pankov the in-loop probability threshold was set to 0.0001 (option–prob-basepair-in-loop). The sequence-structure balance term of Pankov’s score (Eq. 17) was set to 1.5, the chosen among the values 1, 1.5, 2 and 2.5 posing a fair balance for the average of sequence and structure scores. The Matthews Correlation Coefficient (MCC) performance was stable for ’s value of 1.5 and larger values (Additional file 1: Figure S1).
Figure 4a, b show the performance comparison in term of sequence alignment quality sum-of-pairs-score (SPS) and structure prediction quality by Matthews Correlation Coefficient (MCC) [22]. To mediate the Bralibase curve “dent” effect [23], the visualization was done for sequence pairs of sequence identity (SI) between 30 to 80% to avoid a curve dent around SI-80 that seems to be mainly caused by enforcing a continuous curve over a quasi-heterogeneous distribution. Entries with a higher SI are not of particular interest, as they mostly perform fine also using the structure-unaware alignment algorithms. Furthermore, the dashed curves in Fig. 4 correspond to the subset of the benchmark by excluding the three ribosomal and tRNAs families. These families are shown to be moderately overrepresented in the Braliabase and could overweight in the overall performance, especially on lower sequence identity range [23].
In the aspect of execution time, Pankov is overall faster than LocARNA and slower than SPARSE since Pankov implements the exact loop-closing track of the alignment recursions (see methods). Our implementation of Pankov had an average run time of 1.8 seconds on Bralibase K2 instances, when running on a AMD Opteron 2.1 GHz processor; this compares to respective average run times of 3.9 and 0.7 seconds of LocARNA and SPARSE. As can be seen in Fig. 4, Pankov considerably improves structure prediction over both SPARSE and LocARNA. Compared to its predecessor SPARSE, it even improves the sequence alignment quality.
Family-wise analysis
We further compared the performance of SPARSE and Pankov in a family-wise manner, to find out which families benefit most from the incorporation of a more accurate energy model. Fig. 5 illustrates the difference () between quality metrics of SPARSE and Pankov which are averaged per family for Bralibase K2 entries with an SI less than 80%. The majority of families gain in the term of structure prediction from Pankov’s accurate energy model. Furthermore, IRES_HCV and the riboswitch families yybP-ykoY, Cobalamin and glycine (gcvT) have an improved sequence alignment quality (SPS) as well. A couple of families performed worse on average in structure prediction metric using the more accurate energy model of Pankov. Overall these families tend to be short and accommodate simple non-branching structures, according to Rfam’s reference structures.
The effect of domain indel feature
The possibility of deleting or inserting an entire domain (i.e. a branch of the structure tree) is beyond the original Sankoff’s algorithm. In Sankoff’s model the structures are constrained to have the same branching topology. The Pankov extended version with domain indel support () had about 80% increase of the runtime. On the other side, the average alignment quality metrics (both SPS and MCC) are mainly improved or remained unchanged (Additional file 1: Figure S2).
Conclusion
Sankoff’s algorithm is the reference standard for simultaneous alignment and folding (SA&F) of RNAs. While the theoretical work integrates the full loop-based nearest-neighbor energy model, the derived algorithms mostly implement a simplified or limited structure energy models or restrict on the alignment and structure formation possibilities, to reduce the high computational complexity. PMcomp proposed a probabilistic light-weight energy model. This empowers the PMcomp-like methods to strongly reduce the computational overhead of the exact thermodynamic folding details and allows further algorithmic optimizations and sparsification based on the equilibrium probability of the base-pairs.
Here we showed that PMcomp’s energy model assumes a level of independence between the base-pairing events, which violates the underlying nearest neighbor energy model. To solve this issue, we demonstrated an exact way to compute the probability of an RNA secondary structure from the decomposed loop-probabilities. To circumvent the computational complexity of multi-loop cases, we introduced an energy model to accurately approximate this loop decomposition in an efficient way using the precomputed in-loop probabilities. Our proposed energy model takes care of the nearest-neighbor thermodynamic rule. It was further empirically validated that the novel model has a much closer agreement with the full-loop energy model, based on the dataset of real non-coding RNAs. Using this energy model, we proposed the Pankov algorithm for pairwise simultaneous alignment and folding of RNAs. Benchmark results show that the implementation of Pankov outperforms its predecessors on predicting the secondary structure from the pairs of homologous RNAs.
The concept of conditional and joint in-loop probabilities has some parallels to the production rules of Stochastic Context-Free Grammars (SCFG) that can encode base-pair relations differently [24], they have also been used to solve the SA&F problem [25]. The overhead of treating various nucleotides separately during the alignment procedure is similar to the invocation of the full loop-based energy model, which restrains the implementation towards using simplified grammars that may not benefit from the power of thermodynamic rules. In our proposed model, the probabilistic terms are obtained from the thermodynamic partition function, so the probabilistic transition rules are straightforward and do not need to deal with individual types and combinations of nucleotides separately.
Pankov, to the best of authors’ knowledge, is the first SA&F method that dissociates the loop computation details from the alignment and prediction step to efficiently solve the target problem without substantially compromising the power of underlying thermodynamic rules.
Supplementary information
Acknowledgements
The authors would like to thank the anonymous reviewers. MM and MR are supported by the German Research Foundation (DFG BA2168/14-1 and BA2168/16-1). SW is supported by the Austrian Science Fund (FWF I 2874-N28). RB is supported by the German Research Foundation (DFG) under Germany’s Excellence Strategy (CIBSS - EXC-2189 - Project ID 390939984). The article processing charge was funded by the German Research Foundation (DFG) and the University of Freiburg in the Open-Access Publishing funding programme.
Abbreviations
- MCC
Matthews Correlation Coefficient
- SA&F
Simultaneous alignment and folding
- SI
Sequence identity
- SPS
Sum-of-pairs score
Authors' contributions
MM proposed, designed and implemented the Pankov algorithm and the evaluations. MM, MR and SW participated in establishing the theoretical results. MM, MR, SW and RB participated in writing the manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Availability of data and materials
Pankov is developed as a branch of the LocARNA package and available at https://github.com/BackofenLab/Pankov. The data and analysis of the correlation analysis (Figs. 2 and 3) are provided within the Github repository.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Milad Miladi, Email: miladim@cs.uni-freiburg.de.
Martin Raden, Email: mmann@informatik.uni-freiburg.de.
Sebastian Will, Email: will@tbi.univie.ac.at.
Rolf Backofen, Email: backofen@informatik.uni-freiburg.de.
Supplementary information
Supplementary information accompanies this paper at 10.1186/s13015-020-00179-w.
References
- 1.Cech TR, Steitz JA. The noncoding RNA revolution-trashing old rules to forge new ones. Cell. 2014;157(1):77–94. doi: 10.1016/j.cell.2014.03.008. [DOI] [PubMed] [Google Scholar]
- 2.Seetin MG, Mathews DH. RNA structure prediction: an overview of methods. In: Bacterial Regulatory RNA, editor. Totowa: Humana Press; 2012. p. 99–122. [DOI] [PubMed]
- 3.Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math. 1985;45(5):810–825. doi: 10.1137/0145048. [DOI] [Google Scholar]
- 4.Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33(8):2433–9. doi: 10.1093/nar/gki541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Miladi M, Junge A, Costa F, Seemann SE, Havgaard JH, Gorodkin J, et al. RNAscClust: clustering RNA sequences using structure conservation and graph based motifs. Bioinformatics. 2017;33(14):2089–2096. doi: 10.1093/bioinformatics/btx114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317(2):191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]
- 7.Havgaard JH, Lyngsø RB, Stormo GD, Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40% Bioinformatics. 2005;21(9):1815–1824. doi: 10.1093/bioinformatics/bti279. [DOI] [PubMed] [Google Scholar]
- 8.Mathews D, Sabina J, Zuker M, Turner D. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999;288(5):911–40. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
- 9.Hofacker IL, Bernhart SH, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20(14):2222–7. doi: 10.1093/bioinformatics/bth229. [DOI] [PubMed] [Google Scholar]
- 10.McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29(6–7):1105–19. doi: 10.1002/bip.360290621. [DOI] [PubMed] [Google Scholar]
- 11.Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007;3(4):e65. doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23(8):926–32. doi: 10.1093/bioinformatics/btm049. [DOI] [PubMed] [Google Scholar]
- 13.Harmanci AO, Sharma G, Mathews DH. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics. 2007;8:130. doi: 10.1186/1471-2105-8-130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kiryu H, Tabei Y, Kin T, Asai K. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics. 2007;23(13):1588–1598. doi: 10.1093/bioinformatics/btm146. [DOI] [PubMed] [Google Scholar]
- 15.Will S, Otto C, Miladi M, Möhl M, Backofen R. SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics. Bioinformatics. 2015;31(15):2489–2496. doi: 10.1093/bioinformatics/btv185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Otto C, Mohl M, Heyne S, Amit M, Landau GM, Backofen R, et al. ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs. BMC Bioinformatics. 2014;15(1):6602. doi: 10.1186/s12859-014-0404-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 2018;46(11):5381–5394. doi: 10.1093/nar/gky285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bompfünewerer AF, Backofen R, Bernhart SH, Hertel J, Hofacker IL, Stadler PF, et al. Variations on RNA folding and alignment: lessons from Benasque. J Math Biol. 2008;56(1–2):129–144. doi: 10.1007/s00285-007-0107-5. [DOI] [PubMed] [Google Scholar]
- 19.Fu Y, Sharma G, Mathews DH. Dynalign II: common secondary structure prediction for RNA homologs with domain insertions. Nucleic Acids Res. 2014;42(22):13939–13948. doi: 10.1093/nar/gku1172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers Original Res Biomol. 1999;49(2):145–165. doi: 10.1002/(SICI)1097-0282(199902)49:2<145::AID-BIP4>3.0.CO;2-G. [DOI] [PubMed] [Google Scholar]
- 21.Wilm A, Mainz I, Steger G. An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol Biol. 2006;1:19. doi: 10.1186/1748-7188-1-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gorodkin J, Stricklin SL, Stormo GD. Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res. 2001;29(10):2135–2144. doi: 10.1093/nar/29.10.2135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Löwes B, Chauve C, Ponty Y, Giegerich R. The BRaliBase dent-a tale of benchmark design and interpretation. Brief Bioinformatics. 2016;18(2):306–311. doi: 10.1093/bib/bbw022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dowell RD, Eddy SR. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004;5(1):71. doi: 10.1186/1471-2105-5-71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dowell RD, Eddy SR. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics. 2006;7:400. doi: 10.1186/1471-2105-7-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Pankov is developed as a branch of the LocARNA package and available at https://github.com/BackofenLab/Pankov. The data and analysis of the correlation analysis (Figs. 2 and 3) are provided within the Github repository.