Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2014 Nov 21;42(22):13939–13948. doi: 10.1093/nar/gku1172

Dynalign II: common secondary structure prediction for RNA homologs with domain insertions

Yinghan Fu 1,2, Gaurav Sharma 2,3,4,*, David H Mathews 1,2,4,*
PMCID: PMC4267632  PMID: 25416799

Abstract

Homologous non-coding RNAs frequently exhibit domain insertions, where a branch of secondary structure is inserted in a sequence with respect to its homologs. Dynamic programming algorithms for common secondary structure prediction of multiple RNA homologs, however, do not account for these domain insertions. This paper introduces a novel dynamic programming algorithm methodology that explicitly accounts for the possibility of inserted domains when predicting common RNA secondary structures. The algorithm is implemented as Dynalign II, an update to the Dynalign software package for predicting the common secondary structure of two RNA homologs. This update is accomplished with negligible increase in computational cost. Benchmarks on ncRNA families with domain insertions validate the method. Over base pairs occurring in inserted domains, Dynalign II improves accuracy over Dynalign, attaining 80.8% sensitivity (compared with 14.4% for Dynalign) and 91.4% positive predictive value (PPV) for tRNA; 66.5% sensitivity (compared with 38.9% for Dynalign) and 57.0% PPV for RNase P RNA; and 50.1% sensitivity (compared with 24.3% for Dynalign) and 58.5% PPV for SRP RNA. Compared with Dynalign, Dynalign II also exhibits statistically significant improvements in overall sensitivity and PPV. Dynalign II is available as a component of RNAstructure, which can be downloaded from http://rna.urmc.rochester.edu/RNAstructure.html.

INTRODUCTION

In the past three decades, RNA has been studied not just for its role in protein synthesis, but also for its large number of non-coding roles, where RNA directly controls cellular function (16). Because of the biological significance of non-coding RNAs (ncRNAs), the prediction of RNA secondary structure, i.e. the set of canonical base pairs, is now a commonly employed tool for understanding the mechanism of RNA function. Available approaches are categorized and summarized in a number of reviews (79).

The most accurate approach for modeling secondary structure is comparative analysis, by which the conserved structure is inferred using multiple homologs. To date, there is no approach that fully automates comparative analysis. One barrier that prevented automation is the fact that folding domains can often be inserted in one homolog relative to another. An inserted domain is a subsequence inserted in one homolog relative to one or more homologs that forms a substructure with base pairing between nucleotides that are within the inserted subsequence. For example, 9.2% of the base pairs in 60 pairs of sequences drawn from a bacterial type A RNase P RNA alignment (10) are in inserted domains. Other barriers include variation of helix and loop length and base pair opening caused by nucleotide mutations between homologous sequences.

This paper describes a novel technique that allows and accounts for domain insertions in prediction of conserved structures for two unaligned sequences. The technique was developed and demonstrated with Dynalign II, an update of Dynalign (1114), although the principles apply generally to dynamic programming approaches for conserved structure prediction (1521), including free energy minimization algorithms, partition function algorithms or stochastic context-free grammars. Dynalign is a pairwise RNA secondary structure prediction program that implements the Sankoff algorithm (22) for predicting the conserved structure for two unaligned homologous sequences; it has also been extended to multiple sequences with the Multilign algorithm (23) and to simultaneous structure prediction with three sequences (24). The dynamic programming recursions were updated in Dynalign II to account for the ΔG° in inserted domains. In addition to domain insertions, Dynalign II accommodates other types of structural variations, specifically, base pair openings and stem extensions. Base pair openings represent the situation where one of the homologs has an internal loop with nucleotides that align to base paired nucleotides in the other homolog. Stem extension represent the situation where a helix in one homolog includes a larger number of base pairs than the corresponding helix in the other homolog. The updates to Dynalign handle these structural variations with negligible increase in computational cost by using pre-computed values for the ΔG° for inserted domains, obtained from single sequence folding of each homolog.

The developed methodology is validated by benchmarking Dynalign II on ncRNA families that exhibit domain insertions and other structural variations, tRNA, RNase P RNA and SRP RNA. Dynalign II predicts base pairs in inserted domains with better accuracy as compared to Dynalign, and this improvement is statistically significant. Additional tests with 5S rRNA homologs provide evidence that Dynalign II encounters no degradation in performance for ncRNA families that have highly conserved secondary structure with little or no structural variation.

The following section highlights the methodology for allowing domain insertions and other aforementioned structural variations. Evaluation methods for benchmarking the algorithm and parameter selection are also discussed within the same section. Next, in the Results section, benchmarks evaluating Dynalign II for accuracy and computation time are presented. The Discussion section closes the paper with concluding remarks and a summary.

MATERIALS AND METHODS

Common secondary structure prediction by ΔG° minimization

Dynalign takes two sequences as input and simultaneously predicts the conserved pseudoknot-free secondary structure and the structural alignment of the sequences. A total ΔG°:

graphic file with name M33.gif (1)

is minimized, where ΔG°1 and ΔG°2 are the folding ΔG°s of sequence 1 and 2, respectively, for the common structure, ΔG°gap_penalty is the penalty per gap and ngap is the number of gaps in the alignment between the two sequences, where the alignment is constrained to be consistent with the common structure. The ΔG°s are calculated according to the nearest-neighbor thermodynamic model (2527). While these should technically be referred to as predicted ΔG°s, the qualifier ‘predicted’ is dropped for brevity. The original Dynalign algorithm (1114), considers only common structures for which all base pairs in the two homologs are aligned or for which one homolog has single base pairs inserted between two aligned (conserved) base pairs. Therefore, the original Dynalign algorithm does not account for the domain insertions and other structural variations seen in RNA homologs in nature. The same observation holds true for the original Sankoff algorithm and for alternative implementations of the Sankoff algorithm (22). Dynalign II accounts for (the possibility of) domain insertions in one sequence with respect to the other by modifying the total ΔG° that is minimized in the process of predicting common structures to

graphic file with name M34.gif (2)

where i is the index of the ith inserted domain, xi is the length of the ith inserted domain and ΔG°domain_opening and ΔG°domain_elongation are the newly introduced ΔG° penalties for initiation and per nucleotide elongation of inserted domains. This is an affine model for each domain insertion into the alignment. ΔG°domain_opening and ΔG°domain_elongation were optimized on a training data set of known secondary structures (described later) and the value of ΔG°gap_penalty was kept the same as in (11). The terms ΔG°1 and ΔG°2 correspond, as before, to the ΔG°s of the structures for sequence 1 and sequence 2 according to the nearest-neighbor thermodynamic model.

Algorithm

Dynalign II predicts the conserved structure using a dynamic programming algorithm that generalizes the original Dynalign algorithm. In the following discussion, nucleotide positions in each sequence are indexed in 5′ to 3′ order with i and j denoting indices for sequence 1 and k and l denoting indices for sequence 2, with i < j and k < l. The optimal structure of a conserved fragment [i, j, k, l] of the two input sequences, i.e. the substructure i to j in sequence 1 aligned with the substructure k to l in sequence 2, are determined recursively from smaller to larger fragments by the dynamic programming algorithm. This determines the minimum over all possible pseudoknot-free, common secondary structures and over alignments consistent with those structures. Therefore, the algorithm guarantees the optimal structures will be found given the rules that are implemented. As with Dynalign, Dynalign II predicts structural alignments by only aligning nucleotides that are base paired in conserved base pairs. This is because the ΔG°total in Equations (1) and (2) does not include sequence identity, so nucleotides are not aligned in loop regions.

Given two homologous RNA sequences with lengths N1 and N2, Dynalign fills two, 4D arrays of size N1 × N1 × N2 × N2. These arrays are W(i, j, k, l) and V(i, j, k, l), and they represent the ΔG° of putative conserved fragments of the two sequences with different conformational constraints. V(i, j, k, l) stores the minimum ΔG° of fragments [i, j, k, l], where i is base paired with j, k is base paired with l and fragment [i, j] is aligned to fragment [k, l]. W(i, j, k, l) stores the lowest ΔG° of fragments [i, j, k, l], where fragment [i, j] is aligned to fragment [k, l] and these sequence fragments represent potential branches in multibranch loops. In order to fill the arrays, auxiliary 2D arrays are needed, W3(i, k), W5(i, k), W1single(i, j), W2single(k, l), WE1single(i, j) and WE2single(k, l). W3(i, k) and W5(i, k) are fragments at the 3′ and 5′ end of the two sequences, respectively. W5(i, k) stores the minimum ΔG° of fragments [1, i] and [1, k], with no conformational constraints. W3(i, k) represents the minimum ΔG° of fragments [i, N1] and [k, N2], again with no conformational constraints. W1single(i, j), W2single(k, l), WE1single(i, j) and WE2single(k, l) are newly introduced arrays in Dynalign II for implementing domain insertions. W1single(i, j) represents the minimum ΔG° of fragment [i, j] of sequence 1 given nucleotides from i to j are in a branch in a multibranch loop. W2single(k, l) is analogously the minimum ΔG° for fragment [k, l] of sequence 2 given that nucleotides from k to l are in a branch in a multibranch loop. WE1single(i, j) represents the minimum ΔG° of fragment [i, j], where i, j are exterior nucleotides, i.e. there is no base pair i’-j’ where i’ < i < j < j’. WE2single(k, l) is for fragment [k, l], and is the analog to WE1single for sequence 2. These four arrays are all calculated using single sequence ΔG° minimization routines in the RNAstructure package (28).

V(i, j, k, l) and W(i, j, k, l) are filled for both interior and exterior fragments to facilitate the prediction of suboptimal solutions (12). Interior fragments are those that span nucleotides i to j and k to l. Exterior fragments are those that span nucleotides 1 to i, j to N1, 1 to k and l to N2. For conserved structures with base pairs i-j and k-l, the lowest free energy structure possible is the sum of the V array for the interior and exterior fragments.

Overview

The improvements introduced by the Dynalign II algorithm are illustrated using Figures 14 and an abbreviated set of recursions that omit non-essential details. The full set of recursions is available in the Supplementary Materials.

Figure 1.

Figure 1.

Expansion of W(i, j, k, l) to allow domain insertions. (A) and (B) Represent two of the original filling steps of W(i, j, k, l) that are for conserved domains. (C)–(F) Are expanded steps that allow consideration of inserted domains in four different positions: (C) 3′ side of sequence 2, (D) 3′ side of sequence 1, (E) 5′ side of sequence 2 and (F) 5′ side of sequence 1. The black solid lines represent sequences, black dashed lines represent gaps, black arcs represent base pairs and colored brackets are the substructures represented by the array members.

Figure 4.

Figure 4.

Expansion of V(i, j, k, l) allowing stem extension and internal loop aligning with consecutive stacking base pairs. (A) and (B) Represent an internal loop in one sequence aligned with consecutive stacking base pairs in another, where in (A) the internal loop is in sequence 1 and in (B) it is in sequence 2. (C) and (D) Represent the extension of a conserved stem, where in (C) the internal loop, stacking base pair or bulge loop is inserted in sequence 2 and in (D) it is inserted in sequence 1.

To account for domain insertions, the recursions for W(i, j, k, l), V(i, j, k, l), W5(i, k) and W3(j, l) are modified from the original Dynalign algorithm. V(i, j, k, l) is determined as

graphic file with name M35.gif (3)

where penalty(i, j) is the penalty term applied to A-U or G-U base pairs at the end of a helix (25,26). Vhairpin(i, j, k, l) represents the ΔG° of hairpin loops closed by base pairs i-j and k-l. Vinternal/stack(i, j, k, l) represents the minimum ΔG° of the conserved fragment [i, j, k, l], where internal loops, bulge loops or stacking base pairs are closed by base pairs i-j and k-l. Vinternal/stackII(i, j, k, l) accounts for new structural variations incorporated in Dynalign II that include a set of stacking base pairs aligned with an internal loop and insertion of stacking base pairs, internal loops or bulge loops of unlimited length. Vmultibranch(i, j, k, l) represents the minimum ΔG° of the conserved fragment [i, j, k, l], where multibranch loops are closed by base pairs i-j and k-l. Vdomain_insertion(i, j, k, l) represents the minimum ΔG° of the conserved fragment [i, j, k, l], where an inserted domain is formed in a loop closed by base pair i-j in sequence 1 or k-l in sequence 2. W(i, j, k, l) is determined as

graphic file with name M36.gif (4)

Wextend(i, j, k, l) extends substructures shorter by either one or two nucleotides in sequence 1 and/or sequence 2 with unpaired terminal nucleotides. Wbranch(i, j, k, l) considers the formation of a helical branch. Wbifurcation(i, j, k, l) accounts for bifurcation of W(i, j, k, l) so that more than three helical branches can be formed in multibranch loops. Wdomain_insertion(i, j, k, l) represents the formation of inserted domains to W(i, j, k, l).

W5(i, k) is the minimum ΔG° for substructures from nucleotides 1 to i and from 1 to k. W3(i, k) is the minimum ΔG° for substructures i to N1 and k to N2, where N1 and N2 are the lengths of the sequence 1 and sequence 2, respectively:

graphic file with name M37.gif (5)

where the first three terms account for extending shorter W5 fragments with unpaired nucleotides. W5bifurcation(i, k) represents the formation of conserved helical branches at the 3′ end of W5(i, k). W5domain_insertion(i, k) represents the formation of inserted domains at the 3′-end of W5(i, k). The terms for W3(i, k) are analogous, but involve the 3′ ends of the sequences.

Because the minimum ΔG° calculation for longer sequence fragments depends on the minimum ΔG° of shorter fragments, the array locations representing shorter fragments are filled prior to those representing longer fragments, i.e. an array location [i’, j’, k’, l’] is filled before an array location [i, j, k, l] if the fragment [i’, j’, k’, l’] is completely contained in the fragment [i, j, k, l]. After filling the arrays, the minimum ΔG° of the common structure is W5(N1, N2), which is equal to W3(1,1).

Expansion of W(i, j, k, l) and V(i, j, k, l)

In the original Dynalign algorithm, W(i, j, k, l) was the minimum of Wextend(i, j, k, l), Wbranch(i, j, k, l) and Wbifurcation(i, j, k, l). The last two terms are given by:

graphic file with name M38.gif (6)
graphic file with name M39.gif (7)

where Equation (6) represents a single, conserved branch (Figure 1A) with ΔG°helix_terminating_in_MBL being the ΔG° penalty for terminating a helix of a multibranch loop and Equation (7) represents the bifurcation of the conserved domain (Figure 1B). In order to accommodate domain insertion, the calculation of Wdomain_insertion(i, j, k, l) is introduced in Dynalign II as:

graphic file with name M40.gif (8)
graphic file with name M41.gif (9)
graphic file with name M42.gif (10)
graphic file with name M43.gif (11)
graphic file with name M44.gif (12)

where Equations (9)–(12) are illustrated by Figure 1C–F. They represent four possible positions for forming an inserted domain, the 3′ side of sequence 2 (Wdomain_insertion1), the 3′ side of sequence 1 (Wdomain_insertion2), the 5′ side of sequence 1 (Wdomain_insertion3) and the 5′ side of sequence 2 (Wdomain_insertion4). It is important to note that only one variable (k’ or i’) is enumerated for each equation, and this makes the time scaling O(N12 + N22)for calculating Wdomain_insertion(i, j, k, l). This is in contrast to Wbifurcation(i, j, k, l), which requires O(N12N22) time scaling. Therefore, the expansion to account for domain insertion in Dynalign II does not change the time complexity of Dynalign.

In V(i, j, k, l), Vmultibranch(i, j, k, l) is the minimum ΔG° for pairs closing multibranch loops, i.e.

graphic file with name M45.gif (13)

where two conserved domains form inside base pairs i-j and k-l (Figure 2A). ΔG°closure_MBL is the ΔG° penalty for the closure of a multibranch loop. With just this recursion in V (Equation (3)), a base pair has to close either one conserved domain (forming an internal loop/stacking base pair/bulge loop) or multiple conserved domains (forming a multibranch loop). In order to account for the situation where a conserved base pair closes a different number of branches in one homolog compared to another, the calculation of Vdomain_insertion(i, j, k, l) is needed:

graphic file with name M46.gif (14)
graphic file with name M47.gif (15)
graphic file with name M48.gif (16)
graphic file with name M49.gif (17)
graphic file with name M50.gif (18)

Equations (15)–(18) are illustrated in Figure 2B–E. By using W, W1 and W2 arrays in Equations (14)–(18), any change in the number of branching helices is accommodated because these arrays recursively consider any number of branches (see Equation (7), for example). The form of Equations (15)–(18) allow a multibranch loop in one sequence to structurally align with a single-stem loop rather than a second multibranch loop. In that case, the stem loop would be treated as a branch of a multibranch loop in terms of the energy model. This simplification in the energy model is introduced for computational efficiency.

Figure 2.

Figure 2.

Expansion of V(i, j, k, l) to allows domain insertions. (A) represents the step in the original Dynalign algorithm where two conserved domains form inside a conserved base pair. (B)–(E) Illustrate how the modifications in Dynalign II account for potential inserted domains within the conserved base pair of V(i, j, k, l) at four positions: (B) 5′ side of sequence 2, (C) 3′ side of sequence 1, (D) 5′ side of sequence 2 and (E) the 5′ side of sequence 1.

Expansion of W3(i, k) and W5(i, k)

The two terms in W3(i, k) and W5(i, k) arrays exist for adding conserved branches to exterior loops:

graphic file with name M51.gif (19)
graphic file with name M52.gif (20)

where Equation (19) is demonstrated in Figure 3A.

Figure 3.

Figure 3.

Expansion of W5(i, k) to account for domain insertions. (A) represents the recursion in the original Dynalign algorithm where W5(i, k) considers a conserved domain. (B) and (C) Represent the consideration of an inserted domain in W5(i, k) at two positions: (B) 3′ side of sequence 1 and (C) the 3′ side of sequence 2.

Additional terms in the filling of W3(i, k) and W5(i, k) arrays are added to consider a domain insertion in exterior loops, W5domain_insertion(i, k) and W3domain_insertion(i, k):

graphic file with name M53.gif (21)
graphic file with name M54.gif (22)
graphic file with name M55.gif (23)
graphic file with name M56.gif (24)
graphic file with name M57.gif (25)
graphic file with name M58.gif (26)

Equations (22) and (23) are illustrated in Figure 3B and C.

Additional structural variations

In the original Dynalign algorithm, single base pairs could be inserted in one sequence relative to another only if they were flanked by conserved base pairs. In Dynalign II, the model is more flexible. It allows a set of stacking base pairs aligned with an internal loop and unlimited insertion of nucleotides in stacking base pairs, internal loops or bulge loops. In the original Dynalign:

graphic file with name M59.gif (27)

where ΔG°motif (m, n, p, q) represents the ΔG° contributed by a motif, i.e. a base pair stack, internal loop or bulge loop closed by base pairs m-p and n-q from sequences 1 or 2. In Dynalign II, the additional types of structural alignment are realized (shown in Equations (29)-(32) and Figure 4A–D) by adding Vinternal/stackII(i, j, k, l):

graphic file with name M60.gif (28)
graphic file with name M61.gif (29)
graphic file with name M62.gif (30)

where a set of consecutive base pairs aligned with an internal loop, and ΔG°stack (m, m + 1, p, p − 1) represents the ΔG° contributed by stacking base pair m-p and (m + 1) − (p − 1), which is analogous to ΔG°motif (m, m + 1, p, p − 1).

graphic file with name M63.gif (31)
graphic file with name M64.gif (32)

where a motif kl and (k + c) − (ld) or ij and (i + c) − (jd) is inserted in sequence 2 or 1, respectively, with the gap penalty term added for each unaligned nucleotide.

Implementation considerations and computational complexity

The full Dynalign recursions require O(N13 N23) time and O(N12 N22) memory. For typical ncRNA sequence lengths, heuristics for reducing computational time are essential in order to run on current hardware. Dynalign uses an adaptively determined banded constraint on the space of allowable nucleotide alignments. This is based on a hidden Markov model-based estimation of posterior alignment probabilities from the sequences without accounting for structure (13), which requires O(N1N2) time and memory. If the alignment constraints are approximated by a band with width d, i.e. aligned nucleotide indices are no further apart than (d/2), the algorithm reduces to O(N13 d3) time and O(N12 d2) memory (22). In addition to the original Dynalign, Dynalign II requires the precomputation of W1single(i, j), W2single(k, l), WE1single(i, j) and WE2single(k, l), which require O(N3) computation and O(N2) memory for each sequence. These are calculated from single sequence secondary structure predictions on each sequence, which are already performed to reduce the set of base pairs considered when filling the V array. This heuristic, which excludes base pairs that can only be found in relatively high ΔG° structures, was previously demonstrated to accelerate the calculation with no loss of accuracy (14). Thus, the time and memory complexity of Dynalign II remain the same as Dynalign, despite the additional functionality of handling a greater set of structural variations. Experimental benchmarks presented in the Results section demonstrate that, in agreement with the preceding complexity analysis, the practical time and memory requirements of Dynalign II are also almost identical to those for Dynalign.

Evaluation

Two metrics, sensitivity and positive predictive value (PPV), were used to quantify the accuracy of structure predictions for databases of ncRNA families with known secondary structure. Sensitivity is the fraction of known base pairs that are predicted. PPV is the fraction of base pairs predicted that are in the known structure. A predicted base pair i-j is deemed correct if i-j, (i + 1) − j, (i1) − j, i − (j1) or i − (j + 1) base pair is in the known structure (13,25). This convention is adopted for two important reasons. First, base pairs in RNA structures can be dynamic, for example, single nucleotide bulges can migrate to adjacent nucleotides, as has been observed by nuclear magnetic resonance and by thermodynamic measurements (27,2930). Second, comparative sequence analysis, which provides the ‘ground-truth’ for evaluating accuracy of secondary structure predictions, is not able to distinguish the two cases encountered when base pairs are able to migrate in position (31). For completeness, metrics computed under an exact matching requirement are also computed and reported in the Supplementary Materials. The average absolute difference of all the methods for the four families between exact and flexible matching is 0.031. The maximum difference between exact and the flexible matching is 0.05 and does not change the conclusions for the paper.

For a single sequence pair, sensitivity was calculated as the ratio of the correctly predicted to the total number of known base pairs in the structures of the two sequences, and PPV was computed as the ratio of the correctly predicted to the total number of predicted base pairs in the two sequences. Average sensitivity over an ncRNA family was calculated as the ratio of the correctly predicted to the total number of known base pairs in all the sequence pairs for the family. Average PPV over an ncRNA family was similarly computed as the ratio of the correctly predicted to the total number of predicted base pairs across all the sequence pairs for the family.

Sensitivity and PPV were also computed specifically over base pairs in inserted domains for individual ncRNA families, where complete helices and multibranch loops inserted in one sequence compared to the other homolog in the pair were identified as inserted domains. Here, sensitivity was calculated as the ratio of the correctly predicted to the total number of base pairs in the inserted domains, and PPV was computed as the ratio of the correctly predicted base pairs to the total number of base pairs in the predicted inserted domains.

Because the improvement of accuracy on individual sequence pairs can vary greatly, the one-sided paired t-test procedure of Xu et al. (32) was used to test the null hypothesis that the methods offer identical accuracy against the alternative hypothesis that Dynalign II offers higher accuracy. The one-tail P-value was computed to assess statistical significance of the reported improvement in accuracy.

Dynalign II parameters

In addition to the nearest-neighbor thermodynamic parameters, Dynalign II has three additional ΔG° parameters: ΔG°gap_penalty, ΔG°domain_opening and ΔG°domain_elongation. Among these, ΔG°gap_penalty was determined by maximizing prediction accuracy on 5S rRNA sequences in the original Dynalign (11), and was found to be optimal at 0.4 kcal/mol. ΔG°domain_opening and ΔG°domain_elongation were determined for Dynalign II by a 2D grid search for maximizing prediction accuracy over 66 sequence pairs obtained by selecting all possible pairs from a training data set of 12 group I Intron IC1 subgroup sequences selected from a database of structures (33,34). Based on this procedure, the parameters ΔG°domain_opening and ΔG°domain_elongation were set to 0.5 and 0.1 kcal/mol, respectively. At these chosen values, both sensitivity and PPV were the highest over the training data set. Details of the grid search are provided in the Supplementary Materials.

RESULTS

Structure prediction accuracy was benchmarked using four RNA families: tRNA, RNase P RNA, SRP RNA and 5S rRNA. tRNA sequences can contain variable loops that form inserted stem-loop structures. Forty tRNA sequences were randomly drawn from the Sprinzl database (35) without replacement, and all 780 sequence pairs with these sequences were chosen. The tRNA-inserted base pairs were annotated using tRNAscan-SE 1.21 (36) because the Sprinzl database does not annotate the variable loop base pairs. Base pairs in these inserted domains constitute 2.2% of all base pairs in the sequences. Note that 340 RNase P RNA sequences were randomly drawn without replacement from the bacterial type A RNA alignment on the RNase P database (10) to form 170 non-overlapping sequence pairs. Among all the base pairs in the RNase P RNA data set, 10.7% are in inserted domains. A total of 428 SRP RNA sequences were randomly drawn without replacement from the SRP database (37) to form 214 non-overlapping sequence pairs. Among all the SRP base pairs in the data set, 6.7% are in inserted domains. Twenty 5S rRNA sequences were randomly drawn from the 5S rRNA database (38) without replacement and all 190 possible sequence pairs of these sequences were considered. The 5S rRNA family has no known inserted domains and is included in the benchmark as a test for accuracy of Dynalign II on sequence pairs with little structural variation. Statistics about the pairwise sequence identities for each of the four families are provided as Supplementary Table S7. Four methods were run on the benchmark set: Dynalign II, Dynalign II without domain insertion, (original) Dynalign and Fold (a single sequence ΔG° minimization program from RNAstructure (28)). Dynalign II without domain insertion still included the base pair opening and stem extension functionality in order to separately test the improvement offered by each of the generalizations.

The overall accuracy is illustrated in Figure 5. For the RNase P RNA, SRP RNA and tRNA families, the capability to handle domain insertions and the other two structural variations each improve the sensitivity and PPV. For 5S rRNA, the capability to handle domain insertions does not improve sensitivity or PPV, which is expected given that this family does not have inserted domains. In addition, performance was stratified according to pairwise identity of sequence pairs and the results are reported in Supplementary Table S8.

Figure 5.

Figure 5.

Overall structure prediction accuracy for secondary structure prediction. (A) Shows the sensitivity of the four prediction methods over homologous pairs from tRNA, 5S rRNA, RNase P RNA and SRP RNA data sets. (B) Shows the PPV of the four prediction methods on the four families. Colors represent the program used, as identified by the legends. The numerical values are indicated on the bars. The improvements in performance of Dynalign II over Dynalign and of Dynalign II over Fold are statistically significant for each RNA family Supplementary Tables S9 and S10 in the Supplementary Materials provide the P-values for the tests.

To further investigate the improvement provided by the capability to account for domain insertions, the accuracy was assessed specifically on base pairs in inserted domains. The results, shown in Figure 6, show that the sensitivity of prediction of base pairs in inserted domains is improved over the original Dynalign algorithm for the RNase P RNA, SRP RNA and tRNA families. Dynalign II also achieves a reasonable PPV in predicting base pairs in inserted domains. Note that the corresponding PPV cannot be calculated for the original Dynalign because inserted pairs are not allowed.

Figure 6.

Figure 6.

Structure prediction accuracy over base pairs in inserted domains. (A) Shows the sensitivity of Dynalign II and Dynalign on the tRNA, RNase P and SRP data sets. (B) Shows the PPV of Dynalign II and Dynalign on the tRNA, RNase P and SRP data sets. Colors represent the program used and are identified by the legends. The numerical values of the sensitivities and PPVs are indicated on the bars.

A one-tail paired t-test (32) was performed to test the statistical significance of the improvement in sensitivity and PPV for Dynalign II over Dynalign. The P-values computed for the test are reported in Supplementary Table S9. With the type I error rate, alpha, set to 0.05, the improvements of Dynalign II upon Dynalign are statistically significant in all cases. The statistical significance of improvements of Dynalign II upon Fold (25) were assessed using the same test and corresponding P-values are included in Supplementary Table S10. All the improvements are significant.

To demonstrate the improvement provided by Dynalign II over Dynalign, an example pair of RNA homologs is illustrated in Figures 79. Figure 7 shows the accepted structures for two SRP RNA sequences, Bacillus amyloliquefaciens D11416 (SRP database ID: Baci.amyl._D11416) and Pyrococcus horikoshii BA000001 (SRP database ID: Pyro.hori._BA000001)(37). horikoshii has an inserted domain compared with amyloliquefaciens (indicated by a blue rectangle) in addition to the deletion and insertion of base pairs (Figure 7). The prediction made by the original Dynalign algorithm, shown in Figure 8, achieves a sensitivity of 0.55 and a PPV of 0.57. Because the original Dynalign algorithm cannot account for the domain insertion, the overall structures are incorrectly predicted. The prediction from Dynalign II, shown in Figure 9, has an improved sensitivity of 0.86 and PPV of 0.87 (Figure 9). The inserted domain is correctly identified (indicated by a blue rectangle) and the capability to account for the inserted domain also results in an overall more accurate prediction.

Figure 7.

Figure 7.

Known structures for two SRP homologs with a domain insertion in one homolog. (A) Bacillus amyloliquefaciens D11416 (SRP database ID: Baci.amyl._D11416) and (B) Pyrococcus horikoshii BA000001 (SRP database: Pyro.hori._BA000001) from the SRP database (37). The nucleotides are numbered from 5′-3′. The inserted domain in (B) is marked by a blue rectangle.

Figure 9.

Figure 9.

Structure prediction results for Dynalign II. (A) and (B) are the Dynalign II predictions for the structures of the Bacillus amyloliquefaciens D11416 and the Pyrococcus horikoshii BA000001, respectively. Correctly predicted base pairs are colored green and their pairs are more heavily weighted. The incorrectly predicted base pairs are colored gray and their pairs are less heavily weighted. The correctly identified inserted domain is marked by a blue rectangle.

Figure 8.

Figure 8.

Structure predictions for the homologs in Figure 7 obtained with the original Dynalign algorithm. (A) and (B) are the Dynalign predictions for the structures of the Bacillus amyloliquefaciens D11416 (A) and the Pyrococcus horikoshii BA000001 (B), respectively. The correctly predicted base pairs are colored green and their pairs are more heavily weighted. The incorrectly predicted base pairs are colored gray and their pairs are less heavily weighted.

The results illustrate the improvement that Dynalign II offers over the original Dynalign in secondary structure prediction accuracy. Advantageously, this improvement is achieved with negligible increase of computational cost. To highlight this, the average run times and memory requirements for the original Dynalign and Dynalign II algorithms are listed in Tables 1 and 2 for RNA sequence pairs from the four families that were used in the accuracy benchmarking. The average execution times and memory requirements for Dynalign II compare favorably with those for Dynalign. For example, the average execution times for the sequence pairs from the RNase P and SRP families were 53 min:41 s and 5 h:6 min:30 s for Dynalign II, compared to 50 min:15 s and 4 h:38 min:37 s for Dynalign, on four cores of an Intel Xeon E5–2695 v2 processor. Similarly, average memory requirements for Dynalign II for sequence pairs from RNase P and SRP families was 812 MB and 1791 MB for Dynalign II, compared to 810 MB and 1790 MB for Dynalign.

Table 1. Average wall time required for common secondary structure prediction for 5S rRNA, tRNA, RNase P and SRP RNA homologous RNA sequence pairs.

5S rRNA tRNA SRP RNA RNase P RNA
Dynalign II 55 s 18 s 5 h:6 min:30 s 53 min:41 s
Dynalign 49 s 16 s 4 h:38 min:37 s 50 min:15 s

Four cores of a 12 core Intel Xeon E5–2695 v2 processor (2.4GHz) were used for parallel computations of RNase P RNA and SRP RNA sequence pairs. One core of an Intel Xeon E5–2695 v2 processor (2.4GHz) was used for computations of 5S rRNA and tRNA sequence pairs.

Table 2. Average memory required for common secondary structure prediction for 5S rRNA, tRNA, RNase P and SRP homologous RNA sequence pairs.

5S rRNA tRNA SRP RNA RNase P RNA
Dynalign II 76MB 57MB 1791MB 812MB
Dynalign 74MB 56MB 1790MB 810MB

DISCUSSION

Research aimed at automating comparative sequence analysis has now been ongoing for over a decade. There is still no algorithm, however, that is as accurate at secondary structure determination as manual effort by an expert investigator (7). Two categories of obstacles prevented this. First, computational methods for comparative sequence analysis fail to properly account for structural variations among homologs. These variations include domain insertions, variations of length of helices, insertions of internal loops/bulge loops and base pair openings caused by mutation of nucleotides. Second, current computational models have only an incomplete comprehension of the factors that impact secondary structure. In particular, the influence of tertiary and pseudoknotted interactions is not included (39,40), and the thermodynamic model is imperfect.

In this paper, a novel methodology was presented to incorporate prediction of inserted domains into dynamic programming algorithms for common secondary structure prediction. The methodology was developed and implemented by updating Dynalign to Dynalign II. Figure 6 shows the dramatic impact that the proposed change has on the ability to correctly predict inserted folding domains. The improvements offered by the new technique over Dynalign in overall average prediction sensitivity and PPV are statistically significant although the numerical gains are small on average because the fraction of base pairs encountered in inserted domains in homologous structures is relatively low (Figure 5). The impact of the proposed change on specific structure predictions can be large, as shown by the example in Figures 79, where there is a domain insertion in one sequence relative to the other. Advantageously, the improvement in performance is achieved with negligible increase of computational cost. The new technique generalizes and enhances the overall framework provided by the Sankoff algorithm (22), and is also applicable to other comparative RNA structure analysis tools (79). The algorithm presented in this paper accounts for interior inserted domains in multibranch loops that terminate in one or more hairpin stem-loops. Inserted domains, however, can also be found in exterior loops of sequences with known structure, i.e. loops that contain the ends of the sequence, or they can be interior to structures, i.e. not terminating in hairpin stem loops, but terminating in conserved domains.

Another attractive area for further development is to use these improvements for conserved structure prediction for three or more homologous sequences. The work could be extended to multiple sequences, for example, by extending the Multilign method (23) to use Dynalign II instead of Dynalign. Other progressive structure alignment tools could also be adapted in similar ways.

AVAILABILITY

Dynalign II is freely available as a component of the RNAstructure package at http://rna.urmc.rochester.edu/RNAstructure.html.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

SUPPLEMENTARY DATA

Acknowledgments

The authors thank the Center for Integrated Research Computing, University of Rochester, for providing access to computational resources.

FUNDING

National Institutes of Health (NIH) [GM097334 to G.S.]. Funding for open access charge: NIH [GM097334].

Conflict of interest statement. None declared.

REFERENCES

  • 1.Eddy S.R. Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2001;2:919–929. doi: 10.1038/35103511. [DOI] [PubMed] [Google Scholar]
  • 2.Waters L.S., Storz G. Regulatory RNAs in bacteria. Cell. 2009;136:615–628. doi: 10.1016/j.cell.2009.01.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Doudna J.A., Cech T.R. The chemical repertoire of natural ribozymes. Nature. 2002;418:222–228. doi: 10.1038/418222a. [DOI] [PubMed] [Google Scholar]
  • 4.Tucker B.J., Breaker R.R. Riboswitches as versatile gene control elements. Curr. Opin. Struct. Biol. 2005;15:342–348. doi: 10.1016/j.sbi.2005.05.003. [DOI] [PubMed] [Google Scholar]
  • 5.Marraffini L.A., Sontheimer E.J. CRISPR interference: RNA-directed adaptive immunity in bacteria and archaea. Nat. Rev. Genet. 2010;11:181–190. doi: 10.1038/nrg2749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wu L., Belasco J.G. Let me count the ways: mechanisms of gene regulation by miRNAs and siRNAs. Mol. Cell. 2008;29:1–7. doi: 10.1016/j.molcel.2007.12.010. [DOI] [PubMed] [Google Scholar]
  • 7.Seetin M.G., Mathews D.H. RNA structure prediction: an overview of methods. Methods Mol. Biol. 2012;905:99–122. doi: 10.1007/978-1-61779-949-5_8. [DOI] [PubMed] [Google Scholar]
  • 8.Havgaard J.H., Gorodkin J. RNA structural alignments, part I: Sankoff-based approaches for structural alignments. Methods Mol. Biol. 2014;1097:275–290. doi: 10.1007/978-1-62703-709-9_13. [DOI] [PubMed] [Google Scholar]
  • 9.Asai K., Hamada M. RNA structural alignments, part II: non-Sankoff approaches for structural alignments. Methods Mol. Biol. 2014;1097:291–301. doi: 10.1007/978-1-62703-709-9_14. [DOI] [PubMed] [Google Scholar]
  • 10.Brown J.W. The Ribonuclease P Database. Nucleic Acids Res. 1999;27:314. doi: 10.1093/nar/27.1.314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mathews D.H., Turner D.H. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]
  • 12.Mathews D.H. Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics. 2005;21:2246–2253. doi: 10.1093/bioinformatics/bti349. [DOI] [PubMed] [Google Scholar]
  • 13.Harmanci A.O., Sharma G., Mathews D.H. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics. 2007;8:130. doi: 10.1186/1471-2105-8-130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Uzilov A.V., Keegan J.M., Mathews D.H. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics. 2006;7:173. doi: 10.1186/1471-2105-7-173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Holmes I. A probabilistic model for the evolution of RNA structure. BMC Bioinformatics. 2004;5:166. doi: 10.1186/1471-2105-5-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gorodkin J., Heyer L.J., Stormo G.D. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res. 1997;25:3724–3732. doi: 10.1093/nar/25.18.3724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Will S., Reiche K., Hofacker I.L., Stadler P.F., Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol. 2007;3:e65. doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Harmanci A.O., Sharma G., Mathews D.H. PARTS: probabilistic alignment for RNA joinT secondary structure prediction. Nucleic Acids Res. 2008;36:2406–2417. doi: 10.1093/nar/gkn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dowell R., Eddy S.R. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics. 2006;7:400. doi: 10.1186/1471-2105-7-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Do C.B., Foo C.S., Batzoglou S. A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics. 2008;24:i68–i76. doi: 10.1093/bioinformatics/btn177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hofacker I.L., Bernhart S.H.F., Stadler P.F. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20:2222–2227. doi: 10.1093/bioinformatics/bth229. [DOI] [PubMed] [Google Scholar]
  • 22.Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 1985;45:810–825. [Google Scholar]
  • 23.Xu Z., Mathews D.H. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics. 2011;27:626–632. doi: 10.1093/bioinformatics/btq726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Masoumi B., Turcotte M. Simultaneous alignment and structure prediction of three RNA sequences. Int. J. Bioinform. Res. Appl. 2005;1:230–245. doi: 10.1504/IJBRA.2005.007581. [DOI] [PubMed] [Google Scholar]
  • 25.Mathews D.H., Sabina J., Zuker M., Turner D.H. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
  • 26.Xia T., SantaLucia J., Jr, Burkard M.E., Kierzek R., Schroeder S.J., Jiao X., Cox C., Turner D.H. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1998;37:14719–14735. doi: 10.1021/bi9809425. [DOI] [PubMed] [Google Scholar]
  • 27.Mathews D.H., Disney M.D., Childs J.L., Schroeder S.J., Zuker M., Turner D.H. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Nat. Acad. Sci. U.S.A. 2004;101:7287–7292. doi: 10.1073/pnas.0401799101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Reuter J.S., Mathews D.H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics. 2010;11:129. doi: 10.1186/1471-2105-11-129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Woodson S.A., Crothers D.M. Proton nuclear magnetic resonance studies on bulge-containing DNA oligonucleotides from a mutational hot-spot sequence. Biochemistry. 1987;26:904–912. doi: 10.1021/bi00377a035. [DOI] [PubMed] [Google Scholar]
  • 30.Znosko B.M., Silvestri S.B., Volkman H., Boswell B., Serra M.J. Thermodynamic parameters for an expanded nearest-neighbor model for the formation of RNA duplexes with single nucleotide bulges. Biochemistry. 2002;41:10406–10417. doi: 10.1021/bi025781q. [DOI] [PubMed] [Google Scholar]
  • 31.Gutell R.R., Lee J.C., Cannone J.J. The accuracy of ribosomal RNA comparative structure models. Curr. Opin. Struct. Biol. 2002;12:301–310. doi: 10.1016/s0959-440x(02)00339-1. [DOI] [PubMed] [Google Scholar]
  • 32.Xu Z., Almudevar A., Mathews D.H. Statistical evaluation of improvement in RNA secondary structure prediction. Nucleic Acids Res. 2012;40:e26. doi: 10.1093/nar/gkr1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Damberger S.H., Gutell R.R. A comparative database of group I intron structures. Nucleic Acids Res. 1994;22:3508–3510. doi: 10.1093/nar/22.17.3508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Andronescu M., Bereg V., Hoos H.H., Condon A. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008;9:340. doi: 10.1186/1471-2105-9-340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Juhling F., Morl M., Hartmann R.K., Sprinzl M., Stadler P.F., Putz J. tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 2009;37:D159–D162. doi: 10.1093/nar/gkn772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Schattner P., Brooks A.N., Lowe T.M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 2005;33:W686–W689. doi: 10.1093/nar/gki366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rosenblad M.A., Larsen N., Samuelsson T., Zwieb C. Kinship in the SRP RNA family. RNA Biol. 2009;6:508–516. doi: 10.4161/rna.6.5.9753. [DOI] [PubMed] [Google Scholar]
  • 38.Szymanski M., Barciszewska M.Z., Erdmann V.A., Barciszewski J. 5S ribosomal RNA database. Nucleic Acids Res. 2002;30:176–178. doi: 10.1093/nar/30.1.176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Seetin M.G., Mathews D.H. TurboKnot: rapid prediction of conserved RNA secondary structures including pseudoknots. Bioinformatics. 2012;28:792–798. doi: 10.1093/bioinformatics/bts044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bellaousov S., Mathews D.H. ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA. 2010;16:1870–1880. doi: 10.1261/rna.2125310. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES