Abstract
DIAL (dihedral alignment) is a web server that provides public access to a new dynamic programming algorithm for pairwise 3D structural alignment of RNA. DIAL achieves quadratic time by performing an alignment that accounts for (i) pseudo-dihedral and/or dihedral angle similarity, (ii) nucleotide sequence similarity and (iii) nucleotide base-pairing similarity.
DIAL provides access to three alignment algorithms: global (Needleman–Wunsch), local (Smith–Waterman) and semiglobal (modified to yield motif search). Suboptimal alignments are optionally returned, and also Boltzmann pair probabilities Pr(ai,bj) for aligned positions ai , bj from the optimal alignment. If a non-zero suboptimal alignment score ratio is entered, then the semiglobal alignment algorithm may be used to detect structurally similar occurrences of a user-specified 3D motif. The query motif may be contiguous in the linear chain or fragmented in a number of noncontiguous regions.
The DIAL web server provides graphical output which allows the user to view, rotate and enlarge the 3D superposition for the optimal (and suboptimal) alignment of query to target. Although graphical output is available for all three algorithms, the semiglobal motif search may be of most interest in attempts to identify RNA motifs. DIAL is available at http://bioinformatics.bc.edu/clotelab/DIAL.
INTRODUCTION
During much of the 20th century the structural biology community has focused attention on the study of proteins, leading to a ‘protein-centric’ view of molecular and cellular biology, as manifest in various protein databases and tools: ‘protein sequence’ databases such as SwissProt (1), PIR (2), ‘protein structure’ databases such as the PDB (3), SCOP (4), CATH (5), tools such as PHD secondary structure prediction (6) and DALI structural alignment (7), etc.
In this century, RNA has emerged as an important focus of the structural biology community, as evidenced by the surprising and previously unsuspected roles played by RNA in genomic regulatory processes, such as post-transcriptional regulation with micro RNAs and small interfering RNAs (8), transcriptional and translational gene regulation by allosteric conformational changes in riboswitches (9,10), ribosomal frameshift induced by pseudoknots and slippery sequences (11) and chemical modification of specific nucleotides in the ribosome. Even the peptidyltransferase reaction in peptide bond formation is catalyzed by RNA (12,13).
Within this context, the current article describes the web server, DIAL (dihedral alignment), for pairwise structural alignment of RNA from input PDB files. Depending on the precise formulation of the problem, structural alignment of 3D protein/RNA backbone conformations is known to be NP-complete.1 It follows that all current efficient algorithms either restrict the notion of 3D structural alignment or involve a heuristic. For protein structural alignment, DALI (7) and SSAP (Sequential Structure Alignment Program) (15) are perhaps the most widely used heuristic algorithms, where the latter has been used toward automatic classification in the CATH database (5).
With the current interest in RNA, ribonucleic acid sequence and structure databases have been established, and there is continual development of new algorithms and efficient software. For instance, Rfam (16) is an important sequence database of RNAs grouped by family (tRNA, SAM riboswitch, miRNA, etc.), while the NDB (Nucleic Acid Database) (17) is the primary repository for 3D RNA structures and the SCOR (Structural Classification of RNA) database is a derived data collection of RNA motifs (18).
There are far too many RNA alignment and motif-searching algorithms for us to properly survey the area in this article. To properly situate the contribution of DIAL, we give only the most necessary remarks. Most RNA structural alignment algorithms account for sequence and secondary structure similarity. In pioneering work, Sankoff (19) provided an important O(n6) algorithm to compute the optimal sequence and secondary structure alignment for two given RNA nucleotide sequences.2 Both Foldalign (20) and Dynalign (21) are important practical implementations of reasonable restrictions of Sankoff's algorithm, using the Turner nearest neighbor energy model (22,23). Quite surprisingly, in the technical report (24) Blin et al. prove that optimal pairwise alignment is NP-complete, when the input consists of two RNA sequences along with their given secondary structures.3 It follows that the precise stipulation of the input can effect the computational complexity of RNA structural alignment, a fact that explains in part the multitude of different algorithms for structural alignment.
In (27), Macke et al. describe the software RNAMotif developed for RNA motif search, allowing a flexible description of motif including any kind of base–base interaction. Liu et al. (28) present a quadratic time algorithm RSmatch for RNA secondary structure alignment and motif detection. Dalli et al. (29) describe the program STRAL, which performs a progressive alignment of non-coding RNA using base-pairing probability vectors in quadratic time. In an unusual approach, Sato and Sakakibara (30) apply conditional random fields to determine optimal RNA alignment.
Turning to RNA 3D structural alignment, in (31) Olson describes two virtual (or pseudo-) dihedral angles, later reintroduced by Duarte and Pyle (32). The pseudo-dihedral angles η respectively θ are determined by the four points C4′(i-1),P(i),C4′(i),P(i+1) respectively P(i),C4′(i),P(i+1),C4′(i+1), where P(i) respectively C4′(i) denotes the phosphorus atom respectively 4′-carbon atom of the ith RNA nucleotide. The program AMIGOS of Duarte and Pyle (32) computes RNA dihedral and pseudo-dihedral angles, used in the program PRIMOS of Duarte, Wadley and Pyle (33) to compute RNA ‘worms’, i.e. a sequence of η, θ angles for the entire RNA molecule. The method COMPADRES of Wadley and Pyle (34) uses PRIMOS to detect new RNA structural motifs, such as the π-turn, Ω-turn, α-loop, C2FA and hook turn (34), by the following procedure: (i) a non-redundant RNA structural data collection of 49 structures, 50 chains and 6697 nt is created; (ii) RNA worms are calculated for each of these structures, and the worms are concatenated into a single sequence; (iii) all maximal gapless matches of at least 5 nt of this sequence with itself are detected4 and (iv) known 3D motifs are removed from the matches, and a frequency count is made of remaining matches, from which the high-frequency motifs are analyzed.
In (35), Hershkovitz et al. compute dihedral angles α, β, γ, δ, ε, ζ, χ and pseudorotational phase P for all nucleotides in the 3D structure of 23 S rRNA of Haloarcula marismortui with PDB ID 1S72:0. They identify similar contiguous sequences by ‘torsion matching’; i.e. determining whether dihedral and pseudorotational angles differ by at most an angle-dependent threshold. Hershkovitz et al. then refine this analysis by binning the computed angles inorder to determine dihedral and pseudorotational angle preferences.
In (36,37), Dror, Nussinov and Wolfson describe a cubic time RNA tertiary structure alignment algorithm, ARTS, which proceeds by a seed match and greedy global extension to approximately compute the ‘largest common point set’ (LCP) between phosphorus atoms of two RNA molecules. Given PDB files for two RNA molecules A and B, the program ARTS first determines 3D coordinates a1, …, an respectively b1, …, bm of all phosphorus atoms from A respectively B, then applies the software 3DNA of Lu and Olson to determine base pairs of each structure. Given RMSD error bound of ε, ARTS determines all seed matches of ‘base quadrats’5 (i,i+1,j-1,j) and (i′,i′+1,j′-1,j′) for which there is a rigid transformation (rotation and/or translation) T such that
Since there are O(n) base pairs in an RNA molecule of length n, the computation of all seed matches is done in O(n2) time. Subsequently, a greedy extension of seed matches approximately computes the LCP A′ = { ai1, …, aik} ⊆ A and B′ = {bi′1, …, bi′k} ⊆ B of phosphorus atoms between both RNA molecules, such that ||aix - T(bi′x)|| ≤ ε for all 1 ≤ x ≤ k. The extension is done to maximize a score F(ℓ, k) = w1 · k + w2 · ℓ, where there are ℓ base pairs and k nucleotides, for appropriate weights w1,w2. Note that ARTS does not necessarily respect the order of nucleotides in the linear chain, and that no account is taken for nucleotide identity; i.e. there is no nucleotide bonus for GNRA tetraloops.
In (38), Mokdad and Leontis describe the program Ribostral, which analyzes an RNA 3D alignment, and graphically presents base-pair isostericities. Sarver et al. (39) describe the algorithm FR3D used for RNA motif search by computationally intensive coordinate RMSD computations to determine optimal alignment between motif and target, by using a reduced atom representation of RNA nucleotides.
In this article, we introduce a quadratic time, dynamic programming algorithm, DIAL, able to find the optimal alignment with gaps (i.e. bulges) of two RNAs taking into account sequence, structure and base-pairing information extracted from the PDB file. DIAL provides a number of features not available in other RNA pairwise structural alignment algorithms. While PRIMOS and COMPADRES compute gapless alignments of pseudo-dihedral angles of contiguous segments, DIAL can perform global, local and semiglobal alignment in O(n2) time with affine gap penalty by taking into account nucleotide similarity,6 dihedral and pseudo-dihedral angles as well as the base-pairing nature of nucleotides (0: unpaired, L: base paired with nucleotide to left, R: base paired with nucleotide to right). The program DIAL can perform alignments of ‘fragmented’ (i.e. non-contiguous or composite) motifs with targets, where the number of fragments is arbitrary. Since the computation of pseudo-dihedral angle η for the ith nucleotide requires atomic coordinates of both the (i − 1)st and (i+1)st nucleotide, DIAL additionally extracts atomic coordinates from 1 nt preceding the start of the region specified and 1 nt following the end of the region specified. Inaccuracies (not checked by DIAL) will occur for the first and last nucleotide in the chain of a PDB file. The web server DIAL is an important extension of the program PRIMOS; indeed, PRIMOS alignments are obtained if DIAL parameters are specified to obtain a gapless alignment of pseudo-dihedral angles. This is done by entering negative gap initiation and gap extension parameters whose absolute value is prohibitively large, and setting to 0 all parameters for dihedral angles, nucleotide similarity, base-pairing nature (0,L,R). For user-specified ‘suboptimal alignment score ratio’ 0 ≤ p ≤ 1, DIAL returns suboptimal alignments for which
, where S denotes the optimal alignment score and S′ denotes the suboptimal alignment score. Additionally, in quadratic time DIAL computes the partition function (41–43) for alignments, hence returns the Boltzmann pair probabilities Pr(ai,bj) for aligned nucleotides ai , bj occurring in the optional alignment. Boltzmann pair probabilities can suggest the biological significance of portions of the optimal alignment, an idea validated for protein sequences by Vingron and Argos (44).
While ARTS is an excellent cubic time program for ‘motif detection’, yielding an approximation to the LCP set A′ = {ai1, …, aik}⊆ A and B′ = {bi′1, …, bi′k}⊆ B of phosphorus atoms, it should be noted that ARTS does not necessarily preserve linear order within the alignment; i.e. it can happen that ij< iℓ and
. Moreover, ARTS takes no account of nucleotide identity or similarity.
The graphical user interface of DIAL is particularly simple, in that PDB accession codes, chain IDs and starting and ending residue sequence numbers can be entered for both RNA molecules; optionally, PDB files can be uploaded. Allowing the user to fine-tune all parameters, DIAL is powerful, flexible and sufficiently accurate to allow the comparison of a large number of molecules for subsequent refinement by other methods.
MATERIALS AND METHODS
We computed RNA backbone dihedral angles by writing Python scripts based on the Biopython Structural Bioinformatics package (45). Six dihedral angles (α, β, γ, δ, ε, ζ) can be defined in the RNA backbone, and one dihedral angle (χ) describes the rotation between backbone and base. The values for all these angles are not independent, but there is a very high correlation between values of each pair of angles (35). Two additional virtual angles η and θ, first introduced by Olson (31) and later reintroduced by Duarte, Duarte, Wadley and Pyle (33) offer a reduced but sufficient conformational description of the RNA backbone (46). To determine base-pairing status of each nucleotide, we run RNAVIEW (47) on all the SCOR RNA chains.
Algorithm
In addition to quadratic time implementations of Needleman–Wunsch global alignment (48) and Smith Waterman local alignment (49) algorithms, the DIAL web server includes an implementation of ‘semiglobal’ alignment (50), opportunely modified to perform motif searching for contiguous or fragmented queries. All algorithms have been extended to account for the similarity of matched nucleotides, dihedral angles7 and base-pairing attributes. To illustrate our modification of semiglobal alignment for fragmented motifs, suppose that the query consists of two non-contiguous fragments, a1, …, am and
, and that the target consists of the contiguous sequence b1, …, bn. In our semiglobal alignment, there is no penalty for gaps occurring to the left of a1, between am and
and to the right of
, while gaps occurring in a1, …, am or in
are penalized; i.e. alignment of the query is semiglobal. In contrast, all gaps in b1, …, bn are penalized, including those occurring to the left of b1 and to the right of bn. All algorithms run in quadratic time using affine gap penalties, following Gotoh's method (51).
Our scoring function evaluates the similarity between each nucleotide of the query and target, by accounting for (i) dihedral/pseudo-dihedral similarity, (ii) nucleotide sequence similarity and (iii) base-pairing similarity. Each one of these contributions is weighted by default parameters; these parameters can be modified by the user. In particular, if the user enters parameters x,y,z respectively for the dihedral, nucleotide and base-pairing parameters, then the weight of dihedral angle contribution is x/(x + y + z), while that for nucleotide similarity is y/(x + y + z), and that for base pairing is z/(x + y + z). Similarly, one can modify the parameters for the seven dihedral angles and two pseudo-dihedral angles; i.e. if x1, …, x7 respectively y , z denote the form values for the dihedral and pseudo-dihedral angles, then the first dihedral angle weight is
, the weight for the first pseudo-dihedral angle η is
, etc.
Given query a1, …, an and target b1, …, bm, the ‘similarity’ sim(ai,bj) of aligning ai from the query RNA with bj from the target RNA is given by the weighted sum
where, following BLAST default (40), the nucleotide sequence contribution Sequence(ai,bj) is 1 if nucleotides ai , bj are identical (match) and -3 otherwise (mismatch), and where
Here A is the set of six backbone dihedral angles (α, β, γ, δ, ε, ζ), one dihedral angle (χ) describing the orientation of the base, and two pseudo-dihedral angles η respectively θ determined by the 4 points C4′(i-1),P(i),C4′(i), P(i + 1) respectively P(i),C4′(i), P(i+1),C4′(i +1).8 BasePair(ai,bj) is a penalty if the base-pairing attribute of ai and bj differ. Although we have focused discussion on the motif search application of DIAL, global and local 3D structural alignment is supported. Unless the parameters are set to be permissive, local alignment tends to report very small alignments of only a few nucleotides. Full details of the algorithm and extensions will be given in a forthcoming article.
Following Clote, Ferrè and Straubhaar (42,43), we additionally compute the Boltzmann pair probabilities within an optimal alignment (41) by computing a ‘forward’ Boltzmann partition function
where
ranges over all possible alignments of a1, …, ai with b1, …, bj, R is the universal gas constant and T is absolute temperature.9 In the inductive case, the forward partition function FZ(i,j) can be computed by
where for notational simplicity we have assumed a linear gap penalty γ.10 In a similar fashion, the backward Boltzmann partition function BZ can be computed, where
where
ranges over all possible alignments of ai, …, an with bj, …, bm. The Boltzmann probability Pr[(ai,bj)] that ai will be aligned with bj is then
It should be stressed that due to the complexity of RNA 3D structural alignment, one cannot hope that a quadratic time algorithm such as DIAL be highly accurate. However, by using DIAL to compute potential target regions predicted to align well with the query, one can subsequently apply a very accurate, but computationally intensive RNA structural alignment algorithm, such as FR3D (39). We believe that this will be the primary application of DIAL.
Web server
The web server http://bioinformatics.bc.edu/clotelab/DIAL runs on a Linux cluster with 20 computational nodes, each with double processors of between 1300 and 3000 MHz and 2 GB RAM (6 Dell PowerEdge 1650, 2 × 1300 MHz Pentium III, 2 GB RAM; 11 Dell PowerEdge 1850, 2 × 2800 MHz Xeon EM64T, 2 GB RAM; 5 Dell PowerEdge 1850, 2 × 3000 MHz Xeon EM64T, 2 GB RAM).
The input form for DIAL is shown in Figure 1. The user must either upload or give the four character alphanumeric PDB accession code for both query and target RNA structures, and indicate the chain identifier for each (underscore if the PDB file contains no chain identifier). Optionally, the starting and ending residue sequence number for the query and/or target structure can be given. Default parameters for dihedral and pseudo-dihedral angle contributions to the alignment may be used or modified. The user can choose between the semiglobal motif finding algorithm (default), or global or local alignment. Three temperatures may be chosen for the Boltzmann pair probability computation to determine highly significant portions of the alignment. Figure 2 displays the output of DIAL, when executing semiglobal alignment of query 1J5A (chain A, nucleotides 2530–2536) with target 1HR2 (chain A). Hot links are provided for the alignment, dihedral and pseudo-dihedral angles (and sugar pucker), Boltzmann probabilities and superposition; alignment and a zoomed close-up of the superposition are depicted in Figure 3.
Figure 1.
DIAL input form. The user must either upload query and target PDB files, or give the four character PDB code, and additionally indicate the chain identifier (underscore indicates a blank chain identifier in the PDB file). By modifying the parameter α, the user can appropriately weigh the sequence versus dihedral angle contribution to the alignment score. By default, the alignment takes into account only the two pseudo-dihedral angles η, θ and the base-pairing similarity.
Figure 2.
DIAL screen output, when applying the motif detection (semiglobal alignment) algorithm of query 1J5A (chain A, nucleotides 2530–2536) with target 1HR2 (chain A). The target (respectively query) conformation is depicted in the upper (respectively lower) left corner, along with hot links to the computed dihedral angles. The superposition of optimal query to target alignment is depicted on the right. The images are produced by using a JMOL applet, hence allow the user to rotate, zoom in, zoom out and choose a variety of molecule representations. To the right of this output (not shown) is a pull-down tab for suboptimal alignments, provided the user entered a non-zero parameter for suboptimal alignment score ratio.
Figure 3.
(Top) Optimal alignment produced in this case by the semiglobal alignment of query to target, when applying the motif detection (semiglobal alignment) algorithm of query 1J5A (chain A, nucleotides 2530–2536) with target 1HR2 (chain A). Output includes a computation of the Boltzmann pair probabilities (not shown). (Bottom) An enlarged superposition of query to target; user can rotate and zoom in/out of image, and choose various representations of both query and target.
RESULTS
To illustrate the difference in alignment accuracy of DIAL and ARTS, we applied the motif search algorithm to two transfer RNA structures. The query structure was 1ASZ:R from residue sequence number 620 to 660 and the target structure was 4TRN. While DIAL correctly aligned this 41 nt portion of aspartyl-tRNA 1ASZ with the corresponding portion of 4TRN, the alignment produced by ARTS is incorrect; see Figure 4. For certain examples, this comportment of ARTS is not surprising, since it was designed to compute the largest collection of phosphorus atoms which are ε-close to each other.
Figure 4.
Alignment of contiguous fragment of aspartyl-tRNA 1ASZ:R starting from residue sequence number 620 and ending with 660 with the tRNA 4TRN. Left panel displays the first alignment produced by ARTS; right panel displays output of DIAL using the motif alignment algorithm with default parameters.
To assess the accuracy of the DIAL web server, we computed receiver operating characteristic (ROC) curves (52), which depict the trade-off between sensitivity (true positive rate) and specificity (1 minus false positive rate). For this assessment, we used the SCOR database (18).
SCOR XML dumps were parsed in order to locally reconstruct the SCOR database. Our starting structure data set included all RNA motifs in the SCOR database; i.e. 440 families and altogether 9850 motifs. Of 440 SCOR families, 82 had both fragmented and non-fragmented members while 62 had only fragmented members. Note that even if the number of SCOR families having fragmented members is relatively small, they are often the most populated families. There were 5110 members in families having 2 fragments, 3 members in families having 3 fragments, 1 member in a family of 4 fragments and 1 member in a family of 6 fragments. We filtered the SCOR collection to eliminate the following motifs: (i) shorter than 3 nt; (ii) composite motifs where fragments belong to different chains and (iii) no range is specified. (i.e. starting and ending position in the chain.) After this selection, we accepted only SCOR families having more than one remaining motif. This step produced 136 families and altogether 5619 motifs. Of these 136 families, 89 contained only local (contiguous, non-fragmented) motifs, 41 contained only composite (fragmented) motifs, and 6 contained both local and composite examples. Of a total of 5619 motifs, 2836 are composite, all formed by two fragments.
Since SCOR includes RNA structures which may be identical or very similar but have different PDB accession codes, for each SCOR family we produced a sequence non-redundant subcollection using Algorithm 2 described in (53).11 In this process, we additionally discarded structures shorter than 5 nt and having poorer resolution than 3.5 Å. Our final, filtered, non-redundant data set extracted from SCOR database thus consisted of 78 families and altogether 359 motifs. The reason there were so few remaining motifs is due to the fact that the SCOR database has many identical or very similar motifs occurring in different RNA molecules.
Figure 5 and 6 present ROC curves respectively for contiguous and fragmented queries. These are computed as follows. For each pair (S1, S2) of structures in the non-redundant data collection obtained from SCOR as indicated above, we computed the DIAL similarity
where seqSim represents nucleotide sequence similarity, strSim represents pseudo-dihedral η, θ angle similarity and bpSim represents base-pairing similarity.12 Computations were performed for weights w from 0,0.2,0.4, …, 0.8,1.0. Positives (respectively negatives) were considered pairs (S1, S2) from the same (respectively different) SCOR class. This allowed the computation of ROC curves displayed in Figure 5. For the most part, pseudo-dihedral angle similarity is much more important for proper SCOR classification than nucleotide sequence similarity.
Figure 5.
Average ROC curves when using the semiglobal DIAL algorithm to align query motifs from the SCOR database with targets from the SCOR database. The x-axis represents false positive rate (1 minus specificity), while the y-axis represents true positive rate (sensitivity). Overlaid curves represent different weighting of dihedral angle versus sequence contributions with weights w = 0,0.2, …, 0.8,1.0. (See Table 1 or text for fuller description of parameters used.) This figure depicts ROC curves for contiguous queries, consisting of an uninterrupted linear sequence of nucleotides.
Figure 6.
Average ROC curves when using the semiglobal DIAL algorithm to align query motifs from the SCOR database with targets from the SCOR database. The x-axis represents false positive rate (1 minus specificity), while the y-axis represents true positive rate (sensitivity). Overlaid curves represent different weighting of dihedral angle versus sequence contributions with weights w = 0,0.2, …, 0.8,1.0. (See Table 1 or text for fuller description of parameters used.) This figure depicts ROC curves for fragmented queries, representing 3D motifs consisting of two or more interrupted linear sequences of nucleotides. In the SCOR database, most fragmented queries consist of two contiguous linear sequences. (See text for fragment breakdown for the SCOR databse.)
These data gave rise to the ROC curves shown in Figure 6, which displays overlaid curves with different weights w for the sequence versus structural alignment, w = 0,0.2,0.4, …, 0.8,1.0. Table 1 presents the area under ROC curves, denoted by AUC, for both non-fragmented and fragmented motifs, using the data from the previously discussed ROC curves.
Table 1.
Area under ROC curve (AUC) for ROC curves displayed in Figures 5 and 6. ROC curves were created for a non-redundant data set extracted from the SCOR database—see Section Data set in reference (20). AUC is computed for different values of weight parameter w for both non-fragmented and fragmented queries, for w = 0,0.2, …, 0.8,1.0. This corresponds to setting parameters on DIAL web form as follows: ‘dihedral’=w, ‘sequence’=(1−w), ‘base-pairing’=1. With these settings, DIAL alignments give same weight to sequence/structural similarity and base-pairing similarity. By varying weight w, we obtain a trade-off between sequence and dihedral angle similarity. With these settings, DIAL appears to perform slightly better on fragmented motifs
| w | Non-fragmented AUC | Fragmented AUC |
|---|---|---|
| 0.0 | 0.69 | 0.78 |
| 0.2 | 0.74 | 0.77 |
| 0.4 | 0.73 | 0.78 |
| 0.6 | 0.76 | 0.81 |
| 0.8 | 0.78 | 0.82 |
| 1.0 | 0.80 | 0.86 |
DISCUSSION
In this article, we have described the DIAL web server, which provides access to global, local and semiglobal alignment of RNA structures, presented as PDB files. We believe the semiglobal alignment to be of particular interest as a preprocessing step for RNA motif detection.
The DIAL web server performs a quadratic time, dynamic programming alignment, taking into account similarity of nucleotide identity, (pseudo-) dihedral angles and base pairing in the secondary structure. The algorithm is fully customizable by allowing the user to stipulate different weights for angles and base pairs.
Unlike the PRIMOS algorithm of Duarte et al. (33), which considers a gapless alignment of pseudo-dihedral angles for contiguous sequences, DIAL can handle fragmented queries and alignments with bulging nucleotides by means of gap insertion. DIAL alignment accounts for base-pairing similarity, known to be of primary importance in the manual curation of the SCOR database. Additionally, DIAL computes Boltzmann pair probabilities in the alignment, and can return suboptimal query-target alignments.
ACKNOWLEDGEMENTS
Research of P.C., W.A.L. and Y.P. was supported by National Science Foundation grant DBI-0543506. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We wish to thank Steve Holbrook for pointing out the importance of properly accounting for base pairing in an earlier version of DIAL, Jason Persampieri for technical assistance and both N. Leontis and C. Zirbel for generously providing us with a preprint of the article (39). Funding to pay the Open Access publication charges for this article was provided by National Science Foundation grant DBI-0543506.
Conflict of interest statement. None declared.
Footnotes
1In Lemma 3.3 of (14), Kolodny and Linial prove that ε-approximate optimal structural alignment is NP-complete, when the input consists of two distance matrices over an arbitrary metric space. Over 3D Euclidean space, Kolodny and Linial present an algorithm for ε-approximate optimal structural alignment of two proteins with run time O(n10/ε6).
2Sankoff provided a general O(n3k) algorithm to determine the optimal multiple sequence/secondary structure alignment for k RNA nucleotide sequences of length n. To the best of our knowledge, there is no publicly available implementation of Sankoff 's algorithm.
3In other words, the nested–nested edit-distance problem of Lin et al. (25) is NP-complete. See (26) for an O(n4) algorithm for a related RNA alignment problem.
4The worm 〈(η1, θ1), … , (ηm, θm)〉 is defined to match the worm
if the Euclidean distance between (ηi, θi) and
is at most 25° for each 1 ≤ i ≤ m.
5A quadrat is a stack of size 2, i.e. positions i, i + 1, j − 1, j such that (i, j) and (i + 1, j − 1) are base pairs.
6Default RNA nucleotide similarities are taken from BLAST (40); however, the user can modify nucleotide similarity, gap initiation and gap extension costs as well as other parameters.
7A dihedral or torsion angle is determined by four points a, b, c, d in Euclidean 3D space. By taking cross products, compute the normal vectors
to the plane determined by a, b, c and b, c, d. The dihedral angle is defined to be the inverse cosine of the inner product
of
normalized by their lengths.
8Given four points, a, b, c, d, the first three and last three determine two planes. The dihedral angle between the planes is computed by taking the inverse cosine of the inner product of the normal to each plane.
9In alignment, temperature is a non-physical parameter; however, as in (42), by taking several temperatures one sees the overall significance of portions of the alignment (44).
10DIAL uses a general affine gap penalty, following Gotoh s algorithm (51).
11Algorithm 2 constructs a sequence non-redundant data set as follows. Given list L of sequences, determine BLAST similarity of first sequence to all others, removing from L all homologous sequences (with E-value above a given threshold). Take the second sequence from the filtered list L, determine BLAST similarity with all successive sequences from L, removing those which are homologous, etc. In this fashion a set of sequences is obtained, guaranteed not to be pairwise homologous. In our implementation, we used default BLAST values for nucleotide match, mismatch and gap, set the threshold to be the E-value 0.001.
12set parameters on web form as follows: ‘dihedral’=w, ‘sequence’=(1−w), ‘base pairing’=1. Dihedral angle parameters α through χ are set to 0, while parameters η and θ are set to 1.
REFERENCES
- 1.Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu ZZ, Ledley RS, Lewis KC, Mewes HW, et al. The protein information resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002;30:35–37. doi: 10.1093/nar/30.1.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, et al. The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002;58:899–907. doi: 10.1107/s0907444902003451. [DOI] [PubMed] [Google Scholar]
- 4.Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004;32:D226–D229. doi: 10.1093/nar/gkh039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. 2003;31:452–455. doi: 10.1093/nar/gkg062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 1993;232:584–599. doi: 10.1006/jmbi.1993.1413. [DOI] [PubMed] [Google Scholar]
- 7.Holm L, Sander C. Mapping the protein universe. Science. 1996;273:595–603. doi: 10.1126/science.273.5275.595. [DOI] [PubMed] [Google Scholar]
- 8.Lim L, Glasner M, Yekta S, Burge C, Bartel D. Vertebrate microRNA genes. Science. 2003;299:1540. doi: 10.1126/science.1080372. [DOI] [PubMed] [Google Scholar]
- 9.Winkler WC, Chalamish S, Cohen-Breaker RR. An mRNA structure that controls gene expression by binding FMN. Proc. Natl Acad. Sci. USA. 2002;99:15908–15913. doi: 10.1073/pnas.212628899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Penchovsky R, Breaker R. Computational design and experimental validation of oligonucleotide-sensing allosteric ribozymes. Nat. Biotechnol. 2005;23:1424–1431. doi: 10.1038/nbt1155. [DOI] [PubMed] [Google Scholar]
- 11.Bekaert M, Bidou L, Denise A, Duchateau-Nguyen G, Forest J, Froidevaux C, Hatin I, Rousset J, Termier M. Towards a computational model for -1 eukaryotic frameshifting sites. Bioinformatics. 2003;19:327–335. doi: 10.1093/bioinformatics/btf868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Weinger J, Parnell K, Dorner S, Green R, Strobel S. Substrate-assisted catalysis of peptide bond formation by the ribosome. Nat. Struct. Mol. Biol. 2004;11:1101–1106. doi: 10.1038/nsmb841. [DOI] [PubMed] [Google Scholar]
- 13.Nissen P, Hansen J, Ban N, Moore P, Steitz T. The structural basis of ribosome activity in peptide bond synthesis. Science. 2000;289:920–923. doi: 10.1126/science.289.5481.920. [DOI] [PubMed] [Google Scholar]
- 14.Kolodny R, Linial N. Approximate protein structural alignment in polynomial time. Proc. Natl Acad. Sci. USA. 2004;101:12201–12206. doi: 10.1073/pnas.0404383101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Taylor WR, Flores TP, Orengo CA. Multiple protein structure alignment. Protein Sci. 1994;3:1858–1870. doi: 10.1002/pro.5560031025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy S. Rfam: an RNA family database. Nucleic Acids Res. 2003;31:439–441. doi: 10.1093/nar/gkg006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Berman HM, Westbrook J, Feng Z, Iype L, Schneider B, Zardecki C. The nucleic acid database. Methods Biochem Anal. 2003;44:199–216. [PubMed] [Google Scholar]
- 18.Klosterman P, Tamura M, Holbrook S, Brenner S. SCOR: a structural classification of rna database. Nucleic Acids Res. 2002;30:392–394. doi: 10.1093/nar/30.1.392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 1985;45:810–825. [Google Scholar]
- 20.Havgaard JH, Lyngso RB, Gorodkin J. The FOLDALIGN web server for pairwise structural RNA alignment and mutual motif search. Nucleic Acids Res. 2005;33:W650–W653. doi: 10.1093/nar/gki473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Mathews D, Turner D. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]
- 22.Matthews D, Sabina J, Zuker M, Turner D. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
- 23.Xia T, SantaLucia J, Burkard M, Kierzek R, Schroeder S, Jiao X, Cox C, Turner D. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1999;37:14719–14735. doi: 10.1021/bi9809425. [DOI] [PubMed] [Google Scholar]
- 24.Blin G, Fertin G, Rusu I, Sinoquet C. RNA sequences and the EDIT(NESTED,NESTED) problem. Technical Report 03.07 Research Report. 2003 Submitted for publication. [Google Scholar]
- 25.Jiang T, Lin G, Ma B, Zhang K. A General Edit Distance between Two RNA Structures. Journal of Computational Biology. 2002;9(2):371–388. doi: 10.1089/10665270252935511. [DOI] [PubMed] [Google Scholar]
- 26.Herrbach C, Denise A, Dulucq S, Touzet H. Alignment of RNA secondary structures using a full set of operations Technical Report 1451 Laboratoire de recherche en informatique (LRI) Research Report. 2006 Submitted for publication. [Google Scholar]
- 27.Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R. RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res. 2001;29:4724–4735. doi: 10.1093/nar/29.22.4724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Liu J, Wang JT, Hu J, Tian B. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC. Bioinformatics. 2005;6:89. doi: 10.1186/1471-2105-6-89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Dalli D, Wilm A, Mainz I, Steger G. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22:1593–1599. doi: 10.1093/bioinformatics/btl142. [DOI] [PubMed] [Google Scholar]
- 30.Sato K, Sakakibara Y. RNA secondary structural alignment with conditional random fields. Bioinformatics. 2005;21:ii237–ii242. doi: 10.1093/bioinformatics/bti1139. [DOI] [PubMed] [Google Scholar]
- 31.Olson WK. Configurational statistics of polynucleotide chains. A single virtual bond treatment. Macromolecules. 1975;8:272–275. doi: 10.1021/ma60045a006. [DOI] [PubMed] [Google Scholar]
- 32.Duarte C, Pyle A. Stepping through an RNA structure: a novel approach to conformational analysis. J. Mol. Biol. 1998;284:1465–1478. doi: 10.1006/jmbi.1998.2233. [DOI] [PubMed] [Google Scholar]
- 33.Duarte CM, Wadley LM, Pyle AM. RNA structure comparison, motif search and discovery using a reduced representation of RNA conformational space. Nucleic Acids Res. 2003;31:4755–4761. doi: 10.1093/nar/gkg682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wadley L, Pyle A. The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res. 2004;32:6650–6659. doi: 10.1093/nar/gkh1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hershkovitz E, Tannenbaum E, Shelley B, Sheth A, Tannenbaum A, Williams L. Automated identification of RNA conformational motifs: theory and application to the HM LSU 23S rRNA. Nucleic Acids Res. 2003;31:6249–6257. doi: 10.1093/nar/gkg835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dror O, Nussinov R, Wolfson H. ARTS: alignment of RNA tertiary structures. Bioinformatics. 2005;21:ii47–ii53. doi: 10.1093/bioinformatics/bti1108. [DOI] [PubMed] [Google Scholar]
- 37.Dror O, Nussinov R, Wolfson HJ. The ARTS web server for aligning RNA tertiary structures. Nucleic Acids Res. 2006;34:W412–W415. doi: 10.1093/nar/gkl312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mokdad A, Leontis NB. Ribostral: an RNA 3D alignment analyzer and viewer based on basepair isostericities. Bioinformatics. 2006;22:2168–2170. doi: 10.1093/bioinformatics/btl360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sarver M, Zirbel C, Stombaugh J, Mokdad A, Leontis N. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math Biol. 2006 doi: 10.1007/s00285-007-0110-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 41.Mückstein U, Hofacker I, Stadler P. Stochastic pairwise alignments. Bioinformatics. 2002;18:S153–S160. doi: 10.1093/bioinformatics/18.suppl_2.s153. [DOI] [PubMed] [Google Scholar]
- 42.Clote P, Straubhaar J. Symmetric time warping, Boltzmann pair probabilities and functional genomics. J. Math. Biol. 2006;53:135–161. doi: 10.1007/s00285-006-0379-1. [DOI] [PubMed] [Google Scholar]
- 43.Ferre F, Clote P. BTW: a web server for Boltzmann time warping of gene expression time series. Nucleic Acids Res. 2006;34:W482–W485. doi: 10.1093/nar/gkl162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vingron M, Argos P. Determination of reliable regions in protein sequence alignments. Protein Eng. 1990;3:565–569. doi: 10.1093/protein/3.7.565. [DOI] [PubMed] [Google Scholar]
- 45.Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics. 2003;19:2308–2310. doi: 10.1093/bioinformatics/btg299. [DOI] [PubMed] [Google Scholar]
- 46.Wadley LM, Pyle AM. The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res. 2004;32:6650–6659. doi: 10.1093/nar/gkh1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yang H, Jossinet F, Leontis N, Chen L, Westbrook J, Berman H, Westhof E. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003;31:3450–3460. doi: 10.1093/nar/gkg529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 49.Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- 50.Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997. [Google Scholar]
- 51.Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
- 52.Gribskov M, Robinson N. The use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 1996;20:25–34. doi: 10.1016/s0097-8485(96)80004-0. [DOI] [PubMed] [Google Scholar]
- 53.Hobohm U, Scharf M, Schneider R, Sander C. Selection of a representative set of structures from the Brookhaven Protein Data Bank. Proteins Sci. 1992;1:409–417. doi: 10.1002/pro.5560010313. [DOI] [PMC free article] [PubMed] [Google Scholar]






