A max-margin model for efficient simultaneous alignment and folding of RNA sequences

Chuong B Do; Chuan-Sheng Foo; Serafim Batzoglou

doi:10.1093/bioinformatics/btn177

. 2008 Jul 1;24(13):i68–i76. doi: 10.1093/bioinformatics/btn177

A max-margin model for efficient simultaneous alignment and folding of RNA sequences

Chuong B Do ^1,^*, Chuan-Sheng Foo ¹, Serafim Batzoglou ¹

PMCID: PMC2718655 PMID: 18586747

Abstract

Motivation: The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction.

In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for simultaneous alignment and consensus folding of unaligned RNA sequences. Algorithmically, RAF exploits sparsity in the set of likely pairing and alignment candidates for each nucleotide (as identified by the CONTRAfold or CONTRAlign programs) to achieve an effectively quadratic running time for simultaneous pairwise alignment and folding. RAF's fast sparse dynamic programming, in turn, serves as the inference engine within a discriminative machine learning algorithm for parameter estimation.

Results: In cross-validated benchmark tests, RAF achieves accuracies equaling or surpassing the current best approaches for RNA multiple sequence secondary structure prediction. However, RAF requires nearly an order of magnitude less time than other simultaneous folding and alignment methods, thus making it especially appropriate for high-throughput studies.

Availability: Source code for RAF is available at:http://contra.stanford.edu/contrafold/

Contact: chuongdo@cs.stanford.edu

1 INTRODUCTION

The secondary structure adopted by an RNA molecule in vivo is a vital consideration in many bioinformatics analyses. In PCR primer design, stable secondary structures can obstruct proper binding of the primer to DNA (Dieffenbach et al., 1993); in RNA folding pathway studies, secondary structure forms the basic scaffold on which more complicated 3D structures organize (Brion and Westhof, 1997); and in computational non-coding RNA gene prediction, RNA secondary structural stability provides the characteristic signal for distinguishing real RNA sequence from non-functional transcripts (Eddy, 2002).

To date, the most powerful non-experimental methods for determining RNA secondary structure rely primarily on position-specific patterns of nucleotide covariation in multiple homologous RNA sequences. Specifically, enrichment for complementarity in pairs of columns from an RNA multiple alignment, especially when primary sequence is not conserved, provides strong evidence for potential base-pairings in the RNA's in vivo structure. A primary limitation of covariation analysis, however, is the difficulty of obtaining reliable sequence alignments for divergent RNA families. This shortcoming is especially relevant in the detection of ncRNA genes, as secondary structural constraints often exist even when primary sequence conservation is lacking (Torarinsson et al., 2006).

In this article, we describe RNA alignment and folding (RAF), a new algorithm for predicting RNA secondary structure from a collection of unaligned homologous RNA sequences. Algorithmically, RAF belongs to a category of RNA secondary structure prediction methods which simultaneously align and fold RNA sequences. By optimizing a pair of unaligned RNA sequences for both sequence homology and structural conservation concurrently, simultaneous alignment and folding approaches sidestep the usual problem of needing accurate sequence alignments before the folding is done. By exploiting sparsity in the set of likely base pairings and aligned nucleotides, RAF achieves O(L²) running time for sequences of length L, improving significantly upon the O(L⁴) running times of typical simultaneous folding and alignment approaches.

The main contribution of RAF, however, is its application of discriminative machine learning techniques for parameter estimation to the problem of simultaneous alignment and folding. Unlike previous methods, RAF's scoring model does not rely on ad hoc combinations of thermodynamic free energies for structural features (Mathews et al., 1999) with arbitrary alignment match and gap penalties (Hofacker et al., 2002), nor does RAF attempt the ambitious task of simultaneously modeling the evolutionary history of both sequences and structure (Knudsen and Hein, 2003). Instead, RAF defines a fixed set of basis features describing aspects of the alignment, RNA secondary structure, or both. RAF then poses the task of learning weights for these features as a convex optimization problem, giving rise to efficient algorithms with guaranteed convergence to optimality.

The concept of using discriminative methods for parameter estimation rather than relying solely on parameters compiled from experimental measurements originated with the CONTRAfold (Do et al., 2006b) program, and later also became the basis of the CG (Andronescu et al., 2007) method. In a manner analogous to these two previous methods for single sequence secondary structure prediction, RAF demonstrates that automatic learning of parameters can also confer benefits to multiple sequence structure prediction accuracy.

2 METHODS

The RAF algorithm consists of four components: (1) a simple yet flexible objective function for pairwise alignment and folding of unaligned RNA sequences; (2) a fast Sankoff-style inference engine for maximizing this objective function via sparse dynamic programming; (3) a simple progressive strategy for extending the pairwise algorithm to handle multiple unaligned sequence inputs; and (4) a max-margin framework for automatically learning model parameters from training data. We describe each of these in turn.

2.1 The RAF scoring model

We begin our description of the algorithm by describing a scoring scheme for alignments and consensus foldings of two sequences. Let a and b be a pair of unaligned input RNA sequences. We refer to a candidate alignment and consensus secondary structure of a and b collectively as a parse. Formally, a parse y for a pair of sequences a and b is a set whose elements consist of base pairings (a_i, a_j) belonging to sequence a, base pairings (b_k, b_l) belonging to sequence b, and aligned positions (a_i, b_k) between a and b.

For a given parse y from the space of all valid¹ parses 𝒴, RAF uses a simple scoring scheme which takes into account aligned positions and conserved base pairings. Specifically, RAF defines the score, Score(y;w), of such a parse y to be

where Inline graphic and are scoring terms for aligned positions and conserved base pairs, respectively, and where ℬ(y) is the set of all conserved base pairings. In turn, RAF models each scoring term as a linear combination of arbitrary basis features (Appendix A.1):

where w∈ℝ^{n_aligned+n_Paired}=ℝⁿ is a vector of scoring parameters.

2.2 Fast pairwise alignment and folding

Given the scoring scheme described in the previous section, the problem of simultaneous alignment and folding reduces to the optimization problem,

(1)

In principle, the solution to (1) follows immediately from the original dynamic programming algorithm for simultaneous alignment and folding presented by Sankoff (1985). Sankoff's algorithm, however, has an O(L^3K) time complexity and O(L^2K) space complexity for K sequences of length L, rendering it impractical for all but the smallest multiple folding problems. Therefore, most programs for RNA simultaneous alignment and folding use heuristics to reduce time and memory requirements while minimally compromising alignment and structure-prediction quality. Some heuristics used in previous programs have included incorporating structural information into a single alignment scoring matrix (Dalli et al., 2006), disallowing multi-branch loops (Gorodkin et al., 1997), and precomputing potential conserved helices prior to alignment (Tabei et al., 2006; Touzet and Perriquet, 2004).

The most popular heuristics, however, involve reduction of the portion of the dynamic programming matrices (which we call the DP region) that must be computed. For example, some methods restrict the DP region to a strip of fixed width about the diagonal (Hofacker et al., 2004; Mathews and Turner, 2002) or about an initial alignment path (Kiryu et al., 2007). Other methods rely on external single-sequence folding and probabilistic alignment programs to generate base pairing probability matrices (Torarinsson et al., 2007; Will et al., 2007) or alignment match posterior probability matrices (Kiryu et al., 2007), and then exploit the sparsity of these matrices in order to reduce the amount of computation required.

The RAF algorithm adopts the last of these strategies. Namely, RAF uses a single-sequence RNA secondary structure prediction program CONTRAfold; Do et al., 2006b) and a pairwise RNA sequence alignment program (CONTRAlign; Do et al., 2006a),² respectively, to construct a constraint set 𝒞 of allowed base pairs and aligned positions in a and b. Given a constraint set 𝒞, RAF then replaces (1) with the reduced inference problem,

(2)

where 𝒴_𝒞={y∈𝒴:y⊆𝒞} is the space of valid parses, restricted to those which contain only base pairings and alignment matches from the constraint set 𝒞 (Fig. 1).

Fig. 1. — Sparsity patterns in posterior probability matrices. Panels (a) and (b) illustrate the pairwise pairing posterior probabilities for two different sequences (such as generated by a single-sequence probabilistic or partition function–based RNA folding program). Panel (c) shows the alignment match probabilities for these sequences (such as generated by a probabilistic HMM). In each panel, the darkness of each square represents the posterior confidence in the corresponding base pairing or alignment match. While the single sequence folder or the pairwise sequence aligner may not be able to identify the single correct folding or alignment, respectively, the set of likely candidate base pairings and matched positions, nonetheless, is extremely sparse.

To obtain the set of allowed base pairings, RAF uses the implementation of McCaskill's algorithm (McCaskill, 1990) from CONTRAfold in order to compute the posterior probability of each possible base pairing in sequence a, and similarly for sequence b. All base pairs with posterior probability at least ɛ_paired are then retained. Similarly, to determine the set of allowed aligned positions, RAF retains those matches whose posterior probability, according to a version of the CONTRAlign program adapted for RNAs, is at least ɛ_aligned. If these cutoffs ɛ_aligned and ɛ_paired are chosen to be too low, then the reduction of the dynamic programming space achieved for 𝒴_𝒞 will not be significant. Conversely, a higher cutoff could also degrade performance by excluding portions of the DP matrix which actually correspond to the true parse of the input sequences. A similar approach for pruning the space of candidate alignments and folds via fold and alignment envelopes was implemented in the Stemloc (Holmes, 2005) program. A number of other programs exploit either base-pairing sparsity (Torarinsson et al., 2007; Will et al., 2007) or alignment sparsity (Dowell and Eddy, 2006; Harmanci et al., 2007; Kiryu et al., 2007) separately.

Assuming O(c) and O(d) bounds on the number of candidate base pairing and alignment partners, respectively, per position of both sequences, we show that the time complexity of the RAF algorithm scales quadratically in the length of the sequences, while the space complexity scales linearly (Appendix B.1). A comparison table of asymptotic time and space complexity of a number of modern RNA simultaneous folding and alignment approaches is shown in Table 1. In practice, we find that RAF's scaling reflects the theoretical bounds, achieving running times often an order of magnitude faster than current simultaneous alignment and folding methods.³

Table 1.

Comparison of computational complexity of RNA simultaneous folding and alignment algorithms

Algorithm	Time complexity	Space complexity
Sankoff	O(L⁶)	O(L⁴)
FOLDALIGN	O(L⁴)	O(L⁴)
LocARNA	O(c²L⁴)	O(c²L²)
Murlet	O(d²L²+d³L³/κ⁶)	O(d²L²)
RAF	O(min(c,d)·cd²L²)	O(min(c,d)·cdL)

Open in a new tab

Here, L denotes the sequence length, c is the number of candidate base pairs per position, d is the number of candidate alignment matches per position and κ is the minimum allowed distance between adjacent helices.

2.3 Extension to multiple alignment

Using the RAF pairwise alignment subroutine, we can also address the problem of aligning two alignments. Let S and T be two sets of sequences that we wish to align; furthermore, we denote their corresponding alignments as A and B.

To align a pair of alignments, we first define new basis features Inline graphic and to simply be the average over all pairs of sequences s∈S and t∈T of the basis features for aligning s and t, remapped to the coordinates of the alignments A and B. Second, we define the new constraint set 𝒞 for aligning the two alignments to be the union over all pairs of sequences s∈S and t∈T of the constraint sets for each pair, again remapped to the alignment coordinates. Finally, using these new features and our new constraint set, we simply call the existing RAF subroutine for fast-pairwise alignment and folding.

Using this new subroutine for aligning alignments, we can then perform multiple alignment in RAF using a standard progressive strategy (Feng and Doolittle, 1987). Specifically, we cluster the sequences with a UPGMA (Sneath and Sokal, 1962) tree-building procedure, using the expected accuracy similarity measure (Do et al., 2005). Finally, we perform progressive alignment by aligning subgroups of sequences according to the tree.

2.4 A max-margin framework

Given a set of training examples, Inline graphic , the parameter estimation problem is the task of identifying a vector of weights w=(w₁,w₂,…,w_n)∈ℝⁿ for which the RAF inference algorithm, as described in the previous section, will yield accurate alignments and consensus structures. In this section, we present a max-margin framework for parameter estimation in RAF.

2.4.1 Formulation

In the max-margin framework, our goal is to obtain a parameter vector w for which running the RAF inference algorithm will generate accurate alignments and consensus structures. Clearly, this goal is met if for each training example (a⁽ⁱ⁾,b⁽ⁱ⁾,y⁽ⁱ⁾) from our training set S,⁴

(3)

In such a case, we would be guaranteed that the maximum of (2) is attained for y*=y⁽ⁱ⁾ (provided the true parse y⁽ⁱ⁾ belongs to Inline graphic ), and hence our inference procedure would necessarily return the correct alignment and consensus folding. This intuition is captured in the following convex optimization problem:

(4)

Here, C is a regularization constant, and Δ(y⁽ⁱ⁾,y′) is a non-negative distance measure between pair of parses, conventionally referred to as the loss function, which takes value 0 if and only if its two arguments are equal (Section 2.4.2).

The inequality constraints play the role of (3)—they try to ensure that the training output y⁽ⁱ⁾ scores higher than any alternative incorrect parse y′ by some positive amount Δ(y⁽ⁱ⁾,y′). In cases where this condition is not achieved, the objective function incurs a penalty of ξ_i. Finally, the regularization term (½)C‖w‖² is a penalty used to prevent overfitting.⁵

2.4.2 The loss function

The loss function Δ(y⁽ⁱ⁾,y′) in (4) plays two significant roles. Technically, the loss function establishes an appropriate scale for the parameters of the problem and prevents the trivial solution, w=0. Intuitively, however, the loss function also helps to make the max-margin optimization robust. By choosing a loss function that takes large positive values for incorrect candidate outputs y′ that differ from the true output y⁽ⁱ⁾ in a very critical way, but that takes small positive values for incorrect candidate outputs y′ whose errors are more forgivable, the loss function allows the user to implement a notion of ‘cost’ for different types of mistakes in the max-margin model.

For RAF, we defined the loss function by restricting our attention to four types of parsing errors: (1) false positive base-pairings ((a_i, a_j)∈y′ ∖ y⁽ⁱ⁾, or similarly in sequence b), (2) false negative base-pairings ((a_i, a_j)∈yⁱ ∖ y′, or similarly in sequence b), (3) false positive aligned matches ((a_i, b_k)∈y’ ∖ y⁽ⁱ⁾) and (4) false negative aligned matches ((a_i, b_k)∈y⁽ⁱ⁾ ∖ y′). Then, we set

The numbers γ^{FN paired}, γ^{FP paired}, γ^{FN aligned} and γ^{FP aligned} are hyperparameters, chosen by the user prior to training the RAF algorithm, which allow the user to express her preference for models with either high sensitivity or high specificity for base-pairing positions and aligned nucleotides.⁶

2.4.3 Optimization algorithm

At first glance, the constrained optimization problem stated in (4) appears to be a standard convex quadratic program and hence solvable using off-the-shelf packages for convex programming. In reality, for each training example, the optimization problem has an exponential number of inequalities, one corresponding to each possible candidate parse y′ of the input sequences! Despite our use of constraints sets to reduce the set of allowed candidate outputs, in most cases, this space is still too large to enumerate.

One approach to deal with this problem is an iterative algorithm known as constraint generation (or column generation), as used in the program CG (Andronescu et al., 2007). In this approach, the parameter vector w_t at each time t is the solution to a reduced version of (4) in which only a small subset of the constraints are retained. Next, one checks if w_t violates any of the constraints of the original full optimization problem by more than an prescribed tolerance of ɛ. If so, the worst violated constraint is added to the current set of constraints to form a new reduced optimization problem, whose solution, in turn, gives the next iterate w_t+1. If not, the optimization algorithm terminates. Each of the optimization problems in the sequence requires a quadratic programming solver.

Here, we take a simpler approach based on the recent SVM training algorithm of (Shalev-Shwartz and Singer, 2007) and Shalev-Shwartz et al., (2007). Omitting details, we begin by converting (4) into an equivalent unconstrained problem: namely, minimize (with respect to w∈ℝⁿ),

(5)

Next, we use strong duality from optimization theory in order to derive an upper bound B on the norm of the optimal solution of our unconstrained problem (Appendix C.1). Finally, we actually run the optimization procedure by applying the simple update rule,

(6)

starting from w₁=0. Here, g_t∈∂f(w_t) is any subgradient of the objective function f(w) evaluated at w=w_t, and the operator Π_B[·] projects a vector onto an origin-centered ball of radius B (i.e. Π_B[v]=(B/‖v‖)v if ‖v‖>B and Π_B[v]=v otherwise). Intuitively, the algorithm works much like a standard gradient descent procedure adapted for non-differentiable objective functions, but with the added twist that the projection operation ensures that the weight vector iterates stay with a region of the parameter space where the optimum is known to exist.

Given an existing routine for computing subgradients of the unconstrained objective, this algorithm can be implemented in a few lines of code with no complicated numerical optimization software. As shown by Singer and Shalev-Shwartz, the algorithm is also quite efficient, requiring only Õ(m/Cɛ) iterations to achieve ɛ accuracy on a training set of m examples. An online variant of the algorithm, in which the subgradients g_t in each step are computed based only on a randomly sampled subset of the training data (e.g. a single example), achieves an Õ(1/Cɛ) expected running time, independent of m, the size of the training set.

2.4.4 Subgradient computation

Finally, we show how to compute a subgradient g_t∈∂f(w_t). In order to simplify notation, define an n-dimensional vector Φ(y) whose pth component is

from which it follows that Score(y;w)=w^TΦ(y). We can apply the usual rules for computing subgradients see, e.g. Bertsekas et al., 2003) to obtain

(7)

where Inline graphic is simply any y′ which attains the maximum in the ith term of the summation in (5), for w=w_t. Each ‘loss-augmented’ maximization, in turn, is easily performed by modifying the original RAF inference procedure to incorporate an appropriately defined additional scoring matrix, φ₀(i, j; k, l), with fixed weight w₀=1.

3 RESULTS

To evaluate the performance of RAF on real data, we collected training and testing data from a variety of sources. In particular, for training, we obtained Rfam 8.1 (Griffiths-Jones et al., 2005), a database of alignments and covariance models for RNA families along with annotated secondary structures where available. For testing, we obtained BRAliBASE II (Gardner et al., 2005), a benchmark set for RNA alignment programs. We also obtained a testing set of RNA families used by the authors of the recent program, MASTR (Lindgreen et al., 2007).

An important concern in the validation of RNA alignment programs is the confounding factor that unless cross-validation is properly performed, the performance that one sees on any given validation set is not likely to be a reliable judge of the program's performance on future data. Even in cases where the training and evaluation tests are disjoint but still contain sequences from the same RNA family, evaluation can still give misleading results, because the weights learned for loop lengths and composition will be biased toward specific properties of that RNA family.

To be absolutely sure of no contamination between training and testing data, we preprocessed our Rfam training set of alignments and consensus structures (October 2007 version, 607 families) by excluding all families for which either of the two testing databases contained an example from that family. We then also removed all families for which only automatically predicted consensus structures were known, leaving a total of 154 families. Finally, we generated a training set 𝒯₁ of up to 10 randomly sampled pairwise alignments with consensus structures from each remaining family (1361 pairwise alignments in total), a training set 𝒯₂, of up to 10 randomly sampled sequences with structures from each family (1179 sequences in total), and a training set 𝒯₃, containing one randomly sampled five-way multiple alignment from each family (118 multiple alignments in total).

RAF uses two external programs, CONTRAlign (Do et al., 2006a) and CONTRAfold (Do et al., 2006b), to compute alignment match and base-pairing posterior probabilities, respectively. To ensure proper cross-validation, CONTRAlign was retrained from scratch using 𝒯₁, and CONTRAfold was retrained using 𝒯₂. Finally, the RAF algorithm itself was trained using all pairwise projections of each multiple alignment of 𝒯₃. Our strict cross-validation procedure significantly reduces both the size and coverage of the training sets used for CONTRAlign and CONTRAfold, and thus places RAF at a significant disadvantage in the comparisons shown here. Nonetheless, as shown in the following sections, RAF performs well, indicating its ability to generalize for sequences not present in the training set.

3.1 Alignment and base-pairing constraints

To observe the effects of different cutoffs ɛ_aligned and ɛ_paired, we computed the proportions of reference base pairings and reference aligned matches recovered for varying cutoff constraints. In addition, we also computed the sparsity ratio (i.e. the maximum number of pairing partners or matching partners for any nucleotide, averaged over the entire training set) for each cutoff. A plot of these two values for training set 𝒯₃ is shown in Figure 2. As seen in the figure, nearly complete coverage of base pairings and alignment matches can be retained when each sparsity factor is roughly 10.⁷

Fig. 2. — Trade-off between sparsity factor and proportion of reference base-pairings or aligned matches covered when varying the cutoffs ɛ_paired and ɛ_aligned. This graph was made using training set 𝒯₃.

3.2 Evaluation metrics

To evaluate the quality of the resulting alignments, we used five different scoring measures:

(1) the standard sum-of-pairs (SP) score (Thompson et al., 1999), which computes the proportion of matches in a reference alignment which are present in the predicted alignment,
(2) sensitivity (Sens), the proportion of base pairings in a reference parse which are recovered in the predicted parse,
(3) specificity or positive-predictive value (PPV), the proportion of base pairings in a predicted parse which are also present in the reference parse, and
(4) the Matthews correlation coefficient (MCC) (Matthews, 1975), which we approximate as , following Gorodkin et al., (2001).

3.3 Comparison of accuracy

In our first accuracy assessment, we evaluated RAF as well as a number of other current RNA secondary structure prediction programs using the BRAliBASE II dataset. In particular, the first dataset from BRAliBASE II contains collections of 100 five-sequence subalignments, sampled from five specific Rfam families (5S rRNA, group II intron, SRP, tRNA and U5). For each of these alignments, we ran a number of current multiple-sequence RNA secondary structure prediction programs, including Murlet v0.1.1 (Kiryu et al., 2007), LocARNA v1.2.2a (Will et al., 2007), and RNA Sampler v1.3 (Xu et al., 2007). Wherever any of these programs required access to external pairing-posterior probabilities, we used ViennaRNA v1.7 (Hofacker et al., 1994). The results of the comparison are shown in Table 2.

Table 2.

Performance comparison on BRAliBASE II datasets. The best number in each column is marked in bold

Dataset	Program	Time (s)	SP	Sens	PPV	MCC
5S rRNA	Murlet	687	0.94	0.70	0.70	0.70
	LocARNA	812	0.93	0.55	0.60	0.57
	RNA Sampler	2361	0.90	0.55	0.64	0.59
	RAF	87	0.95	0.66	0.66	0.66
group II intron	Murlet	962	0.78	0.75	0.76	0.75
	LocARNA	250	0.74	0.79	0.65	0.72
	RNA Sampler	1626	0.72	0.77	0.65	0.71
	RAF	48	0.78	0.83	0.65	0.73
SRP	Murlet	20548	0.88	0.75	0.78	0.76
	LocARNA	22467	0.85	0.66	0.70	0.68
	RAF	1290	0.87	0.72	0.71	0.70
tRNA	Murlet	525	0.93	0.86	0.90	0.88
	LocARNA	246	0.95	0.86	0.90	0.88
	RNA Sampler	763	0.92	0.93	0.91	0.92
	RAF	52	0.94	0.81	0.85	0.83
U5	Murlet	1772	0.84	0.69	0.75	0.72
	LocARNA	549	0.80	0.56	0.61	0.58
	RNA Sampler	4084	0.77	0.75	0.70	0.72
	RAF	99	0.82	0.83	0.79	0.81

Open in a new tab

As seen from the table, on the BRAliBASE II benchmark, RAF attains comparable accuracy to the other methods, achieving either the best or second-best overall accuracy according to MCC on four out of the five datasets. The running time of the method, however, is dramatically faster than the other algorithms, often taking an order of magnitude less time than many of the other programs.

We also obtained the dataset used in the benchmarking of the MASTR RNA secondary structure prediction program. For a number of different programs, pre-generated predictions for each input file are available for download on the MASTR website. In addition to scoring these pre-generated predictions, we also generated and scored predictions using Murlet and RAF. The results are shown in Table 3. In this benchmark set, RAF obtains the highest overall MCC.

Table 3.

Performance comparison on MASTR benchmarking sets. The best number in each column is marked in bold.

Program	SP	Sens	PPV	MCC
CLUSTAL W+Alifold	0.81	0.57	0.73	0.65
FoldalignM	0.78	0.38	0.81	0.55
LocARNA	0.75	0.41	0.77	0.56
MASTR	0.84	0.64	0.73	0.68
Murlet	0.89	0.62	0.78	0.70
RNAforester	0.53	0.55	0.55	0.55
RNA Sampler	0.82	0.65	0.70	0.67
RAF	0.88	0.68	0.77	0.72

Open in a new tab

We emphasize, however, that benchmarking results such as these should be taken with a grain of salt; both the BRAliBASE II and MASTR benchmarking sets are extremely restricted in their coverage of the space of RNA families, choosing to focus on a few individual RNA families only. As a result, methods carefully tuned to the benchmarks may perform less well on diverse RNA families not found in either of these benchmarks. By using cross-validation, we improve the chances that RAF's validation results really do indicate reliable out-of-sample performance.

We also note that the performance of RAF on particular RNA families is often closely related to the accuracy of the underlying alignment and single-sequence models used to derive folding and alignment constraints. Because the tools involved in the RAF pipeline all rely on automatic parameter learning, RAF allows the possibility of learning custom parameter sets well-suited for predictions on particular RNA families.

4 DISCUSSION

We presented RAF, a new tool for simultaneous folding and alignment of RNA sequences which exploits sparsity in base pairing and alignment probability matrices and max-margin training in order to achieve faster running times and higher accuracy than previous tools.

Besides its speed, one principal advantage of the RAF meth-odology is its use of a flexible scoring function for combining an arbitrary set of functions into a coherent objective function for alignment scoring. The ability to introduce new basis scoring functions into the RAF scoring model means that there remains a rich space of possible features to explore.

In addition, the use of the max-margin framework to identify relevant linear combinations of scoring functions has other promising potential applications. For example, Wallace et al. (2006) recently introduced M-Coffee, a meta-algorithm for protein sequence alignment, which combines the results of several different protein sequence alignment programs using the T-Coffee framework. The difficulty of identifying appropriate weights for the various programs used in the M-Coffee scoring scheme (i.e. some heuristically derived tree-based weights the authors tried did not give a significant improvement in accuracy over flat weights), led the authors to rely on a uniform weight model, treating programs known to be more accurate on equal footing with less accurate aligners. The max-margin framework developed in this paper obviates the need for heuristically-derived weights altogether.

ACKNOWLEDGEMENTS

C.B.D. was supported by an NSF Graduate Research Fellowship. C.S.F. was supported by an A*STAR National Science Scholarship. This material is based in part upon work supported by the NSF under grant number EF-0312459.

Conflict of Interest: none declared.

APPENDIX

A.1 RAF features

The features used by the RAF program, as evaluated in this article, consist of alignment features, Inline graphic and pairing features, . Specifically, the alignment features, φ^aligned(i, k)∈ℝ⁴ for a candidate alignment match (a_i, b_k) are

(A1)

The pairing features, φ^paired(i, j; k, l)∈ℝ⁴ for a conserved base pairing 〈(a_i, a_j), (b_k, b_l)〉 are given by φ^paired(i, j; k, l)=φ^paired(a_i, a_j)+φ^paired(b_k, b_l). In turn, φ^paired(a_i, a_j)∈ℝ⁴ is given by

(A2)

and similarly for φ^paired(b_k, b_l). Thus, the model contains a total of eight features whose weights must be learned. Here, the posterior probabilities for aligned positions and base-pairing positions are computed using the CONTRAlign (Do et al., 2006a) and CONTRAfold (Do et al., 2006b) programs, respectively.

B.1 The RAF inference engine

In the section, we describe the RAF inference engine for fast approximate simultaneous alignment and consensus folding for pairs of sequences. In particular, we first present some exact recurrences for alignment and folding, and then use restrictions on the set of allowed base pairings and aligned positions to achieve an improvement in computational complexity.

B.1.1 Recurrences

First, we describe a straightforward O(L⁶) dynamic programming recurrence for computing the optimal simultaneous alignment and consensus fold for a pair of sequences a and b.

To compute the optimal parse of a and b, we construct 2 four-dimensional matrices, S and D. Here, S_{i, j; k, l} denotes the optimal score for aligning and folding a_i+1a_i+2…a_j with b_k+1b_k+2…b_l. Furthermore, D_{i, j; k, l} denotes the optimal score for aligning and folding these same substrings, subject to the additional constraint that the outermost positions (a_i+1,a_j) and (b_k+1,b_l) form conserved base pairs.

For 0≤i≤j≤|a| and 0≤k≤l≤|b|, we have

(B1)

and for 0≤i<i+2≤j≤|a| and 0≤k<k+2≤l≤|b|,

(B2)

Here, recurrence (B1) takes the form of a standard Needleman-Wunsch procedure for aligning the substring a_i+1a_i+2… a_j with b_k+1b_k+2…b_l, with an extra case to handle bifurcations in the base-pairing structure of the RNAs. At the end of the recurrence, S₀,|a|;0,|b| gives the score of the optimal alignment and consensus fold of the input sequences a and b. By using traceback pointers in the standard way, the optimal parse can be recovered easily once the recurrence has been evaluated.

In the next section, we explore how these recurrences may be sped up considerably if a constraint set 𝒞 of allowed base pairings and aligned positions is known ahead of time. For complexity analysis, we assume O(c) and O(d) bounds on the number of candidate base pairing and alignment partners per sequence position, respectively.

B.1.2 Exploiting base-pairing sparsity

LocARNA (Will et al., 2007) was the first program for simultaneous alignment and folding of RNA to take advantage of base pairing sparsity in a manner that significantly improved in both running time and memory usage. In this section, we recount the innovations of LocARNA as they are applied in RAF. In the next section, we extend these ideas to also account for alignment sparsity.

First, observe that since all parses in 𝒴_𝒞 contain only conserved base pairings, the evaluation of (B2) may be restricted to only those D_{i, j; k, l} cells for which both (a_i+1,a_j)∈𝒞 and (b_k+1,b_l)∈𝒞. Similarly, the inner loop for considering bifurcations in (B1) may also be restricted to only those j′ and l′ for which both (a_j′+1,a_j)∈𝒞 and (b_l′+1,b_l)∈𝒞. Since the bottleneck in the dynamic programming complexity is the number of executions of the innermost loop in (B1), it follows that restricting the considered bifurcations in the manner described above yields an O(c²L⁴) running time; in particular, for each i and k, computing all values of S_i,•;k,• takes O(c²L²) time as each entry of the D matrix is touched at most once. This optimization was originally implemented as part of the LocARNA (Will et al., 2007) and FoldAlignM (Torarinsson et al., 2007) algorithms.

Second, consider the task of computing all entries in the D matrix. From (B2), we see that the values D_i,•;k,• depend only on S_{i+1,•;k+1,•}. Similarly, from (B1), the values S_{i+1,•;k+1,•} depend only on D_{j′,j;l′,l} for j′≥i+1 and l′≥k+1. Thus, ordering computations in the following way allows the recurrences to be evaluated in a single pass:

Furthermore, since S_{i+1,•;k+1,•} is only needed while computing D_i,•;k,• (but not for any later values of i and k), we need only to retain one S_{i+1,•;k+1,•} matrix in memory at any given time while computing the D matrix. This observation was originally incorporated in the LocARNA program of Will et al., (2007).

Finally, observe that once the D matrix has been computed, the score S_0,|a|;0,|b| of the optimal parse is easily obtainable in O(c²L²) time by recomputing S_0,•;0,•. Likewise, computing the full traceback requires at most O(c²L³) time, negligible relative to the cost of computing the D matrix itself. Thus, we obtain an overall O(c²L⁴) time complexity with O(c²L²) space complexity (for storing the D matrix).

B.1.3 Exploiting alignment sparsity

To exploit sparsity in the set of allowed aligned positions in 𝒞, we again use the strategy of limiting the DP region. We accomplish this by first considering the simpler problem of computing the reduced DP region 𝒜 (known as the alignment envelope) for pairwise sequence alignment without folding scores. Using 𝒜, we then define a reduced DP region for our original alignment and folding task.

For the first step, consider the following restatement of recurrence (B1) using the notation Inline graphic , where we have omitted the case involving bifurcations/base pairing:

As before, Inline graphic represents the optimal score of aligning a₁a₂…a_j to b₁ b₂ … b_l. Here, our goal is to find 𝒜, the minimal set of cells containing no holes,⁸ such that for every parse y∈𝒴_𝒞, there exists some DP path through 𝒜 corresponding to an alignment with the same set of aligned positions. Under the assumption that 𝒜 contains no holes, we can represent 𝒜 by keeping track of its boundaries: for each j∈{0,1,…,|a|}, let 〈{𝒜}.First[j], 𝒜.Last[j]〉 denote the first and last positions l∈{0,1,…,|b|} such that Inline graphic .

We compute these boundaries in linear time using the following procedure. First, we adjust the boundaries to include Inline graphic and for each candidate aligning pair (a_j,b_l)∈𝒞. In addition, we also include the corners and in 𝒜. Finally, we force the boundaries of 𝒜 to satisfy the monotonicity conditions

in such a way that guarantees all DP cells Inline graphic are accessible via some DP path from to .

For the second step, we define the reduced DP region for our original simultaneous alignment and folding recurrences as the set ℛ of all positions S_{i, j; k, l} such that Inline graphic and . To use this reduced DP region ℛ, then, we simply force S_{i, j; k, l}=−∞ for all S_{i, j; k, l} ∉ ℛ. Under this restriction, we can reduce the amount of computation performed in the recurrence (B1) by iterating only over cells S_{i, j; k, l}∈ℛ, and similarly, restricting the evaluation of the D matrix in (B2) to only those cells D_{i, j; k, l} for which S_{i+1,j−1;k+1,l−1}∈ℛ. To ensure that each allowed parse belongs to 𝒴_𝒞, we could penalize any base pairing or aligned position not in 𝒞 by −∞. In practice, we instead augment 𝒞 to include all aligned matches allowed by ℛ, since this can be done at no increase in computational complexity.

To analyze the new computational complexity of the algorithm, we begin by bounding the size of D matrix in two different ways. First, for each of the O(cL) base pairs (a_i, a_j)∈𝒞, there are O(d) aligning partners for a_i and O(d) aligning partners for a_j, giving a total size of O(cd^2L). Alternatively, for each of the O(dL) aligning pairs (a_i, b_k)∈𝒞, there are O(c) base-pairing partners for a_i and O(c) base-pairing partners for b_k, giving a total size of O(c^2dL). Thus, the size of the D matrix is O(min(c,d)·cdL).

As in Section B.1.2, the space complexity of the algorithm is dominated by cost of storing the D matrix, and hence, is O(min(c,d)·cdL). Similarly, the time complexity can be estimated as the number of evaluations of the innermost loop in the bifurcation case of (B1). Since the innermost loop touches each entry of the D matrix at most once for each i and k, and since there are O(dL) choices of (a_i, b_k)∈𝒜, it follows that the time complexity of the algorithm is O(min(c,d)·cd²L²).⁹

C.1 Norm bound

In this section, we derive a bound on the maximum norm of the optimal parameter vector w* for (4). From standard arguments (see, e.g. Taskar et al., 2003), the dual optimization problem is

where

By strong duality, for any solutions (w^*,ξ^*) and α^* of the primal and dual optimization problems, respectively, the values of the primal and dual objectives must be equal, i.e.,

(C1)

Now, suppose that D_i∈ℝ for i=1,…,m satisfy

(C2)

In the case of the RAF loss function, for example, we can use

Then the KKT optimality condition w^*=w(α^*), the primal constraint that Inline graphic for i=1,…,m, and (C1) imply that

Therefore, Inline graphic . ▪

Footnotes

¹We say that a parse y of inputs a and b is valid provided that (1) each nucleotide of a and b base pairs with at most one other nucleotide in the same sequence; (2) each nucleotide aligns with at most one nucleotide in the opposite sequence; (3) neither sequence contains pseudo-knotted base pairings; (4) the alignment of the two sequences does not contain rearrangements or repeats; and (5) all base pairings are conserved.

²The original CONTRAlign program was designed for protein sequences. We adapted this for RNAs by removing all protein-specific features (e.g. hydrophobicity), modifying the underlying alphabet (A, C, G and U) and simply retraining on the appropriate training set.

³We note that the method described here bears some relation to the ‘candidate list’ algorithm of Wexler et al. (2007), which maintains sparse lists of potential bifurcation points for single sequence folding. By showing that the number of relevant bifurcation points has a negligible dependence on sequence length, the authors provide an effectively quadratic time algorithm for single-sequence folding. Here, our algorithm also relies on sparsity of bifurcation point candidates when dealing with pairwise alignment and folding, but unlike in the previous algorithm, the candidates are provided explicitly via the constraint set 𝒞.

⁴Note that our notation hides the dependencies of the Score function on each of the input sequences a⁽ⁱ⁾ and b⁽ⁱ⁾, and similarly for the unconstrained and constrained space of parses, 𝒴⁽ⁱ⁾ and Inline graphic .

⁵By default, we used C=1. We found that when running the online Pegasos optimization algorithm (Section 2.4.3) for a fixed number of iterations, the resulting generalization performance for RAF is relatively insensitive to the value of C used, provided that C is not too large.

⁶By default, we used γ^{FN paired}=10, and γ^{FP paired}=γ^{FN aligned}=γ^{FP aligned}=1 in order to emphasize prediction of correct base pairings.

⁷In practice, we found that using cutoffs of ɛ_aligned∼0.01 and ɛ_paired ∼ 0.002 gave a good trade-off between speed and accuracy of our algorithm when using CONTRAlign and CONTRAfold; these cutoffs correspond roughly to average sparsity factors of ∼10 each, respectively.

⁸That is, Inline graphic whenever for some j₁<j<j₂, or for some l₁<l<l₂.

⁹Note that in these bounds, we assume an O(c) bound on the number of base-pairing partners per position, and an O(d) bound on the number of aligning partners per position. A weaker condition would be to assume an O(cL) bound on the total number of candidate base-pairing partners for sequences a and b and similarly, an O(dL) bound on the total number of candidate aligned positions; under these conditions, we obtain a worst-case space complexity of O(min(c,d)² L²) and a worst case time complexity of O(min(c,d)²dL³).

REFERENCES

Andronescu M, et al. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics. 2007;23:19–28. doi: 10.1093/bioinformatics/btm223. [DOI] [PubMed] [Google Scholar]
Bertsekas DP, et al. Athena Scientific; 2003. Convex analysis and optimization. [Google Scholar]
Brion p, Westhof E. Hierarchy and dynamics of RNA folding. Annu. Rev. Biophys. Biomol. Struct. 1997;26:113–137. doi: 10.1146/annurev.biophys.26.1.113. [DOI] [PubMed] [Google Scholar]
Dalli D, et al. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22:1593–1599. doi: 10.1093/bioinformatics/btl142. [DOI] [PubMed] [Google Scholar]
Dieffenbach CW, et al. General concepts for PCR primer design. PCR Methods Appl. 1993;3:30–37. doi: 10.1101/gr.3.3.s30. [DOI] [PubMed] [Google Scholar]
Do CB, et al. PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. doi: 10.1101/gr.2821705. [DOI] [PMC free article] [PubMed] [Google Scholar]
Do CB, et al. RECOMB. 2006a. CONTRAlign: discriminative training for protein sequence alignment; pp. 160–174. [Google Scholar]
Do CB, et al. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006b;22:e90–e98. doi: 10.1093/bioinformatics/btl246. [DOI] [PubMed] [Google Scholar]
Dowell RD, Eddy SR, et al. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics. 2006;7:400. doi: 10.1186/1471-2105-7-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy SR. Computational genomics of noncoding RNA genes. Cell. 2002;109:137–140. doi: 10.1016/s0092-8674(02)00727-4. [DOI] [PubMed] [Google Scholar]
Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 1987;25:351–360. doi: 10.1007/BF02603120. [DOI] [PubMed] [Google Scholar]
Gardner PP, et al. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gorodkin J, et al. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res. 1997;25:3724–3732. doi: 10.1093/nar/25.18.3724. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gorodkin J, et al. Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res. 2001;29:2135–2144. doi: 10.1093/nar/29.10.2135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths-Jones S, et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005;33:D121–D124. doi: 10.1093/nar/gki081. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harmanci AO, et al. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hofacker IL, et al. Fast folding and comparison of RNA secondary structures (The Vienna RNA Package) Monatsh. Chem. 1994;125:167–188. [Google Scholar]
Hofacker IL, et al. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. 2002;319:1059–1066. doi: 10.1016/S0022-2836(02)00308-X. [DOI] [PubMed] [Google Scholar]
Hofacker IL, et al. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20:2222–2227. doi: 10.1093/bioinformatics/bth229. [DOI] [PubMed] [Google Scholar]
Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005;6 doi: 10.1186/1471-2105-6-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kiryu H, et al. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics. 2007;23:1588–1598. doi: 10.1093/bioinformatics/btm146. [DOI] [PubMed] [Google Scholar]
Knudsen B, Hein J, et al. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003;31:3423–3428. doi: 10.1093/nar/gkg614. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindgreen S, et al. MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing. Bioinformatics. 2007;23:3304–3311. doi: 10.1093/bioinformatics/btm525. [DOI] [PubMed] [Google Scholar]
Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]
Mathews DH, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
Matthews BW. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]
McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29:1105–1119. doi: 10.1002/bip.360290621. [DOI] [PubMed] [Google Scholar]
Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 1985;45:810–825. [Google Scholar]
Shalev-Shwartz S, Singer Y. 2007. Logarithmic regret algorithms for strongly convex repeated games, 2007. [Google Scholar]
Shalev-Shwartz S, et al. ICML. 2007. Pegasos: Primal estimated sub-gradient solver for svm; pp. 807–814. [Google Scholar]
Sneath PH, Sokal RR. Numerical taxonomy. Nature. 1962;193:855–860. doi: 10.1038/193855a0. [DOI] [PubMed] [Google Scholar]
Tabei Y, et al. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics. 2006;22:1723–1729. doi: 10.1093/bioinformatics/btl177. [DOI] [PubMed] [Google Scholar]
Taskar B, et al. NIPS 16. 2003. Max-margin markov networks. [Google Scholar]
Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27:2682–2690. doi: 10.1093/nar/27.13.2682. [DOI] [PMC free article] [PubMed] [Google Scholar]
Torarinsson E, et al. Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res. 2006;16:885–889. doi: 10.1101/gr.5226606. [DOI] [PMC free article] [PubMed] [Google Scholar]
Torarinsson E, et al. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23:926–932. doi: 10.1093/bioinformatics/btm049. [DOI] [PubMed] [Google Scholar]
Touzet H, Perriquet O. Nucleic Acids Res., 32 (Web Server) 2004. CARNAC: folding families of related RNAs; pp. W142–W145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wallace IM, et al. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006;34:1692–1699. doi: 10.1093/nar/gkl091. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wexler Y, et al. A study of accessible motifs and RNA folding complexity. J. Comput. Biol. 2007;14:856–872. doi: 10.1089/cmb.2007.R020. [DOI] [PubMed] [Google Scholar]
Will S, et al. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol. 2007;3 doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu X, et al. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics. 2007;23:1883–1891. doi: 10.1093/bioinformatics/btm272. [DOI] [PubMed] [Google Scholar]

[B1] Andronescu M, et al. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics. 2007;23:19–28. doi: 10.1093/bioinformatics/btm223. [DOI] [PubMed] [Google Scholar]

[B2] Bertsekas DP, et al. Athena Scientific; 2003. Convex analysis and optimization. [Google Scholar]

[B3] Brion p, Westhof E. Hierarchy and dynamics of RNA folding. Annu. Rev. Biophys. Biomol. Struct. 1997;26:113–137. doi: 10.1146/annurev.biophys.26.1.113. [DOI] [PubMed] [Google Scholar]

[B4] Dalli D, et al. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22:1593–1599. doi: 10.1093/bioinformatics/btl142. [DOI] [PubMed] [Google Scholar]

[B5] Dieffenbach CW, et al. General concepts for PCR primer design. PCR Methods Appl. 1993;3:30–37. doi: 10.1101/gr.3.3.s30. [DOI] [PubMed] [Google Scholar]

[B6] Do CB, et al. PROBCONS: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. doi: 10.1101/gr.2821705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Do CB, et al. RECOMB. 2006a. CONTRAlign: discriminative training for protein sequence alignment; pp. 160–174. [Google Scholar]

[B8] Do CB, et al. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006b;22:e90–e98. doi: 10.1093/bioinformatics/btl246. [DOI] [PubMed] [Google Scholar]

[B9] Dowell RD, Eddy SR, et al. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics. 2006;7:400. doi: 10.1186/1471-2105-7-400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Eddy SR. Computational genomics of noncoding RNA genes. Cell. 2002;109:137–140. doi: 10.1016/s0092-8674(02)00727-4. [DOI] [PubMed] [Google Scholar]

[B11] Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 1987;25:351–360. doi: 10.1007/BF02603120. [DOI] [PubMed] [Google Scholar]

[B12] Gardner PP, et al. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Gorodkin J, et al. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res. 1997;25:3724–3732. doi: 10.1093/nar/25.18.3724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Gorodkin J, et al. Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res. 2001;29:2135–2144. doi: 10.1093/nar/29.10.2135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Griffiths-Jones S, et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005;33:D121–D124. doi: 10.1093/nar/gki081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Harmanci AO, et al. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Hofacker IL, et al. Fast folding and comparison of RNA secondary structures (The Vienna RNA Package) Monatsh. Chem. 1994;125:167–188. [Google Scholar]

[B18] Hofacker IL, et al. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol. 2002;319:1059–1066. doi: 10.1016/S0022-2836(02)00308-X. [DOI] [PubMed] [Google Scholar]

[B19] Hofacker IL, et al. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20:2222–2227. doi: 10.1093/bioinformatics/bth229. [DOI] [PubMed] [Google Scholar]

[B20] Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005;6 doi: 10.1186/1471-2105-6-73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Kiryu H, et al. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics. 2007;23:1588–1598. doi: 10.1093/bioinformatics/btm146. [DOI] [PubMed] [Google Scholar]

[B22] Knudsen B, Hein J, et al. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003;31:3423–3428. doi: 10.1093/nar/gkg614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Lindgreen S, et al. MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing. Bioinformatics. 2007;23:3304–3311. doi: 10.1093/bioinformatics/btm525. [DOI] [PubMed] [Google Scholar]

[B24] Mathews DH, Turner DH. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]

[B25] Mathews DH, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]

[B26] Matthews BW. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]

[B27] McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29:1105–1119. doi: 10.1002/bip.360290621. [DOI] [PubMed] [Google Scholar]

[B28] Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 1985;45:810–825. [Google Scholar]

[B29] Shalev-Shwartz S, Singer Y. 2007. Logarithmic regret algorithms for strongly convex repeated games, 2007. [Google Scholar]

[B30] Shalev-Shwartz S, et al. ICML. 2007. Pegasos: Primal estimated sub-gradient solver for svm; pp. 807–814. [Google Scholar]

[B31] Sneath PH, Sokal RR. Numerical taxonomy. Nature. 1962;193:855–860. doi: 10.1038/193855a0. [DOI] [PubMed] [Google Scholar]

[B32] Tabei Y, et al. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics. 2006;22:1723–1729. doi: 10.1093/bioinformatics/btl177. [DOI] [PubMed] [Google Scholar]

[B33] Taskar B, et al. NIPS 16. 2003. Max-margin markov networks. [Google Scholar]

[B34] Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27:2682–2690. doi: 10.1093/nar/27.13.2682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Torarinsson E, et al. Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res. 2006;16:885–889. doi: 10.1101/gr.5226606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Torarinsson E, et al. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23:926–932. doi: 10.1093/bioinformatics/btm049. [DOI] [PubMed] [Google Scholar]

[B37] Touzet H, Perriquet O. Nucleic Acids Res., 32 (Web Server) 2004. CARNAC: folding families of related RNAs; pp. W142–W145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B38] Wallace IM, et al. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006;34:1692–1699. doi: 10.1093/nar/gkl091. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] Wexler Y, et al. A study of accessible motifs and RNA folding complexity. J. Comput. Biol. 2007;14:856–872. doi: 10.1089/cmb.2007.R020. [DOI] [PubMed] [Google Scholar]

[B40] Will S, et al. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol. 2007;3 doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] Xu X, et al. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics. 2007;23:1883–1891. doi: 10.1093/bioinformatics/btm272. [DOI] [PubMed] [Google Scholar]

PERMALINK

A max-margin model for efficient simultaneous alignment and folding of RNA sequences

Chuong B Do

Chuan-Sheng Foo

Serafim Batzoglou

Abstract

1 INTRODUCTION

2 METHODS

2.1 The RAF scoring model

2.2 Fast pairwise alignment and folding

Fig. 1.

Table 1.

2.3 Extension to multiple alignment

2.4 A max-margin framework

2.4.1 Formulation

2.4.2 The loss function

2.4.3 Optimization algorithm

2.4.4 Subgradient computation

3 RESULTS

3.1 Alignment and base-pairing constraints

Fig. 2.

3.2 Evaluation metrics

3.3 Comparison of accuracy

Table 2.

Table 3.

4 DISCUSSION

ACKNOWLEDGEMENTS

APPENDIX

A.1 RAF features

B.1 The RAF inference engine

B.1.1 Recurrences

B.1.2 Exploiting base-pairing sparsity

B.1.3 Exploiting alignment sparsity

C.1 Norm bound

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases