Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2014 Aug 1;21(8):609–621. doi: 10.1089/cmb.2013.0044

Simultaneous Folding of Alternative RNA Structures with Mutual Constraints: An Application to Next-Generation Sequencing-based RNA Structure Probing

Cuncong Zhong 1, Shaojie Zhang 1,
PMCID: PMC4116156  PMID: 24689688

Abstract

Recent advances in next-generation sequencing technology have significantly promoted high-throughput experimental probing of RNA secondary structures. The resulting enzymatic or chemical probing information is then incorporated into a minimum free energy folding algorithm to predict more accurate RNA secondary structures. A drawback of this approach is that it does not consider the presence of alternative RNA structures. In addition, the alternative RNA structures may contaminate experimental probing information of each other and direct the minimum free-energy folding to a wrong direction. In this article, we present a combinatorial solution for this problem, where two alternative structures can be folded simultaneously given the experimental probing information regarding the mixture of these two alternative structures. We have tested our algorithm with artificially generated mixture probing data on adenine riboswitch and thiamine pyrophosphate (TPP) riboswitch. The experimental results show that our algorithm can successfully recover the ON and OFF structures of these riboswitches.

Key words: : high-throughput RNA structure probing, next-generation sequencing, RNA alternative structures, RNA folding, RNA SHAPE chemistry

1. Introduction

The structures of noncoding RNAs are critical in understanding the transcriptome, including structure–function relationship, stability of the RNA transcripts, and various regulations that may be applied (Eddy, 2001; Storz, 2002; Martin and Ephrussi, 2009). Recently, many enzymatic and chemical RNA structure-probing techniques have been coupled with next-generation sequencing, aiming at producing genome-wide RNA structure maps. In a high-throughput RNA structure-probing experiment, the RNA samples are treated with restriction enzymes or chemical reagents, which have preferential reactivity with helix or loop regions of the RNA transcripts. The resulting fragments of the reaction are pulled out and sequenced to recover RNA structural information (Wan et al., 2011). After sequencing, one can see discrepancies in the reads mapping profile between the paired and unpaired regions. The major idea of this technique is very similar to the traditional RNA probing techniques (Weeks, 2010), except that the resulting fragments are being sequenced using high-throughput next-generation sequencing rather than electrophoresis.

Kertesz et al. (2010) pioneered whole-genome RNA secondary structure probing and applied this novel technique to the yeast genome. Underwood et al. (2010) modified this technique with an alternative restriction enzyme and applied it to the mouse genome to discover novel ncRNAs. Besides restriction enzymes, chemicals have also been used as probing reagents. A technique named SHAPE uses chemical reagents such as NMIA or 1M7 to preferentially react with single-stranded RNA transcripts (Merino et al., 2005; Wilkinson et al., 2006; Mortimer and Weeks, 2007). Later, SHAPE technique was coupled with next-generation sequencing (SHAPE-seq) by Lucks et al. (2011) to improve its throughput and application. Although not yet applied to genome-wide analysis, SHAPE-seq has demonstrated its strong potential by accurately recovering the secondary structures of 16S and 23S ribosomal RNAs (Deigan et al., 2009).

The resulting output of these techniques is the potential for each site of the RNA transcript to react with the probing enzyme or reagent. The potential, also termed reactivity, is usually derived from the read mappings (normalized with a control experiment) of the chemical probing experiment (Aviran et al., 2011). Take the SHAPE experiment, for example, a site with reactivity ‘1’ indicates that it is highly reactive to the chemical reagent, suggesting a free (unpaired) configuration of the site. On the other hand, a site with reactivity ‘0’ indicates a restricted (paired) configuration of the site. These site-wise reactivities are usually transformed into pseudoenergies and incorporated into the existing minimum free energy (MFE) folding tools, such as RNAstructure (Reuter and Mathews, 2010), to predict the structure of the RNA transcript (Deigan et al., 2009; Low and Weeks, 2010). Recently, Washietl et al. (2012) developed an iterative approach to compute the optimal weight for the pseudoenergy that should be taken into account.

However, none of these approaches considers the presence of alternative RNA structures from the same RNA transcript (such as riboswitch elements). If alternative RNA structures are present, the reactivities for the mixture of RNA structures are generated from the experiment (Fig. 1). Such mixture reactivities may fail to capture the structural information from one, or even both, alternative RNA structures and can lead to misprediction while using current available approaches. Even though combining the experimental pseudoenergy and McCaskill's (1990) algorithm, a complete folding landscape of the RNA of interest can be generated, it is very difficult to obtain the two alternative structures of interest from an exponentially large search space (Wuchty et al., 1999; Li and Zhang, 2011). In this case, developing a new computational approach to partition the mixture of reads and explicitly predict the two alternative structures of interest is extremely important for extending experimental RNA structure-probing applications.

FIG. 1.

FIG. 1.

An overview of the RNA mutually constrained folding problem. A set of mixture of reactivities is generated as the enzymatic/chemical probing information of two alternative structures from the same RNA sequence. Assume the probing enzymes/chemical reagents will preferentially attack the unpaired (loop) regions (the blue arrows) of the RNA structures. High reactivities are expected to be observed at these regions, while low reactivities are expected for the paired regions. With the presence of both alternative structures (assuming the partition is 50%–50%), their individual reactivities are attenuated by each other. Finally, a mixture of reactivities of both alternative structures are observed from the experiment. The problem we aim to solve is how to infer both alternative structures from the mixture of reactivities.

We propose considering the observed reactivity as a mixture before performing the constrained folding algorithm. Ideally, if the correct assignment of the reactivities is given, we can perform the traditional constrained folding using the two sets of reactivities separately and reach satisfying results. However, unlike the reads assignment problems, there is no sequence discrepancy between these two RNA transcripts. Therefore, its very difficult to devise a statistical framework to infer the real partition. Thus, we will simultaneously fold these two alternative structures, such that (1) the sum of their free energies is minimized, and (2) the discrepancy between the expected and the observed reactivity profiles is minimized (Fig. 1).

We refer to this problem as the RNA mutually constrained folding problem, because each of the two alternative structures may exert constraint on, or be constrained by, the other structure. Therefore, to solve the RNA mutually constrained folding problem, we devise a combinatorial algorithm that finds the optimal solution by enumerating all possible constraining structures and constraining orders (which will be formally defined in the Methods section). We first present an algorithm using the Nussinov's energy model (Nussinov et al., 1978) (base-pair maximization). The algorithm can be run within O(l8) time and O(l5) space, where l is the length of the RNA sequence. To make the algorithm more applicable in real cases, we further present an improved algorithm that is guided by stacks (continuously nested base pairs), with the implementation of Turner's energy model (free-energy minimization). The improved algorithm can run within O(n5) time and O(n3l) space, where n is the number of stacks and n < l in most real scenarios (Bafna et al., 2006). In this case, we can significantly reduce the running time and space consumption, and the algorithm can be applied to predict alternative structures of real RNA sequences.

We implemented the improved algorithm into a program called MutualFold using GNU C++. Using artificially generated reactivities on adenine and thiamine pyrophosphate (TPP) riboswitches, we showed that our program, MutualFold, can successfully recover the major scaffold of the true alternative structures given a set of mixture reactivities. On the contrary, the traditional energy minimization approach, both with and without reactivity as an auxiliary information, failed to predict one or both of the alternative structures. In this case, we anticipate that the proposed MutualFold algorithm will significantly promote future RNA alternative structure prediction and related research.

2. Methods

In this section, we will introduce the RNA mutually constrained folding problem, which predicts two alternative RNA structures of the minimum sum of free energies with the consideration of step-wise folding constraint effect. We will first present the basic problem formulation in section 2.1, and then present a straightforward solution to the problem using Nussinov's energy model in section 2.2. We will further introduce a more realistic algorithm with sophisticated Turner's energy model and an improved time complexity in section 2.3. Finally, we will discuss how we handle reactivities and pseudoenergies in section 2.4.

2.1. Problem formulation

We begin the formulation of the RNA mutually constrained folding problem by introducing the inputs of the algorithm. The algorithm requires three inputs: (1) the RNA sequence S with length l, from which the alternative structures are to be predicted; (2) a set of observed mixture reactivities Inline graphic, where ri is the reactivity for the ith nucleotide; and (3) the expected partition of these two alternative structures ψ, where 0 ≤ ψ ≤ 1.

The outputs of the algorithm are two simultaneously predicted RNA structures TA and TB. We expect that the predicted structures are thermodynamically stable, or, the sum of the real free energies of the two structures, say E(TA) + E(TB) (a solid dot is used to indicate real energy), is minimized. At the same time, we also expect that the discrepancy between the expected and the observed reactivity profile, say |Rmixture(R(TA), R(TB))|, is minimized. Here, R(TA) refers to the expected reactivity profile of structure TA alone. The discrepancy of the reactivity profiles is usually quantified by the pseudoenergy (Deigan et al., 2009; Low and Weeks, 2010), and we adopt this measurement as well. For example, let TA(i) be the structural configuration of the ith nucleotide in TA, and assume TA(i) is paired. If the probing enzyme/chemical reagent is not reactive to the paired nucleotides, we should expect rather low reactivity observed at this nucleotide, that is, Inline graphic. The following formula (reformulated from Deigan et al., 2009 and Low and Weeks, 2010) has been proposed to compute the pseudoenergy E (a void dot is used to indicate pseudoenergy):

graphic file with name eq3.gif

where m is positive and b is a negative constant. If Inline graphic (i.e., no discrepancy), Inline graphic, and a favorable pseudoenergy is returned to indicate that the assumption of TA(i) being paired is correct. If Inline graphic, and a less favorable pseudoenergy is returned to question the assumption. We can simplify the notation of pseudoenergy as E(ri) since Inline graphic is known. We also assume the pseudoenergies to be additive as the real energies, that is, Inline graphic. In this case, the pseudo energy can be incorporated into the real free energy, that is, E(TA) = E(TA) + E(TA), and our algorithm will minimize E(TA) + E(TB) to simultaneously consider thermodynamic stability and reactivity discrepancy. Now, we can formally define the RNA mutually constrained folding problem as follows:

Input: an RNA sequence S, a set of mixture reactivities R, and an expected partition ψ.

Output: two RNA structures TA and TB of S, such that the sum of free energies (including both real and pseudo energies) E(TA) + E(TB) is minimized.

The key to the solution of this problem is based on the understanding of how TA and TB are mutually constrained, that is, how the folding of one structure may exert constraint on, or be constrained by, the other structure. Recall that ri is the observed reactivity at the ith nucleotide, and let it be a mixture (with a ratio ψ) of the reactivity from structure A (defined as Inline graphic) and structure B (defined as Inline graphic). In other words, Inline graphic. Also assume that we can represent the observed reactivity Inline graphic by its expectation Inline graphic, which is determined by the structural configuration TB(i), i.e., Inline graphic. Therefore, Inline graphic and finally, Inline graphic. For the sake of simplicity, we will write Inline graphic, as ψ and ri are not variables and are given as the inputs. In this case, the pseudoenergy used for folding TA is affected by the structure of TB and vise versa. In this case, TA and TB are mutually constrained.

With the understanding of mutual constraints, we can easily devise a brute force solution, which takes exponential time. We can enumerate all possible TB as constraints and fold TA correspondingly. The pair of TA and TB that results in the minimum sum of free energies can be taken as the optimal solution. However, this approach is computationally expensive, as there are an exponential number of possible structures that are to be enumerated as the structural constraint TB. To resolve this issue, we can break down the problem into smaller subproblems using dynamic programming formulation. Let Inline graphic be a substructure in TA (which begins with Si1 and ends with Sj1), and Inline graphic be a substructure in TB (which begins with Si2 and ends with Sj2). These two substructures may or may not overlap each other. If the optimal structures (with the minimum sum of energies) adopted by both subregions, say Ti1,j1;i2,j2, are known, we can use them as structural constraints to fold nearby RNA sequences. We can reach the final solution using this approach by extending i1, j1 to 0 and i2, j2 to l−1.

To guarantee optimality, we need to consider all possible constraining orders when computing Ti1,j1;i2,j2. For example, let “−” indicate a void region that exerts no structural constraint, T−,−;i2,k2 → Ti1,j1;i2,k2 → Ti1,j1;i2,j2 (where i2 ≤ k2 ≤ j2) represents the following constraining order: (1) Inline graphic is folded without constraint, (2) Inline graphic is folded by applying Inline graphic as a structural constraint, and (3) Inline graphic is folded by applying Inline graphic as a structural constraint. Other constraining orders are possible as well. For example, T−,−;i2,k2 → Ti1,k1;i2,k2 → Ti1,k1;i2,j2 → Ti1,j1;i2,j2 (where i1 ≤ k1 ≤ j1) can act as an alternative constraining order to be traversed during the computation of Ti1,j1;i2,j2. In summary, all subregions (Ti1,j1;i2,j2) and all constraining orders need to be taken into account in the algorithm. We present a combinatorial solution using the Nussinov's energy model (Nussinov et al., 1978) in the following section.

2.2. An algorithm with Nussinov's energy model

In this section, we will introduce a solution for the RNA mutually constrained folding problem using Nussinov's energy model (Nussinov et al., 1978) to facilitate the understanding of the major idea. The object function of Nussinov's RNA folding formulation is to maximize the number of base pairs, instead of minimizing the free energy, of the predicted RNA structure. Therefore, denote F(i1,j1;i2,j2) as the maximum number of base pairs within the subregions Inline graphic and Inline graphic, where Inline graphic, and Inline graphic is the structure for the subregion Inline graphic of Ti1,j2;i2,j2. Note that F also contains both real and pseudo base pairs, that is, F = F + F. Also denote Inline graphic as the maximal number of base pairs within the subregions Inline graphic given the structure Inline graphic as a structural constraint. For the sake of clarity, we underline the terms that correspond to the terminal cases, whose values can be directly computed or looked up. We can compute F(i1,j1;i2,j2) using the following recursive function:

graphic file with name eq33.gif

The first case described in Equation (2) corresponds to a boundary case where no base pair is formed. The second and third cases correspond to paired cases, where the outmost nucleotides (i1 and j1, or i2 and j2, respectively) form a base pair. In this case, “1” is added to indicate the base pair that has formed, and how well the observed reactivity supports the pair is evaluated by pseudo base pairs (Fs). The last four cases try all possible branching points with different constraining orders. Take the fourth case as an example; the last added structural component (i.e., Inline graphic) will be predicted using the existing optimal substructure (i.e., Inline graphic) as a constraint by the traditional Nussinov's folding algorithm (Nussinov et al., 1978) with soft constraints.

The optimal structural configuration for a region with a structural constraint, for example, Inline graphic [similar for Inline graphic], can be computed as follows:

graphic file with name eq38.gif

A direct implementation of this algorithm leads to an O(l8) time complexity and O(l5) space complexity. Indeed, to complete the algorithm, we need to fill up a four-dimensional dynamic programming table F, which requires O(l4) time. For each entry in F, O(l) time is used for traversing all branching k1 and k2, and O(l3) is used to compute the constrained folding FA and FB. Therefore, the overall time complexity would be O(l8). The space complexity is O(l5). Note that the algorithm needs to maintain the four-dimensional table F, in addition, and O(l) space is also required for each entry of F to record the corresponding optimal structure that would be used as a structural constraint in the future folding steps. Hence the overall space complexity is O(l5).

2.3. An improved algorithm with Turner's energy model

The time and space complexity of the previous algorithm are prohibitively high and are not feasible for most real RNA sequences. Therefore, we need to devise a more efficient algorithm. At the same time, we need to consider the more realistic Turner's energy model (Turner et al., 1988). Inspired by the idea of RNAscf (Bafna et al., 2006), we observe that the major scaffolds of RNA secondary structures can be represented by stacks. A stack, built from a number of continuously nested base pairs, form the regular A-form helix of the RNA structure that stabilizes the structure. Note that we only consider the significant stacks, that is, those with more than four base pairs and eight hydrogen bonds, as the number of these significant stacks is usually small and less than the length of the RNA sequence (Bafna et al., 2006). Therefore, at each folding step we will add a stack or a structural component enclosed by a stack. Thus we can achieve significant speedup compared to the previous algorithm with base-pair resolution.

We begin the exposition of the algorithm by introducing basic definitions of stacks and their relationships. An RNA structure can be represented by a set of significant stacks; denote the set as Inline graphic. A stack p can be uniquely determined by three indices: the leftmost endpoint l(p), the rightmost endpoint r(p), and the width of the stack w(p). The nucleotides at l(p) and r(p) form the outmost (smallest 5′ and largest 3′ indices) base pair of p, while l(p) + w(p) − 1 and r(p) − w(p) + 1 form the innermost base pair of p. To simplify the notations, we also say that lI(p) and rI(p) form the innermost base pair of p. The stacks can be partially ordered by increasing rightmost endpoints and decreasing leftmost endpoints. With such partial ordering, we can denote the ith stack in Inline graphic as pi.

Let pi and pj be two stacks in Inline graphic and assume that i < j. If pi is enclosed by pj, that is, lI(pj) < l(pi) and rI(pj) > r(pi), denote their relationship as pi < I pj. If pi is juxtaposed to pj, that is, r(pi) < l(pj), denote their relationship as pi < J pj. If there is no stack pk such that pi < J pk and pk < J pj, we say that pi is directly before pj. Note that there may exist more than one stack that are directly before pj, therefore denote the stacks that are directly before pj as a set Inline graphic. The size of the set Inline graphic is expected to be a constant when only the significant stacks are considered (Bafna et al., 2006).

Since Turner's energy model also considers the free energies of loops (unpaired regions) formed between stacks, we define the loop regions as follows. Denote the hairpin loop formed by stack pi as L(pi), which refers to the region Inline graphic. Denote the internal/bulge loops formed between stacks pi and pj as Ll(pi, pj) and Lr(pi, pj) if pi < I pj, which refer to the two regions Inline graphic and Inline graphic, respectively. If not specified, L(pi, pj) is used to represent both loops. Denote the multibranch loop formed between stacks pi and pj as L(pi, pj) if pi < J pj, which refers to the region Inline graphic. Finally, we can also represent a loop region by explicitly giving the sequence region, for example, Inline graphic is a loop starting from the ith nucleotide and ending with the jth nucleotide.

Let the minimum free energy of the regions enclosed by stacks pi and pj (including these two stacks) be E(pi; pj), and E = E⋅ + E. If we artificially add a stack p*, where l(p*) = 0, r(p*) = l − 1 and w(p*) = 0, we can retrieve the global optimal solution from E(p*;p*). For clarity, we explicitly write E(pi;pj) as Inline graphic to indicate that pi is presented in the structure A and pj is presented in the structure B. Also, denote Inline graphic as the minimum free energy for the most recent hairpin loop folding event, Inline graphic for the most recent internal/bulge loop folding event, and Inline graphic for the most recent multibranch loop folding event. Therefore:

graphic file with name eq53.gif

where the first case “0” is a boundary case where no structure is formed.

To compute the hairpin loop energy Inline graphic, denote es(p) as the free energy of a stack p, and Inline graphic as the free energy for the hairpin loop Inline graphic (recall that the underlined terms indicate the terminal cases that can be directly computed or looked up). Denote Inline graphic as the minimum free energy for the stack Inline graphic and the region enclosed by it when folded without mutual constraint. The matrix Euc can be precomputed by using the traditional minimum free energy folding algorithms (Zuker and Sankoff, 1984; Hofacker et al., 1994; Reuter and Mathews, 2010), while the reactivities are used as soft constraints [extending the recursive function for computing Inline graphic with Turner's energy model]. Note that we only need to precompute the matrix once, and all required unconstrained folding results can be retrieved. Let the structure that corresponds to Inline graphic be Inline graphic. For pseudoenergies, denote Inline graphic as the pseudoenergy of adopting Inline graphic as a stack into the structure given the constraint TB, and Inline graphic as the pseudoenergy of adopting the loop region given the constraint. (We do not consider loop pseudoenergy if both structures are unpaired at this region.) The recursive function for computing Inline graphic considers two cases, where (1) Inline graphic, or (2) Inline graphic is recently added as a hairpin loop:

graphic file with name eq68.gif

To consider the internal/bulge loop case, denote Inline graphic as the free energy for the internal/bulge loop formed by Inline graphic and Inline graphic, if Inline graphic. The recursive function for computing Inline graphic considers two cases, where (1) Inline graphic or (2) Inline graphic is recently added as an internal/bulge loop:

graphic file with name eq76.gif

To compute the multibranch loop case, we have to introduce a new three-dimensional matrix Em1. Em1 stores the minimum free energy formed between an opened multibranch loop and a closed loop. The opened multibranch loop can be viewed as a chain, which is formally defined as a set of juxtaposing stacks and their enclosed structural components. Therefore, the entry Inline graphic is the optimal structural configuration formed between the chain that is ended with Inline graphic and enclosed by Inline graphic (Inline graphic itself is NOT included in the chain), and the structural component that is enclosed by Inline graphic (where Inline graphic itself is included). Let ema be the multibranch loop closing penalty, emb be the unpaired region extension penalty (applied on the length of the loop L, |L|), and emc be the bonus free energy for adding a new branch. The recursive function for computing Inline graphic considers two cases, where (1) Inline graphic, or (2) Inline graphic is recently added as a multibranch loop:

graphic file with name eq86.gif

To compute Inline graphic, we introduce another matrix Em2 that corresponds to the minimum free energy configuration formed between two chains. The two chains that correspond to the entry Inline graphic are the ones that ended with Inline graphic (enclosed by Inline graphic) and Inline graphic (enclosed by Inline graphic), respectively. Let the term ‘Inline graphic’ refer to the real free energy of the structural component formed in A as recorded in Inline graphic. For boundary cases, denote Inline graphic as the unconstrained free energy of the chain that is enclosed at Inline graphic and ended at Inline graphic. Note that the Inline graphic matrix is auxiliary to Euc matrix (Zuker and Sankoff, 1984; Hofacker et al., 1994), which can also be precomputed for a constant time look-up. Let the corresponding structure of Inline graphic be Inline graphic. The recursive function for computing Inline graphic considers five cases: where the closed loop Inline graphic is recently added as (1) a hairpin loop, (2) an internal/bulge loop, or (3) a multi-branch loop, respectively; or the last component Inline graphic in the chain is recently added as (4) an extension, or (5) the beginning of the chain.

graphic file with name eq104.gif

In the last two cases, note that we do not fold the structure enclosed by Inline graphic from scratch as we did in the naive algorithm. Instead, we assume that its structure is majorally determined by only one structural constraint. Let the structural constraint be enclosed by Inline graphic. By search all Inline graphic, we will identify this structural constraint from Inline graphic. Once we have identified Inline graphic, which encloses the structural constraint, we can retrieve the corresponding structure that is enclosed by Inline graphic from Inline graphic. Note that we omitted the cases where Inline graphic adopts no structure, which can be computed easily by adjusting the length to be applied on emb and discard emc.

Note that the recursive function for computing Inline graphic can be easily derived based on the symmetricity. Therefore, we omit the exposition of this part. Finally, the recursive function for computing Inline graphic considers four cases, where (1) Inline graphic from the chain is recently added as an extension of the existing chain, (2) Inline graphic is added as an extension, (3) Inline graphic is added as the beginning of the chain, or (4) Inline graphic is added as the beginning of the chain:

graphic file with name eq119.gif

Note that we also omitted the cases where Inline graphic or Inline graphic adopts no structure, which can be computed by adjusting the loop size for emb and discarding emc.

The time complexity of the improved algorithm is O(n5), where n is the number of significant stacks predicted from the input RNA sequence. It is shown that n < l (Bafna et al., 2006), and thus the improved algorithm is feasible for most real RNAs. The algorithm will fill up the two-dimensional matrix E. To compute an entry in E, say Inline graphic, one needs to compute three matrices: Inline graphic, Inline graphic, and Inline graphic. Since Inline graphic and Inline graphic are determined, the variables become Inline graphic and Inline graphic. Therefore, we can use O(n) time to compute the Em1 matrix, and O(n2) time to compute the Em2 matrix. Since we need to traverse a number of constraining structural components for computing each entry of Em1 and Em2, the time complexities add up to O(n2) and O(n3), respectively. Hence, the overall time complexity for this algorithm is O(n5).

The space complexity of the improved algorithm is O(n3l). Consider the fact that the matrix Em2 is only referred to by the computation of Em1 with the same enclosing base pairs (Inline graphic and Inline graphic), it can be discarded immediately once the corresponding entries in Em1 are computed. Therefore, we only need to store E and Em1. For each entry in E and Em1, O(l) space is used to record the optimal structures. As a result, the overall space complexity for this algorithm is O(n3l) (note that Em1 is a three-dimensional matrix).

2.4. Inferring reactivities and pseudoenergies

In this section, we mainly discuss how we infer the reactivities and compute the corresponding pseudoenergies. Note that the reactivity is computed as a scaled reads mapping difference between the treated sample and the control sample (Deigan et al., 2009; Low and Weeks, 2010). In this case, when we assume that the number of reads for the control sample is very small (which can be expected from high-quality experiments), we can derive the following naive model to partition the mixture of reactivities. Given the partition for the first transcript ψ, and the expected reactivities for TA and TB at the ith nucleotide alone, that is, Inline graphic and Inline graphic, we approximately model the observed reactivity ri as the weighted (ψ) sum of Inline graphic and Inline graphic. The expected reactivity may vary in experiments, where different enzymes/chemical reagents are used. In this article, we assume Inline graphic and Inline graphic. In cases when one of the structures is not determined, say TB(i) = unknown, we makeInline graphic. That is, an optimistic pseudoenergy is applied no matter what structural configuration TA(i) may adopt. In other words, TA(i) has the “right of free folding” and will constrain the folding of TB in the future. After inferring the reactivity, we can then compute pseudoenergy using the traditional way as described in Equation (1). Note that we only compute the pseudoenergy E(ri) when at least one of TA(i) and TB(i) is paired (Deigan et al., 2009; Low and Weeks, 2010). The parameters m and b are used as suggested in the references, where m = 2.6 kcal/mol and b = −0.8 kcal/mol.

Note that we only present a naive way of handling the reactivities and pseudoenergies. More sophisticated algorithms are encouraged if the characteristics of the probing enzymes/chemical reagents are well understood (Vasa et al., 2008; Aviran et al., 2011). In addition, if the raw reads are available, we can better model their mutual constraints and devise a more accurate estimation of the reactivities. Nevertheless, our focus of this work is to devise a new algorithmic framework for folding RNA structures with mutual constraints, and the reactivity and pseudoenergy handling components are independent of the major algorithmic framework. Different handling techniques are expected to be derived for specific applications.

3. Results

We have implemented the improved algorithm into a program called MutualFold using GNU C++. We searched for real experimental data on RNA alternative structures that cannot be correctly predicted simultaneously. Unfortunately, we cannot find such experimental data, because this technology is only developed recently. Therefore, we generate two artificial examples to demonstrate that MutualFold can correctly predict the alternative structures through partitioning the mixture of reactivities. We artificially assigned the mixture reactivities based on known alternative structures of adenine (Lemay et al., 2011) and TPP (Mironov et al., 2002; Rentmeister et al., 2007) riboswitches to their corresponding sequences, respectively. Following the convention of the SHAPE technology, we assumed that the expected reactivities for the unpaired regions are 1 and for paired regions are 0. Also, we introduced 20% error rate into the expected reactivities to simulate experimental errors. We expected that such high error rate is sufficient to cover most of the real experimental errors, and we also expected to show that the MutualFold algorithm is robust with such errors.

We first generated a set of mixture reactivities from a known adenine riboswitch. We assume that 70% of the transcripts adopt the “ON” structure, and 30% of them adopt the “OFF” structure. Given the mixture reactivities (and the correct partition ψ, see the Discussion section for cases in which the correct ψ is not available), we applied MutualFold to predict the two alternative structures and compared the results with the true alternative structures. We also used RNAstructure to predict the minimum free energy structure of the RNA, both with and without the reactivities as auxiliary information. We summarize the experimental results for adenine riboswitch in Figure 2.

FIG. 2.

FIG. 2.

Alternative structures of adenine riboswitch and structures predicted by MutualFold and RNAstructure. (a and b) ON and OFF structures of adenine riboswitch, respectively. (c and d) Two alternative structures of adenine riboswitch predicted by MutualFold. (e) The minimum free energy (MFE) folding result of adenine riboswitch by RNAstructure. (f) The MFE folding result of adenine riboswitch with artificially generated reactivities by RNAstructure.

The real alternative structures of adenine riboswitch are shown in Figure 2 a and b. The two structures that are simultaneously predicted by MutualFold are shown in Figure 2 c and d. We can see that the two alternative structures predicted by MutualFold are exactly the same as the true structures. Therefore, the proposed algorithm is very powerful in recovering alternative structures when the mixture reactivities are given. In addition, MutualFold is also very robust in handling experimental errors, as both correct structures can be perfectly predicted even when 20% error rate is assumed. On the other hand, we found that the minimum free energy structure, both with and without (Fig. 2e and f, respectively) reactivity, cannot perfectly predict the real structures. Therefore, the alternative structures cannot be predicted separately, and the algorithm that can simultaneously predict both structures is necessary.

For a more challenging test, we artificially generated the mixture reactivities for a TPP riboswitch, while assuming the transcript partition is 50%–50%. This test is challenging because (1) a large fraction of the TPP riboswitch adopts the same structure, thus the mixture reactivities have less distinguishing power to recover both alternative structures; and (2) there exist many insignificant stacks presented in the TPP riboswitch alternative structures that will be considered by MutualFold. We presented the test results of TPP riboswitch in Figure 3.

FIG. 3.

FIG. 3.

Alternative structures of thiamine pyrophosphate (TPP) riboswitch and structures predicted by MutualFold and RNAstructure. (a and b) ON and OFF structures of TPP riboswitch, respectively. (c and d) Two alternative structures of TPP riboswitch predicted by MutualFold. (e) The MFE folding result of TPP riboswitch by RNAstructure. (f) The MFE folding result of TPP riboswitch with artificially generated reactivities by RNAstructure.

The real alternative structures of TPP riboswitch are shown in Figure 3a and b. The two structures that are simultaneously predicted by MutualFold are shown in Figure 3c and d, and the minimum free energy structures, with and without input reactivities, are shown in Figure 3e and f, respectively. Using the reactivities as soft constraint, RNAstructure can only predict the “OFF” structure with high accuracy. On the other hand, MutualFold is able to recover the major scaffold of both “ON” and “OFF” structures, although several insignificant stacks are missed. We argue that MutualFold only considers significant stacks for computational efficiency, and the insignificant stacks can be easily taken back when more powerful computational resource is available. Nevertheless, even with the missed insignificant stacks, MutualFold is still capable of recovering the major scaffold of both “ON” and “OFF” structures (Fig. 3c and d).

4. Discussion

The algorithm presented in this work assumes that the real partition of the alternative structures, ψ, is known. In cases when such information is unknown, we can devise an EM (Expectation Maximization) algorithm to computationally estimate the partition ψ. We start with arbitrarily assigning a value between 0 and 1 to ψ as the a priori estimation, and compute the alternative structures using the MutualFold algorithm as the E-step. In the M-step, we update the partition estimation using the reactivity profiles at regions where the two predicted structures TA and TB adopt different structural configurations. This EM algorithm will terminate when the predictions of TA and TB become invariant.

In cases when the partition is difficult to estimate, we claim that the proposed combinatorial algorithm is not sensitive to the estimation of ψ. The phase transition property of dynamic programming indicates that the results are invariant when the parameters vary only within a certain range. That is, small deviation of the partition estimation will not change the predicted alternative structures significantly. We have tested the adenine example with partition estimation from 50% to 80% (note that the real partition is 70%), and MutualFold can still predict the correct alternative structures. In addition, because of the symmetricity, the partition estimation from 20% to 50% will also generate the correct prediction. In this case, the algorithm can accept a wide range (20% to 80% in this case) of partition estimations without making errors in the prediction.

In summary, we presented a combinatorial algorithm to simultaneously fold two alternative structures from a mixture of their experimental structure-probing results. The algorithm has a time complexity of O(n5) and a space complexity of O(n3l), where n is the number of significant stacks, l is the length of the RNA sequence, and n < l. We implemented the algorithm into a program called MutualFold and have shown that MutualFold is capable of simultaneously predicting both alternative structures with the artificially generated mixture reactivities. The algorithmic framework can be applied to different RNA structure-probing techniques, and only the reactivity and pseudoenergy handling component need to be revised. Therefore, we anticipate that the proposed algorithm will significantly promote future RNA structure-probing studies and related research.

Acknowledgment

This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM102515).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

  1. Aviran S., Trapnell C., Lucks J.B., et al. . 2011. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. U.S.A. 108, 11069–11074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bafna V., Tang H., and Zhang S.2006. Consensus folding of unaligned RNA sequences revisited. J. Comput. Biol. 13, 283–295 [DOI] [PubMed] [Google Scholar]
  3. Deigan K.E., Li T.W., Mathews D.H., and Weeks K.M.2009. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. U.S.A. 106, 97–102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Eddy S.2001. Non-coding RNA genes and the modern RNA world. Nature Reviews in Genetics 2, 919–929 [DOI] [PubMed] [Google Scholar]
  5. Hofacker I.L., Fontana W., Stadler P.F., et al. . 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125, 167–188 [Google Scholar]
  6. Kertesz M., Wan Y., Mazor E., et al. . 2010. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Lemay J.F., Desnoyers G., Blouin S., et al. . 2011. Comparative study between transcriptionally- and translationally-acting adenine riboswitches reveals key differences in riboswitch regulatory mechanisms. PLoS Genet. 7, e1001278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Li Y., and Zhang S.2011. Finding stable local optimal RNA secondary structures. Bioinformatics 27, 2994–3001 [DOI] [PubMed] [Google Scholar]
  9. Low J.T., and Weeks K.M.2010. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lucks J.B., Mortimer S.A., Trapnell C., et al. . 2011. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. U.S.A. 108, 11063–11068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Martin K.C., and Ephrussi A.2009. mRNA localization: gene expression in the spatial dimension. Cell 136, 719–730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. McCaskill J.S.1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105–1119 [DOI] [PubMed] [Google Scholar]
  13. Merino E.J., Wilkinson K.A., Coughlan J.L., and Weeks K.M.2005. RNA structure analysis at single nucleotide resolution by selective 2′-hydroxyl acylation and primer extension (SHAPE). J. Am. Chem. Soc. 127, 4223–4231 [DOI] [PubMed] [Google Scholar]
  14. Mironov A.S., Gusarov I., Rafikov R., et al. . 2002. Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria. Cell 111, 747–756 [DOI] [PubMed] [Google Scholar]
  15. Mortimer S.A., and Weeks K.M.2007. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 [DOI] [PubMed] [Google Scholar]
  16. Nussinov R., Pieczenik G., Griggs J., and Kleitman D.1978. Algorithms for loop matchings. SIAM J. Appl. Math. 35, 68–82 [Google Scholar]
  17. Rentmeister A., Mayer G., Kuhn N., and Famulok M.2007. Conformational changes in the expression domain of the Escherichia coli thiM riboswitch. Nucleic Acids Res. 35, 3713–3722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Reuter J.S., and Mathews D.H.2010. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Storz G.2002. An expanding universe of noncoding RNAs. Science 296, 1260–1263 [DOI] [PubMed] [Google Scholar]
  20. Turner D.H., Sugimoto N., and Freier S.M.1988. RNA structure prediction. Annu. Rev. Biophys. Biophys. Chem. 17, 167–192 [DOI] [PubMed] [Google Scholar]
  21. Underwood J.G., Uzilov A.V., Katzman S., et al. . 2010. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat. Methods 7, 995–1001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Vasa S.M., Guex N., Wilkinson K.A., et al. . 2008. ShapeFinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis. RNA 14, 1979–1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wan Y., Kertesz M., Spitale R.C., et al. . 2011. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 12, 641–655 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Washietl S., Hofacker I.L., Stadler P.F., and Kellis M.2012. RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction. Nucleic Acids Res. 40, 4261–4272 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Weeks K.M.2010. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20, 295–304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Wilkinson K.A., Merino E.J., and Weeks K.M.2006. Selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat. Protoc. 1, 1610–1616 [DOI] [PubMed] [Google Scholar]
  27. Wuchty S., Fontana W., Hofacker I.L., and Schuster P.1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49, 145–165 [DOI] [PubMed] [Google Scholar]
  28. Zuker M., and Sankoff D.1984. RNA secondary structures and their prediction. Bull. Math. Biol. 46, 591–621 [Google Scholar]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES