Simultaneous Folding of Alternative RNA Structures with Mutual Constraints: An Application to Next-Generation Sequencing-based RNA Structure Probing

Cuncong Zhong; Shaojie Zhang

doi:10.1089/cmb.2013.0044

. 2014 Aug 1;21(8):609–621. doi: 10.1089/cmb.2013.0044

Simultaneous Folding of Alternative RNA Structures with Mutual Constraints: An Application to Next-Generation Sequencing-based RNA Structure Probing

Cuncong Zhong ¹, Shaojie Zhang ^1,^✉

PMCID: PMC4116156 PMID: 24689688

Abstract

Recent advances in next-generation sequencing technology have significantly promoted high-throughput experimental probing of RNA secondary structures. The resulting enzymatic or chemical probing information is then incorporated into a minimum free energy folding algorithm to predict more accurate RNA secondary structures. A drawback of this approach is that it does not consider the presence of alternative RNA structures. In addition, the alternative RNA structures may contaminate experimental probing information of each other and direct the minimum free-energy folding to a wrong direction. In this article, we present a combinatorial solution for this problem, where two alternative structures can be folded simultaneously given the experimental probing information regarding the mixture of these two alternative structures. We have tested our algorithm with artificially generated mixture probing data on adenine riboswitch and thiamine pyrophosphate (TPP) riboswitch. The experimental results show that our algorithm can successfully recover the ON and OFF structures of these riboswitches.

Key words: : high-throughput RNA structure probing, next-generation sequencing, RNA alternative structures, RNA folding, RNA SHAPE chemistry

1. Introduction

The structures of noncoding RNAs are critical in understanding the transcriptome, including structure–function relationship, stability of the RNA transcripts, and various regulations that may be applied (Eddy, 2001; Storz, 2002; Martin and Ephrussi, 2009). Recently, many enzymatic and chemical RNA structure-probing techniques have been coupled with next-generation sequencing, aiming at producing genome-wide RNA structure maps. In a high-throughput RNA structure-probing experiment, the RNA samples are treated with restriction enzymes or chemical reagents, which have preferential reactivity with helix or loop regions of the RNA transcripts. The resulting fragments of the reaction are pulled out and sequenced to recover RNA structural information (Wan et al., 2011). After sequencing, one can see discrepancies in the reads mapping profile between the paired and unpaired regions. The major idea of this technique is very similar to the traditional RNA probing techniques (Weeks, 2010), except that the resulting fragments are being sequenced using high-throughput next-generation sequencing rather than electrophoresis.

Kertesz et al. (2010) pioneered whole-genome RNA secondary structure probing and applied this novel technique to the yeast genome. Underwood et al. (2010) modified this technique with an alternative restriction enzyme and applied it to the mouse genome to discover novel ncRNAs. Besides restriction enzymes, chemicals have also been used as probing reagents. A technique named SHAPE uses chemical reagents such as NMIA or 1M7 to preferentially react with single-stranded RNA transcripts (Merino et al., 2005; Wilkinson et al., 2006; Mortimer and Weeks, 2007). Later, SHAPE technique was coupled with next-generation sequencing (SHAPE-seq) by Lucks et al. (2011) to improve its throughput and application. Although not yet applied to genome-wide analysis, SHAPE-seq has demonstrated its strong potential by accurately recovering the secondary structures of 16S and 23S ribosomal RNAs (Deigan et al., 2009).

The resulting output of these techniques is the potential for each site of the RNA transcript to react with the probing enzyme or reagent. The potential, also termed reactivity, is usually derived from the read mappings (normalized with a control experiment) of the chemical probing experiment (Aviran et al., 2011). Take the SHAPE experiment, for example, a site with reactivity ‘1’ indicates that it is highly reactive to the chemical reagent, suggesting a free (unpaired) configuration of the site. On the other hand, a site with reactivity ‘0’ indicates a restricted (paired) configuration of the site. These site-wise reactivities are usually transformed into pseudoenergies and incorporated into the existing minimum free energy (MFE) folding tools, such as RNAstructure (Reuter and Mathews, 2010), to predict the structure of the RNA transcript (Deigan et al., 2009; Low and Weeks, 2010). Recently, Washietl et al. (2012) developed an iterative approach to compute the optimal weight for the pseudoenergy that should be taken into account.

However, none of these approaches considers the presence of alternative RNA structures from the same RNA transcript (such as riboswitch elements). If alternative RNA structures are present, the reactivities for the mixture of RNA structures are generated from the experiment (Fig. 1). Such mixture reactivities may fail to capture the structural information from one, or even both, alternative RNA structures and can lead to misprediction while using current available approaches. Even though combining the experimental pseudoenergy and McCaskill's (1990) algorithm, a complete folding landscape of the RNA of interest can be generated, it is very difficult to obtain the two alternative structures of interest from an exponentially large search space (Wuchty et al., 1999; Li and Zhang, 2011). In this case, developing a new computational approach to partition the mixture of reads and explicitly predict the two alternative structures of interest is extremely important for extending experimental RNA structure-probing applications.

FIG. 1. — An overview of the RNA mutually constrained folding problem. A set of mixture of reactivities is generated as the enzymatic/chemical probing information of two alternative structures from the same RNA sequence. Assume the probing enzymes/chemical reagents will preferentially attack the unpaired (loop) regions (the blue arrows) of the RNA structures. High reactivities are expected to be observed at these regions, while low reactivities are expected for the paired regions. With the presence of both alternative structures (assuming the partition is 50%–50%), their individual reactivities are attenuated by each other. Finally, a mixture of reactivities of both alternative structures are observed from the experiment. The problem we aim to solve is how to infer both alternative structures from the mixture of reactivities.

We propose considering the observed reactivity as a mixture before performing the constrained folding algorithm. Ideally, if the correct assignment of the reactivities is given, we can perform the traditional constrained folding using the two sets of reactivities separately and reach satisfying results. However, unlike the reads assignment problems, there is no sequence discrepancy between these two RNA transcripts. Therefore, its very difficult to devise a statistical framework to infer the real partition. Thus, we will simultaneously fold these two alternative structures, such that (1) the sum of their free energies is minimized, and (2) the discrepancy between the expected and the observed reactivity profiles is minimized (Fig. 1).

We refer to this problem as the RNA mutually constrained folding problem, because each of the two alternative structures may exert constraint on, or be constrained by, the other structure. Therefore, to solve the RNA mutually constrained folding problem, we devise a combinatorial algorithm that finds the optimal solution by enumerating all possible constraining structures and constraining orders (which will be formally defined in the Methods section). We first present an algorithm using the Nussinov's energy model (Nussinov et al., 1978) (base-pair maximization). The algorithm can be run within O(l⁸) time and O(l⁵) space, where l is the length of the RNA sequence. To make the algorithm more applicable in real cases, we further present an improved algorithm that is guided by stacks (continuously nested base pairs), with the implementation of Turner's energy model (free-energy minimization). The improved algorithm can run within O(n⁵) time and O(n³l) space, where n is the number of stacks and n < l in most real scenarios (Bafna et al., 2006). In this case, we can significantly reduce the running time and space consumption, and the algorithm can be applied to predict alternative structures of real RNA sequences.

We implemented the improved algorithm into a program called MutualFold using GNU C++. Using artificially generated reactivities on adenine and thiamine pyrophosphate (TPP) riboswitches, we showed that our program, MutualFold, can successfully recover the major scaffold of the true alternative structures given a set of mixture reactivities. On the contrary, the traditional energy minimization approach, both with and without reactivity as an auxiliary information, failed to predict one or both of the alternative structures. In this case, we anticipate that the proposed MutualFold algorithm will significantly promote future RNA alternative structure prediction and related research.

2. Methods

In this section, we will introduce the RNA mutually constrained folding problem, which predicts two alternative RNA structures of the minimum sum of free energies with the consideration of step-wise folding constraint effect. We will first present the basic problem formulation in section 2.1, and then present a straightforward solution to the problem using Nussinov's energy model in section 2.2. We will further introduce a more realistic algorithm with sophisticated Turner's energy model and an improved time complexity in section 2.3. Finally, we will discuss how we handle reactivities and pseudoenergies in section 2.4.

2.1. Problem formulation

We begin the formulation of the RNA mutually constrained folding problem by introducing the inputs of the algorithm. The algorithm requires three inputs: (1) the RNA sequence S with length l, from which the alternative structures are to be predicted; (2) a set of observed mixture reactivities Inline graphic , where r_i is the reactivity for the ith nucleotide; and (3) the expected partition of these two alternative structures ψ, where 0 ≤ ψ ≤ 1.

The outputs of the algorithm are two simultaneously predicted RNA structures T^A and T^B. We expect that the predicted structures are thermodynamically stable, or, the sum of the real free energies of the two structures, say E^⋅(T^A) + E^⋅(T^B) (a solid dot is used to indicate real energy), is minimized. At the same time, we also expect that the discrepancy between the expected and the observed reactivity profile, say |R−mixture(R(T^A), R(T^B))|, is minimized. Here, R(T^A) refers to the expected reactivity profile of structure T^A alone. The discrepancy of the reactivity profiles is usually quantified by the pseudoenergy (Deigan et al., 2009; Low and Weeks, 2010), and we adopt this measurement as well. For example, let T^A(i) be the structural configuration of the ith nucleotide in T^A, and assume T^A(i) is paired. If the probing enzyme/chemical reagent is not reactive to the paired nucleotides, we should expect rather low reactivity observed at this nucleotide, that is, Inline graphic . The following formula (reformulated from Deigan et al., 2009 and Low and Weeks, 2010) has been proposed to compute the pseudoenergy E^∘ (a void dot is used to indicate pseudoenergy):

where m is positive and b is a negative constant. If Inline graphic (i.e., no discrepancy), , and a favorable pseudoenergy is returned to indicate that the assumption of T^A(i) being paired is correct. If , and a less favorable pseudoenergy is returned to question the assumption. We can simplify the notation of pseudoenergy as E^∘(r_i) since is known. We also assume the pseudoenergies to be additive as the real energies, that is, Inline graphic . In this case, the pseudo energy can be incorporated into the real free energy, that is, E(T^A) = E^⋅(T^A) + E^∘(T^A), and our algorithm will minimize E(T^A) + E(T^B) to simultaneously consider thermodynamic stability and reactivity discrepancy. Now, we can formally define the RNA mutually constrained folding problem as follows:

Input: an RNA sequence S, a set of mixture reactivities R, and an expected partition ψ.

Output: two RNA structures T^A and T^B of S, such that the sum of free energies (including both real and pseudo energies) E(T^A) + E(T^B) is minimized.

The key to the solution of this problem is based on the understanding of how T^A and T^B are mutually constrained, that is, how the folding of one structure may exert constraint on, or be constrained by, the other structure. Recall that r_i is the observed reactivity at the ith nucleotide, and let it be a mixture (with a ratio ψ) of the reactivity from structure A (defined as Inline graphic ) and structure B (defined as ). In other words, . Also assume that we can represent the observed reactivity by its expectation , which is determined by the structural configuration T^B(i), i.e., . Therefore, and finally, . For the sake of simplicity, we will write , as ψ and r_i are not variables and are given as the inputs. In this case, the pseudoenergy used for folding T^A is affected by the structure of T^B and vise versa. In this case, T^A and T^B are mutually constrained.

With the understanding of mutual constraints, we can easily devise a brute force solution, which takes exponential time. We can enumerate all possible T^B as constraints and fold T^A correspondingly. The pair of T^A and T^B that results in the minimum sum of free energies can be taken as the optimal solution. However, this approach is computationally expensive, as there are an exponential number of possible structures that are to be enumerated as the structural constraint T^B. To resolve this issue, we can break down the problem into smaller subproblems using dynamic programming formulation. Let Inline graphic be a substructure in T^A (which begins with S_i₁ and ends with S_j₁), and be a substructure in T^B (which begins with S_i₂ and ends with S_j₂). These two substructures may or may not overlap each other. If the optimal structures (with the minimum sum of energies) adopted by both subregions, say T_i_1,j1;i2,j2, are known, we can use them as structural constraints to fold nearby RNA sequences. We can reach the final solution using this approach by extending i₁, j₁ to 0 and i₂, j₂ to l−1.

To guarantee optimality, we need to consider all possible constraining orders when computing T_i_1,j1;i2,j2. For example, let “−” indicate a void region that exerts no structural constraint, T_{−,−;i2,k2} → T_i_1,j1;i2,k2 → T_i_1,j1;i2,j2 (where i₂ ≤ k₂ ≤ j₂) represents the following constraining order: (1) Inline graphic is folded without constraint, (2) is folded by applying as a structural constraint, and (3) is folded by applying as a structural constraint. Other constraining orders are possible as well. For example, T_{−,−;i2,k2} → T_i_1,k1;i2,k2 → T_i_1,k1;i2,j2 → T_i_1,j1;i2,j2 (where i₁ ≤ k₁ ≤ j₁) can act as an alternative constraining order to be traversed during the computation of T_i_1,j1;i2,j2. In summary, all subregions (T_i_1,j1;i2,j2) and all constraining orders need to be taken into account in the algorithm. We present a combinatorial solution using the Nussinov's energy model (Nussinov et al., 1978) in the following section.

2.2. An algorithm with Nussinov's energy model

In this section, we will introduce a solution for the RNA mutually constrained folding problem using Nussinov's energy model (Nussinov et al., 1978) to facilitate the understanding of the major idea. The object function of Nussinov's RNA folding formulation is to maximize the number of base pairs, instead of minimizing the free energy, of the predicted RNA structure. Therefore, denote F(i₁,j₁;i₂,j₂) as the maximum number of base pairs within the subregions Inline graphic and , where , and is the structure for the subregion of T_i_1,j2;i2,j2. Note that F also contains both real and pseudo base pairs, that is, F = F^⋅ + F^∘. Also denote as the maximal number of base pairs within the subregions given the structure as a structural constraint. For the sake of clarity, we underline the terms that correspond to the terminal cases, whose values can be directly computed or looked up. We can compute F(i₁,j₁;i₂,j₂) using the following recursive function:

The first case described in Equation (2) corresponds to a boundary case where no base pair is formed. The second and third cases correspond to paired cases, where the outmost nucleotides (i₁ and j₁, or i₂ and j₂, respectively) form a base pair. In this case, “1” is added to indicate the base pair that has formed, and how well the observed reactivity supports the pair is evaluated by pseudo base pairs (F^∘s). The last four cases try all possible branching points with different constraining orders. Take the fourth case as an example; the last added structural component (i.e., Inline graphic ) will be predicted using the existing optimal substructure (i.e., ) as a constraint by the traditional Nussinov's folding algorithm (Nussinov et al., 1978) with soft constraints.

The optimal structural configuration for a region with a structural constraint, for example, Inline graphic [similar for ], can be computed as follows:

A direct implementation of this algorithm leads to an O(l⁸) time complexity and O(l⁵) space complexity. Indeed, to complete the algorithm, we need to fill up a four-dimensional dynamic programming table F, which requires O(l⁴) time. For each entry in F, O(l) time is used for traversing all branching k₁ and k₂, and O(l³) is used to compute the constrained folding F^A and F^B. Therefore, the overall time complexity would be O(l⁸). The space complexity is O(l⁵). Note that the algorithm needs to maintain the four-dimensional table F, in addition, and O(l) space is also required for each entry of F to record the corresponding optimal structure that would be used as a structural constraint in the future folding steps. Hence the overall space complexity is O(l⁵).

2.3. An improved algorithm with Turner's energy model

The time and space complexity of the previous algorithm are prohibitively high and are not feasible for most real RNA sequences. Therefore, we need to devise a more efficient algorithm. At the same time, we need to consider the more realistic Turner's energy model (Turner et al., 1988). Inspired by the idea of RNAscf (Bafna et al., 2006), we observe that the major scaffolds of RNA secondary structures can be represented by stacks. A stack, built from a number of continuously nested base pairs, form the regular A-form helix of the RNA structure that stabilizes the structure. Note that we only consider the significant stacks, that is, those with more than four base pairs and eight hydrogen bonds, as the number of these significant stacks is usually small and less than the length of the RNA sequence (Bafna et al., 2006). Therefore, at each folding step we will add a stack or a structural component enclosed by a stack. Thus we can achieve significant speedup compared to the previous algorithm with base-pair resolution.

We begin the exposition of the algorithm by introducing basic definitions of stacks and their relationships. An RNA structure can be represented by a set of significant stacks; denote the set as Inline graphic . A stack p can be uniquely determined by three indices: the leftmost endpoint l(p), the rightmost endpoint r(p), and the width of the stack w(p). The nucleotides at l(p) and r(p) form the outmost (smallest 5′ and largest 3′ indices) base pair of p, while l(p) + w(p) − 1 and r(p) − w(p) + 1 form the innermost base pair of p. To simplify the notations, we also say that l_I(p) and r_I(p) form the innermost base pair of p. The stacks can be partially ordered by increasing rightmost endpoints and decreasing leftmost endpoints. With such partial ordering, we can denote the ith stack in Inline graphic as p_i.

Let p_i and p_j be two stacks in Inline graphic and assume that i < j. If p_i is enclosed by p_j, that is, l_I(p_j) < l(p_i) and r_I(p_j) > r(p_i), denote their relationship as p_i < _I p_j. If p_i is juxtaposed to p_j, that is, r(p_i) < l(p_j), denote their relationship as p_i < _J p_j. If there is no stack p_k such that p_i < _J p_k and p_k < _J p_j, we say that p_i is directly before p_j. Note that there may exist more than one stack that are directly before p_j, therefore denote the stacks that are directly before p_j as a set Inline graphic . The size of the set is expected to be a constant when only the significant stacks are considered (Bafna et al., 2006).

Since Turner's energy model also considers the free energies of loops (unpaired regions) formed between stacks, we define the loop regions as follows. Denote the hairpin loop formed by stack p_i as L(p_i), which refers to the region Inline graphic . Denote the internal/bulge loops formed between stacks p_i and p_j as L_l(p_i, p_j) and L_r(p_i, p_j) if p_i < _I p_j, which refer to the two regions and , respectively. If not specified, L(p_i, p_j) is used to represent both loops. Denote the multibranch loop formed between stacks p_i and p_j as L(p_i, p_j) if p_i < _J p_j, which refers to the region Inline graphic . Finally, we can also represent a loop region by explicitly giving the sequence region, for example, is a loop starting from the ith nucleotide and ending with the jth nucleotide.

Let the minimum free energy of the regions enclosed by stacks p_i and p_j (including these two stacks) be E(p_i; p_j), and E = E^⋅ +E^∘. If we artificially add a stack p*, where l(p*) = 0, r(p*) = l − 1 and w(p*) = 0, we can retrieve the global optimal solution from E(p*;p*). For clarity, we explicitly write E(p_i;p_j) as Inline graphic to indicate that p_i is presented in the structure A and p_j is presented in the structure B. Also, denote as the minimum free energy for the most recent hairpin loop folding event, for the most recent internal/bulge loop folding event, and for the most recent multibranch loop folding event. Therefore:

where the first case “0” is a boundary case where no structure is formed.

To compute the hairpin loop energy Inline graphic , denote e_s(p) as the free energy of a stack p, and as the free energy for the hairpin loop (recall that the underlined terms indicate the terminal cases that can be directly computed or looked up). Denote as the minimum free energy for the stack and the region enclosed by it when folded without mutual constraint. The matrix E^uc can be precomputed by using the traditional minimum free energy folding algorithms (Zuker and Sankoff, 1984; Hofacker et al., 1994; Reuter and Mathews, 2010), while the reactivities are used as soft constraints [extending the recursive function for computing Inline graphic with Turner's energy model]. Note that we only need to precompute the matrix once, and all required unconstrained folding results can be retrieved. Let the structure that corresponds to be . For pseudoenergies, denote as the pseudoenergy of adopting as a stack into the structure given the constraint T^B, and Inline graphic as the pseudoenergy of adopting the loop region given the constraint. (We do not consider loop pseudoenergy if both structures are unpaired at this region.) The recursive function for computing considers two cases, where (1) , or (2) is recently added as a hairpin loop:

To consider the internal/bulge loop case, denote Inline graphic as the free energy for the internal/bulge loop formed by and , if . The recursive function for computing considers two cases, where (1) or (2) is recently added as an internal/bulge loop:

To compute the multibranch loop case, we have to introduce a new three-dimensional matrix E_m₁. E_m₁ stores the minimum free energy formed between an opened multibranch loop and a closed loop. The opened multibranch loop can be viewed as a chain, which is formally defined as a set of juxtaposing stacks and their enclosed structural components. Therefore, the entry Inline graphic is the optimal structural configuration formed between the chain that is ended with and enclosed by ( itself is NOT included in the chain), and the structural component that is enclosed by (where itself is included). Let e_ma be the multibranch loop closing penalty, e_mb be the unpaired region extension penalty (applied on the length of the loop L, |L|), and e_mc be the bonus free energy for adding a new branch. The recursive function for computing Inline graphic considers two cases, where (1) , or (2) is recently added as a multibranch loop:

To compute Inline graphic , we introduce another matrix E_m₂ that corresponds to the minimum free energy configuration formed between two chains. The two chains that correspond to the entry are the ones that ended with (enclosed by ) and (enclosed by ), respectively. Let the term ‘’ refer to the real free energy of the structural component formed in A as recorded in Inline graphic . For boundary cases, denote as the unconstrained free energy of the chain that is enclosed at and ended at . Note that the matrix is auxiliary to E^uc matrix (Zuker and Sankoff, 1984; Hofacker et al., 1994), which can also be precomputed for a constant time look-up. Let the corresponding structure of Inline graphic be . The recursive function for computing considers five cases: where the closed loop is recently added as (1) a hairpin loop, (2) an internal/bulge loop, or (3) a multi-branch loop, respectively; or the last component in the chain is recently added as (4) an extension, or (5) the beginning of the chain.

In the last two cases, note that we do not fold the structure enclosed by Inline graphic from scratch as we did in the naive algorithm. Instead, we assume that its structure is majorally determined by only one structural constraint. Let the structural constraint be enclosed by . By search all , we will identify this structural constraint from . Once we have identified , which encloses the structural constraint, we can retrieve the corresponding structure that is enclosed by Inline graphic from . Note that we omitted the cases where adopts no structure, which can be computed easily by adjusting the length to be applied on e_mb and discard e_mc.

Note that the recursive function for computing Inline graphic can be easily derived based on the symmetricity. Therefore, we omit the exposition of this part. Finally, the recursive function for computing considers four cases, where (1) from the chain is recently added as an extension of the existing chain, (2) is added as an extension, (3) is added as the beginning of the chain, or (4) Inline graphic is added as the beginning of the chain:

Note that we also omitted the cases where Inline graphic or adopts no structure, which can be computed by adjusting the loop size for e_mb and discarding e_mc.

The time complexity of the improved algorithm is O(n⁵), where n is the number of significant stacks predicted from the input RNA sequence. It is shown that n < l (Bafna et al., 2006), and thus the improved algorithm is feasible for most real RNAs. The algorithm will fill up the two-dimensional matrix E. To compute an entry in E, say Inline graphic , one needs to compute three matrices: , , and . Since and are determined, the variables become and . Therefore, we can use O(n) time to compute the E_m₁ matrix, and O(n²) time to compute the E_m₂ matrix. Since we need to traverse a number of constraining structural components for computing each entry of E_m₁ and E_m₂, the time complexities add up to O(n²) and O(n³), respectively. Hence, the overall time complexity for this algorithm is O(n⁵).

The space complexity of the improved algorithm is O(n³l). Consider the fact that the matrix E_m₂ is only referred to by the computation of E_m₁ with the same enclosing base pairs ( Inline graphic and ), it can be discarded immediately once the corresponding entries in E_m₁ are computed. Therefore, we only need to store E and E_m₁. For each entry in E and E_m₁, O(l) space is used to record the optimal structures. As a result, the overall space complexity for this algorithm is O(n³l) (note that E_m₁ is a three-dimensional matrix).

2.4. Inferring reactivities and pseudoenergies

In this section, we mainly discuss how we infer the reactivities and compute the corresponding pseudoenergies. Note that the reactivity is computed as a scaled reads mapping difference between the treated sample and the control sample (Deigan et al., 2009; Low and Weeks, 2010). In this case, when we assume that the number of reads for the control sample is very small (which can be expected from high-quality experiments), we can derive the following naive model to partition the mixture of reactivities. Given the partition for the first transcript ψ, and the expected reactivities for T^A and T^B at the ith nucleotide alone, that is, Inline graphic and , we approximately model the observed reactivity r_i as the weighted (ψ) sum of and . The expected reactivity may vary in experiments, where different enzymes/chemical reagents are used. In this article, we assume and . In cases when one of the structures is not determined, say T^B(i) = unknown, we make Inline graphic . That is, an optimistic pseudoenergy is applied no matter what structural configuration T^A(i) may adopt. In other words, T^A(i) has the “right of free folding” and will constrain the folding of T^B in the future. After inferring the reactivity, we can then compute pseudoenergy using the traditional way as described in Equation (1). Note that we only compute the pseudoenergy E^∘(r_i) when at least one of T^A(i) and T^B(i) is paired (Deigan et al., 2009; Low and Weeks, 2010). The parameters m and b are used as suggested in the references, where m = 2.6 kcal/mol and b = −0.8 kcal/mol.

Note that we only present a naive way of handling the reactivities and pseudoenergies. More sophisticated algorithms are encouraged if the characteristics of the probing enzymes/chemical reagents are well understood (Vasa et al., 2008; Aviran et al., 2011). In addition, if the raw reads are available, we can better model their mutual constraints and devise a more accurate estimation of the reactivities. Nevertheless, our focus of this work is to devise a new algorithmic framework for folding RNA structures with mutual constraints, and the reactivity and pseudoenergy handling components are independent of the major algorithmic framework. Different handling techniques are expected to be derived for specific applications.

3. Results

We have implemented the improved algorithm into a program called MutualFold using GNU C++. We searched for real experimental data on RNA alternative structures that cannot be correctly predicted simultaneously. Unfortunately, we cannot find such experimental data, because this technology is only developed recently. Therefore, we generate two artificial examples to demonstrate that MutualFold can correctly predict the alternative structures through partitioning the mixture of reactivities. We artificially assigned the mixture reactivities based on known alternative structures of adenine (Lemay et al., 2011) and TPP (Mironov et al., 2002; Rentmeister et al., 2007) riboswitches to their corresponding sequences, respectively. Following the convention of the SHAPE technology, we assumed that the expected reactivities for the unpaired regions are 1 and for paired regions are 0. Also, we introduced 20% error rate into the expected reactivities to simulate experimental errors. We expected that such high error rate is sufficient to cover most of the real experimental errors, and we also expected to show that the MutualFold algorithm is robust with such errors.

We first generated a set of mixture reactivities from a known adenine riboswitch. We assume that 70% of the transcripts adopt the “ON” structure, and 30% of them adopt the “OFF” structure. Given the mixture reactivities (and the correct partition ψ, see the Discussion section for cases in which the correct ψ is not available), we applied MutualFold to predict the two alternative structures and compared the results with the true alternative structures. We also used RNAstructure to predict the minimum free energy structure of the RNA, both with and without the reactivities as auxiliary information. We summarize the experimental results for adenine riboswitch in Figure 2.

FIG. 2. — Alternative structures of adenine riboswitch and structures predicted by MutualFold and RNAstructure. (a and b) ON and OFF structures of adenine riboswitch, respectively. (c and d) Two alternative structures of adenine riboswitch predicted by MutualFold. **(e)** The minimum free energy (MFE) folding result of adenine riboswitch by RNAstructure. **(f)** The MFE folding result of adenine riboswitch with artificially generated reactivities by RNAstructure.

The real alternative structures of adenine riboswitch are shown in Figure 2 a and b. The two structures that are simultaneously predicted by MutualFold are shown in Figure 2 c and d. We can see that the two alternative structures predicted by MutualFold are exactly the same as the true structures. Therefore, the proposed algorithm is very powerful in recovering alternative structures when the mixture reactivities are given. In addition, MutualFold is also very robust in handling experimental errors, as both correct structures can be perfectly predicted even when 20% error rate is assumed. On the other hand, we found that the minimum free energy structure, both with and without (Fig. 2e and f, respectively) reactivity, cannot perfectly predict the real structures. Therefore, the alternative structures cannot be predicted separately, and the algorithm that can simultaneously predict both structures is necessary.

For a more challenging test, we artificially generated the mixture reactivities for a TPP riboswitch, while assuming the transcript partition is 50%–50%. This test is challenging because (1) a large fraction of the TPP riboswitch adopts the same structure, thus the mixture reactivities have less distinguishing power to recover both alternative structures; and (2) there exist many insignificant stacks presented in the TPP riboswitch alternative structures that will be considered by MutualFold. We presented the test results of TPP riboswitch in Figure 3.

FIG. 3. — Alternative structures of thiamine pyrophosphate (TPP) riboswitch and structures predicted by MutualFold and RNAstructure. (a and b) ON and OFF structures of TPP riboswitch, respectively. (c and d) Two alternative structures of TPP riboswitch predicted by MutualFold. **(e)** The MFE folding result of TPP riboswitch by RNAstructure. **(f)** The MFE folding result of TPP riboswitch with artificially generated reactivities by RNAstructure.

The real alternative structures of TPP riboswitch are shown in Figure 3a and b. The two structures that are simultaneously predicted by MutualFold are shown in Figure 3c and d, and the minimum free energy structures, with and without input reactivities, are shown in Figure 3e and f, respectively. Using the reactivities as soft constraint, RNAstructure can only predict the “OFF” structure with high accuracy. On the other hand, MutualFold is able to recover the major scaffold of both “ON” and “OFF” structures, although several insignificant stacks are missed. We argue that MutualFold only considers significant stacks for computational efficiency, and the insignificant stacks can be easily taken back when more powerful computational resource is available. Nevertheless, even with the missed insignificant stacks, MutualFold is still capable of recovering the major scaffold of both “ON” and “OFF” structures (Fig. 3c and d).

4. Discussion

The algorithm presented in this work assumes that the real partition of the alternative structures, ψ, is known. In cases when such information is unknown, we can devise an EM (Expectation Maximization) algorithm to computationally estimate the partition ψ. We start with arbitrarily assigning a value between 0 and 1 to ψ as the a priori estimation, and compute the alternative structures using the MutualFold algorithm as the E-step. In the M-step, we update the partition estimation using the reactivity profiles at regions where the two predicted structures T^A and T^B adopt different structural configurations. This EM algorithm will terminate when the predictions of T^A and T^B become invariant.

In cases when the partition is difficult to estimate, we claim that the proposed combinatorial algorithm is not sensitive to the estimation of ψ. The phase transition property of dynamic programming indicates that the results are invariant when the parameters vary only within a certain range. That is, small deviation of the partition estimation will not change the predicted alternative structures significantly. We have tested the adenine example with partition estimation from 50% to 80% (note that the real partition is 70%), and MutualFold can still predict the correct alternative structures. In addition, because of the symmetricity, the partition estimation from 20% to 50% will also generate the correct prediction. In this case, the algorithm can accept a wide range (20% to 80% in this case) of partition estimations without making errors in the prediction.

In summary, we presented a combinatorial algorithm to simultaneously fold two alternative structures from a mixture of their experimental structure-probing results. The algorithm has a time complexity of O(n⁵) and a space complexity of O(n³l), where n is the number of significant stacks, l is the length of the RNA sequence, and n < l. We implemented the algorithm into a program called MutualFold and have shown that MutualFold is capable of simultaneously predicting both alternative structures with the artificially generated mixture reactivities. The algorithmic framework can be applied to different RNA structure-probing techniques, and only the reactivity and pseudoenergy handling component need to be revised. Therefore, we anticipate that the proposed algorithm will significantly promote future RNA structure-probing studies and related research.

Acknowledgment

This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM102515).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Aviran S., Trapnell C., Lucks J.B., et al. . 2011. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. U.S.A. 108, 11069–11074 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bafna V., Tang H., and Zhang S.2006. Consensus folding of unaligned RNA sequences revisited. J. Comput. Biol. 13, 283–295 [DOI] [PubMed] [Google Scholar]
Deigan K.E., Li T.W., Mathews D.H., and Weeks K.M.2009. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. U.S.A. 106, 97–102 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddy S.2001. Non-coding RNA genes and the modern RNA world. Nature Reviews in Genetics 2, 919–929 [DOI] [PubMed] [Google Scholar]
Hofacker I.L., Fontana W., Stadler P.F., et al. . 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125, 167–188 [Google Scholar]
Kertesz M., Wan Y., Mazor E., et al. . 2010. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lemay J.F., Desnoyers G., Blouin S., et al. . 2011. Comparative study between transcriptionally- and translationally-acting adenine riboswitches reveals key differences in riboswitch regulatory mechanisms. PLoS Genet. 7, e1001278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y., and Zhang S.2011. Finding stable local optimal RNA secondary structures. Bioinformatics 27, 2994–3001 [DOI] [PubMed] [Google Scholar]
Low J.T., and Weeks K.M.2010. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lucks J.B., Mortimer S.A., Trapnell C., et al. . 2011. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. U.S.A. 108, 11063–11068 [DOI] [PMC free article] [PubMed] [Google Scholar]
Martin K.C., and Ephrussi A.2009. mRNA localization: gene expression in the spatial dimension. Cell 136, 719–730 [DOI] [PMC free article] [PubMed] [Google Scholar]
McCaskill J.S.1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105–1119 [DOI] [PubMed] [Google Scholar]
Merino E.J., Wilkinson K.A., Coughlan J.L., and Weeks K.M.2005. RNA structure analysis at single nucleotide resolution by selective 2′-hydroxyl acylation and primer extension (SHAPE). J. Am. Chem. Soc. 127, 4223–4231 [DOI] [PubMed] [Google Scholar]
Mironov A.S., Gusarov I., Rafikov R., et al. . 2002. Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria. Cell 111, 747–756 [DOI] [PubMed] [Google Scholar]
Mortimer S.A., and Weeks K.M.2007. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 [DOI] [PubMed] [Google Scholar]
Nussinov R., Pieczenik G., Griggs J., and Kleitman D.1978. Algorithms for loop matchings. SIAM J. Appl. Math. 35, 68–82 [Google Scholar]
Rentmeister A., Mayer G., Kuhn N., and Famulok M.2007. Conformational changes in the expression domain of the Escherichia coli thiM riboswitch. Nucleic Acids Res. 35, 3713–3722 [DOI] [PMC free article] [PubMed] [Google Scholar]
Reuter J.S., and Mathews D.H.2010. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storz G.2002. An expanding universe of noncoding RNAs. Science 296, 1260–1263 [DOI] [PubMed] [Google Scholar]
Turner D.H., Sugimoto N., and Freier S.M.1988. RNA structure prediction. Annu. Rev. Biophys. Biophys. Chem. 17, 167–192 [DOI] [PubMed] [Google Scholar]
Underwood J.G., Uzilov A.V., Katzman S., et al. . 2010. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat. Methods 7, 995–1001 [DOI] [PMC free article] [PubMed] [Google Scholar]
Vasa S.M., Guex N., Wilkinson K.A., et al. . 2008. ShapeFinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis. RNA 14, 1979–1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wan Y., Kertesz M., Spitale R.C., et al. . 2011. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 12, 641–655 [DOI] [PMC free article] [PubMed] [Google Scholar]
Washietl S., Hofacker I.L., Stadler P.F., and Kellis M.2012. RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction. Nucleic Acids Res. 40, 4261–4272 [DOI] [PMC free article] [PubMed] [Google Scholar]
Weeks K.M.2010. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20, 295–304 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilkinson K.A., Merino E.J., and Weeks K.M.2006. Selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat. Protoc. 1, 1610–1616 [DOI] [PubMed] [Google Scholar]
Wuchty S., Fontana W., Hofacker I.L., and Schuster P.1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49, 145–165 [DOI] [PubMed] [Google Scholar]
Zuker M., and Sankoff D.1984. RNA secondary structures and their prediction. Bull. Math. Biol. 46, 591–621 [Google Scholar]

[B1] Aviran S., Trapnell C., Lucks J.B., et al. . 2011. Modeling and automation of sequencing-based characterization of RNA structure. Proc. Natl. Acad. Sci. U.S.A. 108, 11069–11074 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Bafna V., Tang H., and Zhang S.2006. Consensus folding of unaligned RNA sequences revisited. J. Comput. Biol. 13, 283–295 [DOI] [PubMed] [Google Scholar]

[B3] Deigan K.E., Li T.W., Mathews D.H., and Weeks K.M.2009. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. U.S.A. 106, 97–102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Eddy S.2001. Non-coding RNA genes and the modern RNA world. Nature Reviews in Genetics 2, 919–929 [DOI] [PubMed] [Google Scholar]

[B5] Hofacker I.L., Fontana W., Stadler P.F., et al. . 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125, 167–188 [Google Scholar]

[B6] Kertesz M., Wan Y., Mazor E., et al. . 2010. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Lemay J.F., Desnoyers G., Blouin S., et al. . 2011. Comparative study between transcriptionally- and translationally-acting adenine riboswitches reveals key differences in riboswitch regulatory mechanisms. PLoS Genet. 7, e1001278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Li Y., and Zhang S.2011. Finding stable local optimal RNA secondary structures. Bioinformatics 27, 2994–3001 [DOI] [PubMed] [Google Scholar]

[B9] Low J.T., and Weeks K.M.2010. SHAPE-directed RNA secondary structure prediction. Methods 52, 150–158 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Lucks J.B., Mortimer S.A., Trapnell C., et al. . 2011. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. U.S.A. 108, 11063–11068 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Martin K.C., and Ephrussi A.2009. mRNA localization: gene expression in the spatial dimension. Cell 136, 719–730 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] McCaskill J.S.1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105–1119 [DOI] [PubMed] [Google Scholar]

[B13] Merino E.J., Wilkinson K.A., Coughlan J.L., and Weeks K.M.2005. RNA structure analysis at single nucleotide resolution by selective 2′-hydroxyl acylation and primer extension (SHAPE). J. Am. Chem. Soc. 127, 4223–4231 [DOI] [PubMed] [Google Scholar]

[B14] Mironov A.S., Gusarov I., Rafikov R., et al. . 2002. Sensing small molecules by nascent RNA: a mechanism to control transcription in bacteria. Cell 111, 747–756 [DOI] [PubMed] [Google Scholar]

[B15] Mortimer S.A., and Weeks K.M.2007. A fast-acting reagent for accurate analysis of RNA secondary and tertiary structure by SHAPE chemistry. J. Am. Chem. Soc. 129, 4144–4145 [DOI] [PubMed] [Google Scholar]

[B16] Nussinov R., Pieczenik G., Griggs J., and Kleitman D.1978. Algorithms for loop matchings. SIAM J. Appl. Math. 35, 68–82 [Google Scholar]

[B17] Rentmeister A., Mayer G., Kuhn N., and Famulok M.2007. Conformational changes in the expression domain of the Escherichia coli thiM riboswitch. Nucleic Acids Res. 35, 3713–3722 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Reuter J.S., and Mathews D.H.2010. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Storz G.2002. An expanding universe of noncoding RNAs. Science 296, 1260–1263 [DOI] [PubMed] [Google Scholar]

[B20] Turner D.H., Sugimoto N., and Freier S.M.1988. RNA structure prediction. Annu. Rev. Biophys. Biophys. Chem. 17, 167–192 [DOI] [PubMed] [Google Scholar]

[B21] Underwood J.G., Uzilov A.V., Katzman S., et al. . 2010. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat. Methods 7, 995–1001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Vasa S.M., Guex N., Wilkinson K.A., et al. . 2008. ShapeFinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis. RNA 14, 1979–1990 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Wan Y., Kertesz M., Spitale R.C., et al. . 2011. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 12, 641–655 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Washietl S., Hofacker I.L., Stadler P.F., and Kellis M.2012. RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction. Nucleic Acids Res. 40, 4261–4272 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Weeks K.M.2010. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20, 295–304 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Wilkinson K.A., Merino E.J., and Weeks K.M.2006. Selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat. Protoc. 1, 1610–1616 [DOI] [PubMed] [Google Scholar]

[B27] Wuchty S., Fontana W., Hofacker I.L., and Schuster P.1999. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49, 145–165 [DOI] [PubMed] [Google Scholar]

[B28] Zuker M., and Sankoff D.1984. RNA secondary structures and their prediction. Bull. Math. Biol. 46, 591–621 [Google Scholar]

PERMALINK

Simultaneous Folding of Alternative RNA Structures with Mutual Constraints: An Application to Next-Generation Sequencing-based RNA Structure Probing

Cuncong Zhong

Shaojie Zhang

Abstract

1. Introduction

FIG. 1.

2. Methods

2.1. Problem formulation

2.2. An algorithm with Nussinov's energy model

2.3. An improved algorithm with Turner's energy model

2.4. Inferring reactivities and pseudoenergies

3. Results

FIG. 2.

FIG. 3.

4. Discussion

Acknowledgment

Author Disclosure Statement

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Simultaneous Folding of Alternative RNA Structures with Mutual Constraints: An Application to Next-Generation Sequencing-based RNA Structure Probing

Cuncong Zhong

Shaojie Zhang

Abstract

1. Introduction

FIG. 1.

2. Methods

2.1. Problem formulation

2.2. An algorithm with Nussinov's energy model

2.3. An improved algorithm with Turner's energy model

2.4. Inferring reactivities and pseudoenergies

3. Results

FIG. 2.

FIG. 3.

4. Discussion

Acknowledgment

Author Disclosure Statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases