Sieving RNA 3D Structures with SHAPE and Evaluating Mechanisms Driving Sequence-Dependent Reactivity Bias

Travis Hurst; Shi-Jie Chen

doi:10.1021/acs.jpcb.0c11365

. Author manuscript; available in PMC: 2021 Mar 3.

Published in final edited form as: J Phys Chem B. 2021 Jan 26;125(4):1156–1166. doi: 10.1021/acs.jpcb.0c11365

Sieving RNA 3D Structures with SHAPE and Evaluating Mechanisms Driving Sequence-Dependent Reactivity Bias

Travis Hurst ¹, Shi-Jie Chen ¹

PMCID: PMC7927954 NIHMSID: NIHMS1673266 PMID: 33497570

Abstract

Selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical probing provides local RNA flexibility information at single-nucleotide resolution. In general, SHAPE is thought of as a secondary structure (2D) technology, but we find evidence that robust tertiary structure (3D) information is contained in SHAPE data. Here we report a new model that achieves a higher correlation between SHAPE data and native RNA 3D structures than the previous 3D structure-SHAPE relationship model. Furthermore, we demonstrate that the new model improves our ability to discern between SHAPE-compatible and incompatible structures on model decoys. After identifying sequence-dependent bias in SHAPE experiments, we propose a mechanism driving sequence-dependent bias in SHAPE experiments, using replica-exchange umbrella sampling simulations to confirm that SHAPE sequence bias is largely explained by the stability of the unreacted SHAPE reagent in the binding pocket. Taken together, this work represents multiple practical advances in our mechanistic and predictive understanding of SHAPE technology.

Graphical Abstract

graphic file with name nihms-1673266-f0001.jpg

Introduction

From design of nanotherapeutics to comprehension of fundamental mechanisms driving biological processes, the knowledge of RNA structure and function forms a cornerstone of support that we must grasp in order to prevent diseases and harness the power of RNA. In vivo, RNA molecules must adopt specific structural conformations to perform biological functions.¹ Identification of the functional structure of RNA molecules can lead to important insights into their mechanisms of action. Rapid advances in sequencing RNA molecules has led to the identification of more than 18.5 million non-coding RNA sequences in the RNAcentral (version 14) database.² In contrast, as of September 2020, about 5060 RNA-containing structures can be found in the RCSB protein data bank (PDB), and only 2962 of these are high resolution (< 3.5 Å). Myriad challenges associated with experimental structure determination prevent rapid growth of the benchmark structure database.³ To address this stark gap in knowledge, we require accurate computational prediction of RNA structures from their sequences to build our understanding of the role of RNA in biological systems. Combined with constraints from efficient experimental data, such as selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical probing, computational models can characterize RNA structures of interest, addressing the knowledge-gap between sequence and structure information and taking full advantage of emerging chemical profiling techniques to drive structure prediction.⁴

The SHAPE chemical probing method provides a single-nucleotide resolution measure of local nucleotide flexibility.^{5, 6} SHAPE reagents are small ligands—e.g. 1-methyl-7-nitroisatoic anhydride (1M7)—that preferentially bind to the oxygen of the 2’-hydroxyl group of RNA nucleotides in flexible regions.⁷ High SHAPE reactivity for a nucleotide typically indicates that the nucleotide is located in a flexible region of the RNA, such as a flexible loop, whereas low SHAPE reactivity tends to indicate that the nucleotide is found in a rigid location, such as a helix or rigid loop.^{8, 9} This efficient, solvent-based method has revolutionized 2D structure estimation for RNA .^10–16 Furthermore, SHAPE can be used to characterize RNA molecules in living cells.^{7, 17} Although SHAPE provides information about 2D structure, because the interactions that determine SHAPE reactivity are intrinsically 3D structure-dependent, SHAPE data contains information pertaining to the tertiary structure of RNA, making it a valuable tool for sieving 3D decoys in computational modeling. Previously, we showed that measures of local nucleotide flexibility generally correlate with SHAPE data and proposed the 3D structure-SHAPE relationship (3DSSR) model to predict SHAPE reactivity from 3D RNA structure.¹⁸ Comparing tools for predicting SHAPE from 3D RNA structure, we also showed that physics-based methods outperform pattern recognizing, machine learning techniques because of data scarcity and our ability to express the SHAPE mechanism as an analytical function of nucleotide interaction strength and structural features.¹⁹ Combining the efficiency of SHAPE with the power of computational modeling can accelerate the search for accurate representations of RNA 3D structures.

In this work, we identify several features of SHAPE that contribute to the underlying reactivity mechanism and use SHAPE data to expand our previous model for sieving likely 3D structure candidates from a pool of decoy structures. Because more SHAPE data is available for sequences than for 3D structures, we analyze an expanded data set to identify sequence-dependent biases in SHAPE experiments. Using replica exchange umbrella sampling (REUS) simulations, we identify a sequence-dependent mechanism that largely explains the bias previously noted in the SHAPE data.²⁰ With interest, we note that previous work has identified the log-normality of SHAPE data and noise.^{21, 22} This implies that SHAPE reactivity directly reflects the underlying energetics of the SHAPE reaction. In comparison to the previously published 3D structure-SHAPE relationship (3DSSR) model,¹⁸ we find performance improvements of 30%, 21%, and 10% in average Pearson correlation (PC) of the near-native ensemble, PC of the top-ranked structure, and Spearman rank correlation (SRC), respectively. These improvements indicate that our model can accurately predict the SHAPE data from the 3D structure and reliably discern between SHAPE-compatible (near-native) and SHAPE-incompatible (non-native) structures. By investigating disparities between crystal structure information and SHAPE reactivity, we explain the discrepancies that lead to below-average prediction performance for some validation structures. The strength of this method lies in exploiting the combination of computational strength and efficient experimental data. Although SHAPE has previously been considered to provide mostly secondary structure information, we reiterate that SHAPE provides a wealth of information at the level of tertiary structure, especially for larger RNA with more long-range interactions. Furthermore, while SHAPE provides data at single-nucleotide resolution, we emphasize that interactions between a nucleotide of interest (NOI) and nearby nucleotide—and interactions between those nearby nucleotides and the SHAPE ligand—play an important role in estimating the reactivity of a NOI. SHAPE data is autocorrelated, and we can take advantage of the correlations within the data to extract reliable information from SHAPE profiles.

Theory and Methods

In this section, we describe the underlying theory underpinning the physics-based model. First, we describe the energy scoring for base pairing and stacking interactions. These combine to form the interaction energy score. We also describe how local correlations and polarity biases in SHAPE reactivity are accounted for using a weighted averaging method. Using these energy scores and additional structural features, we provide our new estimator of SHAPE reactivity from RNA 3D structure. After that, we detail how experimental noise is factored into our performance calculations and explain improvements to the simulation and training methodology. Finally, we delineate a procedure for detecting sequence-dependent bias and provide free energy calculations from REUS to show that SHAPE sequence-dependent bias can be largely attributed to the relative stability of the unreacted SHAPE reagent in the binding pocket.

Base pairing and stacking interactions

Base pairing interactions can be categorized into 12 types, according to their nucleotide edge geometries. Using updated statistical frequencies of the base pairing interactions (see Table S1 in the Supplemental Material (SM)) extracted from the non-redundant RNA Basepair Catalog,²³ we assigned pseudo-free energy scores to each interaction family on the basis of a quasi-chemical statistical potential²⁴

E_{bp}^{(t)} (a, b) = - k_{B} T ln (\frac{f_{obs}^{(t)} (a, b)}{f_{exp}^{(t)} (a, b)})

(1)

where the expected fraction is calculated as

f_{exp}^{(t)} (a, b) = f_{obs}^{(t)} x_{a} x_{b}

(2)

from the observed fraction of type-t pairs (t = 1, 2, … , 12) and the mole fractions (x_{a, b}) of base pairs involving a and b nucleotide types (A, U, G, C). The observed fractions of type-t pairs between a and b nucleotide types $f_{obs}^{(t)} (a, b)$ are directly calculated from Table S1 in SM. The updated statistical frequencies from the non-redundant database included 20, 451 entries, which is more than the 3,917 datapoints used to characterize base pairing in the previous model. Base pair types were identified in structures using RNAView.²⁵

If a pair of nucleotides meets the cutoffs for the distance (d ≤ D = 7.202 Å) between the center of the planes defined by the triads of C2, C4, and C6 atoms and angle between the planes (cos(θ) ≥ C = 0.717), stacking energies are estimated for each nucleotide in the pair as

{\begin{array}{l} \begin{array}{l} E_{st}^{(i)} (i, j) = c_{0} \cdot E_{5^{'}} + c_{1} \cdot E_{3^{'}}^{(j)} \\ E_{st}^{(j)} (i, j) = c_{2} \cdot E_{5^{'}} + c_{3} \cdot E_{3^{'}}^{(j)} \end{array} \end{array}

where c₀ − c₃ are weight parameters that decompose the contribution of the stacking energy from each nucleotide, E_5′ is a parameter for the upstream 5’ nucleotide in a stack, and $E_{3^{'}}^{(j)}$ is the parameter for the downstream 3’ nucleotide of type j > i in a stack. This extension of the original 3DSSR model introduces sequence and 5′→3′polarity-dependence into the stacking energy score. See Fig. 1 for a visual example of base pairing and stacking calculations.

Figure 1: — Details of stacking and base pairing calculations. To clarify the details of the stacking and base pairing calculation, we zoom into a section of a validation structure. Nucleotide 84G is indexed as the a (i) nucleotide and is base pairing (stacking) with 112C (84G). The 84G-112C base pair is a cis-Watson Crick base pair (t = cWW). In the 84G-85G stack, the C2, C4, and C6 atoms are highlighted in magenta, and the distance (d) and angle (θ) between the planes are noted. As long as the stack meets the cutoffs described in the example calculation, we include the 3’ nucleotide type-dependent stacking contributions in the energy score.

Pairing and stacking energies are combined into the interaction energy score (IE) for a given nucleotide n as

E_{IE} (n) = \sum_{m} [A \cdot E_{bp}^{(t)} (n, m) + B] + \sum_{k} E_{st}^{(n)} (n, k)

(3)

where the base pairing energy score $E_{bp}^{(t)} (n, m)$ is sequence and base pairing type t-dependent. A free nucleotide that is next to a rigid nucleotide will be less reactive than a free nucleotide that has flexible neighbors. To account for this type of correlative effect, we further process the interaction energy score by performing weighted averaging score of the nearby nucleotides to calculate the weighted interaction energy score $({\bar{E}}_{IE})$

{\bar{E}}_{IE} (i) = \frac{\sum_{j = 0}^{3} w_{j} \times E_{IE} (i + j - 1)}{\sum_{j = 0}^{3} w_{j}}

(4)

where w₀ – w₃ are weights accounting for the influence of interactions involving the nucleotide of interest (NOI) and/or neighboring nucleotides on the flexibility and SHAPE reactivity of the NOI. Constraints placed on the neighboring nucleotides by stacking or base pairing interactions will reduce the degrees of freedom and flexibility of the NOI. Because the SHAPE reagent attacks the 2’OH group, the downstream 3’ nucleotide usually has more influence on the NOI than the upstream 5’ nucleotide, and this polarity bias is reflected in the relative magnitudes of the weights. Furthermore, because the SHAPE reagent binding pocket is formed between the NOI and its 3’ neighbor, the next nearest neighbor on the 3’ side is also included in the score. In other words, we are evaluating the nucleotides that form the binding pocket and their neighbors in our assessment of the predicted flexibility and SHAPE reactivity of the NOI.

2D structure

In order to automate our procedure, we utilized the Dissecting the Spatial Structure of RNA (DSSR) tool to identify the 2D structure from the 3D structures.²⁶ This important addition to the procedure not only automates the process, it also removes human error and allows recognition of formation and breaking of base pairs during the simulations. In the previous work, the 2D structure term was 1.0 for free nucleotides and 0.01 for pairing nucleotides. Converting this concept to an energy-like score, we introduced a fit parameter E_2D(n) to represent the energy contribution of base pairing in the 2D structure. Similar to the IE term, we performed weighted averaging on the 2D structures

{\bar{E}}_{2 D} (i) = \frac{\sum_{j = 0}^{3} d_{j} \times E_{2 D} (i + j - 1)}{\sum_{j = 0}^{3} d_{j}}

(5)

where d₀ – d₃ weight the NOI and neighboring nucleotide 2D structure contributions to the NOI prediction. Nucleotides that neighbor base pairing nucleotides normally have lower SHAPE reactivity than nucleotides neighboring free nucleotides, so the weights account for these correlation effects. Again, the polarity bias in the SHAPE data is accounted for by fitting direction-dependent weights.

Accounting for Other Structural Features

We expect that accessibility of the SHAPE reagent (1M7) to the 2’-hydroxyl of a nucleotide is a necessary condition for reactivity. The free SHAPE ligand has an effective radius of about 2.0 Å, so the ligand accessible surface A_SAS of the 2’-hydroxyl for each nucleotide was calculated using Visual Molecular Dynamics (VMD)²⁷ with a bead radius of 2.0 Å. Similar to the IE and 2D terms, we can perform weighted averaging on the A_SAS term

{\bar{A}}_{SAS} (i) = \frac{\sum_{j = 0}^{3} a_{j} \times A_{SAS} (i + j - 1)}{\sum_{j = 0}^{3} a_{j}}

(6)

where a₀ – a₃ are weights for the NOI and neighboring nucleotides to account for the influence of accessibility of nearby nucleotides on the accessibility and reactivity of the NOI. Since A_SAS is a noisy term—it can significantly fluctuate for any given nucleotide, even with the small structure changes in heavily constrained simulations—this calculation smooths the accessibility estimation, allowing the prediction to hone in on broader regions of inaccessibility that may reduce SHAPE reactivity.

In addition, previous studies found that the ribose sugar conformation is important for SHAPE-reactivity.^{28, 29} Indeed, we found that addition of this structural feature into our predictive algorithm provides a marginal performance improvement. To account for this affect, we assign a fit correction F_sug determined by the pseudorotation angle of the ribose to modulate the predicted SHAPE reactivity (see Table S2 for details).

The addition of short nucleotide sequences in the tail regions during SHAPE experiments can affect the SHAPE-reactivity of those regions. Although no consensus method has been established for SHAPE, some of these tail effects can be accounted for through the addition of simple fit parameters F_term on terminal nucleotides, so the model also accounts for those effects (see Table S2).

Bound ligands are known to reduce SHAPE reactivity for nucleotides.³¹ To account for these effects, we introduced a fit ligand binding energy penalty E_lig for 45 nucleotides that are interacting with bound ligands.

Predicting SHAPE

Several structural features can be combined into a single structure factor coefficient for each nucleotide

{SF}_{i} = ({\bar{A}}_{SAS} (i) + A_{SAS}^{0}) * F_{sug} (i) * F_{term} (i)

(7)

where $A_{SAS}^{0}$ is a fit parameter accounting for RNA flexibility that may allow an apparently inaccessible nucleotide to become flexible. Similarly, the SHAPE energy score can be written

{SE}_{i} = {\bar{E}}_{2 D} (i) + {\bar{E}}_{IE} (i) + E_{lig} (i)

(8)

where E_lig(i) is the energy-like ligand binding penalty for the ith nucleotide. Combining all of these, we arrive at an estimation for SHAPE reactivity on the basis of the 3D structure

p_{i} = {SF}_{i} * e^{{SE}_{i}}

(9)

The form of this equation takes advantage of a Boltzmann transformation to express that SHAPE profiles are a function of the underlying energetics that drive the reaction.

All-atom simulations to form near-native ensembles

Curating starting structures from the PDB database (see Table 1), we performed all-atom MD simulations using Amber18³² with the RNA.OL3 force field. Each RNA molecule was solvated in a TIP3P truncated octahedral water box with a buffer size of 15 Å. Neutralizing sodium and chloride counterions were used to maintain a 1 M sodium concentration. Coupling the system to a Langevin thermostat maintained constant temperature (300 K). Two sets of “near-native” simulations were run. In the first, only the backbone phosphate (P) atom positions were strongly constrained with a restraint weight of 500 kcal·Å²/mol to maintain the global folding pattern. In the second set, two nucleotide atom positions in the base pairing atoms (C2, C6) were also constrained. Each system was simulated in five steps using a GTX 1080 Ti NVIDIA GPU with an AMD Ryzen Threadripper 1950X 16-core, 4 GHz processor. In the first minimization step, only solvent atoms were allowed to move, and the RNA atoms were fixed. The full system was minimized in the second step. In the third step, the system was warmed to 300 K in constant volume (NVT). Fourth, the system underwent constant pressure equilibration (NPT). Finally, we ran production simulations in the NPT ensemble, writing to the Amber NETCDF file every 10 ps, which yielded 1000 snapshots of each RNA over 10 ns.

Table 1:

RNA structures used for validation of the model. The Protein Database ID (PDBID), length of the RNA in nucleotides (#nt), type of RNA, and organism of origin are displayed. For RNA structures with multiple PDB entries in the database, we chose the one with the highest resolution. The SHAPE profiles for these RNA molecules are from the published experimental data.^8,13–15,30

PDBID	#nt	RNA structure type	Organism
2L8H	29	TAR	HIV-1
1AUD	30	U1A protein binding site	H. sapiens
2L1V	34	M-box riboswitch	B. subtilis
2K95^*	47	P2B-P3 telomerase pseudoknot	H. sapiens
1Y26	71	Adenine riboswitch	V. vulnificus
1VTQ	74	PreQ1 riboswitch aptamer	B. subtilis
1EHZ	76	Aspartate tRNA	Yeast
1P5O^*	77	IRES domain II	HCV
2GDI	78	TPP riboswitch	E. coli
3MXH^*	91	Cyclic-di-GMP riboswitch	V. cholera
3IWN	93	Cyclic-di-GMP riboswitch	V. cholera
4KQY	118	SAM-I riboswitch	B. subtilis
1C2X^*	120	5S rRNA	E. coli
1NBS	149	Ribonuclease P domain	B. subtilis
3PDR	154	M-box riboswitch	B. subtilis
1GID^*	158	Group I ribozyme domain	Synthetic
3DIG	174	Lysine riboswitch	T. maritima

Open in a new tab

Structures used in new model that were not used to validate the previous 3DSSR model.¹⁸

Generating non-native decoy structures with coarse grained simulations

A pool of 3D decoys with native and non-native 2D structures was generated using coarse grained (CG) simulations to test the ability of the algorithm to discern between native and non-native 2D and 3D RNA structures (see Tables S3–S8 for 2D structures). Native 2D structures were extracted from PDB structures using DSSR. Low energy, non-native RNA structures were found by providing RNA sequences to Mfold³³ and taking five of the top ten results, excluding the native 2D structure. Native structure PDBs were converted into IsRNA-compatible CG structures, and CG simulations were performed using the IsRNA model in LAMMPS³⁴ to sample the conformational space using a temperature of 300 K and a Langevin thermostat. Each simulation ran for 5 ns with integration time-step of 0.5 fs to provide 1000 structures with RMSD from the native structure ranging from 0.5 to 14–35, depending on RNA sequence length. These structures were converted back into all-atom representations and clustered with a mutual RMSD cutoff of 0.5 Å. Representative frames from the top 100 clusters were kept for each simulation. In total, 6 CG simulations were performed for each of the 17 RNA molecules studied: 1 simulation using constraints provided by the native 2D structure and 5 simulations using constraints provided by low-energy, non-native 2D structures. The all-atom representations of the CG structures were minimized in Amber18 to remove clashes due to transforming CG representations into all-atom structures. These structures were processed identically to the Amber18-generated near-native ensembles for the remaining steps.

Log-normal noise

Multiple previous studies describe the log-normality of SHAPE data and noise.^{21, 22} The form of our energy function is partially motivated by this work because log-normality of the data implies that SHAPE is a direct reflection of the underlying reaction energetics. However, using the standard Pearson correlation (PC) to compare our prediction to the SHAPE data does not account for the log-normal noise. Without loss of discerning ability, we can apply a noise term to improve our PC with SHAPE, which allows our training algorithm to focus on more severe shortcomings if a prediction for a nucleotide is close enough to the experimental reactivity. In accordance with processing methods proposed by previous authors, we scaled (“normalized”) the SHAPE data by discarding fewer than 10% of outliers (< 5% for molecules with < 100 nucleotides) and averaging the remaining top 10% to extract the scaling factor. Because there is not yet a consensus method for SHAPE experiments, no universal scaling method that does not account for variations in the experimental methodology can perfectly scale the disparate experimental data, but hopefully, data from disparate sources will better agree after scaling than if no scaling effort was attempted. The previous models such as the 3DSSR model 18 did not need to attempt any scaling of the SHAPE data since a scaling factor does not affect PCs, but when including a noise function in the training process, the scaling of the SHAPE data helps maintain consistent noise approximations across different sets of experimental data.

In previous similar models such as 3DSSR,¹⁸ SHAPE data for each nucleotide was naïvely considered to be independent of its surrounding nucleotides, but looking at SHAPE data shows this to be a simplistic assumption, prima facie. Consistent with our view that SHAPE measures local flexibility at single-nucleotide resolution but that the reactivity for a NOI contains non-negligible information about the reactivity of neighboring nucleotides, we also account for the influence of neighboring nucleotides on the SHAPE reactivity for the NOI by performing weighted averaging on the experimental SHAPE reactivity, such that

{\bar{r}}_{i} = \frac{\sum_{j = 0}^{2} s_{j} \times r (i + j - 1)}{\sum_{j = 0}^{2} s_{j}}

(10)

where s₀ – s₂ are fit weights. Although SHAPE data has single-nucleotide resolution, this reweighting accounts for the influence of the flexibility of the surrounding nucleotides on the SHAPE reactivity for the NOI.

From previous studies,^{21, 22} SHAPE reactivity can be modeled as

r_{i} = s_{i} e^{n_{i}}

(11)

where r_i, s_i, and n_i are the SHAPE reactivity for a run, ground truth SHAPE signal, and noise associated with nucleotide i, respectively. This equation implies that we can model the noise for a SHAPE experiment as

n_{i} = ln \frac{r_{i}}{s_{i}}

(12)

Numerically, this equation has some problems: the noise is poorly defined when the reactivity is zero, which commonly occurs in SHAPE data, and the ground truth signal cannot be defined from a single experiment. To overcome these issues, we used the processed reactivity to estimate the noise from a numerically stable noise estimator

n_{i} = ln (\frac{{\bar{r}}_{i}}{η - \bar{r}} + 1.0)

(13)

where $\bar{r}$ is the average reactivity for a given SHAPE profile and η ≈12.4 is a fit constant. The large size of η drives down the amount of noise we use, but our noise-adjusted PCs are still substantially improved. Due to their increased flexibility, RNA molecules with higher average SHAPE reactivity will incur more noise than highly structured sequences. Adding 1.0 to the estimator sets the minimum of the noise function to 0.0, which is desirable.

In order to compare our predictions to the SHAPE data using this noisy paradigm, we minimized the difference according to the analytically exact function

\frac{d}{d f} \sum_{i} {(f \cdot p_{i} - {\bar{r}}_{i})}^{2} = 0 \Rightarrow f = \frac{\sum_{i} p_{i} \cdot {\bar{r}}_{i}}{\sum_{i} p_{i}^{2}}

(14)

where p_i is the predicted reactivity for nucleotide i (see Eq. 9), f is a scaling factor for our prediction, so p_i and ${\bar{r}}_{i}$ are the predicted and reweighted experimental SHAPE reactivities for nucleotide i. We are free to scale our prediction by a constant scaling factor without affecting the PC, but the scaling factor allows us to include noise and calculate the noise-adjusted PC between our prediction and SHAPE.

In calculating the noise-adjusted PC between our predicted SHAPE profile and the SHAPE data, we use the noise factor to judge the prediction accuracy. If the prediction for a nucleotide falls into the estimated noise range set by the reactivity, we say the prediction is identical to the experimental SHAPE data. If the prediction is outside of this estimated error range, we shift our prediction closer to the experimental reactivity by n_i. For the purposes of calculating the noise-adjusted PC, we take our noise-shifted prediction to be

p_{i}^{noise} = {\begin{array}{l} {\bar{r}}_{i} & {\bar{r}}_{i} - n_{i} \leq p_{i} \leq {\bar{r}}_{i} + n_{i} \\ p_{i} \pm n_{i} & p_{i} ≷ {\bar{r}}_{i} \mp n_{i} \end{array}

Accounting for the noise in the experimental SHAPE reduces the sensitivity of our model to near-native fluctuations and improves the PC of the near-native ensemble without sacrificing any ability to discern between the near-native structures and non-native structures generated by CG simulations. Therefore, we find this procedure to be helpful in evaluating the performance of our predictive algorithm.

Training the model with simulated annealing

We greatly improved our training strategy and efficiency to optimize the ability of our algorithm to accurately select the near-native ensemble from a pool of decoy structures. The previous model (3DSSR) was trained with few parameters using less efficient random search and genetic algorithms to optimize the PC between the model prediction on the near-native ensemble and experimental SHAPE. This type of algorithm is inefficient as the parameter space grows, and it does not use information from the non-native structure ensemble to improve selectivity. To improve our selectivity and efficiency, we optimized a downhill simplex using simulated annealing.³⁵ Rather than optimizing the PC, we optimized the selectivity of the model by maximizing the Spearman rank correlation (SRC) between the Matthews correlation coefficient³⁶ applied to RNA tertiary structure—i.e. the interaction network fidelity (INF)³⁷—and the RNA-length weighted, noise-filtered PC

{PC}_{N} = PC \cdot \sqrt{N - 1}

(15)

where N is the number of nucleotides in an RNA sequence. The choice of this PC_N metric was inspired by the t-statistic. Our goal is to optimize the model on all of the available structures, but obtaining higher PCs on shorter sequences is easier, which biases the performance in favor of shorter RNA. Weighting the PC by the RNA length encourages the model to optimize the PC for structures of varying length, rather than focusing on optimizing the shorter structures. In addition, our previous model did not use information from non-native decoys to improve the discerning power of the algorithm. This training method enables us to improve the association between highly scored structures and near-native structures, as measured by the SRC, while putting reactivity data from each nucleotide on the same plane of importance, regardless of RNA length. Furthermore, rather than using all of the generated data (2,600 decoy structures for each RNA molecule), we performed random sampling of each generated structural ensemble to curate a subset of 90 structures from each of the 8 simulations (2 all-atom simulations, 6 CG simulations) that we ran for each structure. While improving our computational efficiency, this also allows us to more uniformly sample the possible conformational space, rather than heavily weighting the near-native ensembles and skewing the SRC.

We use the leave-one-out method to train the parameters, ensuring that our parameter set is generalizable and not overfit. Performing 200 steps at each sampled temperature, the simulated annealing temperature schedule followed

T_{n} = T_{n - 1} (1 - ϵ)

(16)

where ∈ = 0.1 and the initial temperature T₀ = 2.0. Note that this is not a physical temperature but a temperature relative to the SRC score that serves as an energy-like function to optimize. Termination for a training run occurred when the fractional convergence tolerance was calculated to be under 0.001. Simulated annealing runs typically converged after 15,000 steps. Descriptions of fit parameters, values for each training run, and average values can be found in Tables S2, S9, and S10.

Sequence-dependent bias in SHAPE experiments

While only 17 structures had SHAPE data and relatively short 3D benchmark structures available—large structures are more difficult to simulate—we wanted to assess other available SHAPE data to identify sequence-dependent bias in SHAPE experiments. To this end, we curated a set of 71 non-redundant sequences with SHAPE data (see Tables S11–S13) from the RMDB database³⁰ and other work.^8,13–15 Including sequence-dependent bias cannot improve the selectivity of the model (non-native decoys have the same sequence as the native structure), but it may lead to insights concerning the SHAPE mechanism. Previous work pointed out some bias in SHAPE experiments,²⁰ but they found bias in the inherent reactivity of different nucleotides rather than pairwise sequence-dependent bias in SHAPE data. Here, we use a statistical potential to determine whether significant sequence-dependent bias can be found in SHAPE experiments and assign an energy-like score to these biases.

From the experimental data, we can calculate the bias towards being SHAPE-reactive as

B (ξ, a_{i}) = \frac{f_{SHAPE} (ξ, a_{i})}{f (ξ, a_{i})}

(17)

where ξ is a cutoff for the scaled SHAPE reactivity (nucleotides with SHAPE ≤ ξ) and the reference state f(ξ, a_i) is the fraction of each type a_i = {A, U, G, or C} nucleotide present. The reference state is calculated using

f (ξ, a_{i}) = \frac{n (ξ, a_{i})}{N (ξ)}

(18)

where N(ξ) is the total number of nucleotides having SHAPE reactivity above ξ and n(ξ, a_i) is the number of nucleotides of type a_i having SHAPE reactivity above ξ. This reference state removes the bias due to the prevalence of a nucleotide type in a specified reactivity range. The fractional cumulative SHAPE reactivity of a nucleotide type a_i is expressed as

f_{SHAPE} (ξ, a_{i}) = \frac{\sum_{X > ξ} X (a_{i})}{\sum_{X > ξ} \sum_{i} X (a_{i})}

(19)

where X(a_i) is the set of SHAPE reactivities for nucleotides of type a_i. In the numerator, we sum up all the SHAPE reactivities above the cutoff for a nucleotide of type a_i, and in the denominator, we sum up all the SHAPE reactivities above the cutoff for nucleotides of all types. Scaling the SHAPE profile for each structure by its respective maximum in this calculation, the cumulation is calculated from ξ to 1. Reflecting the relative reactivity bias, an energy-like score can be calculated from this bias using

E_{bias} (ξ, a_{i}) = - k_{B} T ln B (ξ, a_{i})

(20)

which can be directly compared to the relative stability of the SHAPE ligand in the binding pocket. Applying the reactivity cutoff to the NOI yields some results of interest, but more interesting results arise from looking at the effects on the SHAPE reactivity of the NOI due to increasing reactivity of neighboring nucleotides. We can extend this bias estimator to calculate the pairwise bias by replacing a with 5’-XY-3’ or 5’-YZ-3’ pairs. If the neighboring nucleotide reactivity meets the cutoff, the NOI (Y) SHAPE reactivity is added into the calculation. This gives us a way to evaluate the impact of increasingly reactive neighbors of specific type on the SHAPE reactivity for the NOI, and this method reveals sequence-dependent pairwise bias in experimental SHAPE data.

REUS to find mechanism for sequence-dependent bias

The sequence-dependent biases in SHAPE reactivity can be largely explained by analyzing the stability of the unreacted 1M7 ligand in the binding pocket between the NOI and its downstream, 3’ neighbor. The more stable the 1M7 ligand is in the binding pocket, the more opportunity it has to react. The 5’ neighbor is also thought to affect the stability of this binding pocket through interactions with the NOI.

To confirm that this mechanism drives the bias seen in SHAPE experiments, we built model 5’UXYZU3’ RNA 5-mers, where the XYZ sequences are the exhaustive set of A, U, G, C nucleotide combinations and terminal uracils (U) were attached to remove bias due to terminal nucleotide force field parameters in simulations. Simulation parameter and topology files for the 1M7 ligand were created using the Antechamber module in Amber18.³² The 64 single-stranded 5-mers with XYZ sequences were created by producing A-form duplex structures in the Nucleic Acid Builder (NAB) module in Amber18 and deleting the extraneous strand from the duplex. For each structure, we placed the reactive carbon of the 1M7 ligand near the 2’-hydroxyl of nucleotide Y in each 5-mer, to approximate the SHAPE-reactive conformation (see Fig. S1). In tleap, the ligand/5-mer complex was embedded in a 15.0 Å truncated octahedral water box with 80 sodium ions and neutralizing chloride counterions. Minimization and equilibration was performed as detailed before: five steps were used to minimize, warm, and pressurize the system under the RNA.OL3 and GAFF force fields. After the initial equilibration, 12 replicas were created and equilibrated under their respective umbrella potentials for 100 ps. The centers of the potentials were uniformly spaced from 2.9 to 10.6 Å along the reaction coordinate—the distance between the 2’-oxygen and reactive 1M7 carbon distance—with 5.5 kcal/mol spring constants. After each replica reached equilibrium, replica exchange umbrella sampling (REUS) was performed for 5 ns per replica (60 ns total for each 5-mer) with exchanges attempted every 50 steps (0.1 ps).

Initial testing indicated the normal umbrella sampling simulations (without replica exchange) were more susceptible to kinetic trapping, with the system tending to get trapped between the Z and 3’U nucleotides for longer umbrella biasing distances (see Figs. S2–S3). We found that REUS tends to reduce this kinetic trapping, enabling better sampling on the 5’ side of nucleotide Y (between X and Y) and giving us better sampling of the transition state, in addition to other advantages. The exchange acceptance rate was above 0.12 for all sampling windows, indicating sufficient umbrella overlap. Adequate umbrella overlap was also visually confirmed by viewing the overlap of the umbrella histograms (see Fig. S4). The Amber18 implementation of REUS also allows generation of high-frequency distance data because the distance data is stored at every exchange step, avoiding storage of huge numbers of frames in the trajectory while retaining a large amount of distance data. We used the pymbar python package to subsample the data by removing correlated data using the autocorrelation time for the distance reaction coordinate and generate reliable potentials of mean force (PMFs) from the 50,000 data points extracted from each replica. Although much of the data is discarded by pymbar, the subsampling allows us to retain more data from umbrellas with faster autocorrelation times, maximizing the efficiency of REUS and the accuracy of our PMFs. Each umbrella yielded at least 200 uncorrelated data points, with most umbrellas yielding over 1000 data points.

We calculated free energy changes from the 64 generated PMFs. First, the partition function for a state (unbound or bound) was calculated by summing over the relevant portion of the PMF, such that

Z_{a \leftrightarrow b} = \sum_{i = a}^{b} e^{- U_{i} / k_{B} T}

(21)

where a and b form the minimum and maximum for the range of the state, respectively, and U_i is the PMF energy at the i-th point of the reaction coordinate. To calculate the free energy for a state, we use G = −k_BT ln Z. Then, the free energy change for traveling from the unbound state to the bound state can be calculated for some XYZ 5-mer as

Δ G_{XYZ} = G_{rx} - G_{ub}

(22)

where G_rx is the free energy of the system when the ligand is bound in the reactive site and G_ub is the free energy of the system when the ligand is free and unbound. This estimation of the absolute free energy is only accurate up to a constant. To give meaning to this number, we provide a reference state—the average of our free energy changes for all XYZ (ΔG_ave; see Fig. S5)—that is in alignment with the spirit of the SHAPE bias analysis to calculate the relative free energy change for binding to the SHAPE-reactive pocket

Δ Δ G_{XYZ} = Δ G_{XYZ} - Δ G_{ave}

(23)

The value of this relative free energy can be compared to the free energy estimated extracted from SHAPE-sequence bias calculations. However, we do not have enough SHAPE data to afford to spread the data across all the three-nucleotide combinations using the statistical potential from experimental data. In order to make pairwise and single nucleotide comparisons, we average the relative free energies. More details for these calculations can be found in the SM (Section 4).

Results

Enhanced performance

By modeling SHAPE as an energy-driven reaction affected by some local structural features, we have substantially improved selectivity and PC in comparison to the previous model (3DSSR).¹⁸ To compare the new model with the 3DSSR model, we used 3DSSR to predict the SHAPE reactivity for the same decoys used in validation. The new model has 38 physical parameters, and the physical description of the model makes it directly generalizable to new structures (see Table S2). Results produced here are for the left out structure of each training run. The trained parameters are highly stable, converging to nearly the same values for all of the leave-one-out runs, indicating that we have not overfit the model (see Tables S9–S10). The Boltzmann-inspired prediction function gives us a way to directly evaluate the impact of various structure features and energy terms that drive SHAPE reactivity. Three performance metrics were used to determine the selectivity of the near-native ensembles and the agreement with experimental data (see Fig. 2 and Table S14). The SRC between the INF—an objective measure of structural similarity to the native structure—and the weighted Pearson correlation measures the selectivity for the near-native ensembles. On average, we have improved the SRC performance in comparison to the original 3DSSR model by 10%. The average PC of the near-native ensemble and the PC of the top-ranked structure show the agreement of our model with experimental SHAPE data. The most significant advance of this model may be the 30% improvement in average PC of the near-native ensemble: this indicates that the new model is much less sensitive to small fluctuations away from the native structure, while the improved SRC shows the improved selectivity of the model (i.e. higher sensitivity to clearly non-native structures). The PC gains seen in the near-native ensembles and top-ranked structures emerged through optimization of SRC in training, indicating the stability of our model training across multiple measures of performance.

Figure 2: — Bar chart showing performance differences. For each validation structure in leave-one-out training, the difference between the new model and 3DSSR model performance is shown. Near-native (top-ranked) structure PC performance differences are denoted ‘NN’ (‘Top’). Average performance comparisons are at the top of the chart, denoted ‘Ave’. Positive differences indicate higher performance by the new model. Absolute performance comparisons can be found in Table S14.

Identifying tertiary contacts with SHAPE

Although SHAPE has been traditionally considered a 2D technology, the RNA molecules it probes live in a 3D environment. To extract all of the meaningful data from SHAPE experiments, we should discard the idea that SHAPE is only useful in secondary structure prediction. To illustrate this point, we generated near-native 3D ensembles and non-native 3D ensembles with native and non-native 2D structures. In our longest RNA, 3DIG (see Fig. 1), the near-native structures occupy the highest PC range; the native 2D, non-native 3D structures occupy a middle regime, having higher PC than the non-native 2D structures and lower PC than the near-native structures; the non-native 2D structures perform the worst (see Fig. 3). Especially in larger structures, where the conformational space between these ensembles is more separated, the spectrum of PCs leaves no doubt that useful 3D structure information is contained in SHAPE data. For short structures without many tertiary contacts, the effect of 3D structure on the SHAPE data is not as pronounced, but even 1P5O (77 nucleotides long) has a spectrum, showing that 3D structure information is clearly contained in the SHAPE data.

Figure 3: — Plot showing trends between PC, INF, and RMSD for the longest validation structure, 3DIG. Each of the 90 points in magenta (red) represents a randomly selected near-native decoy from an all-atom simulation with base pairing (only backbone) restraints. Each of the 90 (450) points in cyan (green) represents a randomly selected non-native 3D (2D) decoy from a CG simulation. As the INF (RMSD) increases (decreases), the PC between SHAPE data and our prediction increases. The spectrum of results shows how our model can use SHAPE to select near-native 3D structures and indicates that SHAPE contains important tertiary structure information, especially for long structures. Shadows are projected onto surfaces for 2D visualization.

The 1M7 stability in the binding pocket largely explains bias

Although SHAPE reactivity is mostly independent of sequence identity, there are some small, sequence-dependent biases in SHAPE data. More SHAPE profiles are available with sequence information than 3D structure information, so information contained in 1D structure-SHAPE relationships might help improve our understanding of the SHAPE mechanism. Previous work has focused on the inherent reactivity of SHAPE with specific nucleotide types.²⁰ For instance, SHAPE is generally less reactive towards cytosine (C) than other nucleotide types. However, our bias calculation detailed in the Methods reveals that the sequence bias of SHAPE is deeper than simple reactivity preference. The main results from these calculations are shown in Fig. 4, where SHAPE sequence-dependent bias is shown to be directly correlated to the stability of the unreacted 1M7 ligand in the binding pocket of the XYZ 5-mers. While inherent reactivity differences can partially explain the SHAPE bias, the majority of this bias is driven by the stability of the SHAPE reagent in the binding pocket between the NOI and the downstream 3’ nucleotide. The upstream 5’ nucleotide can also affect the stability of this pocket. Calculated free energies are provided in Tables S15–S17.

Figure 4: — SHAPE bias energy score compared with stability of unbound ligand in binding pocket. The SHAPE bias energy score (y-axis) is directly correlated to the free energy change relative to the average calculated from REUS simulations (x-axis). The SHAPE reactivity cutoff used for the bias calculation was ξ = 0.1. Blue (red) data points relate the pairwise bias to the REUS-calculated free energy change, influenced by the previous (next) nucleotide type, X (Z), and the NOI, Y. Inherent low reactivity of SHAPE to C is accounted for by a 0.4 kcal/mol penalty in the REUS calculation.

We found that these biases are too small to justify explicit inclusion in our new prediction model, especially considering that large amounts of sequence-dependence and polarity bias are already included in the interaction energy (IE) calculation. For instance, we found that our stacking energy term for C in the 3’ position is too strong to reflect only the relative stacking energy (see $E_{3^{'}}^{(C)}$ in Table S10): however, this energy parameter is explained by the tendency of C to be less reactive toward SHAPE²⁰ and to suppress the reactivity of its neighbors. While we did not directly include information from this sequence-dependent SHAPE bias calculation in our modeling, we did find evidence that most of the sequence bias in SHAPE reactivity is driven by the stability of the unbound 1M7 ligand in the binding pocket, a mechanism that extends our understanding of SHAPE. To account for the inherent reactivity difference, we penalize nucleotides with C as the NOI (Y in XYZ 5-mer) by 0.4 kcal/mol. Otherwise, the relative free energies shown on the x-axis of Fig. 4 were directly calculated from the REUS-generated PMFs.

Discussion

Improved performance

By incorporating an expanded perspective on SHAPE technology into our modeling, we have substantially improved the performance in terms of selectivity of the near-native structure ensemble and correlation with experimental data. The log-normality of SHAPE data and noise indicates that the SHAPE reactivity profile is a reflection of the underlying energetics driving SHAPE reactivity. Using an explicit energy-based form of the prediction function via a Boltzmann transformation has improved our selectivity as measured by SRC by 10% and the PC of the near-native ensemble to SHAPE by 30% (21% improvement for the top-ranked structure). The PC gains are also due to incorporation of a small log-normal noise term that is in accordance with previous work, and this inclusion of a small amount of noise does not hamper the selectivity of the model. We also account for the autocorrelation tendencies in SHAPE data by incorporating structural and reactivity influences of neighboring nucleotides on the prediction and reactivity of the NOI. Furthermore, we better capture the sequence and directionality-dependence of SHAPE data by increasing the sophistication of our stacking calculation. Our new model is less sensitive to small fluctuations away from the native structure but more sensitive to larger 2D and 3D conformational differences than the previous model, which is desirable.

Determining sequence-dependent reactivity bias mechanism

Although there is a lack of easily analyzed 3D structures with SHAPE data, there are more sequences with SHAPE data. This provides an opportunity to evaluate sequence-dependent reactivity biases. Using a statistical potential, we found that identity of neighboring nucleotides can have a notable influence on the relative reactivity of the NOI. Performing REUS simulations, we found that most of the reactivity bias is due to the relative stability of 1M7 in the binding pocket between the NOI and the 3’ downstream nucleotide (at the 2’-hydroxyl of the NOI). When 1M7 is more stable in the binding pocket, it will remain for longer and have more opportunity to react with the NOI. In agreement with previous work, we also saw that C is inherently less reactive to SHAPE, which we accounted for to more clearly show the relationship between the sequence-dependent bias and the stability of 1M7 in the binding pocket.

Detailing experimental artifacts hampering performance

All of the validation structures with below average performance can be explained by investigating the inconsistencies between the prediction and experimental SHAPE. The major inconsistencies appear to be mainly due to crystallization or experimental SHAPE artifacts. Two of our validation structures have a best PC < 0.7, and five other structures have a best PC between 0.7 and 0.8. Here, we detail the experimental artifacts that appear to cause the inconsistencies for these structures.

Interactions can form in crystalline structures that are not present when RNA molecules are placed in solution. For instance, the 158 nucleotide RNA with PDBID 1GID (best snap PC = 0.63) forms a dimer in the crystal structure that is likely not present during SHAPE experiments. During the crystallization process, several nucleotides may swing out of stable stacking or base pairing conformations to make intermolecular contacts with the sister molecule. The potential structure rearrangement that would occur in solvent largely explains our poor performance on this structure, where we simply delete the extra RNA in the dimer before simulating the structure and do not account for these intermolecular interactions. Without knowing the true structure in the solution environment, it is not possible to point out every discrepancy between the crystal structure and the solution structure. However, we can point out the obvious nucleotides that form interactions in the crystal structure, and we can see that a local conformational change would explain most of the large discrepancies between our prediction and the SHAPE data. Nucleotides 152–154 in this structure all form intermolecular contacts in the dimer, and we substantially overpredict the SHAPE reactivity for all of these. Additional missed predictions for nucleotides 2, 3, and 65–71 are likely due to these effects. Some base pairing partners we think are broken in the crystallization conformational change are overpredicted, while interactions in the 5’ tail region are underpredicted, indicating a significant conformational change may occur when 1GID is placed in solution. Longer range effects of this conformational rearrangement are also suspected, but we only analyze the direct effects here.

The other RNA with a best snap PC < 0.7 was 2K95 (best snap PC = 0.64). This structure appears to have the ability to sample an alternative 2D structure in SHAPE experiments. Some nucleotides that appear to form base pairs in the PDB are reactive, and neighboring those nucleotides, unpaired nucleotides are unreactive to SHAPE. For example, nucleotides 39–40 are base paired and underpredicted, while nucleotide 41 is unpaired and overpredicted. Similarly, nucleotides 27 and 18 are paired in the structure but reactive to SHAPE. Since 2K95 is only 47 nucleotides long, a few of these 2D structure inconsistencies are hugely detrimental to the prediction performance for this structure. Without allowing for fuzziness in the SHAPE data (e.g by allowing frameshifts in the SHAPE signal by ± 1 nucleotide, which leads to degraded selectivity performance), this kind of effect is difficult to fix without substantially altering the 2D and 3D structure of the molecule. Indeed, if we permit these frameshifts, we can obtain a PC above 0.8 for this structure. Since we are not comfortable with sacrificing our selectivity or tweaking the benchmark structure to better match the SHAPE data, we live with the below average performance for this structure.

For the other six validation structures with below average performance, a variety of explanations can be found. While performance for 3MXH (best PC = 0.73) has been substantially improved in comparison to 3DSSR (best PC = 0.57) by accounting for ligand interactions, it remains below average due to protein-RNA interactions in the crystal structure. The SHAPE experiments were probably performed in absence of the protein, leading to the 10 nucleotide loop forming a more stable, rigid structure. Even if the protein was present in the SHAPE experiments, the model does not consider RNA-protein interactions, leading to below average performance for 3MXH. In the crystal structure, the loop is extended due to these intermolecular interactions, and we do not permit the loop to sample more stable conformations in the simulations. Therefore, our model overpredicts the reactivity of these nucleotides. Although 1NBS (best PC = 0.75) is improved in comparison to the old model (best PC = 0.57), performance on this structure remains below average because we do not restrain any stacking interactions in the simulations, and the structure is pretty long (149 nucleotides). Restraining base stacking for several nucleotides could improve the performance on the near-native ensemble, but we do not think that is necessary due to the decent correlation we already obtained from this long structure. The 4KQY riboswitch (best PC = 0.72) is susceptible to conformational switching in SHAPE that we cannot easily account for by viewing the PDB structure. Several base pairing interactions are absent in the crystal structure that would improve the performance for this structure if enforced in simulations. However, as we mentioned, adjusting the crystal structure to better match the SHAPE data is not a tenable approach for us. Two of these underperforming structures—1C2X (best PC = 0.71) and 1P5O (best PC = 0.7)—have noisy SHAPE data, which may indicate that the structures are better represented by a broader structural ensemble in solution, rather than the tightly constrained near-native ensemble we provide to the model. In order for the 3D PDB structure to match the SHAPE data, large regions of rigidity are helpful, which 1C2X lacks. On the other hand, the 1P5O 3D structure has several base pairing regions that are reactive to SHAPE, indicating that fraying may be a problem in the SHAPE experiment for this structure. The 174 nucleotide structure 3DIG (best PC = 0.74) is the longest structure in our dataset, and the below average PC performance for this structure is due to the large size.

CONCLUSION

By including autocorrelation effects, log-normal noise, and sequence-dependence in the energy-based SHAPE prediction, we substantially improved our ability to predict SHAPE data from RNA 3D structure. Furthermore, we showed that stability of the unreacted SHAPE reagent in the binding pocket contributes to sequence-dependent reactivity bias in SHAPE experiments. While SHAPE has single-nucleotide resolution, it also has clear autocorrelation effects relating to the character and rigidity of neighboring nucleotides that were not previously considered. The influence of neighboring nucleotides is a small but significant aspect of SHAPE reactivity. Constraints on neighboring nucleotides reduce the conformational propensity and the SHAPE reactivity of the NOI. The log-normality of SHAPE data indicates that SHAPE reactivity profiles may be viewed as a direct expression of the energetics of the underlying reaction mechanism. Accounting for noise in SHAPE experiments leads to improved performance by removing sensitivity to small fluctuations from the native structure that should be present in solution experiments. To explore the mechanism driving sequence-dependent bias in SHAPE experiments, we performed exhaustive REUS simulations on RNA 5-mers, and free energies extracted from these simulations trend with statistical energy estimates of sequence-dependent reactivity bias, supporting the idea that bias in SHAPE reactivity is driven by the stability of the SHAPE reagent in the binding pocket formed between the NOI and the downstream 3’ neighbor.

Our model uses efficient SHAPE data to drive RNA 3D structure prediction. The concepts espoused in our model may be transferred to data from other chemical probing experiments to more efficiently predict 3D structures. The knowledge gap between RNA sequence and structure can be reduced by combining computational modeling with probing experiments. In lieu of experimental input, the rugged free-energy landscape of RNA makes predicting functional, native structures difficult. Our model works to overcome this difficulty by sieving SHAPE-compatible, near-native, 3D structure candidates from a pool of decoys, reducing the search space on the basis of efficient SHAPE data.

Supplementary Material

Suppl Info

NIHMS1673266-supplement-Suppl_Info.pdf^{(1.6MB, pdf)}

Acknowledgement

This research was supported by NSF Graduate Research Fellowship Program under Grant 1443129 to T.H. and NIH Grants R35-GM134919 and R01-GM117059 to S.-J.C.

Footnotes

Supporting Information Available

The following files are available free of charge.

supp.pdf: base pairing frequency table; training parameters; 2D structures; training results; sequences with SHAPE data; performance comparisons table; unreacted 1M7 ligand complexed with 5mer; example normal umbrella sampling PMFs; example REUS PMFs; REUS histogram overlap; average PMF; estimations of 1M7 ligand stability in reactive site; free energy calculations

References

(1).Miao Z; Westhof E RNA Structure: Advances and Assessment of 3D Structure Prediction. Annu. Rev. Biophys 2017, 46, 483–503. [DOI] [PubMed] [Google Scholar]
(2).The RNAcentral Consortium., RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res. 2019, 47, D221–D229. [DOI] [PMC free article] [PubMed] [Google Scholar]
(3).Stagno JR; Yu P; Dyba MA; Wang Y-X; Liu Y Heavy-atom labeling of RNA by PLOR for de novo crystallographic phasing. PLoS One 2019, 14, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
(4).Choudhary K; Deng F; Aviran S Comparative and integrative analysis of RNA structural profiling data: current practices and emerging questions. Quant. Biol 2017, 5, 3–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
(5).Merino EJ; Wilkinson KA; Coughlan JL; Weeks KM RNA Structure Analysis at Single Nucleotide Resolution by Selective 2âĂŸ-Hydroxyl Acylation and Primer Extension (SHAPE). J. Am. Chem. Soc 2005, 127, 4223–4231. [DOI] [PubMed] [Google Scholar]
(6).Wilkinson KA; Merino EJ; Weeks KM Selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat. Protoc 2006, 1, 1610–1616. [DOI] [PubMed] [Google Scholar]
(7).Lee B; Flynn RA; Kadina A; Guo JK; Kool ET; Chang HY Comparison of SHAPE reagents for mapping RNA structures inside living cells. RNA 2017, 23, 169–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
(8).Gherghe CM; Shajani Z; Wilkinson KA; Varani G; Weeks KM Strong Correlation between SHAPE Chemistry and the Generalized NMR Order Parameter (S2) in RNA. J. Am. Chem. Soc 2008, 130, 12244–12245. [DOI] [PMC free article] [PubMed] [Google Scholar]
(9).McGinnis JL; Dunkle JA; Cate JHD; Weeks KM The Mechanisms of RNA SHAPE Chemistry. J. Am. Chem. Soc 2012, 134, 6617–6624. [DOI] [PMC free article] [PubMed] [Google Scholar]
(10).Deigan KE; Li TW; Mathews DH; Weeks KM Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci 2009, 106, 97–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).Low JT; Weeks KM SHAPE-directed RNA secondary structure prediction. Methods 2010, 52, 150–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
(12).Kladwang W; VanLang CC; Cordero P; Das R Understanding the errors of SHAPE-directed RNA structure modeling. Biochemistry 2011, 50, 8049–8056. [DOI] [PMC free article] [PubMed] [Google Scholar]
(13).Hajdin CE; Bellaousov S; Huggins W; Leonard CW; Mathews DH; Weeks KM Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots. Proc. Natl. Acad. Sci. U. S. A 2013, 110, 5498–5503. [DOI] [PMC free article] [PubMed] [Google Scholar]
(14).Leonard CW; Hajdin CE; Karabiber F; Mathews DH; Favorov OV; Dokholyan NV; Weeks KM Principles for understanding the accuracy of SHAPE-directed RNA structure modeling. Biochemistry 2013, 52, 588–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
(15).Loughrey D; Watters KE; Settle AH; Lucks JB SHAPE-Seq 2.0: systematic optimization and extension of high-throughput chemical probing of RNA secondary structure with next generation sequencing. Nucleic Acids Res. 2014, 42, e165–e165. [DOI] [PMC free article] [PubMed] [Google Scholar]
(16).Spasic A; Assmann SM; Bevilacqua PC; Mathews DH Modeling RNA secondary structure folding ensembles using SHAPE mapping data. Nucleic Acids Res. 2017, 46, 314–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
(17).Smola MJ; Weeks KM In-cell RNA structure probing with SHAPE-MaP. Nat. Protoc 2018, 13, 1181–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
(18).Hurst T; Xu X; Zhao P; Chen S-J Quantitative Understanding of SHAPE Mechanism from RNA Structure and Dynamics Analysis. J. Phys. Chem. B 2018, 122, 4771–4783. [DOI] [PMC free article] [PubMed] [Google Scholar]
(19).Hurst T; Zhou Y; Chen S-J Analytical modeling and deep learning approaches to estimating RNA SHAPE reactivity from 3D structure. Commun. Inf. Syst 2019, 19, 299–319. [Google Scholar]
(20).Weeks KM; Mauger DM Exploring RNA structural codes with SHAPE chemistry. Acc. Chem. Res 2011, 44, 1280–1291. [DOI] [PMC free article] [PubMed] [Google Scholar]
(21).Deng F; Ledda M; Vaziri S; Aviran S Data-directed RNA secondary structure prediction using probabilistic modeling. RNA 2016, 22, 1109–1119. [DOI] [PMC free article] [PubMed] [Google Scholar]
(22).Vaziri S; Koehl P; Aviran S Extracting information from RNA SHAPE data: Kalman filtering approach. PLoS One 2018, 13, e0207029. [DOI] [PMC free article] [PubMed] [Google Scholar]
(23).Coimbatore Narayanan B; Westbrook J; Ghosh S; Petrov AI; Sweeney B; Zirbel CL; Leontis NB; Berman HM The Nucleic Acid Database: new features and capabilities. Nucleic Acids Res. 2014, 42, D114–D122. [DOI] [PMC free article] [PubMed] [Google Scholar]
(24).Lu H; Skolnick J A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins Struct. Funct. Bioinforma 2001, 44, 223–232. [DOI] [PubMed] [Google Scholar]
(25).Yang H; Jossinet F; Leontis N; Chen L; Westbrook J; Berman H; Westhof E Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003, 31, 3450–3460. [DOI] [PMC free article] [PubMed] [Google Scholar]
(26).Lu X-J; Bussemaker HJ; Olson WK DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015, 43, e142–e142. [DOI] [PMC free article] [PubMed] [Google Scholar]
(27).Humphrey W; Dalke A; Schulten K VMD: Visual molecular dynamics. J. Mol. Graph 1996, 14, 33–38. [DOI] [PubMed] [Google Scholar]
(28).Vicens Q; Gooding AR; Laederach A; Cech TR Local RNA structural changes induced by crystallization are revealed by SHAPE. RNA 2007, 13, 536–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
(29).Frezza E; Courban A; Allouche D; Sargueil B; Pasquali S The interplay between molecular flexibility and RNA chemical probing reactivities analyzed at the nucleotide level via an extensive molecular dynamics study. Methods 2019, 162–163, 108–127. [DOI] [PubMed] [Google Scholar]
(30).Cordero P; Lucks JB; Das R An RNA Mapping DataBase for curating RNA structure mapping experiments. Bioinformatics 2012, 28, 3006–3008. [DOI] [PMC free article] [PubMed] [Google Scholar]
(31).Stoddard CD; Montange RK; Hennelly SP; Rambo RP; Sanbonmatsu KY; Batey RT Free state conformational sampling of the SAM-I riboswitch aptamer domain. Structure 2010, 18, 787–797. [DOI] [PMC free article] [PubMed] [Google Scholar]
(32).Case D et al. AMBER 2018. University of California, San Francisco 2018, [Google Scholar]
(33).Zuker M Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003, 31, 3406–3415. [DOI] [PMC free article] [PubMed] [Google Scholar]
(34).Plimpton S Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys 1995, 117, 1–19. [Google Scholar]
(35).Press WH; Teukolsky SA; Vetterling WT; Flannery BP Numer. Recipes Art Sci. Comput, 3rd ed.; Cambridge University Press: New York, 2007; pp 549–554. [Google Scholar]
(36).Matthews BW Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta - Protein Struct 1975, 405, 442–451. [DOI] [PubMed] [Google Scholar]
(37).Parisien M; Cruz JA; Westhof E; Major F New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA 2009, 15, 1875–1885. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl Info

NIHMS1673266-supplement-Suppl_Info.pdf^{(1.6MB, pdf)}

[R1] (1).Miao Z; Westhof E RNA Structure: Advances and Assessment of 3D Structure Prediction. Annu. Rev. Biophys 2017, 46, 483–503. [DOI] [PubMed] [Google Scholar]

[R2] (2).The RNAcentral Consortium., RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res. 2019, 47, D221–D229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] (3).Stagno JR; Yu P; Dyba MA; Wang Y-X; Liu Y Heavy-atom labeling of RNA by PLOR for de novo crystallographic phasing. PLoS One 2019, 14, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] (4).Choudhary K; Deng F; Aviran S Comparative and integrative analysis of RNA structural profiling data: current practices and emerging questions. Quant. Biol 2017, 5, 3–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] (5).Merino EJ; Wilkinson KA; Coughlan JL; Weeks KM RNA Structure Analysis at Single Nucleotide Resolution by Selective 2âĂŸ-Hydroxyl Acylation and Primer Extension (SHAPE). J. Am. Chem. Soc 2005, 127, 4223–4231. [DOI] [PubMed] [Google Scholar]

[R6] (6).Wilkinson KA; Merino EJ; Weeks KM Selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat. Protoc 2006, 1, 1610–1616. [DOI] [PubMed] [Google Scholar]

[R7] (7).Lee B; Flynn RA; Kadina A; Guo JK; Kool ET; Chang HY Comparison of SHAPE reagents for mapping RNA structures inside living cells. RNA 2017, 23, 169–174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] (8).Gherghe CM; Shajani Z; Wilkinson KA; Varani G; Weeks KM Strong Correlation between SHAPE Chemistry and the Generalized NMR Order Parameter (S2) in RNA. J. Am. Chem. Soc 2008, 130, 12244–12245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] (9).McGinnis JL; Dunkle JA; Cate JHD; Weeks KM The Mechanisms of RNA SHAPE Chemistry. J. Am. Chem. Soc 2012, 134, 6617–6624. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] (10).Deigan KE; Li TW; Mathews DH; Weeks KM Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci 2009, 106, 97–102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] (11).Low JT; Weeks KM SHAPE-directed RNA secondary structure prediction. Methods 2010, 52, 150–158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] (12).Kladwang W; VanLang CC; Cordero P; Das R Understanding the errors of SHAPE-directed RNA structure modeling. Biochemistry 2011, 50, 8049–8056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] (13).Hajdin CE; Bellaousov S; Huggins W; Leonard CW; Mathews DH; Weeks KM Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots. Proc. Natl. Acad. Sci. U. S. A 2013, 110, 5498–5503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] (14).Leonard CW; Hajdin CE; Karabiber F; Mathews DH; Favorov OV; Dokholyan NV; Weeks KM Principles for understanding the accuracy of SHAPE-directed RNA structure modeling. Biochemistry 2013, 52, 588–595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] (15).Loughrey D; Watters KE; Settle AH; Lucks JB SHAPE-Seq 2.0: systematic optimization and extension of high-throughput chemical probing of RNA secondary structure with next generation sequencing. Nucleic Acids Res. 2014, 42, e165–e165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] (16).Spasic A; Assmann SM; Bevilacqua PC; Mathews DH Modeling RNA secondary structure folding ensembles using SHAPE mapping data. Nucleic Acids Res. 2017, 46, 314–323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] (17).Smola MJ; Weeks KM In-cell RNA structure probing with SHAPE-MaP. Nat. Protoc 2018, 13, 1181–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] (18).Hurst T; Xu X; Zhao P; Chen S-J Quantitative Understanding of SHAPE Mechanism from RNA Structure and Dynamics Analysis. J. Phys. Chem. B 2018, 122, 4771–4783. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] (19).Hurst T; Zhou Y; Chen S-J Analytical modeling and deep learning approaches to estimating RNA SHAPE reactivity from 3D structure. Commun. Inf. Syst 2019, 19, 299–319. [Google Scholar]

[R20] (20).Weeks KM; Mauger DM Exploring RNA structural codes with SHAPE chemistry. Acc. Chem. Res 2011, 44, 1280–1291. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] (21).Deng F; Ledda M; Vaziri S; Aviran S Data-directed RNA secondary structure prediction using probabilistic modeling. RNA 2016, 22, 1109–1119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] (22).Vaziri S; Koehl P; Aviran S Extracting information from RNA SHAPE data: Kalman filtering approach. PLoS One 2018, 13, e0207029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] (23).Coimbatore Narayanan B; Westbrook J; Ghosh S; Petrov AI; Sweeney B; Zirbel CL; Leontis NB; Berman HM The Nucleic Acid Database: new features and capabilities. Nucleic Acids Res. 2014, 42, D114–D122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] (24).Lu H; Skolnick J A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins Struct. Funct. Bioinforma 2001, 44, 223–232. [DOI] [PubMed] [Google Scholar]

[R25] (25).Yang H; Jossinet F; Leontis N; Chen L; Westbrook J; Berman H; Westhof E Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res. 2003, 31, 3450–3460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] (26).Lu X-J; Bussemaker HJ; Olson WK DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015, 43, e142–e142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] (27).Humphrey W; Dalke A; Schulten K VMD: Visual molecular dynamics. J. Mol. Graph 1996, 14, 33–38. [DOI] [PubMed] [Google Scholar]

[R28] (28).Vicens Q; Gooding AR; Laederach A; Cech TR Local RNA structural changes induced by crystallization are revealed by SHAPE. RNA 2007, 13, 536–548. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] (29).Frezza E; Courban A; Allouche D; Sargueil B; Pasquali S The interplay between molecular flexibility and RNA chemical probing reactivities analyzed at the nucleotide level via an extensive molecular dynamics study. Methods 2019, 162–163, 108–127. [DOI] [PubMed] [Google Scholar]

[R30] (30).Cordero P; Lucks JB; Das R An RNA Mapping DataBase for curating RNA structure mapping experiments. Bioinformatics 2012, 28, 3006–3008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] (31).Stoddard CD; Montange RK; Hennelly SP; Rambo RP; Sanbonmatsu KY; Batey RT Free state conformational sampling of the SAM-I riboswitch aptamer domain. Structure 2010, 18, 787–797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] (32).Case D et al. AMBER 2018. University of California, San Francisco 2018, [Google Scholar]

[R33] (33).Zuker M Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003, 31, 3406–3415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] (34).Plimpton S Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys 1995, 117, 1–19. [Google Scholar]

[R35] (35).Press WH; Teukolsky SA; Vetterling WT; Flannery BP Numer. Recipes Art Sci. Comput, 3rd ed.; Cambridge University Press: New York, 2007; pp 549–554. [Google Scholar]

[R36] (36).Matthews BW Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta - Protein Struct 1975, 405, 442–451. [DOI] [PubMed] [Google Scholar]

[R37] (37).Parisien M; Cruz JA; Westhof E; Major F New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA 2009, 15, 1875–1885. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sieving RNA 3D Structures with SHAPE and Evaluating Mechanisms Driving Sequence-Dependent Reactivity Bias

Travis Hurst

Shi-Jie Chen

Abstract

Graphical Abstract

Introduction

Theory and Methods

Base pairing and stacking interactions

Figure 1:

2D structure

Accounting for Other Structural Features

Predicting SHAPE

All-atom simulations to form near-native ensembles

Table 1:

Generating non-native decoy structures with coarse grained simulations

Log-normal noise

Training the model with simulated annealing

Sequence-dependent bias in SHAPE experiments

REUS to find mechanism for sequence-dependent bias

Results

Enhanced performance

Figure 2:

Identifying tertiary contacts with SHAPE

Figure 3:

The 1M7 stability in the binding pocket largely explains bias

Figure 4:

Discussion

Improved performance

Determining sequence-dependent reactivity bias mechanism

Detailing experimental artifacts hampering performance

CONCLUSION

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases