Skip to main content
RNA logoLink to RNA
. 2025 Jul;31(7):885–895. doi: 10.1261/rna.080176.124

Unraveling unbreakable hairpins: characterizing RNA secondary structures that are persistent after dinucleotide shuffling

Alyssa A Pratt 1,2, David A Hendrix 1,2,
PMCID: PMC12170181  PMID: 40328470

Abstract

The sequence of nucleotides that make up an RNA determines its structure, which determines its function. The RNA hairpin, also known as a stem–loop, is a ubiquitous and fundamental feature of RNA secondary structure. A common method of randomizing an RNA sequence is dinucleotide shuffling with the Altschul–Erickson algorithm, which preserves the dinucleotide content of the sequence. This algorithm generates randomized sequences by sampling Eulerian paths through the de Bruijn graph representation of the original sequence. We identified a subset of RNA hairpins in the bpRNA-1m meta-database that always form hairpins after repeated application of dinucleotide shuffling. We investigated these “unbreakable hairpins” and found several common properties. First, we found that unbreakable hairpins had on average similar folding energies compared to other hairpins of similar lengths, although they frequently contained ultra-stable hairpin loops. We found that they tend to be split by purines and pyrimidines on opposite sides of the stem. Furthermore, we found that this specific sequence feature restricts the number of distinct Eulerian paths through their de Bruijn graph representation, resulting in a small number of distinguishable dinucleotide-shuffled sequences. Beyond this algorithmic means of identification, these distinct sequences may have biological significance because we found that a significant percentage occur in a specific location of 16S ribosomal RNAs.

Keywords: RNA hairpins, RNA secondary structure, bioinformatics, dinucleotide shuffling

INTRODUCTION

RNA has critical roles in translation, gene regulation, genetic sequence encoding, and scaffolding for molecular complexes. The structure of RNA directs its molecular function, and it is valuable to predict how mutations will disrupt structure. While RNA has a three-dimensional tertiary structure, it is often studied with a two-dimensional projection known as the secondary structure. Several approaches have been developed to predict RNA secondary structure including thermodynamic models (Lorenz et al. 2011), machine learning methods including conditional log-linear models (Do et al. 2008), deep neural networks (Chen et al. 2020; Saman Booy et al. 2022), and attention networks (Wang et al. 2020). RNA has several recognizable features that can be predicted by software. These predicted structures are often valuable for interpreting the function and interactions of RNA, and several tools have been produced to annotate three-dimensional RNA features from experimental sequences, including “Dissecting the Spatial Structure of RNA” (DSSR) and “Find RNA 3D” (FR3D) (Sarver et al. 2007; Lu et al. 2015). Methods have also been developed to automatically annotate secondary structural features including hairpin loops, multiloops, internal loops, bulges, and stems (Yang 2003; Danaee et al. 2018).

RNA hairpins, also known as stem–loops, are fundamental features because they are both common and essential, protect messenger RNAs, guide the molecule's tertiary structure, and serve as recognition sites for proteins (Svoboda and Di Cara 2006). RNA hairpins are characterized by a paired stem topped by an unpaired loop. Their diversity in stem length, loop length, and sequence allows for a high specificity in interactions with proteins, especially given their propensity to occur on most types of RNAs in different positions, each of which may serve a different function. For instance, although hairpin loops of length 4, tetraloops, are the most common in known structures, the number of tetraloops in hairpins of 16S-like ribosomal RNA is correlated across taxa with the amount of protein in the ribosome, which may contribute to ribosomal stability because of thermodynamic favorability of tetraloops and potential interactions with proteins (Wolters 1992).

When analyzing RNA, it is common to randomly permute or “shuffle” the RNA sequence, whereby the positions of nucleotides, dinucleotides, or k-mers are rearranged through various algorithms. Shuffled RNA is expected to be free of its original secondary structure, only resembling the original sequence in mononucleotide or dinucleotide frequencies, depending on the method used. Dinucleotide shuffling, often preferred for its ability to retain the original dinucleotide content, has been known as a useful and minimally biased method to create control sequences in bioinformatics for over 40 years (Fitch 1983; Jiang et al. 2008). It was found by Clote et al. that the dinucleotide-shuffled versions of functional RNA sequences had a higher predicted folding energy (negative energy being more stable) than the original sequence (with the exception of mRNA), validating the use of a comparison to shuffled sequences to identify functional features. Due to these findings and others, several software tools and analyses have been built around the use of dinucleotide shuffling as a control (Washietl and Hofacker 2007; Jiang et al. 2008). Because the structure-free energy is mainly determined by base stacking, dinucleotide shuffling carried out via the Altschul–Erickson algorithm can prevent the control sequences from having a different proportion of thermodynamically favorable dinucleotides (Altschul and Erickson 1985). The Altschul–Erickson algorithm utilizes a de Bruijn graph representation of the RNA sequence. The de Bruijn graph represents the four bases of DNA or RNA as vertices, and each dinucleotide in a sequence is represented by an edge between the nucleotides (see Materials and Methods). The algorithm generates a shuffle by walking along these connections through a randomly selected Eulerian path (a path that visits each edge once) from the first nucleotide in the sequence to the last.

Workman and Krogh found similar results using a heuristic to approximate dinucleotide shuffling, while mononucleotide shuffling was not reliable for the detection of functional RNAs based on different folding energies (Workman and Krogh, 1999). This is dictated by the mechanics of base-pairing, for which the energy contributed by a base pair is dependent on the pair it is stacked on. This allows for a statistical comparison between the folding energy of a sequence and its shuffled variations, such that a statistically significant difference confers the presence of functional RNA. More recently, the ScanFold software tool has used Altschul–Erickson dinucleotide shuffling to detect functional regions in Zika virus, human immunodeficiency virus 1 (HIV-1), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA (Andrews et al. 2018, 2020).

Very little is known about the effectiveness of generating a random sequence (and breaking RNA secondary structures) using dinucleotide shuffling as a control method, even though this is often a base assumption for this work detecting functional RNAs. In the work of Babak et al., it was found that tools used to detect functional RNA elements produced widely different results when applied to a standardized test set (Babak et al. 2007). Though this research utilized a different dinucleotide shuffling algorithm, low complexity sequences that were preserved under shuffling had a large impact on the detection of functional RNAs. This raises questions that have yet to be addressed by current research: are there RNA structures that persist upon dinucleotide shuffling?

Here we present a subset of hairpins that we call “unbreakable hairpins,” which were found to retain their secondary structure despite repeated Altschul–Erickson shuffling. Furthermore, we assess whether there is any biological significance to these unbreakable hairpins.

RESULTS

We started with the 708,144 hairpin loops in the bpRNA-1m meta-database, and removed all duplicated RNA sequences, which resulted in 532,012 hairpins. We attached the adjacent segments for each hairpin, where we define a segment as a stretch of neighboring base pairs that are interrupted only by bulges or internal loops, to create what are commonly called “stem loops.” We further required that each hairpin sequence will fold back into a hairpin. We performed dinucleotide shuffling and predicted secondary structure for each shuffled version. We sought to determine how many shuffles it takes for a sequence to no longer be predicted to form a hairpin structure. Figure 1A shows the fraction of hairpin-forming sequences that stop forming hairpin structures as a function of the number of applications of the dinucleotide shuffling algorithm (Altschul and Erickson 1985). After one shuffle, about half are broken, and after two, more than 80% are broken, but the percentage never reaches 100% (Fig. 1A). We identified 2364 hairpin-forming sequences that were predicted to form hairpins for each of 1000 shuffled variants, hereafter “unbreakable hairpins.” We sought to characterize the properties of this subset of sequences to explain why they always form hairpins, regardless of how many times the dinucleotide shuffling algorithm is applied.

FIGURE 1.

FIGURE 1.

(A) The cumulative distribution function describing the number of dinucleotide shuffles applied to hairpin sequences such that the shuffled sequence will result in a non-hairpin sequence by secondary structure prediction. (B) A scatterplot comparing unbreakable hairpins to control hairpins, such that each dot corresponds to a hairpin sequence; the x-axis is the hairpin length, including both the stem and hairpin loop, and the y-axis is the predicted minimum free energy of the structure. (C) A scatterplot showing the average of five minimum free energy values (in kcal/mol) of unbreakable and breakable hairpins before and after Altschul–Erickson dinucleotide shuffling, where the diagonal represents no change in hairpin minimum free energy. (D) A representation of the same results as C, shown as a histogram representing the change in minimum free energy as a Z-score.

Hairpin length distribution

We examined the length distribution of the unbreakable hairpins. We observed that the length distribution (measured by the sequence length from the 5′ end of the hairpin stem to the 3′ end of the hairpin stem, including the loop) for the unbreakable hairpin sequences is shorter than the distribution of all hairpin sequences (Fig. 1B). We next examined the predicted minimum free energy of the unbreakable hairpins compared to a sample from all hairpins that has the same length distribution. We observed that the unbreakable hairpins have similar energy values compared to other hairpin sequences of the same length distribution (Fig. 1B, inset).

Upon shuffling, unbreakable hairpins appeared to have more resistance to shuffling regarding the computed minimum free energy structure, while hairpins of a similar length distribution were likely to lose structure, or lose thermodynamic stability, on average across five Altschul–Erickson dinucleotide shuffles (Fig. 1C). Many methods use a Z-score to quantify the degree of change in free energy upon dinucleotide shuffling (Washietl and Hofacker 2007). We found that unbreakable hairpins have a much higher distribution of Z-scores, indicating less change upon shuffling (Fig. 1D).

We next investigated the length distribution of the hairpin loop itself, defined as the unpaired region, as well as the closing base-pair distribution. We found that the unbreakable hairpin loops were predominately of length 4 nt (Fig. 2), similar to the global distribution; the favored closing base pair was found to be C:G, which was previously observed to be the most common for tetraloops (Danaee et al. 2018). In contrast, breakable hairpins of a similar length distribution were most likely to contain loops of length 6 nt (Supplemental Fig. 1).

FIGURE 2.

FIGURE 2.

A bar plot representing the number of unbreakable hairpins sharing loop sequences, with the closing base pair included. The most common sequence is C[UUCG]G.

Hairpin stem and loop sequence content

We next observed the actual hairpin loop sequence and found it to be enriched with the sequence CUUCGG for 75.2% (1814/2413) of the unbreakable hairpin loops in bpRNA-1m (Fig. 2). The CUUCGG tetraloop has been characterized as “ultra-stable” for its experimental stability exceeding thermodynamic predictions to the extent that RNA secondary structure parameters have been modified to account for its additional stability (Tuerk et al. 1988). Additionally, this tetraloop has been found in biologically significant regions of RNA across kingdoms. We found that ∼77% of unbreakable hairpins in bpRNA-1m contain this tetraloop, compared to 24% of total hairpins in the database.

We next observed that the unbreakable hairpins tend to be split by purines and pyrimidines. While one side of the stem was typically enriched for either purine (A or G) or pyrimidine (C or U) nucleotides, the other side of the stem had the opposite base composition. A total of 99.9% of unbreakable hairpins were found to exhibit a split on either side of the segment in purines and pyrimidines. Most of these splits are represented by primarily a cytosine and guanine abundance on each side of the segment, with an average GC abundance of 79% on the segment. GC-rich hairpin segments have been found to exist more frequently than low-GC content hairpins, which is attributed to the strength of the GC-pair (Chan et al. 2009). The specific reflection of abundance as a purine–pyrimidine split, however, is a more specific property. Furthermore, we found that the number of dinucleotide shuffles required to “break” a hairpin—producing a shuffled sequence that results in a non-hairpin structure—was inversely related to the number of “split” dinucleotides that contain both a purine and a pyrimidine (Supplemental Fig. 2). Related to this, we observed that on average, the unbreakable hairpins had a lower mononucleotide sequence complexity than breakable hairpins and unbreakable hairpins that lack the characteristic features present in the majority of unbreakable hairpins (Supplemental Fig. 3), using Wootton–Federhen sequence complexity, which is described in the Materials and Methods (Wootton and Federhen 1993).

Unbreakable hairpin identifying properties

The unbreakable hairpins tend to represent the intersection of two properties: the presence of an ultra-stable tetraloop sequence and the presence of a purine/pyrimidine split on either side of the hairpin (Fig. 3A).

FIGURE 3.

FIGURE 3.

(A) A Venn diagram representation of the shared features of RNA hairpins. The majority of unbreakable RNA hairpins contain a C[UUCG]G loop and split in purines and pyrimidines. (B) The majority of unbreakable hairpins are located on the 22nd hairpin on 16S ribosomal RNAs.

Next, we examined whether the unbreakable hairpins are enriched in a particular type of RNA structure found in the bpRNA-1m meta-database, which primarily contains noncoding RNAs. We found that 2081 occurred within 16S ribosomal RNAs (rRNAs). We also observed that the most common location of unbreakable hairpins was the same hairpin of 16S ribosomal RNAs, typically hairpin 22 or 23, as defined by numbering each hairpin loop in order from 5′ end (Fig. 3B). A total of 1622 of the 2081 16S ribosomal RNA unbreakable hairpins are identical in sequence, UCCCCUUCGGGGGA. In addition, the unbreakable hairpins are associated with a small number of unique generatable sequences by dinucleotide shuffling, which we were able to quantify in the next section.

Enumeration of distinct sequences generated by dinucleotide shuffling

We observed that the de Bruijn graph for unbreakable hairpins have a “pinch point,” in that they typically only have one edge spanning purines and pyrimidines (Supplemental Fig. 2). Typically, the unbreakable hairpins have one or a small, odd number of dinucleotides with both a purine and a pyrimidine (Fig. 4A).

FIGURE 4.

FIGURE 4.

(A) A histogram showing the distribution of the number of purine/pyrimidine split dinucleotides for both unbreakable and control hairpins sequences. (B) A histogram comparing the count of distinct shuffling-generatable hairpin sequences for unbreakable and control hairpin sequences. (C) Scatterplot comparing the number of distinct dinucleotide-shuffled sequences and the number of purine/pyrimidine split dinucleotides for unbreakable and control hairpins. The unbreakable hairpins have fewer, typically one, purine/pyrimidine split dinucleotides and fewer distinct dinucleotide generatable sequences.

We examined the sequences of the unbreakable hairpins in the context of the Altschul–Erickson algorithm. This algorithm generated different dinucleotide-shuffled sequences by selecting random Eulerian circuits within the de Bruijn graph for the sequence (Supplemental Fig. 4). The de Bruijn graph G = (E, V) is such that each nucleotide is represented by a vertex vV, and the edges E between them correspond to dinucleotides that occur in the input sequence [e.g., (A, G) ∈ E]. To ensure that every vertex has the same number of ingoing and outgoing edges, an additional “false edge” is added between the vertices corresponding to the first and last character. Eulerian circuits are defined as paths through the graph that visit every edge once. The number of possible Eulerian cycles can be quantified for a given sequence using the de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte (BEST) theorem, as each individual Eulerian circuit represents a dinucleotide shuffle (van Aardenne-Ehrenfest and de Bruijn 1951). This formula can be modified to compute the number of unique sequences that can be generated by a dinucleotide shuffling (Kandel et al. 1996). We observed that the number of possible sequences that can be generated is lower by an order of magnitude for unbreakable hairpins (Fig. 4B). When combined with the number of purine/pyrimidine split dinucleotides, we see that the unbreakable hairpins show a combination of a small number of generatable sequences, which are constrained by the pinch-point created in the de Bruijn graph by a small number of purine/pyrimidine split dinucleotides (Fig. 4C). This “purine–pyrimidine” split limits the combinatorics of vertices visited in an Eulerian circuit traversing the de Bruijn graphs of unbreakable hairpins. We illustrate an example of the secondary structure and de Bruijn graph for an unbreakable (Fig. 5A,B) and breakable example (Fig. 5C,D).

FIGURE 5.

FIGURE 5.

(A) An example of a secondary structure of an unbreakable hairpin. (B) The de Bruijn graph representation of the unbreakable hairpin in A. Graph edges correspond to dinucleotides in the sequence, whereas the red edge indicates a false edge from the last nucleotide to the first. (C) An example of a secondary structure of a breakable hairpin. (D) The de Bruijn graph representation of the breakable hairpin in C, with edges defined as in B.

Biological significance of unbreakable hairpins

Given that these types of unbreakable hairpins exist and that they would probably evade methods of detecting them using dinucleotide shuffling, we sought to determine if there could be any missed biological relevance to them. We conducted a phylogenetic analysis of 16S sequences that contain unbreakable hairpin sequences in bacteria. We found that sequences in the class Bacilli were more likely to contain unbreakable hairpins than those in other classes (Supplemental Fig. 5). Within Bacilli, the distribution among orders, families, and genera was unequal. For example, at the order-level, <1% of sequences in Lactobacillales contained unbreakable hairpin sequences as opposed to ∼25% of sequences in Bacillales. Within Bacillales, 85% of Sporolactobacillacea sequences contained unbreakable hairpins, indicating a potential connection to endospore formation.

To determine whether there is a connection between endospore formation and unbreakable hairpins, we identified the proportion of sequences in each genus within Bacilli and Alphaproteobacteria, which both contained some unbreakable sequences, and compared this to whether the genera were known to form endospores. Approximately 27.3% of genera were not found to have sufficient research to support endospore formation or a lack thereof. With the remaining genera, a one-tailed t-test for independence produced a P-value of <0.01. A Fisher contingency table P-value was also <0.01, supporting the positive association between endospore formation and unbreakable sequences.

Finally, we found that the sequence features that characterize unbreakable hairpins may also confer a robustness to sequence mutations. We observed that unbreakable hairpins on average show less of a change in stem length when a random single-nucleotide insertion or deletion is introduced (Supplemental Figs. 6, 7). This trend is strongest for deletions, but in both cases a greater proportion of unbreakable hairpins are unchanged in length.

DISCUSSION

Dinucleotide shuffling is widely used to generate control sequences in the analysis of RNA. However, this method does not account for the varying potential of dinucleotide shuffling to produce distinct outputs across RNA sequences, which may produce results that miss specific subsets of sequences. Counting the number of possible unique sequences that can be generated through dinucleotide shuffling has significant implications for bioinformatics work with RNA structures where dinucleotide shuffling is used as a control. We demonstrated that it is at least possible for some highly stable hairpin structures to evade detection with dinucleotide shuffling. We quantified the number of potential distinct dinucleotide shuffled sequences, given by Equation 4, for NU(G). We further found that the unbreakable hairpins have a very small number of purine/pyrimidine split dinucleotides, which forces purines and pyrimidines on opposite sides of the sequence. Calculating NU(G) and the number of purine/pyrimidine split dinucleotides may be useful in dinucleotide shuffling software to provide a score quantifying the potential for sequences to have a restricted number of distinct shuffled sequences, and possibly similar predicted minimum free energies. As illustrated in Figure 6, it is unlikely that more complex RNA features may have the unbreakable property due to the necessity of one or fewer transitions between purine and pyrimidine nucleotides. An exception shown is the H-type pseudoknot, which could contain a single purine–pyrimidine split, though there could be no other structure between the closing pair of the hairpin stem and the nucleotides paired to the loop portion of the hairpin, so structure possibilities are limited. An example of a simple H-type pseudoknot with one transition is shown in Figure 6B.

FIGURE 6.

FIGURE 6.

(A) The structure of unbreakable hairpins prevents multiple hairpins from forming an unbreakable structure, as multiple purine/pyrimidine splits must be present. (B) It is feasible for a single H-type RNA pseudoknot to be unbreakable if there are no further splits.

It remains unclear whether unbreakable hairpins are an oddity of the dinucleotide shuffling algorithm, or if they are biologically relevant. As the majority of unbreakable hairpin sequences were located in a specific region within 16S ribosomal RNAs, where the structures of tetraloop hairpins are known to facilitate interactions that enhance protein stability (Wolters 1992), the unbreakable features of the hairpin may allow for sustained stability despite deletions. Since unbreakable hairpins have a lower sequence complexity than breakable hairpins of similar length, and the majority are split between purines and pyrimidines, single-nucleotide or dinucleotide deletions are less likely to affect the secondary structure than in a control hairpin of similar length. We observed that the unbreakable hairpins are enriched in endospore-forming bacteria in the class Bacilli. Unbreakable hairpin sequences may confer a robustness due to their retained structure upon rearrangement, which could be valuable to bacterial endospores, which are known to retain stability upon exposure to radiation, desiccation, and heat (Setlow 2006).

In RNA tertiary structures, it has been found that a split in purines and pyrimidines is associated with Watson–Crick base-pairing in the metastasis-associated lung adenocarcinoma transcript 1 (MALAT1) long noncoding RNA nuclear retention element. Brown et al. described a preference for pyrimidine–purine pairs on this triple helix over purine–pyrimidine pairs, as determined by chemical probing using Watson–Crick base-pair replacement mutations (Brown et al. 2014). All base pairs in the triple-helical region retain a purine–pyrimidine split in the wild-type form. This example is consistent with the observation that purine–pyrimidine splits increase the robustness of RNA structures.

UUCG loops are frequently considered an “experimental tool” that can stabilize the secondary structure of RNAs and form stable structures, despite no protein–RNA binding observed (Hall 2015). Since the initial categorization of 16S rRNA tetraloops, research has found that the ultra-stable UNCG hairpin loops, and particularly those with a C:G closing base pair, are more abundant than their thermodynamic stability and base-pair frequency would indicate (Tuerk et al. 1988; Varani 1995; Hall 2015). Suggested roles of UNCG hairpins in rRNA and other locations include interactions with ribosomal proteins, forming nucleation sites for RNA folding, and protection against degradation by nucleases (Varani 1995; Thapar et al. 2014). Indeed, UNCG hairpins have been found to serve as sites of zinc finger protein interactions, such as the Rous Sarcoma Virus (RSV) packaging signal's UGCG tetraloop, which interacts with a zinc knuckle on the nucleocapsid protein and appears to affect viral assembly without any conformational change in the RNA (Thapar et al. 2014).

Despite the ubiquity of C(UUCG)G loops, no such specific functional interaction has been found for these tetraloops. The observation of their properties and frequency in unbreakable hairpins, as well as their tendency to be missed by functional RNA screens due to their typically low rearrangement count, indicates that it would be valuable to investigate them further.

MATERIALS AND METHODS

Unbreakable hairpin identification

Unbreakable hairpins were identified as a subset of all hairpin segments listed in bpRNA-1m database, version 1.0 (Danaee et al. 2018), which includes full stems and hairpin loops stored in the FASTA file format. Each segment listed was then processed by RNAfold version 2.1.9 (Lorenz et al. 2011) to determine the dot-bracket sequence of the predicted hairpin structure. To filter against invalid hairpins, hairpin segments that were not comprised of solely A, C, G, and U characters were eliminated, as were segments that did not meet two thresholds to qualify as hairpins: balance and base-pair density. Balance scores were computed from the distribution of left and right parenthesis in the dot-bracket sequence:

Balance=leftbracketsonlefttotalleftbrackets+rightbracketsonrighttotalrightbrackets

In other words, this describes the fraction of left brackets that are on the left half of the sequence plus the fraction of right brackets that are on the right half of the sequence. The balance score therefore would be 2 for a hairpin structure and smaller value for an unbalanced structure. For example, a structure containing two hairpins of equal length would have a balance score of 1 because left and right brackets would be on each half of the dot-bracket sequence. The base-pair density scores describe the ratio of base pairs to nucleotide length, and can in practice be computed computed from the total fraction of left parenthesis. More precisely,

Basepairdensity=totalbasepairsnucleotidelength

In other words, the base-pair density would be 0.0 for a completely unpaired RNA and would be 0.5 for a dot-bracket that lacks any unpaired nucleotides. A base-pair density of 0.25 corresponds to half of the nucleotides being paired. In our analysis, if the balance score exceeded 1.86 and the base-pair density exceeded 0.25, hairpins proceeded to a series of Altschul–Erickson dinucleotide shuffles (Deng et al. 2019).

Each hairpin was shuffled repeatedly by the Altschul–Erickson algorithm until the balance score, or base-pair density of the predicted secondary structure of the shuffled sequence, was no longer within the thresholds. If hairpins could be shuffled at least 1000 times and remain a hairpin, they were stored as unbreakable hairpins.

Control hairpin set

Control hairpins were randomly sampled from the hairpin segments in the bpRNA-1m database that were of similar or equal length to each unbreakable hairpin, creating a comparable set of equal size. If an exact match in length could not be found for each unbreakable hairpin, the control hairpin with the next-closest length was kept. The control and unbreakable hairpin sets are of an equal size (Supplemental Fig. 7).

Hairpin length and sequence content

We found the length distribution of unbreakable hairpin sequences as well as the global length distribution of RNA hairpins in bpRNA-1m from their respective FASTA sequences. We then computed the predicted minimum free energy (MFE) of each unbreakable hairpin and control hairpin from RNAfold 2.1.19 using default parameters.

We also investigated the loop of each hairpin using similar methods, as the loop of each hairpin segment is stored individually. We searched for the known ultra-stable sequence motif “CUUCGG” (Tuerk et al. 1988) in each hairpin in the control and unbreakable regions to find evidence of enrichment. We also determined the propensity of a total purine/pyrimidine split in hairpin segments. We defined a “total split” as a hairpin with nucleotides that were entirely purines or pyrimidines on one side of the hairpin loop and the other on the other side, disregarding the content of the loop. We recorded the content of each split unbreakable and control hairpin.

To determine sequence complexity, we used the following equations for Wootton–Federhen sequence complexity (Wootton and Federhen 1993).

Mononucleotide sequence complexity for a sequence x, where n represents the sequence length, and nA, nC, nG, and nU represent the count of each nucleotide, respectively, is defined as:

C(x)=1nlog4(n!nA!nC!nG!nU!) (1)

Using the RNA structure annotation tool, bpRNA, we identified the location of each unbreakable hairpin (Danaee et al. 2018). Of 2413 unbreakable hairpins, 1116 (46%) were the 22nd, and 383 were the 23rd hairpin, defined from 5′ to 3′, of 16S ribosomal RNAs. Overall, 2064 unbreakable hairpins (86%) were derived from 16S ribosomal RNAs in the bpRNA-1m meta-database.

Enumeration of number of possible dinucleotide shufflings

We used the de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte (BEST) theorem to calculate the number of unique Eulerian circuits possible for a given sequence, as each individual Eulerian circuit represents a dinucleotide shuffle (van Aardenne-Ehrenfest and de Bruijn 1951). It states that the number of Eulerian circuits NE(G) is given by the following equation:

NE(G)=tx(G)ΠvV(deg(v)1)! (2)

In this expression, the term tx(G) is the number of arborescences, which are trees directed toward the root from a fixed vertex x, which are chosen such that the root is the vertex corresponding to the first nucleotide, and x corresponds to the last nucleotide. A false edge is added from the last nucleotide's vertex to the first to ensure an Eulerian graph. The number of arborescences can be computed by first defining the matrix Q, which is the difference of the adjacency matrix A and the degree matrix D, so that Q = DA. The terms Aij of the adjacency matrix A are equal to the number of edges from vertex i to j. The degree matrix D is a diagonal matrix with terms Dii equaling the in-degree of each vertex i, where the in-degree is equal to the out-degree for this type of graph. The elements of Q are therefore such that Qij equals –m for a graph with m edges from i to j, and Qii equals the number of ingoing edges minus the number of self-edges (Supplemental Fig. 4). Based on Kirchhoff's theorem for directed multigraphs, the number of arborescences at a vertex i is equal to the determinant of the matrix resulting from the removal of the ith row and column of Q (Kirchhoff 1847). In this case, we select i to correspond to the root, which results in a 3 × 3 matrix that does not include terms from the false edge (Supplemental Fig. 4).

While the BEST theorem counts the number of Eulerian circuits, it does not count the number of distinct possible dinucleotide-shuffled sequences because many such sequences are the same, despite arising from different Eulerian circuits. The equation counts degree of the vertices, but while the edges in the graph are distinct, the resulting character sequence may be the same; hence it overcounts the number of unique generated sequences.

Kandel et al. previously identified a formula to count the number of Eulerian paths in a digraph such as an RNA de Bruijn graph using the matrix tree theorem (Kandel et al. 1996). They introduce a multiplicity term, which gives the total number of rearrangements of the indistinguishable edges, connecting the same nodes.

The multiplicity M(G) is given by the following product:

M(G)=Πv,wGm(v,w)! (3)

where m(v, w) is the number of parallel edges in the graph connecting vertices v and w.

We therefore consider a modified form of the BEST theorem to count the total number of unique sequences that includes M(G) (Kandel et al. 1996).

NU(G)=tx(G)ΠvV(deg(v)1)!Πv,wGm(v,w)! (4)

This equation updates the equation from the BEST theorem to compute NU(G), the number of unique Eulerian circuits, avoiding what would otherwise be multiple counting of circuits that result in the same output sequence. To determine the count of unique sequences rather than circuits, we divided our result for each sequence by the product of the factorials of the number of each of its parallel edges, eliminating circuits that differ only by a parallel edge.

SUPPLEMENTAL MATERIAL

Supplemental material is available for this article.

DATA DEPOSITION

RNA sequences used for this analysis were downloaded from bpRNA-1m and filtered using CD-HIT with a sequence identity value of 1.0 (Fu et al. 2012). The unbreakable set of RNA hairpins, as well as the control RNA hairpin set used in the analyses, are available at http://github.com/alyssapratt/unbreakable-hairpins. All Python scripts used to generate the figures in this analysis are available within the same GitHub repository.

ACKNOWLEDGMENTS

The authors would like to thank Dezhong Deng for helpful discussions regarding the first observation of the unbreakable hairpins when designing a negative control data set for the prediction of RNA hairpins.

Footnotes

Freely available online through the RNA Open Access option.

MEET THE FIRST AUTHOR

Alyssa A. Pratt.

Alyssa A. Pratt

Meet the First Author(s) is an editorial feature within RNA, in which the first author(s) of research-based papers in each issue have the opportunity to introduce themselves and their work to readers of RNA and the RNA research community. Alyssa A. Pratt is the first author of this paper, “Unraveling unbreakable hairpins: characterizing RNA secondary structures that are persistent after dinucleotide shuffling.” She conducted the research described in this work as an undergraduate researcher in the Hendrix Lab at Oregon State University and is now a first-year doctoral student studying computational biology at the University of California, Berkeley.

What are the major results described in your paper and how do they impact this branch of the field?

This is the first paper describing the phenomena of RNA sequences folding into the same secondary structure despite repeated dinucleotide shuffling. I expect that the identification and investigation of these structures is most impactful to computational research in the field that relies on using dinucleotide shuffling to break up primary and secondary structure features. However, I think that there are signs of a potential for direct biological significance of these unbreakable hairpins, though future work would be needed to validate this.

What led you to study RNA or this aspect of RNA science?

I was somewhat familiar with secondary structure features before I began studying RNA, because of the importance of DNA secondary structure in some DNA viral genomes. When I joined the Hendrix Lab at Oregon State University, I was drawn to the lab's investigation of RNA structural features because of the far-reaching impact of RNA and its structure through all ranks of life.

What are some of the landmark moments that provoked your interest in science or your development as a scientist?

I had a childhood interest in science that was fostered by spending time exploring the natural world and learning about scientific feats like the Mars rover program and genetic engineering, which both guided me toward an interest in learning about the rules that govern life at a molecular level. The opportunity to be involved in research in high school set me on track toward a scientific career since I was exposed to what it meant to be part of a sustained research project, and I learned what academic steps were needed. Having rigorous academic programs in biochemistry and computer science at Oregon State University, as well as undergraduate research opportunities, helped me to develop the skills to begin this career.

What are your subsequent near- or long-term career plans?

I recently started a PhD program in Computational Biology at the University of California, Berkeley. While I haven't joined a lab yet, I have an interest in evolution through the lenses of genomics and mechanistic discovery. After I graduate, I'm hoping to remain in research and work toward leading a research lab.

REFERENCES

  1. Altschul S, Erickson B. 1985. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 2: 526–538. 10.1093/oxfordjournals.molbev.a040370 [DOI] [PubMed] [Google Scholar]
  2. Andrews RJ, Roche J, Moss WN. 2018. ScanFold: an approach for genome-wide discovery of local RNA structural elements—applications to Zika virus and HIV. PeerJ 6: e6136. 10.7717/peerj.6136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Andrews RJ, Baber L, Moss WN. 2020. Mapping the RNA structural landscape of viral genomes. Methods 183: 57–67. 10.1016/j.ymeth.2019.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Babak T, Blencowe BJ, Hughes TR. 2007. Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 8: 33. 10.1186/1471-2105-8-33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brown JA, Bulkley D, Wang J, Valenstein ML, Yario TA, Steitz TA, Steitz JA. 2014. Structural insights into the stabilization of MALAT1 noncoding RNA by a bipartite triple helix. Nat Struct Mol Biol 21: 633–640. 10.1038/nsmb.2844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chan CY, Carmack CS, Long DD, Maliyekkel A, Shao Y, Roninson IB, Ding Y. 2009. A structural interpretation of the effect of GC-content on efficiency of RNA interference. BMC Bioinformatics 10: S33. 10.1186/1471-2105-10-S1-S33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen X, Li Y, Umarov R, Gao X, Song L. 2020. RNA secondary structure prediction by learning unrolled algorithms. arXiv 10.48550/ARXIV.2002.05810 [DOI]
  8. Clote P, Ferré F, Kranakis E, Krizanc D. 2005. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 11: 578–591. 10.1261/rna.7220505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. 2018. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res 46: 5381–5394. 10.1093/nar/gky285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Deng D, Holman D, Hendrix DA. 2019. DeepSloop: a recurrent neural network learns complex rules to detect stem-loop-forming RNA sequences. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2799–2807. 10.1109/BIBM47256.2019.8983346 [DOI] [Google Scholar]
  11. Do CB, Foo C-S, Batzoglou S. 2008. A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics 24: i68–i76. 10.1093/bioinformatics/btn177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fitch WM. 1983. Random sequences. J Mol Biol 163: 171–176. 10.1016/0022-2836(83)90002-5 [DOI] [PubMed] [Google Scholar]
  13. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150–3152. 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hall KB. 2015. Mighty tiny. RNA 21: 630–631. 10.1261/rna.050567.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Jiang M, Anderson J, Gillespie J, Mayne M. 2008. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics 9: 192. 10.1186/1471-2105-9-192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kandel D, Matias Y, Unger R, Winkler P. 1996. Shuffling biological sequences. Discrete Appl Math 71: 171–185. 10.1016/S0166-218X(97)81456-4 [DOI] [Google Scholar]
  17. Kirchhoff G. 1847. Ueber die Auflösung der Gleichungen, auf welche man bei der Untersuchung der linearen Vertheilung galvanischer Ströme geführt wird. Ann Phys Chem 148: 497–508. 10.1002/andp.18471481202 [DOI] [Google Scholar]
  18. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. 2011. ViennaRNA package 2.0. Algorithms Mol Biol 6: 26. 10.1186/1748-7188-6-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lu X-J, Bussemaker HJ, Olson WK. 2015. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res gkv716. 10.1093/nar/gkv716 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Saman Booy M, Ilin A, Orponen P. 2022. RNA secondary structure prediction with convolutional neural networks. BMC Bioinformatics 23: 58. 10.1186/s12859-021-04540-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sarver M, Zirbel CL, Stombaugh J, Mokdad A, Leontis NB. 2007. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J Math Biol 56: 215–252. 10.1007/s00285-007-0110-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Setlow P. 2006. Spores of Bacillus subtilis: their resistance to and killing by radiation, heat and chemicals. J Appl Microbiol 101: 514–525. 10.1111/j.1365-2672.2005.02736.x [DOI] [PubMed] [Google Scholar]
  23. Svoboda P, Di Cara A. 2006. Hairpin RNA: a secondary structure of primary importance. Cell Mol Life Sci 63: 901–908. 10.1007/s00018-005-5558-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Thapar R, Denmon AP, Nikonowicz EP. 2014. Recognition modes of RNA tetraloops and tetraloop-like motifs by RNA-binding proteins: recognition modes of RNA tetraloops and tetraloop-like motifs. Wiley Interdiscip Rev RNA 5: 49–67. 10.1002/wrna.1196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Tuerk C, Gauss P, Thermes C, Groebe DR, Gayle M, Guild N, Stormo G, d'Aubenton-Carafa Y, Uhlenbeck OC, Tinoco I. 1988. CUUCGG hairpins: extraordinarily stable RNA secondary structures associated with various biochemical processes. Proc Natl Acad Sci 85: 1364–1368. 10.1073/pnas.85.5.1364 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. van Aardenne-Ehrenfest T, de Bruijn NG. 1951. Circuits and trees in oriented linear graphs. In Classic papers in combinatorics (ed. Gessel I, Rota G-C), pp. 149–163. Birkhäuser, Boston. [Google Scholar]
  27. Varani G. 1995. Exceptionally stable nucleic acid hairpins. Annu Rev Biophys Biomol Struct 24: 379–404. 10.1146/annurev.bb.24.060195.002115 [DOI] [PubMed] [Google Scholar]
  28. Wang Y, Liu Y, Wang S, Liu Z, Gao Y, Zhang H, Dong L. 2020. ATTfold: RNA secondary structure prediction with pseudoknots based on attention mechanism. Front Genet 11: 612086. 10.3389/fgene.2020.612086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Washietl S, Hofacker LI. 2007. Identifying structural noncoding RNAs using RNAz. Curr Protoc Bioinformatics 19: 12.7.1–12.7.18. 10.1002/0471250953.bi1207s19 [DOI] [PubMed] [Google Scholar]
  30. Wolters J. 1992. The nature of preferred hairpin structures in 16S-like rRNA variable regions. Nucleic Acids Res 20: 1843–1850. 10.1093/nar/20.8.1843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wootton JC, Federhen S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17: 149–163. 10.1016/0097-8485(93)85006-X [DOI] [Google Scholar]
  32. Workman C, Krogh A. 1999. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 27: 4816–4822. 10.1093/nar/27.24.4816 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Yang H. 2003. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res 31: 3450–3460. 10.1093/nar/gkg529 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from RNA are provided here courtesy of The RNA Society

RESOURCES