Decoding RNA triple helices: identification from sequence and secondary structure

Margherita A G Matarrese; Michela Quadrini; Nicole Luchetti; Federico Di Petta; Daniele Durante; Monica Ballarino; Letizia Chiodo; Luca Tesei

doi:10.1093/bib/bbag009

. 2026 Jan 26;27(1):bbag009. doi: 10.1093/bib/bbag009

Decoding RNA triple helices: identification from sequence and secondary structure

Margherita A G Matarrese ^1,^✉,^#, Michela Quadrini ^2,^#, Nicole Luchetti ³, Federico Di Petta ⁴, Daniele Durante ⁵, Monica Ballarino ⁶, Letizia Chiodo ⁷, Luca Tesei ^8,^✉

PMCID: PMC12834306 PMID: 41587321

Abstract

The discovery of long non-coding RNAs (lncRNA) has revealed additional layers of gene-expression control. Specific interactions of lncRNAs with DNA, RNAs, and RNA-binding proteins enable regulation in both cytoplasmic and nuclear compartments; e.g. a conserved triple-helix motif is essential for MALAT1 stability and oncogenic activity. Here, we present a secondary-structure-based framework to annotate and detect RNA triple helices. First, we extend the dot-bracket formalism with a third annotation line that encodes Hoogsteen contacts. Second, we introduce TripleMatcher, which searches for a triple-helix pattern, filters candidates by C1′–C1′ distance thresholds, and merges overlaps into region-level zones. Using telomerase RNAs and RNA-stability elements with experimentally established triple helices (8 RNAs), TripleMatcher localized all annotated regions (structure-wise detection 8/8); geometric filtering removed most spurious candidates and improved precision (positive predictive value from 0.42 to 0.81) and overall accuracy (F Inline graphic from 0.42 to 0.62) while maintaining sensitivity. Benchmarking eight predictors showed that pseudoknot-aware methods most reliably reproduce the local architecture required for detection, aligning secondary-structure quality with downstream triple-helix recovery. Applied prospectively, the framework identified candidate regions directly from predicted secondary structures and scaled to a screen of 4160 RNAs, where distance filtering reduced 150 990 (median per molecule: 108 [20–270]) raw candidates to 97 geometrically feasible regions across seven molecules, including human telomerase complexes. Together, the notation and TripleMatcher provide a concise route from secondary structure to a small, interpretable set of triple-helix candidates suitable for targeted experimental validation.

Keywords: non-coding RNA, long non-coding RNA, RNA pattern search, RNA secondary structure, RNA structure prediction

Introduction

RNA molecules adopt diverse secondary structures, such as stem-loops, bulges, G-quadruplexes, and pseudoknots, that regulate biosynthesis, stability, localization, and molecular interactions [1–3]. Experimental and computational approaches have linked structures to function [4–7]; chemical probing, nuclear magnetic resonance (NMR), and comparative analysis have clarified folding and dynamics [8]. Yet, capturing how these structures adapt to varying cellular conditions, and how they modulate interactions with RNA or protein partners remains a significant challenge.

The role of secondary structures is particularly crucial for non-coding RNAs (ncRNAs), whose functionality extends beyond their linear nucleotide sequences and lack of protein-coding capacity. Long non-coding RNAs (lncRNAs) can act as scaffolds, bringing proteins and RNAs into proximity within nuclear and cytoplasmic compartments [9]. Deciphering the intricate folding patterns of these RNAs is essential for understanding their biological functions and could open avenues for therapeutic interventions in various diseases.

A significant example is the polyadenylated nuclear (PAN) RNA from Kaposi’s sarcoma-associated herpesvirus (KSHV); PAN RNA is abundant during the lytic phase, representing a large fraction of the cell’s polyadenylated RNA [10, 11]. Its expression and nuclear retention element (ENE) form a triple-helix with the poly(A) tail via U Inline graphic A–U interactions, protecting PAN from exonucleolytic decay [12–15]. Another paradigmatic case is the Metastasis-Associated Lung Adenocarcinoma Transcript 1 (MALAT1), a multifunctional lncRNA with diverse roles in health and disease [16]. Conserved structural domains mediate interactions and nuclear localization [16, 17], and a conserved triple-helix underlies its stability and nuclear accumulation [18, 19].

Beyond lncRNAs, triple helices are widespread in viral RNAs, riboswitches, catalytic ribozymes, and telomerase [17]. First observed by Felsenfeld et al. as U Inline graphic A–U Hoogsteen triples and later confirmed in tRNA and telomerase RNA [20, 21, 22], they contribute to RNA stability and regulation . In telomerase RNA, conserved UA–U triples stabilize the pseudoknot required for catalytic activity [23–26], and disrupting these triples impairs telomerase function [24, 27, 28].

Despite their biological significance, triple helices remain difficult to detect experimentally, primarily relying on crystallographic and NMR studies, with computational methods largely absent. Furthermore, standard notations for RNA secondary structure do not easily represent triple helices, complicating computational analyses [29, 30].

Here, we present TripleMatcher, a computational framework for identifying and characterizing RNA triple helices from sequence and secondary structure. First, we extend the dot-bracket notation to encode Hoogsteen interactions. Then, we implement a four-stage workflow: (i) structural characterization from experimentally determined 3D RNA structures; (ii) secondary-structure prediction; (iii) development and validation of TripleMatcher, a search tool to detect putative triple-helix regions; (iv) identification and reliability assessment of triple helices. To our knowledge, TripleMatcher is the first tool specifically designed to identify RNA–RNA–RNA major-groove triple helices, whereas previous computational methods focus on RNA–DNA triplex formation [31–33].

Our ultimate goal is to enable the identification of tertiary structural motifs directly from nucleotide sequence via predicted secondary structure. Integrating these higher-order interactions may help bridge the gap between experimental and predicted RNA structural analyses, particularly for lncRNAs, where experimentally resolved structures are limited.

Materials and methods

Validation dataset

We selected RNAs with experimentally determined structures containing triple helices, focusing on telomerase RNAs, and two lncRNAs, MALAT1 and PAN (Fig. 1 and Supplementary Table S1). Major groove interactions were prioritized while minor groove interactions and intermolecular triple helices (e.g. NEAT1 [34]) were excluded. As an experimental design choice, we restricted the analysis to unimolecular RNAs so that experimental structures and predicted secondary structures could be treated in a consistent way.

Multi-panel figure showing the workflow used to prepare and annotate RNA triple helices, with diagrams of RNA secondary and 3D structures highlighting triple-helix regions and strand interactions. — (A) Four-stage workflow for preparing and annotating RNA triple helices: example shown for MALAT1. (1) Primary and secondary structures were extracted from reference publications and manually verified by two independent curators; only RNAs with an uninterrupted nucleotide sequence and an experimentally resolved triple-helix region in the Protein Data Bank (PDB) file entry were retained. (2) Residues were renumbered sequentially, missing segments were modeled with ModeRNA or RNAComposer and merged with experimental coordinates to obtain complete hybrid models. (3) Nucleotides forming the triple-helix were annotated from the secondary structure and 3D model, identifying the major-groove face of the WCF base-paired region (WCF region in red) and the orientation of the third strand (in blue). We further distinguish the two strands of the WFC naming *First* the side of the major groove where Hoogsteen interactions occur in known RNA triple helices, *Second* the opposite side of the same WCF stack. The resolved MALAT1 core was complemented by modeled flanking segments. (4) After creating the final hybrid models, the triple-helix is geometrically characterized by measuring C1′–C1′ distances from each third-strand nucleotide to the WCF base-paired region on (i) the interacting major-groove face (*First*) and (ii) the opposite face (*Second*). From the overall 3D structure (third strand, blue; WCF region, red), the zoom of the triple-helix region highlights the First distances (shorter, green dashed lines) and Second distances (longer, yellow dashed lines). (B) Secondary structures for the validation set with triple-helix nucleotides highlighted: red, WCF base-paired region (major-groove face); blue, third strand engaged in Hoogsteen interactions.

MALAT1 is a nuclear lncRNA involved in gene regulation and cancer progression [19, 35]. Its 3’ end forms a bipartite triple-helix composed of two runs of U Inline graphic A–U base triples, interrupted by a CG–C triplet and a C–G doublet that induce a “helical reset,” realigning the strands and preventing steric clashes [19]. Two A-minor interactions engage adjacent G–C base pairs [19, 35]. The core region is in the 4PLX PDB [19, 36].

PAN RNA from KSHV forms two unimolecular triple helices [15]. The “PAN core triple-helix” features a shortened apical P2 helix, designed to promote triple-helix formation and stability, with mild protection against exonucleolytic degradation [14, 15]. The engineered “GCPAN triple-helix” adds a GC clamp at the ENE base, anchoring the A-rich sequence to the lower base-paired region and increasing exonuclease resistance [15]. High-resolution PDBs for these unimolecular conformations were derived from 3P22 and 6X5N.

Telomerase RNA (TER) provides the scaffold for ribonucleo-protein assembly and the template for telomere elongation. TER is conserved and contains essential features, including a conserved triple-helix in the pseudoknot that supports catalytic activity and stability [24, 37–39]. In Kluyveromyces lactis (K. lactis), the pseudoknot junction forms a triple-helix stabilized by C Inline graphic G–C and UA–U base triples, with bound divalent cations [38]. In Homo sapiens, the wild-type pseudoknot (2K95) includes a U41 bulge (U177 in [39]) that modulates catalysis; the deletion mutant (2K96/1YMO) lacks U177 but shares the network of base triples and a minor-groove A Inline graphic G–C triple.

For all RNAs in the validation set, complete 3D models were generated by combining the available high-resolution segments with computationally reconstructed regions. The triple-helix core was always taken from experimentally resolved structures; only surrounding elements were modeled, if needed, to restore continuity, with ModeRNA [40] and RNAComposer [41] (Fig. 1A).

Triple helices characterization

Secondary structure validation

Experimental secondary structure was obtained from the corresponding reference publication. When PDB structures were available, we extracted sequence and base-pairing information using RNAView through RNApdbee [42] and manually adjusted the output to match the published models (Fig. 1A). We annotated all base pairs and third-strand nucleotides forming the triple-helix and, when applicable, marked the major-groove face involved in Hoogsteen base pairing (Fig. 1B).

Augmented dot-bracket notation for Hoogsteen pairs

Dot-bracket notation [30] represents base pairs and pseudoknots, but not non-canonical interactions. To encode Hoogsteen contacts, we propose a third annotation line where unpaired third-strand nucleotides are marked with lowercase letters (e.g. z, x,y,w,v), and the facing nucleotides in the Watson–Crick–Franklin (WCF) base-paired region are marked with the matching uppercase letters (Z,X,Y,W,V). When a third-strand nucleotide contacts both bases of a pair, the uppercase symbol appears twice; all other positions carry a dash (-). This scheme is flexible and can accommodate other multi-nucleotide interactions. All augmented notations for the Validation Dataset are reported in Supplementary Table S4.

Three-dimensional atomic distance assessment

For each base triple, we measured the Euclidean distance (in Å) between the C1′ atom of the third-strand nucleotide and the C1′ atoms of each nucleotide in the corresponding WCF base pair. To quantify triple-helix geometric consistency, we computed a localized RSI₂ [43] from the distribution of these C1′–C1′ distances. Lower RSI₂ values indicate high geometric regularity of the triple-helix, while higher values reflect structural divergence relative to typical base-pair spacing.

Secondary structure prediction

For each RNA in the validation dataset, we predicted secondary structures using eight folding tools: CentroidFold [44], IPknot++ [45], Mfold [46], pKiss [47], RNAfold [48], RNAshapes [47], RNAstructure [49], and vsfold5 [50].

Predicted structures were compared against experimentally refined secondary structures using standard evaluation metrics: true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), Fowlkes–Mallows index (FM), Matthews correlation coefficient (MCC), and accuracy (ACC) [43].

We also computed ASPRA [51, 52] and SERNA distances [53, 54] that quantify the dissimilarity between abstractions of secondary structures by minimizing insertions, deletions, or substitutions needed to align prediction Inline graphic to the reference . ASPRA is based on structural tree alignment, while SERNA adapts edit distance to structural sequences. We normalized both distances to , using the maximum value observed across predictions for each tool, and we derived normalized similarities for each structure , defined as:

Metrics were computed per RNA and averaged within telomerase and stability element RNAs to assess method-specific performance trends. Predicted structures and the data used for metric computation are available on Zenodo [55].

Triple helices identification

TripleMatcher architecture

We developed TripleMatcher, a Java-based tool composed of two modules: the Matcher and the optional 3DFilter [56]. The Matcher scans secondary structures to detect regions consistent with major-groove triple-helix motifs, defined as a stretch of canonical WCF base pairs (e.g. A–U or C–G) facing a run of unpaired nucleotides (commonly U or C) capable of forming Hoogsteen interactions. This pattern is specific to RNA–RNA–RNA triple helices and is not intended to capture other tertiary motifs. The Matcher operates directly on sequence and standard secondary structure; 3D information is not required at this stage. When a candidate is found, the tool produces an augmented dot-bracket notation that marks Hoogsteen interactions for visualization. Figure 2A shows the architecture of the tool.

Schematic diagram of the TripleMatcher tool pipeline and graphs showing distributions of atomic C1`-C1' distances used to distinguish feasible and infeasible RNA triple-helix interactions. — (A) Schematic representation of the TripleMatcher tool. The `Matcher` scans RNA primary and secondary structures to report 2D-matches; the 3DFilter applies C1′–C1′ distance thresholds to keep only geometrically feasible triples (3D-matches); the ZoneCombiner (ZC) merges overlapping matches into non-overlapping zones. (B) C1′–C1′ distance and RSI₂ distributions for *First* (distance between the third-strand and the interacting WCF base on the major-groove face), *Second* (distance between the third-strand and the opposite WCF base), and *Double* (distance in canonical WCF pairs). The separation between categories motivated the 3DFilter cutoff (default 11 Å) used to discard geometrically infeasible candidates. Each point denotes one base pair.

The Matcher uses a dynamic-programming algorithm to identify 2D-matches, defined as pairs of regions in the secondary structure: one unpaired segment acting as a potential third strand and one continuous WCF stack (see Supplementary Table S2). Both regions must satisfy constraints on minimum length, pairing continuity, and the number of tolerated mismatches or indels, as specified by the user-defined options (see Table 1). By default, submatches are not reported to limit the number of 2D-matches generated, but can be enabled with -a option.

Table 1.

Matcher and 3DFilter command line options and their descriptions. For further details, see [56]

Matcher option	Description	Default value
`-n <unpaired-nucleotide>`	Nucleotide type for third strand (U, A, C, or G)	U
`-b <canonical-base-pair>`	Canonical base pair type (AU, UA, GC, or CG)	UA
`-ml <pattern-minimum-length>`	Minimum number of consecutive matches to detect	4
`-st <sequence-tolerance>`	Allowed mismatches/insertions/deletions in the unpaired sequence	1
`-bt <base-pair-tolerance>`	Allowed mismatches/insertions/deletions in the base-pair sequence	1
`-pt <paired-tolerance>`	Maximum number of base-paired nucleotides in unpaired region	1
`-ct <consecutive-tolerance>`	Maximum allowed interruptions in base-pair continuity	1
`-p <pseudoknot-tolerance>`	Maximum allowed bonds that are not part of a pseudoknot	Optional
`-a <allow-all-submatches>`	Find all sub-matches of exact bond matches	Optional
3DFilter option	Description	Default value
`-t <tolerance>`	Tolerance in Ångström to be added to a maximum default distance 11	0

Open in a new tab

When a 3D structure is available, the 3DFilter checks spatial feasibility of each 2D-match by measuring the atomic distances between the third strand nucleotides and the corresponding WCF base pairs (see Supplementary Table S3). Matches exceeding the empirically calibrated thresholds are discarded. A 3D-match is therefore a 2D-match that satisfies these geometric constraints.

Finally, both 2D- and 3D-matches can be aggregated into non-overlapping zones using the independent ZoneCombiner (ZC) module, which groups nearby matches within the same RNA molecule. This reduces redundancy and highlights broader structural regions likely to host a triple-helix, even when detected as several nearby matches by the Matcher or 3DFilter.

TripleMatcher validation

We evaluated TripleMatcher by comparing its predicted 2D-matches against the experimentally annotated base triples for the validation dataset. Let Inline graphic denotes the set of annotated base triples (consisting of a WCF base pair and one third-strand nucleotide forming a Hoogsteen interaction). For each base triple , we define a true positive (TP) if there exists at least one predicted 2D-match in which all three nucleotide positions involved in Inline graphic are present. If no such match exists for , we record a false negative (FN). Conversely, any predicted 2D-match that does not fully cover any base triple in is counted as a false positive (FP). We do not define true negatives (TNs), since we cannot exhaustively list every RNA segment incapable of forming a triple-helix.

We quantified the localization accuracy (LA) by comparing the centers of the predicted and annotated regions. Let the RNA length be Inline graphic . For the annotated triple-helix and a predicted 2D-match , let be the sets of nucleotide indices belonging to the third strand and the sets of indices belonging to the WCF double strand. For , we define the normalized centers:

The LA score of Inline graphic relative to is then

so that Inline graphic ; corresponds to perfect alignment of predicted and annotated centers, and smaller values indicate poorer localization.

Since the Matcher does not test spatial feasibility, we re-scored its output after applying the optional 3DFilter. A 3D-match can contain fewer base triples than the original 2D-match or be excluded entirely if no geometrically valid triples remain. On the 3DFilter, we applied the same base-triple-level criteria, defining: (i) TP as a retained predicted base triple overlapping an annotated one; (ii) FN as an annotated base triple with no retained match; and (iii) FP as a retained predicted triple with no corresponding annotation.

For both modules, we computed: (i) PPV, measuring the proportion of predicted base triples that are correctly identified; (ii) TPR, measuring the fraction of true base triples recovered; (iii) FM index, the geometric mean of PPV and TPR; (iv) F1-score, the harmonic mean of PPV and TPR; (v) LA, i.e. the center alignment; and (vi) 3DFilter efficiency, defined as the percentage of FP base triples removed by the filter while retaining all TP.

TripleMatcher usage: search dataset

We applied TripleMatcher to a large-scale search dataset composed of RNA secondary structures from the PhyloRNA database (https://bdslab.unicam.it/phylorna [57]) and the PDBe archive [58]. In total, we analyzed 4160 RNA structures, including ribosomal RNAs (2897), transfer RNAs (1208), telomerase RNAs (32), and group I and II introns (23); 2210 of these contain pseudoknots. We selected structures with 3D models available to enable full usage of the 3DFilter module.

The Matcher scanned each secondary structure for patterns compatible with triple-helix motifs. Then, the 3DFilter module was used to verify the spatial feasibility of each match.

TripleMatcher usage: predicted secondary structure

To assess the applicability of TripleMatcher in a fully computational setting, we applied the tool to predicted structures (Section Secondary structure prediction) per RNA in the validation dataset.

Predicted 2D-matches were evaluated against the annotated triple-helix regions using the same base-triple-level criteria described above. We computed the following metrics: (i) the number of TPs recovered per folding algorithm; (ii) the false positive rate (FPR), defined as the number of predicted base triples outside the annotated region; and (iii) structure-wise detection rate defined as the fraction of RNAs for which at least one TP was recovered.

TripleMatcher usage: predicted 3D structure

We applied TripleMatcher in two distinct in silico scenarios. In the first, we kept the experimental secondary structure and paired it with predicted 3D models to test the behavior of the 3DFilter. In the second, we used secondary structures extracted directly from predicted 3D models to evaluate a fully computational workflow. For each RNA in the validation set, we generated models with RhoFold+ [59], 3dRNAv2.0 [60], RNAComposer [41], and Farfar2 [61], and derived their 2D structures using RNAView [62].

Statistical analysis

We used the Wilcoxon signed-rank test to compare atomic distances and RSI₂ values across three categories of interactions in RNA triple helices: (i) between the third strand and the interacting side of the WCF double strand (First), (ii) between the third strand and the opposite side of the major groove (Second), and (iii) between nucleotides within the canonical WCF base pair (Double).

For evaluation metrics (TP, FP, TPR, PPV, F1-score, FM index, LA, and 3DFilter efficiency), per-RNA values were reported as median and interquartile range (IQR: 25th–75th percentile). Differences across groups, such as RNA type or folding tool, were assessed using the Wilcoxon rank-sum test. All statistical analyses were performed in MATLAB R2021b. Statistical significance was considered at Inline graphic (*) and (**).