Abstract
Since the 1980s, dozens of computational methods have addressed the problem of predicting RNA secondary structure. Among them are those that follow standard optimization approaches and, more recently, machine learning (ML) algorithms. The former were repeatedly benchmarked on various datasets. The latter, on the other hand, have not yet undergone extensive analysis that could suggest to the user which algorithm best fits the problem to be solved. In this review, we compare 15 methods that predict the secondary structure of RNA, of which 6 are based on deep learning (DL), 3 on shallow learning (SL) and 6 control methods on non-ML approaches. We discuss the ML strategies implemented and perform three experiments in which we evaluate the prediction of (I) representatives of the RNA equivalence classes, (II) selected Rfam sequences and (III) RNAs from new Rfam families. We show that DL-based algorithms (such as SPOT-RNA and UFold) can outperform SL and traditional methods if the data distribution is similar in the training and testing set. However, when predicting 2D structures for new RNA families, the advantage of DL is no longer clear, and its performance is inferior or equal to that of SL and non-ML methods.
Keywords: RNA 2D structure prediction, machine learning, deep learning, algorithm benchmarking
INTRODUCTION
The world of RNA molecules is characterized by a wide variety of functions and structures [1–5]. These, in turn, remain closely related—a structure sets the foundation for RNA interactions with other molecules in the organism, thereby conditioning its function. Therefore, the determination of primary, secondary and tertiary structures is the starting point for many studies in contemporary molecular biology. It is performed experimentally, computationally or most often, using both approaches in parallel.
Sequence-based prediction of pairings in an RNA molecule is one of the mainstream problems in structural bioinformatics and an attractive alternative to the experimental acquisition of knowledge about the secondary structure of RNA [6–8]. Understanding the latter makes it possible to study the properties of molecules, predict the spatial folding of RNA chains, infer the similarity of structures, evaluate 3D structure prediction algorithms, etc. [9–14]. The first computational algorithms for 2D structure prediction of RNA date back to the 1980s [15, 16]. They applied dynamic programming (DP) and focused exclusively on canonical Watson-Crick-Franklin (WCF) interactions. Since then, dozens of algorithms have been developed to predict RNA base pairs. They followed various strategies, such as ab initio modeling, comparative methods, hybrid approaches or more recently machine learning [17]. Machine learning (ML) entered the prediction of 2D RNA structure already in the 1990s with simple neural networks and linear regression [18–20]. Recent ML predictors use deep learning, which is a consequence of progress in the development of deep learning (DL) models and their spectacular results in many areas, such as computer vision (CV) [21, 22], natural language processing (NLP) [23, 24] or life sciences [25, 26]. The achievements of DL led to enormous expectations of its effectiveness in solving any problem. At the same time, awareness of the limitations of the DL methods increased. Let us mention the greed for data; all known successful CV and NLP models were developed using large training datasets. In 2D RNA structure prediction, collecting enough training data is a problem that can reduce the potential of DL. In fact, the set of reliable and experimentally confirmed secondary structure data remains sparse.
Non-ML predictors of the secondary structure of RNA were benchmarked and scanned for the quality of the results in a multitude of computational experiments [27–29]. For ML approaches, the focus was mainly on theoretical analysis [17]; the differences in the strategies used were described. Only two works attempted to validate the selected DL and traditional approaches based on thermodynamic models [30, 31]. They underlined the difficulty of reliable evaluation of ML-based methods due to the lack of common training/test sets and training routines. Although the authors of several algorithms created and shared their datasets [32], only a few competing methods have used them since [33, 34]. Training procedures were not published for all algorithms. Some did not set the random seed used in the training and test set and the initialization of weights in the neural network, so re-trained models may not converge to the original state. As a consequence, the published results of many algorithms are not reproducible and are hard to rely on in the comparative analysis. A possible remedy for this problem is to train the algorithms on one’s own data before testing them, provided that the training procedure is available. This was done in [30, 31], where the authors re-trained the DL models to assess their generalization for various RNA families. They pointed out the problem of bias in the training data, which resulted in higher performance of the ML models if all RNA families were included in the training; otherwise, the prediction quality dropped significantly.
This study is geared toward the user perspective. We test the predictive powers of various ML algorithms dedicated to the 2D structure of RNA, compare them to several non-ML methods (both single- and multiple-sequence approaches), and evaluate their suitability for the end user. We explain the rules behind various ML approaches and show their advantages and disadvantages. Bypassing the training of ML models, we verify their effectiveness in predicting canonical, non-canonical, and pseudoknot interactions. We benchmark 15 algorithms divided into three groups, (a) DL – 6 deep learning algorithms based on deep neural networks, (b) SL – 3 shallow learning methods, which include shallow neural networks (up to 3 layers), linear regression and SVMs and (c) 3 traditional non-ML approaches applying optimization strategies such as dynamic programming and 3 non-ML approaches using multiple sequence alignment (MSA). According to the literature, more ML-based predictors of the 2D RNA structure were developed, such as DMFold [35], CDPfold [36], TORNADO [37], Pfold [38] and others [39–45]. For comparison, we select only those for which models and inference scripts are available and execute without errors. We evaluate algorithms based on their predictions of representatives of RNA equivalence classes, selected Rfam sequences, and new RNA families. Data for the experiments come from RNAsolo [46], Rfam [47] and BGSU [48] databases.
METHODS
From all the available ML-based methods that predict the secondary structure of RNA, we selected those that were available for local installation and executed without errors. The collection includes 6 DL algorithms (SPOT-RNA [32], SPOT-RNA2 [49], MXFold2 [33], UFold [34], RNA-state-inf [50], E2efold [51] and 3 SL methods (Mxfold [52], Contextfold [53] and CONTRAfold [54]). Furthermore, 3 traditional non-ML algorithms operating on single input sequence, IPKnot [55], RNAFold [56] and RNAStructure [57], and 3 multiple-sequence approaches, RNAalifold [58], TurboFold II [59] and R-scape [60], are used for comparison.
The basic features of the machine learning algorithms compared in this work are shown in Table 1. First, we give information on the sources of data used to train each model. These include archives and databases such as ArchiveII [61], bpRNA [62], Comparative RNA Web [63], Protein Data Bank [64], Rfam [47], RNAStralign [59] and RNA STRAND [65]. The training data compiled from these resources contained only secondary RNA structures. These were 2D structures obtained experimentally, consensus-built and accurately-annotated structures, and 2D structures derived from experimental tertiary structures. All algorithms, except for two, provide their training sets. Next, we state which strategies are used during the prediction. Four algorithms – E2efold, SPOT-RNA, SPOT-RNA2 and UFold – implement the end-to-end approach in which the secondary structure is predicted by neural networks directly from the sequence. The other algorithms apply machine learning to infer intermediate structural data. Let us add, that SPOT-RNA2 is the only algorithm in the pool to implement multiple sequence alignment; the others operate on a single input sequence. CONTRAfold uses a conditional log-linear model (CLLM), a probabilistic technique, to predict features such as hairpin length, internal loop length, free bases, etc. Next, they constitute the input to the Viterbi algorithm [66] that predicts the secondary structure. RNA-state-inf applies DL to infer SHAPE reactivities [67] and runs GTfold [68] to predict base pairs from the RNA sequence and nucleotide reactivity scores. The three remaining ML-based methods generate weighted scores to approximate the total energy factor based on various structural characteristics. Both versions of Mxfold compute a folding score as a weighted thermodynamic energy for each base pair. Next, it is used in the Zuker-like [16] dynamic programming algorithm to generate the secondary structure of RNA. Mxfold applies Structured Support Vector Machines (SSVM) [69] to train the scoring parameters. In Mxfold2, this task belongs to the DNN model, while SSVM optimizes network parameters during training. Finally, Contextfold predicts scores that combine structural and contextual information, such as stem-closing base pair, unpaired hairpin base, hairpin length, etc.
Table 1.
Algorithm | Training | ML-based | ML input data | ML model | Prediction by | ML output data | Predicted bps | |
---|---|---|---|---|---|---|---|---|
dataset | strategy | representation | type(s) | ML model | representation | NWCF | PK | |
(a) DL – deep learning-based algorithms | ||||||||
E2efold | RNAStralign & ArchiveII | End-to-end | 4xL One-Hot encoded sequence | Transformer Encoder + 2D CNN | 2D structure | LxL symmetric matrix | + | + |
Mxfold2 | SPOT-RNA & E2efold sets | Weighted | Lxd dimensional embeddings | 2D CNN + BiLSTM, SSVM | thermodynamic model scoring | 4xLxL matrix of folding scores | - | - |
RNA-state-inf | Comparative RNA Web | Hybrid | 4xL One-Hot encoded sequence | 1D CNN + BiLSTM | SHAPE | reactivity vector | - | - |
SPOT-RNA | bpRNA, PDB | End-to-end | 8xLxL One-Hot encoded matrix | 2D CNN + BiLSTM | 2D structure | LxL triangular probability matrix | + | + |
SPOT-RNA2 | bpRNA, PDB | End-to-end | 18xLxL feature matrix | 2D CNN + BiLSTM | 2D structure | LxL triangular probability matrix | + | + |
UFold | SPOT-RNA, RNAStralign | End-to-end | 16xLxL binary matrix | 2D CNN (UNet architecture) | 2D structure | LxL probability matrix | + | + |
(b) SL – shallow learning-based algorithms | ||||||||
Contextfold | RNA STRAND (unavailable) | Weighted | Feature vector | Discriminative Structured-Prediction + Passive-Aggressive Algorithm | feature scoring | weight vector | - | - |
CONTRAfold | Rfam | Probabilistic | Feature vector | CLLM | 2D structure | log-probability vector | - | - |
Mxfold | Rfam (unavailable) | Weighted | Feature vector | SSVM | thermodynamic model scoring | weight vector | - | - |
NWCF - non-canonical interaction, PK - pseudoknots, L - length of RNA sequence, SSVM - structured support vector machines, CLLM - conditional log-linear model, BiLSTM - bidirectional long- and short-term memory networks, CNN - convolutional neural networks.
ML models accept input data of various types and representations (Table 1: ML input data representation). In the one-hot encoding of the RNA sequence, each nucleotide is represented as a binary vector of length 4. Its consecutive cells correspond to four nitrogenous bases, adenine, cytosine, uracil and guanine. If a nucleotide contains a base of type , cell []=1 while the other three cells are zeroed. This representation is used in RNA-state-inf and E2efold. In SPOT-RNA, for each pair of nucleotides, two vectors (4 x ) are merged into a matrix of 8 x x . SPOT-RNA2 combines two single-sequence features and two evolutionary-based ones. One-hot encoded sequence and the base pair probability from LinearPartition [70] are single-sequence features. The evolutionary-based ones include the position-specific score matrix (PSSM) and the direct coupling analysis (DCA). UFold uses a 16 x x matrix, resulting from Kronecker’s product transformation of 4-bit vectors. In Mxfold2, nucleotides are encoded in a -dimensional trainable embedding. A sequence of such embeddings is an input to the ML model and has a size x, where is a hyperparameter set to 64 by default. ML models operating in Mxfold and CONTRAfold accept feature vectors.
The representation of input and output data is a consequence of the ML model used for prediction (Table 1: ML model type(s)). So, let us take a look at the models operating in the presented study. The core of all benchmarked DL algorithms is the Convolutional Neural Network (CNN) [71], in which RNA is treated as an image and represented as a matrix. E2efold combines CNN and Transformer Encoder [72] – a type of neural network successfully applied in natural language modeling such as BERT [23] or GPT-3 [73]. CNNs alone are used by UFold in the UNet network architecture [74] that commonly serves for segmentation. The remaining DL algorithms – Mxfold2, RNA-state-inf, SPOT-RNA and SPOT-RNA2 – combine CNN and bidirectional LSTM [75]. LSTM, long-short-term memory cell, is a recurrent neural network (RNN) [71] designed for sequence processing. In single-layer LSTM, the information flows from the beginning of the sequence to the end. In the bidirectional one, the second layer provides data flow in the opposite way. This mechanism allows for the capture of contextual dependencies throughout the RNA sequence. Shallow learning methods apply other ML models mentioned in the previous paragraph: CCLM (conditional log-linear model) and SSVM (structured support vector machines).
The rightmost columns of Table 1 show whether the tested algorithms predict non-canonical (NWCF) and pseudoknot-involved (PK) base pairs. Only end-to-end DL methods can learn to infer them. Other approaches carry the limitations of the final-stage algorithms that generate the secondary structure – GTfold, Zuker-like and Viterbi methods predict only canonical base pairs in non-pseudoknotted regions.
RESULTS
We present the results of three computational experiments in which we benchmarked 14 algorithms. They were run to predict the secondary structures of the RNA sequences divided into three representative subsets prepared by us. Each method was installed and run on a local workstation equipped with an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (6 cores), 16 GB RAM and Nvidia GeForce RTX 2080 Ti. GPU was used to evaluate deep learning models trained by their authors.
Test sets
To build datasets for algorithm evaluation, we downloaded 1646 high-resolution (<3Å) 3D RNA structures, deposited in the RNAsolo database [46] in September 2022. With RNApdbee [76, 77], we extracted their sequences and secondary structures and saved them in FASTA and BPSEQ files, respectively. Due to the SPOT-RNA limitation on the length of the input sequence, we removed entries over 500 nts from the collection. We also discarded data that were used to train the algorithms considered. The remaining 1421 RNA sequences were input in the 2D structure prediction experiments. The corresponding secondary structures were treated as ground truth in the evaluation of predictive algorithms.
From the basic collection, we created three test sets that contained (I) representatives of RNA equivalence classes, (II) selected Rfam sequences and (III) representatives of new RNA families. Test set I includes 497 representatives of the equivalence classes created within the BGSU RNA Hub [48]. In the second collection, there are 283 RNAs from Rfam 14.8, whose sequences and their homologs were not present in release 14.2 and earlier (Figure S1 in the Supplementary File presents the distribution of sequences in families). The reason for eliminating the latter is that the 2D structures from Rfam 14.2 were used to train the DL models benchmarked in our experiments. For test set III, we selected representatives of 16 RNA families present in Rfam 14.8 and not included in Rfam 14.2. We verified that no instance in test set II and test set III had a sequence similarity of more than 80% with any member of Rfam 14.2. All these datasets, including sequences and ground truth data, were made available at https://zenodo.org/record/7542063#.Y8j_MnbMJhG.
In each collection, we distinguished three overlapping subsets. The first () includes molecules with canonical base pair(s), the second () collects RNAs with at least one non-canonical base pair, and the third () has RNAs with pseudoknots. We independently evaluated the predictive powers of the algorithms for each of these pairing categories (Table 2). Note that pseudoknot-involved base pairs can be both canonical and non-canonical. Thus, the canonical pairing in the pseudoknot was considered both as part of the assessment of canonical base pair prediction and as part of the assessment of pseudoknot prediction; analogously for non-canonical base pairs.
Table 2.
Contents of the test set | Test set I | Test set II | Test set III |
---|---|---|---|
RNAs with canonical bps () | 461 | 283 | 16 |
RNAs with non-canonical bps () | 447 | 283 | 16 |
RNAs with pseudoknots () | 101 | 107 | 5 |
Computational experiments
We conducted three independent experiments to evaluate the performance of 15 selected algorithms in predicting the secondary structure based on the RNA sequence. Different data were used in each experiment: test set I in Experiment I, test set II in Experiment II and test set III in Experiment III. The predicted secondary structures were converted to a common format (BPSEQ) and compared to the ground-truth data (that is, base pairs extracted from the 3D structures).
The accuracy of reproducing base–base interactions was measured in terms of Interaction Network Fidelity (INF) [11]. We determined the INF associated with each run. Given the INF values for algorithm and all instances in the subset , we calculated the mean of these values and the standard deviation. We computed INF(WCF) for canonical base pairs, INF(ST) for stacking in helices, INF(NWCF) for non-canonical base pairs and INF(PK) for pseudoknot-involved interactions (Table 3). For each experiment, we prepared a violin plot corresponding to the INF values computed for canonical base pairs (see Figures 1– 3).
Table 3.
Algorithm | Experiment I | Experiment II | Experiment III | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
WCF | ST | NWCF | PK | WCF | ST | NWCF | PK | WCF | ST | NWCF | PK | |
(a) DL – deep learning-based algorithms | ||||||||||||
E2efold | 0.260.33 | 0.190.26 | 0.020.07 | 0.010.09 | 0.30.39 | 0.260.31 | 0.020.07 | 0.010.04 | 0.080.16 | 0.030.07 | 0.00.0 | 0.00.0 |
MXfold2 | 0.820.26 | 0.590.32 | – | – | 0.810.27 | 0.60.23 | – | – | 0.650.35 | 0.410.34 | – | – |
RNA-state-inf | 0.730.31 | 0.410.34 | – | – | 0.650.31 | 0.490.24 | – | – | 0.660.34 | 0.390.37 | – | – |
SPOT-RNA | 0.86 0.22 | 0.62 0.31 | 0.22 0.26 | 0.260.42 | 0.82 0.25 | 0.64 0.23 | 0.22 0.2 | 0.25 0.44 | 0.690.32 | 0.420.37 | 0.170.24 | 0.20.45 |
SPOT-RNA2 | – | – | – | – | – | – | – | – | 0.830.16 | 0.6 0.17 | 0.2 0.2 | 0.00.0 |
UFold | 0.830.26 | 0.60.32 | 0.110.23 | 0.27 0.42 | 0.770.25 | 0.580.21 | 0.10.15 | 0.130.34 | 0.720.28 | 0.520.32 | 0.050.13 | 0.4 0.55 |
(b) SL – shallow learning-based algorithms | ||||||||||||
Contextfold | 0.780.29 | 0.570.32 | – | – | 0.760.29 | 0.570.23 | – | – | 0.670.3 | 0.410.31 | – | – |
CONTRAfold | 0.770.29 | 0.550.33 | – | – | 0.630.27 | 0.460.2 | – | – | 0.760.29 | 0.50.33 | – | – |
MXfold | 0.750.29 | 0.550.33 | – | – | 0.640.26 | 0.480.21 | – | – | 0.580.32 | 0.410.38 | – | – |
(c) Non-ML algorithms | ||||||||||||
IPknot | 0.770.26 | 0.560.32 | – | 0.10.28 | 0.670.25 | 0.50.19 | – | 0.050.22 | 0.740.31 | 0.520.35 | - | 0.00.0 |
R-scape | – | – | – | – | – | – | – | – | 0.98 0.05 | – | – | – |
RNAalifold | – | – | – | – | – | – | – | – | 0.860.11 | – | – | – |
RNAFold | 0.750.31 | 0.550.34 | – | – | 0.670.32 | 0.50.25 | – | – | 0.670.35 | 0.430.36 | – | – |
RNAStructure | 0.690.33 | 0.520.34 | – | – | 0.640.32 | 0.490.26 | – | – | 0.660.33 | 0.430.37 | – | – |
TurboFold II | – | – | – | – | – | – | – | – | 0.580.37 | – | – | – |
Single-sequence based algorithms were executed for all sequences in each test collection. The algorithms that apply MSA – SPOT-RNA2, RNAalifold, TurboFold II and R-scape – participated only in Experiment III. They were tested on sequence alignments provided by the Rfam seed for the tested families; their predictions were compared with the Rfam reference. Each single-sequence method predicted secondary structures for all input data. As the results show (Table 3), algorithms performed with varying success. They did the best in reproducing canonical interactions in the structure. INF(WCF) values range between 0.6 and 0.98 for all but one algorithm (E2efold); the highest value was obtained by R-scape in Experiment III. Stems are well predicted, which is due to the majority of canonical pairings in these structural elements. On the other extreme are non-canonical base pairs, which none of the tested algorithms can handle well; INF(NWCF) values range between 0.0 and 0.22. In Table 3, the best scores in each experiment are marked in bold. Most of the best scores go to the SPOT-RNA algorithm, and in three cases to UFold, SPOT-RNA2 and R-scape.
In Experiments I and II, the DL-based algorithms, such as SPOT-RNA, UFold and MXFold2, scored significantly better than the non-ML and most SL methods. However, the DL technique does not always guarantee success. For example, in Experiment I, both the best and the worst results belong to DL-based algorithms (SPOT-RNA and E2efold, respectively). The prediction accuracy of E2efold and RNA-state-inf is at most comparable to that of non-ML approaches, and in some cases significantly lower. In Experiment III, performance of the DL algorithms decreases, whereas non-ML and some SL-based predictors produce similar results. The decrease in accuracy proves that the ML-based methods do not generalize well on the data from a totally new distribution. Similar conclusions were drawn in [30] and [31]. Experiment III was the most challenging one for machine learning algorithms. On the other hand, it shows a good performance of the MSA-based methods, including those that do not use machine learning. MSA algorithms were benchmarked only in this experiment. This was due to their computational complexity and the paucity of relevant data in the other experiments. For example, in SPOT-RNA2, searching a sequence database to build a covariance model appeared too expensive for larger test sets like test sets I and II.
Only four algorithms – E2efold, SPOT-RNA, SPOT-RNA2 and UFold – predicted any non-canonical base pairs (Table 3). All of them implement the DL model. However, the INF values are unsatisfactory, ranging between 0.0 (E2efold in Experiment III) and 0.22 (SPOT-RNA in Experiments I and II). Regarding RNA pseudoknots, the three DL algorithms mentioned above and one non-ML method (IPknot) gave some successful predictions. However, the accuracy in reproducing pseudoknot interactions is similarly poor as in the case of non-canonical base pairs. In Experiment III, SPOT-RNA2 showed a significant improvement in the prediction of canonical base pairs compared to the first release of SPOT-RNA. However, this improvement is no longer clear in predicting long-range interactions.
CONCLUSIONS
In this work, we analyzed 15 methods for sequence-based prediction of the secondary structure of RNA, available as standalone applications. The collection included six deep learning-based algorithms, three shallow learning-based methods and six non-ML ones. We benchmarked them against each other without retraining and evaluated their performance for various data: representatives of non-redundant sets of RNA structures (Experiment I), selected Rfam sequences in known RNA families (Experiment II) and representatives of new Rfam families (Experiment III). Prediction accuracy was computed separately for canonical, non-canonical, stacking and pseudoknot-involved interactions for all input instances.
Reconstructing non-canonical and pseudoknot interactions has long been a key challenge in predicting RNA secondary structure, and it has remained so despite the advent of ML-based methods. In the case of canonical base pair prediction, DL methods proved to generalize well within known families (Experiments I and II). However, their performance decreased for new RNA families (Experiment III). Such behavior is a consequence of bias in the training data. DL-based algorithms, while predicting solely on the basis of RNA sequences, learn to recognize family-specific sequence patterns. Running these algorithms in a cross-family manner, we can observe that the prediction is based only on a subset of lower features shared by many families. Such a subset may be insufficient to reliably predict the structure of new RNA from an outlier family.
Among the algorithms tested in Experiments I and II, SPOT-RNA appeared to stand out slightly but noticeably. It is the only method that implements additional strategies, such as transfer learning and the model ensemble. According to the authors of SPOT-RNA, transfer learning improved prediction quality, in terms of INF, by approximately 6%, while the model ensemble allowed them to obtain results 3% better than any single model independently. It may be evidence that combining various techniques enhances the algorithm’s performance. SPOT-RNA2, an extension of SPOT-RNA, is able to capture evolutionary features. Experiment III proved that such additional information improves the quality of prediction. However, it comes with a price – SPOT-RNA2 searches for homologous sequences and builds a covariance model, which is a computationally demanding process.
We conclude that ML methods have improved the accuracy of predicting the complete RNA secondary structure relative to traditional methods when not dealing with brand-new RNA families. Importantly, they have the advantage of learning from new data and improving prediction performance. However, taking advantage of this feature is only possible if the authors of ML-based algorithms make them accessible along with the learning sets and training routines. The availability of the latter can contribute to improving the quality of predicted RNA structures and allow a reliable evaluation of prediction algorithms.
What can be done to improve the quality of prediction of RNA structures from entirely new families? Many of the current ML approaches can be classified as shortcut learning [78]. They perform well on certain benchmarks but fail in real-world scenarios. The reason is their dependence on sequence patterns while predicting 2D structures within an RNA family. Consequently, when the new family occurs, its patterns are unknown and therefore the algorithms fail to predict a reliable 2D structure. ML approaches will struggle with this issue as long as they rely on sequence patterns. One alternative path may be to combine ML with traditional approaches. For example, similarly to MXfold2, one can apply ML to predict structural features that would then support prediction by non-ML algorithms. However, in such a hybrid approach the performance of the whole system is limited by the second-phase algorithm; for example, long-range interactions are excluded from prediction. Another way is using a traditional approach as a core algorithm and improving its output with an ML model. In such a system, a non-ML algorithm might be a guarantor of minimal performance. The DL model would learn to correct the errors of the first-phase algorithm. Such a combination can be particularly effective when used with algorithms based on multiple sequence alignment. Finally, ML models can be applied as an approximation of MSA-based methods.
Key Points
Machine learning (ML) algorithms to predict the 2D structure of RNA are compared on three representative datasets.
In general, deep learning-based methods are superior to non-machine learning ones when the structures are predicted for RNAs from the same data distribution.
Non-ML approaches, especially those using multiple sequence alignment (MSA), remain indispensable in RNA 2D structure prediction.
When predicting 2D structures of RNAs for new RNA families, deep learning-based algorithms perform similarly or even worse than non-ML approaches.
Supplementary Material
Marek Justyna is a PhD student and a member of the Laboratory of RNA Structural Bioinformatics, Poznan University of Technology. Research interests: machine learning, structural bioinformatics and RNA structure modeling. He holds an MSc in bioinformatics (2018).
Maciej Antczak is an assistant professor at Poznan University of Technology and IBCh PAS. Research interests: RNA structure, structural bioinformatics, combinatorial optimization and artificial intelligence. Author of OR/bioinformatics papers in leading scientific journals. PhD (2013) and DSc (2019) in computing science.
Marta Szachniuk is a professor of technical sciences, vice president of the Polish Bioinformatics Society and vice chair of EURO CBBM. Research interests: structural bioinformatics, RNA structure modeling, algorithmics and AI. Author of highly cited papers and over 20 bioinformatics tools. PhD (2005) and DSc (2015) in computing science, ProfTit (2020).
Contributor Information
Marek Justyna, Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland.
Maciej Antczak, Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
Marta Szachniuk, Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland.
FUNDING
This research was supported by the National Science Centre, Poland (2020/39/O/ST6/01488 to MS); the statutory funds of Poznan University of Technology and the Institute of Bioorganic Chemistry, PAS.
DATA AVAILABILITY
Datasets used to test the algorithms, including sequences and ground truth data, are available at https://zenodo.org/record/7542063#.Y8j_MnbMJhG.
References
- 1. Mortimer SA, Kidwell MA, Doudna JA. Insights into RNA structure and function from genome-wide studies. Nat Rev Genet 2014;15(7):469–79. [DOI] [PubMed] [Google Scholar]
- 2. Meister G, Tuschl T. Mechanisms of gene silencing by double-stranded RNA. Nature 2004;431:343–9. [DOI] [PubMed] [Google Scholar]
- 3. Serganov A, Nudler E. A decade of riboswitches. Cell 2013;152(1–2):17–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wu L, Belasco JG. Let me count the ways: mechanisms of gene regulation by miRNAs and siRNAs. Mol Cell 2008;29(1):1–7. [DOI] [PubMed] [Google Scholar]
- 5. Zou Q, Li J, Hong Q, et al. . Prediction of microRNA-disease associations based on social network analysis methods. Biomed Res Int 2015;2015:810514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Tijerina P, Mohr S, Russell R. DMS footprinting of structured RNAs and RNA-protein complexes. Nat Protoc 2007;2(10):2608–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Antczak M, Zablocki M, Zok T, et al. . RNAvista: a webserver to assess RNA secondary structures with non-canonical base pairs. Bioinformatics 2019;35(1):152–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Gumna J, Zok T, Figurski K, et al. . RNAthor - fast, accurate normalization, visualization and statistical analysis of rna probing data resolved by capillary electrophoresis. PloS One 2020;15(10):e0239287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003;31:3406–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Parisien M, Major F. The MC-fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 2008;452(7183): 51–5. [DOI] [PubMed] [Google Scholar]
- 11. Parisien M, Cruz J, Westhof E, Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA 2009;15(10):1875–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Szachniuk M. RNApolis: computational platform for RNA structure analysis. Found Comput Decis Sci 2019;2(44):241–57. [Google Scholar]
- 13. Popenda M, Zok T, Sarzynska J, et al. . Entanglements of structure elements revealed in RNA 3D models. Nucleic Acids Res 2021;17(49):9625–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li J, Zhang S, Zhang D, Chen S-J. Vfold-pipeline: a web server for RNA 3D structure prediction from sequences. Bioinformatics 2022;38:4042–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Nussinov R, Piecznik G, Grigg J, Kleitman D. Algorithms for loop matchings. SIAM J Appl Math 1978;35:68–82. [Google Scholar]
- 16. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 1981;9(1):133–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Zhao Q, Zhao Z, Fan X, et al. . Review of machine learning methods for RNA secondary structure prediction. PLoS Comput Biol 2021;17(8): e1009291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Takefuji Y, Chen L-L, Lee K-C, Huffman J. Parallel algorithms for finding a near-maximum independent set of a circle graph. IEEE Trans on Neural Netw 1990;1(3):263–7. [DOI] [PubMed] [Google Scholar]
- 19. Steeg EW. Artificial Intelligence and Molecular Biology. MIT Press, 1993. [Google Scholar]
- 20. Xia T, SantaLuciaJ, Jr, Burkard ME, et al. . Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-crick base pairs. Biochemistry 1998;37(42):14719–35. [DOI] [PubMed] [Google Scholar]
- 21. Redmon J, Divvala S, Girshick R, Farhadi A.. You only look once: Unified, real-time object detection. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016; 779–88.
- 22. Szegedy C, Vanhoucke V, Ioffe S, et al. . Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016; 2818–26. [Google Scholar]
- 23. Devlin J, Chang M-W, Lee K, et al. . BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019;1:4171–86. [Google Scholar]
- 24. Brown T, Mann B, Ryder N, et al. . Language models are few-shot learners. Adv Neural Inform Process Syst 2020;33:1877–901. [Google Scholar]
- 25. Townshend R, Eismann S, Watkins A, et al. . Geometric deep learning of rna structure. Science 2021;373(6558):1047–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Gong W, Wee J, Wu M, et al. . Persistent spectral simplicial complex-based machine learning for chromosomal structural analysis in cellular differentiation. Brief Bioinform 2022;23(4):bbac168. [DOI] [PubMed] [Google Scholar]
- 27. Seetin MG, Mathews DH. RNA structure prediction: an overview of methods. Methods Mol Biol2012;905:99–122. [DOI] [PubMed] [Google Scholar]
- 28. Puton T, Kozlowski LP, Rother KM, Bujnicki JM. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res 2013;41(7):4307–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wayment-Steele HK, Kladwang W, Strom AI, et al. . RNA secondary structure packages evaluated and improved by high-throughput experiments. Nat Methods 2022;19(10):1234–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Flamm C, Wielach J, Wolfinger MT, et al. . Caveats to deep learning approaches to rna secondary structure prediction. Front Bioinform 2022;2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Szikszai M, Wise M, Datta A, et al. . Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics 2022;38(16):3892–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun 2019;10(1):5407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Sato K, Akiyama M, Sakakibara Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat Commun 2021;12:941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Fu L, Cao Y, Wu J, et al. . UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Res 2021;50(3):e14–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Wang L, Liu Y, Zhong X, et al. . DMfold: a novel method to predict RNA secondary structure with pseudoknots based on deep learning and improved base pair maximization principle. Front Genet 2019;10:143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Zhang H, Zhang C, Li Z, et al. . A new method of RNA secondary structure prediction based on convolutional neural network and dynamic programming. Front Genet 2019;10:467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Rivas E, Lang R, Eddy SR. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA 2012;18(2):193–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 2003;31(13):3423–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Lu W, Tang Y, Wu H, et al. . Predicting RNA secondary structure via adaptive deep recurrent neural networks with energy-based filter. BMC Bioinform 2019;20(Suppl 25):684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Lu W, Cao Y, Wu H, et al. . Research on RNA secondary structure predicting via bidirectional recurrent neural network. BMC Bioinform 2021;22(Suppl 3):431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Wu H, Tang Y, Lu W, et al. . RNA secondary structure prediction based on long short-term memory model. Intell Comput Theories Appl 2018;595–9. [Google Scholar]
- 42. Quan L, Cai L, Chen Y, et al. . Developing parallel ant colonies filtered by deep learned constrains for predicting RNA secondary structure with pseudo-knots. Neurocomputing 2020;384:104–14. [Google Scholar]
- 43. Calonaci N, Jones A, Cuturello F, et al. . Machine learning a model for RNA structure prediction. NAR Genom Bioinform 2020; 2(4):lqaa090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Yonemoto H, Asai K, Hamada M. A semi-supervised learning approach for RNA secondary structure prediction. Comput Biol Chem 2015;57:72–9. [DOI] [PubMed] [Google Scholar]
- 45. Qasim R, Kauser N, Jilani DT. Article:secondary structure prediction of RNA using machine learning method. Int J Comput Appl 2010;10(6):15–22. [Google Scholar]
- 46. Adamczyk B, Antczak M, Szachniuk M. RNAsolo: a repository of cleaned PDB-derived RNA 3D structures. Bioinformatics 2022;38(14):3668–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, et al. . Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021;49(D1): D192–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Leontis NB, Zirbel CL. RNA 3D Structure Analysis and Prediction. Berlin Heidelberg: Springer, 2012. [Google Scholar]
- 49. Singh J, Paliwal K, Zhang T, et al. . Improved RNA secondary structure and tertiary base-pairing prediction using evolutionary profile, mutational coupling and two-dimensional transfer learning. Bioinformatics 2021;37(17):2589–600. [DOI] [PubMed] [Google Scholar]
- 50. Willmott D, Murrugarra D, Ye Q. Improving RNA secondary structure prediction via state inference with deep recurrent neural networks. CMB 2020;8(1):36–50. [Google Scholar]
- 51. Chen X, Li Y, Umarov R., et al. . RNA secondary structure prediction by learning unrolled algorithms. In: International Conference on Learning Representations. 2020.
- 52. Akiyama M, Sato K, Sakakibara Y. A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol 2018;16(06):1840025. [DOI] [PubMed] [Google Scholar]
- 53. Zakov S, Goldberg Y, Elhadad M, Ziv-ukelson M. Rich parameterization improves RNA structure prediction. J Comput Biol 2011;18(11):1525–42. [DOI] [PubMed] [Google Scholar]
- 54. Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 2006;22(14):e90–8. [DOI] [PubMed] [Google Scholar]
- 55. Sato K, Kato Y, Hamada M, et al. . IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics 2011;27(13):i85–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Lorenz R, Bernhart SH, Höner zu Siederdissen C, et al. . ViennaRNA package 2.0. Algorithms Mol Biol 2011;6:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Reuter JS, Mathews DH. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinform 2010;11(1):129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Bernhart SH, Hofacker IL, Will S, et al. . RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 2008;9(1):474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Tan Z, Fu Y, Sharma G, Mathews DH. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res 2017;45(20):11570–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Rivas E, Clements J, Eddy SR. A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat Methods 2017;14(1):45–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Sloma MF, Mathews DH. Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA 2016;22(12):1808–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Danaee P, Rouches M, Wiley M, et al. . bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res 2018;46(11):5381–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Cannone JJ, Subramanian S, Schnare MN, et al. . The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinform 2002;3:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Rose PW, Prlić A, Altunkaya A, et al. . The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 2017;45(D1):D271–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Andronescu M, Bereg V, Hoos HH, Condon A. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinform 2008;9(1):340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. [Google Scholar]
- 67. Wilkinson KA, Gorelick RJ, Vasa SM, et al. . High-throughput SHAPE analysis reveals structures in HIV-1 genomic RNA strongly conserved across distinct biological states. PLoS Biol 2008;6(4):e96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Swenson MS, Anderson J, Ash A, et al. . GTfold: enabling parallel RNA secondary structure prediction on multi-core desktops. BMC Res Notes 2012;5(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Tsochantaridis I, Joachims T, Hofmann T, Altun Y. Large margin methods for structured and interdependent output variables. J Mach Learn Res 2005;6:1453–84. [Google Scholar]
- 70. Zhang H, Zhang L, Mathews DH, Huang L. LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics 2020;36:i258–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2015;61:85–117. [DOI] [PubMed] [Google Scholar]
- 72. Vaswani A, Shazeer N, Parmar N, et al. . Attention is all you need. In: Adv Neural Inform Process Syst 2017;30:1–11. [Google Scholar]
- 73. Brown T, Mann B, Ryder N, et al. . Advances in Neural Information Processing Systems. In: Language Models are Few-shot Learners. 2020;33:1877–901. [Google Scholar]
- 74. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. 2015;234–41.
- 75. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–80. [DOI] [PubMed] [Google Scholar]
- 76. Antczak M, Zok T, Popenda M, et al. . RNApdbee - a webserver to derive secondary structures from pdb files of knotted and unknotted RNAs. Nucleic Acids Res 2014;42(W1):W368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Antczak M, Popenda M, Zok T, et al. . New algorithms to represent complex pseudoknotted RNA structures in dot-bracket notation. Bioinformatics 2018;34(8):1304–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Geirhos R, Jacobsen J-H, Michaelis C, et al. . Shortcut learning in deep neural networks. Nat Mach Intell 2020;2(11):665–73. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Datasets used to test the algorithms, including sequences and ground truth data, are available at https://zenodo.org/record/7542063#.Y8j_MnbMJhG.