Abstract
Motivation
RNA consensus secondary structure prediction from aligned sequences is a powerful approach for improving the secondary structure prediction accuracy. However, because the computational complexities of conventional prediction tools scale with the cube of the alignment lengths, their application to long RNA sequences, such as viral RNAs or long non-coding RNAs, requires significant computational time.
Results
In this study, we developed LinAliFold and CentroidLinAliFold, fast RNA consensus secondary structure prediction tools based on minimum free energy and maximum expected accuracy principles, respectively. We achieved software acceleration using beam search methods that were successfully used for fast secondary structure prediction from a single RNA sequence. Benchmark analyses showed that LinAliFold and CentroidLinAliFold were much faster than the existing methods while preserving the prediction accuracy. As an empirical application, we predicted the consensus secondary structure of coronaviruses with approximately 30 000 nt in 5 and 79 min by LinAliFold and CentroidLinAliFold, respectively. We confirmed that the predicted consensus secondary structure of coronaviruses was consistent with the experimental results.
Availability and implementation
The source codes of LinAliFold and CentroidLinAliFold are freely available at https://github.com/fukunagatsu/LinAliFold-CentroidLinAliFold.
Supplementary information
Supplementary data are available at Bioinformatics Advances online.
1 Introduction
RNAs play essential roles in various biological processes, such as transcriptional regulation and nucleotide modification (Statello et al., 2021; Stefani and Slack, 2008). RNA function is closely related to their structures, and as a result many studies have been conducted to determine RNA structures experimentally (Ma et al., 2022). However, these experiments are too expensive and time-consuming to perform the large-scale RNA structure analysis; therefore, a fast computational structure prediction analysis is widely used as an alternative approach. Although protein tertiary structures can be predicted with high accuracy using AlphaFold2 (Jumper et al., 2021), the prediction of RNA tertiary structures with acceptable performance remains a challenge (Miao et al., 2020). Accordingly, RNA secondary structure prediction analyses are frequently performed, and various secondary structure prediction tools have been developed (Do et al., 2006; Hamada et al., 2009b; Lorenz et al., 2011; Reuter and Mathews, 2010; Sato et al., 2021; Singh et al., 2019).
RNA consensus secondary structure prediction from sequence alignments is a powerful approach for accurate secondary structure prediction (Bernhart et al., 2008; Hamada et al., 2011; Hofacker et al., 2002; Kiryu et al., 2007; Tagashira and Asai, 2022). These methods improve prediction performance by taking advantage of the fact that evolutionarily conserved base-pairs tend to form secondary structures (Puton et al., 2013). The conventional method is RNAalifold, which predicts a consensus minimum free energy (MFE) structure while including sequence covariation scores in the energy model (Bernhart et al., 2008; Hofacker et al., 2002). Another representative tool is CentroidAlifold, which computes the centroid structure in the probability distribution of candidate structures based on the maximum expected accuracy (MEA) principle (Hamada et al., 2011).
These prediction methods are highly accurate and are widely used by experimental biologists. However, they are too computationally expensive to be applied to long sequences, such as viral RNAs and long non-coding RNAs, because the computational complexities are proportional to the cube of the alignment lengths. One solution to this problem is the local folding method, which reduces the computational time by ignoring long-range base-pair interactions (Bernhart et al., 2006; Fukunaga et al., 2014; Kawaguchi and Kiryu, 2016; Kiryu et al., 2008). This approach enables analysis of RNA secondary structures at the transcriptome level (Agarwal et al., 2015; Fukunaga and Hamada, 2017). However, ignored long-range interactions are often found in natural RNAs, some of which have functional roles such as regulation of splicing or transcription (Lai et al., 2018; Raker et al., 2009). For example, the SARS-CoV-2 RNA genome forms 29.8-knt spanning long-range base-pairs that are presumed to contribute to genome replication (Ziv et al., 2020).
As another approach for long RNA prediction, LinearFold was recently proposed by Huang et al. for fast secondary structure prediction from a single RNA sequence. LinearFold accelerates structure prediction by pruning candidate structures using a beam search method (Huang et al., 2019). In other words, by using heuristics to obtain approximate solutions, LinearFold can predict secondary structures including long-range interactions, within a reasonable amount of time. Despite the use of the heuristics, Huang et al. demonstrated the superior prediction accuracy of LinearFold over the conventional prediction tools, RNAfold and CONTRAFold. The beam search-based acceleration technique is currently applied to various RNA secondary structure analyses, including partition function calculation (Zhang et al., 2020b), structural alignment (Li et al., 2021), stochastic sampling (Zhang et al., 2021), sequence design (Zhang et al., 2020a) and pseudoknot prediction (Sato and Kato, 2022; Zhang et al., 2019).
In the current study, we developed fast RNA consensus secondary structure prediction tools using beam search methods. We implemented two prediction tools, LinAliFold and CentroidLinAliFold, accelerated methods of RNAalifold and CentroidAlifold by beam search methods, respectively. We confirmed that LinAliFold and CentroidLinAliFold showed comparable prediction accuracy to existing programs and that the two programs were much faster than existing programs.
2 Methods
2.1 The LinAliFold algorithm
We first briefly review the RNAalifold algorithm, which outputs a consensus MFE structure from the input alignment data under a thermodynamic energy model incorporating sequence covariation scores (Bernhart et al., 2008; Hofacker et al., 2002). The prediction algorithm is based on dynamic programming (DP), with time and space complexities of and for a sequence alignment whose the number of sequences is S and the length is N. As a thermodynamic energy model, RNAalifold uses the same neighborhood energy model as RNAfold, which predicts the MFE structure from a single sequence. RNAalifold is unique as it uses sequence covariation scores for structure prediction.
The sequence covariation score is defined for two alignment columns, and the columns with higher scores are less likely to become base-paired in the predicted structure. The score is designed to have a lower value when there are more base mutations such that base pairing is conserved in the two columns. Conversely, when gaps or two bases that do not form a base-pair are included in the two columns, the value is designed to be higher. The sequence covariation score model was firstly designed in RNAalifold, and this score model regarded all substitution rates between base pairs as equal. We refer to this sequence covariation score model as the simple score model. RNAalifold also provides the RIBOSUM score model, which reflects substitution rates between base pairs in real data. Specifically, the model was learned from an alignment of 2492 SSU ribosomal RNAs from the European Ribosomal RNA Database (Bernhart et al., 2008; Klein and Eddy, 2003; Wuyts et al., 2004).
We also review the beam search method in RNA secondary structure analysis (Huang et al., 2019). As an example, consider evaluating the energy score when a column i forms a base-pair with a column on the side of the column i. In conventional exact approaches, the energy scores of all columns that can form base-pairs are investigated, and all energy scores are retained for subsequent calculations in the DP. In contrast, the heuristic beam search method evaluates only the energy scores of columns whose scores are retained. Subsequently, the method only maintains the energy scores of the top b columns with the lowest energy scores, and discards the energy scores of the other columns. Here, before the column i is evaluated, all columns on the side from the column i must have been evaluated. Because this condition is not satisfied by a bottom-up DP, which is conventionally used in RNA structural analysis, LinearFold employed a left-to-right DP instead (Huang and Sagae, 2010; Tomita, 1988). This beam search heuristics was also applied to the evaluation of multi loop formation. LinearFold achieved significant acceleration by pruning the search space with this beam search, and reduced the time and space complexities from and to and O(Nb). b is the beam size and is a user-defined parameter.
LinAliFold is a program that speeds up RNAalifold by using the beam search method. Specifically, the bottom-up DP in RNAalifold was replaced by the left-to-right DP, and the beam pruning was applied to the DP calculation. Compared with RNAalifold, the time complexity is reduced from to . On the other hand, the space complexity is the same as RNAalifold because we used 2D arrays for the data structure to store the sequence covariance scores. Note that we can reduce the spatial complexity by using hashes instead of 2D arrays. However, we did not use hashes because preliminary experiments showed that the usage of hashes worsened the computational speed performance. In addition, we have slightly modified the calculation method of RNAalifold and LinearFold; this is described in the Supplementary Material.
2.2 The CentroidLinAliFold algorithm
We next present an overview of the CentroidAlifold algorithm (Hamada et al., 2011). CentroidAlifold was designed from the viewpoint of MEA, which maximizes the expected prediction accuracy under a probability distribution on candidate solutions. In particular, CentroidAlifold utilizes the γ-centroid estimator (Hamada et al., 2009b), which is theoretically superior to the frequently used ContraFold MEA estimator (Do et al., 2006).
CentroidAlifold first computes , a base pairing probability (BPP) between two columns i and j, for all column pairs, as follows:
where is a mixture weight parameter, is the BPP on an individual sequence x, and is the BPP on the input alignment . Here, and can be computed using the McCaskill algorithm and the extension of RNAalifold with the time complexities and , respectively.
CentroidAlifold then performed Nussinov-style DP using as follows (Nussinov et al., 1978):
| (1) |
where denotes the best score for , which is defined as a subsequence between i and j on alignment . in Eq. (1) represents that two columns i and j are more likely to be included in the predicted secondary structure when the BPP is high. γ is a user-defined hyperparameter that controls the number of base pairs in the predicted structure. A higher γ value is expected to increase the number of base pairs, i.e. increase the sensitivity (SEN) and decrease the positive prediction value (PPV). The first, second and third lines of Eq. (1) represent the case where the column i does not form a base-pair, the column i is base-paired with the column j, and the column i is base-paired with column k satisfying i < k < j, respectively. The time and space complexities of DP are and . CentroidAliFold finally predicted the secondary structure by traceback of the DP matrix. Accordingly, the total time and space complexities of CentroidAlifold were and .
CentroidLinAliFold is the software that speeds up CentroidAlifold by applying the beam search method to each step of CentroidAlifold. The calculation method of using beam search has been proposed as LinearPartition (Zhang et al., 2020b) with a time complexity , and the algorithm was reimplemented and incorporated into CentroidLinAliFold. In addition, to calculating , we developed a novel beam search-based algorithm with time complexity by extending LinAliFold. The beam search method was not applied to the Nussinov-style DP step because this step empirically requires less computational time when using the threshold method. The threshold method is a fast structure prediction method that uses only larger than the threshold in DP recursion. From Eq. (1), we can set the threshold to without affecting the prediction results. The number of base pairs to be considered per sequence position is bounded by , and the number of k to be considered in the third line of Eq. (1) is also . Therefore, the time complexity of this step becomes .
2.3 Datasets and evaluation measures
We evaluated the prediction performance using eight RNA families provided in the RNAStralign dataset: 5S rRNA, 16S rRNA, group I intron, RNase P, SRP RNA, telomerase RNA, tmRNA and tRNA (Tan et al., 2017). We randomly selected 5, 10 or 20 sequences for each RNA family, and aligned the sequences by mafft-xinsi v7.490, which performs multiple alignments considering RNA secondary structure information using SCARNA (Katoh and Standley, 2013; Tabei et al., 2006). Note that most current RNA structural alignment tools require at least the cube of the sequence length in the computation time, so the advantage of the beam search-based consensus structure prediction method may appear to have diminished. However, just as applying beam search to TurboFold enabled the construction of LinearTurboFold, which has linear time complexity with the alignment length (Li et al., 2021), we can develop fast structural alignment tools by applying the beam search methods to conventional tools. Therefore, the drawback will be disappeared in the future. In particular, speeding up SCARNA (Tabei et al., 2006) and CentroidAlign (Hamada et al., 2009a), which are faster than TurboFold (Harmanci et al., 2011), would be practical fast structural alignment tools.
We then predicted the consensus secondary structures from the alignments, and evaluated the prediction quality of individual RNA sequences using three metrics: PPV, SEN and the Matthews correlation coefficient (MCC) with respect to base-pairs (Hamada et al., 2011). We prepared five datasets for each evaluation experiment, and measured the average values of the evaluation metrics.
The maximum alignment length of the RNAStralign dataset is approximately 1500 bp for 16S rRNA, and the alignment length is insufficient to evaluate the computation speed of LinAliFold and CentroidLinAliFold. To test the performance on longer sequences, we created artificial alignment datasets by concatenating multiple alignments of 16S rRNAs two to ten times. For each number of concatenations, we evaluated the average computational time and required memory size for the five datasets. Although the number of concatenations is the same, the alignment lengths differed for each dataset. Therefore, we regarded the average alignment lengths as the representative alignment length for the number of concatenations. In this experiment, the computation was conducted on an Intel Core i5 2.4 GHz CPU with 8GB of memory.
For practical application of long RNA sequences, we predicted consensus secondary structures of coronaviruses with sequence lengths of approximately 30 000 nt. We used the coronavirus genome dataset of 25 sequences compiled by Li et al. (2021) and 1 sequence of 2019-nCoV/USA-WA1/2020 strain, which was added for accuracy evaluation. The dataset consisted of 17 SARS-CoV-2, five human SARS-CoV-1 and four bat coronavirus genomes. We could not apply mafft-xinsi to the coronavirus dataset owing to the shortage of memory; therefore, we used mafft with default settings for multiple sequence alignments. As experimentally determined reference RNA secondary structures for accuracy evaluation, we used two structures in Vero or Huh7 cells (Lan et al., 2022). In addition, for analysis of long-range RNA–RNA interactions, we used a structure determined by Ziv et al. (2020). The computation was performed on an Intel Xeon Gold 6154 4.0 GHz CPU with 128GB memory in this experiment.
We compared the performance of LinAliFold and CentroidLinAliFold with those of RNAalifold and CentroidAlifold, respectively. We also used RNAfold and Centroidfold to predict the secondary structure from a single sequence to compare the prediction performance. For the thermodynamic energy parameter, we used Andronescu’s BL* parameter owing to its high prediction accuracy (Andronescu et al., 2010). As in previous studies, we used 100 and 0.5 as the default values of the beam size b and the mixture weight parameter w, respectively (Hamada et al., 2011; Huang et al., 2019). We also used the simple score model as the default sequence covariation score model.
3 Results
3.1 Evaluation of prediction accuracy
We first evaluated the prediction accuracy of the six RNA structure prediction tools using the RNAStralign dataset. Figure 1 and Supplementary Figure S1 show the average prediction results for all RNA families. MEA-based tools have multiple prediction results depending on γ, and we selected γ from . We found that the consensus secondary structure prediction tools were superior to the prediction tools from single sequences, and that the MEA-based tools performed better than the MFE-based tools. These results are consistent with previous studies. In addition, we verified that LinAliFold and CentroidLinAliFold have comparable prediction accuracy to RNAalifold and CentroidAlifold, respectively. These results were independent of the number of sequences in the alignments (Supplementary Fig. S1).
Fig. 1.

Prediction performance of RNA secondary structure on the RNAStralign dataset with 20 sequences. The x-axis and the y-axis represent PPV and SEN, respectively. Symbols and colors used above represent: RNAfold (black crosses), CentroidFold (black squares), RNAalifold (red crosses), CentroidAlifold (red circles), LinAliFold (blue squares) and CentroidLinAliFold (blue triangles)
Table 1 and Supplementary Tables S1–S8 show the prediction performance of each RNA family. In this analysis, the hyperparameter γ was automatically determined based on the maximization of the pseudo-expected accuracy of the MCC (Hamada et al., 2010). The pseudo-expected accuracy of the MCC is an approximate MCC estimator, and it is computed from the BPPs only, without using a reference structure. We found that the consensus secondary structure prediction tools performed well. However, the prediction from a single sequence was more accurate for some families, such as group I introns and tmRNAs. This result may be due to inaccurate structural alignments. Additionally, we confirmed that LinAliFold and CentroidLinAliFold have equal or better accuracy than RNAalifold and CentroidAlifold, respectively. In particular, CentroidLinAliFold achieved the best prediction performance with an MCC of 0.702 when the number of sequences was 20. These results suggest that the beam search heuristics do not have a negative influence on the prediction accuracy of RNA consensus secondary structure prediction.
Table 1.
The MCC results of prediction accuracy for each RNA family when the number of sequences was 20
| RNA family | RNAfold | CentroidFold | RNAalifold | LinAliFold | CentroidAlifold | CentroidLinAliFold |
|---|---|---|---|---|---|---|
| 5s rRNA | 0.729 | 0.676 | 0.885 | 0.898 | 0.885 | 0.903 |
| 16S rRNA | 0.434 | 0.520 | 0.684 | 0.656 | 0.734 | 0.720 |
| group I intron | 0.708 | 0.752 | 0.598 | 0.632 | 0.627 | 0.644 |
| RNaseP | 0.568 | 0.647 | 0.616 | 0.596 | 0.635 | 0.654 |
| SRP RNA | 0.643 | 0.640 | 0.596 | 0.645 | 0.602 | 0.649 |
| telomerase RNA | 0.455 | 0.493 | 0.436 | 0.555 | 0.551 | 0.572 |
| tmRNA | 0.512 | 0.547 | 0.382 | 0.475 | 0.460 | 0.505 |
| tRNA | 0.809 | 0.784 | 0.962 | 0.966 | 0.962 | 0.966 |
| average | 0.607 | 0.632 | 0.645 | 0.678 | 0.682 | 0.702 |
Note: The bold values are the highest scores in each row.
3.2 Evaluation of computational time and required memory size
We next assessed the computational time required by the developed software (Fig. 2 and Supplementary Fig. S2). We confirmed that LinAliFold and CentroidLinAliFold could perform the prediction in almost linear computational time with respect to the alignment lengths. In addition, LinAliFold and CentroidLinAliFold were considerably faster than RNAalifold and CentroidAlifold when the alignment lengths were long, and the speed efficiency increased as the alignment lengths increased. For example, with an alignment length of 15 600 nt and the number of sequences of 20, LinAliFold was 7.8 times faster than RNAalifold. As another example, with an alignment length of 4680 nt and the number of sequences of 20, CentroidLinAliFold was 20.9 times faster than CentroidAlifold. Furthermore, LinAliFold was found to be largely faster than CentroidLinAliFold. There are various reasons for this difference. First, the computational complexity is different; LinAliFold is O(SNblogb), whereas CentroidLinAliFold is . Second, LinAliFold only needs to perform DP once for an alignment, like the Viterbi algorithm of hidden Markov models (HMMs), whereas CentroidLinAliFold needs to perform DP twice for a sequence or alignment, like the forward-backward algorithm of HMMs. Third, CentroidLinAliFold has to calculate BPPs for individual sequences as well as for alignments. Fourth, CentroidAliFold has to calculate the BPPs in the log space to avoid underflow, but LinAliFold does not use the log space.
Fig. 2.

Evaluation of the computational time when the number of sequence was 20. The x- and y-axis represent the alignment lengths (nt) and computational time (s), respectively. The lines represent RNAalifold (red dashed), CentroidAlifold (red solid), LinAliFold (blue dashed) and CentroidLinAliFold (blue solid). The exact computational time was not measured when it exceeded 2 h
We also investigated the required memory size for the developed software (Fig. 3 and Supplementary Fig. S3). The results show that RNAalifold and LinAliFold required almost the same amount of memory size, but CentroidLinAliFold was more memory efficient than CentroidAlifold. Because CentroidAlifold and CentroidLinAliFold should have equivalent amounts of memory based on the algorithm, this difference in results may be due to implementation rather than the algorithms. In addition, we verified that LinAliFold was better memory efficiency than CentroidLinAliFold. The main reason for this difference is that calculating BPPs requires more internal variables than finding an MFE structure. We further examined the dependencies of computational time and required memory size on the number of sequences (Supplementary Tables S9 and S10). The computational time increased for both LinAliFold and CentroidLinAliFold as the number of sequences increased; however, the required memory size did not change significantly.
Fig. 3.

The required memory size was evaluated when the number of sequences was 20. The x- and y-axis represent the alignment lengths (nt) and required memory size (GB), respectively. The lines represent RNAalifold (red dashed), CentroidAlifold (red solid), LinAliFold (blue dashed) and CentroidLinAliFold (blue solid). The required memory size was not measured when computational time exceeded 2 h
3.3 Analysis of the dependence of the performance on the sequence covariation score model and the beam size
We then analyzed the effects of the sequence covariation score model on the prediction accuracy (Supplementary Table S11). Note that we could not perform the analysis for CentroidAlifold because this tool does not have the option of changing the sequence covariation score model. As expected from the results of a previous study (Bernhart et al., 2008), the RIBOSUM model outperforms the simple score model with any prediction tool. When comparing the prediction performance for each RNA family using the RIBOSUM model, some families improved the prediction accuracy significantly, and no families showed substantial decreases.
We also estimated the influence of the beam size b on the performance of LinAliFold and CentroidLinAliFold. Supplementary Tables S12 and S13 show the MCC changes with beam size b. The prediction accuracy monotonically increased with increasing b as expected, but the performance did not improve significantly over b = 100. Supplementary Tables S14 and S15 show the changes in the computational time and required memory size with the change in beam size b. We found that the computational time increased but the required memory size did not change significantly as b increased. These results are consistent with the results of the computational complexity analysis.
3.4 Prediction of coronavirus RNA consensus secondary structure
We finally predicted the consensus secondary structures of coronaviruses using LinAlifold and CentroidLinAliFold. Table 2 shows the computational time and required memory size for the prediction. The computational times include the execution time of MAFFT for our tools; however, most of the required computational time is the structure prediction because MAFFT completed the computation in eight seconds. We demonstrated that our tools could predict the secondary structure of coronaviruses within a reasonable amount of time and memory. Note that LinearTurboFold, a beam search-based structural alignment and structure prediction tool (Li et al., 2021), required 15 h 44 min 28 s time and 45.9 GB memory for coronavirus secondary structure prediction. This difference in software performance between our tools and LinearTurboFold may be attributed to differences in the purpose of programs; LinearTurboFold performs structural alignment, but our tools do not perform it.
Table 2.
The computational time and required memory size for coronavirus secondary structure prediction
| Software | Computational time | Required memory |
|---|---|---|
| LinAliFold | 4 min 55 s | 4.7 GB |
| CentroidLinAliFold | 1 h 19 min 10 s | 15.1 GB |
We evaluated the secondary structure prediction accuracy using two experimentally determined structures in Vero or Huh7 cells. Table 3 shows prediction performances for each software program. Note that the secondary structure of coronaviruses fluctuates largely in vivo, and there are significant differences even among the experimentally determined structures. When the MCC was calculated for the two experimental structures, with one as the correct answer and the other as the prediction, the value was only 0.752. In addition, the predicted structure of LinearTurboFold is not the consensus structure but the structure from a single sequence based on the structural alignment. As a result, we verified CentroidLinAliFold has higher performances than LinAliFold in any metrics, and CentroidLinAliFold has the highest PPV values among the prediction tools. The reason why CentroidLinAliFold has lower MCC and SEN values than LinearTurboFold may be that CentroidLinAliFold does not perform structural alignment or predict secondary structures of individual sequences.
Table 3.
The prediction performances of coronavirus secondary structures
| Vero |
Huh7 |
|||||
|---|---|---|---|---|---|---|
| Prediction software | PPV | SEN | MCC | PPV | SEN | MCC |
| LinAliFold | 0.496 | 0.548 | 0.521 | 0.525 | 0.500 | 0.512 |
| CentroidLinAliFold | 0.728 | 0.581 | 0.650 | 0.730 | 0.503 | 0.606 |
| LinearTurboFold | 0.641 | 0.748 | 0.693 | 0.683 | 0.687 | 0.685 |
| Experiments | 0.698 | 0.809 | 0.752 | 0.809 | 0.698 | 0.752 |
Note: The bold values are the highest scores among software predictions. The row of ‘Experiments’ means the comparison with two experimentally determined structures when one is the correct answer and the other is the prediction.
We further compared the predicted and experimentally determined structures of canonical -UTRs, which have been well studied. We confirmed that LinAliFold and CentroidLinAliFold produce predictions consistent with the experimental results, with a few exceptions, such as the lack of the SL3 and SL2 motifs, respectively (Fig. 4 and Supplementary Fig. S4). A previous study showed that the beam search-based prediction from a single sequence could not predict the consistent structures with experiments (Li et al., 2021). Therefore, prediction results by LinAliFold and CentroidLinAliFold demonstrated that RNA consensus secondary structure prediction using beam search is an effective approach for viral RNA structure prediction.
Fig. 4.

(A) A predicted RNA consensus secondary structure of coronavirus 5′-UTR by CentroidLinAliFold. (B) An experimentally determined secondary structure of the coronavirus. Graphical views were created using VARNA (Darty et al., 2009)
The region forming the SL3 motif in the 5′-UTR of the coronavirus can form an alternative structure that takes a long-range interaction spanning 29.8 knt. However, most RNA secondary structure prediction tools predict only one of two different structures. For example, while LinearTurboFold predicted a structure with the long-range interaction, our tools predicted the canonical structure. However, prediction methods that calculate BPPs, such as LinearTurboFold and CentroidAliFold, can estimate the possibility that alternative structures form. Therefore, we examined whether CentroidLinAliFold could capture the potential of the long-range interaction by examining BPPs. Specifically, we calculated the average probabilities that the bases constituting the SL3 motif formed the base pairs of each structure. As a result, we found that the probabilities that the region formed the canonical SL3 structure and the long-range interaction were 40% and 7%, respectively. This result indicates that CentroidLinAliFold can capture the long-range interactions in the coronavirus genome.
4 Discussion
In this study, we developed LinAliFold and CentroidLinAliFold, which adopt beam search methods to accelerate RNA consensus secondary structure prediction. We investigated the prediction performance using the RNAStralign dataset and verified that LinAliFold and CentroidLinAliFold have comparable accuracy to the existing programs, RNAalifold and CentroidAlifold. However, the computational speeds of LinAliFold and CentroidLinAliFold were much faster than those of the existing programs, especially for longer alignment lengths. In addition, we demonstrated that these two programs could accurately predict the secondary structures of coronaviruses of approximately 30 000 nt lengths in a reasonable amount of time.
While we were undergoing peer review for our paper, the paper about LinearAliFold, linear-time consensus secondary structure prediction tools based on the same concept as LinAliFold, was published in arxiv (Zhang et al., 2022). Benchmark studies showed that LinearAliFold is also much faster than RNAalifold while preserving the prediction accuracy. Unlike the methods we developed, LinearAliFold can predict structures containing pseudoknots or sample the secondary structures from the stochastic distribution. On the other hand, unlike LinearAliFold, CentroidLinAliFold can predict sophisticated MEA structures based on the γ-centroid estimator. All of these methods will be foundational tools in predicting the common secondary structure of long RNAs.
We used the same score models for the sequence covariation score model as in the RNAalifold software was used. However, previous benchmarks suggested that the score models do not have high detection performance because the score models ignore the effects of stacking energy and nucleotide frequency distribution (Lindgreen et al., 2006; Rivas et al., 2017). Rivas et al. (2017) recently showed that average product corrected (APC) G-test statistics have robust and sensitive performance for sequence covariation detection. This result suggests that the use of the APC G-test statistics may be effective for consensus secondary structure prediction. However, because the computational time of the APC G-test statistics scale with the square of the alignment lengths, the simple integration of the APC G-test statistics into LinAliFold or CentroidAlifold may result in the loss of the high speed advantage of these tools. We envision the development of a fast computable and sensitive sequence covariation score model.
The beam search method has been used to increase the speed of various RNA structure informatics tools, and the scope of its application should continue to expand. For example, CentroidHomfold (Hamada et al., 2009c) and CentroidAlign (Hamada et al., 2009a), which we previously developed for the structure prediction of a single sequence using homologous sequence information and structural alignment, can also be accelerated by beam search methods. The speedup of existing tools based on beam search methods is essential for genome-scale RNA secondary structural analysis in the research area of RNA informatics.
Supplementary Material
Acknowledgements
Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics and the SHIROKANE supercomputer at human genome center, the Institute of Medical Science, The University of Tokyo.
Funding
This work was supported by the Japan Society for the Promotion of Science [JP19K20395, JP20H05582 and JP22H04891 to T.F.]. This research was also supported by Research Support Project for Life Science and Drug Discovery (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under Grant Number JP22ama121055.
Conflict of Interest: none declared.
Contributor Information
Tsukasa Fukunaga, Waseda Institute for Advanced Study, Waseda University, Tokyo 1690051, Japan.
Michiaki Hamada, Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 1698555, Japan; Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo 1698555, Japan.
References
- Agarwal V. et al. (2015) Predicting effective microRNA target sites in mammalian mRNAs. Elife, 4, e05005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andronescu M. et al. (2010) Computational approaches for RNA energy parameter estimation. RNA, 16, 2304–2318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernhart S.H. et al. (2006) Local RNA base pairing probabilities in large sequences. Bioinformatics, 22, 614–615. [DOI] [PubMed] [Google Scholar]
- Bernhart S.H. et al. (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9, 474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Darty K. et al. (2009) Varna: interactive drawing and editing of the RNA secondary structure. Bioinformatics, 25, 1974–1975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Do C.B. et al. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90–e98. [DOI] [PubMed] [Google Scholar]
- Fukunaga T., Hamada M. (2017) RIblast: an ultrafast RNA-RNA interaction prediction system based on a seed-and-extension approach. Bioinformatics, 33, 2666–2674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukunaga T. et al. (2014) CapR: revealing structural specificities of RNA-binding protein target recognition using CLIP-seq data. Genome Biol., 15, R16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamada M. et al. (2009a) CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics, 25, 3236–3243. [DOI] [PubMed] [Google Scholar]
- Hamada M. et al. (2009b) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25, 465–473. [DOI] [PubMed] [Google Scholar]
- Hamada M. et al. (2009c) Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics, 25, i330–338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamada M. et al. (2010) Prediction of RNA secondary structure by maximizing pseudo-expected accuracy. BMC Bioinformatics, 11, 586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamada M. et al. (2011) Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res., 39, 393–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harmanci A.O. et al. (2011) TurboFold: iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinformatics, 12, 108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hofacker I.L. et al. (2002) Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066. [DOI] [PubMed] [Google Scholar]
- Huang L., Sagae K. (2010) Dynamic programming for linear-time incremental parsing. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1077–1086.
- Huang L. et al. (2019) LinearFold: linear-time approximate RNA folding by 5′-to-3′ dynamic programming and beam search. Bioinformatics, 35, i295–i304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J. et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh K., Standley D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30, 772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kawaguchi R., Kiryu H. (2016) Parallel computation of genome-scale RNA secondary structure to detect structural constraints on human genome. BMC Bioinformatics, 17, 203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiryu H. et al. (2007) Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics, 23, 434–441. [DOI] [PubMed] [Google Scholar]
- Kiryu H. et al. (2008) Rfold: an exact algorithm for computing local base pairing probabilities. Bioinformatics, 24, 367–373. [DOI] [PubMed] [Google Scholar]
- Klein R.J., Eddy S.R. (2003) RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics, 4, 44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lai W.C. et al. (2018) mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances. Nat. Commun., 9, 4328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lan T. C.T. et al. (2022) Secondary structural ensembles of the SARS-CoV-2 RNA genome in infected cells. Nat. Commun., 13, 1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li S. et al. (2021) LinearTurboFold: linear-time global prediction of conserved structures for RNA homologs with applications to SARS-CoV-2. Proc. Natl. Acad. Sci. USA, 118, e2116269118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lindgreen S. et al. (2006) Measuring covariation in RNA alignments: physical realism improves information measures. Bioinformatics, 22, 2988–2995. [DOI] [PubMed] [Google Scholar]
- Lorenz R. et al. (2011) ViennaRNA package 2.0. Algorithms Mol. Biol., 6, 26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma H. et al. (2022) Cryo-EM advances in RNA structure determination. Signal Transduct. Target. Ther., 7, 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miao Z. et al. (2020) RNA-Puzzles round IV: 3D structure predictions of four ribozymes and two aptamers. RNA, 26, 982–995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nussinov R. et al. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., 35, 68–82. [Google Scholar]
- Puton T. et al. (2013) CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res., 41, 4307–4323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raker V.A. et al. (2009) Modulation of alternative splicing by long-range RNA structures in Drosophila. Nucleic Acids Res., 37, 4533–4544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reuter J.S., Mathews D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rivas E. et al. (2017) A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat. Methods, 14, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sato K., Kato Y. (2022) Prediction of RNA secondary structure including pseudoknots for long sequences. Brief Bioinform., 23, bbab395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sato K. et al. (2021) RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun., 12, 941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh J. et al. (2019) RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun., 10, 5407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Statello L. et al. (2021) Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol., 22, 96–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stefani G., Slack F.J. (2008) Small non-coding RNAs in animal development. Nat. Rev. Mol. Cell Biol., 9, 219–230. [DOI] [PubMed] [Google Scholar]
- Tabei Y. et al. (2006) SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics, 22, 1723–1729. [DOI] [PubMed] [Google Scholar]
- Tagashira M., Asai K. (2022) ConsAlifold: considering RNA structural alignments improves prediction accuracy of RNA consensus secondary structures. Bioinformatics, 38, 710–719. [DOI] [PubMed] [Google Scholar]
- Tan Z. et al. (2017) TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res., 45, 11570–11581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomita M. (1988) Graph-structured stack and natural language parsing. In: 26th Annual Meeting of the Association for Computational Linguistics, Buffalo, New York, USA, pp. 249–257.
- Wuyts J. et al. (2004) The European ribosomal RNA database. Nucleic Acids Res., 32, D101–D103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang H. et al. (2020a) Algorithm for optimized mRNA design improves stability and immunogenicity. arXivpreprint arXiv:2004.10177. [DOI] [PMC free article] [PubMed]
- Zhang H. et al. (2020b) LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics, 36, i258–i267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang H. et al. (2021) LazySampling and LinearSampling: fast stochastic sampling of RNA secondary structure with applications to SARS-CoV-2. bioRxiv 2020.12.29.424617. [DOI] [PMC free article] [PubMed]
- Zhang L. et al. (2019) Threshknot: thresholded probknot for improved RNA secondary structure prediction. arXiv preprint arXiv:1912.12796.
- Zhang L. et al. (2022) Linearalifold: linear-time consensus structure prediction for RNA alignments. arXiv preprint arXiv:2206.14794. [DOI] [PMC free article] [PubMed]
- Ziv O. et al. (2020) The short- and long-range RNA-RNA interactome of SARS-CoV-2. Mol. Cell., 80, 1067–1077.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
