Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

Research Square logoLink to Research Square
[Preprint]. 2024 Feb 23:rs.3.rs-3798842. [Version 1] doi: 10.21203/rs.3.rs-3798842/v1

Graph Convolutional Network for predicting secondary structure of RNA

Palawat Busaranuvong 1,3,, Aukkawut Ammartayakun 1,, Dmitry Korkin 2,*, Roya Khosravi-Far 3,*
PMCID: PMC10925402  PMID: 38464300

Abstract

The prediction of RNA secondary structures is essential for understanding its underlying principles and applications in diverse fields, including molecular diagnostics and RNA-based therapeutic strategies. However, the complexity of the search space presents a challenge. This work proposes a Graph Convolutional Network (GCNfold) for predicting the RNA secondary structure. GCNfold considers an RNA sequence as graph-structured data and predicts posterior base-pairing probabilities given the prior base-pairing probabilities, calculated using McCaskill’s partition function. The performance of GCNfold surpasses that of the state-of-the-art folding algorithms, as we have incorporated minimum free energy information into the richly parameterized network, enhancing its robustness in predicting non-homologous RNA secondary structures. A Symmetric Argmax Post-processing algorithm ensures that GCNfold formulates valid structures. To validate our algorithm, we applied it to the SARS-CoV-2 E gene and determined the secondary structure of the E-gene across the Betacoronavirus subgenera.

Keywords: RNA Secondary Structure, Graph Convolutional Neural Network, Energy-based Model, RNAfold, SARS-CoV-2

1. Introduction

RNAs play an essential role in biological processes, such as RNA translation, modification, and protein synthesis. Understanding the secondary structure of RNA is essential for the development of technologies in bioinformatics and medicine, including drug discovery, disease diagnostics, biosensing tools, and many other genomics applications. The secondary structure of RNA impacts its interaction with other cellular components. Accurate secondary structures can be determined through experimental assays, such as Nuclear Magnetic Resonance and X-ray Diffraction. However, these methods are often expensive, have resolution limits, and can be technically challenging.

Computational folding algorithms are alternative approaches to predict the secondary structure of RNA solely from its sequence. Energy-based models, such as RNAfold Lorenz et al (2011), RNAstructure Reuter and Mathews (2010), and UNAFold Markham and Zuker (2008), are commonly used for this purpose and are based on the idea that the most stable structure is the one that minimizes free energy (MFE) Tinoco and Bustamante (1999). Dynamic programming algorithms (DP) Zuker and Stiegler (1981); Trotta (2014) can be used to formulate the optimal MFE secondary structure, given a consistent estimator of the nearest-neighbor model Mathews et al (2004) to estimate the energy. This estimator is typically a linear combination of functions of substructures such as hairpins, internal loops, stems, etc. Zuker and Stiegler (1981). However, structure prediction accuracy using energy-based models is relatively low because the free energy parameters are experimentally determined in advance Sato et al (2021). Machine learning (ML)-based models, such as CONTRAfold Do et al (2006), Contextfold Zakov et al (2011), and MXfold Akiyama et al (2018), have been proposed to improve structure predictions by learning scoring parameters and finding structures with respect to them. CONTRAfold Do et al (2006) is a well-known algorithm that uses a stochastic context-free grammar rule to embed information on thermodynamic energy minimization and biological stability.

With the rise of AI technologies and the explosion of RNA sequence data, recent studies have used deep learning to predict secondary structures. SPOT-RNA Singh et al (2019) and E2Efold Chen et al (2020) both employ deep neural networks (DNN) for end-to-end RNA secondary structure prediction. These approaches can successfully predict pseudoknot structures, which traditional energy minimization techniques cannot. SPOT-RNA uses an ensemble of multiple DNN models to obtain the predicted structures. However, it may produce secondary structures that contain overlapping pairs (i.e., nucleotide pairs with more than one other base). E2Efold, on the other hand, comprises a deep score network and a post-processing network that introduces hard constraints over the DNN model to restrict the output space. However, recent studies Sato et al (2021); Fu et al (2022) have shown that E2Efold is prone to overfitting and is only effective for the specific sequence-wise RNA dataset on which it was initially trained. It is not capable of predicting RNAs from different families. To address this issue, MXfold2 Sato et al (2021) proposed a hybrid approach that combines DNN-based models and energy-based models. Integrating a folding score DNN and Turner’s nearest-neighbor free energy parameters helps prevent overfitting and improves the accuracy of secondary structure predictions. However, as with energy-based models, MXfold2 cannot predict complex pseudoknot structures.

Our work introduces a new graph convolutional network (GCN) model called GCNfold for predicting secondary structures of RNA. GCNfold is also an ensemble model, but unlike SPOT-RNA, which utilizes multiple DNN modules, GCNfold leverages graph neural networks to connect RNAfold’s base pairing probabilities with a DNN. To obtain stable RNA structures, the predictions from the deep score network are enforced by the following three constraints: only canonical base pairing with the inclusion of G-U pairing, no sharp loops, and each base pairing only once.

2. Results

GCNfold utilizes an ensemble model composed of the base pairing probability (BPP) matrix, calculated using McCaskill’s partition function algorithm McCaskill (1990), and an RNA one-hot encoding vector. The BPP matrix represents the structural connections between bases and can be conceptualized as a graph G(V,E). The core idea behind GCNfold is to formulate the prediction of RNA secondary structure as a binary classification problem, predicting whether each pair of nucleotides forms a complementary base pair according to the canonical and wobble base pairing rules or not. The model architecture and training process details are discussed in Section 3 of the Methods section. GCNfold undergoes training and evaluation in two benchmark datasets: the Rivas Database Rivas et al (2012) and the bpRNA dataset Danaee et al (2018). The inclusion of the Rivas Database allows an analysis of whether our richly parameterized model is prone to overfitting on a family-wise test dataset, while the bpRNA dataset aids the model in learning generalized RNA secondary structures, including pseudoknot pairings.

2.1. Performance Comparison on TestSet of Rivas Data

A comparative analysis of GCNfold with state-of-the-art algorithms for RNA secondary structure prediction, including MXfold2 Sato et al (2021), CONTRAfold Do et al (2006), RNAfold Zuker and Stiegler (1981), RNAstructure Reuter and Mathews (2010), ContextFold Zakov et al (2011), SimFold Andronescu et al (2007), and TORNADO Rivas et al (2012), was conducted using TestSetA and TestSetB for evaluation. In this experiment, MXfold2, ContextFold, and GCNfold were trained on TrainSetA. Table 1 presents the results, which shows that GCNfold achieves the highest F1 score on TestSetA. Furthermore, its F1 score on TestSetB, assessed on a family-wise basis, ranks second only to MXfold2’s F1 score. These findings highlight the effectiveness of hybrid approaches, such as MXfold2 and GCNfold, which combine free energy minimization and neural networks, in significantly improving the accuracy of RNA secondary structure prediction while mitigating overfitting concerns.

Table 1.

GCNfold yields the same performance as MXfold2 on the TestSetA and non-homologous TestSetB. Where TestSetA shares less than 70% sequence identity with TestSetB.

TestSetA (sequence-wise) TestSetB (family- wise)
PPV SEN F1 PPV SEN F1
GCNfold (ours) 0.801 0.773 0.780 0.589 0.603 0.587
MXfold2 0.745 0.778 0.761 0.571 0.650 0.601
CONTRAfold 0.671 0.705 0.682 0.543 0.629 0.575
RNAfold 0.626 0.668 0.642 0.498 0.606 0.540
RNAstructure 0.622 0.650 0.631 0.475 0.584 0.518
ContextFold 0.768 0.750 0.759 0.485 0.534 0.502
SimFold 0.616 0.643 0.629 0.512 0.611 0.551
TORNADO 0.738 0.754 0.746 0.528 0.594 0.552

2.2. Performance Comparison on TS0 of bpRNA Data

Next, we analyzed our model using a more comprehensive RNA dataset, which includes pseudoknot structures and various types of incorrect base pairings, in addition to canonical and wobble base pairings (see Figure 2(A)). Our model was trained on the TR0 dataset and validated on the VL0 dataset (please refer to Section 3 for more details on the bpRNA data). We compared GCNfold with state-of-the-art (SOTA) algorithms for RNA secondary structure predictions in the TS0 dataset. The results, presented in Figure 2(D), reveal that GCNfold achieves the highest F1 score of 0.638 in the bpRNA dataset. Although GCNfold’s F1 score is approximately 4% higher than SPOT-RNA’s score, it strikes a better balance between sensitivity (SEN) and positive predictive value (PPV) scores. This indicates that GCNfold is a more reliable algorithm for predicting true base pairings. Additionally, it should be noted that the predictions from SPOTRNA may not always adhere to the three hard constraints necessary to form valid secondary structures. An example of these constraint violations can be observed in Figure 3(e), where (1) there are types of pairings in addition to (A-U), (C-G) and (G-U), and (2) the structure of the hairpin loop (between the 29th-30th bases) is invalid because there must be at least three bases inside a stem.

Fig. 2. GCNfold achieves the highest F1 score on the bpRNA TS0 dataset. Exploratory and performance comparison of folding algorithms on the test set TS0.

Fig. 2

(A) shows the frequency of each type of base pairing within the bpRNA training set. (B) shows that GCNfold has a distinct distribution of F1 score distribution along the RNA length. Furthermore, the F1 distribution of GCNfold is relatively higher compared to other models. (C) shows the plot of the positive predictive value and sensitivity of the prediction of the model in the TS0 test set. ε represents the cutoff probability threshold of the model. The result shows that GCNfold has higher sensitivity than other models. (D) Box plot of the F1 score among the algorithms. GCNfold has the highest F1 score.

Fig. 3. GCNfold prediction (b) yields the highest evaluation scores of F1=0.905, PPV=0.905, and SEN=0.905.

Fig. 3

Green bases represent nucleotides with the same structure types (pairing/non-pairing) as the bpRNA reference (a) while red bases represent incorrect predictions compared to the bpRNA reference. The scores of other models are as follows: (c) MXfold2: F1=0.481, PPV=0.403, SEN=0.595, (d) CONTRAfold: F1=0.558, PPV=0.468, SEN=0.690, (e) SPOT-RNA: F1=0.781, PPV=0.651, SEN=0.976, and (f) RNAfold: F1=0.257, PPV=0.209, SEN=0.333.

The GCNfold scores significantly exceed the RNAfold scores, serving as our model’s edge features. This implies that GCNfold enhances the predictive capabilities of RNAfold by assimilating its underlying pairing distribution through a graph network, thus formulating a more precise posterior pairing distribution, i.e., predicted base pairing probabilities. We also analyzed our model’s predictive performance across various RNA length intervals. In particular, for shorter RNA sequences (L ≤ 100 bases), all models produce higher evaluation scores than longer RNA sequences. Intriguingly, our models maintain an F1 score of approximately 0.58–0.61 for RNA sequences with lengths ranging from 100 to 400 bases, slightly lower than GCNfold’s average F1 score of 0.638.

2.3. Analysis of GCNfold Predictions on Short Segments of the SARS-CoV-2 Viral Genome

Experimental approaches to obtain a highly accurate secondary structure of RNA, especially for long RNA genomes, are time-consuming Wu et al (2020). This can become a substantial bottleneck in clinical applications, specifically when studying genomes of RNA viruses, such as the recently emerging SARS-CoV-2 (Refseq accession number NC 045512.2), a single-stranded betacoronavirus with a genome of approximately 30 kilobases. It is still challenging for computational folding algorithms to predict the structure of a long genome at once due to structural complexity and computational demands. In this experiment, GCNfold is employed to predict the secondary structures of small genes or subsections of the SARS-CoV-2 genome, such as the 5’ untranslated region (UTR), heptapeptide repeat regions (HR) of the spike (S) gene, and the envelope (E) gene. The validation of our model predictions is compared with the available SARS-CoV-2 structure from Lan et al (2022) based on DMS mutational profiling with sequencing (DMS-MaPseq) Zubradt et al (2017).

The GCNfold prediction demonstrates that our algorithm accurately predicts most of the 5’UTR and the beginning of the ORF1a structure, with an F1 score of 0.953, a PPV of 0.988, and a SEN of 0.921 compared to the DMS-MaPseq reference structure (Figure 4(A)). In line with a previous study Zubradt et al (2017), GCNfold perfectly predicted four stem loops (SL1–4) within the 5’ UTR, with minor differences found in the SL5 region, where our model formulated a larger multibranch loop that included SL5A, SL5B and SL5C. The prediction scores for other folding algorithms are RNAfold (F1=0.924, PPV=0.885, SEN=0.966) and MXfold2 (F1=0.960, PPV=0.955, SEN=0.966). Therefore, our model not only enhances the prediction accuracy of RNAfold but also achieves a performance comparable to MXfold2.

Fig. 4. Generated secondary structures of SARS-CoV-2 sub-genomes by GCN-fold are homogeneous to those formulated by DMS-MaPseq.

Fig. 4

(A) Consensus secondary structure predictions of 5’UTR and beginning of ORF1a structure, (B) HR2 region, and (C) E-gene by GCNfold and DMS-MaPseq, respectively. Black boxes indicate the regions where the GCNfold structure predictions differ from the DMS-MaPseq determinations.

We then applied our method to the gene that encodes the SARS-CoV-2 Spike protein, which plays an essential role in the interaction with potential host cells by binding its receptor-binding domain to the host receptor, causing an infection Huang et al (2020); Zhou and Zhao (2020). GCNfold achieved an F1-score of 0.845, a PPV of 0.968, and a SEN of 0.750 for predicting the secondary structure of HR2, a part of the S2 subunit of the spike protein that plays a role in the virus’s membrane fusion with the host Lu et al (2015). Interestingly, the energy-based model (e.g., RNAfold) completely failed to predict the structure of this particular gene (F1=0.118, PPV=0.111, SEN=0.125). MXfold2, whose post-processing heavily relies on Zuker’s algorithm, could not formulate an accurate structure (F1=0.500, PPV=0.563, SEN=0.450). Despite the high evaluation scores, GCNfold’s prediction missed the interactions between bases marked by the black boxes.

Lastly, we predicted the folding structure of the E-gene (Figure 4(C)). We observed that the predictions of the first two stem loops by GCNfold are almost identical to the DMS-MaPseq structure. In this region, MXfold2 performed the best in predicting its secondary structure. The scores for the folding algorithms are as follows: GCNfold (F1=0.855, PPV=0.917, SEN=0.800), MXfold2 (F1=0.924, PPV=0.961, SEN=0.891), RNAfold (F1=0.815, PPV=0.830, SEN=0.800).

2.4. Comparative analysis of secondary structures for GCNfold Predictions of E-gene across Betacoronavirus Subgenera

This study aims to analyze the RNA secondary structure of the envelope protein gene (E gene) across various subgenera within the Betacoronavirus genus. These subgenera include SARS-CoV-1 (Sarbecovirus, lineage B) Thiel et al (2003), RaTG13 (Sarbecovirus, lineage B) Zhou et al (2020), and MERS-CoV (Merbecovirus, lineage C) van Boheemen et al (2012). The selection of the E gene stems from its presence in all Betacoronavirus subgenera and its relatively short genome, rendering it well-suited for comparative analysis. There is a notable absence of experimental data regarding the annotated RNA secondary structure within these subgenera. Consequently, this study aims to shed light on these structural aspects and underscore the effectiveness of the GCNfold model in making these inferences.

Based on the GCNfold predictions (Figure 5), the anticipated structures of the E genes within the Sarbecovirus subgenus (Figure 5(a), (b), and (c)) exhibit striking similarities. Notably, whether non-silent or silent mutations, these variations do not significantly impact the overall secondary structure. In contrast, the projected structure of the E gene from the Merbecovirus subgenus (Figure 5(d)) markedly deviates from that of the Sarbecovirus subgenus due to substantial disparities in their nucleotide sequences, potentially influencing protein synthesis. This shows the GCNfold model’s ability to accurately predict the RNA secondary structure of the E gene across diverse subgenera. Furthermore, in the process of secondary structure prediction, this algorithm identifies regions within the E gene that can serve as a basis for designing subgenus-specific primers for molecular diagnostics. Alternatively, these regions can be harnessed to develop target-specific or target-agnostic antisense primers for therapeutic purposes.

Fig. 5. Formulated secondary structures of E genes of Betacoronavirus by GCN-fold shows the similarity within their subgenus.

Fig. 5

Where (a), (b), and (c) are the structures of SARS-CoV-2 (NC 045512.2), SARS-CoV-1 (AY291315.1), and RaTG13 (MN996532.2) that belong to the Sarbecovirus linkage. And (d) is the structure MERS-CoV (NC 019843.3), which belongs to the Merbecovirus linkage.

3. Methods

3.1. Data Descriptions

We employ two benchmark datasets to train our model and assess its performance in comparison to other folding algorithms:

  1. Rivas Database Rivas et al (2012): This dataset comprises TrainSetA, TestSetA, and TestSetB. TrainSetA and TestSetA contain sequences with less than 70% sequence identity compared to the data in TestSetB, ensuring non-homologous sequences. Additionally, the dataset excludes pseudoknot secondary structures Sato et al (2021). In this scenario, our model was trained on TrainSetA and evaluated on TestSetA (sequence-wise) and TestSetB (family-wise). Table 2 summarizes the statistics related to this dataset.

  2. (ii)bpRNA-1m dataset Danaee et al (2018): This dataset’s secondary structures are based on experimental data from various studies. To mitigate the risk of overfitting due to dataset redundancy and sequence similarity, we followed SPOT-RNA’s protocol Singh et al (2019). This protocol employed an 80% sequence identity cut-off, computed using the CD-HIT-EST program Fu et al (2012). The cut-off dataset is then randomly split into training (TR0), validation (VL0), and testing (TS0) sets, as indicated in Table 2. Exploratory data analysis on the TR0 dataset reveals that approximately 78.5% of base-pairing occurrences are canonical (G-C & A-U), 11.2% are non-canonical Wobble pairings (U-G), and the remaining 10.3% consist of other non-canonical pairings.

Table 2.

The summary of datasets used in our experiments.

Dataset #sequences Length
bpRNA-1m (cut-off) TR0 10,841 33–498
VL0 1,300 33–497
TS0 1,305 22–499
Rivas Data TrainSetA 3,166 10–734
TestSetA 592 10–768
TestSetB 430 27–244

3.2. Modeling

Figure 6 presents the architecture of our deep neural network denoted as Fθ(G), designed for calculating base pairing probability scores. Here, θ represents learnable parameters, and G signifies the input graph. Our model operates as an ensemble, incorporating the base pairing probability (BPP) matrix derived from RNAfold Lorenz et al (2011) and the RNA embedding vector. The BPP encodes structural connections between bases, forming a graph G(V,E). We employ a graph convolutional network (GCN) to harness this information. It takes the L × d dimensional sequence embedding of an RNA sequence with length L as node features (X) and the L × L prior BPP as a quasi-adjacency matrix (A^C) of graph G. The propagation process in a general GCN layer f(X,A) is defined as follows:

f(X,A)=σ(D˜1/2A˜D˜1/2XW+b) (1)

Fig. 6. Deep neural architecture of GCNfold.

Fig. 6

The input of the model is a set of graph G such that for each graph G(V,E), V is the encoded information of RNA sequences which is L × 4 matrix in which each row representing the nucleotide for each base. The edge E will be the pairing probability matrix from passing RNA sequences through RNAfold. That input then gets processed by the graph convolutional layer (GCSConv) and bi-directional LSTM for generating the embedding. This part is pre-trained before training the whole model. Then, the embedding is converted back to the matrix and passed through the series of convolutional layers with a skip connection. The output of convolutional layers is post-processed by applying an element-wise product with a constraint mask. Last, the symmetric argmax algorithm is applied to formulate the score matrix, which is the binary representation of RNA secondary structure.

Here, Ã = A+I, where I represents added self-loops, and D˜1/2 stands for the inverse of the degree matrix corresponding to Ã. The parameters W and b are trainable weights and biases for the function f(X,A). This process approximates the convolution operator within a graph Defferrard et al (2016); Kipf and Welling (2016) (for further details, see Kipf and Welling (2016)). In our work, we have enhanced f(X,A) by introducing a trainable skip connection. Notably, the RNAfold BPP already generates a stochastic quasi-adjacency matrix ÂR. Consequently, there’s no need to include the self-loops I as they are irrelevant for RNA secondary structures (i.e., aij = 0 when i = j). Additionally, we can omit D˜1/2 since there’s no requirement to renormalize the probability matrix. Thus, our modified f(X,ÂR) (also referred to as the GCSConv layer) can be expressed as follows:

f(X,A^R)=σ(A^RXW1+XW2+b) (2)

In this equation, W1, W2, and b represent parameters that must be trained, and σ(·) denotes the nonlinear activation function. Specifically, we employ GeLU Hendrycks and Gimpel (2020) as the activation function in this model. Figure 6 depicts our model divided into three stages.

3.2.1. Stage 1: Sequential

In this stage, we start by inputting the graph-structured data into the GCSConv block, followed by the bidirectional long short-term memory (Bi-LSTM) block. To maintain computational efficiency, each direction of the Bi-LSTM network has a total of d/2 hidden units. The combination of GCSConv+Bi-LSTM blocks is repeated a total of N times. The resulting output feature map is then forwarded to a fully connected layer (FC) responsible for learning and predicting the probability matrix. This matrix has dimensions L × (p + 1), where L represents the length of the sequence, and p signifies the number of possible dot-bracket notations. The neural network architecture in this stage can be viewed as a sequence-to-sequence model, where the input sequence is transformed into a corresponding sequence of dot-bracket notations.

3.2.2. Stage 2: Mapping

In this stage, we convert the sequential information acquired from Stage 1 into a binary score matrix with dimensions L × L using a convolutional network. Initially, a Conv1D layer generates a matrix x sized L × d/2. After that a matrix multiplication between x and its transpose xT results in a matrix of size L × L. Subsequently, a non-linear activation function σ(·) is applied to this result. The output is passed through to the Residual Conv2D block, in which the network detail is illustrated in Figure 6. The sigmoid activation ϕ(·) is then used to compute the output pairing probability matrix of Fθ(G)

3.2.3. Stage 3: Post-Processing

As previously mentioned, biologically, we assume that RNA secondary structures must satisfy the following constraints Chen et al (2020); Steeg (1993):

  1. Only allows Watson-Crick and Wobble pairing types, denoted as λ = {(A,U),(U,A),(C,G),(G,C),(U,G),(G,U)}

  2. Prohibiting sharp loops in the secondary structure means loops must not consist of less than three nucleotides.

  3. Restricting each base to be paired with at most one other base (i.e., each row and column of  must have at most one non-zero element).

We implement constraints (i) and (ii) using a symmetric constraint matrix M, where M ∈ 0,1L×L. Here, xi represents a base at the ith position of the RNA sequence x = (x1, …, xL). The matrix M is defined as follows: Mij = 1 if (xi, xj) ∈ λ and i-j ≥ 4; otherwise, Mij = 0 (as illustrated in Figure 6). To ensure that the output matrix of Fθ(G) adheres to constraints (i) and (ii), we perform element-wise multiplication between Fθ(G) and M, denoted as  = Fθ(G) ⊗ M. This operation simplifies the search space of GCNfold, allowing the training process to converge significantly faster than training the model without including matrix M.

3.2.

However, the prediction matrix  may not always satisfy the constraint (iii). To address this issue, we propose a Symmetric Argmax Post-processing (SAP) technique inspired by Booy et al (2021). The SAP algorithm, outlined in Algorithm 1, is straightforward. Given a predicted probability matrix Â, we perform an argmax operation along the y-axis to identify the position i of the maximum value in the jth column. Next, we introduce a probability threshold (ϵ) set at 0.35 as a cutoff to determine which Âij values can form pairings. Because the score matrix  does not guarantee symmetric pairings, we define a transformation τ on {0, 1}L×L as τ(A^)=12(A^+A^T), where τ(Â)ij = 0 if τ(Â)ij < 1. As a result, the output of GCNfold is represented by the predicted RNA secondary structure in the form of a matrix τ(Â) ∈ {0, 1}L×L. This matrix is symmetric and complies with all three sets of constraints (i), (ii), and (iii). The ϵ value of 0.35 is optimized on the VL0 dataset (selected values: 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, and 0.6).

3.3. Training Procedure

We leverage the idea of transfer learning to train GCNfold. Specifically, we divide the training process into two steps: (1) Training GCNfold during Stage 1 to formulate dot-bracket notations and (2) Connecting the neural network from Stage 1 with other parts of the architecture, and fine-tune the whole network to predict the base pairing probability matrix (i.e., the score matrix). The training process is described below.

Step 1: (Pre-training Step). As described in Stage 1 (3.2.1), given RNA graph data G(V,E) as input, we train the sequence-to-sequence model to predict secondary structures in the form of dot-bracket sequences. We use the training dataset (TR0) for training and assessing the model on the validation dataset (VL0). In particular, we use categorical cross-entropy loss and Adam optimizer for training the model. The hyperparameters d and N are also optimized in this step (i.e., selecting the hyperparameters’ combination that returns the lowest loss after training for 60 epochs).

Step 2: After finishing the initial training of Step 1, we connect the model defined in Stage 1 with the other sections (Figure 6). The neural network weights from Stage 1 are initialized by the trained parameters of the sequence-to-sequence model in Step 1. Now, we predict the base pairing probability matrix (BPP) due to its dimension and sparsity compared to the dot-bracket sequence. Since BPP is a sparse matrix (most entries are 0), we use weighted binary cross-entropy loss with a positive sample weight of 100 to handle imbalanced label predictions during training. The Adam optimizer and a reduced learning rate on the plateau are utilized when a metric stops improving for more than 5 epochs.

3.3.1. Performance Measure

The metrics commonly used for this task are Sensitivity (SEN), Positive Predictive Value (PPV), and F1 score. SEN measures the ability to predict the positive base pairs, while PPV measures the ability not to fold false positive base pairs Wang et al (2019). Finally, the F1 score is the harmonic mean of SEN and PPV, a balanced metric between PPV and SEN. The equations of our metrics are as follows:

SEN=TPTP+FP,  PPV=TPTP+TN, and  F1=2×SEN×PPVSEN+PPV (3)

Here, TP is the number of correctly predicted pairs, FP is the number of wrong-predicted pairs, and TN is the number of correctly predicted unpaired bases.

4. Conclusion

This study introduces the GCNfold model, a graph convolutional neural network designed to predict RNA secondary structures from RNA sequences. We also present a symmetric argmax post-processing algorithm with linear time complexity integrated into the model. This algorithm enforces the secondary structure constraints and ensures all output predictions are valid. In our initial experiment, we demonstrate that GCNfold outperforms other folding algorithms, particularly on the testing set (TS0) of the bpRNA dataset, where the RNA sequences in the testing set exhibit structural similarity to those in the training set. Notably, GCNfold shows an ability to recognize certain pseudoknot structures.

Concerns regarding overfitting have been mentioned in previous research, particularly in rich-parameterized models Rivas et al (2012); Sato et al (2021). To mitigate this issue, GCNfold leverages base pairing probabilities obtained from RNAfold partition as a known prior distribution of RNA graph data. It then embeds RNA structural data through multiple graph convolutional layers and optimizes DNN parameters. As a result, our experiments reveal that GCNfold achieves the best F1 score when applied to homologous sequences from TestSetA. Furthermore, even for non-homologous sequences from TestSetB, its F1-score is second only to MXfold2, with a difference of approximately 1%. This outcome suggests that incorporating prior information related to free energy and graph structural data into deep neural networks can significantly enhance prediction accuracy and model robustness.

The experiment on SARS-CoV-2 genomes also demonstrates the utility of folding algorithms for newly discovered RNA. The results suggest that the predicted structures from the learning models leveraging thermodynamic energy knowledge (e.g., GCNfold and MXfold2) could formulate structures that closer resemble the structures obtained from the DMS-MaPseq experiment than a traditional energy-based model (e.g., RNAfold).

Due to limited training data for RNA secondary structure prediction, accuracy remains constrained. One potential improvement path involves using unsupervised learning algorithms to enhance our understanding of RNA sequences. Despite current limitations, GCNfold can still be valuable for RNA structure prediction, especially when combined with expert knowledge like primer-nucleotide design or identifying RNA-exposed regions. In our study, we employed a basic graph convolutional layer Kipf and Welling (2016). Future research could explore alternative convolutional layer designs Bianchi et al (2021); Thekumparampil et al (2018); Xu et al (2019) used in graph neural networks (GNNs) to potentially boost RNA secondary structure prediction accuracy. Additionally, adapting algorithms like sliding windows Agarwal et al (2019); Chen et al (2006) or leveraging RNA’s physical properties, as demonstrated in AlphaFold Senior et al (2020, 2019), are avenues worth exploring for enhanced performance given the long RNA sequences.

Fig. 1. Workflow of the modeling and inference.

Fig. 1

The prediction process began with the preprocessing of the RNA Sequence. The one hot encoding from the sequence itself and the probabilistic encoding from Vienna RNAfold Lorenz et al (2011) were used as input for the sequential embedding part, which consisted of a graph convolutional network (GCN) and bidirectional LSTM (Bi-LSTM). The Bi-LSTM output was then flattened and inserted into the pairing probability mapping, which expanded the dimension of a vector into a matrix with a one-dimensional convolution network and the shape-preserved nonlinear transformation σ of the inner product between itself and its transpose. The output then passed through the residual networks and underwent post-processing to yield the RNA secondary structure.

Acknowledgement

We would like to thank Prof. Fatemeh Emdad for advice on this project. Palawat Busaranuvong was partly supported by a Massachusetts Life Sciences Center Internship award. This project was partly supported by The RADx-RAD R42DE030829 to R. Khosravi-Far and by NIH/NLM grant R01LM014017NLM to D. Korkin.

Appendix A. An example of RNA structure where none of the models can predict correctly

Fig. A1. Comparison of predicted structures.

Fig. A1

of bpRNA-RFAM-6232: Green represents the correct prediction, and red indicates the incorrect prediction in that position compared to (a) bpRNA reference. The predictions are from (b) GCNfold (F1=0.595, PPV=0.583, SEN=0.609), (c) MXfold2 (F1=0.453, PPV=0.323, SEN=0.304), (d) CONTRAfold (F1=0.444, PPV=0.326, SEN=0.696), (e) SPOT-RNA (F1=0.264, PPV=0.233, SEN=0.400), and (f) RNAfold (F1=0.430, PPV=0.303, SEN=0.739).

A1 shows a case in which neither model can predict secondary structures that are similar to the bpRNA reference.

Appendix B. SARS-CoV-2 structure prediction by each folding algorithm

Fig. B2.

Fig. B2

Secondary structure predictions of 5’UTR and beginning of ORF1a structure by (A) DMS-MaPseq (reference), (B) GCNfold, (C) RNAfold, and (D) MXfold2

graphic file with name nihpp-rs3798842v1-f0002.jpg

The scores of folding algorithms on this region compared with the DMS-MaPseq reference are as follows: GCNfold (F1=0.953, PPV=0.988, SEN=0.921), RNAfold (F1=0.924, PPV=0.885, SEN=0.966), and MXfold2 (F1=0.960, PPV=0.955, SEN=0.966).

Fig. B3.

Fig. B3

Secondary structure predictions of Heptapeptide Repeat Sequence 2 (HR2) by (A) DMS-MaPseq (reference), (B) GCNfold, (C) RNAfold, and (D) MXfold2

graphic file with name nihpp-rs3798842v1-f0003.jpg

The scores of folding algorithms on this region compared with the DMS-MaPseq reference are as follows: GCNfold (F1=0.845, PPV=0.968, SEN=0.750), RNAfold (F1=0.118, PPV=0.111, SEN=0.125), and MXfold2 (F1=0.500, PPV=0.563, SEN=0.450).

Fig. B4.

Fig. B4

Secondary structure predictions of envelope (E) gene by (A) DMS-MaPseq (reference), (B) GCNfold, (C) RNAfold, and (D) MXfold2

graphic file with name nihpp-rs3798842v1-f0004.jpg

The scores of folding algorithms on this region compared with the DMS-MaPseq reference are as follows: GCNfold (F1=0.854, PPV=0.917, SEN=0.800), RNAfold (F1=0.815, PPV=0.830, SEN=0.800), and MXfold2 (F1=0.924, PPV=0.961, SEN=0.891).

Appendix C. Difference in the RNA Sequence within subgenus

Fig. C5. Comparison of E-gene RNA sequences among some of Coronavirus.

Fig. C5

The rectangle indicates the mutation position and the nucleotide highlighted with red color indicates the difference in Sabercovirus. In which, SARS-CoV-2 and RaTG13 have only one difference in a nucleotide at the 72th position.

Footnotes

Additional Declarations: Yes there is potential Competing Interest. Dr. Roya Khosravi-Far is a Chief Executive Officer and Founder of the company InnoTech Precision Medicine. Dr. Dmitry Korkin was a consultant for the company Seismic Therapeutics.

References

  1. Agarwal S, Singh V, Agarwal P, et al. (2019) Prediction of secondary structure of proteins using sliding window and backpropagation algorithm. In: Malik H, Srivastava S, Sood YR, et al. (eds) Applications of Artificial Intelligence Techniques in Engineering. Springer; Singapore, Singapore, pp 533–541 [Google Scholar]
  2. Akiyama M, Sato K, Sakakibara Y (2018) A max-margin training of rna secondary structure prediction integrated with the thermodynamic model. Journal of Bioinformatics and Computational Biology 16(06):1840,025. 10.1142/S0219720018400255, URL https://www.worldscientific.com/doi/abs/10.1142/S0219720018400255, publisher: World Scientific Publishing Co. [DOI] [PubMed] [Google Scholar]
  3. Andronescu M, Condon A, Hoos HH, et al. (2007) Efficient parameter estimation for rna secondary structure prediction. Bioinformatics 23(13):i19–i28. 10.1093/bioinformatics/btm223, [DOI] [PubMed] [Google Scholar]
  4. Bianchi FM, Grattarola D, Livi L, et al. (2021) Graph neural networks with convolutional arma filters. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1. 10.1109/TPAMI.2021.3054830, URL http://arxiv.org/abs/1901.01343, arXiv:1901.01343 [cs, stat] [DOI] [PubMed] [Google Scholar]
  5. van Boheemen S, de Graaf M, Lauber C, et al. (2012) Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans. mBio 3(6):e00,473–12. 10.1128/mBio.00473-12, URL https://journals.asm.org/doi/abs/10.1128/mBio.00473-12, https://arxiv.org/abs/https://journals.asm.org/doi/pdf/10.1128/mBio.00473-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Booy MS, Ilin A, Orponen P (2021) Rna secondary structure prediction with convolutional neural networks. bioRxiv p 2021.05.24.445408. 10.1101/2021.05.24.445408, URL https://www.biorxiv.org/content/10.1101/2021.05.24.445408v1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen K, Kurgan L, Ruan J (2006) Optimization of the sliding window size for protein structure prediction. In: 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp 1–7, 10.1109/CIBCB.2006.330959 [DOI] [Google Scholar]
  8. Chen X, Li Y, Umarov R, et al. (2020) RNA secondary structure prediction by learning unrolled algorithms. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, URL https://openreview.net/forum?id=S1eALyrYDH [Google Scholar]
  9. Danaee P, Rouches M, Wiley M, et al. (2018) bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Research 46(11):5381–5394. 10.1093/nar/gky285, URL https://doi.org/10.1093/nar/gky285, https://arxiv.org/abs/https://academic.oup.com/nar/article-pdf/46/11/5381/27982344/gky285.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. CoRR abs/1606.09375. URL http://arxiv.org/abs/1606.09375, https://arxiv.org/abs/1606.09375 [Google Scholar]
  11. Do CB, Woods DA, Batzoglou S (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22(14):e90–e98. 10.1093/bioinformatics/btl246, URL https://doi.org/10.1093/bioinformatics/btl246, https://arxiv.org/abs/https://academic.oup.com/bioinformatics/article-pdf/22/14/e90/616509/btl246.pdf [DOI] [PubMed] [Google Scholar]
  12. Fu L, Niu B, Zhu Z, et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23):3150–3152. 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fu L, Cao Y, Wu J, et al. (2022) Ufold: fast and accurate rna secondary structure prediction with deep learning. Nucleic Acids Research 50. 10.1093/nar/gkab1074, [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hendrycks D, Gimpel K (2020) Gaussian error linear units (gelus). Tech. Rep. arXiv:1606.08415, arXiv, 10.48550/arXiv.1606.08415, URL http://arxiv.org/abs/1606.08415, arXiv:1606.08415 [cs] type: article [DOI] [Google Scholar]
  15. Huang Y, Yang C, Xu Xf, et al. (2020) Structural and functional properties of sars-cov-2 spike protein: potential antivirus drug development for covid-19. Acta Pharmacologica Sinica 41(9):1141–1149. 10.1038/s41401-020-0485-4, URL https://www.nature.com/articles/s41401-020-0485-4, number: 9 Publisher: Nature Publishing Group; [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. URL http://arxiv.org/abs/1609.02907, https://arxiv.org/abs/1609.02907 [Google Scholar]
  17. Lan TCT, Allan MF, Malsick LE, et al. (2022) Secondary structural ensembles of the sars-cov-2 rna genome in infected cells. Nature Communications 13(1):1128. 10.1038/s41467-022-28603-2, URL https://www.nature.com/articles/s41467-022-28603-2, number: 1 Publisher: Nature Publishing Group; [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lorenz R, Bernhart SH, Höner zu Siederdissen C, et al. (2011) Viennarna package 2.0. Algorithms for Molecular Biology 6(1):26. 10.1186/1748-7188-6-26, [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lu G, Wang Q, Gao GF (2015) Bat-to-human: spike features determining ‘host jump’ of coronaviruses sars-cov, mers-cov, and beyond. Trends in microbiology 23(8):468—478. 10.1016/j.tim.2015.06.003, URL https://europepmc.org/articles/PMC7125587 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Markham NR, Zuker M (2008) Unafold: software for nucleic acid folding and hybridization. Methods in Molecular Biology (Clifton, NJ) 453:3–31. 10.1007/978-1-60327-429-6_1 [DOI] [PubMed] [Google Scholar]
  21. Mathews DH, Disney MD, Childs JL, et al. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences 101(19):7287–7292. 10.1073/pnas.0401799101, URL https://www.pnas.org/doi/abs/10.1073/pnas.0401799101, publisher: Proceedings of the National Academy of Sciences; [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for rna secondary structure. Biopolymers 29(6–7):1105–1119. 10.1002/bip.360290621, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/bip.360290621 [DOI] [PubMed] [Google Scholar]
  23. Reuter JS, Mathews DH (2010) Rnastructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11(1):129. 10.1186/1471-2105-11-129, [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Rivas E, Lang R, Eddy SR (2012) A range of complex probabilistic models for rna secondary structure prediction that includes the nearest-neighbor model and more. RNA 18(2):193–212. 10.1261/rna.030049.111, URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3264907/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Sato K, Akiyama M, Sakakibara Y (2021) Rna secondary structure prediction using deep learning with thermodynamic integration. Nature Communications 12(1):941. 10.1038/s41467-021-21194-4, [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Senior AW, Evans R, Jumper J, et al. (2019) Protein structure prediction using multiple deep neural networks in the 13th critical assessment of protein structure prediction (casp13). Proteins: Structure, Function, and Bioinformatics 87(12):1141–1148. https://doi.org/ 10.1002/prot.25834, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25834, https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.25834 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Senior AW, Evans R, Jumper J, et al. (2020) Improved protein structure prediction using potentials from deep learning. Nature 577(7792):706–710. 10.1038/s41586-019-1923-7, [DOI] [PubMed] [Google Scholar]
  28. Singh J, Hanson J, Paliwal K, et al. (2019) Rna secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications 10(1):5407. 10.1038/s41467-019-13395-9, [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Steeg EW (1993) Neural Networks, Adaptive Optimization, and RNA Secondary Structure Prediction, American Association for Artificial Intelligence, USA, p 121–160 [Google Scholar]
  30. Thekumparampil KK, Wang C, Oh S, et al. (2018) Attention-based graph neural network for semi-supervised learning. 10.48550/arXiv.1803.03735, URL http://arxiv.org/abs/1803.03735, arXiv:1803.03735 [cs, stat] [DOI] [Google Scholar]
  31. Thiel V, Ivanov KA, Putics A, et al. (2003) Mechanisms and enzymes involved in sars coronavirus genome expression. Journal of General Virology 84(9):2305–2315. https://doi.org/ 10.1099/vir.0.19424-0, URL https://www.microbiologyresearch.org/content/journal/jgv/10.1099/vir.0.19424-0 [DOI] [PubMed] [Google Scholar]
  32. Tinoco I, Bustamante C (1999) How rna folds. Journal of Molecular Biology 293(2):271–281. https://doi.org/ 10.1006/jmbi.1999.3001, URL https://www.sciencedirect.com/science/article/pii/S0022283699930012 [DOI] [PubMed] [Google Scholar]
  33. Trotta E (2014) On the normalization of the minimum free energy of rnas by sequence length. PLOS ONE 9(11):1–9. 10.1371/journal.pone.0113380, [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang L, Liu Y, Zhong X, et al. (2019) Dmfold: A novel method to predict rna secondary structure with pseudoknots based on deep learning and improved base pair maximization principle. Frontiers in Genetics 10. URL https://www.frontiersin.org/article/10.3389/fgene.2019.00143 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wu F, Zhao S, Yu B, et al. (2020) A new coronavirus associated with human respiratory disease in China. Nature 579(7798):265–269. 10.1038/s41586-020-2008-3, URL https://www.nature.com/articles/s41586-020-2008-3, number: 7798 Publisher: Nature Publishing Group; [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Xu K, Hu W, Leskovec J, et al. (2019) How powerful are graph neural networks? 10.48550/arXiv.1810.00826, URL http://arxiv.org/abs/1810.00826, arXiv:1810.00826 [cs, stat] [DOI] [Google Scholar]
  37. Zakov S, Goldberg Y, Elhadad M, et al. (2011) Rich parameterization improves rna structure prediction. Journal of Computational Biology 18(11):1525–1542. 10.1089/cmb.2011.0184, URL https://www.liebertpub.com/doi/10.1089/cmb.2011.0184, publisher: Mary Ann Liebert, Inc., publishers; [DOI] [PubMed] [Google Scholar]
  38. Zhou G, Zhao Q (2020) Perspectives on therapeutic neutralizing antibodies against the novel coronavirus sars-cov-2. International Journal of Biological Sciences 16(10):1718–1723. 10.7150/ijbs.45123, URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7098029/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhou P, Yang XL, Wang XG, et al. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798):270–273. 10.1038/s41586-020-2012-7, [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zubradt M, Gupta P, Persad S, et al. (2017) Dms-mapseq for genome-wide or targeted rna structure probing in vivo. Nature Methods 14(1):75–82. 10.1038/nmeth.4057, URL https://www.nature.com/articles/nmeth.4057, number: 1 Publisher: Nature Publishing Group; [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Zuker M, Stiegler P (1981) Optimal computer folding of large rna sequences using thermodynamics and auxiliary information. Nucleic acids research 9(1):133–148. 10.1093/nar/9.1.133, [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Research Square are provided here courtesy of American Journal Experts

RESOURCES