Abstract
Protein–RNA interactions (PRIs) are essential for many biological processes, so understanding aspects of the sequences and structures involved in PRIs is important for unraveling such processes. Because of the expensive and time-consuming techniques required for experimental determination of complex protein–RNA structures, various computational methods have been developed to predict PRIs. However, most of these methods focus on predicting only RNA-binding regions in proteins or only protein-binding motifs in RNA. Methods for predicting entire residue–base contacts in PRIs have not yet achieved sufficient accuracy. Furthermore, some of these methods require the identification of 3D structures or homologous sequences, which are not available for all protein and RNA sequences. Here, we propose a prediction method for predicting residue–base contacts between proteins and RNAs using only sequence information and structural information predicted from sequences. The method can be applied to any protein–RNA pair, even when rich information such as its 3D structure, is not available. In this method, residue–base contact prediction is formalized as an integer programming problem. We predict a residue–base contact map that maximizes a scoring function based on sequence-based features such as k-mers of sequences and the predicted secondary structure. The scoring function is trained using a max-margin framework from known PRIs with 3D structures. To verify our method, we conducted several computational experiments. The results suggest that our method, which is based on only sequence information, is comparable with RNA-binding residue prediction methods based on known binding data.
Keywords: protein–RNA interaction, RNA secondary structure, structured support vector machine
1. Introduction
Recent studies have begun unraveling the mechanisms of biological processes involving functional non-coding RNAs, most of which interact with RNA-binding proteins (RBPs) in essential roles, such as splicing, transport, localization, and translation. These interactions involve sequence- and structure-specific recognition between proteins and RNAs. Therefore, understanding aspects of sequences and structures involved in protein–RNA interactions (PRIs) is important for understanding many biological processes. To that end, several studies have focused on the analysis and discussion of PRIs [1,2,3].
Compared with deciphering genomic sequences by using high-throughput sequencing technology, experimental determination of protein–RNA joint structures is both more expensive and more time consuming. Accordingly, rapid computational prediction of PRIs from only sequence information is desirable. Existing methods for computational prediction of PRIs can be roughly classified into four groups. The first group predicts whether a given protein–RNA pair interacts or not [4,5,6,7]. A prediction algorithm for this approach can be simply designed from interacting protein–RNA pairs alone, so 3D structures and residue–base contacts are not necessary for use in model training. However, this approach cannot predict binding sites of proteins and RNAs that should be biologically and structurally essential for PRIs. The second group aims to predict RNA-binding residues from protein information. DR_bind1 [8], KYG [9], and OPRA [10] are structure-based methods that use 3D structures from PDB to extract descriptors for prediction. BindN+ [11] and Pprint [12] are sequence-based methods that employ evolutionary information instead of 3D structures. However, this approach ignores the binding partners of target proteins, although some RNA-binding domains in RBPs recognize sequence- and structure-specific motifs in RNA sequences. The third group computes RNA structural motifs recognized by RNA-binding domains in certain proteins and includes MEMERIS [13], RNAcontext [14], CapR [15], and GraphProt [16]. This approach focuses on a certain RBP and extracts RNA motifs as consensus sequences and/or secondary structures of the RBP-binding RNAs. The fourth and final group of methods predicts intermolecular joint structures between proteins and RNAs such as residue–base contacts. To our knowledge, Hayashida et al. [17] have developed the only method of this type. However, it is unfortunately not sufficiently accurate.
Accordingly, we propose a prediction method for residue–base contacts between proteins and RNAs based only on sequence information and structural information predicted from sequences. Our method can be applied to any protein–RNA pair, including those for which rich information, such as 3D structures, are unavailable. Residue–base contact prediction is formalized as an integer programming (IP) problem. Our method predicts a residue–base contact map that maximizes a scoring function based on sequence features such as k-mers of sequences and predicted secondary structures. The scoring function is trained by a max-margin framework from known PRIs with 3D structures. To verify our method, we performed several computational experiments. The results suggest that our method based on only sequence information is comparable with RNA-binding residue prediction methods based on actual known binding data.
2. Methods
We present a novel algorithm for predicting PRIs using IP. Our algorithm consists of the following two parts: (1) prediction of a residue–base contact map given a protein and RNA pair by solving an integer programming problem; and (2) learning a scoring function from a given training dataset using a max-margin framework.
2.1. Preliminaries
Let represent the set of 20 canonical amino acid residues and let denote the set of all finite amino acid sequences consisting of residues in . Similarly, let represent the set of the four canonical ribonucleotide bases (A, C, G, and U) and let denote the set of all finite RNA sequences consisting of bases in . Given a protein consisting of residues and an RNA consisting of bases, let represent the space of all possible residue–base contact maps between P and R. An element is represented as an binary-valued matrix, where indicates that residue interacts with the base (Figure 1). We define the problem of PRI prediction as follows: given a protein P and an RNA R, predict a residue–base contact map .
2.2. Scoring Model
A scoring model f is a function that assigns real-valued scores to protein–RNA pairs and residue–base contact maps . Our aim is to find a residue–base contact map that maximizes the scoring function for a given protein–RNA pair . The scoring function is computed on the basis of various local features of , and z. These features correspond to residue features, base features, and residue–base contact features that describe local contexts around residue–base contacts, respectively.
Residue features, as summarized in Table 1, describe the binding preference in the amino acid sequences by local contexts around residue–base contacts. For this purpose, we employ k-mers of the amino acids centered on the interacting ith residue. For each k-mer of the amino acids, , we define a binary-valued local feature of the ith residue as
where is an indicator function that takes a value of 1 or 0 depending on whether the is true or false, is the k-mer of the substring of P centered on the ith residue , that is, , and is a binary-valued variable such that if and only if residue is a binding site (Figure 1), that is, . We use and to characterize k-mer features.
Table 1.
Type | Context len. | # of Features |
---|---|---|
Residues | 3 | |
5 | ||
Simplified alphabets (10 groups) | 5 | |
7 | ||
Simplified alphabets (4 groups) | 5 | |
7 | ||
Secondary structures | 3 | |
5 |
To reduce the sparsity of amino acid contexts, we consider the k-mers of simplified alphabets of amino acids proposed by Murphy et al. [18], who calculated groups of simplified alphabets based on the BLOSUM50 matrix [19]. Note that Murphy et al. [18] have shown that the simplified alphabets are correlated with physiochemical properties such as hydrophobicity, hydrophilicity, and polarity, which may have important roles in PRIs. We employ the simplified alphabets of 10 groups, , and those of 4 groups, (Table 2).
Table 2.
# | Groups | |
---|---|---|
10 | LVIM, C, A, G, ST, P, FYW, EDNQ, KR, H | |
4 | LVIMC, AGSTP, FYW, EDNQKRH |
For each string (or ), we define a binary-valued local feature of the ith residue as
where is the string of simplified alphabets (or ) converted from P according to Table 2. In contrast with the k-mers used in other part of this algorithm, we instead use and for the k-mers of simplified alphabets.
To consider the structural preference of RNA-binding residues, we employ secondary structures predicted by SSpro8 [20]. We predict one structural element [ -helix (H), 3-helix (G), 5-helix (I), folded (E), -turn (B), corner (T), curl (S), and loop (–)] for each residue. For each string of structural elements of length k, we define a binary-valued local feature of the ith residue as
where is the string of structural elements predicted from P. Here, we again use structural contexts with lengths and .
The collection of occurrences of the residue features are calculated as
(1) |
where is a vector whose elements are the residue features of the ith residue mentioned above.
Base features, as summarized in Table 3, describe the binding preference in the ribonucleotide sequences by local contexts around residue–base contacts. In addition to the residue features, we employ the k-mer contexts of the ribonucleotides centered on the interacting jth base. For each k-mer of the ribonucleotides , we define a binary-valued local feature of the jth base as
where is a binary-valued variable such that if and only if the residue is a binding site (Figure 1), that is, . Here, we once again use and 5 for the k-mer features.
Table 3.
Type | Context len. | # of Features |
---|---|---|
Bases | 3 | |
5 | ||
Secondary structures | 3 | |
5 |
To consider the structural preference of binding sites, we employ secondary structures predicted by CentroidFold [21]. We assign a structural element [external loop (E), hairpin loop (H), internal loop (I), bulge (B), multibranch loop (M), or stack (S), as shown in Figure 2] to each base. Note that to encode secondary structures as a sequence, this encoding of structural profiles loses a portion of the structural information, e.g., base-pairing partners for stacking bases. However, this approach is still efficient for describing structural information [13,14,15]. For each k-length string of structural elements, we define a binary-valued local feature of the jth base as
where is the string of structural elements predicted from R. Here, we use structural contexts with lengths and .
The collection of occurrences of the base features are calculated as
(2) |
where is a vector whose elements are the base features of the jth base mentioned above.
Residue–base contact features, which are summarized in Table 4, describe the binding affinity between the local contexts of amino acids and ribonucleotides. For this purpose, we employ combinations of the residue features and the base features mentioned above. For example, for each pair of k-mers of amino acids and ribonucleotides , we define a binary-valued local feature of the ith residue and the jth base:
Table 4.
Type | |||
---|---|---|---|
Residue | Base | Context len. | # of Features |
Residues | Bases | 3 | |
5 | |||
Secondary structures | Secondary structures | 3 | |
5 | |||
Simplified alphabets (10 groups) | Bases | 3 | |
5 | |||
Simplified alphabets (10 groups) | Secondary structures | 3 | |
5 | |||
Simplified alphabets (4 groups) | Bases | 3 | |
5 | |||
Simplified alphabets (4 groups) | Secondary structures | 3 | |
5 |
The collection of occurrences of the residue–base contact features are calculated as
(3) |
where is a vector whose elements are the residue–base contact features of the ith residue and the jth base mentioned above.
The notation denotes the feature representation of protein–RNA pair and its residue–base contact map , that is, the collection of occurrences of local features in , and z defined as follows:
(4) |
Each feature in is associated with a corresponding parameter, and the score for the feature is defined as the value of the occurrence multiplied by the corresponding parameter. We define the scoring model as a linear function
(5) |
where is the inner product and is the corresponding parameter vector trained with training data as described in Section 2.4.
2.3. IP Formulation
To formulate the problem as an IP problem, we rewrite the scoring function (5) as
(6) |
where , , and represent the binding preferences for , , and , respectively, calculated as
We find a that maximizes the objective function (6) under the following constraints to ensure consistency among the variables , , and as follows:
(7) |
(8) |
(9) |
(10) |
(11) |
(12) |
The constraints defined by Equations (7)–(9) describe the relation between contacts and binding sites . The constraint defined by Equation (10) disallows any isolated interacting bases, which are rare in PRIs. The constraints defined by Equations (11) and (12) define the upper bound on the number of contacts and for each residue and base, respectively.
2.4. Learning Algorithm
To optimize feature parameter , we employ a max-margin framework called structured support vector machines [22]. Given a training dataset , where and are the protein and RNA sequences, respectively, and is their corresponding contact map for the kth datapoint, we aim to find the parameter that minimizes the objective function
(13) |
where is the norm and C is a weight for the regularization term to avoid overfitting to the training data. Here, is a loss function of for z defined as
(14) |
where , and are hyperparameters controlling the trade-off between sensitivity and specificity for learning the parameters. In this case, we can calculate the first term of Equation (13) by replacing scores , and in Equation (6) as follows:
See Section S1 in the Supplementary Material for the derivation.
To minimize the objective function (13), we can apply stochastic subgradient descent (Figure 1) or forward-backward splitting [23].
3. Results
3.1. Implementation
Our method was implemented using the IBM CPLEX optimizer http://www.ibm.com/software/integration/optimization/cplex-optimizer/) (accessed on 21 October 2021) for solving IP problems (6)–(12). To extract the structural feature elements described in Section 2.2, we employed SSpro8 [20] and CentroidFold [21] to predict secondary structures of protein and RNA sequences, respectively. We empirically chose the following hyperparameters: penalty for positives, ; penalty for negatives, ; and the weight for the regularization term, . See Section S2 in the Supplementary Material for details. We implemented AdaGrad [24] to control the learning rate in Algorithm 1. The source code for our algorithm is available at https://github.com/keio-bioinformatics/practip/ (accessed on 21 October 2021).
Algorithm 1 The stochastic subgradient descent algorithm for a structured support vector machine; is the sign function, whereas is the predefined learning rate. |
|
3.2. Dataset
We prepared our datasets in accordance with those of Chen et al. [8] and Miao et al. [25] and extracted RNA-bound proteins with an X-ray resolution of ≤3.0 Å from the Protein Data Bank (PDB) [26]. To reduce dataset redundancy, we discarded some extracted data such that the dataset contained no protein pairs whose sequence identity was >30%. As a result, our test dataset consisted of 98 protein–RNA interacting pairs from 81 protein–RNA complexes from Chen et al. [8] as listed in Table S6 in the Supplementary Matrial, and our training dataset consisted of 4399 protein–RNA interacting pairs from 772 protein–RNA complexes was from Miao et al. [25]. Note that our training data and test data share no common complexes. We considered a residue to bind RNA if at least one non-hydrogen atom was contained within the van der Waals contact (4.0Å) or hydrogen-bonding distance (3.5Å) from the non-hydrogen atom of its binding partner. We employed HBPLUS [27] to detect the hydrogen bonds and van der Waals contacts. Our datasets are available at https://doi.org/10.5281/zenodo.5584470 (accessed on 21 October 2021).
3.3. Prediction of Residue–Base Contacts
To validate our method, we conducted computational experiments on our dataset, comparing the accuracy under several conditions related to the maximum number of contacts for each residue and base, and in Equations (11) and (12) from 1 to 9, and no upper bounds.
We evaluated the accuracy of predicting residue–base contacts between proteins and RNAs using three measures: predicted residue–base contacts, binding residues in proteins, and binding bases in RNA sequences. The accuracy of residue–base contacts is assessed by the positive predictive value (PPV) and the sensitivity (SEN), respectively defined as
where is the number of correctly predicted contacts (true positives), is the number of incorrectly predicted contacts (false positives), and is the number of contacts in the true contact map that were not predicted (false negatives). We also used the F-value as a balanced measure between PPV and SEN, and it is defined as their harmonic mean:
The accuracy of binding residues and binding bases is defined in the same way.
Table 5 shows the accuracy of predicting residue–base contacts in PRIs, binding residues in proteins, and binding bases in RNA sequences for upper bounds of contacts , in Equations (11) and (12) from 1 to 9 and for no upper bounds. The case with the strongest constraint () has a very high PPV because it limits the number of contacts to be predicted, while its SEN is poor because of a lack of coverage of the prediction. On the other hand, if there is no constraint on the number of contacts (corresponding to the row labeled “no limit” in Table 5), both PPV and SEN are not high owing to many incorrect predictions being made. We found that if the upper limit of the number of contacts is set between 4 and 9, reasonably accurate contact prediction, residue binding site prediction, and base binding site prediction can be obtained. As a result, we set as the default constraint for the upper bound of the number of contacts.
Table 5.
Upper Bounds | Contacts | Binding Residues | Binding Bases | ||||||
---|---|---|---|---|---|---|---|---|---|
of # Contacts () | PPV | SEN | F | PPV | SEN | F | PPV | SEN | F |
1 | 0.599 | 0.192 | 0.278 | 0.829 | 0.349 | 0.460 | 0.877 | 0.361 | 0.481 |
2 | 0.552 | 0.347 | 0.414 | 0.736 | 0.509 | 0.578 | 0.796 | 0.506 | 0.597 |
3 | 0.523 | 0.436 | 0.462 | 0.679 | 0.585 | 0.608 | 0.744 | 0.595 | 0.644 |
4 | 0.532 | 0.480 | 0.491 | 0.676 | 0.642 | 0.638 | 0.718 | 0.626 | 0.656 |
5 | 0.534 | 0.506 | 0.507 | 0.655 | 0.667 | 0.641 | 0.693 | 0.656 | 0.657 |
6 | 0.537 | 0.515 | 0.514 | 0.669 | 0.671 | 0.654 | 0.688 | 0.647 | 0.652 |
7 | 0.541 | 0.520 | 0.518 | 0.671 | 0.685 | 0.663 | 0.677 | 0.649 | 0.647 |
8 | 0.539 | 0.525 | 0.519 | 0.664 | 0.688 | 0.657 | 0.684 | 0.655 | 0.652 |
9 | 0.531 | 0.513 | 0.510 | 0.658 | 0.679 | 0.649 | 0.659 | 0.650 | 0.638 |
no limit | 0.321 | 0.367 | 0.328 | 0.481 | 0.556 | 0.493 | 0.535 | 0.530 | 0.508 |
It should be noted that in this experiment, we were unable to compare our method with the method by Hayashida et al. [17], which is the only published method for predicting residue–base contacts in PRIs. Specifically, we were unable to conduct an experiment using the method by Hayashida et al. on the same dataset because their software implementation is not yet available and their method requires homologous sequences with accurate alignments to calculate evolutionary information. In addition, Hayashida et al. [17] have reported that the method is not sufficiently accurate for such analyses.
3.4. Comparison of Binding Residues Predictions among the Present and Existing Methods
We compared our method with existing methods for predicting RNA-binding residues in proteins. DR_bind1 [8], KYG [9], and OPRA [10] are structure-based methods that use 3D structures from PDB to extract descriptors for prediction. BindN+ [11] and Pprint [12] are sequence-based methods that employ evolutionary information instead of 3D structures. Table 6 indicates that our method is comparable to other methods. Recall that our method employs only sequence information and structural information predicted from sequences as well as information on the partner RNAs bound to RNA-binding proteins, rather than 3D structures and evolutionary information.
Table 6.
Our Method | DR_bind1 | KYG | OPRA | BindN+ | Pprint | |
---|---|---|---|---|---|---|
PPV | 0.66 | 0.69 | 0.38 | 0.50 | 0.54 | 0.42 |
SEN | 0.69 | 0.05 | 0.60 | 0.33 | 0.73 | 0.82 |
F | 0.66 | 0.09 | 0.47 | 0.40 | 0.62 | 0.56 |
4. Discussion
Several existing methods for predicting PRIs utilize evolutionary information from homologous sequences, [11,12] for protein sequences and [17] for both protein and RNA sequences. Homologous sequences of target sequences are typically searched for in large databases using a highly sensitive homology search engine such as PSI-BLAST [28]. Furthermore, to extract evolutionary information, homologous sequences must be aligned before PRI prediction. Homology searches are employed in a wide range of analyses, such as functional analysis of proteins, because if homologous proteins can be found in curated databases, the function of the target protein can be easily inferred. However, as described above and by Zhang et al. [29], the secondary structures of proteins play essential roles in residue–base contacts. Similarly, structural elements of RNA secondary structures also serve as key descriptors for residue–base contact prediction [13,14,15,16]. This means that structure-based homology searches are needed for PRI prediction based on evolutionary information. Although efficient structural alignment algorithms for proteins (e.g., [30]) and RNAs (e.g., [31]) have recently been developed, they have not yet been successfully applied to large-scale homology searches.
To our knowledge, Hayashida et al. [17] have developed the only existing method that predicts intermolecular joint structures between proteins and RNAs such as residue–base contacts; however, this method is unfortunately not sufficiently accurate. The method by Hayashida et al. [17] is similar to our method in that its approach is based on a machine learning technique with regularization. The main difference between our method and the method by Hayashida et al. [17] is that our method employs a large number of features, including structural information about proteins and RNAs, which have been shown to serve as key descriptors of PRIs as mentioned above.
We utilized the structural profiles of predicted RNA secondary structures, which does lose an important part of structural information, such as base-pairing partners for stacking bases. Most of the existing RBP-binding RNA motif finding methods [13,14,15] have also utilized similar encoding, which may not be suitable for dealing with the recognition sites of double-stranded RNA-binding proteins. GraphProt [16] is an exceptional algorithm that utilizes graph-based encoding of RNA secondary structures. Our method should be extended by utilizing another structural profile with no loss of base pairing information like the graph-based encoding of GraphProt.
To predict the secondary structure of RNA and amino acid sequences, we employed CentroidFold [21] and SSPro8 [20], which are standard tools, respectively. Since our method takes as input the results of secondary structure prediction, the prediction error may propagate to the residue–base contact prediction and worsen the prediction accuracy. The accuracy of our method could be improved by exploring various combinations of prediction methods, including the state-of-the-art secondary structure prediction methods such as MXfold2 [32] and DeepCNF [33].
As shown in Section 2.3, we formulated the residue–base contact prediction as an IP problem, which enables us to build a flexible model, including, for example, constraints on the upper bound on the number of contacts for each residue and base. In contrast to the RNA–RNA interaction model [34,35] in which each base interacts with at most one base via hydrogen bonds such as Watson–Crick and wobble base pairs, PRIs contain diverse patterns of residue–base contacts. For example, Kondo et al. have classified residue–base contacts with respect to three interaction edges on nucleotides (Watson–Crick, Hoogsteen, and sugar) with side-chains and backbones of their partner residues, and have analyzed their propensities [1]. Thus, there is room for further improvement of our model, which can be extended by using other constraints for each contact between a residue and a base to include such considerations.
In terms of the formulation as the integer programming problem, the RNA–RNA interaction prediction model [34,35] and our model for protein–RNA interaction prediction proposed in this paper are quite similar. In the RNA–RNA interaction prediction model, the probability distribution of RNA–RNA interactions can be calculated (even though it is an approximation), and thus the number of variables to be handled in the integer programming problem can be greatly reduced by using a technique called the threshold cut, which has succeeded in reducing the computation time. However, since such probability distributions are not known so far for protein–RNA interactions, there is no breakthrough technique that can significantly speed up the process like threshold cut. Therefore, speeding up our method is one of the future challenges for large-scale screening of protein–RNA interactions.
The large-scale sequencing data produced by RNA-related high-throughput sequencing technologies, such as Structure-seq [36] and hiCLIP [37], will help us improve our algorithm, especially by providing data for training the model. In the present work, we employed complete joint 3D structures of proteins and RNAs as the training dataset, which was not sufficiently large. We cannot build from large-scale sequencing data a complete dataset with residue–base contact maps, but we can partially calculate structural profiles and binding bases from in vivo chemical probing data such as Structure-seq datasets. This information will significantly help us improve our model.
Deep learning has been increasingly used in various fields, including bioinformatics, in recent years. Wei et al. [38] have provided a review of the use of deep learning in RNA–protein interaction prediction. Yamada et al. [39] have developed a method to accurately identify RNA sequences that interact with a particular protein by using the DNABERT model [40] that is pre-trained using the human genome. Although our method does not use deep learning, we expect to achieve higher accuracy in prediction by using a pre-trained BERT model, which could be improved through the application of deep learning relatively easily.
5. Conclusions
We developed a max-margin framework for predicting residue–base contacts between proteins and RNAs based on integer programming. To verify our method, we performed several computational experiments. The results suggest that our method based only on sequence information and structural information predicted from sequences is comparable with RNA-binding residue prediction methods based on known binding data. Further improvements are needed, such as the incorporation of informative features, the development of a joint prediction model that simultaneously predicts RNA secondary structures and protein contact maps, and the utilization of high-throughput sequencing data that can deal with PRI without residue–base contact information as training data.
Acknowledgments
The supercomputer system used for this research was made available by the National Institute of Genetics (NIG), Research Organization of Information and Systems (ROIS).
Abbreviations
The following abbreviations are used in this manuscript:
IP | Integer programming |
PPV | Positive predictive value |
PRI | Protein–RNA interaction |
RBP | RNA binding protein |
SEN | Sensitivity |
SP | Structural profile |
SVM | Support vector machine |
Supplementary Materials
The following are available at https://www.mdpi.com/article/10.3390/life11111135/s1, A max-margin model for predicting residue-base contacts in protein-RNA interactions. Table S1: Accuracy under varying with fixing and , Table S2: Accuracy under varying C with fixing and , Table S3: Structural profiles coding, Table S4: Coding of simplified alphabets (10 groups), Table S5: Coding of simplified alphabets (4 groups), Table S6 PDB ID and chain IDs used in our test dataset.
Author Contributions
Conceptualization, K.S.; methodology, K.S.; software, S.K. and K.S.; resources, S.K. and K.S.; data curation, S.K.; writing—original draft preparation, S.K. and K.S.; writing—review and editing, K.S. and Y.S.; supervision, K.S. and Y.S.; project administration, K.S.; funding acquisition, K.S. and Y.S. All authors have read and agreed to the publication of this version of the manuscript.
Funding
This work was supported in part by JSPS KAKENHI Grant Number JP19H04210 and JP19K22897 to K.S. This work was also supported in part by JSPS KAKENHI Grant Number JP17H06410 (Frontier Research on Chemical Communications) to K.S. and Y.S.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The datasets used in this study are available at https://doi.org/10.5281/zenodo.5584470.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Kondo J., Westhof E. Classification of pseudo pairs between nucleotide bases and amino acids by analysis of nucleotide-protein complexes. Nucleic Acids Res. 2011;39:8628–8637. doi: 10.1093/nar/gkr452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Iwakiri J., Tateishi H., Chakraborty A., Patil P., Kenmochi N. Dissecting the protein-RNA interface: The role of protein surface shapes and RNA secondary structures in protein-RNA recognition. Nucleic Acids Res. 2012;40:3299–3306. doi: 10.1093/nar/gkr1225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Iwakiri J., Kameda T., Asai K., Hamada M. Analysis of base-pairing probabilities of RNA molecules involved in protein-RNA interactions. Bioinformatics. 2013;29:2524–2528. doi: 10.1093/bioinformatics/btt453. [DOI] [PubMed] [Google Scholar]
- 4.Pancaldi V., Bahler J. In silico characterization and prediction of global protein-mRNA interactions in yeast. Nucleic Acids Res. 2011;39:5826–5836. doi: 10.1093/nar/gkr160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Muppirala U.K., Honavar V.G., Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinform. 2011;12:489. doi: 10.1186/1471-2105-12-489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bellucci M., Agostini F., Masin M., Tartaglia G.G. Predicting protein associations with long noncoding RNAs. Nat. Methods. 2011;8:444–445. doi: 10.1038/nmeth.1611. [DOI] [PubMed] [Google Scholar]
- 7.Wang Y., Chen X., Liu Z.P., Huang Q., Wang Y., Xu D., Zhang X.S., Chen R., Chen L. De novo prediction of RNA-protein interactions from sequence information. Mol. Biosyst. 2013;9:133–142. doi: 10.1039/C2MB25292A. [DOI] [PubMed] [Google Scholar]
- 8.Chen Y.C., Sargsyan K., Wright J.D., Huang Y.S., Lim C. Identifying RNA-binding residues based on evolutionary conserved structural and energetic features. Nucleic Acids Res. 2014;42:e15. doi: 10.1093/nar/gkt1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim O.T., Yura K., Go N. Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction. Nucleic Acids Res. 2006;34:6450–6460. doi: 10.1093/nar/gkl819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Perez-Cano L., Fernandez-Recio J. Optimal protein-RNA area, OPRA: A propensity-based method to identify RNA-binding sites on proteins. Proteins. 2010;78:25–35. doi: 10.1002/prot.22527. [DOI] [PubMed] [Google Scholar]
- 11.Wang L., Huang C., Yang M.Q., Yang J.Y. BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst. Biol. 2010;4((Suppl. 1)):S3. doi: 10.1186/1752-0509-4-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kumar M., Gromiha M.M., Raghava G.P. Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008;71:189–194. doi: 10.1002/prot.21677. [DOI] [PubMed] [Google Scholar]
- 13.Hiller M., Pudimat R., Busch A., Backofen R. Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res. 2006;34:e117. doi: 10.1093/nar/gkl544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kazan H., Ray D., Chan E.T., Hughes T.R., Morris Q. RNAcontext: A new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput. Biol. 2010;6:e1000832. doi: 10.1371/journal.pcbi.1000832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fukunaga T., Ozaki H., Terai G., Asai K., Iwasaki W., Kiryu H. CapR: Revealing structural specificities of RNA-binding protein target recognition using CLIP-seq data. Genome Biol. 2014;15:R16. doi: 10.1186/gb-2014-15-1-r16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Maticzka D., Lange S.J., Costa F., Backofen R. GraphProt: Modeling binding preferences of RNA-binding proteins. Genome Biol. 2014;15:R17. doi: 10.1186/gb-2014-15-1-r17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hayashida M., Kamada M., Song J., Akutsu T. Prediction of protein-RNA residue-base contacts using two-dimensional conditional random field with the lasso. BMC Syst. Biol. 2013;7((Suppl. 1)):S15. doi: 10.1186/1752-0509-7-S2-S15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Murphy L.R., Wallqvist A., Levy R.M. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 2000;13:149–152. doi: 10.1093/protein/13.3.149. [DOI] [PubMed] [Google Scholar]
- 19.Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Magnan C.N., Baldi P. SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics. 2014;30:2592–2597. doi: 10.1093/bioinformatics/btu352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hamada M., Kiryu H., Sato K., Mituyama T., Asai K. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics. 2009;25:465–473. doi: 10.1093/bioinformatics/btn601. [DOI] [PubMed] [Google Scholar]
- 22.Tsochantaridis I., Joachims T., Hofmann T., Altun Y. Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res. 2005;6:1453–1484. [Google Scholar]
- 23.Duchi J., Singer Y. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 2009;10:2899–2934. [Google Scholar]
- 24.Duchi J., Hazan E., Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011;12:2121–2159. [Google Scholar]
- 25.Miao Z., Westhof E. A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs. PLoS Comput. Biol. 2015;11:e1004639. doi: 10.1371/journal.pcbi.1004639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rose P.W., Beran B., Bi C., Bluhm W.F., Dimitropoulos D., Goodsell D.S., Prlic A., Quesada M., Quinn G.B., Westbrook J.D., et al. The RCSB Protein Data Bank: Redesigned web site and web services. Nucleic. Acids Res. 2011;39:392–401. doi: 10.1093/nar/gkq1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.McDonald I.K., Thornton J.M. Satisfying hydrogen bonding potential in proteins. J. Mol. Biol. 1994;238:777–793. doi: 10.1006/jmbi.1994.1334. [DOI] [PubMed] [Google Scholar]
- 28.Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhang T., Zhang H., Chen K., Ruan J., Shen S., Kurgan L. Analysis and prediction of RNA-binding residues using sequence, evolutionary conservation, and predicted secondary structure and solvent accessibility. Curr. Protein Pept. Sci. 2010;11:609–628. doi: 10.2174/138920310794109193. [DOI] [PubMed] [Google Scholar]
- 30.Deng X., Cheng J. MSACompro: Protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts. BMC Bioinform. 2011;12:472. doi: 10.1186/1471-2105-12-472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sato K., Kato Y., Akutsu T., Asai K., Sakakibara Y. DAFS: Simultaneous aligning and folding of RNA sequences via dual decomposition. Bioinformatics. 2012;28:3218–3224. doi: 10.1093/bioinformatics/bts612. [DOI] [PubMed] [Google Scholar]
- 32.Sato K., Akiyama M., Sakakibara Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 2021;12:941. doi: 10.1038/s41467-021-21194-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wang S., Peng J., Ma J., Xu J. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci. Rep. 2016;6:18962. doi: 10.1038/srep18962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kato Y., Sato K., Hamada M., Watanabe Y., Asai K., Akutsu T. RactIP: Fast and accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics. 2010;26:i460–i466. doi: 10.1093/bioinformatics/btq372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kato Y., Mori T., Sato K., Maegawa S., Hosokawa H., Akutsu T. An accessibility-incorporated method for accurate prediction of RNA-RNA interactions from sequence data. Bioinformatics. 2017;33:202–209. doi: 10.1093/bioinformatics/btw603. [DOI] [PubMed] [Google Scholar]
- 36.Ding Y., Tang Y., Kwok C.K., Zhang Y., Bevilacqua P.C., Assmann S.M. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature. 2014;505:696–700. doi: 10.1038/nature12756. [DOI] [PubMed] [Google Scholar]
- 37.Sugimoto Y., Vigilante A., Darbo E., Zirra A., Militti C., D’Ambrogio A., Luscombe N.M., Ule J. hiCLIP reveals the in vivo atlas of mRNA secondary structures recognized by Staufen 1. Nature. 2015;519:491–494. doi: 10.1038/nature14280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wei J., Chen S., Zong L., Gao X., Li Y. Protein-RNA Interaction Prediction with Deep Learning: Structure matters. arXiv. 2021 doi: 10.1093/bib/bbab540.2107.12243 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Yamada K., Hamada M. Prediction of RNA-protein Interactions Using a Nucleotide Language Model. bioRxiv. 2021 doi: 10.1101/2021.04.27.441365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Ji Y., Zhou Z., Liu H., Davuluri R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–2120. doi: 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used in this study are available at https://doi.org/10.5281/zenodo.5584470.