Abstract
RNA-protein complexes underlie numerous cellular processes, including basic translation and gene regulation. The high-resolution structure determination of the RNA-protein complexes is essential for elucidating their functions. Therefore, computational methods capable of identifying the native-like RNA-protein structures are needed. To address this challenge, we thus develop DRPScore, a deep-learning-based approach for identifying native-like RNA-protein structures. DRPScore is tested on representative sets of RNA-protein complexes with various degrees of binding-induced conformation change ranging from fully rigid docking (bound-bound) to fully flexible docking (unbound-unbound). Out of the top 20 predictions, DRPScore selects native-like structures with a success rate of 91.67% on the testing set of bound RNA-protein complexes and 56.14% on the unbound complexes. DRPScore consistently outperforms existing methods with a roughly 10.53–15.79% improvement, even for the most difficult unbound cases. Furthermore, DRPScore significantly improves the accuracy of the native interface interaction predictions. DRPScore should be broadly useful for modeling and designing RNA-protein complexes.
Subject terms: RNA, Computational biology and bioinformatics, Molecular modelling, Protein structure predictions, Machine learning
RNA-protein docking is a very challenging area. Here, the authors develop a deep-learning based method, DRPScore, to evaluate RNA-protein complexes. DRPScore is robust and consistently performs better than existing methods on representative testing sets.
Introduction
RNA regulates various biological functions by interacting with proteins, such as DNA repair, RNA splicing, protein synthesis, and gene regulation1–6. It is recognized that RNA-protein complexes are involved in many human diseases ranging from neurologic disorders7,8 to cancer9. Understanding the biological roles of the RNA-protein complex requires a three-dimensional structure10–12. Unfortunately, the flexible RNA molecules are challenging to be well-crystallized and determined by X-ray crystallography13. Besides, electron microscopy is expensive and time-consuming14. The available RNA-protein experimental structures are few due to the technical limitations15. Some computational methods can predict the RNA-protein complex by homologous fragment modeling, docking, or molecular dynamics simulation16–21. However, it is still challenging to predict the highly accurate RNA-protein complex due to the limited RNA-protein scoring functions22.
Several computational methods have been developed to evaluate RNA-protein structures21,23,24. These methods can be divided into propensity-based and atomic-level statistical scoring functions25. The propensity-based scoring functions statistically analyzed the interface propensity of pairwise nucleotide-residue26,27. Then, a potential statistical formula was constructed based on the inverse Boltzmann formula. For example, DARS-RNP24 is one coarse-grained propensity-based scoring function introduced by Tuszynska and Bujnicki. DARS-RNP was developed by a reduced representation of protein and RNA. In this reduced representation, amino residues are represented by one to three united atoms24,28,29. For RNA representation, two united atoms are used for the backbone and one/two for pyrimidines/purines. DARS-RNP constructed the scoring function through the four terms: the steric clash penalty and dependencies on distances, angles, and sites. Subsequently, Xiao et al.30 constructed a novel scoring function, 3dRPC-Score, based on statistical potential energy using the conformation of nucleotide-residue pairs as statistical variables. Unlike DARS-RNP, it used relative RMSD (Root Mean Square Deviation) to assess conformational differences between nucleotide-residue pairs to reflect energy, thus using only one variable. The propensity-based scoring function can consider the pairwise-based nucleotide-residue interactions, but it is challenging to consider conformational changes21,24.
The atomic-level statistical scoring functions are the distance-dependent interaction potentials obeying the Boltzmann distribution, which is more discriminative than the propensity-based scoring functions in the native-like structure evaluation. For example, ITScore-PR23 is one atomic-level statistical scoring function developed by Huang and Zou. The core idea of ITScore-PR is to improve the interatomic pair potential through iterations by comparing the differences between the predicted and native atomic pairs in the training set. However, a significant challenge for RNA-protein prediction is the conformational change upon binding. ITScore-PR is very effective at bound docking but challenging to deal with unbound docking. Deep learning strategies have been proved helpful for RNA and protein predictions31–36. But such techniques are not in widespread use for RNA-protein complex prediction yet. Moreover, the computational structure evaluation methods for unbound complex structures have not yet been developed.
Here, we propose a deep-learning-based RNA-protein complex scoring function to consider structural flexibility explicitly. We use physics-based simulations to generate the training decoys for training the deep-learning-based scoring function. Then, DRPScore is extensively evaluated on the RNA-protein testing sets, including unbound RNA-protein challenges. The results demonstrate significant improvements and are consistently better than the existing methods in selecting native-like RNA-protein complexes. We expect the technique described here to be useful for RNA-protein complex prediction.
Results
Testing on the bound RNA-protein testing set
The DRPScore was evaluated on the three generated bound-bound RNA-protein testing sets. Figure 1 shows the average success rates and standard deviations for DRPScore, ITScore-PR23, DARS-RNP24, and 3dRPC30, respectively. Supplementary Figs. 1–3 and Supplementary Data 1–3 show the individual prediction success rates, ranking, and . ITScore-PR performs significantly better than DARS-RNP and 3dRPC with success rates from the top 5 to top 30 predictions while achieving similar results with DARS-RNP from top 30 to 1000 predictions. It is noted that DRPScore performs consistently better than ITScore-PR, DARS-RNP, and 3dRPC. The results demonstrate that the DRPScore performs better with an average success rate of 80.56%, compared with 79.63% for ITScore-PR, 70.37% for DARS-RNP, and 64.81% for 3dRPC in the top 5 predictions. When the top 20 predictions were considered, the average success rates of DRPScore increased to 91.67%, compared with 89.81% for ITScore-PR, 86.11% for DARS-RNP, and 83.33% for 3dRPC, respectively. DRPScore can identify the native-like bound-bound RNA-protein structures with high accuracy.
Fig. 1. The performance of DRPScore and other scoring functions on the bound-bound testing sets.
The average success rates and standard deviations (number = 3 independent tests) of DRPScore (green inverted triangle), ITScore-PR (black square), DARS-RNP (red circle), and 3dRPC (blue triangle) on the bound-bound testing sets. Data are presented as mean values + /- SD. Source data are provided as a Source Data file.
Testing on the unbound RNA-protein testing set
We further focused on the relatively difficult unbound cases to evaluate whether DRPScore consistently performs better in challenging cases. The unbound RNA-protein testing set was constructed by Huang and Zou37. At least one partner of each complex was taken from other complexes or homologous modeling.
Figure 2 shows the performances of DRPScore, ITScore-PR, DARS-RNP, and 3dRPC for unbound-bound and unbound-unbound tests. The success rate of DRPScore is 43.86% in the top 5 predictions, compared with 38.60% for ITScore-PR, 35.09% for DARS-RNP, and 36.84% for 3dRPC. When the top 20 predictions were considered, the success rate of DRPScore was 56.14%, compared with 45.61% for ITScore-PR, 40.35% for DARS-RNP, and 42.11% for 3dRPC. The ranking and are shown in Supplementary Data 4. We also calculated the score-RMSD scatter plots for the complexes within 8 Å RMSDs (Supplementary Fig. 4). Overall, DRPScore performs significantly better than ITScore-PR, DARS-RNP, and 3dRPC.
Fig. 2. The performance of DRPScore and other scoring functions on the unbound-bound and unbound-unbound testing set.
The success rates of DRPScore (green inverted triangle), ITScore-PR (black square), DARS-RNP (red circle), and 3dRPC (blue triangle) on the unbound-bound and unbound-unbound testing sets provided by Huang and Zou37. Source data are provided as a Source Data file.
Figure 3 shows the ranking of unbound-bound RNA-protein models using DRPScore and the other three leading scoring functions. Three complexes were not considered because all the docking decoys’ RMSDs were >4 Å. We selected the best-scoring models for each scoring function in the top 5, 10, and 20 models. For each RNA-protein complex, we recorded the lowest RMSD across the models. We quantized the predictions into 4 categories by RMSD ranking: below 4 Å (colored in yellow), between 4 Å and 6 Å (colored in green), between 6 Å and 8 Å (colored in blue), and above 8 Å (colored in purple). Figure 3a shows RNA-protein complexes ranking by the number of nucleotides/residues at the interaction interface. An interesting observation is that DRPScore performs best on RNA-protein models when the number of interface interaction nucleotides/residues is relatively small. The other three scoring functions give unsatisfactory results. Figure 3b shows that DRPScore has the largest proportions of yellow and green areas compared to the other three scoring functions. The statistics of prediction numbers with RMSD <4 Å are 20 for DRPScore, compared with 14 for ITScore-PR, 9 for DARS-RNP, and 14 for 3dRPC, respectively. Moreover, predictions of DRPScore with RMSD >8 Å are only 8 cases, compared with 15 for ITScore-PR, 17 for DARS-RNP, and 14 for 3dRPC, respectively. For example, the lowest RMSDs in the top 5 predictions of the Bacillus subtilis YxiN protein complexed with a fragment of 23 S ribosomal RNA (PDB ID: 3MOJ) are Irmsd = 1.98 Å for DRPScore, compared with Irmsd = 14.74 Å for ITScore-PR, Irmsd = 8.28 Å for DARS-RNP, Irmsd = 8.44 Å for 3dRPC, respectively (Fig. 3c). Together, DRPScore can accurately identify the RNA-protein interface features when the number of interface interactions is relatively small. However, the other scoring functions are too sensitive to interaction changes leading to unsatisfactory results.
Fig. 3. Detailed analysis of native-like structures ranking in the unbound-bound testing set.
a The N best-scoring structural models for each unbound-bound docking RNA-protein complex for DRPScore, ITScore-PR, DARS-RNP, and 3dRPC in the top 5, 10, and 20 predictions. The RNA-protein complexes are sorted (top to bottom) by the number of nucleotides/residues at the RNA-protein interaction interface. For each scoring function, RNA, and value of N, the lowest RMSD (Root Mean Square Deviation) across structural decoys is recorded. These results are quantized by determining if the RMSD is below 4 Å (yellow), between 4~6 Å (green), between 6~8 Å (blue), or above 8 Å (purple). b The statistical prediction results of unbound-bound docking testing sets. c One example (PDB ID: 3MOJ) to illustrate the RMSD between the native structure (RNA in red) and the lowest RMSD structures in the top 5 predictions by DRPScore (RNA in cyan), ITScore-PR (RNA in yellow), DARS-RNP (RNA in origin) and 3dRPC (RNA in blue). Source data are provided as a Source Data file.
To further evaluate the performance of poor sampling structures, we analyzed the top 50 best-scoring structure models of human RIG-I CTD bound to a dsRNA (PDB ID: 3LRR) as an example (Fig. 4a). The lowest RMSD model of DRPScore is 2.51 Å, compared with 8.10 Å for ITScore-PR, 5.72 Å for DARS-RNP, and 5.72 Å for 3dRPC, respectively. DRPScore achieved an average RMSD of 8.94 Å (colored in blue), better than RMSDs of around 10.0 Å in the other three scoring functions (colored in gray, purple, and green). Together, DRPScore resulted in a shift in the distribution of the predictions toward lower RMSDs.
Fig. 4. Two examples of native-like structures ranking analysis.
a Unbound-bound (PDB ID: 3LRR) and (b) unbound-unbound (PDB ID: 1JID) testing sets. The histograms (left to right) are the RMSD (Root Mean Square Deviation) distributions relative to native structure in the top 50 predictions of DRPScore (blue) compared with ITScore-PR (gray), DARS-RNP (purple), 3dRPC (green), respectively. DRPScore shows a shift in the distribution of the predictions toward lower RMSDs. The lowest RMSD models in the top 50 predictions by DRPScore (RNA in cyan) are more similar to native complexes (RNA in red) than ITScore-PR (RNA in yellow), DARS-RNP (RNA in green) and 3dRPC (RNA in pink), respectively. Source data are provided as a Source Data file.
Physics-based interaction contributions
The existing RNA-protein complex evaluation methods are all statistical potential functions based on Boltzmann’s formula23,24,30. DRPScore used an alternative approach to evaluate the RNA-protein complex by deep learning. Thus, DRPScore can accurately extract more features from the individual frames in the training sets. The separate local features can infer the interface interactions. For example, we calculated the interface hydrogen bonds of the Bacillus subtilis YxiN protein complexed with a 23 S ribosomal RNA fragment (PDB ID: 3MOJ) using HBPULS38. The results show that the lowest RMSD structure in the top 5 predictions evaluated by DRPScore contains 8 hydrogen bonds at the RNA-protein interaction interface, compared with 17 for ITScore-PR, 10 for DARS-RNP, and 16 for 3dRPC, respectively (Supplementary Table 1). However, it is noted that 62.5% of the hydrogen bonds identified by DRPScore agree with the experimental structure. None of the native hydrogen bonds are identified by ITScore-PR, DARS-RNP, and 3dRPC.
Learning the global features is much more challenging due to the lack of individual frames. Our approach contains an extra dimension to consider the relations between the secondary structure frames. Therefore, DRPScore can consider the global features, including α-helix and β-sheet for protein, hairpin loop, internal loop, bulge loop, junction, and pseudoknot for RNA. For example, we calculated the secondary structures of protein and RNA using PSIPRED39,40 and forna41 (Supplementary Figs. 5 and 6). The results show that 25% of hydrogen bonds identified by DRPScore are loop-helix secondary structure interactions, compared with 0.0% for ITScore-PR, 0.0% for DARS-RNP, and 12.5% for 3dRPC (Supplementary Table 1). The secondary structure interactions on the RNA-protein interface would improve the assessment accuracy.
Advances compared to the traditional deep learning model
DRPScore applied the 4DCNN to reduce the mean squared errors between the true and predicted results. In the back-propagation-based mini-batch gradient descent optimization algorithm, the learning rate of 4DCNN was initially set to 0.0001. The training processes stopped when the loss decreased to approach 0 and stabilized. The training steps were 20000, around 6 seconds per iteration, and 6.6 GB for ‘---bacth_size 1’ on the GeForce RTX 3070. Finally, a 12500-step model was selected. DRPScore takes about 8 min to evaluate 1000 RNA-protein complex structures.
We compared our deep-learning-based model with the traditional 3DCNN model. 3DCNN utilizes three dimensions to extract and transfer the X, Y, and Z coordinates information to one image. There are no connections between any two images. Thus, 3DCNN provides and learns the local but without global structural features. In other words, it captures the intra-nucleotide/residue information while ignoring the inter-nucleotide/residue interactions. However, the representations learned by our model capture both the intra- (local) and inter-nucleotide/residue (global) information. This is done by adding convolutional layers at the sequence dimension, each layer gradually modeling a longer range of interactions between the nucleotides/residues. Therefore, our model can provide and learn both local and global structural features, including secondary structure interactions. Supplementary Fig. 7 shows the average success rates and standard deviations for DRPScore, 3DCNN, ITScore-PR, DARS-RNP, and 3dRPC on the three generated bound-bound RNA-protein testing sets, respectively. However, 3DCNN is not able to identify the native-like RNA-protein complex accurately.
Discussion
RNA-protein structure evaluation is still a relatively unexplored research field. The previous efforts focused on the RNA-protein rigid-body docking without consideration of the structural flexibility. A critical bottleneck is sampling the dynamical conformations that RNAs or proteins form when interacting. In the case of fully flexible unbound-unbound docking, the interaction interface has changed dramatically. And the scoring functions never learned similar structures before. Supplementary Fig. 8 shows the success rate of DRPScore and the three other scoring functions in the unbound-unbound cases in the testing set II when considering the top10, top20, top30, and top40, respectively. Although the performance of the four scoring functions is not satisfactory compared with bound-bound and unbound-bound cases, DRPScore still achieves better results. For example, when the top 20 predictions are considered, the average success rate of DRPScore is 58.5%, compared with 51.2% for ITScore-PR, 48.8% for DARS-RNP, and 46.3% for 3dRPC, respectively. Figure 4b shows the Human SRP19 complex with SRP RNA (PDB ID: 1JID) as one unbound-unbound RNA-protein example. DRPScore achieves an average RMSD of 8.85 Å (colored in blue), better than RMSDs of around 11.0 Å in the other three scoring functions (colored in gray, purple, and green). The lowest RMSD model in the top 50 predictions of DRPScore is 1.92 Å, compared with 6.58 Å for ITScore-PR, 2.89 Å for DARS-RNP, and 4.82 Å for 3dRPC, respectively.
To test the robustness of DRPScore, we also provided the performance of DRPScore compared with ITScore-PR, DARS-RNP, and 3dRPC on the entire and 0.8 sequences redundancy cutoff for testing set II (Supplementary Figs. 9 and 10)42–44. DRPScore still shows lower RMSDs and performs better than the other three scoring functions (ITScore-PR, DRPScore, and 3dRPC) for all testing sets at different redundancy levels. Traditional scoring functions based on statistical potential energy use native structures to learn features of RNA-protein interactions. DRPScore samples many native-like decoys to consider the RNA-protein dynamical features implicitly. As a result, the performance of the DRPScore in the unbound test is better than the other three scoring functions. We will further consider the structural flexibility, such as adding molecular dynamics simulation to consider the flexibility of RNA-protein complexes explicitly.
In summary, we have developed an efficient scoring function for evaluating RNA–protein complexes using a deep-learning-based method. DRPScore has been extensively assessed on its ability to identify native-like structures of RNA-protein complexes on different diverse testing sets. Compared with other available methods, DRPScore showed success rates as high as 80.56% (91.67%) for bound docking and 43.86% (56.14%) for unbound docking if the top 5 (20) predictions were considered. The significant improvements indicate that DRPScore resolves critical flexibility in the structural evaluation of an RNA-protein complex. We expect the method to be helpful for the RNA-related prediction and drug development.
Methods
Convolution neural network for RNA-protein complex scoring function
Previous RNA-protein complex scoring functions used statistical scoring functions to identify the native-like structures (Fig. 5a). The statistical potential function assumes that the distributions of different native structural features obey the Boltzmann distribution. Then, these methods calculated the probability of the interface interactions to construct energy function and identified a native-like complex structure with the lowest energy. Instead of using the entire RNA-protein structure, DRPScore focuses on the RNA-protein interaction interface within a 6 Å distance (Fig. 5b). First, we extracted the RNA-protein interface structures with a 6 Å cut-off. Second, we utilized 85 atom types with mass and charge in the RNA nucleotides (Supplementary Data 5 and Supplementary Table 2) and 225 atom types with mass and charge in the protein residues (Supplementary Data 6 and Supplementary Table 2) to consider the atomic-level interactions. Then, we fed the interaction interface information with the accumulations of the occupation number, mass, and charge of the atoms in the grid to a convolution neural network. A 32 Å grid was created on each nucleotide and residue with a local cartesian coordinate specified by atoms36. The X, Y, and Z axes were determined by Eqs. 1–6 where and stand for the vectors pointing from the origin in the global coordinate system to the atom N, CB, C1', CA', O5', O', C5' and C, respectively. As expected, the convolution neural network approach learned much more features than the statistical scoring function.
1 |
2 |
3 |
4 |
5 |
6 |
Fig. 5. Flow charts comparing DRPScore and traditional scoring functions.
a The steps of traditional scoring function based on statistical potential energy to evaluate decoys. b The steps of DRPScore to evaluate the decoys. c The local coordinate framework construction in the generated 32 Å grid.
DRPScore used a 4D convolution neural network approach to train and identify the native-like structures. The input of DRPScore is the RNA-protein complex nucleotides/residues at the interaction interface with a 6 Å distance. The output of the DRPScore is the potential scores to evaluate the native-like RNA-protein structures. We compared the 4D convolution neural network with a 3D convolution neural network. In the pre-processing procedures, each RNA is represented by a tensor of shape , where 3 is the number of features: the accumulations of the occupation number, mass, and charge of the atoms in the grid box, L = 128 is the maximum length of RNA sequences, H, W, D denote height, width and depth of a 3D cube for each nucleotide in the RNA sequence. In this study, we set H = W = D = 3236.
3D convolution neural network approach. The 3D approach processes each nucleotide independently to generate local representations and then applies an average pooling at the end to generate a global representation for the sequence. Concretely, for a sequence comprised of 128 nucleotides, the shape of the input is (3 is the number of channels or features C, 32x32x32 is the spatial dimension , and 128 is the number of nucleotides in the sequence); each nucleotide is processed independently by the 3D convolutional module:
7 |
Where, is a tensor of shape being the representation of an RNA nucleotide at layer of the neural network (e.g., at layer 0, which is the initial representation of the RNA nucleotide). Conv3D projects channel to and down-samples the spatial dimension. The output representation of the RNA nucleotide at layer (i.e., ) is a tensor of shape (e.g., at layer 1). The down-sampling enables more compact representations of RNAs and increasing the number of channels at each layer of CNN allows for a larger number of and more expressive features to be learned. After applying N layers of Conv3D modules, at the last layer, each nucleotide of the RNA, , is represented by a tensor with shape , where 1024 is the number of channels (i.e., features) in the last layer. The spatial dimension of the representation is down-sampled to . Finally, a global representation of the RNA sequence can be generated by averaging the individual representations of each nucleotide: . This representation is general-purpose and can be used for both classification and regression tasks for the RNA sequence by adding a single linear layer at the end of the model.
4D convolution neural network approach. Though 3D CNN for modeling RNA has shown success in some tasks36, it importantly ignores the sequential nature of RNAs. By naively averaging the independent representations of individual nucleotides to generate a global representation, crucial information about the interactions of nucleotides may be lost. Our proposed 4D approach addresses this shortcoming by incorporating an additional convolution operation at the sequential dimension. In other words, our method captures not only the spatial information, but also the sequential information (i.e., interactions between the nucleotides/residues). Our Conv4D uses a non-overlapping moving window of size 3 nucleotides/residues (i.e., a kernel size of 3) to capture interactions between the nucleotides/residues at each layer of the convolution. By using multi-layer CNNs, we can capture the interactions between more distant nucleotides/residues since the input to each layer is the output of the convolution of the last layer. As can be seen in the bottom left panel of Fig. 6, the deeper the CNN, the more long-range interactions can be learned. For instance, in a two-layer CNN, the first layer would capture the interactions between 3 consecutive nucleotides/residues while the second layer would capture the higher-level interactions between each of the 3-nucleotide/residue segments. Thus, at the second layer, we are effectively modeling the nucleotides/residues in RNA-protein complex with distance of 6. Further stacking of multi-layer CNN allows modeling of nucleotides/residues with longer distances (in Computer Vision, this is called increasing the receptive field).
Fig. 6. The input features of the DRPScore model.
DRPScore considers both local and global features from the RNA-protein complex. The local features are sequence, mass, and charge for each atom. The global features are secondary structures, nucleotide-residue distances, and interface interactions.
Specifically, our network has six layers, with the last being a fully connected layer for classification. Each of the first six layers has a Conv4d module, a BatchNorm module (optional), and a MaxPooling module. The number of channels in these Conv4d modules are [64, 128, 256, 512, 512]. The strides in these Conv4d modules are [2,2,2,1,1], i.e., the effective length of the RNA-protein complex features is reduced by half in each of the first three blocks (and kept the same in the last 2 blocks). All the pooling modules use a kernel size of two and a stride of two, i.e., each pooling module reduces the effective spatial dimensions (height, width, depth) by half. The last pooling is a global average pooling that reduces the spatial dimensions to one. Thus, the final representation of an RNA-protein complex is a vector of size 8,192. For example, for an RNA-protein complex initially represented by a tensor of shape , we apply a 4D convolution with stride 2 at the sequential dimension:
8 |
Where has shape (the number of channels increases from 3 to 64, the effective length of RNA-protein complex reduces from 128 to 64). Note that the calculation of relies on . We further apply an additional Max-pooling, i.e., ., reducing the spatial dimension from 32x32x32 to 16x16x16. Moving forward to next layer:
9 |
Where has shape (the number of channels increases from 64 to 128, the effective length of RNA-protein complex decreases from 64 to 32). Note that the calculation of relies on , which effectively relies on . Thus, at the second layer, we are effectively modeling the residues in RNA-complex with distance of 6. Again, after applying Max-pooling i.e., , the output feature has a shape . Similarly, the output and have shape of and , respectively.
Finally, at the last layer, we have a tensor () with shape , which we further apply an adaptive pooling at the spatial dimension to get the final overall representation of RNA-protein complex:
10 |
Where has the shape (after flattening ). Similar to the 3D convolution neural network, this representation is general-purpose and can be used for both classification or regression tasks for the RNA-protein complex sequence by adding a single regression or classification linear layer at the end of the model.
The 4DCNN in this work captures the sequence, secondary structural characteristics, and tertiary structural characteristics of protein and RNA (Fig. 6). The sequence, mass, and charge of heavy atoms are considered as local features in 4DCNN. On the other hand, the secondary structure (α-helix and β-sheet for protein, stem, pseudoknot, internal loop, hairpin loop, single-stranded, and junction for RNA) and distances between each nucleotide and residue are considered as global features in 4DCNN. In addition, the interactions of the RNA-protein binding interface are also fully extracted, including electrostatic interaction, Van der Waals interaction, hydrogen bound, and π-π stacking interaction.
Training sets
To construct a diverse training dataset of RNA-protein complex structures, we extracted 951 available RNA-protein complex structures from the NDB database (before July 13, 2022) with the search options “only RNA and Protein” and “Resolution cutoff 3.5 Å (X-ray)”45,46. Second, we removed the short RNAs with lengths of <10 nucleotides. Third, we considered the cases with no more than six chains of protein or RNA as described in ITScore-PR23. Fourth, we removed the RNA redundancy by 0.95 sequence similarity cutoff as RASP47 and DRNA11 using CD-HIT42–44. Finally, we obtained a non-redundant RNA-protein dataset with 346 structures. We randomly selected 277 RNA-protein complex structures for training from the 346 non-redundant RNA-protein structures (Supplementary Data 7). The remaining RNA-protein complexes are further processed to build bound-bound testing sets.
We used 3dRPC21,30,48 to generate RNA-protein structural decoys. 3dRPC first generates the RNA-protein complex by the RPDOCK algorithm and then evaluates the structures by RPRANK. For each RNA-protein complex in the training set, 10,000 decoys were generated using the command of ‘3dRPC -mode 9 -system 8 -par RPDOCK.par’. Then, we calculated the RMSDs of the complex structures using the following command ‘3dRPC –mode 2 –system 0 –par RMSD.par’. Finally, we selected the top 500 structures from 10,000 decoys by RMSD ranking. Thus, there are 1 native structure and 500 docking structures for each RNA-protein complex.
Testing sets
We tested DRPScore with two independent RNA-protein docking testing sets. The unbound structure is defined as a structure in free form or being a binding partner in a different complex37. Thus, the definition of a “bound-bound” structure is that two binding partners are taken from the same complex structure. The “unbound-bound” structure is that one of the two binding partners (RNA or protein) is either in apo form or taken from another complex. The “unbound-unbound” refers to structures that both binding partners (RNA and protein) that are either in apo form or taken from a different complex.
Testing set I is the non-redundant bound-bound RNA-protein docking benchmark. We randomly selected 36 RNA-protein complexes from the remaining non-redundant RNA-protein complexes (Supplementary Data 8). We generated three bound-bound RNA-protein sets for a fair comparison. Then, we generated 1000 decoys for each RNA-protein complex in those three sets by 3dRPC21,30,48.
Testing set II is the non-redundant unbound RNA-protein docking benchmark provided by Huang and Zou37. We removed the redundancy between training and testing sets II by 0.95 sequence similarity cutoff using CD-HIT. Thus, this benchmark remains 57 RNA-protein unbound complex structures, which consist of 41 unbound-unbound complexes and 16 unbound-bound complexes (Supplementary Data 9). For each RNA-protein complex in this benchmark, 1000 decoys were generated by 3dRPC21,30,48 using the command of ‘3dRPC -mode 9 -system 8 -par RPDOCK.par’. The relative RMSDs between decoys and native complex structures were calculated using the command of ‘3dRPC –mode 2 –system 0 –par RMSD.par’. The maximum, minimum, and average RMSD were provided in Supplementary Data 10–13.
For RNA-protein complexes ranking, ITScore-PR uses the command of ‘itscorepr protein.pdb RNA.pdb -nomin’, DARS-RNP uses the command of ‘python DARS_potential_v3.py -s complex.pdb’, and 3dRPC uses the command of ‘3dRPC -mode 8 -system 9 -par scoring.par’, respectively.
Criteria for the assessment of the prediction quality
The quality of the RNA-protein complex prediction is evaluated by the CAPRI criterion49,50. The is the interface RMSD between the native and predicted structures after the superposition of corresponding proteins. The definition of the RMSD is
11 |
where X, Y, and Z are the native and predicted structure coordinates. N is the total number of atoms. All the RNA-protein superimposition and RMSD calculations were performed by 3dRPC21,30,48. We define the RNA-protein complex as successfully predicted if the between the predicted and native complexes is less than or equal to 4.0 Å.
We also analyzed the RNA-protein interface’s hydrogen bond and secondary structure interactions. We used HBPLUS38 to identify intermolecular hydrogen bonds between the RNA and protein molecules. A maximum donor-acceptor distance of 3.35 Å and a maximum hydrogen-acceptor distance of 2.7 Å were used to define a hydrogen bond. Protein often folds into the secondary structure of α-helix and β-sheet while RNA contains the single-strand, stem, pseudoknot, internal loop, hairpin loop, and junctions. The PSIPRED39,40 and forna41 were used to identify the protein and RNA secondary structures.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description of Additional Supplementary Files
Acknowledgements
This work is supported by the National Natural Science Foundation of China 12175081 (Y.Z.) and Fundamental Research Funds for the Central Universities CCNU22QN004 (Y.Z.).
Source data
Author contributions
C.Z. performed the majority of computational analysis; Y.J. built the deep learning model under the supervision of S.V.; C.Z. helped with the deep learning modeling; Y.Z. designed the project and supervised the overall study. All authors have read, edited, and approved the final manuscript.
Peer review
Peer review information
Nature Communications thanks Jian Wang, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
A full list with links of the PDB codes used in this study is available in supplementary data 7–9. All PDB data sets used in this paper can be downloaded from Nucleic Acid Database (http://ndbserver.rutgers.edu/) and Zoulab (http://zoulab.dalton.missouri.edu/RNAbenchmark/). The data that supports the findings of this study, including scoring function, training set, testing sets, and examples, are available to download at https://github.com/Zhaolab-GitHub/DRPScore_v1.0. Source data are provided with this paper.
Code availability
The DRPScore code is freely available for academic or non-commercial users via https://github.com/Zhaolab-GitHub/DRPScore_v1.051.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Chengwei Zeng, Yiren Jian.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-023-36720-9.
References
- 1.Chung CS, et al. Dynamic protein-RNA interactions in mediating splicing catalysis. Nucleic Acids Res. 2019;47:899–910. doi: 10.1093/nar/gky1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Glisovic T, Bachorik JL, Yong J, Dreyfuss G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 2008;582:1977–1986. doi: 10.1016/j.febslet.2008.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Licatalosi DD, Darnell RB. RNA processing and its regulation: global insights into biological networks. Nat. Rev. Genet. 2010;11:75–87. doi: 10.1038/nrg2673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lunde BM, Moore C, Varani G. RNA-binding proteins: modular design for efficient function. Nat. Rev. Mol. Cell Biol. 2007;8:479–490. doi: 10.1038/nrm2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mittal N, Roy N, Babu MM, Janga SC. Dissecting the expression dynamics of RNA-binding proteins in posttranscriptional regulatory networks. Proc. Natl Acad. Sci. USA. 2009;106:20300–20305. doi: 10.1073/pnas.0906940106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Muller-McNicoll M, Neugebauer KM. How cells get the message: dynamic assembly and function of mRNA-protein complexes. Nat. Rev. Genet. 2013;14:275–287. doi: 10.1038/nrg3434. [DOI] [PubMed] [Google Scholar]
- 7.Modic M, Ule J, Sibley CR. CLIPing the brain: studies of protein-RNA interactions important for neurodegenerative disorders. Mol. Cell Neurosci. 2013;56:429–435. doi: 10.1016/j.mcn.2013.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.De Conti, L., Baralle, M., & Buratti, E. Neurodegeneration and RNA-binding proteins. Wiley Interdiscip. Rev. RNA8, e1394 (2017). [DOI] [PubMed]
- 9.Khalil AM, Rinn JL. RNA-protein interactions in human health and disease. Semin Cell Dev. Biol. 2011;22:359–365. doi: 10.1016/j.semcdb.2011.02.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chen Y, Kortemme T, Robertson T, Baker D, Varani G. A new hydrogen-bonding potential for the design of protein-RNA interactions predicts specific contacts and discriminates decoys. Nucleic Acids Res. 2004;32:5147–5162. doi: 10.1093/nar/gkh785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhao H, Yang Y, Zhou Y. Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res. 2011;39:3017–3025. doi: 10.1093/nar/gkq1266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhao H, Yang Y, Zhou Y. Highly accurate and high-resolution function prediction of RNA binding proteins by fold recognition and binding affinity prediction. RNA Biol. 2011;8:988–996. doi: 10.4161/rna.8.6.17813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ke A, Doudna JA. Crystallization of RNA and RNA-protein complexes. Methods. 2004;34:408–414. doi: 10.1016/j.ymeth.2004.03.027. [DOI] [PubMed] [Google Scholar]
- 14.Khatter H, Myasnikov AG, Natchiar SK, Klaholz BP. Structure of the human 80S ribosome. Nature. 2015;520:640–645. doi: 10.1038/nature14427. [DOI] [PubMed] [Google Scholar]
- 15.Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Arnautova YA, Abagyan R, Totrov M. Protein-RNA docking Using ICM. J. Chem. Theory Comput. 2018;14:4971–4984. doi: 10.1021/acs.jctc.8b00293. [DOI] [PubMed] [Google Scholar]
- 17.Zheng J, Hong X, Xie J, Tong X, Liu S. P3DOCK: a protein-RNA docking webserver based on template-based and template-free docking. Bioinformatics. 2020;36:96–103. doi: 10.1093/bioinformatics/btz478. [DOI] [PubMed] [Google Scholar]
- 18.Zhang Z, et al. A combinatorial scoring function for protein-RNA docking. Proteins. 2017;85:741–752. doi: 10.1002/prot.25253. [DOI] [PubMed] [Google Scholar]
- 19.Perez-Cano L, Romero-Durana M, Fernandez-Recio J. Structural and energy determinants in protein-RNA docking. Methods. 2017;118-119:163–170. doi: 10.1016/j.ymeth.2016.11.001. [DOI] [PubMed] [Google Scholar]
- 20.Tuszynska I, Magnus M, Jonak K, Dawson W, Bujnicki JM. NPDock: a web server for protein-nucleic acid docking. Nucleic Acids Res. 2015;43:W425–W430. doi: 10.1093/nar/gkv493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Huang Y, Liu S, Guo D, Li L, Xiao Y. A novel protocol for three-dimensional structure prediction of RNA-protein complexes. Sci. Rep. 2013;3:1887. doi: 10.1038/srep01887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Nithin C, Ghosh P, Bujnicki JM. Bioinformatics tools and benchmarks for computational docking and 3D structure prediction of RNA-protein complexes. Genes (Basel) 2018;9:432. doi: 10.3390/genes9090432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Huang SY, Zou X. A knowledge-based scoring function for protein-RNA interactions derived from a statistical mechanics-based iterative method. Nucleic Acids Res. 2014;42:e55. doi: 10.1093/nar/gku077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tuszynska I, Bujnicki JM. DARS-RNP and QUASI-RNP: new statistical potentials for protein-RNA docking. BMC Bioinforma. 2011;12:348. doi: 10.1186/1471-2105-12-348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Qiu L, Zou X. Scoring functions for protein-RNA complex structure prediction: advances, applications, and future directions. Commun. Inf. Syst. 2020;20:1–22. doi: 10.4310/CIS.2020.v20.n1.a1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Perez-Cano L, Fernandez-Recio J. Optimal protein-RNA area, OPRA: a propensity-based method to identify RNA-binding sites on proteins. Proteins. 2010;78:25–35. doi: 10.1002/prot.22527. [DOI] [PubMed] [Google Scholar]
- 27.Perez-Cano, L., Solernou, A., Pons, C. & Fernandez-Recio J. Structural prediction of protein-RNA interaction by computational docking with propensity-based statistical potentials. Pac Symp. Biocomput. 2010, 293–301 (2010). [DOI] [PubMed]
- 28.Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A. Protein fragment reconstruction using various modeling techniques. J. Comput. Aided Mol. Des. 2003;17:725–738. doi: 10.1023/B:JCAM.0000017486.83645.a0. [DOI] [PubMed] [Google Scholar]
- 29.Malolepsza E, Boniecki M, Kolinski A, Piela L. Theoretical model of prion propagation: a misfolded protein induces misfolding. Proc. Natl Acad. Sci. USA. 2005;102:7835–7840. doi: 10.1073/pnas.0409389102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Li H, Huang Y, Xiao Y. A pair-conformation-dependent scoring function for evaluating 3D RNA-protein complex structures. PLoS One. 2017;12:e0174662. doi: 10.1371/journal.pone.0174662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Townshend RJL, et al. Geometric deep learning of RNA structure. Science. 2021;373:1047–1051. doi: 10.1126/science.abe5650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Sato K, Akiyama M, Sakakibara Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 2021;12:941. doi: 10.1038/s41467-021-21194-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Senior AW, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577:706–710. doi: 10.1038/s41586-019-1923-7. [DOI] [PubMed] [Google Scholar]
- 36.Li J, et al. RNA3DCNN: Local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks. PLoS Comput. Biol. 2018;14:e1006514. doi: 10.1371/journal.pcbi.1006514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Huang SY, Zou X. A nonredundant structure dataset for benchmarking protein-RNA computational docking. J. Comput. Chem. 2013;34:311–318. doi: 10.1002/jcc.23149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. J. Mol. Biol. 1994;238:777–793. doi: 10.1006/jmbi.1994.1334. [DOI] [PubMed] [Google Scholar]
- 39.Buchan DWA, Jones DT. The PSIPRED protein analysis workbench: 20 years on. Nucleic Acids Res. 2019;47:W402–W407. doi: 10.1093/nar/gkz297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
- 41.Kerpedjiev P, Hammer S, Hofacker IL. Forna (force-directed RNA): Simple and effective online RNA secondary structure diagrams. Bioinformatics. 2015;31:3377–3379. doi: 10.1093/bioinformatics/btv372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 43.Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. doi: 10.1093/bioinformatics/17.3.282. [DOI] [PubMed] [Google Scholar]
- 44.Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18:77–82. doi: 10.1093/bioinformatics/18.1.77. [DOI] [PubMed] [Google Scholar]
- 45.Berman HM, et al. The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids. Biophys. J. 1992;63:751–759. doi: 10.1016/S0006-3495(92)81649-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Coimbatore Narayanan B, et al. The Nucleic Acid Database: new features and capabilities. Nucleic Acids Res. 2014;42:D114–D122. doi: 10.1093/nar/gkt980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Capriotti E, Norambuena T, Marti-Renom MA, Melo F. All-atom knowledge-based potential for RNA structure prediction and assessment. Bioinformatics. 2011;27:1086–1093. doi: 10.1093/bioinformatics/btr093. [DOI] [PubMed] [Google Scholar]
- 48.Huang Y, Li H, Xiao Y. Using 3dRPC for RNA-protein complex structure prediction. Biophys. Rep. 2016;2:95–99. doi: 10.1007/s41048-017-0034-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Janin J, et al. CAPRI: a Critical assessment of PRedicted interactions. Proteins. 2003;52:2–9. doi: 10.1002/prot.10381. [DOI] [PubMed] [Google Scholar]
- 50.Mendez R, Leplae R, Lensink MF, Wodak SJ. Assessment of CAPRI predictions in rounds 3-5 shows progress in docking procedures. Proteins. 2005;60:150–169. doi: 10.1002/prot.20551. [DOI] [PubMed] [Google Scholar]
- 51.Zeng, C. W., Jian, Y. R., Vosoughi, S., Zeng, C. & Zhao, Y. J. Evaluating native-like structures of RNA-protein complexes through the deep learning method. Structure10.5281/zenodo.7614606 (2023). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of Additional Supplementary Files
Data Availability Statement
A full list with links of the PDB codes used in this study is available in supplementary data 7–9. All PDB data sets used in this paper can be downloaded from Nucleic Acid Database (http://ndbserver.rutgers.edu/) and Zoulab (http://zoulab.dalton.missouri.edu/RNAbenchmark/). The data that supports the findings of this study, including scoring function, training set, testing sets, and examples, are available to download at https://github.com/Zhaolab-GitHub/DRPScore_v1.0. Source data are provided with this paper.
The DRPScore code is freely available for academic or non-commercial users via https://github.com/Zhaolab-GitHub/DRPScore_v1.051.