Skip to main content
Data in Brief logoLink to Data in Brief
. 2016 Dec 29;10:561–563. doi: 10.1016/j.dib.2016.12.041

Data of protein-RNA binding sites

Wook Lee 1, Byungkyu Park 1, Daesik Choi 1, Kyungsook Han 1,
PMCID: PMC5219607  PMID: 28070546

Abstract

Despite the increasing number of protein-RNA complexes in structure databases, few data resources have been made available which can be readily used in developing or testing a method for predicting either protein-binding sites in RNA sequences or RNA-binding sites in protein sequences. The problem of predicting protein-binding sites in RNA has received much less attention than the problem of predicting RNA-binding sites in protein. The data presented in this paper are related to the article entitled “PRIdictor: Protein-RNA Interaction predictor” (Tuvshinjargal et al. 2016) [1]. PRIdictor can predict protein-binding sites in RNA as well as RNA-binding sites in protein at the nucleotide- and residue-levels. This paper presents four datasets that were used to test four prediction models of PRIdictor: (1) model RP for predicting protein-binding sites in RNA from protein and RNA sequences, (2) model RaP for predicting protein-binding sites in RNA from RNA sequence alone, (3) model PR for predicting RNA-binding sites in protein from protein and RNA sequences, and (4) model PaR for predicting RNA-binding sites in protein from protein sequence alone. The datasets supplied in this article can be used as a valuable resource to evaluate and compare different methods for predicting protein-RNA binding sites.

Keywords: Protein-RNA interactions, Binding sites, Prediction


Specifications Table

Subject area Bioinformatics, computational biology
More specific subject area Molecular structures
Type of data Text files in XML format
How data was acquired Protein data bank (PDB) [2] and Nucleic acid-Protein Interaction DataBase (NPIDB) [3]
Data format Filtered and processed
Experimental factors
Experimental features
Data source location Department of Computer Science and Engineering, Inha University, Incheon, South Korea
Data accessibility Data is provided with this article.

Value of the data

  • Few data resources have been available which can be readily used in developing or assessing a method for predicting protein-binding sites in RNA sequences or RNA-binding sites in protein sequences.

  • Protein-RNA binding sites at the nucleotide and residue levels can facilitate to develop a new method for predicting protein-RNA binding sites.

  • Protein-RNA binding sites provided here can be used as a useful resource to evaluate and compare different methods for predicting protein-binding nucleotides in RNAs and/or RNA-binding residues in proteins.

1. Data

The four datasets S1-S4 in XML format can be used to evaluate various methods for predicting: (1) protein-binding nucleotides from protein and RNA sequences, (2) protein-binding nucleotides from RNA sequence alone, (3) RNA-binding amino acids from protein and RNA sequences, and (4) RNA-binding amino acids from protein sequence alone.

2. Experimental design, materials and methods

From the Protein Data Bank (PDB) [2], we collected structures of protein-RNA complexes which do not include ribosomal RNAs and were determined by X-ray crystallography with a resolution ≤3.0 Å.

As of September 2013, there were a total of 542 protein-RNA complexes, which contained 546 protein-RNA sequence pairs between 376 protein sequences and 439 RNA sequences.

We defined a protein-RNA binding site using three types of protein-RNA interactions (hydrogen bonds, water bridges and hydrophobic interactions). A nucleotide (or amino acid) involved in at least one of the interactions was classified as a protein-binding (or RNA-binding) site. For each of the protein–RNA complexes from PDB, we obtained the three types of interactions from the Nucleic acid–Protein Interaction DataBase (NPIDB) [3] and incorporated them into the RNA and protein sequences.

In order to reduce overlap between training and test datasets, we ran CD-HIT-EST on the RNA sequences and selected RNA sequences with a similarity of 80% or lower from other RNA sequences and constructed test datasets S1 and S2 for models RP and RaP [1], respectively. The datasets S1 and S2 have same RNA sequences, but have the following differences:

  • 1.

    Protein sequences were included in the dataset S1 only.

  • 2.

    In the dataset S2, protein-binding sites in a same RNA sequence with different protein partners were incorporated in the RNA sequence.

The dataset S1 contains 130 protein sequences and 155 RNA sequences with 1848 protein-binding nucleotides and 4631 non-binding nucleotides. The dataset S2 contains 155 RNA sequences with 1795 protein-binding nucleotides and 4235 non-binding nucleotides.

The test datasets S3 and S4 for models PR and PaR were constructed in a similar way. The dataset S3 contains 44 RNA sequences and 46 protein sequences with 923 RNA-binding residues and 7578 non-binding residues. The dataset S4 contains 49 protein sequences with 1349 RNA-binding residues and 11,217 non-binding residues.

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2015R1A1A3A04001243) and in part by the Ministry of Education (2010-0020163).

Footnotes

Transparency document

Transparency data associated with this paper can be found in the online version at 10.1016/j.dib.2016.12.041.

Appendix A

Supplementary data associated with this paper can be found in the online version at 10.1016/j.dib.2016.12.041.

Transparency document. Supplementary material

Supplementary material

mmc1.pdf (7.6KB, pdf)

.

Appendix A. Supplementary material

Supplementary material

mmc2.xml (151KB, xml)

.

Supplementary material

mmc3.xml (36.6KB, xml)

.

Supplementary material

mmc4.xml (28.8KB, xml)

.

Supplementary material

mmc5.xml (32.5KB, xml)

.

References

  • 1.Tuvshinjargal N., Lee W., Park B., Han K. PRIdictor: protein-rna interaction predictor. BioSystems. 2016;139:17–22. doi: 10.1016/j.biosystems.2015.10.004. [DOI] [PubMed] [Google Scholar]
  • 2.Rose P.W., Beran B., Bi C.X., Bluhm W.F., Dimitropoulos D., Goodsell D.S., Prlic A., Quesada M., Quinn G.B., Westbrook J.D., Young J., Yukich B., Zardecki C., Berman H.M., Bourne P.E. The RCSB protein data bank: redesigned web site and web services. Nucleic Acids Res. 2011;39:D392–D401. doi: 10.1093/nar/gkq1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kirsanov D.D., Zanegina O.N., Aksianov E.A., Spirin S.A., Karyagina A.S., Alexeevski A.V. NPIDB: nucleic acid-protein interaction database. Nucleic Acids Res. 2013;41:D517–D523. doi: 10.1093/nar/gks1199. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.pdf (7.6KB, pdf)

Supplementary material

mmc2.xml (151KB, xml)

Supplementary material

mmc3.xml (36.6KB, xml)

Supplementary material

mmc4.xml (28.8KB, xml)

Supplementary material

mmc5.xml (32.5KB, xml)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES