Skip to main content
. 2015 Nov 3;16(11):26303–26317. doi: 10.3390/ijms161125952

Table 1.

Commonly used data sets for RNA-binding sites identification.

ID Reference Publication Year Notes
PRIPU dataset [27] 2015 The dataset contains positive and unlabeled examples, which is an innovation because previous ones usually have negative samples. Such negative samples are not real negative samples, some even may be unknown positive samples
a RB344 [26] 2015 344 RNA binding proteins, almost entirely non-redundant at 30% sequence identity
RB172 [28] 2014 172 protein entries with sequence identity of less than 25%
RB75 [8] 2012 75 RNP complexes released between 1 January and 28 April 2011 from PDB database b, non-redundant at 40% sequence identity
RB199 [25,29] 2011 Extracted dataset (May 2010) from PDB database. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed
RB164 [30] 2010 The data were downloaded from RsiteDB. After removing protein and RNA chains with sequence identity above 25% and 60%, respectively, 205 non-redundant protein–RNA chains in 164 complexes were obtained
RB86 [31] 2008 86 RNA-binding protein chains were collected for training and fivefold cross validation
RB147 [32] 2007 Adding novel RNA-binding complexes since 2006, based on RB109
RB109 [33] 2006 109 RNA–protein complexes extracted from structures of known RNA–protein complexes solved by X-ray crystallography in the PDB. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed

a RB: Abbreviation of RNA-binding dataset; b PDB: Protein Data Bank.