. 2015 Nov 3;16(11):26303–26317. doi: 10.3390/ijms161125952

Table 1.

Commonly used data sets for RNA-binding sites identification.

ID	Reference	Publication Year	Notes
PRIPU dataset	[27]	2015	The dataset contains positive and unlabeled examples, which is an innovation because previous ones usually have negative samples. Such negative samples are not real negative samples, some even may be unknown positive samples
^a RB344	[26]	2015	344 RNA binding proteins, almost entirely non-redundant at 30% sequence identity
RB172	[28]	2014	172 protein entries with sequence identity of less than 25%
RB75	[8]	2012	75 RNP complexes released between 1 January and 28 April 2011 from PDB database ^b, non-redundant at 40% sequence identity
RB199	[25,29]	2011	Extracted dataset (May 2010) from PDB database. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed
RB164	[30]	2010	The data were downloaded from RsiteDB. After removing protein and RNA chains with sequence identity above 25% and 60%, respectively, 205 non-redundant protein–RNA chains in 164 complexes were obtained
RB86	[31]	2008	86 RNA-binding protein chains were collected for training and fivefold cross validation
RB147	[32]	2007	Adding novel RNA-binding complexes since 2006, based on RB109
RB109	[33]	2006	109 RNA–protein complexes extracted from structures of known RNA–protein complexes solved by X-ray crystallography in the PDB. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed

^a RB: Abbreviation of RNA-binding dataset; ^b PDB: Protein Data Bank.