Table 1.
Commonly used data sets for RNA-binding sites identification.
ID | Reference | Publication Year | Notes |
---|---|---|---|
PRIPU dataset | [27] | 2015 | The dataset contains positive and unlabeled examples, which is an innovation because previous ones usually have negative samples. Such negative samples are not real negative samples, some even may be unknown positive samples |
a RB344 | [26] | 2015 | 344 RNA binding proteins, almost entirely non-redundant at 30% sequence identity |
RB172 | [28] | 2014 | 172 protein entries with sequence identity of less than 25% |
RB75 | [8] | 2012 | 75 RNP complexes released between 1 January and 28 April 2011 from PDB database b, non-redundant at 40% sequence identity |
RB199 | [25,29] | 2011 | Extracted dataset (May 2010) from PDB database. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed |
RB164 | [30] | 2010 | The data were downloaded from RsiteDB. After removing protein and RNA chains with sequence identity above 25% and 60%, respectively, 205 non-redundant protein–RNA chains in 164 complexes were obtained |
RB86 | [31] | 2008 | 86 RNA-binding protein chains were collected for training and fivefold cross validation |
RB147 | [32] | 2007 | Adding novel RNA-binding complexes since 2006, based on RB109 |
RB109 | [33] | 2006 | 109 RNA–protein complexes extracted from structures of known RNA–protein complexes solved by X-ray crystallography in the PDB. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed |
a RB: Abbreviation of RNA-binding dataset; b PDB: Protein Data Bank.