TABLE 5.
Strategies for the construction of a negative dataset for RNA-protein interaction prediction.
| Strategy | Assumption | Description |
|---|---|---|
| Random pairing | The likelihood of interaction occurring between randomly paired RNAs and proteins is low | By using known interacting pairs as starting point, the same number of non-interacting pairs are generated by randomly pairing RNAs and proteins from the positive set, followed by discarding pairs that are similar to interactions already present in the positive set |
| FIRE method | Given a known RNA-protein interacting pair (p1, r), and given a second protein p2, the smaller the sequence similarity between p1 and p2, the lower the likelihood that r interacts with p2 | For each positive RNA-protein interaction (p1, r) the p2 protein that is most dissimilar to p1 is selected, similarity between each pair of proteins was computed by taking into account functional annotations and protein domain information in addition to sequence similarity |
| Subcellular localization method | RNAs and proteins that are not in the same subcellular compartment do not interact with each other | This method requires subcellular localization data |
| Least atom distance criterion | Only applicable to interactions derived from known-structure complexes | Given a multimolecular RNA-protein complex, for each pairwise combination of its constituent RNA and protein molecules, if there is at least one atom of the RNA located closer than a threshold to at least one protein atom, the pair is considered to be interacting otherwise it is included in the negative dataset |