Table 2.
Summary statistics of the different datasets analysed in this study
Dataset | Type | # Structures | # Sites | # Ligands | Overlap (%) | Methods |
LIGYSIS | NEW | 3448 | 8244 | 65,116+ | – | – |
LIGYSISNI | NEW | 2275 | 4572 | 38,595 | – | – |
sc-PDBFULL | TRAIN | 17,594+ | 17,594+ | 17,594 | 801− (9.7) | VN-EGNN, GrASP, PUResNet, DeepPocket |
bMOADSUB | TRAIN | 5899 | 11,184 | 11,184 | 606 (7.6) | IF-SitePred |
CHEN11 | TRAIN | 244− | 479− | 479− | 40+ (0.5) | PRANK, P2Rank |
PDBbindREF | TEST | 5316 | 5316 | 5316 | 310 (3.8) | VN-EGNN |
SC6K | TEST | 6147 | 6147 | 6147 | 259 (3.1) | DeepPocket |
HOLO4K | TEST | 4009 | 10,175 | 10,175 | 207 (2.5) | ALL* |
COACH420 | TEST | 413 | 624 | 624 | 41 (0.5) | VN-EGNN, GrASP, DeepPocket, P2Rank, PUResNet |
JOINED | TEST | 557 | 752 | 752 | 110 (1.3) | PRANK |
LIGYSIS is our reference dataset, LIGYSISNI is a subset with no ion (NI) ligand binding sites, sc-PDBFULL, bMOADSUB and CHEN11 constitute the training datasets, whereas PDBbindREF, SC6K, HOLO4K, COACH420 and JOINED represent test sets. # Structures, # Sites and # Ligands represent the number of PDB structures, ligand sites and total number of ligands for each dataset. Note that for LIGYSIS and LIGYSISNI, 3448 and 2775, are the number of human structural segments considered, each represented by a single chain. For each segment, all biologically relevant ligand-binding structures were considered: N = 23,321 (LIGYSIS) and N = 19,012 (LIGYSISNI). The number of ligands, or protein–ligand complexes, is not equal to the number of sites for LIGYSIS, as data from multiple structures of the same protein are aggregated into unique sites, i.e., a LIGYSIS site often includes multiple ligands. Overlap is the number of LIGYSIS binding sites represented by at least one protein–ligand complex for a given dataset. Percentage relative to LIGYSIS also reported. Methods represents the ligand site predictors that use these datasets for training or test. Only the original version of each dataset is considered in the analysis, e.g., HOLO4K is analysed, but not HOLO4KMlig, nor HOLO4KMlig+ HAP, or HAP-small. The same goes for Mlig, Mlig+ versions of COACH420, sc-PDBSUB and sc-PDBRICH. ALL* represents all the methods compared in this work except for PRANK, fpocket, PocketFinder+, Ligsite+ and Surfnet+. For # Structures, # Sites and # Ligands, highest values are indicated with “+” bold superscript and lowest with “−”. This is the other way around for Overlap