. 2023 Dec 13;15(9):3130–3139. doi: 10.1039/d3sc04185a

Data sets used to train the selected five machine learning-based docking methods. All five DL-based methods were trained on subsets of the PDBBind 2020 General Set.

Method	Training and validation set
DeepDock	PDBBind 2019 General Set without complexes included in CASF-2016 or those that fail pre-processing—16 367 complexes
DiffDock, EquiBind	PDBbind 2020 General Set keeping complexes published before 2019 and without those with ligands found in test set—17 347 complexes
TankBind	PDBbind 2020 General Set keeping complexes published before 2019 and without those failing pre-processing—18 755 complexes
Uni-Mol	PDBBind 2020 General Set without complexes where protein sequence identity (MMSeq2) with CASF-2016 is above 40% and ligand fingerprint similarity is above 80%—18 404 complexes