. Author manuscript; available in PMC: 2022 Sep 7.

Published in final edited form as: Expert Opin Drug Discov. 2021 Jun 15;16(9):1009–1023. doi: 10.1080/17460441.2021.1925247

TABLE I.

Some of the common datasets used for molecular machine learning and data-driven property prediction.

Name	Description	Number of molecules	Possible bias
ZINC [1]	Database of commercially available compounds together with very simple estimated molecular properties for virtual screening.	1.4 billion	Inherently biased by currently synthesizable chemical space. Consequently, the molecular shapes have been shown to be highly biased against sphere-like molecules.
QM9 [2]	Electronic properties estimated using density functional theory (DFT) simulations.	134 thousand	Biased towards small molecules only containing the elements C, H, N, O and F.
PubChemQC [3, 4]	Geometries and electronic properties of molecules with short string representations taken from PubChem.	221 million	Biased towards small molecules that have been reported in the literature before.
Tox21 [5]	Toxicologic properties of molecules with respect to 12 different assays	13 thousand	Biased towards environmental compounds and approved drugs.
ToxCast [6]	High-throughput screening and computational data for the toxicology of molecules from industry, consumer products and the food industry based on cell assays.	1.8 thousand	Biased towards molecules used in industry, consumer products and the food industry.
ClinTox [7]	Drugs and drug candidates that made it to clinical trials and were either approved or failed.	1.5 thousand	Biased towards drugs that made it to clinical trials.
SIDER [8]	Recorded adverse drug reactions of marketed drugs.	1.4 thousand	Biased towards marketed drugs.
ChEMBL [9]	Bioactive small molecules and their activities extracted from the literature, from clinical trials and from other databases.	2.0 million	Biased towards compounds for which bioactivity was published in the scientific literature.
DUD-E [10]	Ligand binding affinities against 102 distinct target proteins with both strong and weak binders.	23 thousand	Biased towards molecules that have been synthesized and evaluated for binding affinity.
AqSolDB [11]	Aqueous solubility data of organic molecules taken from 9 different datasets.	10 thousand	Biased towards organic molecules with relatively high aqueous solubility.
Olfaction Prediction Challenge [11]	Olfactory perception of organic molecules at different concentrations.	0.5 thousand	Biased towards small and volatile organic molecules. Results biased by familiarity of smells.
FreeSolv [12]	Experimental and computed hydration free energies of small and neutral molecules.	0.6 thousand	Biased towards small and neutral molecules that have been studied in the literature both computationally and experimentally for hydration free energies.
ESOL [13]	Experimental aqueous solubility combining datasets for small molecules from the literature, for medium-sized molecules used as pesticides and larger proprietary compounds from the pharmaceutical industry.	2.9 thousand	The sub-groups each have a different bias as they each have different application domains.
Lipophilicity [14, 15]	Experimental n-octanol/water (buffered at pH 7.4) distribution coefficient of organic molecules taken from other databases.	1.1 thousand	Biased towards molecules with distribution coefficients between −10 and 10.
PubChem Bioassay [16]	Bioactivity outcomes from high-throughput screenings of molecules.	2.3 million	Biased towards molecules of interest and molecules that are synthesizable.
PDBbind [17, 18]	Experimental binding affinity for biomolecular complexes deposited in the protein data bank (PDB).	21.4 thousand	Biased towards complexes with available crystal structures.
BBBP [19]	The blood-brain penetration partition coefficient for molecules collected from the literature.	2.1 thousand	Biased towards molecules studied in the literature for blood-brain penetration.