Skip to main content
. 2022 Jul 24;7(30):25958–25973. doi: 10.1021/acsomega.2c03264

Table 1. Nonexhaustive Set of Publicly Available Datasets That Can Be Used for a Wide-Range of Machine-Learning Algorithms in Materials Science.

data set features references
PubChem Information about 109,907,032 unique chemical structures, 271,129,167 chemical entities, and 1,366,265 bioassays (69)
ZINC Over 230,000,000 commercially available compounds for the purpose of virtual screening in drug discovery (70)
CheMBL Over 2,086,898 bioactive molecules and 17,726,334 bioactivities for effective drug discovery (71)
ChemDB Over 5,000,000 commercially available small molecules intended for drug discovery (72)
ChemSpider Chemical structure data for over 100,000,000 structures from 276 data soures (73)
DrugBank Over 200 data fields each for 14,853 small drugs (including 2687 approved small drugs) (74)
The Materials Project Information about 131,613 inorganic compounds, 49,705 molecules, 530,243 nanoporous materials, and 76,194 band structures (75)
GDB9 17 properties each of 134,000 neutral molecules with up to nine atoms (CONF), with the exception of hydrogen (76)
Pauling File Information about 310,000 crystal structures for 140,000 different phases, 44,000 phase diagrams, and 120,000 physical properties (77)
AFLOW Information about 3,513,989 material compounds with over 695,769,822 computed properties (78)
HTEM DB 37,093 compositional, 47,213 structural, 26,577 optical, and 12,849 electrical properties of thin films (79)
OQMD DFT-calculated structural and thermodynamic properties of 815,654 materials (80), (81)
SuperCon Superconducting properties of 33,284 oxide and metallic samples and 564 organic samples from the literature (82)