Skip to main content
. 2024 Oct 9;10(20):e39038. doi: 10.1016/j.heliyon.2024.e39038

Table 2.

Datasets used in cheminformatics for pre-training and fine-tuning transformer-based CLMs.

Pre-training Fine-tuning SMILES Data Task # Task # Compounds
Databases ZINC [5], [56] Category Physical Chemistry [61] ESOL R 1 1128
PubChem [57], [58] FreeSolv R 1 642
ChEMBL [59], [60] Lipophilicity R 1 4200
Biophysics [61] PCBA C 128 437929
MUV C 17 93087
HIV C 1 41127
PDBbind R 1 11908
BACE C 1 15513
Physiology [61] BBBP C 1 2039
Tox21 C 12 7831
ToxCast C 617 8575
SIDER C 27 1427
ClinTox C 2 1478
Proposed[62], [63], [64] Antimalarial C 1 4794
Cocrystals C 1 3282
Covid C 1 740
Genes R 1 201

R - Regression.

C - Classification.