Skip to main content
. 2024 Mar 13;3(4):786–795. doi: 10.1039/d3dd00217a

Metrics for the best models found in the current study (upper section) and for other state-of-art models available in the literature (lower section). Values were taken from the cited references. Missing values stand for entries that the cited authors did not study. SolChal columns stand for the solubility challenges. 2_1 represents the tight dataset (set-1), while 2_2 represents the loose dataset (set-2) as described in the original paper (see ref. 30). The best-performing model in each dataset has its RMSE value in bold.

Model Solubility challenge 1 Solubility challenge 2_1 Solubility challenge 2_2
RMSE MAE r RMSE MAE r RMSE MAE r
RF 1.121 0.914 0.950 0.727 1.205 1.002
DNN 1.540 1.214 1.315 1.035 1.879 1.381
DNNAug 1.261 1.007 1.371 1.085 2.189 1.710
kde4LSTMAug 1.273 0.984 1.137 0.932 1.511 1.128 1.397 1.131
kde8LSTMAug 1.247 0.984 1.044 0.846 1.418 1.118 1.676 1.339
kde10LSTMAug 1.095 0.843 0.983 0.793 1.263 1.051 1.316 1.089
Linear regression25 0.75
UG-RNN34 0.90 0.74
RF w/CDF descriptors27 0.93
RF w/Morgan fingerprints36 0.64
Consensus88 0.91
GNN89 ∼1.10 0.91 1.17
SolvBert90 0.925
aSolTranNet41 1.004 1.295 2.99
bSMILES-BERT91 0.47
bMolBERT40 0.531
bRT42 0.73
bMolFormer43 0.278
a

Has overlap between training and test sets.

b

Pre-trained model was fine-tuned on ESOL.