Abstract
Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is useable at https://mol.dev.
We propose a new way of deploying deep learning models to improve reproducibility and usability, making predictions with uncertainty.
1. Introduction
Aqueous solubility measures the maximum quantity of matter that can be dissolved in a given volume of water. It depends on several conditions, such as temperature, pressure, pH, and the physicochemical properties of the compound being solvated.1 The solubility of molecules is essential in many chemistry-related fields, including drug development,2–5 protein design,6 chemical7,8 and separation9 processes. In drug development, for instance, compounds with biological activity may not have enough bioavailability due to inadequate aqueous solubility.
Solubility prediction is essential, driving the development of various methods, from physics-based approaches—including first principles,10,11 semi-empirical equations,12–14 molecular dynamics (MD),15–18 and quantum computations19—to empirical methods like quantitative structure–property relationship (QSPR)20–23 and multiple linear regression (MLR).24,25 Despite their sophistication, physics-based models often present complexity that limits accessibility to advanced users26 and do not guarantee higher accuracy than empirical methods.27 Data-driven models emerge as efficient alternatives, capable of outperforming physics-based models.26 However, achieving accurate and reliable solubility predictions remains a significant challenge.26,28
To address the persistent issues of systematic bias and non-reproducibility in aqueous solubility datasets, Llinàs et al.29,30 introduced two solubility challenges featuring consistent data. The first challenge red participants based on the root mean square error (RMSE) and the accuracy within a ±0.5 log S error range.31 The second challenge revealed that despite the freedom in method selection, all entries relied on QSPR or machine learning (ML) techniques,32 yet did not achieve a notable improvement over the first challenge.31 These challenges highlighted the importance of data quality over model selection for accurate solubility predictions.32 Sorkun et al.28 further emphasized this by demonstrating how data quality assessments on subsets of the AqSolDB1 significantly impacted model performance.28
McDonagh et al.27 demonstrated that cheminformatic methods surpass first principle theoretical calculations in calculating solubilization free energies, highlighting the superior accuracy of cheminformatics and the efficacy of Random Forest models, evidenced by an RMSE of 0.93 using Llinàs’ first dataset.29 Data-driven approaches, particularly feature-based models, have contributed to accurate aqueous solubility prediction. Delaney25 used MLR to develop a model called Estimated SOLubility (ESOL) adjusted on a 2874 small organic molecules dataset with an average absolute error (AAE) of 0.83. Comparable performance has been achieved using various methods including MLR,24 Gaussian processes,33 undirected graph recurrent neural networks (UG-RNN),34 deep neural networks (DNN),35 and random forests (RF).36,37
Recently, transformers38 models have been applied to compute solubility of small molecules.39–43 Francoeur and Koes41 developed the SolTranNet, a transformers model trained on AqSolDB1 solubility data. Notably, this architecture results in an RMSE of only 0.278 when trained and evaluated on the original ESOL25 dataset using random split. Nevertheless, it shows an RMSE of 2.99 when trained using the AqSolDB1 and evaluated using ESOL. It suggests that the molecules present in ESOL may have low variability, meaning that samples in the test set are similar to samples in the training set. Hence, models trained on the ESOL training set performed excellently when evaluated on the ESOL test set.
Solubility models should ideally combine accuracy with ease of access. Thus, a common idea is to use web servers to provide easier public access. However, web servers demand continuous financial and time investments for maintenance, leading to the eventual disappearance of some, despite having institutional or government backing.44 For instance, eight out of 89 web server tools featured in the 2020 Nucleic Acids Research special web server issue were offline by the end of 2022.45 Moreover, computational demands can be significant, with tools like RoseTTAFold46 and ATB47 requiring hours to days for job completion, thus creating potential delays due to long queues and wait times.48
An alternative approach is to perform the computation directly on the user's device, removing the need for the server's maintenance and cost. This method allows hosting the website as a static file on platforms such as GitHub, with potential archiving on the Internet Archive.† We explored this approach in Ansari and White49 for bioinformatics. Our web application implements a deep ensemble50 recurrent neural network (RNN) capable of extracting data directly from molecular string representations, such as SMILES51 or SELFIES,52 which can be easily quickly accessed.53,54
The primary difficulty lies in the application's dependence on the device's capabilities, which is crucial for smartphones with limited resources. Balancing performance in low-resource settings, the use of transformer models38 becomes impractical due to their large size, incompatible with smartphone memory and prolonged inference times. Additionally, our model implements a deep ensemble to calibrate uncertainties, making the application of transformers even more unfeasible. In contrast, using descriptors is an easy way to convey physical information to the model and, consequently, enables smaller models. However, descriptor computation is time-intensive. In our tests, using PaDEL to compute descriptors for all molecules in AqSolDB took roughly ∼20 hours. Furthermore, feature-based model development requires specialized knowledge for feature selection,55 and is limited by the regions of the chemical space these descriptors cover.56 Even application usage may need specialized data, as Kurotani et al.37 illustrate. RNNs present an alternative for property extraction directly from string representations while allowing for adaptable computational resource management.
In this work, we developed a front-end application using a JavaScript (JS) implementation of TensorFlow framework.57 Our application can be used to predict the solubility of small molecules with uncertainty. To calibrate the confidence of the prediction, our model implements a deep ensemble approach50 which allows reporting model uncertainty when reporting the prediction. Our solution implements a deep ensemble of RNN models specially designed to achieve satisfactory performance while being able to run in an environment without strong computational resources. This application runs locally on the user's device and can be accessed at https://mol.dev/. Mol.dev does not save data input for predictions in any way.
2. Methods
2.1. Dataset
The data used for training the models were obtained from AqSolDB.1 This database combined and curated data from 9 different aqueous solubility datasets. The main concern in using a large, curated database is to avoid problems with the generalizability of the model58 and with the fidelity of the data.59 AqSolDB consists of aqueous solubility (Log S) values for 9982 unique molecules extended with 17 topological and physicochemical 2D descriptors calculated by RDKit.60
We augmented AqSolDB to 96 625 molecules using SMILES randomization.61,62 Each entry of AqSolDB was used to generate at most ten new unique randomized SMILES strings. Training the model on multiple representations of the same molecule improves its ability to learn the chemical space constraints of the training set, as demonstrated in previous studies.61,62 Duplicates were removed.
After shuffling, the augmented dataset was split into 80%/20% for the training and test datasets, respectively. The curated datasets for the solubility challenges29,32 were used as withheld validation data to evaluate the model's ability to predict solubility for unseen compounds. To refer to the validation datasets, we labeled the first solubility challenge dataset as “solubility challenge 1” and the two sets from the second solubility challenge as “solubility challenge 2_1” and “solubility challenge 2_2”, respectively. Molecules in these three datasets were not found in train and test datasets.
2.2. Model architecture
Our model uses a deep ensemble approach as described by Lakshminarayanan et al.50. This technique was selected due to its ability to estimate prediction uncertainty, thus enhancing the predictive capability of our model. The uncertainty of a model can be divided into two sources: aleatoric uncertainty (AU) and epistemic uncertainty (EU).63,64 These uncertainties quantify the intrinsic uncertainty inherent in data observations and the disagreement among model estimations, respectively.65
Given a model that outputs two values –
m and
m – that characterize a normal distribution
, a deep ensemble creates an ensemble of m models that can estimate prediction uncertainty. For a given data point
, the estimates for the ensemble predictions are computed as follows:
![]() |
1 |
![]() |
2 |
where ale2 is AU, epi2 is EU, N is the ensemble size, and m indexes the models in the ensemble.
We used a deep neural network (DNN) implemented using Keras66 and TensorFlow67 to build the deep ensemble. Our DNN model uses Self-referencing embedded strings (SELFIES)52 tokens as input. A pre-defined vocabulary was created by analyzing all training data. Each unique SELFIES group was assigned to a corresponding integer, yielding 273 distinct tokens. Simplified molecular-input line-entry system (SMILES)51 or SELFIES52 molecule representations are converted to tokens based on the pre-defined vocabulary. Fig. 1 illustrates the model architecture. The network can be divided into three sections: (i) embedding, (ii) bi-RNN, and (iii) fully connected NN.
Fig. 1. Scheme of the deep learning DNN. The molecule is input using the SMILES or SELFIES representation. This representation is converted to a tokenized input based on a vocabulary obtained using the training dataset. A set of models represents the deep ensemble model. Each model consists of an embed layer, two bidirectional RNN (bi-RNN) layers, a normalization layer, and three fully connected layers being down-sized in three steps. Dropout layers are present after the embed and after each fully connected layer during training, but they were not represented in this scheme. Predictions of the models in the ensemble are then aggregated.
The embedding layer converts a list of discrete tokens into a fixed-length vector space. Working on a continuous vector space has two main advantages: it uses a more compact representation, and semantically similar symbols can be described closely in vector space. Our embedding layer has an input dimension of 273 (vocabulary size) and an output dimension of 64.
Following the embedding layer, the data are fed into the bidirectional Recurrent Neural Network (RNN) layer. We used two RNN layers, each containing 64 units. The effects of using Gated Recurrent Unit (GRU) or Long Short-Term Memory (LSTM)68 layers as the RNN layers were investigated (refer to Section 3.1). Using bi-RNN was motivated based on our previous work49 in which LSTM helped improve the model's performance for predicting peptide properties using its sequences. More details regarding RNN, LSTM, and GRU layers can be found in ref. 69.
The output from the bi-LSTM stack undergoes normalization via Layer Normalization.70 There is no agreement on why Layer Normalization improves the model's performance.71–74 The absence of a comprehensive theoretical understanding of normalization effects hinders the evolution of novel regularization schemes.75 Despite the limited understanding, Layer Normalization is employed due to its demonstrated effectiveness.74
After normalization, data is processed through three dense layers containing 32, 16, and 1 units, respectively. The 16-unit layer's output goes to two different 1-unit layers. One layer uses a linear function and the other uses a softplus function, producing m and m, respectively.
Negative log-likelihood loss l was used to train the model. It is defined as the probability of observing the label y given the input
:
![]() |
3 |
During the training phase, dropout layers with 0.35 dropout rate were incorporated after the embedding and each dense layer to mitigate over-fitting.76 Models were trained using the Adam77 optimizer with a fixed learning rate of 0.0001 and default values for β1 and β2 (0.9 and 0.999, respectively).
Our model employs adversarial training, following the approach proposed by Lakshminarayanan et al.50 to improve the robustness of our model predictions. Because the input for our model is a discrete sequence, we generate adversarial examples by modifying the embedded representation of the input data. Each iteration in the training phase consists of first computing the loss using eqn (3) and a second step with a new input
to smooth the model's prediction:
![]() |
4 |
where ε is the strength of the adversarial perturbation.
Details of the model performance, limitations, training data, ethical considerations, and caveats are available as a model card78 at http://mol.dev/.
3. Results
In order to evaluate the performance of our model using deep ensembles, two baseline models were created: (i) an XGBoost Random Forest (RF) model using the 17 descriptors available on AqSolDB plus 1809 molecular descriptors calculated by PaDELPy, a python wrapper for the PaDEL-descriptor79 software, and (ii) a model with the same architecture used on our deep ensemble using RMSE as the loss function and no ensemble (referred to as DNN). RFs are the SOTA of solubility prediction. We used this baseline as a comparison to prove that our model is able to achieve SOTA performance using only molecular string representations. In addition, we evaluate the effects of (i) the bi-RNN layer (either GRU or LSTM), (ii) using an augmented dataset to train, (iii) the adversarial training, and (iv) the ensemble size in the model's performance. Table 1 shows the performance of each one of our trained models.
Summary of the metrics for each trained model. We used the Root Mean Squared Error (RMSE(↓)), Mean Absolute Error (MAE(↓)), and Pearson correlation coefficient (r(↑)) to evaluate our models. The arrows indicate the direction of improvement. Deep ensemble models are referred to as “kdeN”, where N is the ensemble size. Baseline models using random forest (RF) and the DNN model employed for deep ensemble (DNN) are also displayed. DNN model was trained as described in section 2. The models in which data augmentation was used were subscribed with the flag Aug. A superscript indicates if the bidirectional layer implements a GRU or a LSTM layer. In addition, models trained not using adversarial perturbation are flagged with “-NoAdv”. The columns show the results of each model evaluated on each solubility challenge dataset. 2_1 represents the tight dataset (set-1), while 2_2 represents the loose dataset (set-2) as described in the original paper (see ref. 30). r stands for the Pearson correlation coefficient. The best-performing model in each dataset is displayed in bold.
| Model | Solubility challenge 1 | Solubility challenge 2_1 | Solubility challenge 2_2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | MAE | r | RMSE | MAE | r | RMSE | MAE | r | |
| RF | 1.121 | 0.914 | 0.547 | 0.950 | 0.727 | 0.725 | 1.205 | 1.002 | 0.840 |
| DNN | 1.540 | 1.214 | 0.433 | 1.315 | 1.035 | 0.651 | 1.879 | 1.381 | 0.736 |
| DNNAug | 1.261 | 1.007 | 0.453 | 1.371 | 1.085 | 0.453 | 2.189 | 1.710 | 0.386 |
| kde4GRU | 1.610 | 1.145 | 0.462 | 1.413 | 1.114 | 0.604 | 1.488 | 1.220 | 0.704 |
| kde4LSTM | 1.554 | 1.191 | 0.507 | 1.469 | 1.188 | 0.650 | 1.523 | 1.161 | 0.706 |
| kde4GRU-NoAdv | 1.729 | 1.348 | 0.525 | 1.483 | 1.235 | 0.622 | 1.954 | 1.599 | 0.517 |
| kde4LSTM-NoAdv | 1.425 | 1.114 | 0.505 | 1.258 | 0.972 | 0.610 | 1.719 | 1.439 | 0.609 |
| kde4GRUAug | 1.329 | 1.148 | 0.426 | 1.354 | 1.157 | 0.674 | 1.626 | 1.340 | 0.623 |
| kde4LSTMAug | 1.273 | 0.984 | 0.473 | 1.137 | 0.932 | 0.639 | 1.511 | 1.128 | 0.717 |
| kde8LSTMAug | 1.247 | 0.984 | 0.542 | 1.044 | 0.846 | 0.701 | 1.418 | 1.118 | 0.729 |
| kde10AugLSTM-NoAdv | 1.689 | 1.437 | 0.471 | 1.451 | 1.238 | 0.676 | 1.599 | 1.405 | 0.699 |
| kde10LSTMAug | 1.095 | 0.843 | 0.559 | 0.983 | 0.793 | 0.724 | 1.263 | 1.051 | 0.792 |
3.1. Gated layer
The most common RNN layers are the GRU and the LSTM. GRU layers use two gates, reset and update, to control the cell's internal state. On the other hand, LSTM layers use three gates: forget, input, and output, with the same objective. Available studies compare GRU and LSTM performances in RNNs for different applications, for instance: forecasting,80 cryptocurrency,81,82 wind speed,83,84 condition of a paper press,85 motive classification in thematic apperception tests86 and music and raw speech.87 Nevertheless, it is not clear which of those layers would perform better at a given task.
We trained models with four elements in the deep ensemble using GRU or LSTM. Metrics can be found in Table 1; for an explanation of the naming syntax used in this work, refer to Table 1 caption. Using LSTM resulted in a decrease in RMSE and MAE and an increase in the correlation coefficient, indicating better performance. For solubility challenges 1, 2_1, and 2_2, the kde4GRUAug model yielded RMSE values of 1.329, 1.354, and 1.626, respectively, while the kde4LSTMAug model achieved 1.273, 1.137, and 1.511, respectively. This trend was also observed for the models trained without data augmentation (see Table 1). Considering that LSTM performs better regarding this model and data, we will consider only bi-LSTM layers for further discussion. Those results are in accordance with our previous work49 in which using LSTM helped improve the model's performance.
3.2. Data augmentation
Our model is not intrinsically invariant with respect to the SELFIES representation input. For instance, both “C(C(C1C( C(C( O)O1)O)O)O)O” and “O C1OC(C(O)CO)C(O) C1O” are valid SMILES representations for the ascorbic acid (see Fig. 1) that will be encoded for different SELFIES tokens. Hence, the model should learn to be invariant concerning changes in the string representation during training. It can be achieved by augmenting the dataset with SMILES randomization and training the model using different representations with the same label. Therefore, the model can learn relations in the chemical space instead of correlating the label with a specific representation.61 With this aim, we evaluated the effects of augmenting the dataset by generating new randomized SMILES representations for each sample.
Augmenting the dataset had a significant impact on the metrics. It could be seen improvements of ∼0.5 in the RMSE when evaluating on challenge datasets 1 and 2_1, and a gain of ∼0.2 on 2_2 (see Table 1). Concerning the first two datasets, augmenting data improved every model used in this study. However, surprisingly, data augmentation led to a deprecation of the DNN model on the solubility challenge 2_2 dataset. This behavior was not further investigated.
3.3. Adversarial training
Using adversarial training improved performance in Lakshminarayanan et al.50 studies. Hence, they suggested that it should be used in future applications of their deep learning algorithm. Thus, we tested the effects of adversarial perturbation on training models with ensemble sizes of 4 and 10.
Comparing kde4LSTM-NoAdv and kde4LSTM, using adversarial training decreases model performance. It can be seen in Table 1 that using adversarial perturbation increased the RMSE from 1.425 to 1.554 and 1.258 to 1.469 in solubility challenges dataset 1 and 2_1, respectively. However, the RMSE decreased from 1.719 to 1.523 in dataset 2_2. Using adversarial perturbation affected our kde4LSTM's performance by a change in RMSE of ±0.2.
The inconsistent performance improvement observed when using adversarial training was further investigated with models in which the dataset was augmented. Due to the lack of multiple string representations in the training dataset, it is known that kde4LSTM may have generalization problems. A generalization issue could direct the adversarial perturbation in a non-physical direction because the model does not have complete knowledge about the chemical representation space. This hypothesis is reinforced when we compare kde10LSTMAug-NoAdv and kde10LSTMAug. When using adversarial training on a model trained with an augmented dataset, the performance improvement is more evident (∼0.5) and consistent for all the test datasets.
3.4. Deep ensemble size
To investigate the effects of increasing the ensemble size, we trained models with an ensemble of 4, 8, and 10 models. Given the previous results, these models used LSTM as the bi-RNN layer and were trained on the augmented dataset. Specifically for the solubility challenge 2_2, the most complex set to predict, these models presented an RMSE of 1.511, 1.418, and 1.263, respectively. Therefore, increasing the ensemble size consistently improved performance. We also observed this improvement on the other datasets (see Table 1).
Besides the immediate improvement in RMSE, increasing the ensemble size also improves the uncertainty of the model. Fig. 2 shows the density distribution of the aleatoric variance and the epistemic variance (respectively related to AU and EU) for kde4LSTMAug (top 6 panels) and kde10LSTMAug (bottom six panels).
Fig. 2. Density distribution of the aleatoric (AU) and epistemic variances (EU) for the: (i) kde4LSTMAug (top six panels) and (ii) kde10LSTMAug (bottom six panels). Increasing ensemble size reduces the extent of the distribution's tail, decreasing uncertainty about predictions. However, the ensemble size does not noticeably affect the distribution center.
The increase in ensemble size led to a decrease in both uncertainties. AU distributions for the kde4LSTMAug are centered around 4 log S2, displaying a long tail that extends to values as high as 20 log S2 in the worst case (solubility challenge 2_2). A similar trend is observed in EU distributions. On the other hand, the kde10LSTMAug model results in narrower distributions. The mean of these distributions remains relatively unchanged, but a noticeable reduction in the extent of their tails can be observed. AU distribution ends in values around 10 log S2.
4. Discussion
After extensively investigating the hyperparameter selection, we compared our model with available state-of-art models from the literature. Performance metrics on the solubility challenge datasets can be found in Table 2. Parity plots for our chosen models are presented in Fig. 3.
Metrics for the best models found in the current study (upper section) and for other state-of-art models available in the literature (lower section). Values were taken from the cited references. Missing values stand for entries that the cited authors did not study. SolChal columns stand for the solubility challenges. 2_1 represents the tight dataset (set-1), while 2_2 represents the loose dataset (set-2) as described in the original paper (see ref. 30). The best-performing model in each dataset has its RMSE value in bold.
| Model | Solubility challenge 1 | Solubility challenge 2_1 | Solubility challenge 2_2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | MAE | r | RMSE | MAE | r | RMSE | MAE | r | |
| RF | 1.121 | 0.914 | 0.950 | 0.727 | 1.205 | 1.002 | |||
| DNN | 1.540 | 1.214 | 1.315 | 1.035 | 1.879 | 1.381 | |||
| DNNAug | 1.261 | 1.007 | 1.371 | 1.085 | 2.189 | 1.710 | |||
| kde4LSTMAug | 1.273 | 0.984 | 1.137 | 0.932 | 1.511 | 1.128 | 1.397 | 1.131 | |
| kde8LSTMAug | 1.247 | 0.984 | 1.044 | 0.846 | 1.418 | 1.118 | 1.676 | 1.339 | |
| kde10LSTMAug | 1.095 | 0.843 | 0.983 | 0.793 | 1.263 | 1.051 | 1.316 | 1.089 | |
| Linear regression25 | 0.75 | ||||||||
| UG-RNN34 | 0.90 | 0.74 | |||||||
| RF w/CDF descriptors27 | 0.93 | ||||||||
| RF w/Morgan fingerprints36 | 0.64 | ||||||||
| Consensus88 | 0.91 | ||||||||
| GNN89 | ∼1.10 | 0.91 | 1.17 | ||||||
| SolvBert90 | 0.925 | ||||||||
| aSolTranNet41 | 1.004 | 1.295 | 2.99 | ||||||
| bSMILES-BERT91 | 0.47 | ||||||||
| bMolBERT40 | 0.531 | ||||||||
| bRT42 | 0.73 | ||||||||
| bMolFormer43 | 0.278 | ||||||||
Has overlap between training and test sets.
Pre-trained model was fine-tuned on ESOL.
Fig. 3. Parity plots for two selected models being evaluated on the solubility challenge datasets: (i) kde4LSTMAug (top row), and (ii) kde10LSTMAug (bottom row). The left, middle, and right columns show the parity plots for solubility challenge 1,29 2-set1, and 2-set2,30 respectively. Pearson correlation coefficient is displayed together with RMSE and MAE. “acc-0.5” stands for the ±0.5 log% metric. Red dashed lines show the limits for molecules considered a correct prediction when computing the ±0.5 log%. The correlation between predicted values and labels increases when more models are added to the ensemble. RMSE and MAE also follow this pattern. However, the ±0.5 log% decreases in set-2 of the second solubility challenge dataset (SolChal2-set2). While kde10LSTMAug improved the prediction of molecules that were being poorly predicted by kde4LSTMAug, the prediction of molecules with smaller errors was not greatly improved.
Comparing the performance of different models is a complex task, as performance metrics cannot be directly compared across models evaluated on distinct datasets. To address this issue, Panapitiya et al.89 curated a large and diverse dataset to train models with various architectures and molecular representations. They also compared the performance of these models on datasets from the literature.24,25,29,30,88,92–97 Although their models achieved an RMSE of ∼1.1 on their test set, using descriptors as molecular representations resulted in RMSE values ranging from 0.55 to ∼1.35 when applied to other datasets from the literature. According to their study, the solubility challenge datasets by Llinàs et al.29,30 were found to be particularly challenging due to their more significant reproducibility error. Therefore, we focused on the Llinàs datasets to compare our performance with the literature.
Focusing on the solubility challenge 1 dataset,29 kde10LSTMAug is only ∼0.2 RMSE units worse than the best model available in the literature.34 The RMSE of the participants of the challenge was not reported.31 The primary metric used to evaluate models was the percentage of predictions within an error of 0.5 Log S units (called ±0.5 log%). Computing the same metric, kde10LSTMAug has a percentage of correct prediction of 44.4%. This result would place our model among the 35% best participants. The participant with the best performance presented a ±0.5 log% of 60.7%.
The architecture of the models was not published in the findings of the first challenge.31 Nevertheless, the findings for the second challenge32 investigated the participants more thoroughly. Participants were asked to identify their models' architecture and descriptors used. The challenge is divided into two datasets. Set-1 contains Log S values with an average interlaboratory reproducibility of 0.17 Log S. Our kde10LSTMAug achieve an RMSE of 0.983 and a ±0.5 log% of 40.0% in this dataset. Therefore, our model performs better than 62% of the published RMSE values and 50% of the ±0.5 log%. In addition, the model with the best performance is an artificial neural network (ANN) that correctly predicted 61% (±0.5 log%) of the molecule's Log S using a combination of molecule descriptors and fingerprints. The second dataset (set-2) contains molecules whose solubility measurements are more challenging, reporting an average error in reproducibility of 0.62 Log S. The kde10LSTMAug achieves an RMSE of 1.263 and a ±0.5 log% of 23.3%. It performs better than 82% of the candidates when considering the RMSE. Surprisingly, ±0.5 log% does not follow this outstanding performance, which is more significant than only 32% regarding the literature, kde10LSTMAug has an RMSE only ∼0.1 higher than a GNN that used an extensive set of numeric and one-hot descriptors in their feature vector.89 Our model performs better than a transformer model that uses SMILES-string and an adjacency matrix and inputs.41 The performance of those models is available in Table 2.
Notably, all participants in the solubility challenge 2 submitted a kind of QSPR or descriptor-based ML model. Using descriptors provides an easy way to ensure model invariance concerning molecule representation and is more informative since they can be physical quantities. However, selecting appropriate descriptors is crucial for developing descriptor-based ML models. It often requires specialists with a strong intuition about the relevant physical and chemical properties for predicting the target quantity. Feature-based models are still being considered to be the SOTA of solubility prediction. Recently, studies investigating different descriptors and fingerprints were performed.36,98 These studies showed that similarly to the impacts of data quality,28 molecular representation also has a great impact on models' performance. Despite Tayyebi et al.36 being able to achieve an MAE of 0.64 on solubility challenge 1 when using Morgan fingerprints (MF), Zagidullin et al.98 reported poor performance when using MF. Our approach, on the other hand, is based on extracting information from simple string representations, a more straightforward raw data. Furthermore, we could achieve state-of-the-art performance while balancing the model size and complexity and using a raw input (a simple string). This simplified usage enables running the model on devices with limited computing power.
Lastly, transformer models have been used to address the issue of accurately predicting the solubility of small compounds. The typical workflow for transformers involves pre-training the model using a large dataset and subsequently fine-tuning it for a specific downstream task using a smaller dataset. Most existing models were either pre-trained on the ESOL25 dataset or pre-trained on a larger dataset and fine-tuned using ESOL. Hence, the generalizability of those models cannot be verified. In a study by Francoeur and Koes,41 they considered two versions of their model, SolTranNet. The first version of SolTranNet was trained with the ESOL dataset using random splits. This approach achieved an RMSE of 0.278. Subsequently, the deployed version of SolTranNet was trained with the AqSolDB.1 When ESOL was used to evaluate their deployed version, the model presented an RMSE of 2.99. While our model achieved an RMSE of 1.316 on ESOL, outperforming the SolTranNet deployed version, it cannot be compared with other models trained on ESOL.
5. Conclusions
We used the JavaScript implementation of TensorFlow (tensorflowJS) to implement a deep ensemble recurrent neural network (RNN) that can accurately predict Log S values directly from SMILES or SELFIES string representations. This model is hosted on a static website and can be accessed at https://mol.dev/. The contributions of this work can be listed as follows: (1) we show that it is possible to use string representations to predict solubilities; (2) we show that using strings does not lead to an unacceptable decrease in performance, with models performing comparable to state-of-the-art (SOTA) models on Llinas et al. datasets; (3) our model is able to perform predictions with uncertainties, increasing the reliability and practical utility of the predictions; (4) we largely improve model ease of use by implementing a static website whose does not require domain-specific data or knowledge to be used.
Our based on a deep ensemble of recurrent neural networks (RNNs) model was trained using SMILES randomization for data augmentation on the AqSolDB dataset and validated using the solubility challenges by Llínas et al.29,30 It directly processes molecular string representations, such as SMILES or SELFIES, to predict solubility without relying on pre-selected descriptors. This approach not only simplifies the prediction process but also enhances its applicability across a broader chemical space. In addition, we show that this deep ensemble RNN model could achieve similar performance compared to a random forest (RF) using PaDEL descriptors. RFs with descriptors were shown to perform relatively well in other datasets.
By carefully compromising between performance and complexity, we developed a model with acceptable performance and that is not computationally intensive. It enables us to host the model on a static website using TensorFlow JS. Our model was designed to operate on devices with limited computational resources, aiming to broaden the accessibility of advanced solubility prediction tools. This application can satisfactorily run on any device with limited computational resources, such as laptops and smartphones. This approach ensures wider applicability, catering to the needs of users without access to high-performance computing facilities, improving usability and flexibility, and decreasing implementation costs. We believe this is a considerable step in improving the usability of deep learning models and promoting such models to a broader scientific community.
Data availability
The code for training and evaluating the models discussed in this manuscript are available at: https://github.com/ur-whitelab/mol.dev/tree/main/ml. The code employed for this study was the code deployed in February 17. This study used publicly available data from AqSolDB dataset (https://doi.org/10.1038/s41597-019-0151-1), ESOL dataset (https://doi.org/10.1021/ci034243x), and the first (https://doi.org/10.1021/ci800436c) and second solubility challenges (https://doi.org/10.1021/acs.jcim.0c00701). These datasets are compiled and available at: https://github.com/ur-whitelab/mol.dev/blob/main/ml/data.zip.
Author contributions
M. C. R. implemented the deep learning model, performed the training and hyperparameters optimization, tested the model's performance, analyzed the results, and wrote this manuscript. A. D. W. idealized the project, proposed the model to be used, implemented the deep learning approach, and developed http://mol.dev/.
Conflicts of interest
There are no conflicts to declare.
Supplementary Material
Acknowledgments
The authors acknowledge the National Institute of General Medical Sciences of the National Institutes of Health (NIH) under award number R35GM137966. This research used the computational resources and structure provided by the Center for Integrated Research Computing (CIRC) at the University of Rochester.
Footnotes
Notes and references
- Sorkun M. C. Khetan A. Er S. Sci. Data. 2019;6:143. doi: 10.1038/s41597-019-0151-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dajas F. J. Ethnopharmacol. 2012;143:383–396. doi: 10.1016/j.jep.2012.07.005. [DOI] [PubMed] [Google Scholar]
- Di L. Fish P. V. Mano T. Drug Discovery Today. 2012;17:486–495. doi: 10.1016/j.drudis.2011.11.007. [DOI] [PubMed] [Google Scholar]
- Docherty R. Pencheva K. Abramov Y. A. J. Pharm. Pharmacol. 2015;67:847–856. doi: 10.1111/jphp.12393. [DOI] [PubMed] [Google Scholar]
- Barrett J. A. Yang W. Skolnik S. M. Belliveau L. M. Patros K. M. Drug Discovery Today. 2022;27:1315–1325. doi: 10.1016/j.drudis.2022.01.017. [DOI] [PubMed] [Google Scholar]
- Sormanni P. Aprile F. A. Vendruscolo M. J. Mol. Biol. 2015;427:478–490. doi: 10.1016/j.jmb.2014.09.026. [DOI] [PubMed] [Google Scholar]
- Herrero-Martínez J. M. Sanmartin M. Rosés M. Bosch E. Ràfols C. Electrophoresis. 2005;26:1886–1895. doi: 10.1002/elps.200410258. [DOI] [PubMed] [Google Scholar]
- Diorazio L. J. Hose D. R. J. Adlington N. K. Org. Process Res. Dev. 2016;20:760–773. doi: 10.1021/acs.oprd.6b00015. [DOI] [Google Scholar]
- Sheikholeslamzadeh E. Rohani S. Ind. Eng. Chem. Res. 2012;51:464–473. doi: 10.1021/ie201344k. [DOI] [Google Scholar]
- Yalkowsky S. H. Valvani S. C. J. Pharm. Sci. 1980;69:912–922. doi: 10.1002/jps.2600690814. [DOI] [PubMed] [Google Scholar]
- Ran Y. Yalkowsky S. H. J. Chem. Inf. Comput. Sci. 2001;41:354–357. doi: 10.1021/ci000338c. [DOI] [PubMed] [Google Scholar]
- Fredenslund A. Jones R. L. Prausnitz J. M. AIChE J. 1975;21(6):1086–1099. doi: 10.1002/aic.690210607. [DOI] [Google Scholar]
- Abrams D. S. Prausnitz J. M. AIChE J. 1975;21:116–128. doi: 10.1002/aic.690210115. [DOI] [Google Scholar]
- Maurer G. Prausnitz J. M. Fluid Phase Equilib. 1978;2:91–99. doi: 10.1016/0378-3812(78)85002-X. [DOI] [Google Scholar]
- Lüder K. Lindfors L. Westergren J. Nordholm S. Kjellander R. J. Phys. Chem. 2007;111(25):7303–7311. doi: 10.1021/jp071687d. [DOI] [PubMed] [Google Scholar]
- Lüder K. Lindfors L. Westergren J. Nordholm S. Kjellander R. J. Phys. Chem. B. 2007;111(7):1883–1892. doi: 10.1021/jp0642239. [DOI] [PubMed] [Google Scholar]
- Boothroyd S. Kerridge A. Broo A. Buttar D. Anwar J. Phys. Chem. Chem. Phys. 2018;20:20981–20987. doi: 10.1039/C8CP01786G. [DOI] [PubMed] [Google Scholar]
- Boothroyd S. Anwar J. J. Chem. Phys. 2019;151:184113. doi: 10.1063/1.5117281. [DOI] [PubMed] [Google Scholar]
- Tomasi J. Mennucci B. Cammi R. Chem. Rev. 2005;105:2999–3093. doi: 10.1021/cr9904009. [DOI] [PubMed] [Google Scholar]
- Yu X. Wang X. Wang H. Li X. Gao J. QSAR Comb. Sci. 2006;25:156–161. doi: 10.1002/qsar.200530138. [DOI] [Google Scholar]
- Ghasemi J. Saaidpour S. Chem. Pharm. Bull. 2007;55(4):669–674. doi: 10.1248/cpb.55.669. [DOI] [PubMed] [Google Scholar]
- Duchowicz P. R. Castro E. A. Int. J. Mol. Sci. 2009;10:2558–2577. doi: 10.3390/ijms10062558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Louis B. Singh J. Shaik B. Agrawal V. K. Khadikar P. V. Chem. Biol. Drug Des. 2009;74(2):190–195. doi: 10.1111/j.1747-0285.2009.00844.x. [DOI] [PubMed] [Google Scholar]
- Huuskonen J. J. Chem. Inf. Comput. Sci. 2000;40:773–777. doi: 10.1021/ci9901338. [DOI] [PubMed] [Google Scholar]
- Delaney J. S. J. Chem. Inf. Comput. Sci. 2004;44:1000–1005. doi: 10.1021/ci034243x. [DOI] [PubMed] [Google Scholar]
- Skyner R. E. McDonagh J. L. Groom C. R. van Mourik T. Mitchell J. B. O. Phys. Chem. Chem. Phys. 2015;17:6174–6191. doi: 10.1039/C5CP00288E. [DOI] [PubMed] [Google Scholar]
- McDonagh J. L. Nath N. De Ferrari L. van Mourik T. Mitchell J. B. O. J. Chem. Inf. Model. 2014;54:844–856. doi: 10.1021/ci4005805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorkun M. C. Koelman J. M. V. A. Er S. iScience. 2021;24:101961. doi: 10.1016/j.isci.2020.101961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Llinàs A. Glen R. C. Goodman J. M. J. Chem. Inf. Model. 2008;48:1289–1303. doi: 10.1021/ci800058v. [DOI] [PubMed] [Google Scholar]
- Llinas A. Avdeef A. J. Chem. Inf. Model. 2019;59:3036–3040. doi: 10.1021/acs.jcim.9b00345. [DOI] [PubMed] [Google Scholar]
- Hopfinger A. J. Esposito E. X. Llinàs A. Glen R. C. Goodman J. M. J. Chem. Inf. Model. 2009;49:1–5. doi: 10.1021/ci800436c. [DOI] [PubMed] [Google Scholar]
- Llinas A. Oprisiu I. Avdeef A. J. Chem. Inf. Model. 2020;60:4791–4803. doi: 10.1021/acs.jcim.0c00701. [DOI] [PubMed] [Google Scholar]
- Schwaighofer A. Schroeter T. Mika S. Laub J. ter Laak A. Sülzle D. Ganzer U. Heinrich N. Müller K.-R. J. Chem. Inf. Model. 2007;47:407–424. doi: 10.1021/ci600205g. [DOI] [PubMed] [Google Scholar]
- Lusci A. Pollastri G. Baldi P. J. Chem. Inf. Model. 2013;53:1563–1575. doi: 10.1021/ci400187y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye Z. Ouyang D. J. Cheminf. 2021;13:98. doi: 10.1186/s13321-021-00575-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tayyebi A. Alshami A. S. Rabiei Z. Yu X. Ismail N. Talukder M. J. Power J. J. Cheminf. 2023;15:99. doi: 10.1186/s13321-023-00752-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kurotani A. Kakiuchi T. Kikuchi J. ACS Omega. 2021;6:14278–14287. doi: 10.1021/acsomega.1c01035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L. and Polosukhin I., arXiv, 2017, preprint, arXiv:1706.03762v7, 10.48550/arXiv.1706.03762 [DOI]
- Wang S., Guo Y., Wang Y., Sun H. and Huang J., Proceedings of the 10th ACM, 2019 [Google Scholar]
- Fabian B., Edlich T., Gaspar H. and Segler M. and Others, arXiv, 2020, preprint, arXiv:2011.13230v1, 10.48550/arXiv.2011.13230 [DOI]
- Francoeur P. G. Koes D. R. J. Chem. Inf. Model. 2021;61:2530–2536. doi: 10.1021/acs.jcim.1c00331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Born J. and Manica M., arXiv, 2022, preprint, arXiv:2202.01338v3, 10.48550/arXiv.2202.01338 [DOI]
- Ross J. Belgodere B. Chenthamarakshan V. Padhi I. Mroueh Y. Das P. Res. Sq. 2022 doi: 10.21203/rs.3.rs-1570270/v1. [DOI] [Google Scholar]
- Zdrazil B. Guha R. J. Med. Chem. 2017;61:4688–4703. doi: 10.1021/acs.jmedchem.7b00954. [DOI] [PubMed] [Google Scholar]
- Seelow D. Nucleic Acids Res. 2020;48:W1–W4. doi: 10.1093/nar/gkaa528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baek M. DiMaio F. Anishchenko I. Dauparas J. Ovchinnikov S. Lee G. R. Wang J. Cong Q. Kinch L. N. Schaeffer R. D. Millán C. Park H. Adams C. Glassman C. R. DeGiovanni A. Pereira J. H. Rodrigues A. V. van Dijk A. A. Ebrecht A. C. Opperman D. J. Sagmeister T. Buhlheller C. Pavkov-Keller T. Rathinaswamy M. K. Dalwadi U. Yip C. K. Burke J. E. Garcia K. C. Grishin N. V. Adams P. D. Read R. J. Baker D. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stroet M. Caron B. Visscher K. M. Geerke D. P. Malde A. K. Mark A. E. J. Chem. Theory Comput. 2018;14:5834–5845. doi: 10.1021/acs.jctc.8b00768. [DOI] [PubMed] [Google Scholar]
- Smith D. G. A. Altarawy D. Burns L. A. Welborn M. Naden L. N. Ward L. Ellis S. Pritchard B. P. Crawford T. D. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2021;11(2) doi: 10.1002/wcms.1491. doi: 10.1002/wcms.1491. [DOI] [Google Scholar]
- Ansari M. White A. D. J. Chem. Inf. Model. 2023;63:2546–2553. doi: 10.1021/acs.jcim.2c01317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lakshminarayanan B., Pritzel A. and Blundell C., arXiv, 2016, preprint, arXiv:1612.01474v3, 10.48550/arXiv.1612.01474 [DOI]
- Weininger D. J. Chem. Inf. Model. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
- Krenn M. Ai Q. Barthel S. Carson N. Frei A. Frey N. C. Friederich P. Gaudin T. Gayle A. A. Jablonka K. M. Lameiro R. F. Lemm D. Lo A. Moosavi S. M. Nápoles-Duarte J. M. Nigam A. Pollice R. Rajan K. Schatzschneider U. Schwaller P. Skreta M. Smit B. Strieth-Kalthoff F. Sun C. Tom G. Falk von Rudorff G. Wang A. White A. D. Young A. Yu R. Aspuru-Guzik A. Patterns. 2022;3:100588. doi: 10.1016/j.patter.2022.100588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim S. Thiessen P. A. Cheng T. Yu B. Bolton E. E. Nucleic Acids Res. 2018;46:W563–W570. doi: 10.1093/nar/gky294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schilter O. T. Laino T. Schwaller P. Appl. AI Lett. 2024;5(1) doi: 10.1002/ail2.91. [DOI] [Google Scholar]
- Beltran J. A. Aguilera-Mendoza L. Brizuela C. A. BMC Genomics. 2018;19:672. doi: 10.1186/s12864-018-5030-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maggiora G. M. On outliers and activity cliffs why QSAR often disappoints. J. Chem. Inf. Model. 2006;46(4):1535. doi: 10.1021/ci060117s. [DOI] [PubMed] [Google Scholar]
- Smilkov D. Thorat N. Assogba Y. Nicholson C. Kreeger N. Yu P. Cai S. Nielsen E. Soegel D. Bileschi S. Others Proc. Mach. Learn. 2019;1:309–321. [Google Scholar]
- Wang J. Hou T. Xu X. J. Chem. Inf. Model. 2009;49:571–581. doi: 10.1021/ci800406y. [DOI] [PubMed] [Google Scholar]
- Wang J. Hou T. Comb. Chem. High Throughput Screening. 2011;14:328–338. doi: 10.2174/138620711795508331. [DOI] [PubMed] [Google Scholar]
- Landrum, RELease 1.0
- Arús-Pous J. Johansson S. V. Prykhodko O. Bjerrum E. J. Tyrchan C. Reymond J.-L. Chen H. Engkvist O. J. Cheminf. 2019;11:71. doi: 10.1186/s13321-019-0393-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwaller P., Vaucher A. C., Laino T. and Reymond J.-L., ChemRxiv, 2020, preprint, 10.26434/chemrxiv.13286741.v1 [DOI]
- Shaker M. H. and Hüllermeier E., Advances in Intelligent Data Analysis XVIII, 2020, pp. 444–456 [Google Scholar]
- Ghoshal B. Tucker A. Sanghera B. Lup Wong W. Comput. Intell. 2021;37:701–734. doi: 10.1111/coin.12411. [DOI] [Google Scholar]
- Scalia G. Grambow C. A. Pernici B. Li Y.-P. Green W. H. J. Chem. Inf. Model. 2020;60:2697–2717. doi: 10.1021/acs.jcim.9b00975. [DOI] [PubMed] [Google Scholar]
- Chollet F. and Others, Keras: The Python Deep Learning library, 2018
- Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G. S., Davis A., Dean J., Devin M. and Others, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015 [Google Scholar]
- Hochreiter S. Schmidhuber J. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- Zhang A., Lipton Z. C., Li M. and Smola A. J., Dive into Deep Learning, Cambridge University Press, 2023 [Google Scholar]
- Ba J. L., Kiros J. R. and Hinton G. E., arXiv, 2016, preprint, arXiv:1607.06450v1, 10.48550/arXiv.1607.06450 [DOI]
- Ioffe S. and Szegedy C., Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, pp. 448–456 [Google Scholar]
- Awais M., Iqbal M. T. B. and Bae S.-H., Revisiting Internal Covariate Shift for Batch Normalization, 2021 [DOI] [PubMed] [Google Scholar]
- Santurkar S., Tsipras D., Ilyas A. and Madry A., Advances in Neural Information Processing Systems, 2018 [Google Scholar]
- Xu J., Sun X., Zhang Z., Zhao G. and Lin J., arXiv, 2019, preprint, arXiv:1911.07013v1, 10.48550/arXiv.1911.07013 [DOI]
- Tian Y. Zhang Y. Inf. Fusion. 2022;80:146–166. doi: 10.1016/j.inffus.2021.11.005. [DOI] [Google Scholar]
- Gal Y. and Ghahramani Z., Proceedings of The 33rd International Conference on Machine Learning, New York, USA, 2016, pp. 1050–1059 [Google Scholar]
- Kingma D. P. and Ba J., arXiv, 2014, preprint, arXiv:1412.6980v9, 10.48550/arXiv.1412.6980 [DOI]
- Mitchell M., Wu S., Zaldivar A., Barnes P., Vasserman L., Hutchinson B., Spitzer E., Raji I. D. and Gebru T., Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2019, pp. 220–229 [Google Scholar]
- Yap C. W. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011;32(7):1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- Gao S. Huang Y. Zhang S. Han J. Wang G. Zhang M. Lin Q. J. Hydrol. 2020;589:125188. doi: 10.1016/j.jhydrol.2020.125188. [DOI] [Google Scholar]
- Kim J., Kim S., Wimmer H. and Liu H., 2021 IEEE/ACIS 6th International Conference on Big Data, Cloud Computing, and Data Science, BCD, 2021, pp. 37–44 [Google Scholar]
- Encean A.-A. and Zinca D., Cryptocurrency Price Prediction Using LSTM and GRU Networks, 2022 [Google Scholar]
- Kumar V. B., Bharat Kumar V., Mallikarjuna Nookesh V., Satya Saketh B., Syama S. and Ramprabhakar J., Wind Speed Prediction Using Deep Learning-LSTM and GRU, 2021 [Google Scholar]
- Liu X. Lin Z. Feng Z. Energy. 2021;227:120492. doi: 10.1016/j.energy.2021.120492. [DOI] [Google Scholar]
- Mateus B. C. Mendes M. Farinha J. T. Assis R. Cardoso A. M. Energies. 2021;14:6958. doi: 10.3390/en14216958. [DOI] [Google Scholar]
- Gruber N. Jockisch A. Front. Artif. Intell. 2020;3:40. doi: 10.3389/frai.2020.00040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chung J., Gulcehre C., Cho K. and Bengio Y., arXiv, 2014, preprint, arXiv:1412.3555, 10.48550/arXiv.1412.3555 [DOI]
- Boobier S. Osbourn A. Mitchell J. B. O. J. Cheminf. 2017;9:63. doi: 10.1186/s13321-017-0250-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Panapitiya G. Girard M. Hollas A. Sepulveda J. Murugesan V. Wang W. Saldanha E. ACS Omega. 2022;7:15695–15710. doi: 10.1021/acsomega.2c00642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu J. Zhang C. Cheng Y. Yang Y.-F. She Y.-B. Liu F. Su W. Su A. Digital Discovery. 2023;2:409–421. doi: 10.1039/D2DD00107A. [DOI] [Google Scholar]
- Kim H. Lee J. Ahn S. Lee J. R. Sci. Rep. 2021;11:11028. doi: 10.1038/s41598-021-90259-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klopman G. Zhu H. J. Chem. Inf. Comput. Sci. 2001;41:439–445. doi: 10.1021/ci000152d. [DOI] [PubMed] [Google Scholar]
- Hou T. J. Xia K. Zhang W. Xu X. J. J. Chem. Inf. Comput. Sci. 2004;44:266–275. doi: 10.1021/ci034184n. [DOI] [PubMed] [Google Scholar]
- Wang J. Krudy G. Hou T. Zhang W. Holland G. Xu X. J. Chem. Inf. Model. 2007;47:1395–1404. doi: 10.1021/ci700096r. [DOI] [PubMed] [Google Scholar]
- Boobier S. Hose D. R. J. Blacker A. J. Nguyen B. N. Nat. Commun. 2020;11:5753. doi: 10.1038/s41467-020-19594-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang B. Kramer S. T. Fang M. Qiu Y. Wu Z. Xu D. J. Cheminf. 2020;12:15. doi: 10.1186/s13321-020-0414-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui Q. Lu S. Ni B. Zeng X. Tan Y. Chen Y. D. Zhao H. Front. Oncol. 2020;10:121. doi: 10.3389/fonc.2020.00121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zagidullin B. Wang Z. Guan Y. Pitkänen E. Tang J. Briefings Bioinf. 2021;22(6) doi: 10.1093/bib/bbab291. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code for training and evaluating the models discussed in this manuscript are available at: https://github.com/ur-whitelab/mol.dev/tree/main/ml. The code employed for this study was the code deployed in February 17. This study used publicly available data from AqSolDB dataset (https://doi.org/10.1038/s41597-019-0151-1), ESOL dataset (https://doi.org/10.1021/ci034243x), and the first (https://doi.org/10.1021/ci800436c) and second solubility challenges (https://doi.org/10.1021/acs.jcim.0c00701). These datasets are compiled and available at: https://github.com/ur-whitelab/mol.dev/blob/main/ml/data.zip.







