Abstract
Water is a ubiquitous solvent in chemistry and life. It is therefore no surprise that the aqueous solubility of compounds has a key role in various domains, including but not limited to drug discovery, paint, coating, and battery materials design. Measurement and prediction of aqueous solubility is a complex and prevailing challenge in chemistry. For the latter, different data-driven prediction models have recently been developed to augment the physics-based modeling approaches. To construct accurate data-driven estimation models, it is essential that the underlying experimental calibration data used by these models is of high fidelity and quality. Existing solubility datasets show variance in the chemical space of compounds covered, measurement methods, experimental conditions, but also in the non-standard representations, size, and accessibility of data. To address this problem, we generated a new database of compounds, AqSolDB, by merging a total of nine different aqueous solubility datasets, curating the merged data, standardizing and validating the compound representation formats, marking with reliability labels, and providing 2D descriptors of compounds as a Supplementary Resource.
Subject terms: Thermodynamics, Cheminformatics
Design Type(s) | chemical reaction data analysis objective • data integration objective • data validation objective |
Measurement Type(s) | aqueous solubility |
Technology Type(s) | digital curation |
Factor Type(s) | physical state |
Sample Characteristic(s) |
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Background & Summary
Aqueous solubility constitutes a crucial property of chemical substances that governs behavior of phenomena in several areas like geochemistry, climate predictions, biochemistry, drug-design, agrochemical design, and protein ligand binding. It is defined as the maximum amount of a compound, i.e., the solute, that can get dissolved in a given volume of water, and depends on physical conditions such as temperature and pressure. It is of critical importance in especially pharmaceutical drug design, where poor aqueous solubility is likely to lead to precipitation of compounds from screening buffer, which may create a high risk of erroneous results, false leads, and increased costs and formulation difficulties during clinical development.
Although the aqueous solubility of a compound can be related to its other structural and physico-chemical properties such as shape, polar surface area (PSA), acid dissociation constant (pKa), lipophilicity (logD), and the number of hydrogen bond donors and acceptors, theoretical predictions are often inaccurate. In order to overcome these challenges, several data-driven models have been developed to predict the aqueous solubility of compounds last couple of decades1–6.
The development of reliable data-driven models, however, has been hindered by uncertainties and disagreements in the underlying data, which are obtained from many disparate sources. Unsystematic errors between different experimental methodologies potentially limit the accuracy with which the models can be trained and validated. To develop generalizable prediction models, accurate datasets are needed that are diverse and large at the same time7.
In this work, we assess the quality of aqueous solubility datasets under 2 categories: generalizability and fidelity. Generalizability can be interpreted in terms of the chemical diversity of the dataset, as well as its size. Machine learning models developed using datasets, which have small size and lack chemical diversity, show poor predictive capability on external test sets, as shown in the study by Wang et al.8. Another very important indicator of dataset quality is fidelity. Fidelity can be understood as accuracy of data in terms of the reliability of the experimental technique, human errors in either conducting the experiments and recording the measured values. In their review, Wang et al. reported inconsistencies of experimental values in different databases9. Balakin et al. also reported the same problem where they found standard deviation (SD) of experimental solubility values of the same compounds as large as 0.5 in LogS units7. These errors may result from experimental noise or unintentional misprints. Data verification is important in order to increase the reliability of datasets9.
The aim of this study is to curate a large experimental aqueous solubility data, AqSolDB, for data-driven model development. For this purpose, we searched for and collected nine open source datasets on aqueous solubility. In order to merge the datasets, we followed systematic steps of identifier generation by converting CAS numbers and SLN identifiers into SMILES representations, and validation10–12. All identifiers were converted to SMILES format and experimental solubility values were all standardized to the LogS units. After we standardized the datasets, we merged all the datasets into one and further grouped them based on their reliability label and the number of occurrences in the merged dataset.
In this data descriptor, we provided a general algorithm for selection of the statistically most reliable values from a set of competing values. AqSolDB consists of aqueous solubility values of 9,982 unique compounds, along with some relevant topological and physico-chemical 2D descriptors. Additionally, the dataset contains validated representations of each of the compounds.
AqSolDB is an openly accessible, easy-to-use, and well-structured database of compound. We expect it to serve a broad community as a reference aqueous solubility dataset for the bench-marking of new experimental and physics-based modelling results, and additionally as machine-readable ancillary resource to improve the prediction capability of future machine learning approaches.
Methods
To curate our dataset we followed three steps. First, we collected nine publicly available aqueous solubility dataset and converted them into a standardized format. Second, we combined datasets into one single dataset by applying a data verification algorithm that selects statistically most reliable experimental value among multiple occurrences. Finally, we added topological and physico-chemical 2D descriptors to the merged dataset. Figure 1 shows the flow of the curation process.
Step 1: Collecting and pre-processing datasets
Solubility data was first collected from nine publicly available datasets as shown in Table 1. A set of three pre-processing steps were applied to each of the datasets in order to standardize the representation format and solubility values in the same units. These steps also describe our exclusion criteria on the basis of unique identifier validation. The steps are as follows:
Identifier generation: We chose the SMILES representation as the standard identifier for compounds for our curated dataset. In external datasets, where SMILES representations were not available, we used the name and the CAS Registry Number of compounds as inputs to retrieve the SMILES strings from the Chemical Identifier Resolver web service of the National Cancer Institute (https://cactus.nci.nih.gov/chemical/structure). Lastly, SLN identifiers available from some datasets were converted to SMILES using RDKit open-source cheminformatics software.
Unit Conversion: The chosen unit of solubility in this dataset is LogS, where S is the aqueous solubility in mol/L (or M). Units such as g/L and mg/L were converted to LogS using the molecular mass of the compounds.
SMILES Validation: In order to ensure consistency and robustness of the SMILES representations, we used InChI representations in the scheme shown in Fig. 213. First, SMILES strings were converted to into RDKit mol objects. If an error occurred during the conversion, the input SMILES string was considered to be invalid. Next, the obtained RDKit mol objects were converted to InChI representations. The InChI representations were used to regenerate the RDKit mol objects. Finally, the thus obtained mol objects were converted back to InChI representations. The original and regenerated InChI were checked for consistency to ensure that the generated InChI were reproducible. This step also validated that both SMILES and InChI representations led to the same RDKit mol object, and thus the chemical compound.
Table 1.
Dataset ID |
Original Size |
Filtered Size |
Compound Representations |
Solubility Units |
---|---|---|---|---|
A14 | 14,180 | 6,110 | name, CAS | g/L, mg/L, μg/L |
B15 | 5,764 | 4,651 | name, CAS | LogS |
C16 | 2,603 | 2,603 | name, SMILES | LogS |
D17 | 2,267 | 2,115 | name, CAS | LogS |
E1 | 1,291 | 1,291 | name, SMILES, CAS | LogS |
F8 | 1,210 | 1,210 | SLN | LogS |
G2 | 1,144 | 1,144 | name, SMILES | LogS |
H8 | 578 | 578 | SLN | LogS |
I20 | 105 | 94 | name, SMILES, InChI | μM |
Dataset ID: identifier of the dataset during the curation process. Original Size: number of instances of the dataset when we collected. Filtered Size: number of instances after the pre-process. Compound Representation: available compound representations of the dataset when we collected. Solubility Units: units of experimental solubility values of the dataset.
Table 1 shows the type of information contained in the datasets. Every dataset was processed separately in order to standardize them. The extraction process and standardization methods applied for each dataset, along with the temperature based exclusion criteria, are explained below. We named the datasets from A to I according to the number of instances they have in descending order.
Dataset A (6,110 instances)
Dataset A was obtained from eChemPortal14, which is an open source chemical property database developed by the Organisation for Economic Co-operation and Development (OECD). Solubility data was extracted after applying the filters “experimental studies” and “water solubility”. This yielded several lines of bulk text which were then parsed to obtain CAS number, name, and experimental results on solubility including temperature and pH conditions. A total of 14,180 instances were thus obtained and these were further filtered by temperature for a range between 25 ± 5 °C. After filtering, 8,419 instances were obtained. In the identifier generation step, 6,183 of 8,419 compounds were successfully converted into SMILES. Finally, after applying SMILES validation 6,110 instances were obtained.
Dataset B (4,651 instances)
Dataset B was downloaded from EPI Suite Data website15. This open-source dataset consisted of 5,764 liquid and crystalline organic compounds with the following properties: CAS number, name, molecular weight, water solubility, temperature. SMILES identifiers were successfully generated for 5,367 of these compounds. After that, we filtered the data by temperature between 25 ± 5 °C to obtain 5,206 compounds. In the final step, the InChI and InChIKey were validated to obtain 4,651 compounds.
Dataset C (2,603 instances)
Dataset C was collected from the work of Raevsky et al.16 and it contains solubility data measured at 25 ± 5 °C. The dataset consists of solubility of 2,603 crystalline solid compounds along with SMILES strings. All compounds were successfully recreated after pre-processing steps.
Dataset D (2,115 instances)
Dataset D was downloaded from EPI Suite Data website17. This open-source dataset consisted of 2,267 liquid and crystalline organic compounds, out of which 2,115 compounds remained after applying the pre-processing steps.
Dataset E (1,291 instances)
Dataset E was taken from the work of Huuskonen et al.1. In this study, the experimental aqueous solubility value measured between 20–25 °C were obtained from the AQUASOL database of the University of Arizona and SCR’s PHYSPROP Database. The extended version of this dataset with 1,291 solubility values and SMILES was downloaded from the Cheminformatics (http://cheminformatics.org/). All compounds were successfully recreated after pre-processing steps.
Dataset F (1,210 instances)
Dataset F was taken from the work of Wang et al.8. They extracted 1,210 compounds from the Beilstein database and sanitized it. However, the dataset contains compound identifiers in only the SLN format12. We converted SLN to SMILES representation using RDKit SLN parser. During the conversion 93 of 1,210 compounds could not be produced. Using Molview (http://molview.org/) web tool, we obtained valid SMILES for 93 missing compounds. Name information was collected from NCI Chemical Identifier Resolver service, Molview and SpyderChem18 websites. InChI and InChIKey values are produced and validated using the pre-processing steps and all compounds were successfully recreated.
Dataset G (1,144 instances)
Dataset G was taken from the work of Delaney et al.2. The dataset consists of 1,144 small compounds with experimental solubility measured at 25 °C and SMILES information. All compounds were successfully recreated after applying the pre-processing steps.
Dataset H (578 instances)
Dataset H was taken from the work of Wang et al.8, who sanitized the dataset used by Jain and Yalkowsky by removing duplicate entries19. This dataset consists of 322 liquid and 256 solid compounds. The dataset contained only SLN as the compound identifiers and after applying the pre-processing steps all compounds are successfully recreated.
Dataset I (94 instances)
Dataset I was taken from the Goodman Group website (http://www-jmg.ch.cam.ac.uk/data/solubility/) as the corrected version of solubility challenge20. The dataset consists of 105 drug-like compounds with name, SMILES, and solubility information. The solubility values were measured at 25 °C. 11 of 105 crystalline data had to be removed because their solubility values were missing. All compounds were successfully recreated after applying the pre-processing steps.
Step 2: Merging datasets
The purpose of this step is combining all datasets into the one single repository that contains only unique compounds paired with the most reliable aqueous solubility value. The InChI representation was used to identify compounds uniquely and solubility values within 0.01 LogS units of each other were deemed to be identical. Based on these conditions, a preliminary analysis of the combined repository revealed two different kinds of redundancies - (1) a given compound was found to repeat with a different solubility value, or (2) a given compound was found to repeat with the same solubility value. In order to quantify the relative uniqueness of each of these datasets, redundancy matrices are plotted in Fig. 3, where the rows and columns of these matrices represent the various datasets. Redundant compounds of kind (1) and (2) between any two data sets are represented as fractional values (Fig. 3a) and (Fig. 3b), respectively, where i and j represent the two datasets in consideration. As an example, the value = 0.13 represents that 13% of the compounds from dataset B can be found in dataset A, but with a different solubility value. In a similar way, the value = 0.09 represents that 9% of the compounds from dataset B can be found in dataset A, but with the same solubility value. The matrix is not symmetric in the fractional representation because of the different sizes of the datasets. While data of kind (2) can be handled simply by removing identical copies, it can be deduced from Fig. 3a about compounds of kind (1) that the datasets possess a high degree of redundancy, which necessitates a strategy for selecting the most reliable value.
There are a total of 19,796 instances in the merged repository with 9,982 unique compounds before redundant values are eliminated. To curate this data set with a unique solubility value for every compound, we design an algorithm to select the most reliable experimental value. The selection is performed by first classifying the compounds into five distinct groups which are defined based on the statistics of occurrence of a compound in the dataset. The flow chart of the curation algorithm is shown in Fig. 4 and described as follows:
For every compound in the dataset, the number of occurrences is determined. If the compound has a unique value or multiple values which are within 0.01 LogS units, the value is simply accepted. This step leads to the curation of 7,746 unique instances, which were assigned to group G1.
Next, for compounds with occurrence count >1 with different solubility values (819), we used the closest to the mean algorithm to select the value. In this method, the mean value is first calculated then the closest value to mean value among the candidates is selected. If the standard deviation (SD) of the set of values was >=0.5 LogS units, we assigned the compounds to group G4 (183), else to G5 (636).
For compounds with exactly 2 values (1,417), the closest to the mean method cannot be applied because mean is always at the middle of the two values. For this case, we used an alternative method, which is closest to the reference. We selected the closest value to an external reference value, which is obtained using the solubility prediction tool ALOGPS21. ALOGPS is an open source online solubility prediction tool that is based on artificial neural networks and has an overall error of 0.49 Root Mean Squared Error (RMSE) in LogS units22. If the SD of the two values was >=0.5 LogS units, we assigned the compounds to group G2 (235), else to G3 (1,182).
We selected 0.5 as a threshold for degree of agreement between multiple values based on the predictive capabilities of some the state-of-the-art models3,5,6. It must also be noted that the average SD of experimental solubility values for a given compound from different sources has been reported to be 0.5 LogS7,23. Using this threshold, the grouping of compound into 5 different groups provides a credible way of assessing reliability for data-driven modeling.
Step 3: Extending the dataset with 2D descriptors
The purpose of this step is extending the information space of compounds by adding basic topological and physico-chemical information. For this purpose, we calculated all the relevant 2D descriptors available from RDKit. The last 17 rows of Table 2 show the name, description and data type of the 2D descriptors.
Table 2.
Column Name | Description | Type |
---|---|---|
ID | ID from source (also shows the source) | string |
Name | Name of compound | string |
InChI | The IUPAC International Chemical Identifier | string |
InChIKey | Hashed form of InChI value | string |
SMILES | SMILES representation of compound | string |
Solubility | Experimental aqueous solubility value (LogS) | float |
SD | Standard deviation of multiple occurrences | float |
Occurrences | Number of occurrences of compound | integer |
Group | Generated reliability group (G1, G2, G3, G4, G5) | string |
Mol Wt | Molecular weight | float |
Mol LogP | Octanol-water partition coefficient | float |
Mol MR | Molar refractivity | float |
Heavy Atom Count | Number of non-H atoms | integer |
Num H Acceptors | Number of H acceptors | integer |
Num H Donors | Number of H donors | integer |
Num Heteroatoms | Number of atoms not carbon or hydrogen | integer |
Num Rotatable Bonds | Number of rotatable bonds | integer |
Num Valence Electrons | Number of valence electrons | integer |
Num Aromatic Rings | Number of aromatic rings | integer |
Num Saturated Rings | Number of saturated rings | integer |
Num Aliphatic Rings | Number of aliphatic rings | integer |
Ring Count | Number of total rings | integer |
TPSA | Topological polar surface area | float |
Labute ASA | Labute’s Approximate Surface Area | float |
Balaban J | Balaban’s J index (graph index) | float |
Bertz CT | A topological complexity index of compound | float |
Data Records
AqSolDB consists of 9,982 unique compounds. AqSolDB data is stored in the comma-separated values (CSV) format and contains representations, experimental aqueous solubility and calculated 2D descriptor data of all compounds, as described in Table 2. AqSolDB is openly accessible at the Harvard Dataverse Repository24.
Technical Validation
Analysis of solubility values
Compounds can be classified according to solubility values (LogS); Compounds with 0 and higher solubility value are highly soluble, those in the range of 0 to −2 are soluble, those in the range of −2 to −4 are slightly soluble and insoluble if less than −4. Figure 5c shows the distribution of solubility values.
As no information about experimental errors from the original data sources was found to be available, we determined their reliability with a statistical approach. As described in the Methods section, each compound was labeled according to the selection process. Figure 5b shows the distribution of compounds to five groups. The G1 group constitutes the largest part of the data and has been encountered only once in all datasets. These compounds are selected directly and it is not possible to comment on their reliability. G2 and G3 groups are composed of compounds that are found only twice in all datasets. Those with SD values greater than 0.5 were assigned to G2 group and those with small or equal values were assigned to G3 group. G4 and G5 groups are composed of compounds that are found three times or more in all datasets. Using the same process as the previous one, compounds with an SD of greater than 0.5 were included in the G4 group and those with a small or equal value in the G5 group. The difference between the results of the independent experiments shows the reliability of this value. Statistically, due to the fact that when sampling increases, reliability will increase, it can be concluded that G5 group is more reliable than G3 group and G4 group is more reliable than G2 group.
Usage Notes
It is recommended for users to consider the group description when using the data as input to other models. The availability of the calculated 2D descriptors makes it possible to directly use the data for developing machine learning models. To create a more complex representation of compounds such as graphs or circular fingerprints, we recommend to use RDKit. We provided both SMILES and InChI representations of compounds which are validated and can be easily converted into the RDKit mol object. Further methodological notes on data processing can be found in the Code Ocean repository25.
ISA-Tab metadata file
Acknowledgements
M.C.S. and S.E. acknowledge funding from the initiative “Computational Sciences for Energy Research” of Shell and the Netherlands Organisation for Scientific Research (NWO). We acknowledge Elif Sorkun for helping in software development. Last but not least, we acknowledge researchers who generated the experimental aqueous solubility data and released them for public use.
Author Contributions
M.C.S. collected, analyzed, merged and curated the data, developed codes for these purposes and computed all additional 2D descriptors, A.K. aided in analyzing data. S.E. devised and supervised the project. All authors contributed to writing of the manuscript.
Code Availability
The reproducibility of the curation algorithm can be verified by executing the provided scripts on Code Ocean25. The code has been developed and tested using Python 3.5 on Linux operating system and is available under the MIT license.
The RDKit cheminformatics software is freely available under the BSD licence (http://www.rdkit.org).
ALOGPS 2.1 used for reference value generation is freely available online (http://www.vcclab.org/lab/alogps/).
Competing Interests
The authors declare no competing interests.
Footnotes
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
ISA-Tab metadata
is available for this paper at 10.1038/s41597-019-0151-1.
References
- 1.Huuskonen J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of Chemical Information and Computer Sciences. 2000;40:773–777. doi: 10.1021/ci9901338. [DOI] [PubMed] [Google Scholar]
- 2.Delaney JS. ESOL: estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences. 2004;44:1000–1005. doi: 10.1021/ci034243x. [DOI] [PubMed] [Google Scholar]
- 3.Lusci A, Pollastri G, Baldi P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of Chemical Information and Modeling. 2013;53:1563–1575. doi: 10.1021/ci400187y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.McDonagh JL, Nath N, De Ferrari L, Van Mourik T, Mitchell JB. Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules. Journal of Chemical Information and Modeling. 2014;54:844–856. doi: 10.1021/ci4005805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems28, 2224–2232 (2015).
- 6.Wu Z, et al. Molecule Net: a benchmark for molecular machine learning. Chemical Science. 2018;9:513–530. doi: 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Balakin KV, Savchuk NP, Tetko IV. In silico approaches to prediction of aqueous and DMSO solubility of drug-like compounds: trends, problems and solutions. Current Medicinal Chemistry. 2006;13:223–241. doi: 10.2174/092986706775197917. [DOI] [PubMed] [Google Scholar]
- 8.Wang J, Hou T, Xu X. Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. Journal of Chemical Information and Modeling. 2009;49:571–581. doi: 10.1021/ci800406y. [DOI] [PubMed] [Google Scholar]
- 9.Wang J, Hou T. Recent advances on aqueous solubility prediction. Combinatorial Chemistry & High Throughput Screening. 2011;14:328–338. doi: 10.2174/138620711795508331. [DOI] [PubMed] [Google Scholar]
- 10.Weisgerber DW. Chemical abstracts service chemical registry system: history, scope, and impacts. Journal of the American Society for Information Science. 1997;48:349–360. doi: 10.1002/(SICI)1097-4571(199704)48:4<349::AID-ASI8>3.0.CO;2-W. [DOI] [Google Scholar]
- 11.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences. 1988;28:31–36. [Google Scholar]
- 12.Ash S, Cline MA, Homer RW, Hurst T, Smith GB. SYBYL line notation (SLN): A versatile language for chemical structure representation. Journal of Chemical Information and Computer Sciences. 1997;37:71–79. doi: 10.1021/ci960109j. [DOI] [Google Scholar]
- 13.Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D. InChI, the IUPAC international chemical identifier. Journal of Cheminformatics. 2015;7:23. doi: 10.1186/s13321-015-0068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.OECD. eChemPortal - The Global Portal to Information on Chemical Substances, https://www.echemportal.org/echemportal/propertysearch/addblock_input.action (2019).
- 15.US EPA. EPI Suite Data. WATERNT (Water Solubility Fragment) Program Methodology & Validation Documents, http://esc.syrres.com/interkow/Download/WaterFragmentDataFiles.zip (1995).
- 16.Raevsky OA, Grigor’ev VY, Polianczyk DE, Raevskaja OE, Dearden JC. Calculation of aqueous solubility of crystalline un-ionized organic chemicals and drugs based on structural similarity and physicochemical descriptors. Journal of Chemical Information and Computer Sciences. 2014;54:683–691. doi: 10.1021/ci400692n. [DOI] [PubMed] [Google Scholar]
- 17.US EPA. EPI Suite Data. WSKOWWIN Program Methodology & Validation Documents, http://esc.syrres.com/interkow/Download/WSKOWWIN_Datasets.zip (1994).
- 18.Pence HE, Williams A. ChemSpider: an online chemical information resource. Journal of Chemical Education. 2010;87:1123–1124. doi: 10.1021/ed100697w. [DOI] [Google Scholar]
- 19.Jain N, Yalkowsky SH. Estimation of the aqueous solubility I: application to organic nonelectrolytes. Journal of Pharmaceutical Sciences. 2001;90:234–252. doi: 10.1002/1520-6017(200102)90:2<234::AID-JPS14>3.0.CO;2-V. [DOI] [PubMed] [Google Scholar]
- 20.Llinas A, Glen RC, Goodman JM. Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements? Journal of Chemical Information and Modeling. 2008;48:1289–1303. doi: 10.1021/ci800058v. [DOI] [PubMed] [Google Scholar]
- 21.Tetko IV, et al. Virtual computational chemistry laboratory–design and description. Journal of Computer-aided Molecular Design. 2005;19:453–463. doi: 10.1007/s10822-005-8694-y. [DOI] [PubMed] [Google Scholar]
- 22.Tetko IV, Tanchuk VY, Kasheva TN, Villa AE. Estimation of aqueous solubility of chemical compounds using E-state indices. Journal of Chemical Information and Computer Sciences. 2001;41:1488–1493. doi: 10.1021/ci000392t. [DOI] [PubMed] [Google Scholar]
- 23.Abraham MH, Le J. The correlation and prediction of the solubility of compounds in water using an amended solvation energy relationship. Journal of Pharmaceutical Sciences. 1999;88:868–880. doi: 10.1021/js9901007. [DOI] [PubMed] [Google Scholar]
- 24.Sorkun MC, Khetan A, Er S. 2019. Harvard Dataverse. [DOI]
- 25.Sorkun, M. C., Khetan, A. & Er, S. AqSolDB (Aqueous Solubility Data Curation). Code Ocean, 10.24433/CO.1992938.v1 (2019).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Sorkun MC, Khetan A, Er S. 2019. Harvard Dataverse. [DOI]
Supplementary Materials
Data Availability Statement
The reproducibility of the curation algorithm can be verified by executing the provided scripts on Code Ocean25. The code has been developed and tested using Python 3.5 on Linux operating system and is available under the MIT license.
The RDKit cheminformatics software is freely available under the BSD licence (http://www.rdkit.org).
ALOGPS 2.1 used for reference value generation is freely available online (http://www.vcclab.org/lab/alogps/).