Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Feb 28.
Published in final edited form as: Phys Chem Chem Phys. 2024 Feb 28;26(9):7907–7919. doi: 10.1039/d3cp04140a

Integrating Multiscale and Machine Learning Approaches towards the SAMPL9 LogP Challenge

Michael R Draper a, Asa Waterman a, Jonathan Dannatt a, Prajay Patel a
PMCID: PMC10938873  NIHMSID: NIHMS1969837  PMID: 38376855

Abstract

The partition coefficient (logP) is an important physicochemical property which provides information regarding a molecule’s pharmacokinetics, toxicity, and bioavailability. Methods to accurately predict the partition coefficient have the potential to accelerate drug design. In an effort to test current methods and explore new computational techniques, the Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL) has established a blind prediction challenge. The ninth iteration challenge was to predict the toluene-water partition coefficient (logPtol/w) of sixteen drug molecules. Herein, three approaches are reported broadly under the categories of quantum mechanics (QM), molecular mechanics (MM), and data-driven machine learning (ML). The three blind submissions yield mean unsigned errors (MUE) ranging from 1.53-2.93 logPtol/w units. The MUEs were reduced to 1.00 logPtol/w for the QM methods. While MM and ML methods outperformed DFT approaches for challenge molecules with fewer rotational degrees of freedom, they suffered for the larger molecules in this dataset. Overall, DFT functionals paired with a triple-ζ basis set were the simplest and most effective tool to obtain quantitatively accurate partition coefficients.

Graphical Abstract

graphic file with name nihms-1969837-f0001.jpg

Introduction

In the pharmaceutical and medicinal industries, the partition coefficient (P), which represents the ratio of the concentration of a solute between two immiscible phases in an un-ionized state, is an extremely important physical parameter. This coefficient, usually between octanol-water (logPo/w), is a tool applied in lead candidate optimization. Lipinski’s Rule of 5,1 aimed at identifying orally viable drug candidates, states an ideal candidate will have a logPo/w < 5. While there are clear classes of drugs outside the parameters set by the rule of 5,2 these guidelines are an excellent starting place to help locate molecules that strike the balance between solubility and permeability. Molecules with a logPo/w between 1.35-1.80 generally have good oral and intestinal absorption, and drugs with a logPo/w around 2.00 are lipophilic enough to enter the central nervous system. If a drug is designed to target outside the central nervous system, a logPo/w around 2.00 should be avoided to avoid undesirable side-effects.3 In this context, rational drug design based on thermodynamic properties is critical in their development and success as drug candidates, since only about 10% of drug candidates that enter Phase 1 clinical trials are ultimately approved by the U.S. Food and Drug Administration.4

Traditionally, drug candidates are found via high-throughput screening (HTS) in which a library of up to a few million molecules is screened for a desired biological activity.5,6 Hits from this screen provide a general starting motif to elaborate, and chemical analogs are produced to optimize the desired characteristics.7 This resource-intensive process can take several years. Moreover, the original molecular library commonly only covers a relatively small sector of chemical space.8 Computational advances both in storage capacity and processing speed have allowed the generation of virtual chemical libraries on the order of billions of entries.9,10 While this does cover a larger swath of chemical space, many of these molecules are synthetically challenging to access. In response to this limitation, virtual libraries which contain make-on-demand molecules that can be synthesized at an ~80% success rate have been generated.11 While access to a larger percentage of chemical space is available, techniques to rapidly identify targets are necessary as seen with efforts to screen therapeutic agents for COVID-19.12-15 Once a subset of potential targets has been identified, rigorous computational methods can be applied advantageously to narrow the subset further. Thus, computational methods need to achieve a level of accuracy and reliability when predicting thermodynamic properties where the experimental values have not been generated are desirable to expedite drug screening efforts.16-19 One of the avenues that focuses on this idea are the Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) challenges. These challenges are an efficient way to identify and analyze new computational approaches for drug discovery through predicting various properties associated with pharmaceutical chemistry such as host-guest binding, protein-ligand interactions, and thermodynamic properties, e.g., pKa, and logP.20-22 These blind challenges remove methodological choices that may bias the results, and thus would overcorrect the true predictive ability of a computational method. In the ninth iteration, the challenge was issued into multiple parts including host-guest challenges, a protein-ligand challenge on nanoluciferase, and a toluene-water logP (logPtol/w) challenge.

For the SAMPL challenges in general, methods submitted as part of the blind portion are classified into the following categories: empirical methods (data-driven methods, e.g., machine learning), physical quantum mechanics (QM)-based, physical molecular mechanics (MM)-based, and Mixed (combination of the other categories). The SAMPL9 logP prediction challenge focused on the toluene-water partition coefficients (logPtol/w) and most submissions followed nonequilibrium alchemical approaches23 and/or empirical/ machine learning approaches.24

Machine learning has become an increasing presence in the chemical space through drug screening efforts, conformational searches, and is used primarily in cheminformatics.25,26 Pertaining to the SAMPL competitions, neural network approaches have been generated as a tool for understanding the quantitative structure-property relationships, or QSPR.27,28 For SAMPL6, a deep learning approach with five hidden layers using extended connectivity fingerprints was constructed to predict the octanol-water partition coefficient (logPo/w).27 The neural network was trained on 14176 data points to predict the logPo/w for 11 kinase-fragment molecules, which were well-represented in their training data, with a MAE of 0.51 logPo/w units. As part of SAMPL7, Donyapour and Dickson used a multilayer perceptron (MLP) neural network to train their Graphical Scattering for Graphs models. This neural network approach achieved a mean absolute error (MAE) of 0.44 logPo/w units for a series of drug-like molecules with a sulfonyl moiety.28,29 These methods showcase the successes of using larger datasets available for logPo/w and how those can be effectively trained to predict logPo/w for a series of drug-like molecules with similar scaffolds and moieties.

When characterizing 3D geometries, subtle structural changes, like a rotation or elongation of a bond, impact predicted thermodynamic and kinetic properties; hence the recommendation of using a thermodynamic ensemble representing a structure. For example, Pinheiro et al. created a database called WS22, which includes 1.18 million equilibrium geometries to obtain a more complete picture of the intrinsic high dimensionality of potential energy surfaces for ten flexible organic molecules through Wigner Sampling, geometry interpolation, and density functional theory (DFT).30 Principal component analysis (PCA) of all structures yielded clusters of the considered conformers. Additionally, when considering surface organometallic chemistry, the active site heterogeneity, e.g., metal-siloxane bond lengths, of organovanadium(III) catalysts grafted onto amorphous silica supports showed how elongating the bond lowers the activation barrier whereas a standard QM geometry optimization yielded the largest activation barriers for styrene hydrogenation.31 These works showcase the importance of how multiscale modeling with QM, MM, and data-driven approaches need to be utilized together to effectively model an ensemble of molecules since a distribution of structures will lead to a more robust description of the chemical environment, and ideally, more accurate descriptions of macroscopic thermochemical properties.

Sixteen drug molecules were selected for the SAMPL9 logPtol/w challenge. Each of the sixteen molecules fall within the parameters set by the Rule of 5 with a spread of experimentally determined logPo/w from −1.37 for epinephrine (6) to 4.92 for amitriptyline (3). These molecules hit a broad array of biological targets. As such, they have an extensive range of effects from anthelmintic properties in albendazole (1) to antibiotic properties in nalidixic acid (11) and sulfamethazine (15) to antidepressant properties in amitriptyline (3), imipramine (9), and trazodone (16). Structurally, this set of molecules have a wide range of expected polarities from highly polar epinephrine (6) to non-polar amitriptyline (3). In addition, there is a wide range of flexibility within the set, from fairly rigid molecules with only a few likely conformers in nalidixic acid (11) and paracetamol (12) to highly flexible systems with many energetically similar conformers in glyburide (8) and trazodone (16). Each molecule in the set has at least one aromatic ring with the most being four aromatic rings in bifonazole (4). While the number of aromatics play an important role in drug design,32,33 for this work, the number of aromatics will affect the degree and type of noncovalent interactions available between the drug and toluene thus effecting logP tol/w values. Experimental measurement of logP tol/w values are described by Zamora et al.24 These are noted as molecules 1-16 as shown in Figure 1.

Figure 1.

Figure 1.

2D structures of the sixteen SAMPL9 challenge molecules.

This study explores how to leverage QM, MM, and data-driven methods to construct multiscale approaches for predicting thermodynamic properties like logPtol/w. Three thrusts of investigation are used to predict the logPtol/w: Physical QM via both wavefunction-based and density-based methods, Physical MM using alchemical approaches, and Mixed empirical/physical multiscale models integrating QM, MM, and unsupervised machine learning. Finally, in line with our research philosophy, computational methodologies that are readily accessible were explored. Specifically, this means that 1) the computational space could be explored on high performance PC workstations which are increasingly cost-effective instruments and 2) the methods used are readily applied given free available online resources (see ESI).

Methods

All work for the results submitted as part of the SAMPL9 challenge were done on two Dell Precision 5820 Tower Workstations on a Windows 11 operating system housed on-site. A vertical solvation method has been shown to be an effective method to predict logPo/w as part of the SAMPL534 and SAMPL6 competitions when combined with more rigorous quantum mechanical methods, e.g., DLPNO-CCSD(T), or ab initio composite approaches, e.g., the correlation consistent Composite Approach (ccCA).35,36 The vertical solvation method is shown as Equation 1,

logPtolw=ΔGwater°ΔGtol°ln(10)RT (1)

where T is temperature, and ΔG is the free energy of solvation in water and toluene and is used to compute the logP tol/w for all approaches.

Quantum Mechanics (QM)

Geometry Optimization.

ORCA 5.0.3 was used for these electronic structure calculations.37 The initial 3D coordinates were generated using the Gen3D38 operation using Open Babel from SMILES strings provided for molecules 1-16.39 A conformer scan using Open Babel was not done at this stage in favor of modeling explicit solute-solvent interactions and the effect of explicit solvent on the conformational space. Table S1 contains the drug molecule name with the label and the SMILES strings. The single conformer structures were then optimized at the B3LYP-D3BJ/def2-SVP level of theory in the solvent phase using the SMD implicit solvent model for water and toluene independently.40-43 Gas phase structures were also considered but were not part of the submission. Grimme’s D3 atom-pairwise dispersion corrections with Becke-Johnson damping (D3BJ) corrections were included to account for long distance intramolecular interactions and the presence of polycyclic aromatic rings.44 All free energy corrections were computed at 298.15 K with a vibrational scaling factor of 1.0044 to account for anharmonicity.45 All optimized structures are verified to be local minima with no imaginary frequencies.

Blind Submission.

The optimized structures in the water and toluene phases were used with DLPNO-CCSD(T)/def2-SVP energies to compute the logPtol/w using a vertical solvation method (Equation 1).41,46 The domain-based local pair natural orbital (DLPNO) methods are desirable methods for describing organic, organometallic, and supramolecular systems of increasing size (30+ atoms) with a wavefunction-based approach.46 The resolution-of-the-identity chain-of-spheres exchange (RIJCOSX)47 approximation has been shown to be more efficient than the canonical resolution-of-the-identity (RI) approximation for larger molecules and is used along with the appropriate auxiliary basis sets (def2-SVP/J and def2-SVP/C).48,49 Results in Table 1 (QM) refer to this method.

Table 1.

Computed logPtol/w values for all challenge molecules from all blind submissions with the mean unsigned error (MUE), root mean square deviation (RMSD), and the standard deviation (STDEV) of the unsigned errors between each submission and the experimental values.

Label QMa MMb Mixedc Experiment24
1 4.23 3.38±0.16 3.36±0.42 3.76
2 4.20 3.94±1.33 1.75±0.31 2.40
3 6.64 6.04±0.20 7.03±0.74 5.51
4 8.36 1.71±0.11 5.97±0.92 5.47
5 6.52 2.96±0.14 3.99±1.18 3.61
6 −3.86 −2.94±0.44 −4.01±1.08 −1.23
7 4.23 0.64±1.98 2.39±1.10 4.37
8 1.06 8.85±0.62 −2.22±1.91 2.79
9 6.84 5.57±0.27 6.16±0.99 5.05
10 2.20 1.17±0.33 3.04±0.96 2.47
11 1.21 5.30±0.30 6.63±0.43 1.46
12 −1.27 −0.97±0.08 −2.10±0.82 −1.59
13 1.40 1.56±0.58 1.12±1.16 0.36
14 1.88 6.33±1.02 0.52±0.73 1.41
15 2.51 9.07±0.18 9.62±0.93 −0.74
16 7.58 5.31±1.82 0.99±1.90 3.77
MUE 1.56 2.63 2.21
RMSD 1.96 3.64 3.40
STDEV 1.23 2.60 2.67
a

QM method corresponds to DLPNO-CCSD(T)/def2-SVP (logP_QM)

b

MM method corresponds to the free energy perturbation alchemy method (logP_MD)

c

Mixed corresponds to the histogram clustering approach (logP_Mixed); MUE = mean unsigned error; RMSD = root mean square deviation; STDEV = standard deviation

DFT and DLPNO-Solv-ccCA.

The Stampede2 supercomputer was used to perform subsequent calculations using density functional theory (DFT) with the correlation consistent basis sets50 and DLPNO-Solv-ccCA,35 a variant of the correlation consistent Composite Approach (ccCA)51-53 that targets solvated properties for larger organic molecules, with ORCA 5.0.3 to widen the scope of investigation with more available resources. The recommended cc-pV(n+d)Z basis sets were used for all S and Cl atoms.54 See ESI for detailed information on choices for DFT and on ccCA, which has been extensively described in the literature.51-53

Conformation Search.

The Spartan ‘20 program55,56 was used to perform a molecular mechanics (MM)-based (Merck Molecular Force Field57) conformational search of each molecule to screen up to 500 rotational conformers within approximately 10 kcal/mol. Different weighting schemes are considered using DFT energies for the rotational conformers generated. These include an equal weighting amongst all conformations, a Boltzmann-weighted scheme, i.e., conformations are weighted based on energies, and RMSD-weighted conformations, i.e., conformations are weighted based on their root mean square deviation (RMSD) from the DFT optimized structure in each solvent, are also considered. This was done to complement the unsupervised machine learning methods that use a larger set of conformations. Screening various conformations allows the examination of thermodynamic ensembles, which are suggested for predicting thermodynamic properties when using methods like DFT.58 This method will be denoted as the S20 method in subsequent analysis.

Molecular Mechanics (MM)

Simulation Parameters.

All molecular dynamics (MD) production runs were done using GROMACS 2021.2.59 The TIP4P force field60,61 was used for water with a 3 nm cubic box since there is no dependence on the box size when simulating the solvation free energy.62 The density of toluene in toluene was calculated and yielded approximately 150 toluene molecules for the size of the box; the appropriate toluene force field was found via the CHARMM-GUI63 when generating simulation boxes. Force fields for the challenge molecules were generated using the CHARMM General Force Field (CGenFF)64,65 with the DFT-optimized structures as input structures. All simulations were set to an equilibrium temperature of 300 K and a pressure of 1 atm.

Blind Submission.

Alchemical solvation free energies in water and toluene were computed with an accelerated weighted histogram (AWH) method. The AWH method is an enhanced sampling technique that explores the free energy landscape via an adaptive bias to improve sampling efficiency.66 A standard parameter λ acts as finely spaced values in a discrete set that describes the van der Waals and Coulomb interactions of the alchemical path connecting the end states of the solute molecule in vacuum and in solution. While these interactions can be independently monitored, as part of the blind submission, a 13 λ point path (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, and 1.0) was used to describe the simultaneous coupling of both types of interactions with one walker given the size of the system. A potential bias is applied by controlling the position of a harmonic potential using Monte Carlo umbrella sampling. Simulations were initially run for 1.4 ns and running the simulation in triplicate yielded uncertainties in the free energies. Results in the MM column in Table 1 correspond to this method.

Alchemical Methods.

Simulation parameters were reconsidered after submission to examine which variables are more reliable for quantitatively accurate predictions. While the box size and type have negligible effects on the solvation free energy, these include using more walkers, longer and consistent simulation times for all systems, as well as changing the van der Waals and Coulomb interactions independently for a longer λ path (see ESI). These parameters were shown to increase the robustness of the AWH method in the original publication modeling the solvation free energy of ethanol in water. These simulations ran for 5 ns with a timestep of 2 fs in quintuplicate to generate error estimates. This method will be denoted as the MM-1 method.

Data Science and Machine Learning

Simulation Parameters.

MD simulations in GROMACS59 were used to scan the conformational space within a solvated environment, as opposed to using ab initio molecular dynamics (AIMD) with a smaller number of solvent molecules as AIMD has a significantly higher computational cost (memory, disk space, CPU time) than conventional MD. The same parameters and force fields for creating the solvated environment for the physical MM submission were used for the Mixed submission. For all single point calculations following the simulation, non-equilibrium structures were used to prevent constrained optimizations to a local minimum from affecting the features that distinguish rotamers. Free energy corrections for the sampled non-equilibrium structures were calculated at the RIJCOSX-B3LYP-D3/def2-SVP level of theory with the same vibrational scaling factor of 1.0044.

Blind Submission.

The MD simulations ran for 4 ns with a timestep of 4 fs and snapshots of the trajectory were collected every 2000 steps after an equilibration period of 4 ps. The logPtol/w was calculated for the matrix of 499 frames in water and 499 frames in toluene. The structures were not optimized because the rotational and vibrational motions present did not exceed geometric displacements quantitatively via a RMSD of 1.5 Å from an optimized structure or qualitatively via chemical intuition; exceeding both metrics infers instability. An example for geometric displacements for aromatic rings would be ring puckering albeit rather small. The presence of multiple non-equilibrium rotamers from the MD simulation leads to formation of a thermodynamic ensemble that can be used to characterize the range of structures present in a solution at any given time. The non-optimized structures generated from the grid of 499x499 logPtol/w values created a distribution of 249001 logPtol/w values at the B97-3c level of theory. B97-3c was chosen as a low-cost method for accurate thermochemistry.67 Five pairs in the logPtol/w matrix that were closest to the mean of the identified distributions, i.e., histogram clusters, were included for calculating ΔG°solv at the RI-B3LYP/def2-TZVP level of theory. The RIJCOSX approximation was used along with the appropriate auxiliary basis sets (def2/J, def2-TZVP/C). The uncertainties are generated as the inverse weighting of the uncertainty based on the standard deviation of each distribution. The weighting of the histogram clusters was based on the number of computed logPtol/w values comprised each statistically different distribution. This histogram cluster approach is the basis of the Mixed column in Table 1 and was chosen as the ranked submission.

Unsupervised Machine Learning.

Expanding on the simulation parameters from the histogram cluster approach led to decreasing the timestep from 4 fs to 1 fs, increasing the simulation time to 10 ns. With snapshots collected every 5 ps, this would allow for more structures to be included in the conformational search. The challenge molecules were extracted from the frames of the MD trajectory and converted for use with the RDKit and scikit-learn68 Python toolkits with the xyz2mol code.69 In total, 5000 structures were generated for each molecule in water and toluene. A pairwise nucleus-nucleus distance lower triangular matrix was the descriptor used to characterize the 3D representation of the molecule. The premise of using the matrix descriptor is to correlate interatomic distances of all atoms to identify clusters of conformations.

Given the high dimensionality of the data (N(N-1)/2), where N is the number of atoms in the molecule), the data was standardized using the standard scalar and then principal component analysis (PCA) was used for pre-processing to reduce the data to three principal components. A Gaussian Mixture Model was used to identify unique clusters. The optimal number of clusters were calibrated via elbow curves and maximizing the silhouette score when using the K-means clustering algorithm. The number of clusters predicted with both the K-means clusters and the Gaussian Mixture models was verified via five-fold cross validation using varying sizes of training/test splits (see ESI for details). Using the Gaussian Mixture Models on the reduced data served as an indicator of the larger structural changes based on identifying variances in the interatomic distances between flexible ligands and the more rigid carbon-based backbones or bond rotations that cause significant structural changes. The population of these clusters correlates to the relative population of the conformations present in solution as there is a ML model generated for each molecule in each solvent. The 3D shape of the distribution within each cluster approximates the shape of the potential energy well for that local minimum, i.e., the densest regions of the cluster correspond to the structures closer to the local minimum of that potential energy well. For each proposed cluster, the five data points that yielded the smallest magnitude of distance to the cluster mean were chosen as representative structures to sample the cluster. After verification of rotamer structures via the identified structures, the logPtol/w was computed using the density functional and basis set combination that yielded the lowest mean unsigned error (MUE) with respect to the experimental logPtol/w values. This method is referred to as the ML method.

Results and Discussion

The results are divided into the blind submissions and the post-submission analysis done after the experimental results were made available. The results submitted as part of the SAMPL9 competition are shown in Table 1. The formal analysis for the competition was the transfer free energies, which are different from the experimental logPtol/w values reported, and the statistics reported here reflect comparisons to the experimental logPtol/w values. The mean unsigned error (MUE), root mean square deviation (RMSD), and the standard deviation (STDEV) of the unsigned errors between the calculated and experimental logPtol/w values were used to gauge the effectiveness of each method as a predictive tool for logPtol/w.

Blind Submission Analysis

With all the submitted methods, the physical QM method (DLPNO-CCSD(T)/def2-SVP) and the physical MM method (AWH 1.4 ns) generated the lowest and highest MUE of 1.56 and 2.63 logPtol/w units, respectively. Compound 15 was predicted to be lipophilic with positive logPtol/w values, whereas the experimental results indicated a logPtol/w value of −0.74. Among all submissions, 15 yielded the largest deviation from experiment. The CGenFF force field generated for 15 may be a contributor as the largest errors arise from the structures used in the physical MM and mixed method submissions, implying the 3D structures generated via the MD simulations of 15 heavily favored toluene over TIP4P water. The generation of structures through MD, while a quick way to generate rotamers at 300K, creates too wide of a scan of the potential energy surface of each molecule to provide meaningful insight into the subtle changes in the 3D structure that leads to enhanced thermodynamic stability in a particular solvent.

The physical MM approach used a 13-point λ path that simultaneously coupled both van der Waals and Coulomb interactions. The sampling of 8 in both water and toluene to compute the respective solvation free energy led to an overestimation by 5.01 logPtol/w units. These particular settings for the λ path led to the same qualitative assessments of which challenge molecules had increased thermodynamic stability in their respective solvents, indicating that the MM methods are qualitatively consistent with the more quantitatively accurate QM methods.

The logPtol/w histograms (Figure 2) used 499 structures extracted from classical MD simulations in each solvent to generate a histogram containing 249001 logPtol/w values. The wide range exhibited showcased how the ensemble of structures generated through MD simulations show large favorability in either solvent just dependent on the 3D geometry. Note that the distributions are not symmetric since the extracted structures were generated in each solvent. As shown in Figure 2 for 14, the three clusters were centered at −44±7, 0±7, and 43±7 logPtol/w units and were weighted at 0.128, 0.720, and 0.152, respectively, when computing the weighted average. The standard deviation of ±7 for these clusters indicate how significantly the rotational flexibility of these molecules effect on the relative stability and favorability in a particular solvent. The seemingly unphysical range of calculated logP values can be accounted for by the choice of method (B97-3c), which relies on semiempirical corrections to achieve lower errors for thermodynamic properties of molecules. Using 14 as an example, the near-symmetry of the trimodal distribution led to an effective error cancellation when treated as a linear combination weighted based on population. The weighting of all the distributions used for the Mixed submission is shown in Table S9. Based on the results in Table 1 and the presence of multiple distinct distributions for 1, 11, 14, and 15, there is no conclusive trend between the presence of multiple near-symmetric distributions and a quantitative accuracy for logPtol/w. These distributions also suggest that there are both structural and electronic features of the challenge molecules that increases the hydrophilic or lipophilic properties.

Figure 2.

Figure 2.

logPtol/w histogram containing logPtol/w data from 249001 logPtol/w values for 14. The population of the three clusters were used as part of the weighted average of logPtol/w.

Post-submission Analysis

After the experimental results were published, additional computations were done to enhance the understanding of logPtol/w predictions with various QM, MM, and ML methods, further categorized into single conformer and ensemble-based methods.

Single conformer QM methods.

Using the optimized structures from the blind submission, additional QM methods were examined. The DLPNO-Solv-ccCA method, which approximates DLPNO-CCSD(T)/aug-cc-pCV∞Z, yielded a MUE of 1.19 logPtol/w units. This is an improvement to the blind submissions by at least 0.37 logPtol/w units in the case of DLPNO-CCSD(T)/def2-SVP (Table 2). The improvement is most likely due to the inclusion of additional corrections built into the DLPNO-Solv-ccCA methodology that enhance the description of the electronic energy beyond a DLPNO-CCSD(T) single point with a smaller basis set such as def2-SVP. These corrections include the use of higher-ζ correlation consistent basis sets for a reference energy, core-valence interactions, scalar relativistic corrections, and the zero-point vibrational energy.

Table 2.

Calculated logPtol/w values with DLPNO-Solv-ccCA and PBE/cc-pV∞Z with the mean unsigned error (MUE), root mean square deviation (RMSD), and the standard deviation (STDEV) of the unsigned errors between each method and the experimental values.

Label DLPNO-Solv-ccCA PBE/cc-pV∞Za Experiment
1 3.95 4.22 3.76
2 3.99 4.03 2.40
3 6.74 6.24 5.51
4 7.76 7.72 5.47
5 6.42 6.69 3.61
6 −3.73 −2.85 −1.23
7 4.45 4.19 4.37
8 0.88 1.35 2.79
9 6.72 7.25 5.05
10 2.51 2.30 2.47
11 1.69 1.22 1.46
12 −1.10 −1.43 −1.59
13 1.28 1.25 0.36
14 1.58 0.82 1.41
15 0.67 −0.83 −0.74
16 5.25 4.00 3.77
MUE 1.19 1.00
RMSD 1.48 1.34
STDEV 0.89 0.90
a

Other density functionals were examined and are shown in the ESI. This column reports the functional and basis set combination that yielded the lowest MUE of those examined.

The results of a density functional scan across fifteen DFT functionals across five developer families and three basis sets yielded that PBE/cc-pV∞Z yielded the lowest MUE of 1.00 logPtol/w units (more details in ESI). The analysis of basis sets of increasing ζ-level for DFT functionals is also consistent with the results in comparing the improvement from DLPNO-CCSD(T)/def2-SVP to DLPNO-Solv-ccCA. This analysis showed that per molecule, the mean signed errors between the calculated and experimental logPtol/w values across all functionals were highest when using cc-pVDZ versus cc-pVTZ and cc-pVQZ (Figure S2). While hybrid functionals like B3LYP or PBE0 are sufficient for DFT prediction of octanol-water partition coefficients (logPo/w), the local GGA functionals tended to predict the logPtol/w values more accurately by approximately 0.3 logPtol/w units when considering basis sets of higher quality, i.e., cc-pVTZ and cc-pVQZ. This trend in performance of GGA functionals is also consistent when considering 15 and 16, where DLPNO-Solv-ccCA yielded errors of 1.41 and 1.48 logPtol/w units, respectively, and the PBE/cc-pV∞Z method yielded errors of 0.09 and 0.23 logPtol/w units, respectively. These observations would suggest that inclusion of exact exchange as a parameter in the functional or in the context of ab initio methods when paired with triple- or quadruple-ζ basis sets negatively impact predicting logP tol/w.

Figure 3 shows the correlation plots between the calculated and experimental logPtol/w values for the QM methods shown in Tables 1 and 2. All QM methods tended to overestimate the lipophilicity and hydrophilicity of the challenge molecules when the molecule was experimentally predicted to be lipophilic or hydrophilic. This is indicated by the slopes of the linear trendlines being 0.597, 0.691, and 0.680 for Figure 3a, 3b, and 3c, respectively. With correlation coefficients ranging from 0.75 to 0.88, a linear correlation is suitable to describe the trend between calculated and experimental logPtol/w values. This would indicate that while the majority of the challenge molecules chosen are lipophilic in general, the SMD implicit model parameters used to describe toluene may overcorrect the nonpolar solute-solvent interactions as well as the noncovalent interactions between the aromatic rings within the challenge molecules and the nonpolar toluene solvent relative to the parameters that are more aligned for long-range solute-solvent interactions with water.

Figure 3.

Figure 3.

Correlation plots between calculated and experimental logP values for (A) the QM column in Table 1 (DLPNO-CCSD(T)/def2-SVP), (B) the DLPNO-Solv-ccCA results in Table 2, and (C) the PBE/cc-pV∞Z results from Table 2. The grey dashed line is the y=x line. Best fit trendlines are shown.

Ensemble-based methods.

Predicting the solvation free energy in each solvent using the alchemical nonequilibrium simulations is denoted as MM-1. The decoupling of the coulomb and van der Waals parameters in a longer 25 step λ path exaggerated the underestimation and overestimation of logPtol/w for 7 and 8, respectively. Based on Dixon’s Q test,70 these values were removed when calculating MUE, RMSD, and STDEV as shown in Table 3. One of the causes is the large rotational flexibility exhibited in these molecules and how the different points sampled from the umbrella sampling capture the wide range of structures when simulating the solvation free energies. The alchemical approach was not qualitatively consistent with the logPtol/w for 15, suggesting the solute-solvent coulomb and van der Waals between 15 and toluene were overestimated, which caused favorable interactions in toluene over water.

Table 3.

Calculated logPtol/w values using the alchemical MM method (MM-1) with the extended λ path and the unsupervised machine learning clusters (ML). The mean unsigned error (MUE), root mean square deviation (RMSD), and the standard deviation (STDEV) between each method and the experimental values is reported.

Label MM-1a MLb,c Experiment
1 2.81±0.57 1.79 3.76
2 2.84±0.26 1.59 2.40
3 6.73±0.02 −1.71 5.51
4 1.97±0.17 −1.74 5.47
5 4.12±0.04 4.33 3.61
6 −2.53±0.47 −2.72 −1.23
7 −8.44±0.26 0.31 4.37
8 13.36±0.27 4.95 2.79
9 6.48±0.06 −2.33 5.05
10 3.35±0.17 −0.55 2.47
11 5.51±0.60 2.65 1.46
12 −0.90±0.18 −1.18 −1.59
13 2.43±0.61 −1.40 0.36
14 2.63±0.70 0.68 1.41
15 5.99±0.17 −5.09 −0.74
16 −3.33±0.90 2.91 3.77
MUE 2.29 2.83
RMSD 3.14 3.72
STDEV 2.14 2.40
a

The results for 7 and 8 were removed from the calculation of the MUE, RMSD, and STDEV through the Dixon’s Q test70 predicting the errors as outliers.

b

The logPtol/w was computed using PBE/cc-pVTZ on the clusters and structures generated using these methods.

c

The dataset used to train the Gaussian Mixture Model consists of 4000 non-equilibrium structures.

Unsupervised machine learning techniques, e.g., Gaussian Mixture Models, were used to assign clusters based on the first three principal components of the lower triangular pairwise distance matrix (Figure 4). Based on the DFT calibration (see ESI), PBE/cc-pVTZ was chosen as the method basis set combination to compute logPtol/w with the structures sampled from the clusters generated via unsupervised machine learning and the conformations from Spartan.

Figure 4.

Figure 4.

A schematic of the ML method workflow. The data is preprocessed by computing the pairwise nuclear-nuclear distance matrix descriptor for 5000 conformers and transforming the matrix to a (N(N-1)/2)-dimensional row vector. The first three principal components are then computed via PCA and visualized. With that information, Gaussian Mixture Models (GMM) are used to identify clusters of unique conformers. Representative structures from four of eight identified clusters for 16 in explicit water are shown.

Performing PCA on the lower triangular pairwise nuclear-nuclear distance matrix descriptor identifies the variances in interatomic distances that would indicate significant structural changes. As seen in Figure 4 by the four representative conformers of the eight identified clusters for 16 in explicit water, the identified clusters represent various stable conformations on the rotational potential energy surface. The clusters are sampled via five structures closest to the cluster mean.

While bond rotations typically have small energy barriers of a few kcal/mol, the variability in these rotamers would lead to significantly different logPtol/w values and therefore need to be considered. Bond rotations around the CH2 groups near the disubstituted cyclohexane ring can cause potential intramolecular interactions between the two cyclic structures at the end of the structures, which in turn, creates a distribution of unique conformations that can be weighted based on the population of structures in the cluster, i.e., the cluster size.

As the dataset size would affect the weighting of the clusters, five-fold cross validation was done for randomized training sets of 2000, 3000, 4000, and 5000 structures (see ESI), which corresponds to a training/test split of 40:60, 60:40, 80:20, and 100:0, respectively, when training neural networks. The silhouette score analysis and subsequent calculation of the logPtol/w based on the weighted average of various clusters yielded that while 4000 structures provided the most consistent silhouette scores, the MUE of the computed logPtol/w values only marginally improved over the MUE when considering 2000 structures. Therefore, there are several factors of randomness within this ML method that contribute to the predicted logPtol/w.

The different weighting schemes (equal weighting, Boltzmann-weighted, and RMSD-weighted) of the S20 method are shown in Table 4. The equal weighting scheme yielded the lowest MUE of 1.31 logPtol/w units of the three schemes. Boltzmann-weighted conformation energies increased the MUE by 0.21 logPtol/w units. The Boltzmann-weighted schemes considers the energies at the PBE/cc-pVTZ level of theory for each conformer and weights each conformation based on the electronic energy relative to the minimum energy. This weighting varies for both solvents (see ESI). The RMSD-weighted scheme utilizes the RMSD of the Spartan-generated conformers relative to the DFT optimized structure in each solvent. Assigning higher weights to the conformations with smaller RMSD values reflects that conformers closer to the DFT optimized local minimum structurally having more significance towards the overall energy. The MUE for the RMSD method was 1.65 logPtol/w units. The RMSD values had similar distributions with respect to each solvent (see ESI). The Boltzmann-weighted logPtol/w values yielded lower weighted standard deviations than both the equal and RMSD-weighted schemes. The decreased standard deviation of the Boltzmann weighting is attributed to the spread of DFT energies that can correct for inaccuracies in the force field. Note that the standard deviation presented as STDEV in Table 4 is the standard deviation of the unsigned errors and not the standard deviations of the predicted logPtol/w values. Overall, the logPtol/w values computed with the Boltzmann-weighted conformation energies were not statistically significant from the other weighting methods for both the whole set and for each molecule. Therefore, in subsequent analysis of ensemble-based methods (Table 5), the representative S20 method will utilize equal weighting of conformers as to not bias the efficacy of the other ensemble-based methods.

Table 4.

Calculated logPtol/w values with the thermodynamic ensemble of rotamers found using the Spartan ’20 package (S20) for different weighting schemes (equal, Boltzmann, RMSD) using PBE/cc-pVTZ. The mean unsigned error (MUE), root mean square deviation (RMSD), and the standard deviation (STDEV) between each method and the experimental values is reported.

Label Equal Weight Boltzmann RMSD Experiment
1 3.76±2.48 4.01±1.85 3.92+2.27 3.76
2 −0.12±4.50 −0.20±4.19 −0.10+4.55 2.40
3 6.30±2.10 6.38±1.63 6.26+1.82 5.51
4 7.82±0.23 7.89±0.00 7.98+0.00 5.47
5 4.25±2.95 4.40±2.21 4.34+2.65 3.61
6 −1.48±4.06 −1.23±3.47 −1.51+3.97 −1.23
7 1.05±3.96 0.25±3.86 −0.02+3.88 4.37
8 0.54±5.66 −1.39±5.74 −1.42+5.68 2.79
9 4.01±6.11 4.01±5.07 3.50+6.26 5.05
10 1.87±0.93 2.10±0.54 1.85+0.76 2.47
11 2.12±0.00a 2.12±0.00a 2.12±0.00a 1.46
12 −0.97±2.71 −0.76±0.40 −0.91+1.89 −1.59
13 −0.07±3.00 −0.20±3.07 −0.10+3.03 0.36
14 0.73±2.39 0.55±1.98 −0.04+2.42 1.41
15 −0.41±3.21 −0.53±0.91 0.08+2.85 −0.74
16 −0.78±4.96 −0.80±4.47 −0.80+5.01 3.77
MUE 1.31 1.52 1.65
RMSD 1.82 2.13 2.21
STDEV 1.25 1.50 1.48
a

Only one rotational conformation was found for 11.

Table 5.

The mean and standard deviation of the unsigned errors excluding the uncertainty for all the single conformerQM methods (QM, DLPNO-Solv-ccCA, PBE/cc-pV∞Z) and the ensemble-based methods (MM, Mixed, MM-1, ML, S20) for each challenge molecule. The difference between these methods is calculated to show statistical significance between the single conformer QM methods and the ensemble-based methods.

Label Single conformer QM methods (N=3) Ensemble-based methods (N=5) Difference
1 0.37±0.16 0.74±0.76 −0.37±0.78
2 1.67±0.11 1.19±0.85 0.48±0.86
3 1.03±0.26 1.75±1.68 −0.72±1.71
4 2.48±0.36 3.38±2.30 −0.91±2.33
5a 2.93±0.14 0.58±0.14 2.35±0.19
6 2.25±0.55 1.51±0.91 0.74±1.06
7 0.13±0.05 5.18±4.34 −5.05±4.34
8 1.69±0.24 5.21±3.45 −3.52±3.46
9 1.89±0.28 1.11±0.38 0.78±0.47
10 0.16±0.12 1.07±0.59 −0.91±0.60
11 0.24±0.01 2.98±1.95 −2.74±1.95
12 0.32±0.17 0.57±0.11 −0.25±0.20
13 0.95±0.08 0.95±0.72 0.00±0.73
14 0.41±0.22 1.69±1.82 −1.28±1.83
15 1.58±1.59 5.93±4.45 −4.34±4.72
16 1.84±1.82 3.25±2.67 −1.41±3.23
a

An unpaired two-tailed t-test showed that the distributions of the single conformer QM methods and ensemble-based methods were statistically significant. All other pairs of distributions were verified to be not statistically different.

The standard entry point for QM calculations is to draw the structure or use a SMILES string to generate a 3D model, which can then be fed into any electronic structure program to optimize with respect to each solvent. One potential issue with this approach is the use of a single input structure, as optimization algorithms would search for a local minimum and not consider the ensemble of rotamers present in any given environment. As shown for the various results involving rotamers generated via MD simulations, the results are tenuously based on the wide spread of possible logPtol/w values (Figure 2). When considering all the structures generated through MD simulations, the force field parameters for the molecules were not enough to capture the subtle electronic changes of the molecules in solution to really dictate whether rotational conformations were favored in water or toluene.

The Spartan software package has the capability to screen conformers based on bond rotation and relative energy through a graphical user interface. This capability provides an ensemble of structures enabling a more guided inquiry into the effect of rotamers on predicting logPtol/w. Note that the structures generated via Spartan are calculated in the gas phase and thus are independent of solute-solvent interactions. The conformations generated by Spartan are also based on electronic energies after rotating the bonds and are not Boltzmann-weighted with respect to conformation. Therefore, with the ensemble of structures, the same 3D structures were considered with only a change in the implicit solvent when calculating the PBE/cc-pVTZ free energy. With the energies of each structure in the ensemble, a heat map (Figure 5) can be generated. A heat map matrix of predicted logPtol/w values provides more information into which rotamers are likely to be more soluble in water or toluene while the histogram can show the overall distribution of predicted logPtol/w values. Figure 5 shows that for 5, the predicted logPtol/w value can vary by up to 8 logPtol/w units based on the choice of rotamer.

Figure 5.

Figure 5.

Heat map showing the range of calculated logPtol/w values from the S20 method for the conformers of 5.

Since the same structures were used with the only change being the implicit solvent chosen to generate the heat map, there are symmetry elements present evidenced by a partitioning of the heat map based on lighter (yellow) and darker (purple) colors. Considering 5 (Figure 5), the values along the diagonal that were used to compute the logPtol/w for S20 are closer to the median logPtol/w values, while the extremes occur when the structures vary considerably, which are at the beginning and end of the scan.

When the same unsupervised machine learning workflow was applied to the ensemble of structures in the S20 method, the heat maps that were generated for the structures in the proposed cluster retained similar ranges of predicted logPtol/w values (Figure S4). The ranges present indicate that the unsupervised clustering based on the pairwise nucleus-nucleus lower triangular matrix does not cluster the structures in a way that transfers the knowledge of rotational changes that are favorable in a particular solvent. Error cancellation from averaging the large range of logPtol/w values as seen in the histogram clusters (Figure 2) and the heat map (Figure 5) is a factor for the success of the ensemble-based methods. For the QM/MD portions of the ML method, more rigorous methods to survey the potential energy surface and generate rotamers such as ab initio molecular dynamics (AIMD) would improve the ratios of likely conformations since AIMD forces are calculated ab initio rather than with parameterized force fields. The use of more contemporary force field generators for small molecules, e.g., OpenFF,71-73 that are trained on larger QM datasets may also contribute to generating more thermodynamically favorable conformations during MD simulations as well. On the ML side, using density-based clustering algorithms,74 such as the Density-Based Spatial Clustering of Applications with Noise (DBSCAN), can be used to identify clusters of structures and eliminate transitory structures from being considered as part of the weighting. As well, other descriptors of 3D geometry may serve as additional dimensions to create clusters with higher silhouette scores.

Across all methods examined in this work, 1, 12, and 13 were the most consistent with predictions within 1 logPtol/w unit with the exception of the MM method shown in Table 1, which yielded an error of 1.20 for 13. For the single conformer QM methods (QM, DLPNO-Solv-ccCA, PBE/cc-pV∞Z), the unsigned errors for 7, 11, 14 were less than 0.18, 0.25, and 0.60 logPtol/w units, respectively, while for the ensemble-based methods (MM, Mixed, MM-1, ML, S20), In the case of 7, the flexibility of both the propylene linker between the tertiary amine- phenothiazine moiety and the ethylene linker between tertiary amine-alcohol group induces the possibility of generating numerous conformers within 10 kcal/mol of the absolute minimum. Moreover, these conformers may have distinct polarity and steric profiles which can skew the relative solvation free energies by up to 10 kcal/mol. This translates to a logPtol/w difference of 7.36 units; thus, explaining the distributions observed in the logP histograms (Figure 2) and heat maps (Figure 5). However, for 5, all the ensemble-based methods yielded lower errors than all the QM methods by an average of 2.35±0.41 logPtol/w units, which was verified to be statistically different via the two-tailed unpaired t-test (Table 5). For 6 and 9, while the average logP for the ensemble-based methods was lower than that for the single conformer QM methods by 0.74±1.06 and 0.78±0.47 logP units, respectively, the datasets were not statistically significant. These molecules have less rotational degrees of freedom, which would suggest that utilizing ensemble-based methods are advantageous for more unsaturated, rigid drug-like molecules. Therefore, a careful selection or weighting of conformations in the thermodynamic ensemble is essential for predicting solvated properties that depend on calculations in multiple environments, such as logPtol/w or pKa.

In screening viable conformations, consider the 3D structures for 7 and 16; both have N,N-disubstituted piperazine rings.75 One key feature of the piperazine functionality is that the net dipole is significantly affected by the conformation of the ring. The structure shown on the left in Figure 6 shows the pseudo-diequatorial moieties about the piperazine, while the structure on the right shows the pseudo-diaxial conformer. In the pseudo-diequatorial case, the nitrogen lone pairs are in a trans orientation, thus minimizing the net dipole. This conformer is favored in the less polar solvent, toluene. The pseudo-diaxial conformer is favored in water as the nitrogen lone pairs are cis thus maximizing the net dipole. The nitrogen lone pairs are localized as HOMO-2 and HOMO-1, and so only one of the MOs (HOMO-2) are visualized in Figure 6. This feature for 7 and 16 was observed for pairs of rotamers generated through the S20 method in which the calculated logP was within 0.01 logP units of the reported experimental values. Depending on the choice of descriptors, machine learning techniques may or may not capture this subtle electronic effect. As well, when treating the solvent implicitly for QM methods that use a single conformer, this positioning of the lone pairs may not be captured either as QM optimization algorithms focus on finding the nearest local minimum, which may not reflect these subtle chemical changes that can impact the dipole moment. Therefore, if care is not taken to capture these features when generating 3D structures from tools like Open Babel with SMILES strings, these subtle chemical features can get lost, especially when utilized in larger scale quantum mechanical machine learning workflows.

Figure 6.

Figure 6.

The B3LYP-D3/cc-pV(T+d)Z optimized structures of 7 in toluene (left) and water (right). The HOMO-2 orbital displays the nitrogen lone pairs. Hydrogens bound to carbon atoms are omitted for clarity. An isovalue of 0.1 was used to render the MO. C = grey, N = blue, O = red, F = fuchsia, S = yellow, H = white.

Conclusions

The logPtol/w was predicted using a bevy of computational approaches in the broad categories of quantum mechanics (QM), molecular mechanics (MM), and data-driven machine learning (ML) approaches in multiscale models. Initially targeting the use of techniques that can be readily utilized by a non-expert, these methods need some refining to be more applicable for logPtol/w, but QM methods like DFT functionals paired with a triple-ζ basis set are the simplest and arguably the most effective tool presented here in terms of reliability and approachability to obtain quantitatively accurate logPtol/w values within 1.00 logPtol/w unit. MM and ML ensemble-based approaches outperformed single conformer QM approaches for the challenge molecules with less rotational degrees of freedom since the average logPtol/w predicted by the ensemble-based methods were lower than the single conformer QM methods; the converse holds true. This observation for the larger and more flexible molecules skewed the results and reduces the reliability of these specific MM and ML approaches. Error cancellation is also a factor for the success of the MM and ML methods over the single conformer QM approaches. With the pairwise nucleus-nucleus distance matrix descriptor, the ML techniques did not cluster the structures in a way that transfers the knowledge of rotational changes that are favorable in a particular solvent. All the methods demonstrate how the 3D geometry and rotational changes caused by solute-solvent environment affects the predicted values of logPtol/w and how subtle structural changes like the orientation of the nitrogen lone pairs in pseudo-diaxial or pseudo-diequatorial conformations impacts the stability of drug molecules in water and toluene. These subtle features were not captured in a way to differentiate this effect for both the MM and ML results, and needed the detailed analysis of the electronic structure provided best through QM methods. The selection of rotamers that comprise the thermodynamic ensemble is crucial for predicting changes in solvation free energies between polar and non-polar solvents, as the large range in logPtol/w values generated using the ensembles and clusters assigned through machine learning yielded results that would need to be meaningfully interpreted to gain insight into how clustering techniques can be used effectively. Refinement and experience with each of these techniques would yield more quantitatively accurate and qualitatively consistent results, but even without such expertise, the accessible methods used herein can readily facilitate thermodynamic property prediction essential for cost-effective rational drug design.

Supplementary Material

SI

Acknowledgements

The authors gratefully acknowledge the University of Dallas, the John B. O’Hara Institute, the Nancy Cain and Jeffrey A. Marcus Foundation, and the Robert A. Welch Foundation (award number BA-0015-20201025) for funds to purchase equipment and support student researcher MRD. Portions of the post-competition ORCA calculations were done utilizing resources provided by the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. We also appreciate the National Institutes of Health (NIH) for its support of the SAMPL project via R01GM124270 to David L. Mobley (UC Irvine).

Footnotes

Conflicts of interest

There are no conflicts to declare.

Electronic Supplementary Information (ESI) available: ESI contains information on the online resources used, details on the DFT calibration, the 25-point λ path, and analysis of the ensemble-based methods. See DOI: 10.1039/x0xx00000x

References

  • 1.Lipinski CA, Lead- and drug-like compounds: the rule-of-five revolution, Drug Discov Today Technol, 2004, 1, 337–341. [DOI] [PubMed] [Google Scholar]
  • 2.DeGoey DA, Chen H-J, Cox PB and Wendt MD, Beyond the Rule of 5: Lessons Learned from AbbVie’s Drugs and Compound Collection, J Med Chem, 2018, 61, 2636–2651. [DOI] [PubMed] [Google Scholar]
  • 3.Patrick GL, An introduction to medicinal chemistry, Oxford University Press, 7th edn., 2023. [Google Scholar]
  • 4.Blomme EAG and Will Y, Toxicology Strategies for Drug Discovery: Present and Future, Chem Res Toxicol, 2016, 29, 473–504. [DOI] [PubMed] [Google Scholar]
  • 5.Wang Y, Cheng T and Bryant SH, PubChem BioAssay: A Decade’s Development toward Open High-Throughput Screening Data Sharing, SLAS Discovery, 2017, 22, 655–666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pereira DA and Williams JA, Origin and evolution of high throughput screening, Br J Pharmacol, 2007, 152, 53–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jorgensen WL, Efficient Drug Lead Discovery and Optimization, Acc Chem Res, 2009, 42, 724–733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gloriam DE, Bigger is better in virtual drug screens, Nature, 2019, 566, 193–194. [DOI] [PubMed] [Google Scholar]
  • 9.Walters WP, Virtual Chemical Libraries, J Med Chem, 2019, 62, 1116–1124. [DOI] [PubMed] [Google Scholar]
  • 10.Hoffmann T and Gastreich M, The next level in chemical space navigation: going far beyond enumerable compound libraries, Drug Discov Today, 2019, 24, 1148–1156. [DOI] [PubMed] [Google Scholar]
  • 11.Grygorenko OO, Radchenko DS, Dziuba I, Chuprina A, Gubina KE and Moroz YS, Generating Multibillion Chemical Space of Readily Accessible Screening Compounds, iScience, 2020, 23, 101681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tran-Nguyen VK, Junaid M, Simeon S and Ballester PJ, A practical guide to machine-learning scoring for structure-based virtual screening, Nature Protocols 2023 18:11, 2023, 18, 3460–3511. [DOI] [PubMed] [Google Scholar]
  • 13.Batra R, Chan H, Kamath G, Ramprasad R, Cherukara MJ and Sankaranarayanan SKRS, Screening of Therapeutic Agents for COVID-19 Using Machine Learning and Ensemble Docking Studies, Journal of Physical Chemistry Letters, 2020, 11, 7058–7065. [DOI] [PubMed] [Google Scholar]
  • 14.Srinivasan S, Batra R, Chan H, Kamath G, Cherukara MJ and Sankaranarayanan SKRS, Artificial Intelligence-Guided de Novo Molecular Design Targeting COVID-19, ACS Omega, 2021, 6, 12557–12566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ghosh N, Saha I and Gambin A, Interactome-Based Machine Learning Predicts Potential Therapeutics for COVID-19, ACS Omega, 2023, 8, 13840–13854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Talele T, Khedkar S and Rigby A, Successful Applications of Computer Aided Drug Discovery: Moving Drugs from Concept to the Clinic, Curr Top Med Chem, 2010, 10, 127–141. [DOI] [PubMed] [Google Scholar]
  • 17.Macalino SJY, Gosu V, Hong S and Choi S, Role of computer-aided drug design in modern drug discovery, Arch Pharm Res, 2015, 38, 1686–1701. [DOI] [PubMed] [Google Scholar]
  • 18.Sabe VT, Ntombela T, Jhamba LA, Maguire GEM, Govender T, Naicker T and Kruger HG, Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques: A review, Eur J Med Chem, 2021, 224, 113705. [DOI] [PubMed] [Google Scholar]
  • 19.Sadybekov AV and Katritch V, Computational approaches streamlining drug discovery, Nature, 2023, 616, 673–685. [DOI] [PubMed] [Google Scholar]
  • 20.Nicholls A, Wlodek S and Grant JA, The SAMP1 Solvation Challenge: Further Lessons Regarding the Pitfalls of Parametrization †, J. Phys. Chem. B, 2009, 113, 4521–4532. [DOI] [PubMed] [Google Scholar]
  • 21.Geballe MT, Skillman AG, Nicholls A, Guthrie JP and Taylor PJ, The SAMPL2 blind prediction challenge: introduction and overview, J. Comput. Aided. Mol. Des, 2010, 24, 259–279. [DOI] [PubMed] [Google Scholar]
  • 22.Nicholls A, Mobley DL, Peter Guthrie J, Chodera JD, Bayly CI, Cooper MD and Pande VS, Predicting Small-Molecule Solvation Free Energies: An Informal Blind Test for Computational Chemistry, J Med Chem, 2008, 51, 769–779. [DOI] [PubMed] [Google Scholar]
  • 23.Procacci P and Guarnieri G, SAMPL9 blind predictions for toluene/water partition coefficients using nonequilibrium alchemical approaches, J. Chem. Phys, 2023, 158, 124117. [DOI] [PubMed] [Google Scholar]
  • 24.Zamora WJ, Viayna A, Pinheiro S, Curutchet C, Bisbal L, Ruiz R, Ràfols C and Luque FJ, Prediction of toluene/water partition coefficients in the SAMPL9 blind challenge: assessment of machine learning and IEF-PCM/MST continuum solvation models, Physical Chemistry Chemical Physics, 2023, 25, 17952–17965. [DOI] [PubMed] [Google Scholar]
  • 25.Kitchin JR, Machine learning in catalysis, Nat Catal, 2018, 1, 230–232. [Google Scholar]
  • 26.Kuntz D and Wilson AK, Machine learning, artificial intelligence, and chemistry: How smart algorithms are reshaping simulation and the laboratory, Pure and Applied Chemistry, 2022, 94, 1019–1054. [Google Scholar]
  • 27.Prasad S and Brooks BR, A deep learning approach for the blind logP prediction in SAMPL6 challenge, J. Comput. Aided. Mol. Des, 2020, 34, 535–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Donyapour N and Dickson A, Predicting partition coefficients for the SAMPL7 physical property challenge using the ClassicalGSG method, J. Comput. Aided. Mol. Des, 2021, 35, 819–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bergazin TD, Tielker N, Zhang Y, Mao J, Gunner MR, Francisco K, Ballatore C, Kast SM and Mobley DL, Evaluation of log P, pKa, and log D predictions from the SAMPL7 blind challenge, J. Comput. Aided. Mol. Des, 2021, 35, 771–802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pinheiro M Jr, Zhang S, Dral PO and Barbatti M, WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets, Sci Data, 2023, 10, 95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Patel P, Wells RH, Kaphan DM, Delferro M, Skodje RT and Liu C, Computational Investigation of the Role of Active Site Heterogeneity for a Supported Organovanadium(III) Hydrogenation Catalyst, ACS Catal, 2021, 11, 7257–7269. [Google Scholar]
  • 32.Ritchie TJ and Macdonald SJF, The impact of aromatic ring count on compound developability – are too many aromatic rings a liability in drug design?, Drug Discov Today, 2009, 14, 1011–1020. [DOI] [PubMed] [Google Scholar]
  • 33.Shearer J, Castro JL, Lawson ADG, MacCoss M and Taylor RD, Rings in Clinical Trials and Drugs: Present and Future, J. Med. Chem, 2022, 65, 8699–8712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Jones MR, Brooks BR and Wilson AK, Partition coefficients for the SAMPL5 challenge using transfer free energies, J. Comput. Aided. Mol. Des, 2016, 30, 1129–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Patel P, Kuntz DM, Jones MR, Brooks BR and Wilson AK, SAMPL6 logP challenge: machine learning and quantum mechanical approaches, J. Comput. Aided. Mol. Des, 2020, 34, 495–510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jones MR and Brooks BR, Quantum chemical predictions of water–octanol partition coefficients applied to the SAMPL6 logP blind challenge, J. Comput. Aided. Mol. Des, 2020, 34, 485–493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Neese F, Software update: The ORCA program system—Version 5.0, WIREs Computational Molecular Science, 2022, 12, e1606. [Google Scholar]
  • 38.Yoshikawa N and Hutchison GR, Fast, efficient fragment-based coordinate generation for Open Babel, J Cheminform, 2019, 11, 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T and Hutchison GR, Open Babel: An open chemical toolbox, J. Cheminform, 2011, 3, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Marenich AV, Cramer CJ and Truhlar DG, Universal Solvation Model Based on Solute Electron Density and on a Continuum Model of the Solvent Defined by the Bulk Dielectric Constant and Atomic Surface Tensions, J. Phys. Chem. B, 2009, 113, 6378–6396. [DOI] [PubMed] [Google Scholar]
  • 41.A. R, Weigend F., Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy, Phys. Chem. Chem. Phys, 2005, 7, 3297. [DOI] [PubMed] [Google Scholar]
  • 42.Becke AD, Density-functional thermochemistry. III. The role of exact exchange, J. Chem. Phys, 1993, 98, 5648–5652. [Google Scholar]
  • 43.Lee C, Yang W and Parr RG, Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density, Phys. Rev. B, 1988, 37, 785–789. [DOI] [PubMed] [Google Scholar]
  • 44.Grimme S, Antony J, Ehrlich S and Krieg H, A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu, J. Chem. Phys, 2010, 132, 154104. [DOI] [PubMed] [Google Scholar]
  • 45.Kesharwani MK, Brauer B and Martin JML, Frequency and zero-point vibrational energy scale factors for double-hybrid density functionals (and other selected methods): Can anharmonic force fields be avoided?, J. of Phys. Chem. A, 2015, 119, 1701–1714. [DOI] [PubMed] [Google Scholar]
  • 46.Riplinger C, Pinski P, Becker U, Valeev EF and Neese F, Sparse maps—A systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory, J. Chem. Phys, 2016, 144, 024109. [DOI] [PubMed] [Google Scholar]
  • 47.Izsák R, Neese F and Klopper W, Robust fitting techniques in the chain of spheres approximation to the Fock exchange: The role of the complementary space, J. Chem. Phys, 2013, 139, 094111. [DOI] [PubMed] [Google Scholar]
  • 48.Weigend F, Accurate Coulomb-fitting basis sets for H to Rn, Physical Chemistry Chemical Physics, 2006, 8, 1057. [DOI] [PubMed] [Google Scholar]
  • 49.Hellweg A, Hättig C, Höfener S and Klopper W, Optimized accurate auxiliary basis sets for RI-MP2 and RI-CC2 calculations for the atoms Rb to Rn, Theor. Chem. Acc, 2007, 117, 587–597. [Google Scholar]
  • 50.Dunning TH Jr, Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen, J. Chem. Phys, 1989, 90, 1007–1023. [Google Scholar]
  • 51.DeYonker NJ, Cundari TR and Wilson AK, The correlation consistent composite approach (ccCA): An alternative to the Gaussian-n methods, J. Chem. Phys, 2006, 124, 114104. [DOI] [PubMed] [Google Scholar]
  • 52.DeYonker NJ, Wilson BR, Pierpont AW, Cundari TR and Wilson AK, Towards the intrinsic error of the correlation consistent Composite Approach (ccCA), Mol. Phys, 2009, 107, 1107–1121. [Google Scholar]
  • 53.Patel P, Melin TRL, North SC and Wilson AK, Ab initio composite methodologies: Their significance for the chemistry community, 2021, vol. 17. [Google Scholar]
  • 54.Dunning TH Jr, Peterson KA and Wilson AK, Gaussian basis sets for use in correlated molecular calculations. X. The atoms aluminum through argon revisited, J. Chem. Phys, 2001, 114, 9244–9253. [Google Scholar]
  • 55.Shao Y, Molnar LF, Jung Y, Kussmann J, Ochsenfeld C, Brown ST, Gilbert ATB, Slipchenko LV, Levchenko SV, O’Neill DP, DiStasio RA Jr, Lochan RC, Wang T, Beran GJO, Besley NA, Herbert JM, Yeh Lin C, Van Voorhis T, Hung Chien S, Sodt A, Steele RP, Rassolov VA, Maslen PE, Korambath PP, Adamson RD, Austin B, Baker J, Byrd EFC, Dachsel H, Doerksen RJ, Dreuw A, Dunietz BD, Dutoi AD, Furlani TR, Gwaltney SR, Heyden A, Hirata S, Hsu C-P, Kedziora G, Khalliulin RZ, Klunzinger P, Lee AM, Lee MS, Liang W, Lotan I, Nair N, Peters B, Proynov EI, Pieniazek PA, Min Rhee Y, Ritchie J, Rosta E, David Sherrill C, Simmonett AC, Subotnik JE, Lee Woodcock III H, Zhang W, Bell AT, Chakraborty AK, Chipman DM, Keil FJ, Warshel A, Hehre WJ, Schaefer III HF, Kong J, Krylov AI, Gill PMW and Head-Gordon M, Advances in methods and algorithms in a modern quantum chemistry program package, Phys. Chem. Chem. Phys, 2006, 8, 3172–3191. [DOI] [PubMed] [Google Scholar]
  • 56.Spartan ’20, Wavefunction Inc., Irvine, CA, 1.1.4. [Google Scholar]
  • 57.Halgren TA, Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94, J. Comput. Chem, 1996, 17, 490–519. [Google Scholar]
  • 58.Bursch M, Mewes J, Hansen A and Grimme S, Best-Practice DFT Protocols for Basic Molecular Computational Chemistry**, Angew. Chem. Int. Ed, 2022, 61, e202205735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B and Lindahl E, GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers, SoftwareX, 2015, 1–2, 19–25. [Google Scholar]
  • 60.Horn HW, Swope WC, Pitera JW, Madura JD, Dick TJ, Hura GL and Head-Gordon T, Development of an improved four-site water model for biomolecular simulations: TIP4P-Ew, J. Chem. Phys, 2004, 120, 9665–9678. [DOI] [PubMed] [Google Scholar]
  • 61.Hess B and Van Der Vegt NFA, Hydration thermodynamic properties of amino acid analogues: A systematic comparison of biomolecular force fields and water models, J. Phys. Chem. B, 2006, 110, 17616–17626. [DOI] [PubMed] [Google Scholar]
  • 62.Gapsys V and de Groot BL, On the importance of statistics in molecular simulations for thermodynamics, kinetics and simulation box size, Elife, 2020, 9, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Jo S, Kim T, Iyer VG and Im W, CHARMM-GUI: A web-based graphical user interface for CHARMM, J. Comput. Chem , 2008, 29, 1859–1865. [DOI] [PubMed] [Google Scholar]
  • 64.Vanommeslaeghe K, Hatcher E, Acharya C, Kundu S, Zhong S, Shim J, Darian E, Guvench O, Lopes P, Vorobyov I and Mackerell AD, CHARMM general force field: A force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields, J. Comput. Chem, 2010, 31, 671–690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Yu W, He X, Vanommeslaeghe K and MacKerell AD, Extension of the CHARMM general force field to sulfonyl-containing compounds and its utility in biomolecular simulations, J. Comput. Chem, 2012, 33, 2451–2468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Lundborg M, Lidmar J and Hess B, The accelerated weight histogram method for alchemical free energy calculations, J. Chem. Phys, , DOI: 10.1063/5.0044352. [DOI] [PubMed] [Google Scholar]
  • 67.Brandenburg JG, Bannwarth C, Hansen A and Grimme S, B97-3c: A revised low-cost variant of the B97-D density functional method, J. Chem. Phys, 2018, 148, 064104. [DOI] [PubMed] [Google Scholar]
  • 68.Pedregosa F, Varoquaux G, Gramfort A, Michel V. and Thirion B, Grisel O, Blondel M, Prettenhofer P and Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M and Duchesnay E, Scikit-learn: Machine Learning in Python, J. Machine Learning Research, 2011, 12, 2825–2830. [Google Scholar]
  • 69.Kim Y and Kim WY, Universal Structure Conversion Method for Organic Molecules: From Atomic Connectivity to Three-Dimensional Geometry, Bull Korean Chem Soc, 2015, 36, 1769–1777. [Google Scholar]
  • 70.Dean RB and Dixon WJ, Simplified Statistics for Small Numbers of Observations, Anal. Chem, 1951, 23, 636–638. [Google Scholar]
  • 71.Mobley DL, Bannan CC, Rizzi A, Bayly CI, Chodera JD, Lim VT, Lim NM, Beauchamp KA, Slochower DR, Shirts MR, Gilson MK and Eastman PK, Escaping Atom Types in Force Fields Using Direct Chemical Perception, J. Chem. Theory Comput, 2018, 14, 6076–6092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Qiu Y, Smith DGA, Boothroyd S, Jang H, Hahn DF, Wagner J, Bannan CC, Gokey T, Lim VT, Stern CD, Rizzi A, Tjanaka B, Tresadern G, Lucas X, Shirts MR, Gilson MK, Chodera JD, Bayly CI, Mobley DL and Wang L-P, Development and Benchmarking of Open Force Field v1.0.0—the Parsley Small-Molecule Force Field, J. Chem. Theory Comput, 2021, 17, 6262–6280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Boothroyd S, Behara PK, Madin OC, Hahn DF, Jang H, Gapsys V, Wagner JR, Horton JT, Dotson DL, Thompson MW, Maat J, Gokey T, Wang L-P, Cole DJ, Gilson MK, Chodera JD, Bayly CI, Shirts MR and Mobley DL, Development and Benchmarking of Open Force Field 2.0.0: The Sage Small Molecule Force Field, J. Chem. Theory Comput, 2023, 19, 3251–3275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Schubert E, Sander J, Ester M, Kriegel HP and Xu X, DBSCAN Revisited, Revisited, ACM Transactions on Database Systems, 2017, 42, 1–21. [Google Scholar]
  • 75.Pandit D, Roosma W, Misra M, Gilbert KM, Skawinski WJ and Venanzi CA, Conformational analysis of piperazine and piperidine analogs of GBR 12909: stochastic approach to evaluating the effects of force fields and solvent, J. Mol. Model, 2011, 17, 181–200. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI

RESOURCES