Implications of Additivity and Nonadditivity for Machine Learning and Deep Learning Models in Drug Design

Karolina Kwapien; Eva Nittinger; Jiazhen He; Christian Margreitter; Alexey Voronov; Christian Tyrchan

doi:10.1021/acsomega.2c02738

. 2022 Jul 19;7(30):26573–26581. doi: 10.1021/acsomega.2c02738

Implications of Additivity and Nonadditivity for Machine Learning and Deep Learning Models in Drug Design

Karolina Kwapien ^†,^*, Eva Nittinger ^†, Jiazhen He ^‡, Christian Margreitter ^‡, Alexey Voronov ^‡, Christian Tyrchan ^†

PMCID: PMC9352238 PMID: 35936431

Abstract

graphic file with name ao2c02738_0005.jpg

Matched molecular pairs (MMPs) are nowadays a commonly applied concept in drug design. They are used in many computational tools for structure–activity relationship analysis, biological activity prediction, or optimization of physicochemical properties. However, until now it has not been shown in a rigorous way that MMPs, that is, changing only one substituent between two molecules, can be predicted with higher accuracy and precision in contrast to any other chemical compound pair. It is expected that any model should be able to predict such a defined change with high accuracy and reasonable precision. In this study, we examine the predictability of four classical properties relevant for drug design ranging from simple physicochemical parameters (log D and solubility) to more complex cell-based ones (permeability and clearance), using different data sets and machine learning algorithms. Our study confirms that additive data are the easiest to predict, which highlights the importance of recognition of nonadditivity events and the challenging complexity of predicting properties in case of scaffold hopping. Despite deep learning being well suited to model nonlinear events, these methods do not seem to be an exception of this observation. Though they are in general performing better than classical machine learning methods, this leaves the field with a still standing challenge.

Introduction

A matched molecular pair (MMP) describes a pair of molecules that differs in one substituent only. Such a structural transformation is associated with a potential property change. MMP analysis is often used by medicinal chemists to compare properties in order to understand the structure–activity relationship (SAR) for a series of compounds. An extension from a pair to a series of molecules that differ in a single transformation forms a matched molecular series (MMS). MMS have been used to investigate automatic ways to derive an SAR similarity score^1,2 and to predict ADME properties.³

The reason for the popularity of MMP analysis is its intuitivity: a particular change in a molecular structure introduces a certain change in a biological activity or physical property. However, this simple concept works only under the assumptions of linearity and additivity. Linearity means that the change in property due to a particular change in structure is constant. Additivity means that the effect of a structural change on a property is independent of other variables. It is important to take these assumptions into consideration before performing an MMP analysis or building a quantitative structure–activity relationship (QSAR) model.⁴ Unfortunately, most publications in the field do not report any such analysis on the data sets. We advocate that this relevant step becomes good practice in the QSAR/ML field. The use of a linear model would fail to capture the trend of nonadditive data mathematically, resulting in erroneous predictions.

Another important aspect of checking the validity of the additivity assumption is the identification of outliers. Outliers indicate so-called activity cliffs, a pair of molecules or even a single observation where a small structural change causes a significant change in property or biological activity.⁵⁻⁷ Analysis of outliers and its understanding can lead to more efficient and effective design of molecules. The interpretation of activity cliffs is hampered by the complexity of the underlying effects and the fact that they can arise from any combination of these.⁸⁻¹⁰ A common example of such an activity cliff is so-called magic methyls where a single methyl group has a large effect on bioactivity or selectivity of a molecule.^11,12

Nonadditive data highlight critical changes in the SAR and are therefore the most interesting for a medicinal chemist. Most common causes of the nonadditive SAR are interactions between substituents, different binding modes, and changes in protein conformation.^8,9 Identification and analysis of nonadditive effects are important and can lead to understanding of changes in binding modes or ligand conformation. Additionally, they prevent chemists from missing good compounds and can change the direction of ligand optimization.

Especially with recent advancement of deep learning, many methods have become available in order to predict molecular properties. State-of-the-art property prediction models make use of fingerprints as molecular representations.¹³⁻¹⁷ Furthermore, models can be trained on SMILES representations or molecular graphs in order for the network to learn the important features themselves, without the need for precalculated molecular descriptors.¹⁸⁻²⁴

In this publication, we examine several machine learning and deep learning algorithms to predict four properties (log D, solubility, permeability, and clearance) using different data sets obtained from AstraZeneca’s (AZ) internal database. First, we determine experimental uncertainty for each property as this is an upper limit for predictability of in silico models.^25,26 Then, we perform a nonadditivity analysis (NAA) using the algorithm published by Kramer²⁷ to identify nonadditive datapoints. Based on this analysis, we generate four data sets: (1) all data, additive and nonadditive; (2) all MMPs; (3) additive MMPs (MMPs A); and (4) nonadditive MMPs (MMPs N). By comparing the different data sets, we analyze the influence of nonadditivity on the modeling and check if using only MMPs is beneficial for the performance of a model. A variety of methods are considered starting from simple partial least squares (PLS, serving as a benchmark), through random forest (RF), support vector regressor (SVR), gradient-boosted trees (XGBoost) to deep learning algorithm (single and multitask deep neural networks). The quality of the models was evaluated using statistical parameters (R² and RMSE). Other common parameters in QSAR studies as receiver operating curves or precision recall curves are not taken into account as our intent is not to judge the performance in a virtual screening setting. Our aim is to evaluate the capability of machine learning methods to qualify and predict MMPs, the smallest possible compound change in a medicinal chemistry project.

Methods

The overview workflow of the whole study is presented in Figure S1. In the following sections, we describe each step in more detail.

Data Sets

In-house AstraZeneca data were used for all four properties, log D, solubility in DMSO, cell permeability, and liver microsome clearance. By using in-house data, a continuous assay setup is guaranteed for each property to reduce the influence of systematic errors in the analysis.

All in-house data were collected on September 14, 2020. Data were curated based on our previously developed pipeline.⁴ Herein, molecules were standardized using PipelinePilot (standardization of stereoisomers, neutralization of charges, and removal of unknown stereochemistry), and the canonical tautomer was generated and kept for further analysis. All properties were converted to log values (SI Table S 1). Further data curation involved removal of unknown or uncertain (“<”, “>”) values and molecules with more than 70 heavy atoms (data_all). Subsequently, for compounds measured multiple times the median was calculated (data_stereo). Finally, compounds with large differences between their multiple measurements (>2.5 log units) were discarded, and compounds only varying in their stereochemistry were combined, while keeping the more active compound (Table 1 and Set 1, Table 2).

Table 1. Number of (Nof) Compounds (cpds) after the Different Curation Steps.

property	data all\|w/o outlier	Nof multimeasures^a	Nof stereoduplicates^a	Nof cpds in Set 1
log D	215,418\|214,320	18,429	6510	207,306
solubility	226,955\|226,189	21,444	5527	219,987
permeability	18,076\|18,051	2282	646	17,257
clearance	179,637\|179,495	24,493	5408	172,947

Open in a new tab

Compounds measured ≥2 times.

Table 2. Number of (Nof) Compounds (cpds) in Each Data Set^a.

property	data	Nof cpds	training	test
log D	Set 1 (all data)	207,306	165,844	41,462
	Set 2 (all MMPs)	187,162	149,729	37,433
	Set 3 (MMPs A)	47,380	37,904	9476
	Set 4 (MMPs N)	24,775	19,820	4955
solubility	Set 1 (all data)	219,987	175,989	43,998
	Set 2 (all MMPs)	196,451	157,160	39,291
	Set 3 (MMPs A)	45,976	36,780	9196
	Set 4 (MMPs N)	27,650	22,120	5530
permeability	Set 1 (all data)	17,257	13,805	3452
	Set 2 (all MMPs)	14,612	11,689	2923
	Set 3 (MMPs A)	4443	3554	889
	Set 4 (MMPs N)	909	727	182
clearance	Set 1 (all data)	172,947	138,357	34,590
	Set 2 (all MMPs)	155,043	124,034	31,009
	Set 3 (MMPs A)	33,755	27,004	6751
	Set 4 (MMPs N)	21,471	17,176	4295

Open in a new tab

A, additive data; N, nonadditive data.

Using the open-source package mmpdb,²⁸ all MMPs were obtained (Set 2, Table 2). Based on the NAA, two additional sets were generated, one containing only additive compounds (Set 3) and one containing only nonadditive ones (Set 4). To determine (non-)additivity, a double transformation cycle (DTC) must be generated. Because not all MMPs are also in a DTC, the number of MMPs (Set 2) is larger than the combination of Set 3 and Set 4.

For log D and solubility, the size of the corresponding sets is similar, with clearance generally having slightly less compounds. The data sets for permeability are about 10 times smaller. The exception is permeability Set 4 with only 909 compounds in total.

For machine learning approaches, we would expect Set 4 to be most difficult to predict followed by Set 1. Set 3 should be the easiest, because all compounds are additive.

The training and test sets for the machine learning approaches were obtained by doing a classical stratified training-test split with 0.8 and 0.2 ratio.

Experimental Uncertainty and R_max²

For all selected properties, data were collected for (a) multiple measurements for the same compound (data_all) and (b) measurements for compounds only differentiating in their stereochemistry (data_stereo). These data were used to calculate the experimental uncertainty of each respective assay.

Herein, the weighted mean was used to derive the experimental uncertainty for each property:

with x being the bin where 2.5% (0.5%) of datapoints for multimeasures (stereoduplicates) are included. A smaller amount of datapoints per bin only lead to an artificial increase of experimental uncertainty.

Based on the experimental uncertainty, the maximum R² achievable for a machine learning approach can be determined:²⁹

Nonadditivity Analysis

NAA was performed to determine (non-)additivity in a compound data set. Therefore, the open-source NA analysis code published by Christian Kramer was used (available on GitHub: https://github.com/KramerChristian/NonadditivityAnalysis).²⁷ The code is written in Python and makes use of the cheminformatics libraries RDKit,³⁰ Pandas, and NumPy. NA calculations are based on matched molecular squares, so-called DTCs, which consist of four MMPs (four compounds) linked by two distinct transformations. The MMPs in the NA code are generated by the open-source code developed by Dalke et al.,²⁸ an implementation of the MMPA algorithm developed by Hussain and Rea.³¹ The NA value of each DTC is calculated as the difference in logged biological activities (pAct_1–4) of the four compounds assembling the cycle:

Machine/Deep Learning

Machine Learning Models Using Optuna

PLS, RF, SVR, and gradient-boosted trees (XGBoost) models were built using Optuna (https://optuna.org).³² Optuna is a hyperparameter optimization framework and forms the basis of our in-house QPTUNA framework (available on GitHub: https://github.com/MolecularAI/Qptuna) that extends Optuna by adding chemoinformatics functionality. Optuna allows specifying the hyperparameter search space for a plethora of machine learning algorithms and automatically tries to optimize them with respect to a defined output metric for a specified number of trials. By using a surrogate model, such search should be more efficient than a mere random or grid search.

For each of the data sets provided, we trained a number of regressors for a minimum of 300 iterations each. This was done with threefold cross-validation (see Table S2 for details) to avoid overfitting during training, and models were then built from the entire training sets. Finally, the models were evaluated on the respective test sets.

For some of the SVR runs, we had to use a “downsampled” data set (10% of the corresponding original size) to be able to obtain optimized hyperparameters within a reasonable time frame. This was done for log D, solubility, and clearance (Sets 1–3). The rest of the sets (all permeability data sets and Set 4 for each property) were used all datapoints for hyperparameter optimization. The following steps, model training and prediction of the respective test sets, were performed on the full-size sets for all properties.

Graph Neural Network Deep Learning Model

The message passing neural network (MPNN)³³ framework operates on molecular graphs with atoms as nodes and bonds as edges. There are two main phases: (1) message passing phase, in which the node information is propagated and updated across the graph in order to build a neural representation of the whole graph, and (2) readout phase, when a final feature vector/representation describing the whole graph is created. Then a feed-forward neural network can be applied to this feature vector for prediction tasks.

The directed message passing neural network (D-MPNN)²⁴ (available on GitHub: https://github.com/chemprop/chemprop) builds upon the MPNN framework with the difference that during the message passing phase, the directed edge information is used instead of node information.

In this study, the D-MPNN model was trained in a single-task setting and a multitask setting. In the single-task setting, the model was trained individually for each property task, while in the multitask setting, a multitask model was trained on the union of the training sets from all the property tasks where each molecule has four target values. Therefore, after training, the multitask model can predict the four properties simultaneously for the molecules of the test set.

Hyperparameter optimization was performed for each data set using Bayesian optimization (i.e., Hyperopt³⁴) provided by chemprop, which finds the optimal parameters (hidden size, depth, dropout, and the number of feed-forward layers; details about the searching space can be found in chemprop) through multiple trials. In particular, 20 and 50 hyperparameter trial settings were tried in a single-task setting, which results in two models for each data set, hereafter named DNN-S_20 and DNN-S_50, respectively. For the multitask setting, only 20 hyperparameter trial settings were tried (DNN-M_20).

During the hyperparameter optimization, the original training set in Table 2 is split into training, validation, and test with the ratio 0.8, 0.1, and 0.1, to find out the best parameter configuration based on the RMSE metric. Then the model was trained using this parameter configuration, and the original training set in Table 2 is split into train and validation with the ratio 0.8 and 0.2. Finally, the trained model was applied to the test set to obtain the predictions.