AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Amanda J Minnich; Kevin McLoughlin; Margaret Tse; Jason Deng; Andrew Weber; Neha Murad; Benjamin D Madej; Bharath Ramsundar; Tom Rush; Stacie Calad-Thomson; Jim Brase; Jonathan E Allen

doi:10.1021/acs.jcim.9b01053

. 2020 Apr 3;60(4):1955–1968. doi: 10.1021/acs.jcim.9b01053

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Amanda J Minnich ^†, Kevin McLoughlin ^†, Margaret Tse ^‡, Jason Deng ^‡, Andrew Weber ^‡, Neha Murad ^‡, Benjamin D Madej ^¶, Bharath Ramsundar ^§, Tom Rush ^‡, Stacie Calad-Thomson ^‡, Jim Brase ^†, Jonathan E Allen ^†,^*

PMCID: PMC7189366 PMID: 32243153

Abstract

One of the key requirements for incorporating machine learning (ML) into the drug discovery process is complete traceability and reproducibility of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing ML models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of ML and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical data sets covering a wide range of parameters. Our key findings indicate that traditional molecular fingerprints underperform other feature representation methods. We also find that data set size correlates directly with prediction performance, which points to the need to expand public data sets. Uncertainty quantification can help predict model error, but correlation with error varies considerably between data sets and model types. Our findings point to the need for an extensible pipeline that can be shared to make model building more widely accessible and reproducible. This software is open source and available at: https://github.com/ATOMconsortium/AMPL.

Introduction

Discovery of new compounds to treat human disease is a multifaceted process involving the selection of chemicals with favorable pharmacological properties: a high potency to the desired target, elimination or minimization of safety liabilities, and a favorable pharmacokinetic (PK) profile. To address this challenge, the drug discoverer has a wealth of choices, with total “drug-like” chemical matter estimated between 10²² and 10⁶⁰ unique molecules. However, evaluating the desirability of these molecules with respect to potency, pharmacokinetics, and safety liabilities is a time-consuming and expensive process. Many of these molecules require de novo synthesis, which is a rate-limiting step. Furthermore, evaluation of pharmacological properties both in vitro and especially in vivo is prohibitively expensive given the universe of possible choices.

To aid in this design challenge, the field of computer-aided drug design has evolved to rapidly predict the properties of pharmacological matter in silico, allowing for rational selection of a feasible set of compounds for synthesis and evaluation. These techniques generally fall into two categories: (1) structure-based drug design, which relies on knowledge of the target structure (i.e., docking, molecular dynamics, free energy perturbation), and (2) ligand-based drug design, which uses known properties of molecules to develop models of quantitative structure–activity relationships (QSAR).

Ligand-based drug design generally relies on machine learning-based techniques to identify the link between structure and the property of interest. Recently, a proliferation of advanced machine learning techniques have shown great promise in increasing the predictability of QSAR models. A deep learning model first won the Merck Kaggle multiactivity challenge in 2014,¹ and since then these models have continued to show increased predictive accuracy over QSAR models based on classical machine learning techniques in many studies.² A recent example of success with deep learning is the paper by Feinberg et al. that compared the PotentialNet deep learning method with existing shallow learners on a wide array of pharmaceutically relevant data sets.³ These results showed dramatic improvements for deep learning based on temporal splits using data collected from a pharmaceutical company. Another evaluation showed that a directed message-passing neural network model can provide robust performance over a range of experimental data sets characterizing molecular properties.⁴ The authors provide an open-source deep learning software to go with this paper that has been tested on a wide range of parameters. However, this software does not include any type of modular pipeline that would allow for the incorporation of different models and chemical representations. Overall there has been a lack of publicly available suites of software tools that support a transparent and reproducible generation of a diverse array of deep and classical machine learning models, especially ones that can scale to model the large set of pharmaceutically relevant parameters. A major advance toward this goal was made with the introduction of DeepChem,⁵ which supports the building of a variety of machine learning models for small molecule property prediction. DeepChem contains a variety of very helpful modules and tools, but has limitations in its ability to robustly train models from a wide selection of hyperparameters, and published performance evaluation is limited to a small number of public data sets with less diverse pharmaceutical relevance.⁶

In this paper, we introduce a new small molecule property prediction pipeline, AMPL. This software was developed through the Accelerating Therapeutics Opportunities in Medicine (ATOM) Consortium as the ATOM Modeling PipeLine. The key contributions of this work are to automate deep learning training, particularly in hyperparameter search; to enable extensive performance benchmarking; and to apply AMPL to a large collection of pharmaceutically relevant property-prediction data sets. Most notably, AMPL is available as open source to benefit the drug discovery community.

The closest existing pipeline tools are BIOVIA Pipeline Pilot⁷ and KNIME.⁸ Pipeline Pilot is a license-based graphical tool for machine learning pipelining. It has capabilities for data cleaning, splitting, training, and model deployment, but are all mainly GUI-based, limiting the customizability of the software. Furthermore, it is only available for a licensing fee, so it does not target the open source community. In terms of free and open source software suites for data analytics, the main alternative is KNIME. This software provides an environment for creating general data flows to process data, use predictive models, and analyze complex data sets. An ecosystem of open source and commercial KNIME node extensions has developed which enable workflows for library analyses, virtual screening, model fitting and prediction. In contrast, AMPL is tightly focused on integrating modern machine learning methods with best practices for chemical activity and property prediction. Important issues with machine learning models, such as data set characterization, model validation, and uncertainty quantification are addressed by AMPL in automated and reproducible ways. The code suite also provides high performance computing modules for model fitting, hyperparameter optimization, and predictions. AMPL currently targets job submission-based clusters to scale training runs; however, AMPL could be adapted to operate on other scalable platforms such as Spark in the future. Furthermore, AMPL is implemented as a modular and reusable Python library to allow for easy integration with other data science software platforms.

An extensive set of experiments were conducted with AMPL, and key observations include the following:

Physicochemical descriptors and deep learning-based graph representations are significantly better than traditional fingerprints to characterize molecular features.
Data set size is directly correlated to performance of prediction: single-task deep learming models only outperform shallow learners if there is enough data. Likewise, data set size has a direct impact on model predictivity independently of comprehensive hyperparameter model tuning. Our findings point to the need for public data set integration or multi- task/transfer learning approaches.
DeepChem uncertainty quantification (UQ) analysis may help identify model error; however, efficacy of UQ to filter predictions varies considerably between data sets and model types.

The aim of this paper is to present the rigorous and transparent open source software pipeline AMPL to build global and local ‘baseline’ models for a wide array of molecular properties that are needed for in silico drug discovery. This new software will support reproducible training and testing protocols that enable the broader modeling community to evaluate and improve modeling approaches over time.

Methods

Figure 1 shows the overall architecture of AMPL. This end-to-end pipeline supports all functions needed to generate, evaluate, and save machine learning models: data ingestion and curation, featurization of chemical structures into feature vectors, training and tuning of models, storage of serialized models and metadata, and visualization and analysis of results. It also contains modules for parallelized hyperparameter search on high-performance computing (HPC) clusters.

Data Curation

AMPL includes several modules to curate data into machine learning-ready data sets. Functions are provided to represent small molecules with canonicalized SMILES strings using RDKit⁹ and the MolVS package,¹⁰ by default stripping salts and preserving isomeric forms. Data curation procedures are provided with AMPL as Jupyter notebooks,¹¹ which can be used as examples for curating new data sets. Procedures allow for averaging response values for compounds with replicate measurements and filtering compounds with high variability in their measured response values. AMPL also provides functions to assess the structural diversity of the data set, using either Tanimoto distances between fingerprints or Euclidean distances between descriptor feature vectors.

Data ingestion and curation-related parameters include the following:

Unique human readable name for training file
Data privilege access group
Parameter for overriding the output files/data set object names
ID for the metadata + data set
Boolean flag for using an input file from the file system
Name of column containing compound IDs
Name of column containing SMILES strings
List of prediction task names
Number of classes for classification
User specified list of names of each class
Boolean switch for using transformation on regression output. Default is True
Response column normalization type
Minimum number of data set compounds considered adequate for model training. A warning message will be issued if the data set size is less than this.

Featurization

AMPL provides an extensible featurization module that can generate a variety of molecular feature types, given SMILES strings as input. They include the following:

Extended connectivity fingerprints (ECFP) with arbitrary radius and bit vector length¹²
Graph convolution latent vectors, as implemented in DeepChem¹³
Chemical descriptors generated by the Mordred open source package¹⁴
Descriptors generated by the commercial software package Molecular Operating Environment (MOE)¹⁵
User-defined custom feature classes

Because some types of features are expensive to compute, AMPL supports two kinds of interaction with external featurizers: a dynamic mode in which features are computed on-the-fly and a persistent mode whereby features are read from precomputed tables and matched by compound ID or SMILES string. In the persistent mode, when SMILES strings are available as inputs, the featurization module matches them against the precomputed features where possible and computes features dynamically for the remainder. Because precomputed feature tables may span hundreds or thousands of feature columns for millions of compounds, the module uses the feather format¹⁶ to speed up access.

Featurized data sets for feature types that support persistent mode (currently, all except ECFP fingerprints and graph convolution format) are saved in the filesystem or remote datastore, so that multiple models can be trained on the same data set. This also facilitates reproducibility of model results.

Chemical descriptor sets such as those generated by MOE often contain descriptors that are exact duplicates or simple functions of other descriptors. In addition, large blocks of descriptors may be strongly correlated with one another, often because they scale with the size of the molecule. The featurization module deals with this redundancy by providing an option to remove duplicate descriptors and to scale a subset of descriptors by the number of atoms in the molecule (while preserving the atom count as a distinct feature). Factoring out the size dependency often leads to better predictivity of models.

The featurization module can be easily extended to handle descriptors generated by other software packages, latent vectors generated by autoencoders, and other types of chemical fingerprints. In most cases, this can be accomplished by writing a small function to invoke the external feature generation software, and by adding an entry to a table of descriptor types, listing the generated feature columns to be used. In more complicated cases, one may need to write a custom subclass of one of the base featurization classes.

Featurization-relevant input parameters include the following:

Type of molecule featurizer
Feature matrix normalization type
Boolean flag for loading in previously featurized data files
Type of transformation for the features
Radius used for ECFP generation
Size of ECFP bit vectors
Type of autoencoder, e.g., molvae, jt
Trained model HDF5 file path, only needed for MolVAE featurizer
Type of descriptors, e.g., MOE, Mordred
Maximum number of CPUs to use for Mordred descriptor computations. None means use all available.
Base of key for descriptor table file

Data Set Partitioning

AMPL supports several options for partitioning data sets for model training and evaluation. By default, splitting follows an approach similar to nested cluster-cross-validation.¹⁷ Data sets are divided into three nonoverlapping bins: (1) training, (2) model selection (validation), and (3) performance evaluation (test). Nested cluster-cross-validation uses single-linkage clustering and Jaccard distance from ECFP4 fingerprints. This option is available in AMPL as the “fingerprint” splitter. The results reported here are based on a different clustering method, which clusters molecules by shared scaffolds using the Bemis–Murcko definition. Both methods show similar reduced prediction accuracy when compared to randomly separating molecules into one of the three nonoverlapping bins. Alternatively, AMPL offers a k-fold cross-validation option, to assess the performance impact of sampling from the modeled data set. Under k-fold cross-validation, the holdout test set is selected first, and the remainder is divided into k-fold sets for training and validation.

AMPL offers a number of data set splitting algorithms, which offer different approaches to the problem of building models that generalize from training data to novel chemical space. It supports several of the methods included in DeepChem, including random splits, Butina clustering, Bemis–Murcko scaffold splitting, and the algorithm based on fingerprint dissimilarity.⁶ In addition, we implemented temporal splitting and a modified version of the asymmetric validation embedding (AVE) debiasing algorithm.¹⁸ We compared random splitting with Bemis–Murcko scaffold splitting for our benchmarking experiments.

Input parameters related to data splitting include the following:

Type of splitter to use: index, random, scaffold, Butina, ave_min, temporal, fingerprint, or stratified
Boolean flag for loading in previously split train, validation, and test CSV files
UUID for CSV file containing train, validation, and test split information
Choice of splitting type between k-fold cross validation and a normal train/valid/test split
Number of k-folds to use in k-fold cross validation
Type of splitter to use for train/validation split if temporal split used for test set (random, scaffold, or ave_min)
Cutoff Tanimoto similarity for clustering in Butina splitter
Cutoff date for test set compounds in temporal splitter
Column in data set containing dates for temporal splitter
Fraction of data to put in validation set for train/valid/test split strategy
Fraction of data to put in held-out test set for train/valid/test split strategy

Model Training and Tuning

AMPL includes a train/tune/predict framework to create high-quality models. This framework supports a variety of model types from two main libraries: scikit-learn¹⁹ and DeepChem 2.1.0.⁵ Currently, specific input parameters are supported for the following:

Random forest models from scikit-learn
XGBoost models²⁰
Fully connected neural network models
Graph convolution neural network models²¹

As with the featurization module, AMPL supports integration of custom model subclasses. Parameters for additional models can be easily added to the parameter parser module.

Model-relevant input parameters include the following:

Type of model to fit (neural network, random forest, or xgboost)
Prediction type (regression or classification)
Singletask or multitask model
Number of decision trees in the forest for random forest models
Maximum number of features to split random forest nodes
Number of estimators to use in random forest models
Batch size for neural network model
Optimizer type for neural network model
Optimizer specific for graph convolutional models, defaults to “adam”
Model batch size for neural network model
List of hidden layer sizes for neural network model
List of dropout rates per layer neural network model
List of standard deviations per layer for initializing weights for neural network model
The type of penalty to use for weight decay, either “l1” or “l2”
The magnitude of the weight decay penalty to use
List of initial bias parameters per layer for neural network model
Learning rate for dense neural network models
Epoch for evaluating baseline neural network model performance, if desired
Maximum number of training epochs for neural network model
Type of score function used to choose best epoch and/or hyperparameters
Boolean flag for computing uncertainty estimates for regression model predictions

Epoch Selection for Neural Net Models

Early stopping is an essential strategy to avoid overfitting of neural networks; thus, the number of training epochs is one of the key hyperparameters that must be optimized. To implement early stopping, AMPL trains neural network models for a user-specified maximum number of epochs, evaluating the model on the validation set after each epoch, and identifies the epoch at which a specified performance metric is maximized. By default, this metric is the coefficient of determination R² for regression models and the area under the receiver operating characteristic curve (ROC AUC) for classification models.

Model Persistence

Serialized models are saved after training and prediction generation are complete, along with detailed metadata to describe the model. This supports traceability and reproducibility, as well as model sharing. AMPL supports saving models and results either using the file system or optionally through a collection of database services. The metadata can be stored in a mongoDB database²² or as JSON files. AMPL has functions for saving models and loading prebuilt models for prediction generation.

Model Performance Metrics

AMPL calculates a variety of performance metrics for predictions on the training, validation and test sets. Metrics may be saved in a mongoDB database or in JSON files. For regression models, we calculate the following:

Coefficient of determination (R²). This is calculated using sklearn’s metrics function. Note that this score can be negative if the model is arbitrarily worse than random.
1
Mean absolute error (MAE). An advantage of MAE is that it has a clear interpretation, the average absolute difference between the measured value y_i and predicted value ŷ_i. This works well for cellular activity assay data sets, which use log normalized dose concentration value with similar concentration ranges across different assays. PK parameters are measured on different scales for some assays, which prevents comparison between assays with this metric.
2
Mean square error (MSE). This is a risk metric corresponding to the expected value of the squared error (or loss).
3

For classification models, we calculate the following:

Area under the receiver operating characteristics curve (ROC AUC). The ROC curve plots the true positive rate versus the false positive rate as a binary classifier’s discrimination threshold is varied. The ROC AUC score is calculated by finding the area under the ROC curve. This value can range from 0–1, where 1 is the best score.
Precision (positive predictive value)
4
where TP = number of true positives and FP = number of false positives
Recall (true positive rate/sensitivity)
5
where TP = number of true positives and TN = number of true negatives.
Area under the precision-recall curve (PRC-AUC). The precision-recall curve plots precision versus recall as a binary classifier’s discrimination threshold is varied. It is a good measure of success of prediction when classes are very imbalanced. High scores show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
Negative predictive value (NPV)
6
where TN = number of true negatives and FN = number of false negatives
Cross entropy (log loss)
7
Accuracy
8
where terms are defined as above.

Uncertainty Quantification

Uncertainty quantification (UQ) attempts to measure confidence in a model’s prediction accuracy by characterizing variance in model predictions. Some common objectives for UQ are to use it to guide active learning or to weight model ensembles. AMPL generates UQ values for both random forest and neural network models.

Uncertainty Quantification for Random Forest

Generating a value quantifying uncertainty is straightforward for random forest and is taken to be the standard deviation of predictions from individual trees. This quantifies how variable these predictions are and thus how uncertain the model is in its prediction for a given sample.

Uncertainty Quantification for Neural Networks

Our neural network-based UQ uses the Kendall and Gal method²³ as implemented in DeepChem 2.1.0. This method combines aleatoric and epistemic uncertainty values to compute an estimate of the overall model error:

Aleatoric uncertainty, which is the inherent noise in the data, cannot be reduced by adding more data, but it can be estimated. The aleatoric uncertainty can be predicted concurrently with the response variable by modifying the loss function of the model to include a term for the observation noise.

Epistemic uncertainty, the uncertainty over the model parameters, arises because of limited data. Normally, this is calculated in a bootstrapped manner, as in the case of a random forest. For models that are expensive to train, such as neural networks, we instead train one network to generate a set of predictions by applying a set of dropout masks during prediction. Prediction variability is then quantified to assess epistemic uncertainty.

Visualization and Analysis

Plots generated by AMPL’s visualization and analysis module are shown in the Results. Additional options include plots of predicted vs actual values, learning curves, ROC curves, precision vs recall curves, and 2-D projections of feature vectors using UMAP.²⁴ The module also includes functions for characterizing and visualizing chemical diversity. Chemical diversity analysis is crucial for analyzing domain of applicability, bias in data set splitting, and novelty of de novo compounds. This module supports a wide range of input feature types, distance metrics, and clustering methods.

Hyperparameter Optimization

A module is available to support distributed hyperparameter search for HPC clusters. This module currently supports linear grid, logistic grid, and random hyperparameter searches, as well as iteration over user-specified values. To run the hyperparameter search, the user specifies the desired range of configurations in a JSON file. The user can either specify a single data set file or a CSV file with a list of data sets. The script generates all valid combinations of the specified hyperparameters, accounting for model and featurization type, and submits jobs for each combination to the HPC job scheduler. The module includes an option to generate a prefeaturized and presplit data set before launching the model training runs, so that all runs operate on the same data set split. The user can specify a list of layer sizes and dropouts to combine, along with the maximum final layer size and a list of the numbers of possible layers for a given model, and the module combines these different options based on the input constraints to generate a variety of model architectures. The search module can check the model tracker database to avoid retraining models that are already available. It also provides users the option to exclude hyperparameter combinations that lead to overparameterized models, by checking the number of weight and bias parameters for a proposed neural network architecture against the size of the training data set. Finally, the search module throttles job submissions to prevent the user from monopolizing the HPC cluster.

Input parameters for hyperparameter search include the following:

Boolean flag indicating whether we are running the hyperparameter search script
UUID of hyperparam search run model was generated in
Comma-separated list of number of layers for permutation of NN layers
Comma-separated list of dropout rates for permutation of neural network layers
The maximum number of nodes in the last layer
Comma-separated list of number of nodes per layer for permutation of neural network layers
Maximum number of jobs to be in the queue at one time for an HPC cluster
Scaling factor for constraining network size based on number of parameters in the network
Boolean flag directing whether to check model tracker to see if a model with that particular param combination has already been built
Path where pipeline file you want to run hyperparam search from is located
Type of hyperparameter search to do. Options are grid, random, geometric, and user_specified/
CSV file containing list of data sets of interest

Running AMPL

There are three ways to run AMPL:

Using a config file: Create a JSON file with desired model parameters and run full pipeline via command line.
Using command line arguments: Specify model parameters via standard command line arguments.
Interactively in a Jupyter notebook using an argparse.Namespace object or a dictionary

Results

Benchmark experiments were run to evaluate and validate components of the pipeline.

Data

Experimental data sets were made available by ATOM Consortium member GlaxoSmithKline from a variety of bioactivity and pharmacokinetics experiments. Selected data sets were used for training and evaluating models. These data sets are summarized in Table 1.

Table 1. Pharmacokinetics Datasets Used to Benchmark AMPL.

data set	units	species	data set size	minimum	maximum	mean	median
blood to plasma ratio		human	101	0.47	10.5	0.85	0.77
blood to plasma ratio		dog	71	0.37	6.85	0.85	0.88
plasma protein binding HSA	fraction unbound	human	123734	0.0001	1	0.05	0.044
plasma protein binding HSA	fraction unbound	rat	2086	0.0001	1	0.036	0.033
plasma clearance (in vivo)	mL/min/kg	dog	1181	0.1	2946	12.6	15.2
plasma clearance (in vivo)	mL/min/kg	rat	10431	0.001	8763	30.2	38.2
Vd,ss	L/kg	dog	1054	0.07	569	1.9	1.9
Vd,ss	L/kg	rat	9681	0.01	2080	2.3	2.4
hepatocyte clearance	mL/min/g liver tissue	human	1695	0.01	97	1.6	1.5
hepatocyte clearance	mL/min/g liver tissue	dog	630	0.1	504	2	1.8
hepatocyte clearance	mL/min/g liver tissue	rat	2098	0.02	878	2.9	2.9
microsomal clearance	mL/min/g liver tissue	human	29162	0	156	2.8	2.4
microsomal clearance	mL/min/g liver tissue	dog	2080	0.03	150	2.5	1.8
microsomal clearance	mL/min/g liver tissue	rat	30563	0.01	198	3.9	3.7
log D			27345	0.01	53703	258	407

Open in a new tab

Pharmacokinetic (PK) and safety data sets (Table 2) were curated separately, as they contain different types of experimental data and thus require different processing. The raw data sets were cleaned to remove rows with outlying, missing, and duplicate measurements, and processed to yield machine learning data sets with a single aggregate value per unique compound. These procedures informed the design of curation functions included in the pipeline. Curation of the PK data sets required the conversion of values to standard units, the removal of compounds with stability or recovery issues, and the exclusion of data that was generated using significantly different assay protocols. Subsequently, replicate experimental measurements were identified by matching duplicate canonical SMILES strings and averaged to produce a single value per compound.

Table 2. Safety Datasets Used to Benchmark AMPL.

assay	target	primary liability	experimental system	detection	compounds
BSEP pIC50	Bile Salt Export Pump	hepatic	membrane vesicles		187
ADRA1B pIC50	adrenergic a1B pIC50	CNS	intracell Ca		3537
ADRA2C pIC50	a2C adrenoceptor	CNS	CHO K1		2873
ADRB2 pEC50	2 adrenoceptor	CNS	FRET		2815
CHRM1 pEC50	cholinergic receptor muscarinic 1	CNS	CHO	intracellular Ca fluorescence	5315
CHRM1 pIC50	cholinergic receptor muscarinic 1	CNS	CHO	intracellular Ca fluorescence	4547
CHRM2 pEC50	cholinergic receptor muscarinic 2	CNS	CHO	intracellular Ca fluorescence	4742
DRD2 pEC50	dopamine D2	CNS	HEK293F low Na GTPgS	SPA	14450
GRIN1 pIC50	GRIN1 GRIN2B NR2B NR1A 2B subunit pIC50	CNS			2663
HRH1 pIC50	histamine receptor H1	CNS		luminescence	7971
HTR1B pIC50	5-hydroxytryptamine receptor 1B	CNS	10 μL LEADseeker GTPgS		6300
HTR2A pEC50	5-hydroxytryptamine receptor 2A	CNS	HEK	luminescence	4259
HTR2A pIC50	5-hydroxytryptamine receptor 2A	CNS	HEK	luminescence	4250
HTR2C pEC50	5-hydroxytryptamine receptor 2C	CNS	CHO	luminescence	2938
HTR2C pIC50	5-hydroxytryptamine receptor 2C	CNS	CHO	luminescence	2939
HTR3A pIC50	5-hydroxytryptamine receptor 3A	CNS	FLIPR		6645
KCNA5 (Kv1.5) pIC50	KCNA5 (Kv1.5)	cardiovascular	CHO	electrophys	4178
KCNE1 KCNQ1 (Kv7.1) pIC50	KCNE1 KCNQ1 (Kv7.1)	Cardiovascular	MinK human blocker CHO	electrophys	2373
MAOA pIC50	monoamine oxidase A	CNS		FLINT	344
PDE3A pIC50	phosphodiesterase 3A	Cardiac		SPA (cAMP inhibition)	614
PDE4B pIC50	phosphodiesterase 4B	CNS		SPA	564
Phospholipidosid pEC50	phospholipidosis induction	cellular tox	HEPG2	FLINT	278
PI3K pIC50	phosphoinositide 3-kinase (pI3K)	cellular tox		TR FRET	9173
COX 2 pIC50	cyclooxygenase 2	cardiovascular		FLINT SAR	3865
SCN5A (NaV1.5) pIC50	SCN5A (NaV1.5)	cardiovascular			3577
SCL6A2 pIC50	noradrenaline Transporter NET	CNS		BacMam bind SPA	2805
SLC6A4 pIC50	seratonin Transporter (SERT)	CNS		BacMam binding SPA	3520
OATP1B1 pIC50	organic anion transport polypeptide (SLCO1B1)	hepatic	HEK	image	1789

Open in a new tab

For the safety data sets, censored measurements were an additional concern. Since bioactivity assays are typically performed over a limited range of compound concentrations, IC50 or EC50 values may be reported as being above or below a maximum or minimum concentration, so that the measurements are censored. When all measurements for a compound are censored in the same direction, the user is given the option to either exclude the compound from the data set, or include it with a relational operator indicating the direction together with the censoring threshold. In the case where some replicate measurements are censored and some are not, AMPL computes a maximum likelihood estimate for the mean activity, assuming a Gaussian distribution of measurements around the true mean. The distribution of response values is reported along with the mean and standard deviation.

Experimental Design for Regression Pharmacokinetic Models

To evaluate AMPL’s performance, we built a total of 11,552 models on 15 pharmacokinetic data sets and 26 bioactivity data sets. These models include 9422 regression models and 2,130 classification models. The data sets are proprietary and were not released; however, public data sets are included in the supplemental.

We evaluated a variety of deep learning model types and architectures and compared them to baseline random forest models. We explored the performance of four types of features: ECFP fingerprints, MOE descriptors, Mordred descriptors, and graph convolution-based latent vectors. We used the RDKit²⁵ implementation of ECFP, which is defined there as roughly equivalent to the Morgan fingerprint. Each data set was divided into a 70% (train), 10% (validation), 20% (test) split. For the neural network models, we searched over many combinations of learning rates, numbers of layers, and nodes per layer. For each combination of neural network hyperparameters, we trained for up to 500 epochs and used a validation set performance metric (R² for regression, ROCAUC for classification) to choose an early stopping epoch for the final model. For random forest models, the only hyperparameter varied was the maximum tree depth, as previous experiments showed that other model hyperparameters had a minimal effect for our data sets. The complete set of hyperparameters that were varied is as follows:

Splitter Types: scaffold and random
Feature types: ECFP, MOE, mordred, and graph convolution
Model types: neural network and random forest
Neural network learning rates: 0.0001, 0.00032, 0.001, 0.0032, 0.01
Maximum number of epochs: 500
Number of layers: 1, 2
Layer size options: 1024, 256, 128, 64, 32, 16, 8, 4, 1
Dropout rate: 0.1, 0.2, 0.4

Analysis of Modeling Performance

To identify which featurization type generated the most predictive models for each model type, models with the largest validation set R² score were selected for each model/splitter/data set combination. The number of “best” models for which each feature type yielded the highest test set R² score (also known as the holdout set) is plotted in Figure 2. Figure 2 shows that the chemical descriptors generated by the commercial MOE software outperformed those produced by the open source Mordred package in most cases. DeepChem’s graph convolution networks outperform all other feature types for neural network models.

Number of times each featurization type produces the best model for the 15 PK data sets

The model/featurization combination with the most accurate predictions on the holdout set is shown in Figure 3. MOE featurization with random forest models most frequently outperformed other featurization/model type combinations for both types of splitters.

Number of times each featurization type/model type combination produces the best model for the 15 PK data sets.

Figure 4 confirms that random forest models tend to outperform neural network models for the evaluated data sets.

Number of times each model type produces the best model for the 15 PK data sets

Investigation into Neural Network Performance

Neural networks are known to perform more poorly on smaller data sets, so we wanted to examine the relationship between the size of a data set and the test set R² values for the best random forest and neural network models for that data set. We chose the best model from our hyperparameter search using the validation set, but we present the test set R² values as our final evaluation of model performance. The purpose of this 2-step process is to present a less biased evaluation of our models. When R² values from multiple models evaluated on a single data set are similar, the models are taken to be equal in performance. Figure 5 shows the test set R² values for the best neural network and random forest models for each data set. The figure shows that as the data set size increases, the R² score for the test set increases as well. The pattern is true for the overall best model, regardless of type, for both regression and classification, as shown in Figure 6. These results indicate that we will need to augment our data sets to further improve model performance. We plan to explore multiple avenues to address this requirement: conducting additional experiments, running simulations, sourcing public data, building multitask models, and experimenting with transfer learning approaches.

Plot of best model test set R² values versus the data set size for neural network and random forest models.

Per data set model accuracy versus data set size.

We also examined the architectures that yielded the best model for each feature type for the neural network models. Our hypothesis was that larger data sets would perform better with larger networks. The number of hidden layer parameters for the 2-layer networks was calculated by multiplying the first and second layers together. Figure 7 shows the number of parameters in the hidden layers of the model versus the size of the data set. The color indicates the data set and the shape indicates the featurizer type. We can see a clear lower bound in the number of parameters for the best network for all featurizer types as the data set size increases.

Number of hidden layer parameters versus number of samples for the best model for each data set/featurizer combination.

Summary of Model Performance

Figure 8 and Figure 9 show the full set of test set R² values for the best model for each molecular featurization representation and model type for random and scaffold splits respectively (picked as before by the largest validation set R² value). Random sampling inflates the R² values of the holdout set, which is as expected since there is greater structural overlap between the set of compounds in the training and holdout set. For scaffold split-generated holdout sets, there is a very clear pattern between data set size and R² value, although the complexity of the predicted property and quality of the data set also obviously has an effect.

Performance accuracy for regression for random split.

Performance accuracy for regression for scaffold split.

Model Tuning Results

To evaluate whether hyperparameter search improves model performance, the test set R² for a baseline model was compared with the test set R² from the best-performing model selected by looking at the validation set R² value. The distribution of improvement in test set R² categorized by featurizer type is shown in Figure 10. Small data sets and ECFP-based models, which showed poor neural network performance overall, showed little to no improvement, while better-performing data sets and featurizers showed greater improvement with hyperparameter search. This suggests that data augmentation will be necessary to improve prediction performance on the smaller problematic data sets, and that ECFP is a poor featurizer no matter the hyperparameters.

Histogram of improvement in R² values for the test set for the four featurizers for neural network models.

Classification Experiments

A set of classification model experiments were also conducted for a panel of 28 bioactivity data sets, without any hyperparameter tuning. In total 2,130 neural network and random forest models were generated. A dose concentration threshold was used to label active and inactive compounds on a per-data set basis using thresholds provided by domain experts at GlaxoSmithKline. The classes were extremely unbalanced, which partially explains the high ROC-AUC scores shown in Figure 11.

Uncertainty Quantification

To explore the utility of the uncertainty quantification values produced by neural network and random forest models, a case study is presented for three representative PK parameter data sets: rat plasma clearance (in vivo), human microsomal clearance, and human plasma protein binding HSA. These data sets were selected to represent small, medium, and large sized data sets with low, medium, and high R² values.

Precision-Recall Plot Analysis

Precision-recall curves measure the fraction of low error predictions made at varying UQ thresholds. Precision is defined as the fraction of predictions with UQ values less than the UQ threshold, with error less than some predefined threshold. For this analysis we use mean logged error and define “low-error” as samples with logged error below the mean (log served to normalize the distribution). Recall reports the fraction of low-error samples which pass the UQ filter threshold. Overall, we would like to use the UQ value as a threshold to identify low error samples at a higher rate than in the overall test set. Table 3 shows the percentage of low error samples in the test set as a whole for each data set/model/featurizer combination.

Table 3. Percent of Total Low-Error Samples in the Test Set for the Specified Dataset, Model/Featurizer Combinations.

data set	model and featurizer type	percent of total low error samples
rat plasma clearance (in vivo)	neural network + ECFP	41.4
rat plasma clearance (in vivo)	neural network + GraphConv	41.8
rat plasma clearance (in vivo)	neural network + MOE	42.9
rat plasma clearance (in vivo)	neural network + Mordred	40.5
rat plasma clearance (in vivo)	random forest + ECFP	42.5
rat plasma clearance (in vivo)	random forest + MOE	41.7
rat plasma clearance (in vivo)	random forest + Mordred	42.0
human microsomal clearance	neural network + ECFP	41.0
human microsomal clearance	neural network + GraphConv	41.0
human microsomal clearance	neural network + MOE	39.0
human microsomal clearance	neural network + Mordred	39.8
human microsomal clearance	random forest + ECFP	39.5
human microsomal clearance	random forest + MOE	38.5
human microsomal clearance	random forest + Mordred	39.6
human plasma protein binding HSA	neural network + ECFP	43.4
human plasma protein binding HSA	neural network + GraphConv	43.0
human plasma protein binding HSA	neural network + MOE	43.1
human plasma protein binding HSA	neural network + Mordred	43.5
human plasma protein binding HSA	random forest + ECFP	42.0
human plasma protein binding HSA	random forest + MOE	42.8
human plasma protein binding HSA	random forest + Mordred	42.5

Open in a new tab

In general, a low UQ threshold with accurate uncertainty would correspond to a precision of 1, which means confident predictions correspond to low-error predictions. To have the greatest utility, the curve should keep fairly high precision as the recall increases. UQ successfully filters out low confidence predictions in some cases but performance varies widely with the model/featurization type and the data set. Figures 12,13, and 14 show that precision drops quickly as recall increases, and for some models precision is poor even when applying the lowest UQ threshold. Nevertheless, for each data set there exists a UQ threshold for at least one model which could be used to increase the fraction of low error predictions over the baseline percentages shown in Table 3. For example, Figure 14 suggests that applying a UQ threshold could increase precision to 65% from around 42% with a recall of 10%. Later, it is shown that for the human plasma protein binding HSA data set that this could still yield a collection of compounds with a diverse range of response values.

Precision-recall plot for rat plasma clearance (*in vivo*), varying UQ value.

Precision-recall plot for human microsomal clearance, varying UQ value.

Precision-recall plot for human plasma protein binding HSA, varying UQ value

Calibration Curves

To further investigate how error changes as the uncertainty increases, we plotted calibration curves of mean error per uncertainty bucket, with the 95% confidence interval of error shown for each bucket as error bars. We would like uncertainty to serve as a proxy for error, so we would hope to see the mean error for the samples in a bucket increase as the UQ threshold for that bucket increased. Calibration curves for the neural network built on MOE feature vectors and random forest models built on MOE feature vectors and neural network graph convolution models have been computed to demonstrate the variation in performance.

For rat plasma clearance (in vivo), there is an overall upward trend for all three calibration curves, but it is not completely monotonically increasing for any of them. Figure 15, the calibration curve for the neural network model with MOE features is shown as an example (See Figure S1 for RF/MOE and Figure S2 for NN/GraphConv). This is the smallest data set of our case study, so increasing the bucket size may improve the choppiness of these curves, but overall UQ does not look like it would be a reliable proxy for error for this data set.

Mean error per uncertainty bucket for rat plasma clearance (*in vivo*) neural network model with MOE features.

Human microsomal clearance shows greater variation in the calibration curves. For MOE features with a neural network model (Figure 16), shows an inverse pattern where the error actually decreases as the uncertainty increases. For MOE features with a random forest model (Figure S3), there seems to be no correlation, except for in the very highest bucket.

Mean error per uncertainty bucket for human microsomal clearance neural network model with MOE features.

The graph convolution model (Figure S4), conversely, shows an upward trend, although it is not monotonically increasing. These curves show that the featurizer and model type have a strong effect on the relationship between UQ and error.

For human plasma protein binding HSA, which is the largest data set with over 123 000 compounds, all calibration curves display the desired behavior (Figures 17, S5, and S6): error increases as uncertainty increases and the 95% confidence intervals are small.

Mean error per uncertainty bucket for human plasma protein binding HSA neural network model with MOE features.

Examining the Relationship between UQ and Predicted Value

Since the UQ values quantify the variation in predictions, the relationship between UQ and the predicted values were checked for evidence of a correlation by examining plotted UQ versus predicted values. Results for three model types were plotted, but only neural network models with MOE features are shown as an example (see the Supporting Information spreadsheet for RF/MOE and NN/GraphConv results).

Rat plasma clearance (in vivo) shows a somewhat negative relationship, where the variation in predictions decreases as the magnitude of the predicted value increases. We found a similar though much less pronounced trend when examining error versus predicted value, so it looks like overall the model is predicting better for compounds with higher clearance values (Figures 18, S7, and S8).

Uncertainty value versus Predicted for rat plasma clearance (*in vivo*) neural network model with MOE features

For human microsomal clearance, MOE feature vectors yield models where the UQ is strongly biased by the predicted value, especially for the neural network model, as seen in Figure 19. Error versus predicted value does not show this trend, so this is likely indicating that UQ contains no real information value for this model. This trend exists for the MOE random forest model as well (Figure S9), although it levels off, suggesting slightly less biased UQ values. The graph convolution model (Figure S10) displays a more balanced relationship between UQ and predicted value, which mirrors what we saw in the previous two subsections, that this model’s UQ is more informative of error than the MOE models’ UQ.

Uncertainty value versus predicted value for human microsomal clearance neural network model with MOE features.

Human plasma protein binding HSA, which showed the best calibration curves, also shows the least correlation between UQ and predicted value. Figure 20 shows the correlation between UQ and predicted value for a human plasma protein binding HSA neural network model with MOE features (Figures S11 and S12) show the model for random forest with MOE features and neural network with Graph Convolution features). UQ has a wide range of values for all predicted values.

Uncertainty value versus predicted value for human plasma protein binding HSA neural network model with MOE features

Correlation between UQ and Error

While these plots provide useful methods for visualizing the behavior of uncertainty quantification, we wanted to identify a value that could summarize if we could trust a given model’s UQ results. Since we want the certainty of the model to be reflected in accurate predictions, we calculated the Spearman correlation coefficient between binned prediction error and UQ. Results are shown in Figure 21. Correlations range from −0.088 to 0.33. While these correlations are fairly low, all p-values are <0.05, and all but one are ≪0.01. There is a weak but statistically significant correlation; therefore, we conclude that UQ cannot be used to reliably predict error in the general case.

Spearman correlation coefficient between error and uncertainty values

Discussion

Key observations from the extensive series of model evaluations are summarized here:

Neural networks generally produced more accurate models only on the larger data sets.
The proprietary MOE descriptors outperformed the open-source Mordred descriptors for both random forest and neural networks. Among neural network representations, graph convolutions outperformed ECFP.
A range of neural network architectures performed best, depending on the data set size. Small networks appear to be prominently featured in many data sets.
Model performance generally improved as data set size increased, suggesting the need for public data set integration or multitask/transfer learning approaches.
Hyperparameter tuning generally improved performance, in some cases dramatically.
Uncertainty quantification showed a weak correlation with error, and the efficacy of using UQ to filter predictions varied considerably between data sets and model types.

In general, the choice of chemical representation has a strong effect on model performance. Models built on physicochemical and graph convolution descriptors tended to outperform models built on ECFP. We found that a wide sweep over a large number of hyperparameters is necessary to tune a model rather than relying on the importance of any individual hyperparameter.

The differences in prediction accuracy show that the parameters needed for in silico drug discovery present a diverse set of data-driven modeling challenges. The extensive benchmarking suggests that there is no clear one best modeling approach for every predicted parameter. The differences in performance show the importance of having a rigorous model building pipeline that can be readily adapted and reapplied to build parameter specific models as new data become available.

Conclusions

In this paper, we present the ATOM Modeling PipeLine, or AMPL. This open-source software suite allows the user to build models for a wide array of molecular properties that are needed for in silico drug discovery. Results of extensive benchmarking on a wide variety of pharmacokinetic and safety data sets were also presented, with an exploration of the effects of different featurization and model types on model accuracy. While the data sets used for developing and testing the pipeline are not publicly available, the software used to curate data and train, evaluate, and share new models is available as open source and benefits from having been tested on a wide array of pharmaceutically relevant parameters. Additional public data sets are included with the pipeline release to support applying reproducible training and testing protocols that enable the broader modeling community to evaluate and improve modeling approaches over time.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.9b01053.

PK best model parameters (XLSX)
Benchmarking of AMPL on public datasets (PDF)

This work represents a multi-institutional effort. Funding sources include the following: Lawrence Livermore National Laboratory internal funds; the National Nuclear Security Administration; GlaxoSmithKline, LLC; and federal funds from the National Cancer Institute, National Institutes of Health, and the Department of Health and Human Services, under Contract No. 75N91019D00024. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Disclaimer. This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

The authors declare no competing financial interest.

Supplementary Material

ci9b01053_si_001.xlsx^{(25.8KB, xlsx)}

ci9b01053_si_002.pdf^{(1.3MB, pdf)}

References

Dahl G. E.; Jaitly N.; Salakhutdinov R.. Multi-task Neural Networks for QSAR Predictions. arXiv:1406.1231v1 2014; pp 1–21.
Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E. Neural Message Passing for Quantum Chemistry. Proceedings of the 34th International Conference on Machine Learning 2017, 70, 1263–1272. [Google Scholar]
Feinberg E. N.; Sheridan R.; Joshi E.; Pande V. S.; Cheng A. C.. Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization. arXiv:1903.11789 [cs, stat] 2019.
Yang K.; Swanson K.; Jin W.; Coley C. W.; Eiden P.; Gao H.; Guzman-Perez A.; Hopper T.; Kelley B.; Mathea M.; Palmer A.; Settels V.; Jaakkola T. S.; Jensen K. F.; Barzilay R. Analyzing Learned Molecular Representations for Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370–3388. 10.1021/acs.jcim.9b00237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Democratizing Deep-Learning for Drug Discovery, Quantum Chemistry, Materials Science and Biology: deepchem/deepchem. 2019-07-31; https://github.com/deepchem/deepchem.
Wu Z.; Ramsundar B.; Feinberg E.; Gomes J.; Geniesse C.; Pappu A. S.; Leswing K.; Pande V. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513–530. 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
BIOVIA Pipeline Pilot. 2019-09-12; https://www.3dsbiovia.com/products/collaborative-science/biovia-pipeline-pilot/.
Berthold M. R.; Cebron N.; Dill F.; Gabriel T. R.; Kötter T.; Meinl T.; Ohl P.; Sieb C.; Thiel K.; Wiswedel B. KNIME: The Konstanz Information Miner. Data Analysis, Machine Learning and Applications 2008, 319. 10.1007/978-3-540-78246-9_38. [DOI] [Google Scholar]
Landrum G.RDKit: Open-source cheminformatics. 2020-02-13; http://www.rdkit.org.
Swain M. MolVS: molecule validation and standardization 2017, 10.5281/zenodo.260237. [DOI] [Google Scholar]
JupyterLab. 2019-09-12; https://jupyterlab.readthedocs.io/en/stable/.
Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Duvenaud D.; Maclaurin D.; Aguilera-Iparraguirre J.; Gomez-Bombarelli R.; Hirzel T.; Aspuru-Guzik A.; Adams R. P.. Convolutional Networks on Graphs for Learning Molecular Fingerprints. arXiv1509.09292v2 2015; 1–9.
Moriwaki H.; Tian Y.-S.; Kawashita N.; Takagi T. Mordred: a molecular descriptor calculator. J. Cheminf. 2018, 10, 4. 10.1186/s13321-018-0258-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chemical Computing Group (CCG) , Computer-Aided Molecular Design. 2019-09-12; https://www.chemcomp.com/.
Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow. 2019-09-13; https://blog.rstudio.com/2016/03/29/feather/.
Mayr A.; Klambauer G.; Unterthiner T.; Steijaert M.; Wegner J. K.; Ceulemans H.; Clevert D.-A.; Hochreiter S. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 2018, 9, 5441–5451. 10.1039/C8SC00148K. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wallach I.; Heifets A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J. Chem. Inf. Model. 2018, 58, 916–932. 10.1021/acs.jcim.7b00403. [DOI] [PubMed] [Google Scholar]
Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen T.; Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, 785–794. 10.1145/2939672.2939785. [DOI] [Google Scholar]
Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P. Molecular graph convolutions: moving beyond fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595–608. 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
mongoDB. 2019-09-13; https://www.mongodb.com/.
Kendall A.; Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?. Neural Inf. Proc. Sys. (NIPS) 2017, 5580–5590. [Google Scholar]
McInnes L.; Healy J.. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 2018.
RDKit: Open-Source Cheminformatics Software. 2019-09-13; https://www.rdkit.org.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci9b01053_si_001.xlsx^{(25.8KB, xlsx)}

ci9b01053_si_002.pdf^{(1.3MB, pdf)}

[ref1] Dahl G. E.; Jaitly N.; Salakhutdinov R.. Multi-task Neural Networks for QSAR Predictions. arXiv:1406.1231v1 2014; pp 1–21.

[ref2] Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E. Neural Message Passing for Quantum Chemistry. Proceedings of the 34th International Conference on Machine Learning 2017, 70, 1263–1272. [Google Scholar]

[ref3] Feinberg E. N.; Sheridan R.; Joshi E.; Pande V. S.; Cheng A. C.. Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization. arXiv:1903.11789 [cs, stat] 2019.

[ref4] Yang K.; Swanson K.; Jin W.; Coley C. W.; Eiden P.; Gao H.; Guzman-Perez A.; Hopper T.; Kelley B.; Mathea M.; Palmer A.; Settels V.; Jaakkola T. S.; Jensen K. F.; Barzilay R. Analyzing Learned Molecular Representations for Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370–3388. 10.1021/acs.jcim.9b00237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Democratizing Deep-Learning for Drug Discovery, Quantum Chemistry, Materials Science and Biology: deepchem/deepchem. 2019-07-31; https://github.com/deepchem/deepchem.

[ref6] Wu Z.; Ramsundar B.; Feinberg E.; Gomes J.; Geniesse C.; Pappu A. S.; Leswing K.; Pande V. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513–530. 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] BIOVIA Pipeline Pilot. 2019-09-12; https://www.3dsbiovia.com/products/collaborative-science/biovia-pipeline-pilot/.

[ref8] Berthold M. R.; Cebron N.; Dill F.; Gabriel T. R.; Kötter T.; Meinl T.; Ohl P.; Sieb C.; Thiel K.; Wiswedel B. KNIME: The Konstanz Information Miner. Data Analysis, Machine Learning and Applications 2008, 319. 10.1007/978-3-540-78246-9_38. [DOI] [Google Scholar]

[ref9] Landrum G.RDKit: Open-source cheminformatics. 2020-02-13; http://www.rdkit.org.

[ref10] Swain M. MolVS: molecule validation and standardization 2017, 10.5281/zenodo.260237. [DOI] [Google Scholar]

[ref11] JupyterLab. 2019-09-12; https://jupyterlab.readthedocs.io/en/stable/.

[ref12] Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref13] Duvenaud D.; Maclaurin D.; Aguilera-Iparraguirre J.; Gomez-Bombarelli R.; Hirzel T.; Aspuru-Guzik A.; Adams R. P.. Convolutional Networks on Graphs for Learning Molecular Fingerprints. arXiv1509.09292v2 2015; 1–9.

[ref14] Moriwaki H.; Tian Y.-S.; Kawashita N.; Takagi T. Mordred: a molecular descriptor calculator. J. Cheminf. 2018, 10, 4. 10.1186/s13321-018-0258-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Chemical Computing Group (CCG) , Computer-Aided Molecular Design. 2019-09-12; https://www.chemcomp.com/.

[ref16] Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow. 2019-09-13; https://blog.rstudio.com/2016/03/29/feather/.

[ref17] Mayr A.; Klambauer G.; Unterthiner T.; Steijaert M.; Wegner J. K.; Ceulemans H.; Clevert D.-A.; Hochreiter S. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 2018, 9, 5441–5451. 10.1039/C8SC00148K. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] Wallach I.; Heifets A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J. Chem. Inf. Model. 2018, 58, 916–932. 10.1021/acs.jcim.7b00403. [DOI] [PubMed] [Google Scholar]

[ref19] Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

[ref20] Chen T.; Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, 785–794. 10.1145/2939672.2939785. [DOI] [Google Scholar]

[ref21] Kearnes S.; McCloskey K.; Berndl M.; Pande V.; Riley P. Molecular graph convolutions: moving beyond fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595–608. 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] mongoDB. 2019-09-13; https://www.mongodb.com/.

[ref23] Kendall A.; Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?. Neural Inf. Proc. Sys. (NIPS) 2017, 5580–5590. [Google Scholar]

[ref24] McInnes L.; Healy J.. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 2018.

[ref25] RDKit: Open-Source Cheminformatics Software. 2019-09-13; https://www.rdkit.org.

PERMALINK

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Amanda J Minnich

Kevin McLoughlin

Margaret Tse

Jason Deng

Andrew Weber

Neha Murad

Benjamin D Madej

Bharath Ramsundar

Tom Rush

Stacie Calad-Thomson

Jim Brase

Jonathan E Allen

Abstract

Introduction

Methods

Figure 1.

Data Curation

Featurization

Data Set Partitioning

Model Training and Tuning

Epoch Selection for Neural Net Models

Model Persistence

Model Performance Metrics

Uncertainty Quantification

Uncertainty Quantification for Random Forest

Uncertainty Quantification for Neural Networks

Visualization and Analysis

Hyperparameter Optimization

Running AMPL

Results

Data

Table 1. Pharmacokinetics Datasets Used to Benchmark AMPL.

Table 2. Safety Datasets Used to Benchmark AMPL.

Experimental Design for Regression Pharmacokinetic Models

Analysis of Modeling Performance

Figure 2.

Figure 3.

Figure 4.

Investigation into Neural Network Performance

Figure 5.

Figure 6.

Figure 7.

Summary of Model Performance

Figure 8.

Figure 9.

Model Tuning Results

Figure 10.

Classification Experiments

Figure 11.

Uncertainty Quantification

Precision-Recall Plot Analysis

Table 3. Percent of Total Low-Error Samples in the Test Set for the Specified Dataset, Model/Featurizer Combinations.

Figure 12.

Figure 13.

Figure 14.

Calibration Curves

Figure 15.

Figure 16.

Figure 17.

Examining the Relationship between UQ and Predicted Value

Figure 18.

Figure 19.

Figure 20.

Correlation between UQ and Error

Figure 21.

Discussion

Conclusions

Supporting Information Available

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases