Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Sep 11.
Published in final edited form as: J Chem Inf Model. 2017 Jan 9;57(1):36–49. doi: 10.1021/acs.jcim.6b00625

In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning

Qingda Zang , Kamel Mansouri , Antony J Williams , Richard S Judson , David G Allen , Warren M Casey , Nicole C Kleinstreuer ∥,*
PMCID: PMC6131700  NIHMSID: NIHMS1504120  PMID: 28006899

Abstract

There are little available toxicity data on the vast majority of chemicals in commerce. High-throughput screening (HTS) studies, such as being carried out by the U.S. Environmental Protection Agency (EPA) ToxCast program in partnership with the federal Tox21 research program, can generate biological data to inform models for predicting potential toxicity. However, physicochemical properties are also needed to model environmental fate and transport, as well as exposure potential. The purpose of the present study was to generate an open-source Quantitative Structure-Property Relationship (QSPR) workflow to predict a variety of physicochemical properties that would have cross-platform compatibility to integrate into existing cheminformatics workflows. In this effort, decades-old experimental property data sets available within EPA EPI Suite™ were reanalyzed using modern cheminformatics workflows to build updated QSPR models capable of supplying computationally efficient, open, and transparent HTS property predictions in support of environmental modeling efforts. Models were built using updated EPI Suite data sets for the prediction of six physicochemical properties: octanol-water partition coefficient (log P), water solubility (log S), boiling point (BP), melting point (MP), vapor pressure (log VP) and bioconcentration factor (log BCF). The coefficient of determination (R2) between the estimated values and experimental data for the six predicted properties ranged from 0.826 (MP) to 0.965 (BP), with model performance for five of the six properties exceeding those from the original EPI Suite™ models. The newly derived models can be employed for rapid estimation of physicochemical properties within an open-source HTS workflow to inform fate and toxicity prediction models of environmental chemicals.

INTRODUCTION

The U.S. Environmental Protection Agency (EPA) has identified ~32,000 chemicals with the potential for human exposure, ranging from pesticides and industrial chemicals to food additives and personal care products.16 As only a fraction of these chemicals have been fully assessed and characterized,79 there is a need for more rapid and inexpensive approaches to prioritize thousands of chemicals for mechanistically relevant toxicity testing. The cross-agency U.S. federal Tox21 and EPA’s ToxCast research programs have developed promising tools for chemical hazard characterization and prioritization, such as in vitro high-throughput screening (HTS) assays, computational toxicology approaches, and quantitative structure activity relationship (QSAR) models. 1018 What these tools lack, however, is the ability to provide insight into the fate and transport of the chemicals, which is needed for chemical risk assessment.

The behavior of chemicals in humans and the environment often depends on some key physicochemical properties, such as octanol-water partition coefficient (log P), water solubility (log S), melting point (MP), boiling point (BP), vapor pressure (VP) and bioconcentration factor (BCF).1923 These properties affect bioavailability, permeability, absorption, transport, and persistence of chemicals in the body and in the environment, and are used extensively in exposure, toxicological hazard, and risk assessments of organic chemicals. Certain properties, such as BCF, are required by regulations such as REACH (Registration, Evaluation, Authorization and Restriction of Chemicals) and the United Nations Globally Harmonized System of Classification and Labeling of Hazardous Chemicals.2426 These properties are often used in evaluating new or problematic chemicals and are valuable parameters in developing QSAR models for toxicity endpoints.2729 These parameters and models support initiatives being driven by the OECD (Organization of Economic Cooperation and Development) and ICATM (International Cooperation on Alternative Test Methods) to reduce or waive animal tests using alternative methods such as in silico modeling.30,31

Whereas physicochemical properties have been experimentally determined for some chemicals, the majority lack freely available experimental data. For new chemicals, such data are often considered confidential business information and are thus not available to regulatory authorities. Obtaining needed data via experimental measurements can be expensive and time-consuming, it may be difficult to handle hazardous or reactive chemicals, and some pre-manufacturing chemicals are unavailable for testing.

Quantitative structure-property relationship (QSPR) methods are designed to identify the relationship between the physicochemical property of interest and the chemical molecular structure without testing, and are widely used to provide inputs for toxicity prediction models.3235 Numerous computational approaches have been proposed to construct QSPR models, and these methods can be generally categorized into three classes: models based on other experimentally determined physicochemical properties;3537 models based on calculated molecular descriptors;3234,38 and models based on group contributions.3943 The third approach was the foundation of initial work in QSPR development. In this approach, a molecule is divided into basic structural building blocks such as atoms or larger functional groups that constitute unique descriptors. Such methods are conceptually simple and computationally efficient for a wide range of chemicals because they only required counting occurrence of functional fragments in a molecule. Nevertheless, the model may suffer from the “missing fragment problem” when new fragments are encountered that were not available in the training set. This issue may be addressed by employing a large learning data set and considering the applicability domain (AD) of the model. The predictive performance of published models in the literature depend on the size, diversity, and composition of the data, rendering it difficult to make direct comparisons of models built using different training sets.

EPA’s Estimation Program Interface (EPI) Suite™ Data program (EPI Suite) provides QSPR models to predict a variety of key fate and transport parameters for environmental chemicals.44 However, the decades-old datasets upon which EPI Suite models were originally built contain numerous errors and the models used in predictions are not open-source. Accordingly, we used structure-curation and data-curation workflows to develop an updated dataset45,46 with which to build new open source models using EPI Suite data sets for log P, log S, BP, MP, log VP and log BCF. This allowed us to reanalyze EPI Suite experimental property data sets and build updated QSPR models capable of supplying computationally efficient, open, and transparent high-throughput property predictions for environmental chemicals.

The goal of the present study was to build QSPR models for in silico prediction of six physicochemical properties using diverse data sets of environmental chemicals exclusively based on analysis of their binary molecular fingerprints. We aimed to improve upon the existing EPI Suite platform and provide the community with an open-source model built in R that can be leveraged in workflows for several computational languages (e.g., Python and, Java). We applied various computational methods, ranging from simple linear regression to sophisticated machine learning approaches, and make recommendations on which models are more appropriate to predict each property. Ultimately, we sought to develop an open-source approach that provides reliable and accurate estimation of physicochemical properties for a wide range of environmental chemicals requiring only input of chemical structures in SMILES (Simplified Molecular Identification and Line Entry System) notation to generate predictions.47 To facilitate their acceptance and application, we have adhered to the validation principles defined by the Organisation for Economic Co-operation and Development (OECD) when building and evaluating the QSPR models.48 These models allow physicochemical property predictions to be readily generated and integrated with other types of information for regulatory and research purposes.

MATERIALS AND METHODS

Data Sets.

The experimentally measured physicochemical property values of structurally diverse sets of environmental chemicals used in this study were taken from a publicly available data source: Estimation Program Interface (EPI) Suite™ Data.49 These organic chemicals cover diverse chemical structures and a broad variety of use classes, including industrial compounds, pharmaceuticals, pesticides, and food additives. Prior to modeling, all chemical structures were subjected to a curation step to prepare QSAR-ready structures free from structural ambiguities (e.g., unconnected fragments, mixtures, salts, inorganics, duplicates), and to reconcile inconsistencies in the EPI Suite dataset. The chemical information were processed by structure-curation and data-curation workflows developed at EPA to provide a QSAR-ready dataset.45,46 This process found and fixed a number of errors in the EPI Suite chemical library, and discarded chemicals that were not able to be fixed (e.g., entries for which the chemical name and structure could not be reconciled).

The chemical names, CAS Registry Numbers, SMILES notations and experimental data corresponding to the six properties are given in the Supporting Information. The data sets were randomly partitioned into training sets (75% of the chemicals) and test sets (25% of the chemicals) to build the models and externally validate their predictive power, respectively. Table 1 lists the summary statistics for the training and test sets. As shown in Figure 1 and Figure S1, several property values are approximately normally distributed, where log P spans nearly 17 log units from -5.40 to 11.29 with a median of 1.99 while MP ranges from -196 to 385 °C and is centered at 79 °C. The data for log S, log VP and log BCF are skewed since there is an upper limit to log S and log VP and a lower limit to log BCF.

Table 1.

Distribution of Six Physicochemical Property Values for Training and Test Sets

Property Data Seta Minimum Maximum Mean Median Standard Deviation
log P Training (11370) −5.40 11.29 2.07 1.99 1.83
Test (2837) −5.08 9.36 2.06 1.96 1.82
log S Training (1507) −12.06 1.58 −2.56 −2.40 2.15
Test (503) −11.25 1.35 −2.71 −2.39 2.30
log BCF Training (456) −0.35 5.97 1.88 1.71 1.25
Test (152) −0.30 5.82 1.90 1.68 1.29
BP Training (4074) −88.60 548.00 188.09 188.50 85.20
Test (1358) −84.70 536.00 188.66 190.80 84.69
MP Training (6485) −196.00 385.00 79.60 79.00 98.45
Test (2163) −187.00 385.00 82.60 83.00 101.09
log VP Training (2034) −13.68 5.67 −2.01 −1.22 3.58
Test (679) −11.80 4.72 −2.15 −1.40 3.56
a

Numbers of chemicals in each set are indicated in parentheses.

Figure 1.

Figure 1.

Data distribution of log P (A) and log S (B).

Molecular Fingerprints.

The chemicals were represented by fingerprints derived from their molecular structures. Fingerprints were calculated using a wide variety of publicly available SMARTS systems implemented in PaDEL:50,51 Estate (79 bits), Extended (1024 bits), Substructure (307 bits), Klekota Roth (4860 bits), PubChem (881 bits), Atom Pairs 2D (780 bits), and MACCS (166 bits). A total of 8097 binary bits were generated, with 1 and 0 denoting the presence or absence, respectively, of a specific structural fragment. Fingerprint bits with zero variance (i.e., uniform observations across the set) were removed. To obtain reliable models, sufficient occurrences of the fingerprint bits throughout the entire data sets are necessary and, thus, bits with low occurrences (< 2%) were eliminated. Following removal of highly correlated and infrequently occurring bits, the resulting number of bits retained and employed to build the regression models were: 1681 for log P; 1061 for log S, 450 for log BCF; 1050 for BP; 1424 for MP; and 1145 for log VP. A genetic algorithm (GA)52,53 was used to reduce the feature space by assigning an initial population of chromosomes to two times the number of variables (fingerprint bits).The crossover probability on each chromosome in a population and mutation rate on each gene in a chromosome were set to 50% and 1%, respectively. There were no improvements in the fitness score after 1000 generations.

Multiple Linear Regression.

Multiple linear regression (MLR) is widely used in the modeling of property data.40,54 We used MLR to produce a linear model to describe the relationship between a physicochemical property and the molecular fingerprint bits:

Property=j=1mcjfj (1)

In Equation 1, Property is one of the six physicochemical properties (log P, log S, log BCF, BP, MP or log VP); cj is the contribution coefficient, which is determined by regression analysis; and fj is the binary bit of the jth fingerprint, with its presence or absence represented by the numeric values 1 or 0 respectively. Any fragment that occurred in a molecule was counted only once for that molecule, no matter how many times it occurred in the molecule.

Partial Least Squares Regression.

Partial least squares regression (PLSR) is a widely used multivariate analytical technique in QSPR studies.55,56 The advantage of PLSR over MLR lies in its ability to build a regression model based on highly correlated descriptors, extract the relevant information, and reduce data dimensions. We employed PLSR to generate linear statistical models based on the fingerprint bits and the physicochemical property being predicted. A set of orthogonal latent variables or principal components (PCs) were first generated through a linear combination of the original molecular fingerprint bits, which served as new variables for regression with the response variables (i.e., the physicochemical properties) to build QSPR models. The optimal number of PCs was determined by 10-fold cross validation (CV).

Random Forest Regression.

Random forest (RF) is a non-linear consensus method based upon an ensemble of decision trees which are grown from separate bootstrap samples of the training data.57 Bootstrap sampling is conducted via random selection with replacement from the training chemicals during tree growth. The chemicals that are not selected in the construction of the forest are called out-of-bag (OOB) samples, which are used to evaluate the prediction accuracy as trees are added to the forest. Each tree gives a prediction for its OOB chemicals, and the average of these results over all trees provides an overall unbiased external validation. There are three possible model parameters for RF regression: ntree - the number of trees in the forest; mtry - the number of variables randomly sampled at each tree node; and nodesize - the minimum node size below which nodes are not further subdivided. In the present study, the RF model was trained based upon a parameter combination of ntree = 500, nodesize = 5 and mtry = 1/3 the number of fingerprint bits.

Support Vector Regression.

Support vector regression (SVR) models a non-linear relationship between the property and molecular descriptors by utilizing an appropriate kernel function to map the input variables from a lower dimensional space to a higher dimensional feature space and transform the non-linear relationship into a linear form.35,58,59 An ε-insensitive loss function was used for the SVR modeling, in which the training chemical samples were represented as a tube with radius ε and a Gaussian radial basis function (RBF) was employed as a kernel function. The accuracy of SVR relies on the optimization of the model parameters. An ε-based SVR analysis needs to tune the RBF kernel parameter γ, the radius of the tube ε, and the regularization parameter C which determines the trade-off between model complexity and the training error. Thus, 10-fold CV via parallel grid search was performed in order to find the optimal combination of the three parameters.

Model Validation.

The performance of each QSPR model was evaluated by examining the correlation between the experimental and predicted values using the following parameters:33,60 R2 (coefficient of determination) and RMSE (root mean squared error) for training or test sets with n chemicals; Q2 (coefficient of determination) and RMSEcv for 10-fold CV with v chemicals not included in the CV model building set. The 10-fold CV procedure was completed using only the training set.

R2=1i=1n(pip^i)2i=1n(pip¯)2 (2)
Q2=1i=1v(pip^i)2i=1v(pip¯)2 (3)
RMSE=1ni=1n(pip^i)2 (4)
RMSEcv=1vi=1v(pip^i)2 (5)

In Equations 2 through 5, pi and p^i are the measured and predicted property values for chemical i, respectively, and p¯ is the mean of all chemicals in the data set. In addition, standard error of prediction (SEP) was employed as a criterion to select the optimal principal components in the PLSR analysis.55

SEP=1n1i=1n(pip^ibias)2 (6)
bias=1ni=1n(pip^i) (7)

Applicability Domain.

Three distance-based measures (i.e., leverage, distance from centroid and k-nearest neighbors (kNN)), were applied to assess the applicability domain (AD) of each regression model. The distance of a test chemical from a defined point in the descriptor space of the training set was calculated and compared to a predefined threshold.61,62 The test chemical is considered to be within AD if its distance is less than or equal to the threshold. Leverage is defined as the diagonal element of the covariance matrix for a given dataset, and the leverage of a test chemical is proportional to Hotellings T2 statistic and its Mahalanobis distance. The threshold was set to three times the average of the leverage (3m/n, with m being the number of variables and n the number of training chemicals). For the measure of distance from centroid, the distance of a test chemical from the training set centroid is compared with a threshold, which is determined as follows: (1) calculate the distances of training chemicals from their centroid; (2) sort the vector of distances in ascending order; (3) set the distance value corresponding to 95th percentile as the threshold. The kNN measure defines the model’s AD based on the similarity between a test chemical and the training chemicals. The average distance of the test chemical from its five nearest neighbors in the training set is compared with a threshold, which is the 95th percentile of average distance of training chemicals from their five nearest neighbors.63

Statistical Analysis.

Mathematical processing for data standardization, multivariate regression analysis, and statistical model building were performed using the R statistical computing environment for Windows (version 3.2.1).64 Genetic algorithm, multiple linear regression, partial least squares regression, random forest regression, support vector regression and distance of k-nearest neighbors were implemented by the R packages subselect, stats, pls, randomForest, e1071 and FNN, respectively. The R code for feature selection and regression analysis is provided in the Supporting Information.

Online Resource.

The CompTox Dashboard from the EPA National Center for Computational Toxicology integrates experimental and predicted physicochemical endpoint data for over 700,000 individual chemical structures (https://comptox.epa.gov/dashboard/). It also meshes together other data including bioassay screening data, exposure models and product categories. Tens of thousands of experimental data points have been curated and are being used to develop a new suite of prediction models across the database.46 Integrated calculation reports are available for each property prediction associated with a chemical and these provide details regarding the performance of the prediction algorithm, associated information regarding whether the chemical is contained within the local applicability domain for the algorithm and a series of nearest neighbors based on the prediction algorithm descriptors. The data generated from the predictions reported in this manuscript are available via the CompTox dashboard and are listed as NICEATM models.

RESULTS AND DISCUSSION

Prior to regression modeling, we investigated the correlation among the six properties as well as their relationship with molecular weight (MW) to potentially inform additional feature sets for model building. For example, if the BP prediction is very accurate and has high correlation to MP, one could use the BP as a predictor for MP. Table 2 gives Pearson correlation coefficients (r) for each combination of physicochemical properties, which were calculated using the following formula:60

r=npkplpkplnpk2(pk)2npl2(pl)2 (8)

Table 2.

Correlations (r) among Molecular Weight (MW) and the Six Physicochemical Propertiesa

MW log P log S log BCF BP MP log VP
MW 1.000 0.256 −0.648 0.367 0.475 0.460 −0.721
log P 1.000 −0.873 0.830 0.365 −0.043 −0.387
log S 1.000 −0.825 −0.444 −0.285 0.564
log BCF 1.000 0.355 0.163 0.351
BP 1.000 0.733 −0.959
MP 1.000 −0.833
log VP 1.000
a

The number of chemicals for each pair:

1664 (log P vs log S)

3531 (log P vs MP)

975 (log S vs BP)

285 (log BCF vs BP)

3421 (BP vs MP)

482 (log P vs log BCF)

1560 (log P vs log VP)

1811 (log S vs MP)

473 (log BCF vs MP)

1775 (BP vs log VP)

1609 (log P vs BP)

330 (log S vs log BCF)

1156 (log S vs log VP)

334 (log BCF vs log VP)

2143 (MP vs log VP).

In Equation 8, pk and pl represent different physicochemical properties and n is the number of chemicals in each pair of properties. 63As shown in Table 2, MW is moderately correlated to log VP (r = -0.721) and log S (r = -0.648), poorly correlated to BP (r = 0.475), MP (r = 0.460) and log BCF (r = 0.367), and nearly uncorrelated with log P (r = 0.256). According to the correlation results reported in Table 2, the six properties can be divided into two groups, one including log P, log S and log BCF with r of -0.873, 0.830 and -0.825 for log P vs log S, log P vs log BCF, and log S vs log BCF, respectively; another including BP, MP and log VP, where MP is significantly correlated to BP and log VP with r of 0.733 and -0.833, respectively. BP is also highly correlated to log VP with r of -0.959. In contrast, the experimental data of log P, log S and log BCF are uncorrelated with those of MP, BP and log VP, with low r for each pair.

Our dataset included large number of independent variables (molecular fingerprint bits) available but not all of them made substantial contributions to modeling the physicochemical properties. The presence of irrelevant or redundant features limits the applicability of a model and may result in overfitting. When too many variables are employed, a complicated regression model can fit the training data extremely well with very low deviations between experimental values and predicted values. However, an over-adapted or over-fitted model yields large prediction errors for test chemicals and thus loses its generalization. Therefore, it is critical to identify appropriate subsets of informative variables from the original set of molecular fingerprint bits. We applied the well-established GA dimension reduction method to our dataset for feature selection. Prediction models were built by regression against the training sets with different subsets of fingerprint bits using four contrasting approaches, and the model performance was evaluated using 10-fold cross validation and an independent external test set. The results of each predicted property and regression statistics are discussed in detail below.

Octanol-Water Partition Coefficient (log P) Model

Among the six properties, log P had the largest data set of chemicals (over 14,000) and fingerprint bits (1681). As shown in Table 3 and Figure 2A, model performance varied with respect to the number of fingerprint bits used. When models were trained using the entire set of fingerprint bits, some of the variables were unrelated to the variation of the property. When using 600 fingerprint bits selected by GA, the MLR results show a significant correlation between the estimated and measured values on the test set with an R2 of 0.888 and a minimum RMSE of 0.576. These statistics are very similar to those for the training set (R2 of 0.901 and RMSE of 0.546), indicating the stability of the model. The inclusion of MW marginally improved the prediction on the test set with R2 increased to 0.891.

Table 3.

Regression Statistics of log P Using Subsets of Fingerprint Bits and MW

Variable Model Statistics Data Set MLR PLSR RF SVR
1681 FP bits R2 Training 0.916 0.915 0.880 0.991
Test 0.879 0.878 0.876 0.920
RMSE Training 0.509 0.510 0.552 0.176
Test 0.607 0.608 0.564 0.502
600 FP bits R2 Training 0.901 0.902 0.885 0.983
Test 0.888 0.889 0.879 0.932
RMSE Training 0.546 0.546 0.548 0.264
Test 0.576 0.575 0.569 0.457
600 FP bits + MW R2 Training 0.904 0.905 0.891 0.987
Test 0.891 0.892 0.886 0.935
RMSE Training 0.539 0.537 0.539 0.207
Test 0.569 0.568 0.562 0.451

Figure 2.

Figure 2.

The relationship between model complexity and prediction errors (RMSE) (A) and plot of experimental data versus estimated values by SVR using 600 fingerprint bits (B) for log P.

When PLSR was utilized to build models, the number of significant principal components (PCs) was determined using a 10-fold cross-validation (CV) procedure on the training set. The relationship of the standard error of prediction (SEP) versus the number of PCs is displayed in Figure 3. The gray lines were produced by repeating this procedure 100 times, while the black line depicts the lowest SEP value from a single 10-fold CV; the dashed vertical lines represent the optimal number of PCs and the dashed horizontal lines indicate the SEP value for the test set when the optimal PCs are applied. The variation of SEP is much larger for the all-variable (1681 fingerprint bits) model than the model of 600 fingerprint bits selected by GA, implying that the optimized feature set exhibits greater model stability. For the all-variable model, SEP initially decreases with PCs; the trend then reverses when noise emerges as the complexity of the model increases. Although more components improve the fitting quality of the model, they also lower the predictive power due to overfitting; the minimal cross-validation error was obtained using 13 PCs. For the 600-bit model, the SEP decreases monotonically and gradually approaches a stable value, and the use of 42 PCs yielded the optimal model, which gave a minimum RMSE of 0.575 corresponding to an R2 of 0.889 for the test chemicals.

Figure 3.

Figure 3.

The relationship between the number of principal components (PCs) and the standard error of prediction (SEP) for the PLSR model of log P. The black lines were produced from a single 10-fold CV while the gray lines correspond to 100 repetitions of the 10-fold CV. (A) Plot of SEP versus PCs for the all-bit model. (B) Plot of SEP versus PCs for the 600-bit model selected by GA.

Unlike other modeling approaches, RF regression was insensitive to the number of fingerprint bits. Figure 2A shows that the model statistics do not vary with the number of variables. A RF model trained on the 600 fingerprint bits produced nearly the same prediction as that trained on all bits. Furthermore, in each case the correlation coefficient of the test set was found to be similar to that of the training set, and the RMSE values were also very close to each other. Since RF is not sensitive to the number of variables, feature selection did not improve the predictive performance, and hence it was unnecessary to remove irrelevant fingerprint bits from the model. The RF algorithm encompassed a large number of simple tree models, and thus greatly mitigated overfitting. Additionally, it was not essential to perform cross or independent validation on RF models as this was inherently provided by the OOB estimate.

The optimum RBF kernel parameter γ, the radius of the tube ε, and the regularization parameter C that collectively yielded the lowest RMSE from cross validation were chosen to build the SVR model. SVR is robust against the existence of irrelevant or of mutually correlated variables, and here even using all fingerprint bits yielded satisfactory results. Feature selection did further improve the predictive performance with test set R2 increasing from 0.920 to 0.932 using 600 fingerprint bits. Figure 2B shows a scatter plot of predicted versus experimental values from SVR modeling. In general, the measured and predicted values were highly correlated and the majority of the data points are concentrated around the regression line with a small deviation over a large range for both training and test sets.

Water Solubility (log S) Model

Compared to the log P data set, the log S data has a low ratio of training chemicals to bits (1507/1061). As shown in Table 4, small sets of variables achieved better predictive power than large ones. When all fingerprint bits were employed, the model yielded R2 of 0.983, 0.933 and 0.995 for the training set, but only 0.604, 0.869 and 0.910 for the test set corresponding to MLR, PLSR and SVR, respectively, implying that overfitting occurred. The predictive performance of the models was improved remarkably when a more appropriate number of fingerprint bits was selected using GA (Figure 4A). The optimal model with 350 bits produced test R2 of 0.931 for both MLR and PLSR, and 0.932 for SVR, which were very close to the training set values (Figure 4B). These results confirm the effectiveness of the GA feature selection in capturing the relevant information and building stable models. Using an optimal combination of fingerprint bits selected using GA, the results from MLR and PLSR are almost identical.

Table 4.

Regression Statistics of log S Using Subsets of Fingerprint Bits, MW and log P

Variable Model Statistics Data Set MLR PLSR RF SVR
1061 FP bits R2 Training 0.983 0.933 0.878 0.995
Test 0.604 0.869 0.896 0.910
RMSE Training 0.275 0.540 0.673 0.156
Test 1.805 0.791 0.663 0.649
350 FP bits R2 Training 0.953 0.952 0.873 0.959
Test 0.931 0.931 0.892 0.932
RMSE Training 0.454 0.456 0.687 0.422
Test 0.588 0.587 0.671 0.579
350 FP bits + MW R2 Training 0.957 0.956 0.887 0.960
Test 0.932 0.933 0.899 0.933
RMSE Training 0.436 0.437 0.648 0.414
Test 0.584 0.580 0.655 0.575
350 FP bits + MW + log P R2 Training 0.961 0.960 0.925 0.966
Test 0.935 0.938 0.932 0.939
RMSE Training 0.415 0.416 0.547 0.388
Test 0.552 0.548 0.554 0.542

Figure 4.

Figure 4.

The relationship between model complexity and prediction errors (RMSE) (A) and plot of experimental data versus estimated values by SVR using 350 fingerprint bits (B) for log S.

Many factors can affect the solubility of a molecule in an aqueous solution, including the molecule’s size, shape, polarity, and hydrophobicity. For example, a cavity must be formed in water for a chemical to dissolve in an aqueous solution. The larger the molecular weight or the molecular size, the larger the cavity is required, so the greater the energy needed to make a bigger cavity and thus the lower the solubility. Hence, log S is negatively correlated to MW with r = −0.648. The partition coefficient measures how strongly a chemical molecule interacts with n-octanol in comparison to water. The aqueous solubility reduces with increasing hydrophobicity, and hence log S has a negative correlation with log P (r = -0.873). When MW and log P were incorporated as additional descriptors to model water solubility, both were found to play important roles and led to improved predictions compared to the pure fingerprint bit model. A significant improvement was achieved for all four regression approaches, and SVR yielded the highest test R2 of 0.939 and lowest RMSE of 0.542. This improvement was most pronounced for RF with R2 increasing from 0.892 to 0.932, and it performed nearly as well as the other models, suggesting that the use of numeric descriptors favors RF modeling.

Bioconcentration Factor (log BCF) Model

As the smallest data set with only 456 training chemicals, the validation statistics of log BCF rely highly on the ratio of the number of chemicals to the fingerprint bits. When all 450 bits were employed for MLR, for example, the fit to the training data was excellent with R2 close to 1, but for the test chemicals R2 was close to zero (Table S1). This was expected, as overly complex models have a tendency to overfit the training set and lose their generalization capability. Unlike the all-variable model from MLR, which yielded extremely low prediction accuracy for the test set, PLSR, RF and SVR achieved acceptable correlation with test R2 between 0.723 and 0.725.

As with the previous results, when the most information-rich variables were retained and redundant ones were discarded via feature selection, the predictive performance was enhanced significantly. The quality of the model depended heavily on the number of selected fingerprint bits, and Figure S2A shows the initial decline of the RMSE for the test set until attaining a minimum at a medium number of bits, and then a gradual increase with the number of bits. While using too many variables resulted in overfitting and hence lowered the predictive ability of the model, too few variables cannot capture sufficient structural information, and with a smaller set of chemicals it was more challenging to derive a stable and meaningful model due to underfitting. The lowest prediction errors occurred on the models with moderate complexity around 200 bits. MLR, PLSR, and SVR yielded very similar results in terms of the predictive ability on the test set with R2 of 0.863–0.879 (Figure S2B). Using MW and log P as additional descriptors to model log BCF did not improve model performance except when using RF, which yielded an increase in R2 from 0.734 to 0.816. Among the four approaches, SVR achieved the best predictive performance with R2 of 0.885 and RMSE of 0.405 for the test set.

Boiling Point (BP) Model

Since the ratio of training chemicals to bits (4074/1050) was large for the BP data set, the modeling statistics were not sensitive to the bit number, and the model performance did not vary considerably with different subsets of fingerprint bits for the test set (Figure S3A). The BP model with 400 bits had the highest test R2 of 0.943 and lowest RMSE of 19.72 using SVR (Table S2). Our regression analysis suggested that MW is an important contributing factor for boiling point. When incorporating MW into the model, the predictive performance was improved with test set R2 increasing to 0.965 for SVR model, which outperformed the other three approaches (R2 ranging from 0.935 to 0.940). The model had difficulty accurately predicting high boiling points; large errors were observed near the upper extreme of the experimental range, and chemicals with medium and low experimental values were predicted more accurately (Figure S3B). These errors may result from experimental measurement or are due to fewer data points at the high end of the range of values and resulting lack of model coverage.

Melting Point (MP) Model

When modeling MP, feature selection did not have a significant impact on the external validation results (Figure S4A) due to a large ratio of training chemicals to bits (6485/1424). Similar to the other physicochemical property modeling, SVR achieved the best results with a test R2 of 0.813 (Figure S4B), followed by RF (0.802), PLSR (0.781) and MLR (0.780) (Table S3) using 500 fingerprint bits. To improve the predictive ability, we considered employing estimated BP as an additional descriptor to predict MP. This consideration was based on the fact that the BP dataset was large (5400 chemicals) and good predictive performance with R2 greater than 0.960 was achieved from the fingerprint model, leading to reliable BP estimates.

A series of models were derived through correlating the experimental MP with MW and estimated BP. Table S3 compares the regression statistics obtained using various combinations of fingerprint bits, MP, and BP. It is evident that the regression models were very sensitive to BP, which could provide useful information and more accurate estimation for MP. The inclusion of both BP and MW as two descriptors into MP models enhanced the predictive performance for all four regression approaches. The test R2 increased to 0.826 while RMSE decreased to 39.14 using SVR. Overall, SVR regression substantially outperformed the other approaches and RF was slightly superior to MLR and PLSR. These facts reflect the advantage of nonlinear approaches over linear approaches for modeling MP. The RMSE reported here compares well with that reported by Tetko et al. modeling on a much larger set of over 200,000 data points extracted from patents.65

Vapor Pressure (log VP) Model

The data set of log VP was relatively small with 2034 training chemicals. Hence, feature selection by GA remarkably improved the predictive ability of the model when fewer fingerprint bits were used instead of all 1145 bits, particularly for MLR (Figure S5A). The four approaches had low prediction errors and comparatively similar statistical performance when 350 bits were employed to build the models, with test R2 values ranging from 0.902 to 0.930 (Table S4). A significant correlation between log VP and BP was observed with an r of -0.959, and meanwhile MW was also correlated to log VP with an r of -0.721 (Table 2). After introducing MW as a feature, the correlation of the model was greatly improved, and the inclusion of BP further enhanced the predictive performance (test set R2 values between 0.941 and 0.946). Linear methods MLR and PLSR performed nearly as well as nonlinear methods RF and SVR. MLR is a simple regression method that does not require time-consuming parameter optimization in order to achieve good performance. If this constraint were a concern, the MLR model would be particularly suitable for the prediction of log VP.

Although these models yielded accurate predictions for most chemicals, some predictions deviated considerably from the experimental values. As indicated in Figure S5B, the model performed poorly for chemicals with low vapor pressure (log VP < -5.0 log units). It is difficult to accurately measure vapor pressure for chemicals with very low volatility and experimental errors tend to be larger. The lack of high quality experimental data may be a driving factor for the failure of the models at extreme values.

Comparison to the Prediction from EPI Suite.

Several software programs are available for predicting physicochemical properties. Among them, the EPI Suite - Estimation Program Interface,44 developed by Syracuse Research Corporation (SRC), is a standalone, reproducible, EPA-endorsed, and branded product with a long history of usage. It is widely used by governmental regulatory agencies in the United States, Canada, and Europe to predict physicochemical properties of environmental chemicals. However, it uses proprietary prediction models. To compare our models with EPI Suite predictions, the KOWWIN (version 1.67), WATERNT (version 1.01), BCFWINNT (version 3.00) and MPBPNT (version 1.43) modules were employed to estimate log P, log S, and log BCF, as well as MP, BP and log VP, respectively. For the six property sets, not all the information about the training and test chemicals from EPI Suite original data is available. We broke the entire data set into training and test sets as described in Table 1. As shown in Table 5, the entire, training and test sets have very similar correlation statistics between experimental data and estimated values for all six properties. When compared with the test sets using the coefficient of determination R2, our log S model is inferior to EPI Suite’s (0.939 vs 0.955), and other our models exhibit better predictions for log P (0.935 vs 0.904), BP (0.965 vs 0.937), log VP (0.946 vs 0.900), and log BCF (0.885 vs 0.813). Our estimation for melting point is substantially superior to EPI Suite’s, with an R2 of 0.826 compared to 0.638 from EPI Suite.

Table 5.

Correlation Statistics between Experimental Data for Chemicals Used in This Study and EPI Suite Predictions

Property All Chemicals
Training Set
Test Set
Number R2 RMSE Number R2 RMSE Number R2 RMSE
logP 14207 0.895 0.605 11370 0.893 0.612 2837 0.904 0.576
logS 2010 0.948 0.490 1507 0.945 0.495 503 0.955 0.472
log BCF 608 0.818 0.484 456 0.820 0.485 152 0.813 0.481
BP 5432 0.937 21.73 4074 0.938 21.56 1358 0.937 21.78
MP 8648 0.635 57.25 6485 0.634 57.58 2163 0.638 57.14
logVP 2713 0.917 0.965 2034 0.923 0.928 679 0.900 1.071

Analysis of Applicability Domain.

The QSPR models were developed using training sets and, thus, their applicability to external chemicals depends on the structural similarity between the training chemicals and the external test chemicals. The models would be expected to provide more reliable predictions for chemicals that fall in the AD, as defined by the three distance measures defined earlier. In this study, a test chemical is deemed to be outside the AD only if the thresholds from all three distance measures are exceeded. If only one or two thresholds are exceeded, the chemical is considered to be potentially outside the AD. Table 6 and Table S5 summarize the number of test chemicals outside the AD for each regression model identified by individual distance measures and their combinations. These tables also show R2 and RMSE for the test sets after removing the chemicals considered outside the AD. In all cases, the number of test chemicals completely outside the AD was very small (0 to 9), and their removal did not have a significant influence on the model statistics. Using kNN and distance from centroid measures, the model statistics were much larger number of improved remarkably with kNN exhibiting better performance than the distance from centroid. Taking the log P model as an example, R2 increased from 0.935 to 0.940 while RMSE decreased from 0.451 to 0.434 after removing 121 test chemicals which were identified outside the domain by five nearest neighbors. kNN is a more appropriate measure of distance between training and test chemicals because the descriptors employed in this study are binary variables. Leverage measure did not considerably impact the model statistics. This is in part because the number of chemicals outside the AD identified by leverage was small. It is also important to consider that not all chemicals outside the AD are always wrongly predicted and not all chemicals inside the AD are correctly predicted, which is in agreement with the literature.61 This phenomenon can be observed in the plots of leverage versus standardized residuals (Figure 5 and Figure S6), which show many test chemicals outside the AD were correctly predicted and some of the test chemicals inside the AD were not accurately predicted.

Table 6.

Applicability Domain (AD) of log P and log S Models: Test Set Evaluationa

Property Measure Chemicals
outside AD
Chemicals
inside AD
Experimental vs Predicted
Test Chemicals inside AD
R2 RMSE
log P Leverage (I) 10 2827 0.935 0.448
Distance from centroid (II) 136 2701 0.936 0.443
Distacne by kNN (III) 121 2716 0.940 0.434
I and II and III 3 2834 0.935 0.450
I or II or III 247 2590 0.938 0.439
log S Leverage (I) 7 496 0.939 0.541
Distance from centroid (II) 28 475 0.941 0.538
Distance by kNN (III) 22 481 0.943 0.531
I and II and III 0 503 0.939 0.542
I or II or III 43 460 0.941 0.534
a

The models were built by SVR using 600 FP bits + MW for log P and 350 FP bits + MW + log P for log S.

Figure 5.

Figure 5.

Plots of leverage versus standardized residuals for log P (A) and log S (B) models’ training and test sets. The models were built by SVR using 350 and 600 fingerprint bits for log S and log P, respectively. Vertical dashed line marks AD threshold based on the leverage value. Horizon dashed lines define region where predictions were within two standardized residuals.

Adherence to OECD (Q)SAR Validation Principles.

OECD has defined five validation principles to facilitate the consideration of a (Q)SAR model for regulatory purposes. To encourage the acceptance and application of the models presented here, we have provided the following information in conjunction with the OECD validation principles. The QSAR Model Reporting Format (QMRF) is available in the Supporting Information.

  • 1)

    A defined endpoint. We have built models for six well-defined physicochemical properties (log P, log S, log BCF, BP, MP and log VP).

  • 2)

    An unambiguous algorithm. We have evaluated four unambiguous machine learning algorithms (MLR, PLSR, RF and SVR). To increase transparency, the R code associated with these modeling approaches is made available in the supplemental material.

  • 3)

    A defined domain of applicability. We have evaluated the AD using three different distance measures.

  • 4)

    Appropriate measure of goodness-of-fit, robustness, and predictivity. We have evaluated our models based on cross-validation and external validation sets, iteratively varied the feature sets to examine robustness, and compared them to currently available gold standard physicochemical property prediction software.

  • 5)

    A mechanistic interpretation, if possible. Here it is not practical to make an interpretation linking each and every selected fingerprint bit to the modeled endpoints. However, we assume that the statistically selected fingerprint bits represent fragments that are relevant to the studied endpoints. The physicochemical properties being predicted by the QSPR models presented here are critical inputs to environmental fate and transport and toxicity prediction models.

Accurate prediction of physicochemical properties using validated QSPR models will ensure a much clearer understanding of chemical hazard and exposure potential.

CONCLUSIONS

In the present study, QSPR models were implemented to predict six physicochemical properties from binary molecular fingerprints on the basis of large and structurally diverse sets of environmental chemicals using four distinctly different modeling approaches. Satisfactory predictive performance was achieved using optimal subsets of fingerprint bits and optimized regression parameters, and the estimated values correlated very well with experimental values. Both linear and nonlinear approaches provided accurate property predictions with similar regression statistics.

Since not all fingerprint bits contain useful or unique information related to the physicochemical properties, it was crucial to select the most relevant variables to construct the model. On one hand, an insufficient number of fingerprint bits led to underfitted and statistically unstable models and potential lack of coverage in new chemistry space. On the other hand, use of excessive fingerprint bits produced overfitted models from the training data, resulting in poor prediction of the test set values. Feature selection using GA was found to substantially improve the predictive ability of MLR models, influence PLSR and SVR models only slightly, and exert no effect on RF models.

The four regression approaches exhibited different modeling characteristics. MLR is a simple linear regression method and does not require a time-consuming parameter optimization procedure. PLSR yields relatively reasonable accuracy of predictions and performs well even in the presence of noise variables. RF is robust against overfitting since the algorithm constructed an ensemble of regression trees to model the data, and we found it predicted the training and test sets equally well. Nevertheless, the RF model did not perform as well as the other models when binary fingerprints were used as descriptors.

SVR, a complex and nonlinear modeling technique, was shown to be superior to the other three approaches for modeling these properties. SVR coupled with GA in selecting the most significant descriptors achieved excellent results and predicted all the six properties accurately in terms of the coefficient of determination and RMSE for both 10-fold cross validation and the external test sets. Table 7 summarizes the best performing models and shows that the statistics between cross validation and test sets are close to each other, confirming the stability, robustness and reliability of these regression models. The correlation between experimental and predicted values for the test sets was found to vary over a range of R2 for different properties. The BP model had the highest test R2 of 0.965, followed by log VP (R2 = 0.946), log S (R2 = 0.939) and log P (R2 = 0.935). MP and log BCF model predictions were less accurate, with test R2 = 0.826 and 0.885, respectively.

Table 7.

Regression Statistics of Best Performing SVR Models for Each Property

Property Variables 10-Fold Cross Validation
R2 for Test Set RMSE for Test Set
Q2
RMSEcv
Mean Interval Mean Interval
log P 600 bits + MW 0.932 0.926–0.935 0.478 0.428–0.558 0.935 0.451
log S 350 bits + MW + log P 0.928 0.921–0.936 0.580 0.487–0.666 0.939 0.542
log 200 bits + MW + log P 0.863 0.851–0.877 0.465 0.361–0.525 0.885 0.444
BCF
BP 400 bits + MW 0.955 0.941–0.968 17.98 14.30–20.64 0.965 15.63
MP 500 bits + MW + BP 0.824 0.812–0.831 41.41 38.37–45.29 0.826 39.14
log VP 350 bits + MW + BP 0.929 0.915–0.939 0.975 0.806–1.080 0.946 0.810

Taken together, the results of this study demonstrate that the combination of careful data curation, binary molecular fingerprints, and machine learning approaches provides a rapid and efficient way to estimate physicochemical properties of environmental chemicals. We have thus developed QSPR models that are highly stable and reliable. The models conform with most of the validation principles put forth by the OECD and therefore have broad applicability for property estimation of many classes of compounds. The models are freely available as downloadable workflows via the NICEATM website to support high-throughput calculations, and can be used by researchers and regulators to make predictions on new chemical sets, improve toxicity models, and inform hazard/risk characterization. In the near future, the models developed here will be used to predict data for the six reported properties for the CompTox dashboard website content of >720,000 chemicals and made available to the public.

Supplementary Material

Supplement1
Supplement2

ACKNOWLEDGMENTS

We would like to express our deep appreciation to Dr. Ann Richard (EPA National Center for Computational Toxicology) and Dr. Shannon Bell (Integrated Laboratory Systems, Inc.) for their constructive suggestions for this manuscript. We also thank Ms. Catherine Sprankle (Integrated Laboratory Systems, Inc.) for editorial review. This project was funded in whole or in part with federal funds from the National Institute of Environmental Sciences (NIEHS), National Institutes of Health (NIH) under contract HHSN273201500010C to Integrated Laboratory Systems in support of NICEATM.

Footnotes

The authors declare no competing financial interest.

Disclaimer: The views expressed in this article are those of the authors and do not necessarily reflect the views of policies of the U.S. Environmental Protection Agency and National Institute of Environmental Health Sciences. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.

ASSOCIATED CONTENT

Supporting Information

Examples of R code: feature selection and regression analysis (The complete R code can be found at GitHub: https://github.com/zang1123/Physicochemical-Property-Prediction/). QMRFs: the QSAR Model Reporting Formats. Excel file: chemical names and CAS registry number of the training and test sets, as well as experimentally measured and estimated property values. Tables S1-S4: Regression statistics for log BCF, BP, MP and log VP, respectively. Table S5: Applicability domains for log BCF, BP, MP and log VP, respectively. Figure S1: Data distribution of log BCF, BP, MP and log VP. Figures S2-S5: The relationship between model complexity and prediction errors as well as the plots of estimated values versus experimental data for log BCF, BP, MP and log VP, respectively. Figure S6: Plots of leverage versus standardized residuals for log BCF, BP, MP and log VP models. This material is available free of charge via the Internet at http://pubs.acs.org.

REFERENCES

  • (1). U.S. EPA, Office of Pollution Prevention and Toxics (OPPT) Chemical Reviews and Tools Case Study. http://www.who.int/ifcs/documents/forums/forum5/precaution/epa_en.pdf (accessed September 22, 2016).
  • (2). Chemicals under the Toxic Substances Control Act (TSCA). https://www.epa.gov/chemicals-under-tsca (accessed September 22, 2016).
  • (3). Egeghy PP; Judson RS; Gangwal S; Mosher S; Smith D; Vail J; Cohen-Hubal EA The exposure data landscape for manufactured chemicals. Sci. Total Environ. 2012, 414(1), 159–166. [DOI] [PubMed] [Google Scholar]
  • (4). Judson RS; Martin MT; Egeghy PP; Gangwal S; Reif DM; Kothiya P; Wolf MA; Cathey T; Transue TR; Smith D; Vail J; Frame A; Mosher S; Cohen-Hubal EA; Richard AM Aggregating data for computational toxicology applications: The U.S. Environmental Protection Agency (EPA) Aggregated Computational Toxicology Resource (ACToR) system. Int. J. Mol. Sci. 2012, 13(2), 1805–1831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5). Judson RS; Richard AM; Dix DJ; Houck KA; Elloumi F; Martin MT; Cathey T; Transue TR; Spencer R; Wolf MA ACToR – Aggregated Computational Toxicology Resource. Toxicol. Appl. Pharmacol. 2008, 233(1), 7–13. [DOI] [PubMed] [Google Scholar]
  • (6). Mansouri K; Abdelaziz A; Rybacka A; Roncaglioni A;Tropsha A; Varnek A; Zakharov A; Worth A; Richard AM; Grulke CM; Trisciuzzi D; Fourches D; Horvath D; Benfenati E; Muratov E; Wedebye EB ; Grisoni F.; Mangiatordi GF.; Incisivo GM; Hong H.; Ng HW.; Tetko IV.; Balabin I.; Kancherla J.; Shen J.; Burton J.; Nicklaus M.; Cassotti M.;Nikolov NG.; Nicolotti O.; Andersson PL.; Zang Q.;Politi R.;Beger RD.; Todeschini R.; Huang R.; Farag S.; Rosenberg SA.;Slavov S.; Hu X.; Judson RS. CERAPP: collaborative estrogen receptor activity prediction project. Environ. Health Perspect. DOI: 10.1289/ehp.1510267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7). Cohen-Hubal EA; Richard AM; Aylward L; Edwards SW; Gallagher J; Goldsmith JM; Isukapalli S; Tornero-Velez R; Weber EJ; Kavlock RJ Advancing exposure characterization for chemical evaluation and risk assessment. J. Toxicol. Environ. Health B. Crit. Rev. 2010, 13(2), 299–313. [DOI] [PubMed] [Google Scholar]
  • (8). Knudsen TB; Houck KA; Sipes N; Singh AV; Judson RS; Martin MT; Weissman A; Kleinstreuer N; Mortensen HM; Reif DM; Rabinowitz JR; Setzer W; Richard AM; Dix DJ; Kavlock RJ Activity profiles of 309 ToxCast™ chemicals evaluated across 292 biochemical targets. Toxicology 2011, 282(1–2), 1–15. [DOI] [PubMed] [Google Scholar]
  • (9). Judson RS; Richard AM; Dix DJ; Houck KA; Martin MT; Kavlock RJ; Dellarco V; Henry T; Holderman T; Sayre P; Tan S; Carpenter T; Smith E The toxicity data landscape for environmental chemicals. Environ. Health Perspect. 2009, 117(5), 685–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10). Kavlock RJ; Dix DJ Computational toxicology as implemented by the U.S. EPA: providing high throughput decision support tools for screening and assessing chemical exposure, hazard and risk. J. Toxicol. Environ. Health B. Crit. Rev. 2010, 13(2–4), 197–217. [DOI] [PubMed] [Google Scholar]
  • (11). Wetmore BA; Wambaugh JF; Ferguson SS; Sochaski MA; Rotroff DM; Freeman K; Clewell III HJ; Dix DJ; Andersen ME; Houck KA; Allen B; Judson RS; Singh R; Kavlock RJ; Richard AM; Thomas RS Integration of dosimetry, exposure and high-throughput screening data in chemical toxicity assessment. Toxicol. Sci. 2012, 125(1), 157–174. [DOI] [PubMed] [Google Scholar]
  • (12). Judson RS; Kavlock RJ; Setzer RW; Cohen-Hubal EA; Martin MT; Knudsen TB; Houck KA; Thomas RS; Wetmore BA; Dix DJ Estimating toxicity-related biological pathway altering doses for high-throughput chemical risk assessment. Chem. Res. Toxicol. 2011, 24(4), 451–462. [DOI] [PubMed] [Google Scholar]
  • (13). Martin MT; Dix DJ; Judson RS; Kavlock RJ; Reif DM; Richard AM; Rotroff DM; Romanov S; Medvedev A; Poltoratskaya N; Gambarian M; Moeser M; Makarov SS; Houck KA Impact of environmental chemicals on key transcription regulators and correlation to toxicity end points within EPA's ToxCast program. Chem. Res. Toxicol. 2010, 23(3), 578–590. [DOI] [PubMed] [Google Scholar]
  • (14). Judson RS; Houck KA; Kavlock RJ; Knudsen TB; Martin MT; Mortensen HM; Reif DM; Rotroff DM; Shah IA; Richard AM; Dix DJ In vitro screening of environmental chemicals for targeted testing prioritization: the ToxCast project. Environ. Health Perspect. 2010, 118(4), 485–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15). Dix DJ; Houck KA; Martin MT; Richard AM; Setzer RW; Kavlock RJ The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol. Sci. 2007, 95(1), 5–12. [DOI] [PubMed] [Google Scholar]
  • (16). Sipes NS; Martin MT; Kothiya P; Reif DM; Judson RS; Richard AM; Houck KA; Dix DJ; Kavlock RJ; Knudsen TB Profiling 976 ToxCast chemicals across 331 enzymatic and receptor signaling assays. Chem. Res. Toxicol. 2013, 26(6), 878–895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17). Browne P; Judson RS; Casey WM; Kleinstreuer NC; Thomas RS Screening chemicals for estrogen receptor bioactivity using a computational model. Environ. Sci. Technol. 2015, 49(14), 8804–8814. [DOI] [PubMed] [Google Scholar]
  • (18). Kleinstreuer NC; Yang J; Berg EL; Knudsen TB; Richard AM; Martin MT; Reif DM; Judson RS; Polokoff M; Dix DJ; Kavlock R; Houck KA Phenotypic screening of the ToxCast chemical library to classify toxic and therapeutic mechanisms. Nature Biotechnol. 2014, 32, 583–591. [DOI] [PubMed] [Google Scholar]
  • (19). Hermens JL; de Bruijn JH; Brooke DN The octanol-water partition coefficient: strengths and limitations. Environ. Toxicol. Chem. 2013, 32(4), 732–733. [DOI] [PubMed] [Google Scholar]
  • (20). Wang J; Hou T Recent advances on aqueous solubility prediction. Comb. Chem. High Throughput Screen. 2011, 14(5), 328–338. [DOI] [PubMed] [Google Scholar]
  • (21). Hewitt M; Cronin MTD; Enoch SJ; Madden JC; Roberts DW; Dearden JC In silico prediction of aqueous solubility: the solubility challenge. J. Chem. Inf. Model. 2009, 49(11), 2572–2587. [DOI] [PubMed] [Google Scholar]
  • (22). Hopfinger AJ; Esposito EX; Llinà A; Glen RC; Goodman JM Findings of the challenge to predict aqueous solubility. J. Chem. Inf. Model. 2009, 49(1), 1–5. [DOI] [PubMed] [Google Scholar]
  • (23). Gissi A; Gadaleta D; Floris M; Olla S; Carotti A; Novellino E; Benfenati E; Nicolotti O. An alternative QSAR-based approach for predicting the bioconcentration factor for regulatory purpose. Altex 2014, 31, 23–36. [DOI] [PubMed] [Google Scholar]
  • (24). Rudén C; Hansson SO Registration, evaluation, and authorization of chemicals (REACH) is but the first step - how far will it take us? Six further steps to improve the European chemicals legislation. Environ. Health Perspect. 2010, 118(1), 6–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25). Schoeters G. The REACH perspective: toward a new concept of toxicity testing. J. Toxicol. Environ. Health B Crit. Rev. 2010, 13(2–4), 232–241. [DOI] [PubMed] [Google Scholar]
  • (26). Winder C; Azzi R; Wagner D. The development of the globally harmonized system (GHS) of classification and labelling of hazardous chemicals. J. Hazard Mater. 2005, 125(1–3), 29–44. [DOI] [PubMed] [Google Scholar]
  • (27). Kujawski J; Popielarska H; Myka A; Drabińska B; Bernard MK The log P parameter as a molecular descriptor in the computer-aided drug design - an overview. Comput. Methods Sci. Technol. 2012, 18(2), 81–88. [Google Scholar]
  • (28). Zang Q; Rotroff DM; Judson RS Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods. J. Chem. Inf. Model. 2013, 53(12), 3244–3261. [DOI] [PubMed] [Google Scholar]
  • (29). Vinggaard AM; Niemelä J; Wedebye EB; Jensen GE Screening of 397 chemicals and development of a quantitative structure-activity relationship model for androgen receptor antagonism. Chem. Res. Toxicol. 2008, 21, 813–823. [DOI] [PubMed] [Google Scholar]
  • (30). Scholz S; Sela E; Blaha L; Braunbeck T; Galay-Burgos M; García-Franco M; Guinea J; Klüver N; Schirmer K; Tanneberger K; Tobor-Kapłon M; Witters H; Belanger S; Benfenati E; Creton S; Cronin MTD; Eggen RIL; Embry M; Ekman D; Gourmelon A; Halder M; Hardy B; Hartung T; Hubesch B; Jungmann D; Lampi MA; Lee L; Léonard M; Küster E; Lillicrap A; Luckenbach T; Murk AJ; Navas JM; Peijnenburg W; Repetto G; Salinas E; Schüürmann G; Spielmann H; Tollefsen KE; Walter-Rohde S; Whale G; Wheeler JR; Winter MJ A European perspective on alternatives to animal testing for environmental hazard identification and risk assessment. Regul. Toxicol. Pharmacol. 2013, 67(3), 506–530. [DOI] [PubMed] [Google Scholar]
  • (31). Burden N; Sewell F; Chapman K. Testing chemical safety: what is needed to ensure the widespread application of nonanimal approaches? PLoS Biol. 2015, 13(5), e1002156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32). Bhhatarai B; Teetz W; Liu T; Öberg T; Jeliazkova N; Kochev N; Pukalov O; Tetko IV; Kovarich S; Papa E; Gramatica P. CADASTER QSPR models for predictions of melting and boiling points of perfluorinated chemicals. Mol. Inf 2011, 30, 189–204. [DOI] [PubMed] [Google Scholar]
  • (33). Zhang J; Liu Z; Liu W. QSPR study for prediction of boiling points of 2475 organic compounds using stochastic gradient boosting. J. Chemom. 2014, 28, 161–167. [Google Scholar]
  • (34). Liang C; Gallagher DA QSPR prediction of vapor pressure from solely theoretically-derived descriptors. J. Chem. Inf. Comput. Sci. 1998, 38, 321–324. [Google Scholar]
  • (35). Hughes LD; Palmer DS; Nigsch F; Mitchell JBO Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and log P. J. Chem. Inf. Model. 2008, 48(1), 220–232. [DOI] [PubMed] [Google Scholar]
  • (36). Ran Y; Jain N; Yalkowsky SH Prediction of aqueous solubility of organic compounds by the general solubility equation (GSE). J. Chem. Inf. Comput. Sci. 2001, 41(5), 1208–1217. [DOI] [PubMed] [Google Scholar]
  • (37). Lombardo A; Roncaglioni A; Boriani E; Milan C; Benfenati E Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish. Chem. Cent. J. 2010, 4(Suppl 1):S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (38). Huuskonen J Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci. 2000, 40(3), 773–777. [DOI] [PubMed] [Google Scholar]
  • (39). Clark M Generalized fragment-substructure based property prediction method. J. Chem. Inf. Model. 2005, 45, 30–38. [DOI] [PubMed] [Google Scholar]
  • (40). Cheng T; Zhao Y; Li X; Lin F; Xu Y; Zhang X; Li Y; Wang R; Lai L Computation of octanol-water partition coefficients by guiding an additive model with knowledge. J. Chem. Inf. Model. 2007, 47(6), 2140–2148. [DOI] [PubMed] [Google Scholar]
  • (41). Meylan WM; Howard PH Estimating log P with atom/fragments and water solubility with log P. Perspect. Drug Discovery Des. 2000, 19(1), 67–84. [Google Scholar]
  • (42). Hou TJ; Xia K; Zhang W; Xu XJ ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach. J. Chem. Inf. Comput. Sci. 2004, 44, 266–275. [DOI] [PubMed] [Google Scholar]
  • (43). Klopman G; Zhu H Estimation of the aqueous solubility of organic molecules by the group contribution approach. J. Chem. Inf. Comput. Sci. 2001, 41(2), 439–445. [DOI] [PubMed] [Google Scholar]
  • (44). EPI Suite™ - Estimation Program Interface. https://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-program-interface (accessed September 22, 2016).
  • (45). The influence of data curation on QSAR Modeling – examining issues of quality versus quantity of data. https://cfpub.epa.gov/si/si_public_record_report.cfm?dirEntryId=311418 (accessed September 22, 2016).
  • (46). Mansouri K. et al., Manuscript submitted for publication.
  • (47). Weininger D SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28(1), 31–36. [Google Scholar]
  • (48). OECD Principles for the Validation of (Q)SARs. https://eurl-ecvam.jrc.ec.europa.eu/laboratories-research/predictive_toxicology/background/oecd-principles (accessed September 22, 2016).
  • (49). EPI Suite Data. http://esc.syrres.com/interkow/EPiSuiteData.htm (accessed September 22, 2016).
  • (50). Yap CW PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32(7), 1466–1474. [DOI] [PubMed] [Google Scholar]
  • (51). PaDEL-Descriptor. http://www.yapcwsoft.com/dd/padeldescriptor/ (accessed September 22, 2016).
  • (52). Judson R Genetic Algorithms and Their Use in Chemistry In Reviews in Computational Chemistry, first edition; Lipkowitz KB., Boyd DB, Eds.; VCH Publisher, Inc.: New York, USA, 1997; 10, pp 1–73. [Google Scholar]
  • (53). Wegner JK; Zell A Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method. J. Chem. Inf. Comput. Sci. 2003, 43(3), 1077–1084. [DOI] [PubMed] [Google Scholar]
  • (54). Zang Q; Keire DA; Wood RD; Buhse LF; Moore CMV; Nasr M; Al-Hakim A; Trehy ML; Welsh WJ Determination of galactosamine impurities in heparin samples by multivariate regression analysis of their 1H NMR spectra. Anal. Bioanal. Chem. 2011, 399(2), 635–649. [DOI] [PubMed] [Google Scholar]
  • (55). Varmuza K; Filzmoser P Introduction to multivariate statistical analysis in chemometrics; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
  • (56). Cao SS; Xu QS; Liang YZ; Chen X; Li HD Prediction of aqueous solubility of druglike organic compounds using partial least squares, back-propagation network and support vector machine. J. Chemom. 2010, 24(9), 584–595. [Google Scholar]
  • (57). Palmer DS; O’Boyle NM; Glen RC; Mitchell JBO Random forest models to predict aqueous solubility. J. Chem. Inf. Model. 2007, 47(1), 150–158. [DOI] [PubMed] [Google Scholar]
  • (58). Lind P; Maltseva T Support vector machines for the estimation of aqueous solubility. J. Chem. Inf. Comput. Sci. 2003, 43(6), 1855–1859. [DOI] [PubMed] [Google Scholar]
  • (59). Zang Q; Keire DA; Buhse LF; Wood RD; Mital DP; Haque S; Srinivasan S; Moore CMV; Nasr M; Al-Hakim A; Trehy ML; Welsh WJ Identification of heparin samples that contain impurities or contaminants by chemometric pattern recognition analysis of proton NMR spectral data. Anal. Bioanal. Chem. 2011, 401(3), 939–955. [DOI] [PubMed] [Google Scholar]
  • (60). Veerasamy R; Rajak H; Jain A; Sivadasan S; Varghese CV; Agrawa RK Validation of QSAR models - strategies and importance. Int. J. Drug Des. Discovery 2011, 2(3), 511–519. [Google Scholar]
  • (61). Sahigara F; Mansouri K; Ballabio D; Mauri A; Consonni V; Todeschini R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 2012, 17, 4791–4810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (62). Sahigara F; Ballabio D; Todeschini R; Consonni V. Assessing the validity of QSARs for ready biodegradability of chemicals: an applicability domain perspective.Curr. Comput Aided Drug Des. 2013, 10, 137–147. [DOI] [PubMed] [Google Scholar]
  • (63). Sahigara F; Ballabio D; Todeschini R; Consonni V. Defining a novel-k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J. Cheminform. 2013, 5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (64). R Development Core Team R: A language and environment for statistical computing; R Foundation for Statistical Computing: Vienna, Austria, 2011. http://www.R-project.org/ (accessed September 22, 2016). [Google Scholar]
  • (65). Tetko IV; Lowe DM; Williams AJ The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS. J. Cheminform. 2016, 8:2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1
Supplement2

RESOURCES