Six machine learning methods combined with descriptors or fingerprints were employed to predict chemical toxicity on marine crustaceans.
Abstract
Aquatic toxicity is a crucial endpoint for evaluating chemically adverse effects on ecosystems. Therefore, we developed in silico methods for the prediction of chemical aquatic toxicity in marine environment. At first, a diverse data set including different crustacean species was constructed. We then built local binary models using Mysidae data and global binary models using Mysidae, Palaemonidae, and Penaeidae data. Molecular fingerprints and descriptors were employed to represent chemical structures separately. All the models were built by six machine learning methods. The AUC (area under the receiver operating characteristic curve) values of the better local and global models were around 0.8 and 0.9 for the test sets, respectively. We also identified several chemicals with selective toxicity on different species. The analysis of selective toxicity would promote to design greener chemicals in a specific environment. Finally, to understand and interpret the models, we explored the relationships between chemical aquatic toxicity and the molecular descriptors. Our study would be helpful in gaining further insights into marine organisms, prediction of chemical aquatic toxicity and prioritization of environmental hazard assessment.
Introduction
Nowadays, industrial chemicals, petrochemicals, pharmaceuticals, and agrochemicals are used frequently in our daily life. The exposure to and impacts of these chemicals on humans and environmental organisms have become a problem. Therefore, it is urgent to conduct stringent chemicals legislation and initiate ambitious risk assessment.1–3 Many international organizations, such as the United States Environmental Protection Agency (US-EPA) and the Organization for Economic Cooperation and Development (OECD), have done a lot for early risk assessment. Within the framework of these organizations, it is necessary to limit animal tests for reasons of animal welfare, costs and management.4–6 Accordingly, more efforts should be taken to develop alternative estimation techniques such as QSARs (quantitative structure–activity relationships) or in vitro methods for risk assessment.7,8
In eco-toxicological risk assessment, acute toxicity is a key endpoint. QSARs are applicable in predicting acute toxicity. The crustacean species are widely chosen in aquatic toxicity studies.9 Recently, a few QSARs have been developed for aquatic toxicity estimation of chemicals in multiple test species, and the majority of data were from freshwater organisms.10–15 However, there are few studies on QSARs for toxicity assessment of chemicals in diverse crustacean species, especially for marine organisms. Basant et al. established linear models to predict the toxicity of pesticides using the toxicity data of Daphnia magna, Americamysis bahia, Gammarus fasciatus and Penaeus duorarum. The models yielded correlations (R2) of 0.941 for the test sets.16 Nonetheless, these models were constructed from small datasets with limited chemical diversity and a lack of interpretability on the mechanism of chemical aquatic toxicity.
For ecological risk assessment, US-EPA employs different toxicity categories to classify chemical molecules.17 As shown in Table 1, chemical aquatic toxicity can be divided into five categories, i.e., nontoxic, slightly toxic, moderately toxic, highly toxic, and very highly toxic. Accordingly, we can develop SAR (structure–toxicity relationship) models for fast decision making.
Table 1. Chemical toxicity categories in aquatic organisms.
| Toxicity category | Aquatic organisms acute concentration (PPM) | Binary classification |
| Very highly toxic | <0.1 | 1 |
| Highly toxic | 0.1–1 | 1 |
| Moderately toxic | >1–10 | 1 |
| Slightly toxic | >10–100 | 0 |
| Nontoxic | >100 | 0 |
In practice, saltwater environment is the final sink of many anthropogenic pollutants.18–20 However, to our knowledge, there have been no published studies on QSAR models for the toxicological evaluation of diverse chemicals in saltwater crustaceans. Mysids are relatively small shrimp-like crustaceans and survive in saltwater and freshwater environment.21,22 Mysids have been used as the crustacean model for toxicity regulation for around two decades. US-EPA and the American Society for Testing of Materials have employed Americamysis bahia as crucial test species for coastal and estuarine monitoring.23–25
In this study, we first constructed a database containing different marine crustaceans, with the majority of the data being from mysid shrimp, which is the common name of Americamysis bahia. Next, we developed computational models to predict the toxicity of chemicals for mysid shrimp. Then, we tried to determine whether mysid shrimp models can be applied to other marine crustaceans. Finally, we tried to determine whether combined models, which use the data of Mysidae, Penaeidae, and Palaemonidae in the training set, could be built for the toxicity prediction of different marine crustaceans. Here, we named the mysid shrimp models as local models and the combined models as global models. In computational methods, we used either molecular fingerprints or key molecular descriptors to represent chemical structures, and developed binary classification models through several machine learning methods. We also applied several indices to evaluate our models, and compared the results with the Ecological Structure–Activity Relationship (ECOSAR). In addition, we explored the relationships between key molecular descriptors and chemical aquatic toxicity. These models would quickly predict chemical aquatic toxicity and provide crucial tools for the prioritization of environmental hazard assessment.
Methods
Data collection and preparation
We collected the data set of marine crustaceans from the US-EPA ECOTOX database (published on 14 September 2017),26 and selected those data with LC50 values and observed in four days (for experimental observation time) on marine crustaceans. Specifically, as shown in Table 2, we collected five different datasets of marine organisms for this research. The Mysidae, Palaemonidae, and Penaeidae are different families of crustaceans, while Amphipoda and Harpacticoida are orders of crustaceans. Since mysid shrimp accounts for the majority of the data, we built local models with the data set of Mysidae. In order to determine whether mysid shrimp models can be applied to other marine crustaceans, we employed the data of Palaemonidae and Penaeidae for external validation, respectively. Since the external validation of local models achieved favorable results, we tried to build global models using Mysidae, Palaemonidae, and Penaeidae for the improvement of structural diversity. Finally, in order to determine whether the better global models can be applied to other marine organisms, data of Amphipoda and Harpacticoida were employed for external validation, respectively. The detailed relationships of different crustacean species, and different species used in local and global models are shown in Fig. 1.
Table 2. Five different datasets of marine organisms.
| Datasets | Original data | Standardized data | Toxic | Nontoxic |
| Mysidae (e.g., Americamysis bahia) | 1075 | 386 | 307 | 79 |
| Palaemonidae (e.g., Palaemonetes pugio) | 623 | 110 | 79 | 31 |
| Penaeidae (e.g., Penaeus duorarum) | 467 | 74 | 56 | 18 |
| Amphipoda (e.g., Chaetogammarus marinus) | 397 | 34 | 20 | 14 |
| Harpacticoida (e.g., Nitocra spinipes) | 297 | 81 | 45 | 36 |
Fig. 1. The detailed relationships of different crustacean species and different species used in local and global models. (A) Relationships of different crustacean species. (B) Data sets used in this study.
All datasets were processed as follows. First, we converted all units into PPM, and removed those data that could not be converted directly, such as mol L–1, AI (active ingredient) ppm, and AI μg L–1. Then, according to the US-EPA instructions for chemical toxicity divisions (Table 1), we divided all chemicals into five categories. For binary classification, we merged the five categories into two categories, toxic or nontoxic. If a molecule belongs to different toxicity categories according to the experimental data for a crustacean species, we would remove this compound. The most toxic data were kept in order to reduce the false negative rate when a molecule was classified into the same category according to different experimental data. Next, by searching the CAS Registry, we collected chemical structures from the US-EPA Aggregated Computational Toxicology Resource (ACToR) database.27 All these structures were checked again by comparing SMILES using the PubChem database.28 Finally, mixtures, inorganic salts and organometallic compounds were removed. After we standardized the datasets, the dataset of Mysidae was randomly classified into a training set using 80% of the data and a test set with the remaining for local models. Furthermore, we eliminated those compounds in external validation sets that were duplicated with the data of training set, and we found some chemicals with selective toxicity, which means a chemical is toxic to one species and nontoxic to another. Similarly, for global models, the data of Mysidae, Palaemonidae, and Penaeidae were randomly classified into a training set using 80% of the data and a test set with the remaining data. At the same time, we eliminated those compounds in the external validation sets that were duplicated with the training set data.
The distribution of chemicals into different toxic classes was imbalanced, so a clustering analysis method (implemented in the Discovery Studio 3.5) was implemented to balance the toxic and nontoxic data in the training set for both local and global models. We reduced similar toxicity data through sampling, and kept the central molecule of each class for modeling.
Molecular representation
Using PaDEL-Descriptor,29 1444 1D/2D molecular descriptors were generated including physicochemical, topological and electronic properties. Nine types of fingerprints were calculated through PaDEL-Descriptor, which are MACCS fingerprint (166 bits, MACCS), Substructure fingerprint (307 bits, SubFP), CDK fingerprint (1024 bits, FP), Estate fingerprint (79 bits, EStateFP), Klekota-Roth fingerprint (4860 bits, KRFP), PubChem fingerprint (881 bits, PubChemFP), CDK extended fingerprint (1024 bits, ExtFP), CDK graph only fingerprint (1024 bits, GraghFP) and 2D atom pairs (780 bits, AP2D). The specific explanations about these fingerprints can be found from previous studies.29–38
For fingerprint-based models, fingerprint was directly represented as the features. For descriptor-based models, first, we removed those descriptors that could not be calculated for the whole chemicals and those with more than 80% zero value. Then, we employed F-score and Pearson correlation coefficient for further feature reduction. We reserved the top twenty descriptors after applying F-score, which measures the discrimination between the endpoint and descriptors. Furthermore, we selected the final descriptors through Pearson correlation coefficient, which measures the correlation between each descriptor.
Model building
We employed six machine learning methods, which are random forest (RF),39 naïve Bayes (NB),40 k-nearest neighbor (kNN),41 C4.5 decision tree (CT),42 support vector machine (SVM),43 and artificial neural network (ANN).44 These methods were implemented in Orange.45
For the SVM method, we adopted a python script in the LIBSVM package46 to optimize the hyper-parameters C and γ. For kNN, ANN, and RF, the hyper-parameters were optimized by scanning according to the total accuracy of the cross-validation. For NB, we used relative frequency in probability estimation. For CT, we applied information gain in the attribute selection criterion.
Evaluation of model performance
The 10-fold cross validation was applied to evaluate the robustness of the models and to eliminate the inappropriate algorithms and feature sets. Test sets were used to assess the performance of the models. In addition, external sets, from different species for inter-species prediction, were used to explore a wider range of applications for models.
For the evaluation of our models, we applied several evaluation indices. True positive (TP) refers to the prediction of toxic chemicals as toxic chemicals, while false negative (FN) refers to the prediction of toxic chemicals as nontoxic chemicals. True negative (TN) is when nontoxic chemicals are predicted as nontoxic chemicals, while false positive (FP) is when nontoxic chemicals are predicted as toxic chemicals. Furthermore, we calculated the overall predictive accuracy Q (= (TP + TN)/(TP + FN + TN + FP)), the sensitivity SE (= TP/(TP + FN)), the specificity SP (= TN/(TN + FP)) and the area under the receiver operating characteristic (AUC).47
Definition of model applicability domain
An applicability domain (AD) method was applied to avert the prediction of chemicals which differ considerably from the training set. Here, we introduced a distance-based applicability domain method48 that uses a threshold DT according to the average distance of compounds and their standard deviation in the training set to judge whether a compound in the test set is in the AD. The DT is defined in eqn (1), where γ̄ and σ are the mean and standard deviation of the Euclidean distance of chemicals. Z is an adjustable parameter that decides the “size” of the applicability domain. In this study, we set Z to 0.5.
| DT = γ̄ + Zσ | 1 |
If the distance between a query molecule and any of its nearest neighbors in the training set exceeds DT, this prediction is treated as questionable.49
Results
Data collection and analysis
All crustacean data were collected from the US-EPA ECOTOX database. We chose those data with LC50 values and observed in four days (for experimental observation time) on marine crustaceans. The number of original data and standardized data is shown in Table 2, and we used the standardized data to develop models. After random division in the ratio of 8: 2, the number of compounds in the training set and test set of the local models was 309 and 77, respectively, and the number of compounds in the training set and test set of the global models was 326 and 82, separately. After the clustering of the toxic molecules in the training set, the number of compounds in the training set was reduced to 192 for local models and 261 for global models. The total number of compounds adopted for modeling is shown in Table 3. We found ten chemicals with selective toxicity on different species in this research, and the detailed structures are shown in Fig. 2. Among them, compounds a, b, c, d, and e had selective toxicity on Mysidae and Palaemonidae, whereas compounds f, g, and h were selective on Mysidae and Penaeidae. Compound i showed selective toxicity on Mysidae and Amphipoda, and compound j showed selective toxicity on Mysidae and Harpacticoida. The CAS numbers of the chemicals and their acute concentration values on the original saltwater crustacean datasets are listed in Table S1.†
Table 3. The data points used in modelling.
| Local model | |||
| Data sets | Toxic | Non-toxic | Total |
| Training set (original) | 245 | 64 | 309 |
| Training set (after clustering) | 128 | 64 | 192 |
| Test set | 62 | 15 | 77 |
| External validation 1 | 79 | 31 | 110 |
| External validation 2 | 56 | 18 | 74 |
| Global model | |||
| Data sets | Toxic | Non-toxic | Total |
| Training set (original) | 239 | 87 | 326 |
| Training set (after clustering) | 174 | 87 | 261 |
| Test set | 64 | 18 | 82 |
| External validation 1 | 20 | 14 | 34 |
| External validation 2 | 45 | 36 | 81 |
Fig. 2. Ten selective toxicity chemicals. Chemical ID a–e: five selective toxicity chemicals of Mysidae and Palaemonidae; chemical ID f–h: three selective toxicity chemicals of Mysidae and Penaeidae; chemical ID i: a selective toxicity chemical of Mysidae and Amphipoda; and chemical ID j: a selective toxicity chemical of Mysidae and Harpacticoida.
In this study, the chemical space was characterized by molecular descriptors. Details of the descriptors used in model building are shown in Table S2.† A principle component analysis (PCA) method was implemented to investigate the distribution of different data sets in the chemical space. As presented in Fig. 3, the chemical space of the training set molecules was similar to that of the test set molecules in both local and global models. In addition, to explore the structural diversity of the training set molecules, Tanimoto similarity index of the chemicals was calculated using MACCS fingerprints. The average Tanimoto similarity index was 0.233 and 0.232 for the data sets of local and global models, respectively, and most of them were distributed between 0.0 to 0.5 (Fig. 4A). The heat map of the Tanimoto similarity index (Fig. 4B and C) revealed that chemicals in the training set were diverse.
Fig. 3. Chemical space defined by the three principle components from several physicochemical properties, 10 kinds of descriptors for local models and 11 kinds of descriptors for global models. Training set data in the chemical space are colored with red, and test set data in the chemical space are colored with yellow. (A) For local models. (B) For global models.
Fig. 4. (A) Distribution of molecular similarity calculated with MACCS fingerprint and Tanimoto coefficient, red for local models and black for global models. Fig. 3B and C: Similarity matrix of the compounds, in which 0 (blue) means the least similarity and 1 (red) means the highest similarity (but not means that they are the same). (B) For local models. (C) For global models.
Selection of molecular descriptors
After the feature reduction of the molecule descriptors, 10 descriptors with higher scores were used to develop the local models, which were mindssC, CrippenLogP, maxHBint2, maxwHBa, GATS1i, hmin, SwHBa, nT6HeteroRing, MDEO-11, and GATS4c. Eleven descriptors with higher scores were used to develop the global models, which were XLogP, SHBd, maxHBint2, maxwHBa, mindssC, ALogP, SwHBa, GATS1i, GATS4c, AATSC1s, and MDEC-23. Detailed description of these descriptors is shown in Table S2.†
Performance of local models
For local models, both descriptor-based and fingerprint-based binary models were built using 6 machine learning methods.
The performance of three better descriptor-based models, namely MD-SVM, MD-RF and MD-ANN, corresponding to Model ID 1–3, is shown in Fig. 5A. The sensitivity values of these three models were around 0.900. The AUC values of these three models were higher than 0.800, and their overall predictive accuracy values was around 0.850. The performance of all descriptor-based classification models is summarized in Table S3.†
Fig. 5. The results of the test set and cross validation with six better local models and six better global models. Model ID 1–3: descriptor-based local models; Model ID a–c: fingerprint-based local models; Model ID 4–6: descriptor-based global models; Model ID d–f: fingerprint-based global models. (A) The results of the test set of local models. (B) The results of 10-fold cross validation of local models. (C) The results of the test set of global models. (D) The results of 10-fold cross validation of global models.
Fifty-four different models were built using the combination of 9 kinds of fingerprints and 6 kinds of machine learning methods. The performance of three better fingerprint-based models, namely GraphFP-ANN, PubchemFP-RF and EStateFP-ANN, corresponding to Model ID a–c, is shown in Fig. 5A. The sensitivity values of these three models were around 0.820 to 0.870, the AUC values were more than 0.830, and the overall predictive accuracy values were around 0.820 to 0.860. The top ten fingerprint-based models are summarized in Table S4.†
We employed the 10-fold cross validation method to evaluate the robustness of models. The complete results of cross validation are shown in Tables S3 and S4,† and the results of the six better local models are shown in Fig. 5B. The AUC values were more than 0.850 for the three better descriptor-based models, and around 0.830 to 0.850 for the three better fingerprint-based models.
To further explore the predictive accuracy of these local models, two external validation sets on Palaemonidae and Penaeidae were used separately. The detailed results are shown in Table 4. The AUC and SE values of the descriptor-based models were better than those of the fingerprint-based ones for both external validation sets. All of these indicated that these local models with high predictive performance could provide useful tools for the prediction of chemical aquatic toxicity in the environmental hazard assessment.
Table 4. The results of external validation for local and global models.
| External validation 1 |
AUC | External validation 2 |
AUC | ||||||
| Q | SE | SP | Q | SE | SP | ||||
| Local models | MD-SVM | 0.827 | 0.937 | 0.548 | 0.897 | 0.797 | 0.893 | 0.500 | 0.840 |
| MD-RF | 0.818 | 0.899 | 0.613 | 0.857 | 0.851 | 0.893 | 0.722 | 0.825 | |
| MD-ANN | 0.818 | 0.899 | 0.613 | 0.897 | 0.838 | 0.911 | 0.611 | 0.845 | |
| GraphFP-ANN | 0.700 | 0.823 | 0.387 | 0.748 | 0.797 | 0.857 | 0.611 | 0.822 | |
| PubchemFP-RF | 0.755 | 0.899 | 0.387 | 0.810 | 0.784 | 0.893 | 0.444 | 0.667 | |
| EStateFP-ANN | 0.736 | 0.810 | 0.548 | 0.847 | 0.797 | 0.911 | 0.444 | 0.811 | |
| Global models | MD-SVM | 0.794 | 1.000 | 0.500 | 0.846 | 0.753 | 0.889 | 0.583 | 0.815 |
| MD-NB | 0.794 | 0.950 | 0.571 | 0.925 | 0.728 | 0.800 | 0.639 | 0.859 | |
| MD-ANN | 0.824 | 0.950 | 0.643 | 0.864 | 0.753 | 0.844 | 0.639 | 0.828 | |
| GraphFP-ANN | 0.824 | 1.000 | 0.571 | 0.907 | 0.741 | 0.933 | 0.500 | 0.765 | |
| SubFP-ANN | 0.824 | 0.900 | 0.714 | 0.900 | 0.765 | 0.867 | 0.639 | 0.840 | |
| ExtFP-ANN | 0.735 | 1.000 | 0.357 | 0.850 | 0.778 | 0.867 | 0.667 | 0.817 | |
Performance of global models
Similarly to local models, both descriptor-based and fingerprint-based global models were built using the 6 machine learning methods.
The performance of the three better descriptor-based models is shown in Fig. 5C. They are MD-SVM, MD-NB and MD-ANN, corresponding to Model ID 4–6. The sensitivity values of these three models were around 0.840 to 0.900, the AUC values were around 0.900 and the overall predictive accuracy values were more than 0.850. The performance of all descriptor-based classification models is summarized in Table S5.†
Fifty-four fingerprint-based global models were also built using the combination of 9 kinds of fingerprints and 6 kinds of machine learning methods. The performance of the three better fingerprint-based models, namely GraphFP-ANN, SubFP-ANN, and ExtFP-ANN, corresponding to Model ID d–f, is shown in Fig. 5C. The sensitivity values of these three models were around 0.900, the AUC values were around 0.850 to 0.920, and the overall predictive accuracy values were more than 0.850. The top ten fingerprint-based models are summarized in Table S6.†
The 10-fold cross validation method was used to evaluate the model robustness. The complete results of the 10-fold cross validation are shown in Tables S5 and S6,† and the results of the six better global models are shown in Fig. 5D. The AUC values were more than 0.840 for these six better models.
To further explore the predictive accuracy of these global models, two external validation sets on Amphipoda and Harpacticoida were used separately. The detailed results are also shown in Table 4. The AUC values of the descriptor-based models were comparable to those of fingerprint-based ones for both external validation sets, while the SE values of the fingerprint-based models were better than those of the descriptor-based ones. These global models with high predictive performance could provide useful tools to predict chemical aquatic toxicity in the environmental hazard assessment.
Analysis of the applicability domain
The applicability domains of the six better local models and six better global models were further analyzed. The detailed results are shown in Table 5. For local models, compounds outside the application domain (OD) had an AUC value of 0.619 to 0.747, while the compounds within the application domain (ID) had an AUC value of 0.885 to 0.990. For global models, OD compounds had an AUC value of 0.701 to 0.868, whereas the ID compounds had an AUC value of 0.889 to 0.980. The specific chemicals of ID and OD in the test set are given in Table S7.† These results confirmed that the adoption of application domain obviously improved the predictive accuracy of the models, although the improvement came at the expense of lower chemical diversity.
Table 5. Performance of in domain (ID) and out of domain (OD) chemicals in the test set for local and global models after applying domain assessment.
| ID |
AUC | OD |
AUC | ||||||
| Q | SE | SP | Q | SE | SP | ||||
| Local models | MD-SVM | 0.938 | 0.951 | 0.857 | 0.965 | 0.724 | 0.810 | 0.500 | 0.708 |
| MD-RF | 0.896 | 0.902 | 0.857 | 0.990 | 0.759 | 0.857 | 0.500 | 0.685 | |
| MD-ANN | 0.938 | 0.951 | 0.857 | 0.958 | 0.724 | 0.810 | 0.500 | 0.685 | |
| GraphFP-ANN | 0.896 | 0.878 | 1.000 | 0.976 | 0.690 | 0.714 | 0.625 | 0.619 | |
| PubchemFP-RF | 0.958 | 0.951 | 1.000 | 0.979 | 0.690 | 0.714 | 0.625 | 0.714 | |
| EStateFP-ANN | 0.875 | 0.878 | 0.857 | 0.885 | 0.724 | 0.857 | 0.375 | 0.747 | |
| Global models | MD-SVM | 0.946 | 0.957 | 0.900 | 0.963 | 0.692 | 0.778 | 0.500 | 0.701 |
| MD-NB | 0.911 | 0.891 | 1.000 | 0.980 | 0.731 | 0.722 | 0.750 | 0.771 | |
| MD-ANN | 0.911 | 0.935 | 0.800 | 0.980 | 0.731 | 0.778 | 0.625 | 0.799 | |
| GraphFP-ANN | 0.929 | 0.935 | 0.900 | 0.928 | 0.808 | 0.833 | 0.750 | 0.854 | |
| SubFP-ANN | 0.893 | 0.913 | 0.800 | 0.948 | 0.846 | 0.833 | 0.875 | 0.868 | |
| ExtFP-ANN | 0.875 | 0.891 | 0.800 | 0.889 | 0.808 | 0.944 | 0.500 | 0.778 | |
Comparison with ECOSAR
The ECOSAR Class Program is a computerized predictive system that estimates aquatic toxicity, and it uses mysid shrimp as the surrogate species for the prediction of saltwater invertebrates.50 In order to compare with our local models, ECOSAR was used to predict the chemical aquatic toxicity for mysid shrimp. Seventy-seven compounds from the test set of the local models were imported into ECOSAR. We obtained results for 54 compounds and the others could not be predicted by ECOSAR because of the application domain. For a fair comparison, we imported the 54 compounds into our better local models, and the detailed results are shown in Table 6. Furthermore, we matched the above 77 molecules with the training set of the global models, then removed those molecules that were duplicates, and gained a data set of mysid shrimp with 72 molecules. In order to compare with our global models, we imported the unique 72 molecules to ECOSAR, and then we obtained the results of 52 molecules for the toxicity prediction on mysid shrimp. To compare more fairly, we imported the 52 molecules into our better global models, and the detailed results are shown in Table 6.
Table 6. Comparison of performance with ECOSAR for local and global models.
| Q | SE | SP | AUC | ||
| Local models | MD-SVM | 0.852 | 0.927 | 0.615 | 0.826 |
| MD-RF | 0.852 | 0.927 | 0.615 | 0.878 | |
| MD-ANN | 0.852 | 0.927 | 0.615 | 0.826 | |
| GraphFP-ANN | 0.833 | 0.854 | 0.769 | 0.818 | |
| PubchemFP-RF | 0.870 | 0.902 | 0.769 | 0.874 | |
| EStateFP-ANN | 0.870 | 0.951 | 0.615 | 0.871 | |
| ECOSAR | 0.796 | 0.854 | 0.615 | ||
| Global models | MD-SVM | 0.846 | 0.923 | 0.615 | 0.830 |
| MD-NB | 0.865 | 0.923 | 0.692 | 0.862 | |
| MD-ANN | 0.846 | 0.923 | 0.615 | 0.797 | |
| GraphFP-ANN | 0.788 | 0.897 | 0.462 | 0.700 | |
| SubFP-ANN | 0.846 | 0.897 | 0.692 | 0.825 | |
| ExtFP-ANN | 0.846 | 0.923 | 0.615 | 0.834 | |
| ECOSAR | 0.808 | 0.897 | 0.539 |
All descriptor-based models outperformed ECOSAR according to the consensus test, while several fingerprint-based models outperformed ECOSAR.
Discussion
Analysis of selective toxicity chemicals
We found a total of 10 chemicals with selective toxicity on different species, and their structures are shown in Fig. 2. For example, compound a is toxic to Palaemonidae, but nontoxic to Mysidae. Compound a is tetrachlorophenol. It has high Kow, and is more soluble in the soil. Actually, most of Mysidae are planktonic crustaceans, so they have less contacts with fat soluble chemicals which usually deposit in the soil. Most of Palaemonidae are benthic, and eat decay material, and are more frequently exposed to toxic environment. Compound b is shown to be toxic to Mysidae, and nontoxic to Palaemonidae. As we can see, compound b is a sulfonic acid derivative. It has low Kow, and is more soluble in water, so it may have more contact with free-swimming organisms. Therefore, different living habitats can be a main reason why the chemicals show different toxicity values to different organisms.
Comparison of different machine learning methods
In this study, we employed six machine learning methods to develop highly predictive local and global binary models for chemical aquatic toxicity prediction.
The results of cross-validation, test set and external validation showed that the performance of most of the models was reliable. Comparing the six different methods, the overall performance of SVM, RF and ANN was better in local descriptor-based models, while the overall performance of SVM, NB and ANN was better in global descriptor-based models. The overall performance of ANN was pretty good in both local and global fingerprint-based models.
For local models, descriptor-based models outperformed fingerprint-based ones. For global models, fingerprint-based models were comparable to the descriptor-based ones. The main reason for this phenomenon might be the different sizes of the training sets. The number of molecules in the training set of global models was larger than that of the local models. When the volume of data in the training set was not large enough, full descriptors were competitive against molecular fingerprints to describe the chemicals. Otherwise, fingerprints were more competitive.
Relevance of key chemical descriptors to aquatic toxicity
To identify compounds that are potentially toxic, we investigated the correlation between the toxic chemicals and some descriptors. The distributions of toxic chemicals are presented in Fig. 6.
Fig. 6. The distributions of 6 physicochemical descriptors, including XLogP, ALogP, SHBd, maxHBint2, maxwHBa, and AATSC1s for toxic chemicals.
XLogP and AlogP were generated by different algorithms. As shown in Fig. 6, we found that the distributions of both XLogP and AlogP were similar to a “bell shaped curve”, and the LogP values of the majority of toxic compounds were distributed between –1 and 8. In addition, the mean value of XLogP was 3.72 and 1.63 for toxic chemicals and non-toxic chemicals, respectively, with a p-value of 1.42 × 10–9, and the mean value of ALogP was 1.12 and –0.08 for toxic and non-toxic chemicals, with a p-value of 2.46 × 10–8. This indicated that the toxic chemicals tended to be more hydrophobic than non-toxic.
SHBd, maxHBint2, and maxwHBa describe the electropological state (E-state) in different aspects. E-state is a way of representing atoms and fragments. It uses intrinsic state values to describe the different states of atoms in different fields.51 SHBd, maxHBint2, and maxwHBa were used as a representation of hydrogen binding ability. Actually, polarity or hydrogen bonding has always been considered to have a great connection with lipophilicity.52 SHBd represents the sum of E-states for (strong) hydrogen bond donors. As shown in Fig. 6, the SHBd values of the majority of toxic molecules were distributed between 0 and 1.40, so we speculated that the molecules in this range tended to be more hydrophobic. In addition, the mean value of SHBd was 0.25 and 0.74 for toxic and non-toxic compounds, separately, and the p-value is 2.64 × 10–8. It indicated that these toxic compounds tended to have a lower SHBd value. maxHBint2 refers to the maximum E-state descriptors of strength for the potential hydrogen bonds of path length 2. The maxHBint2 values of the majority of toxic molecules were distributed between 0 and 1, so we speculated that the molecules in this range tended to be more hydrophobic. In addition, the mean value of maxHBint2 was 1.02 and 3.58 for toxic and non-toxic molecules, separately, and the p-value is 2.80 × 10–8. It indicated that these toxic molecules tended to have a lower maxHBint2 value. maxwHBa represents the maximum E-states for weak hydrogen bond acceptors. The maxwHBa values of the majority of toxic molecules were distributed between 0 and 2.5, so we speculated that the molecules in this range tended to be more hydrophobic. In addition, the mean value of maxwHBa was 1.86 and 1.15 for toxic and non-toxic molecules, separately, and the p-value is 3.62 × 10–9. It indicated that these toxic molecules tended to have a higher maxwHBa value.
AATSC1s is used to describe autocorrelation. Autocorrelation descriptors are topological descriptors that are used to encode chemicals and the properties of chemicals. The values of these descriptors are defined by the hydrophobicity, electronegativity, and volume of atoms.53 AATSC1s indicates the average centered Broto-Moreau autocorrelation – lag 1/weighted by I-state. As shown in Fig. 6, the AATSC1s values of the majority of toxic molecules were distributed between –0.6 and 0.2, so we speculated that the molecules in this range tended to be more hydrophobic. In addition, the mean value of AATSC1s was –0.14 and –0.40 for toxic and non-toxic compounds, separately, and the p-value is 1.43 × 10–5. It indicated that these toxic compounds tended to have a higher AATSC1s value.
All the p-values indicated above suggested that the distributions of toxic and non-toxic chemicals were significantly different. Therefore, it was appropriate to select these descriptors to depict the physicochemical properties of compounds. Moreover, we can change the range of the physicochemical properties of compounds by changing the groups in small molecules to guide the design of greener compounds. However, chemical aquatic toxicity is a complicated action, and we could not explain it explicitly through several descriptors. Many factors, including chemical, biological and environmental circumstances, should also be considered.
Comparison with others’ work
Acute toxicity is one of the most important endpoints in eco-toxicological risk assessment, so there are many studies on aquatic acute toxicity prediction. The methods based on decision tree boost (DTB) and decision tree forest (DTF) were used by Singh et al. to develop QSAR models for algae, daphnia, fish, and marine bacteria.11 The DTB-QSAR models yielded an R2 of 0.793 for the test set (P. subcapitata) and 0.575 to 0.672 for the other four external validation species, while the DTF-QSAR models yielded an R2 of 0.753 for the test set and 0.605 to 0.689 for the other four species. QSAR models of amine oxide toxicity were developed by Belanger et al. on algae (Desmodesmus subspicatus), invertebrate (Daphnia magna) and fish (Danio rerio).54 The R2 of these models was around 0.920 to 0.980. Externally validated QSAR models of acute PCP toxicity were developed by Gramatica et al. on algae (Pseudokirchneriella subcapitata), crustacean (Daphnia magna) and fish (Pimephales promelas).55 Although all these models have obtained pretty good results for different aquatic toxicity predictions, there are few classification models for saltwater crustacean prediction. In this study, we developed local models using one species and global models using different species for saltwater crustacean prediction. We considered more diverse organic chemicals, including pesticides and industrial chemicals.
Conclusions
Here, we first constructed a database containing different marine crustaceans, the majority of the data being from mysid shrimp. Then, we developed local models using the data of Mysidae, and found that the better local models could also be applied to other marine crustaceans. Therefore, we tried to develop global models using different marine crustaceans. The better local and global models outperformed ECOSAR. The significance of AD in this study confirmed that the use of AD improved the prediction accuracy of both local and global models. In addition, we found that molecular descriptors were more suitable for modeling small data sets, while fingerprints were more favorable for the modeling of bigger data sets. Finally, we analyzed the relationships between chemical aquatic toxicity and several physicochemical descriptors to interpret the models and identify compounds that are potentially toxic.
We analyzed some chemicals with selective toxicity, and speculated that different living habitats could be a main reason why these chemicals show different toxicity categories to different organisms. The better local and global models will be integrated as a part of our webserver admetSAR, which is available for free on http://lmmd.ecust.edu.cn/admetsar2/. Our study fills the gap in acute aquatic toxicity research of saltwater crustaceans. Both local and global models developed here will provide critical information and useful tools for predicting chemical aquatic toxicity in saltwater environmental hazard assessment.
Conflicts of interest
There are no conflicts of interest to declare.
Supplementary Material
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant 81872800).
Footnotes
†Electronic supplementary information (ESI) available. See DOI: 10.1039/c8tx00331a
References
- Rohr J. R., Schotthoefer A. M., Raffel T. R., Carrick H. J., Halstead N., Hoverman J. T., Johnson C. M., Johnson L. B., Lieske C., Piwoni M. D., Schoff P. K., Beasley V. R. Nature. 2008;455:1235–1239. doi: 10.1038/nature07281. [DOI] [PubMed] [Google Scholar]
- Planson A. G., Carbonell P., Paillard E., Pollet N., Faulon J. L. Biotechnol. Bioeng. 2012;109:846–850. doi: 10.1002/bit.24356. [DOI] [PubMed] [Google Scholar]
- Worth A. P., Netzeva T. I., Patlewicz G. Risk Assess. Chem. 2007:427–465. [Google Scholar]
- Lilienblum W., Dekant W., Foth H., Gebel T., Hengstler J. G., Kahl R., Kramer P. J., Schweinfurth H., Wollin K. M. Arch. Toxicol. 2008;82:211–236. doi: 10.1007/s00204-008-0279-9. [DOI] [PubMed] [Google Scholar]
- Rorije E., Aldenberg T., Buist H., Kroese D., Schuurmann G. Regul. Toxicol. Pharmacol. 2013;67:146–156. doi: 10.1016/j.yrtph.2013.06.003. [DOI] [PubMed] [Google Scholar]
- Vermeire T., van de Bovenkamp M., de Bruin Y. B., Delmaar C., van Engelen J., Escher S., Marquart H., Meijster T. Regul. Toxicol. Pharmacol. 2010;58:408–420. doi: 10.1016/j.yrtph.2010.08.007. [DOI] [PubMed] [Google Scholar]
- El Mahdi A. M. and Aziz H. A., in Toxicity and Biodegradation Testing, 2018, ch. 18, pp. 349–388, 10.1007/978-1-4939-7425-2_18. [DOI] [Google Scholar]
- von der Ohe P. C., Kühne R., Ebert R. U., Altenburger R., Liess M., Schüürmann G. Chem. Res. Toxicol. 2005;18:536–555. doi: 10.1021/tx0497954. [DOI] [PubMed] [Google Scholar]
- Verslycke T., Ghekiere A., Raimondo S., Janssen C. Ecotoxicology. 2007;16:205–219. doi: 10.1007/s10646-006-0122-0. [DOI] [PubMed] [Google Scholar]
- Singh K. P., Gupta S., Rai P. Ecotoxicol. Environ. Saf. 2013;95:221–233. doi: 10.1016/j.ecoenv.2013.05.017. [DOI] [PubMed] [Google Scholar]
- Singh K. P., Gupta S., Kumar A., Mohan D. Chem. Res. Toxicol. 2014;27:741–753. doi: 10.1021/tx400371w. [DOI] [PubMed] [Google Scholar]
- Singh K. P., Gupta S., Basant N. RSC Adv. 2014;4:64443–64456. [Google Scholar]
- Singh K. P., Gupta S., Basant N. Chemosphere. 2015;120:680–689. doi: 10.1016/j.chemosphere.2014.10.025. [DOI] [PubMed] [Google Scholar]
- Basant N., Gupta S., Singh K. P. Chemosphere. 2015;139:246–255. doi: 10.1016/j.chemosphere.2015.06.063. [DOI] [PubMed] [Google Scholar]
- Basant N., Gupta S., Singh K. P. J. Chem. Inf. Model. 2015;55:1337–1348. doi: 10.1021/acs.jcim.5b00139. [DOI] [PubMed] [Google Scholar]
- Basant N., Gupta S., Singh K. P. Toxicol. Res. 2016;5:340–353. doi: 10.1039/c5tx00321k. [DOI] [PMC free article] [PubMed] [Google Scholar]
- United States Environmental Protection Agency (EPA), https://www.epa.gov/pesticide-science-and-assessing-pesticide-risks/technical-overview-ecological-risk-assessment-0.
- Oberdörster E., Cheek A. O. Environ. Toxicol. Chem. 2001;20:23–36. [PubMed] [Google Scholar]
- Dale V. H., Beyeler S. C. Ecol. Indic. 2002;1:3–10. [Google Scholar]
- Nimmo D. R., Hamaker T. L. Hydrobiologia. 1982;93:171–178. [Google Scholar]
- Verslycke T. A., Fockedey N., Jr M. K. C., Roast S. D., Jones M. B., Mees J., Janssen C. R. Environ. Toxicol. Chem. 2004;23:1219–1234. doi: 10.1897/03-332. [DOI] [PubMed] [Google Scholar]
- Mauchline J., Murano M. J. Tokyo Univ. Fish. 1977;64:39–88. [Google Scholar]
- Lussier S. M., Kuhn A., Comeleo R. Environ. Toxicol. Chem. 1999;18:2888–2893. [Google Scholar]
- Roast S. D., Thompson R. S., Donkin P., Widdows J., Jones M. B. Water Res. 1999;33:319–326. [Google Scholar]
- Harmon V. L., Langdon C. J. Environ. Toxicol. Chem. 2010;15:1824–1830. [Google Scholar]
- ECOTOX Databese, https://cfpub.epa.gov/ecotox/.
- Judson R., Richard A., Dix D., Houck K., Elloumi F., Martin M., Cathey T., Transue T. R., Spencer R., Wolf M. Toxicol. Appl. Pharmacol. 2008;233:7–13. doi: 10.1016/j.taap.2007.12.037. [DOI] [PubMed] [Google Scholar]
- Wang Y., Xiao J., Suzek T. O., Zhang J., Wang J., Bryant S. H. Nucleic Acids Res. 2009;37:W623–W633. doi: 10.1093/nar/gkp456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yap C. W. J. Comput. Chem. 2011;32:1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- Dong J., Cao D. S., Miao H. Y., Liu S., Deng B. C., Yun Y. H., Wang N. N., Lu A. P., Zeng W. B., Chen A. F. J. Cheminf. 2015;7:60. doi: 10.1186/s13321-015-0109-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao D. S., Liang Y. Z., Yan J., Tan G. S., Xu Q. S., Liu S. J. Chem. Inf. Model. 2013;53:3086–3096. doi: 10.1021/ci400127q. [DOI] [PubMed] [Google Scholar]
- Cao D. S., Xu Q. S., Hu Q. N., Liang Y. Z. Bioinformatics. 2013;29:1092–1094. doi: 10.1093/bioinformatics/btt105. [DOI] [PubMed] [Google Scholar]
- Dong J., Yao Z. J., Zhang L., Luo F., Lin Q., Lu A. P., Chen A. F., Cao D. S. J. Cheminf. 2018;10:16. doi: 10.1186/s13321-018-0270-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeh C. H. Chemom. Intell. Lab. Syst. 1991;12:95–96. [Google Scholar]
- Cheng F., Yu Y., Shen J., Yang L., Li W., Liu G., Lee P. W., Tang Y. J. Chem. Inf. Model. 2011;51:996–1011. doi: 10.1021/ci200028n. [DOI] [PubMed] [Google Scholar]
- Shen J., Du Y., Zhao Y., Liu G., Tang Y. QSAR Comb. Sci. 2008;27:704–717. [Google Scholar]
- Klekota J., Roth F. P. Bioinformatics. 2008;24:2518–2525. doi: 10.1093/bioinformatics/btn479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sonquist J. A. and Morgan J. N., The Detection of Interaction Effects: A Report on a Computer Program for the Selection of Optimal Combinations of Explanatory Variables, In-house reproduction, 1964. [Google Scholar]
- Breiman L. Mach. Learn. 2001;45:5–32. [Google Scholar]
- Watson P. J. Chem. Inf. Model. 2008;48:166–178. doi: 10.1021/ci7003253. [DOI] [PubMed] [Google Scholar]
- Cover T., Hart P. IEEE Trans. Inf. Theory. 1967;13:21–27. [Google Scholar]
- Ma L., Destercke S., Wang Y. Pattern Recognit. 2016;52:33–45. [Google Scholar]
- Cortes C., Vapnik V. Mach. Learn. 1995;20:273–297. [Google Scholar]
- Basheer I. A., Hajmeer M. J. Microbiol. Methods. 2000;43:3–31. doi: 10.1016/s0167-7012(00)00201-3. [DOI] [PubMed] [Google Scholar]
- Demšar J., Curk T., Erjavec A., Goru Č., Hočevar T., Milutinovič M., MoŽina M., Polajnar M., Toplak M., Starič A. J. Mach. Learn. Res. 2013;14:2349–2353. [Google Scholar]
- Chang C. C., Lin C. J. ACM Transactions on Intelligent Systems and Technology. 2011;2:1–27. [Google Scholar]
- Baldi P., Brunak S., Andersen Y. C., Nielsen H. Bioinformatics. 2000;16:412–424. doi: 10.1093/bioinformatics/16.5.412. [DOI] [PubMed] [Google Scholar]
- Cheng F., Ikenaga Y., Zhou Y., Yu Y., Li W., Shen J., Du Z., Chen L., Xu C., Liu G., Lee P. W., Tang Y. J. Chem. Inf. Model. 2012;52:655–669. doi: 10.1021/ci200622d. [DOI] [PubMed] [Google Scholar]
- Tropsha A., Golbraikh A. Curr. Pharm. Des. 2007;13:3494–3504. doi: 10.2174/138161207782794257. [DOI] [PubMed] [Google Scholar]
- Ecological Structure Activity Relationships (ECOSAR), https://www.epa.gov/tsca-screening-tools/ecological-structure-activity-relationships-ecosar-predictive-model.
- Kier L. B., Hall L. H. IL Farmaco. 1999;54:346–353. [Google Scholar]
- Winiwarter S., Ridderström M., Ungell A. L., Andersson T. B., Zamora I. Compr. Med. Chem. II. 2007;15:531–554. [Google Scholar]
- Hollas B. J. Math. Chem. 2003;33:91–101. [Google Scholar]
- Belanger S. E., Brill J. L., Rawlings J. M., McDonough K. M., Zoller A. C., Wehmeyer K. R. Ecotoxicol. Environ. Saf. 2016;134P1:95–105. doi: 10.1016/j.ecoenv.2016.08.023. [DOI] [PubMed] [Google Scholar]
- Gramatica P., Cassani S., Sangion A. Green Chem. 2016;18:4393–4406. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






