Abstract

The widespread use of chemical products inevitably brings many side effects as environmental pollutants. Toxicological assessment of compounds to aquatic life plays an important role in protecting the environment from their hazards. However, in vivo animal testing approaches for aquatic toxicity evaluation are time-consuming, expensive, and ethically limited, especially when there are a great number of compounds. In silico modeling methods can effectively improve the toxicity evaluation efficiency and save costs. Here, we present a web-based server, AquaticTox, which incorporates a series of ensemble models to predict acute toxicity of organic compounds in aquatic organisms, covering Oncorhynchus mykiss, Pimephales promelas, Daphnia magna, Pseudokirchneriella subcapitata, and Tetrahymena pyriformis. The predictive models are built through ensemble learning algorithms based on six base learners. These ensemble models outperform all corresponding single models, achieving area under the curve (AUC) scores of 0.75–0.92. Compared to the best single models, the average precisions of the ensemble models have been increased by 12–22%. Additionally, a self-built knowledge base of the structure-aquatic toxic mode of action (MOA) relationship was integrated into AquaticTox for toxicity mechanism analysis. Hopefully, the user-friendly tool (https://chemyang.ccnu.edu.cn/ccb/server/AquaticTox); could facilitate the identification of aquatic toxic chemicals and the design of green molecules.
Keywords: ecotoxicity, aquatic toxicity, water environment protection, toxicity prediction, structure−toxicity relationship, deep learning
1. Introduction
Chemicals play an indispensable role in enhancing the quality of human life. The global industry is producing an increasing number of chemicals to meet daily human needs, covering medications, pesticides, cosmetics, and more. According to the registered ACS numbers, more than 100,000 chemicals had been produced as early as 2008. However, the deluge of chemicals will likely cause unexpected negative impacts on human health and the environment. The rise in the incidence of many human diseases has been shown to be associated with overexposure to chemical pesticides, such as cancers, neurodegenerative disorders, infertility, birth defects, and diabetes.1,2 Examples of environmental problems induced by chemical toxicities include reduction of biodiversity in farmland ecosystem,3 water environment pollution,4 and soil degradation.5 In the context of growing public concerns about the hazards of chemicals, it is essential to discover safer compounds as novel green chemical products to promote sustainable development.
Aquatic ecosystems are highly vulnerable to chemical contamination. Any small change in a complicated aquatic food web has a high potential to affect multiple trophic levels and disrupt the entire ecological balance. From the perspective of the toxicity mechanism, many chemicals, such as pesticides, are developed as toxic chemical agents against specific target species. Nevertheless, it is also commonplace for chemicals to unintentionally cause harmful effects on nontarget species.6 Historically, the deterioration of water quality in many agricultural and urban areas has been reported to be induced by the toxicity of active ingredients or metabolites of chemical products to aquatic organisms.7−9 Surveys from some regions of India have revealed the presence of high levels of pesticides in freshwater ecosystems and bottled drinking water.10 Throughout the Pacific, pesticides such as chlorpyrifos and carbaryl have been detected in many water bodies where salmon were endangered, reducing salmon populations by restraining feeding behavior.11 To decrease the negative impact of chemicals on aquatic ecosystems and water quality, it is more meaningful to make chemicals benign and nontoxic by design rather than treatment after pollution. Hence, assessment of potential hazards for chemicals plays an important role in protecting human health and the environment from their undesirable side effects.12,13
Toxicity assessments of chemicals toward aquatic organisms could intuitively reflect their impacts on the aquatic environment. Aquatic toxicity data for candidate compounds is required by relevant regulatory authorities for the registration of chemical pesticides, such as the US Environmental Protection Agency.14,15 However, with the increasing number of synthesized compounds during chemical discovery, extensive implementation of in vivo animal testing is challengeable and infeasible due to its high costs and ethical limits. Computational toxicology methods can effectively alleviate dependence on animal experiments and facilitate hazard identification, which has been encouraged by many public authorities such as the European Union.16−18 A number of in silico methods of chemical aquatic toxicity have been developed during recent decades.19,20 For example, the Japanese National Institute for Environmental Studies developed two models for predicting acute toxicity in Oryzias latipes and Daphnia magna (D. magna), assuming a linear correlation between log P and aquatic toxicity.21 Mazzatorta et al. proposed a hierarchical model to predict pesticide aquatic toxicity using seven molecular descriptors.22 Ding et al. used the k-nearest neighbor algorithm to develop binary classification models of chronic toxicity in D. magna and Pseudokirchneriella subcapitata (P. subcapitata).23 However, the current prediction models of aquatic toxicity prediction remained with some limitations. For example, applicability domains were relatively limited, and most of the prediction models were only allowed to predict the toxicity of one species.24 Besides, sometimes the data set was narrowed, which easily led to overfitting of the models.25 And the improper selection of training and test sets also often caused an unreliable perdition effect.26,27 Further, most of these models have not been developed into effective and easy-to-use tools to support the molecular design of green chemicals.
It is an appealing goal to present a set of robust predictive models for different aquatic species to guide the prioritization of toxicity testing and the design of safer compounds. Herein, we used the ensemble learning technique to study and predict a range of aquatic toxicities for chemical entities. The noise and variance are the major sources of error, especially when the number of chemicals in the data set was limited. The ensemble learning combines the advantages of different ML algorithms, which could help to minimize these error-causing factors, thereby ensuring the accuracy and stability of the prediction model.28,29 A stacked ensemble of six machine or deep learning methods was implemented in which our previously developed graph attention convolutional neural network (GACNN) model weighted most heavily. The ensemble models were designed to probabilistically classify chemical toxicities in five aquatic species, including Oncorhynchus mykiss (O. mykiss), Pimephales promelas (P. promelas), D. magna, P. subcapitata, and Tetrahymena pyriformis (T. pyriformis). For any of the five species, the ensemble models outperformed their single models, with higher area under the curve (AUC) scores of 0.75–0.92 and average precisions of 0.66–0.89. The independent pesticide data set was used for external validation, and 78.18% of aquatic-friendly and 71.11% of unfriendly compounds were successfully identified. Additionally, we summarized chemical structures with known aquatic toxicity modes of action (MOA) and built a knowledge database for further toxicity mechanism analysis. These ensemble models and databases were integrated into the web-based server AquaticTox (https://chemyang.ccnu.edu.cn/ccb/server/AquaticTox/), providing an efficient platform for identification of harmful compounds in aquatic environment and screening of safer molecules.
2. Materials and Methods
2.1. Toxicity Data Preparation
The data sets comprise chemical structures and experimental acute aquatic toxicity. Five data sets of different sizes were collected for model training (available in Tables S1–S5), including the 96 h O. mykiss (rainbow trout) LC50 data set with 1,060 molecules, 96 h P. promelas (fathead minnow) LC50 data set with 1,181 molecules, 48 h D. magna EC50 data set with 1,264 molecules, 72 h P. subcapitata EC50 data set with 527 molecules, and 48 h T. pyriformis EC50 data set with 1,418 molecules. Most of the toxicity data were derived from the ECOTOX database,30 ChemIDplus database (https://chem.nlm.nih.gov/chemidplus/), Chemical Toxicity Database (https://www.drugfuture.com/toxic/), EnviroTox database,31 and the work of Cheng et al. on collecting T. pyriformis toxicity data.32 The testing set of 100 pesticides (available in Tables S6–S10) were manually curated from the literature and U.S. Environment Protection Agency’s Pesticide Ecotoxicity Database (https://ecotox.ipmcenters.org/) for external validation. The chemical space of toxicity data sets was analyzed through dimensionality reduction on molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) method and Morgan fingerprints.
2.2. Chemical Representation
A structure graph and a series of molecular descriptors represent chemical compounds. A graph consists of a set of nodes representing the constituent atoms and a set of edges connecting those nodes and denoting the chemical bonding (Figure 1a). The characteristic information on each atom was encoded with a feature vector xi∈RD, obtained through calculating important atom descriptors using RDKit. All atom information could constitute an atom feature matrix X∈RN*D. The topological information in a graph was encoded in an atomic adjacency matrix A∈RN*N, where aij = 1 if atoms i and j were connected and aij = 0 if not. Further, a weight matrix W∈RD*C was used to determine the weight of each neighbor node, which was obtained by incorporating an attention mechanism in our models. At last, the three matrices were multiplied to get matrix Y∈RN*C for describing the structure graph in each update. Additionally, a total of 1,875 molecular descriptors were calculated using PaDEL,33 including electron topological state descriptors, autocorrelation descriptors, topological descriptors, radial distribution function (RFD) descriptors, constitutional descriptors, molecular property descriptors, etc.
Figure 1.
Overall architecture of our proposed ensemble models for aquatic toxicity prediction. (a) The workflow of a molecular graph to encode molecular features combined with attention mechanism and convolution operation. (b) The workflow of our ensemble method for aquatic toxicity prediction. Molecular descriptors and graphs were used for training Random Forest, AdaBoost, Gradient Boosting, Support Vector Machine, FCNet, and Graph Attention Convolutional Neural Network. A stacked ensemble of six models with different weights was implemented.
2.3. Machine or Deep Learning Methods
A total of six different machine or deep learning methods were used to build models, namely, random forest (RF),34 AdaBoost,35 gradient boosting,36 support vector machine (SVM),37 FCNet,38 and GACNN.39 The RF, AdaBoost, and gradient boosting approaches were chosen due to their insensitivity to parameters and robustness to redundant features. The SVM method was used due to its generally good performance for classification. FCNet and GACNN are two convolution neural network methods, and they were reported to have excellent discriminative power on important local features. A toxicity data set was randomly divided into five subsets. For each specific learning method, the fitting procedure was repeated five times. The four random subsets were used for model training, and the remaining were used for model testing. The six single models were trained individually as baselines in our performance comparisons. Further, they were a group of base learners behind the ensemble models.
The RF, AdaBoost, gradient boosting, and SVM models were implemented using the scikit-learner package.40 The modeling framework of FCNeT was derived from the GitHub repository shared by Raiz et al.38 The GACNN model was developed previously by our team using the TensorFlow. The number of estimators was set to 400 for RF, 400 for AdaBoost, and 400 for gradient boosting. For SVM, the radial basis function (RBF) kernel was used. For FCNet and GACNN, the Leaky Rectified Linear Unit (ReLU) activation function, the binary cross-entropy loss function, and batch normalization were adopted. In FCNet, a total of five convolution layers were constructed, and the number of filters for each layer was set to 32, 64, 96, 64, and 64, respectively. In GACNN, 128 filters were used for the two convolution layers. The learning rate and the number epoch were 0.001 and 300 for FCNet, while those for GACNN were 0.001 and 300. The learning rate was optimized for all of the models. All other hyperparameters take the defaults in the packages.
The stacking method was chosen to ensemble the six base learners and a multiple linear regression algorithm was used to generate a meta-learner.41 The meta-learner was trained based on 5-fold cross-validation predictions of the six base models, and an appropriate weight for each base learner was fitted. In the ensemble model, the final predictions were generated by averaging the multiplied products between each base learner’s prediction and its corresponding fixed weight (Figure 1b).
2.4. Ensemble Learning for Aquatic Toxicity Prediction
Ensemble learning is one of the most popular ML algorithms for improving predictive performance and has been successfully applied in many fields, such as cancer prediction42 and air quality prediction.43 Ensemble modeling refers to a general metaprocess, in which the predictions output from other machine or deep learning models are learned, and these single models are intelligently combined. The full utilization of multiple models contributes to the advantages of an ensemble model in minimizing modeling approach bias, reducing generalization error, and decreasing the possibility of overfitting. Together with experimental data on a particular aquatic toxicity of interest, ensemble learning can be used to develop high-performance predictive models based on a series of distinctive base models. Such an improved capability of predicting toxicity will likely make the ensemble model a better computational tool for hazard identification in a water environment.
In predictive modeling, molecular graphs are an alternative to molecular descriptors for representing chemical molecules.44 Previously, we introduced undirected graphs (UGs) and multihead attention mechanism into convolutional neural networks and proposed a GACNN model for predicting chemical toxicity in a honeybee.39 It was shown that the GACNN model achieved a 4.6% higher AUC value than did the DNN model for bee toxicity prediction. Using such a structural graph (Figure 1a) positively contributes to improving the predictive model’s performance. Given the successful application of the GACNN model in honeybee toxicity classification, we incorporated GACNN into a group of base models for ensemble modeling to predict particular aquatic toxicity.
We proposed a set of stacked ensemble models for predicting chemical toxicity in five aquatic species including O. mykiss, P. promelas, D. magna, P. subcapitata, and T. pyriformis. The six models—GACNN, RF, AdaBoost, gradient boosting, SVM, and FCNet—were used as base learners for ensemble learning. Except for the GACCN model, the five base models were trained based on general chemical properties represented by molecular descriptors. It can be observed that the GACNN base model consistently had the highest weight in all five ensemble models. The workflow of the ensemble models for aquatic toxicity prediction is shown in Figure 1b.
2.5. Performance Evaluation
To evaluate the discriminative ability of the prediction models, the receiver operating characteristic curves (ROC) and precision-recall (PR) curves were plotted by iterative adjustment of the probability threshold. The areas under the ROCs (AUC) were calculated to measure the overall performance of the models, and the areas under the PC curves were calculated to compare the models’ average precision (AP). The ROC is a plot of sensitivity vs false positive rate (1-specificity), while the PR curve is a plot of precision vs sensitivity (recall).
Sensitivity (SE = TP/(TP + FN)), also known as recall, indicates how likely a toxic compound is accurately detected in a group of toxic chemicals, while specificity (SP = TN/(TN + FP)) is the probability for nontoxic those. Precision (PR = TP/(TP + FP)) represents how often the toxic chemicals are accurately detected when predicting whether those are toxic.
A true positive (TP) is a test case where a toxic chemical is classified correctly, while a true negative (TN) is an outcome where a nontoxic one is classified correctly. False positive (FP) is a test case where a toxic chemical is classified incorrectly as nontoxic. In contrast, false negative (FN) refers to a case where a nontoxic chemical is predicted incorrectly as toxic.
2.6. Aquatic Toxicity Mode of Action Assignment
A knowledge base was established to analyze the aquatic toxicity MOA underlying the chemical structure, by collecting and summarizing reported MOA assignments from published literature.45−48 In the knowledge base, four structure-based classification schemes were selected to assign aquatic toxicity MOAs for chemicals, including Assessment Tool for Evaluating Risk (ASTER),49 MOAtox,45 Verhaar,50 and Russom.46 ASTER was designed based on structural alerts developed by the US EPA, encompassing 31 specific MOAs. MOAtox is a database developed by Barron et al. for providing MOA categories reported in aquatic invertebrates and fish, covering six broad and 30 specific MOA categories. The Verhaar predictive method classified chemicals into five categories according to general toxicological responses. The Russom classification rule was built based on quantitative relationships between acute toxicity and toxic behavioral response in fathead minnow, involving eight MOA categories. There is overlap among the four classification frameworks (Table S11).
2.7. Website Design and Implementation
The five ensemble models of acute toxicity classification for compounds and the structure-aquatic MOA knowledge base are integrated into a web-based server, AquaticTox (https://chemyang.ccnu.edu.cn/ccb/server/AquaticTox/). On the Web site, these five models run automatically once a chemical structure is given, and the five predictions will be directly shown on the webpage. The platform was built on an Apache HTTP server running on a CentOS system. The website was displayed using JavaScript, HTML, CSS, and PHP. The platform can be compatible with most common web browsers (e.g., Google Chrome, Mozilla Firefox, and Windows Edge).
3. Results
3.1. Categorization of Toxicity Data
The quality of acute toxicity data sets as training data for model development was analyzed. As defined by the US Environmental Protection Agency (EPA),51 the chemicals in five data sets were divided into five categories based on their reported toxicity values under specific end points (Figure 2a). The chemicals categorized as very highly toxic, highly toxic, or moderately toxic were merged into toxic positive samples. In contrast, those categorized as slightly or practically toxic were merged into nontoxic negative samples. The chemical space of toxic and nontoxic subsets is visualized in Figure 2b through UMAP analysis. It is observed that the chemicals in each toxicity data set are uniformly distributed overall without excessive isolated clusters, indicating the overall structural diversity of these data sets. According to the numbers of chemicals in different toxicity categories (Figure 2c), it is noticed that the toxic and nontoxic data sets in each of the five groups are imbalanced. The distribution of acute toxicity values is shown in Figure 2d, suggesting approximately 2 orders of magnitude spanned by most chemicals in each data set. There are several outliers in toxicity values, and removing those chemicals was tried during model training but had no significant effect on the model performance. Accordingly, the five data sets contain diverse structures and cover an unbalanced proportion between toxic and nontoxic compounds.
Figure 2.
Classification criteria and chemical distribution of the toxicity data sets for O. mykiss, P. promelas, D. magna, P. subcapitata, and T. pyriformis. (a) Classification criteria of acute toxicity for aquatic organisms according to the U.S. EPA. (b) The chemical space distribution of toxic and nontoxic chemicals in each of the five data sets through UMAP dimension reduction based on molecular fingerprints. (c) The number of very highly toxic, highly toxic, moderately toxic, slightly toxic, and practically nontoxic chemicals in the five data sets. (d) Distribution of LC50 or EC50 values for compounds in the five data sets.
3.2. Model Performance and Comparative Analysis
The predictive performances of single and ensemble models in each species group to discriminate toxic compounds were investigated and compared. The AP and AUC values of different models on 5-fold cross-validation are shown in Figure 3. The ROC and PR curves of all models are provided in Figures S1–S5. The five ensemble models achieve an AP range of 0.66–0.89 and an AUC range of 0.75–0.92 to classify aquatic toxicity for chemical entities. In most cases, the GACNN models perform better than the other single models, indicating their important role as a base model for ensemble learning. In terms of comprehensive AP and AUC, ensemble models always perform the best in any of the five model groups, indicating the effectiveness of the ensemble method for aquatic toxicity prediction. In most practical application scenarios, the capabilities of these models to identify toxic categories are of greater concern, as measured by AP. Although our ensemble models did not always significantly improve AUC values, ensemble learning greatly improved AP compared to any single model within the same group. Compared to the GACNN, the ensemble algorithm improved the performance to predict these aquatic toxicities, with an increase in AP of 12–22% and an average increase in AUC value of 1.4%. Overall, our ensemble models are suitable for identifying harmful compounds in the aquatic environment, even with imbalanced data sets.
Figure 3.
Average precision (AP) values and area under the ROC curve (AUC) values of single or ensemble models when predicting five different aquatic toxicities.
3.3. External Validation and Case Studies of Pesticides
Pesticide residues are a class of water contaminants of emerging concern, posing serious risks to the ecosystem and human health by poisoning aquatic life.52−54 A series of pesticides were collected to independently investigate the performance of our five ensemble models as external cases. The ensemble model for predicting O. mykiss was tested on 20 agrochemicals, including 10 toxic chemicals and 10 nontoxic chemicals classified by their experimental acute toxicity (Table S1). As a result, the O. mykiss ensemble model had an overall accuracy of 75%, an accuracy of 80.00% for predicting toxic chemicals, and an accuracy of 70.00% for predicting nontoxic chemicals. The ensemble model for predicting P. promelas toxicity was tested on 9 experimentally toxic and 11 nontoxic agrochemicals (Table S2). It could achieve an overall accuracy of 80.00%, an accuracy of 77.78% for predicting toxic chemicals, and an accuracy of 81.82% for predicting nontoxic chemicals. Further, the ensemble models for D. magna, P. subcapitata, and T. pyriformis could reach accuracies of 76.00%, 70.59%, and 72.22%, respectively, tested on 25, 17, and 18 independent chemicals (Tables S3–S5). The detailed performance of the five models in this external validation was summarized in Figure 4a. Based on a total of 100 external cases, the five ensemble models achieved an average accuracy of 75.00% for overall prediction, 71.11% for predicting the toxic category, and 78.18% for predicting the nontoxic category. Hence, our ensemble models are effective for the aquatic toxicity prediction of pesticides.
Figure 4.
Validation results of the ensemble models aimed at pesticides and case studies of pesticides using them. (a) External validation results of the ensemble models for aquatic toxicity prediction. (b) Two cases in which the aquatic toxicities of benquitrione and flubeneteram were studied using the ensemble models.
To further test the practicability of the ensemble models, two candidate compounds discovered during our team’s pesticide development research were chosen as representative cases of studying aquatic toxicities (Figure 4b). Quinotrione is the active ingredient of a herbicide designed by our team for sorghum fields.55 Using the ensemble models, quinotrione was predicted to be nontoxic to O. mykiss, nontoxic to D. magna, and toxic to P. subcapitata. Taken together, there was a high possibility that quinotrione was aquatic-friendly. This inference corresponded with the experimental results that quinotrione was indeed nontoxic to any of the three species. Flubeneteram is a candidate fungicide to control many fungal diseases, such as rice sheath blight disease.56 It was first synthesized in our laboratory, and we predictively assessed its ecotoxicity. According to the outcomes of our ensemble models, flubeneteram was toxic to O. mykiss, D. magna, or P. subcapitata, which was verified through toxicological experiments. Although there is no guarantee that the predictions of these ensemble models are always inerrant, their use facilitates the discovery of green chemicals. They can rapidly screen for ecofriendly molecules and inform structural optimization needs for safer molecules.
3.4. Analysis of the Aquatic Toxicity Mode of Action Based on Structures
The relationship between chemical structure and aquatic toxicity was further analyzed based on an understanding of toxicological mechanisms. In toxicology, MOA refers to organism adverse responses identified from the primary mechanism of toxicity initiated at the receptor level, or the effects at the cellular or organ level.57 However, millions of chemicals put on the market have little information other than the structures. Hence, we investigated aquatic toxicity MOA of various chemicals based on their structures, attempting to provide a pipeline to understand aquatic toxicity mechanisms. After our investigation, a knowledge base of chemical aquatic toxicity of MOAs was established. The known chemical aquatic toxicity MOAs assigned by one of four structure-based classification methodologies (Table S6) were collected from references. Currently, this knowledge base compiles a total of 5,749 aquatic toxicity MOAs records for 3,482 chemicals (available in Table S7), covering 981 chemicals recorded with ASTER MOAs, 1,208 chemicals recorded with MOAtox MOAs, 3,286 chemicals recorded with Verhaar MOAs, and 274 chemicals recorded with Russom MOAs.
The MOA categories assigned by the different classification schemes are analogous for most of the chemicals in the knowledge base. For example, azinphos-methyl is classified to exhibit an MOA of acetylcholinesterase inhibition according to the Aster, MOATox, and Russom schemes. It is classified as Class 4 (compounds and groups of compounds acting by a specific mechanism) according to the Verhaar scheme. Based on our collected records, some general patterns of chemical MOAs were found. For instance, aliphatic alkanes, alkenes, alcohols, ethers, ketones, and halobenzenes generally have nonpolar narcosis or baseline toxicity, while phenol or aniline compounds have polar narcosis toxicity. Another instance is that halocarbonyl compounds, isocyanates, and polarized alkenes are common chemicals that exhibit reactivity toxicity. Furthermore, a group of structurally similar chemicals or chemicals containing the same specific substructures tend to exhibit similar MOA classes. Hence, based on the rule of structure similarity and substructure matching, the compounds outside the knowledge base can be predictively assigned MOAs through structural mapping with chemicals in the knowledge base (Figure 5). To facilitate an understanding of aquatic toxicity based on toxicological mechanisms, the MOA prediction function is also integrated into the AquaticTox Web server.
Figure 5.
Workflow for predicting the aquatic toxicity mode of action (MOA). A knowledge base was established to provide the basis for analyzing mode of action from chemical structures, in which 5,749 aquatic toxicity MOA records for 3,482 chemicals were summarized. Four different MOA classification schemes were collected, including ASTER (an assessment tool for evaluating risk), MOAtox (a comprehensive MOA and acute aquatic toxicity database), the scheme proposed by Verhaar et al., and the scheme proposed by Russom et al. Based on substructure mapping and structure similarity, the chemicals not in the base were assigned their MOAs.
3.5. Website Usage
AquaticTox is a web-based server that is designed to provide the public with an interface to predict the toxicities of compounds in O. mykiss, P. promelas, D. magna, P. subcapitata, and T. pyriformis using our ensemble models and to query aquatic MOA of compounds. The prediction task is allowed in the Prediction Tab. As shown in Figure 6a, if a chemical is inputted by drawing its 2D structure or uploading its structure file in SDF/PDB/MOL2 format, the five classified aquatic toxicities and MOA categories of this chemical will be output as predictive results in tens of seconds. Additionally, in the Toxicity Data tab, all chemical toxicity data used in our modeling are available to browse and download for free by selecting a particular toxicity level (Figure 6b). This website is generally very user-friendly as an aquatic toxicity prediction server. It can complete the task rapidly, which could help guide decision-making in the compound discovery process.
Figure 6.
Screenshots of the AquaticTox webpages. (a) The input page for submitting a prediction and the result page of this prediction. (b) The browser page of aquatic acute toxicity data.
4. Conclusion
In silico prediction of chemical aquatic toxicity can facilitate hazardous compound identification and ecological risk assessment in an efficient, low-cost, and ethical manner, thereby playing an important role in discovering green small molecules. Here, we developed a set of ensemble models to classify aquatic acute toxicity for compounds using a stacked ensemble of GACNN, RF, AdaBoost, gradient boosting, SVM, and FCNet. Five aquatic acute toxicity end points were designed, covering O. mykiss, P. promelas, D. magna, P. subcapitata, and T. pyriformis. These ensemble models outperformed all corresponding single models, with AUC scores of 0.75–0.92 and AP values of 0.66–0.89. Compared to the best single models, our ensemble models presented an average 1.4% improvement in AUC value and 12–22% improvement in AP. Tested on independent cases of pesticides, the five ensemble models could achieve an average accuracy of 75.00% for overall prediction. Understanding toxicological mechanisms contributes to increasing the assessment efficiency and improving toxicity extrapolations through chemical grouping. In this work, we also collected 5,749 aquatic toxicity MOA records for 3,482 chemicals from the published literature and integrated them into a knowledge base. Based on it, the MOAs of other chemicals are allowed to be further predictively assigned through structure mapping rules. Modern industry requires the design and production of environmentally friendly chemicals. These ensemble models provide a rapid and easy approach for the early screening of safe chemicals to the aquatic environment.
To facilitate public access to our ensemble models and MOA knowledge base, we established a web-based integrated platform, AquaticTox. On the website, users can predictively classify aquatic toxicity and MOA categories for their interested chemicals in seconds. Admittedly, our ensemble models still have some limitations, including the unsatisfactory performance for P. subcapitata toxicity prediction, the limited performance improvement compared to the best single models, and the exclusion of aquatic chronic toxicity prediction. In the future, we will be committed to overcoming these deficiencies. As a useful tool for predicting aquatic toxicity, AquaticTox could assist chemists, toxicologists, and environmentalists in prioritizing chemicals for toxicity testing and designing environmentally safer molecules, thereby guiding decision-making in the chemical discovery process.
Acknowledgments
This work was supported the National Key Research and Development Program of China (2023YFD1700500), National Natural Science Foundation of China (21907036), and Postdoctoral Fellowship Program of CPSF (No. GZB20230198).
Data Availability Statement
Data will be made available on request.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/envhealth.4c00014.
ROCs and PR curves of models for different aquatic species, validation set for different prediction models, and the knowledge base of known MOA (PDF)
Author Contributions
Xing-Xing Shi: Conceptualization, Methodology, Software, Formal analysis, Writing - original draft. Zhi-Zheng Wang: Resources, Visualization, Writing - review and editing. Yu-Liang Wang: Investigation, Data curation. Fan Wang: Supervision, Project administration, Funding acquisition. Guang-Fu Yang: Supervision, Project administration, Funding acquisition.
The authors declare no competing financial interest.
Special Issue
Published as part of Environment & Healthvirtual special issue “Artificial Intelligence and Machine Learning for Environmental Health”.
Supplementary Material
References
- Mostafalou S.; Abdollahi M. Pesticides: an update of human exposure and toxicity. Arch. Toxicol. 2017, 91 (2), 549–599. 10.1007/s00204-016-1849-x. [DOI] [PubMed] [Google Scholar]
- Ebenstein A. The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China. Review of Economics and Statistics 2012, 94 (1), 186–201. 10.1162/REST_a_00150. [DOI] [Google Scholar]
- Beketov M. A.; Kefford B. J.; Schäfer R. B.; Liess M. Pesticides reduce regional biodiversity of stream invertebrates. Proc. Natl. Acad. Sci. U.S.A. 2013, 110 (27), 11039–11043. 10.1073/pnas.1305618110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Syafrudin M.; Kristanti R. A.; Yuniarto A.; Hadibarata T.; Rhee J.; Al-onazi W. A.; Algarni T. S.; Almarri A. H.; Al-Mohaimeed A. M. Pesticides in Drinking Water—A Review. Int. J. Env. Res. Public Health 2021, 18 (2), 468. 10.3390/ijerph18020468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhattacharyya R.; Ghosh B. N.; Mishra P. K.; Mandal B.; Rao C. S.; Sarkar D.; Das K.; Anil K. S.; Lalitha M.; Hati K. M.; et al. Soil Degradation in India: Challenges and Potential Solutions. Sustainability 2015, 7 (4), 3528–3570. 10.3390/su7043528. [DOI] [Google Scholar]
- Sharma A.; Kumar V.; Shahzad B.; Tanveer M.; Sidhu G. P. S.; Handa N.; Kohli S. K.; Yadav P.; Bali A. S.; Parihar R. D.; et al. Worldwide pesticide usage and its impacts on ecosystem. SN Applied Sciences 2019, 1 (11), 1446. 10.1007/s42452-019-1485-1. [DOI] [Google Scholar]
- Moran K.; Anderson B.; Phillips B.; Luo Y.; Singhasemanon N.; Breuer R.; Tadesse D. Water Quality Impairments Due to Aquatic Life Pesticide Toxicity: Prevention and Mitigation in California, USA. Environ. Toxicol. Chem. 2020, 39 (5), 953–966. 10.1002/etc.4699. [DOI] [PubMed] [Google Scholar]
- Relyea R. A. The Impact of Insecticides and Herbicides on the Biodiversity and Productivity of Aquatic Communities. Ecol. Appl. 2005, 15 (2), 618–627. 10.1890/03-5342. [DOI] [PubMed] [Google Scholar]
- Wolfram J.; Stehle S.; Bub S.; Petschick L. L.; Schulz R. Insecticide Risk in US Surface Waters: Drivers and Spatiotemporal Modeling. Environ. Sci. Technol. 2019, 53 (20), 12071–12080. 10.1021/acs.est.9b04285. [DOI] [PubMed] [Google Scholar]
- Agrawal A.; Pandey R. S.; Sharma B. Water Pollution with Special Reference to Pesticide Contamination in India. Journal of Water Resource and Protection 2010, 02 (05), 432. 10.4236/jwarp.2010.25050. [DOI] [Google Scholar]
- Macneale K. H.; Spromberg J. A.; Baldwin D. H.; Scholz N. L. A Modeled Comparison of Direct and Food Web-Mediated Impacts of Common Pesticides on Pacific Salmon. PLoS One 2014, 9 (3), e92436. 10.1371/journal.pone.0092436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Topping C. J.; Aldrich A.; Berny P. Overhaul environmental risk assessment for pesticides. Science 2020, 367 (6476), 360–363. 10.1126/science.aay1144. [DOI] [PubMed] [Google Scholar]
- Vijver M. G.; Hunting E. R.; Nederstigt T. A. P.; Tamis W. L. M.; van den Brink P. J.; van Bodegom P. M. Postregistration monitoring of pesticides is urgently required to protect ecosystems. Environ. Toxicol. Chem. 2017, 36 (4), 860–865. 10.1002/etc.3721. [DOI] [PubMed] [Google Scholar]
- Ceger P.; Allen D.; Blankinship A.; Choksi N.; Daniel A.; Eckel W. P.; Hamm J.; Harwood D. E.; Johnson T.; Kleinstreuer N.; et al. Evaluation of the fish acute toxicity test for pesticide registration. Regul. Toxicol. Pharmacol. 2023, 139, 105340. 10.1016/j.yrtph.2023.105340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krewski D.; Acosta Jr D.; Andersen M.; Anderson H.; Bailar J. C.; Boekelheide K.; Brent R.; Charnley G.; Cheung V. G.; Green Jr S.; et al. Toxicity Testing in the 21st Century: A Vision and a Strategy. Journal of Toxicology and Environmental Health, Part B 2010, 13 (2–4), 51–138. 10.1080/10937404.2010.483176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krewski D.; Acosta D.; Andersen M.; Anderson H.; Bailar J. C.; Boekelheide K.; Brent R.; Charnley G.; Cheung V. G.; Green S.; et al. Toxicity Testing in the 21st Century: A Vision and a Strategy. Journal of Toxicology and Environmental Health, Part B 2010, 13 (2–4), 51–138. 10.1080/10937404.2010.483176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berg N.; De Wever B.; Fuchs H. W.; Gaca M.; Krul C.; Roggen E. L. Toxicology in the 21st century - Working our way towards a visionary reality. Toxicol. in Vitro 2011, 25 (4), 874–881. 10.1016/j.tiv.2011.02.008. [DOI] [PubMed] [Google Scholar]
- Lillicrap A.; Belanger S.; Burden N.; Pasquier D. D.; Embry M. R.; Halder M.; Lampi M. A.; Lee L.; Norberg-King T.; Rattner B. A.; et al. Alternative approaches to vertebrate ecotoxicity tests in the 21st century: A review of developments over the last 2 decades and current status. Environ. Toxicol. Chem. 2016, 35 (11), 2637–2646. 10.1002/etc.3603. [DOI] [PubMed] [Google Scholar]
- Cronin M. T. D.; Jaworska J. S.; Walker J. D.; Comber M. H. I.; Watts C. D.; Worth A. P. Use of QSARs in international decision-making frameworks to predict health effects of chemical substances. Environ. Health Perspect. 2003, 111 (10), 1391–1401. 10.1289/ehp.5760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y.-L.; Li J.-Y.; Shi X.-X.; Wang Z.; Hao G.-F.; Yang G.-F. Web-Based Quantitative Structure-Activity Relationship Resources Facilitate Effective Drug Discovery. Top. Curr. Chem. 2021, 379 (6), 37. 10.1007/s41061-021-00349-3. [DOI] [PubMed] [Google Scholar]
- Furuhama A.; Toida T.; Nishikawa N.; Aoki Y.; Yoshioka Y.; Shiraishi H. Development of an ecotoxicity QSAR model for the KAshinhou Tool for Ecotoxicity (KATE) system, March 2009 version. SAR QSAR Environ. Res. 2010, 21 (5–6), 403–413. 10.1080/1062936X.2010.501815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mazzatorta P.; Smiesko M.; Lo Piparo E.; Benfenati E. QSAR Model for Predicting Pesticide Aquatic Toxicity. J. Chem. Inf. Model. 2005, 45 (6), 1767–1774. 10.1021/ci050247l. [DOI] [PubMed] [Google Scholar]
- Ding F.; Wang Z.; Yang X.; Shi L.; Liu J.; Chen G. Development of classification models for predicting chronic toxicity of chemicals to Daphnia magna and Pseudokirchneriella subcapitata. SAR QSAR Environ. Res. 2019, 30 (1), 39–50. 10.1080/1062936X.2018.1545694. [DOI] [PubMed] [Google Scholar]
- Gadaleta D.; Mangiatordi G. F.; Catto M.; Carotti A.; Nicolotti O. Applicability Domain for QSAR Models: Where Theory Meets Reality. International Journal of Quantitative Structure-Property Relationships (IJQSPR) 2016, 1 (1), 45–63. 10.4018/IJQSPR.2016010102. [DOI] [Google Scholar]
- Tetko I. V.; Sushko I.; Pandey A. K.; Zhu H.; Tropsha A.; Papa E.; Öberg T.; Todeschini R.; Fourches D.; Varnek A. Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection. J. Chem. Inf. Model. 2008, 48 (9), 1733–1746. 10.1021/ci800151m. [DOI] [PubMed] [Google Scholar]
- Young D.; Martin T.; Venkatapathy R.; Harten P. Are the Chemical Structures in Your QSAR Correct?. QSAR Comb. Sci. 2008, 27 (11–12), 1337–1345. 10.1002/qsar.200810084. [DOI] [Google Scholar]
- Martin T. M.; Harten P.; Young D. M.; Muratov E. N.; Golbraikh A.; Zhu H.; Tropsha A. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?. J. Chem. Inf. Model. 2012, 52 (10), 2570–2578. 10.1021/ci300338w. [DOI] [PubMed] [Google Scholar]
- Dong X.; Yu Z.; Cao W.; Shi Y.; Ma Q. A survey on ensemble learning. Frontiers of Computer Science 2020, 14 (2), 241–258. 10.1007/s11704-019-8208-z. [DOI] [Google Scholar]
- Sagi O.; Rokach L. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery 2018, 8 (4), e1249. 10.1002/widm.1249. [DOI] [Google Scholar]
- Olker J. H.; Elonen C. M.; Pilli A.; Anderson A.; Kinziger B.; Erickson S.; Skopinski M.; Pomplun A.; LaLone C. A.; Russom C. L.; et al. The ECOTOXicology Knowledgebase: A Curated Database of Ecologically Relevant Toxicity Tests to Support Environmental Research and Risk Assessment. Environ. Toxicol. Chem. 2022, 41 (6), 1520–1539. 10.1002/etc.5324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connors K. A.; Beasley A.; Barron M. G.; Belanger S. E.; Bonnell M.; Brill J. L.; de Zwart D.; Kienzler A.; Krailler J.; Otter R.; et al. Creation of a Curated Aquatic Toxicology Database: EnviroTox. Environ. Toxicol. Chem. 2019, 38 (5), 1062–1073. 10.1002/etc.4382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F.; Shen J.; Yu Y.; Li W.; Liu G.; Lee P. W.; Tang Y. In silico prediction of Tetrahymena pyriformis toxicity for diverse industrial chemicals with substructure pattern recognition and machine learning methods. Chemosphere 2011, 82 (11), 1636–1643. 10.1016/j.chemosphere.2010.11.043. [DOI] [PubMed] [Google Scholar]
- Yap C. W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32 (7), 1466–1474. 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- Fawagreh K.; Gaber M. M.; Elyan E. Random forests: from early developments to recent advancements. Systems Science & Control Engineering 2014, 2 (1), 602–609. 10.1080/21642583.2014.956265. [DOI] [Google Scholar]
- Schapire R. E.Explaining AdaBoost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik; Schölkopf B., Luo Z., Vovk V., Eds.; Springer Berlin Heidelberg: 2013; pp 37–52. [Google Scholar]
- Natekin A.; Knoll A. Gradient boosting machines, a tutorial. Front. Neurorob. 2013, 7, 21. 10.3389/fnbot.2013.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noble W. S. What is a support vector machine?. Nat. Biotechnol. 2006, 24 (12), 1565–1567. 10.1038/nbt1206-1565. [DOI] [PubMed] [Google Scholar]
- Riaz A.; Asad M.; Al-Arif S. M. M. R.; Alonso E.; Dima D.; Corr P.; Slabaugh G.. FCNet: A Convolutional Neural Network for Calculating Functional Connectivity from Functional MRI. In Connectomics in NeuroImaging; Wu G., Laurienti P., Bonilha L., Munsell B. C., Eds.; Springer International Publishing: 2017; pp 70–78. [Google Scholar]
- Wang F.; Yang J.-F.; Wang M.-Y.; Jia C.-Y.; Shi X.-X.; Hao G.-F.; Yang G.-F. Graph attention convolutional neural network model for chemical poisoning of honey bees’ prediction. Sci. Bull. 2020, 65 (14), 1184–1191. 10.1016/j.scib.2020.04.006. [DOI] [PubMed] [Google Scholar]
- Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
- Wolpert D. H. Stacked generalization. Neural Networks 1992, 5 (2), 241–259. 10.1016/S0893-6080(05)80023-1. [DOI] [Google Scholar]
- Xiao Y.; Wu J.; Lin Z.; Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 2018, 153, 1–9. 10.1016/j.cmpb.2017.09.005. [DOI] [PubMed] [Google Scholar]
- Wang J.; Song G. A Deep Spatial-Temporal Ensemble Model for Air Quality Prediction. Neurocomputing 2018, 314, 198–206. 10.1016/j.neucom.2018.06.049. [DOI] [Google Scholar]
- David L.; Thakkar A.; Mercado R.; Engkvist O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 2020, 12 (1), 56. 10.1186/s13321-020-00460-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barron M. G.; Lilavois C. R.; Martin T. M. MOAtox: A comprehensive mode of action and acute aquatic toxicity database for predictive model development. Aquat. Toxicol. 2015, 161, 102–107. 10.1016/j.aquatox.2015.02.001. [DOI] [PubMed] [Google Scholar]
- Russom C. L.; Bradbury S. P.; Broderius S. J.; Hammermeister D. E.; Drummond R. A. Predicting modes of toxic action from chemical structure: Acute toxicity in the fathead minnow (Pimephales promelas). Environ. Toxicol. Chem. 1997, 16 (5), 948–967. 10.1002/etc.5620160514. [DOI] [PubMed] [Google Scholar]
- Kienzler A.; Barron M. G.; Belanger S. E.; Beasley A.; Embry M. R. Mode of Action (MOA) Assignment Classifications for Ecotoxicology: An Evaluation of Approaches. Environ. Sci. Technol. 2017, 51 (17), 10203–10211. 10.1021/acs.est.7b02337. [DOI] [PubMed] [Google Scholar]
- Kienzler A.; Connors K. A.; Bonnell M.; Barron M. G.; Beasley A.; Inglis C. G.; Norberg-King T. J.; Martin T.; Sanderson H.; Vallotton N.; et al. Mode of Action Classifications in the EnviroTox Database: Development and Implementation of a Consensus MOA Classification. Environ. Toxicol. Chem. 2019, 38 (10), 2294–2304. 10.1002/etc.4531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- USEPA . ASTER User Guide: Assessment Tools for the Evaluation of Risk, Ver 2.0; US Environmental Protection Agency: Duluth, MN, 2012. [Google Scholar]
- Verhaar H. J. M.; Solbé J.; Speksnijder J.; van Leeuwen C. J.; Hermens J. L. M. Classifying environmental pollutants: Part 3. External validation of the classification system. Chemosphere 2000, 40 (8), 875–883. 10.1016/S0045-6535(99)00317-3. [DOI] [PubMed] [Google Scholar]
- USEPA . Technical Overview of Ecological Risk Assessment - Analysis Phase: Ecological Effects Characterization. 2021. https://www.epa.gov/pesticide-science-and-assessing-pesticide-risks/technical-overview-ecological-risk-assessment-0 (accessed in December, 2023).
- Carvalho F. P. Pesticides, environment, and food safety. Food Energy Secur. 2017, 6 (2), 48–60. 10.1002/fes3.108. [DOI] [Google Scholar]
- Martyniuk C. J.; Mehinto A. C.; Denslow N. D. Organochlorine pesticides: Agrochemicals with potent endocrine-disrupting properties in fish. Mol. Cell. Endocrinol. 2020, 507, 110764. 10.1016/j.mce.2020.110764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shefali; Kumar R.; Sankhla M. S.; Kumar R.; Sonone S. S. Impact of pesticide toxicity in aquatic environment. Biointerface Research in Applied Chemistry 2021, 11 (3), 10131–10140. 10.33263/briac113.1013110140. [DOI] [Google Scholar]
- Lin H.-Y.; Chen X.; Chen J.-N.; Wang D.-W.; Wu F.-X.; Lin S.-Y.; Zhan C.-G.; Wu J.-W.; Yang W.-C.; Yang G.-F. Crystal Structure of 4-Hydroxyphenylpyruvate Dioxygenase in Complex with Substrate Reveals a New Starting Point for Herbicide Discovery. Research 2019, 2019, 2602414. 10.34133/2019/2602414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong L.; Li H.; Jiang L.-N.; Ge J.-M.; Yang W.-C.; Zhu X. L.; Yang G.-F. Structure-Based Discovery of Potential Fungicides as Succinate Ubiquinone Oxidoreductase Inhibitors. J. Agric. Food. Chem. 2017, 65 (5), 1021–1029. 10.1021/acs.jafc.6b05134. [DOI] [PubMed] [Google Scholar]
- Keller D. A.; Juberg D. R.; Catlin N.; Farland W. H.; Hess F. G.; Wolf D. C.; Doerrer N. G. Identification and Characterization of Adverse Effects in 21st Century Toxicology. Toxicol. Sci. 2012, 126 (2), 291–297. 10.1093/toxsci/kfr350. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data will be made available on request.






