Abstract
Regulating chemicals to protect the environment based on ecotoxicological assessments is a major challenge. However, experimental ecotoxicity tests are time-consuming and expensive, which underscores the need for accurate prediction methods. In this study, we conducted a comprehensive analysis on the application of machine learning and graph-based learning techniques for the ecotoxicological prediction of chemicals. A total of 161 models were constructed using a combination of three molecular representations (Morgan, MACCS, and Mol2vec), six machine learning algorithms (KNN, NB, RF, SVM, XGB, and DNN), and five graph neural networks (GAT, GCN, MPNN, Attentive FP, and FPGNN). In predicting the ecotoxicity of three aquatic taxonomic groups – fish, crustaceans, and algae – GCN achieved the best performance overall. In the same-species predictions, GCN models achieved the highest values of area under the ROC curve (AUC), ranging between 0.982 and 0.992. In cross-species predictions, GAT and GCN achieved the best and second-best performance, respectively. However, both models exhibited a reduction of approximately 17% in AUC values when predicting the fish group while being trained on the same chemical data for the crustaceans and algae groups. Interestingly, cross-species predictions for unseen chemicals are only better off by DNN with the MACCS fingerprint, yielding an AUC of 0.821. Our findings underscore the critical need to further advance computational prediction methods in order to accurately predict the ecotoxicity of chemicals across species. The ecotoxicology prediction web server for fish, algae, and crustaceans is accessible at https://app.cbbio.online/ecotoxicology/home.
1. Introduction
More than 3,500,000 chemicals and mixtures are currently registered on the market worldwide. These chemicals pose a constant threat to the environment, but only a small proportion of them have been ecotoxicologically assessed. Ecotoxicology deals with the effects of anthropogenic chemicals on ecosystems at various biological levels, from the molecular and cellular level to entire ecosystems. , Ecotoxicology aims to assess the risks posed by pollutants to the environmental health and to develop strategies to reduce them. Aquatic ecosystems are often affected by chemical pollutants originating from industrial effluents, airborne deposition, and runoff from agricultural areas. Ultimately, ecotoxicology plays a crucial role in understanding and managing the complex interactions between chemicals resulting from human activities and the natural environment. ,
In the course of global industrialization, numerous chemicals have been used in a wide variety of areas. But only a small proportion of them have been subjected to a comprehensive ecotoxicological assessment. Regulating chemicals to protect the environment based on ecotoxicological assessment is a major challenge. , However, experimental ecotoxicological tests are time-consuming and expensive. In these tests, various organisms, including aquatic invertebrates, fish, algae and plants are exposed to varying concentrations of chemicals and the toxicity end points such as mortality rate, growth inhibition, reproductive success and biochemical responses are measured. For example, the EC50 value, representing the effective concentration that causes 50% inhibition of algal growth, serves as a key metric for assessing ecotoxicity to algae. These ecotoxicological experiments primarily focus on toxicity tests for individual species.
Over the last decade, the amount of data for chemical ecotoxicology has increased tremendously due to advances in high-throughput techniques. The availability of published datasets has made computerized screening methods a financially viable and efficient alternative for ecotoxicological assessment. It has been shown that in silico methods can reduce the duration and cost of environmental toxicity tests. In 2021, Rodrigues et al. used several machine learning (ML) and deep learning (DL) models to work with chlorophyll fluorescence induction curves and achieved up to 97.65% accuracy in predicting the type of pollutants. However, few DL models have been applied to multi-species ecotoxicological data in research.
Recently, deep graph learning methods have been used for the property prediction of chemicals. Our team shown that Graph Convolutional Network (GCN) is a superior method for predicting chemical toxicity when trained using a larger dataset by semi-supervised learning. To further assess the effectiveness of different deep graph learning methods on ecotoxicological prediction, we conducted a comparative study, including Graph Attention Network (GAT), GCN, Message Passing Neural Network (MPNN), Attentive FP, and Fingerprints and Graph Neural Network (FPGNN). Extensive empirical benchmarking experiments confirmed that GCN also performs best in ecotoxicity prediction for single species. For cross-species prediction, although GNN algorithms have shown certain advantages, there is still a gap compared to their performance in single species. Therefore, further exploration of species-specific feature engineering and model selection methods is needed to enhance the performance of GNN algorithms in this field. Our study suggests that GNN-based in silico methods can be a step forward in addressing challenges in chemical toxicity prediction.
2. Materials and Methods
2.1. Dataset
All datasets used in this experiment are from ADORE, a comprehensive and well-described dataset focusing on acute aquatic toxicity in three relevant taxonomic groups: fish, crustaceans, and algae. The core dataset contains information on ecotoxicological experiments and is enriched with phylogenetic and species-specific data for each species as well as chemical properties and molecular representations.
ADORE provides three separate datasets for the analysis of individual species, namely F2F (fish), A2A (algae) and C2C (crustaceans). Each of these datasets contains both training and test sets. In the case of the fish dataset, three different test sets are offered. In addition, ADORE offers datasets for mixed species called AC2F-same and AC2F-diff. These datasets are for training on algae and crustaceans but testing on fish. They are subdivided according to whether the chemical compounds in the training and test datasets are the same or different.
The ADORE dataset comprises ecotoxicological data for 203 species, with two-thirds of them being fish (Table ). Specifically, among the fish species included are rainbow trout (Oncorhynchus mykiss), minnow (Pimephales promelas) and sunfish (Lepomis macrochirus). Among the crustaceans, there are Daphnia magna and Daphnia pulex. Among the algae, the microalgae Chlorella vulgaris and the green alga Chlamydomonas reinhardtii are the most frequently occurring species in the dataset.
1. Description of the ADORE Datasets.
| dataset | train on | test on | number of species | N | data split |
|---|---|---|---|---|---|
| Training-F2F | fish | 140 | 4818 | by occurrence of chemical compounds | |
| F2F-1 | Oncorhynchus mykiss | ||||
| F2F-2 | Pimephales promelas | ||||
| F2F-3 | Lepomis macrochirus | ||||
| Training-A2A | algae | 46 | 321 | by occurrence of chemical compounds | |
| A2A | Chlorella vulgaris | ||||
| Training-C2C | crustaceans | 17 | 3062 | by occurrence of chemical compounds | |
| C2C | Daphnia magna |
ADORE has classified the toxicity of chemical compounds as outlined in Supporting Table S1. Toxicity classification is based on the EC50, representing the concentration at which 50% of the effect is observed compared to the control. Compounds falling within the ranges (−∞, 10–1) and (10–1, 100) are considered more toxic, while those within (100, 101), (101, 102), and (102, +∞) mg/L are deemed less toxic. The number of categories in each dataset is provided in Table .
2. Number of Samples in Each Dataset.
| dataset | more toxic | less toxic | total | P:N ratio |
|---|---|---|---|---|
| Training-F2F | 597 | 1600 | 2197 | 1:2.68 |
| F2F-1 | 327 | 543 | 870 | 1:1.66 |
| F2F-2 | 188 | 775 | 963 | 1:4.12 |
| F2F-3 | 272 | 516 | 788 | 1:1.89 |
| Training-A2A | 63 | 140 | 203 | 1:2.22 |
| A2A | 24 | 94 | 118 | 1:3.91 |
| Training-C2C | 504 | 1088 | 1592 | 1:2.15 |
| C2C | 452 | 1018 | 1472 | 1:2.52 |
| CA2F-same | 824 | 1594 | 2418 | 1:1.93 |
| CA2F-diff | 749 | 1894 | 2643 | 1:2.52 |
The data was subdivided according to the occurrence of chemical compounds and The ADORE dataset was created using two approaches to facilitate ML for different purposes. The first approach generated datasets that support training on a single class and testing on the same class. The training sets are referred to as Training-F2F, Training-A2A and Training-C2C, where F stands for fish, A for algae, and C for crustaceans. For fish, the test sets are referred to as F2F-1, F2F-2, and F2F-3 for the three different fish species, while the test sets for algae and crustaceans are referred to as A2A and C2C.
To evaluate the cross-species prediction, the second approach created datasets with crustaceans and algae data for training and fish data for testing. The datasets, labeled CA2F-same and CA2F-diff, indicate whether the test data contain the same chemicals as those in the training set or different ones. This data splitting strategy makes CA2F-diff more challenging than CA2F-same. By using data from alternative sources, it becomes possible to reduce the number of experiments on fish when accurate toxicity results can be predicted, alleviating ethical concerns.
The structural diversity and chemical space of the compounds in the dataset play a key role in the predictive ability of ML models. The chemical space of the compounds in each dataset can be described in two dimensions using molecular weight (MW) and logP. As shown in Supporting Figure S1, the compounds in the training and test datasets are distributed over a wide range of MW (41.05–577.93) and logP (−4.98–6.95), indicating that the compounds in the modeled dataset have a wide chemical space. To evaluate the chemical space relationship between the training and test sets, we calculated Tanimoto similarity coefficients based on ECFP4 molecular fingerprints across three datasets: F2F, A2A, and C2C. The average Tanimoto similarities are 0.088 for F2F, 0.108 for A2A, and 0.084 for C2C. T-SNE visualizations of chemical space distributions and similarity heatmaps (see Supporting Figures S2–S16) further demonstrate clear clustering and distinction between training and test compounds.
2.2. Molecular Representations
The feature representation of molecules determines the maximum predictive capability of the Quantitative Structure-Activity Relationship (QSAR) model, so choosing a suitable molecular feature representation is of critical importance. To fully utilize the chemical information of molecules, three distinct molecular representations, namely molecular fingerprint, molecular embedding, and graph-based representation were explored in this study. Two widely used fingerprints, Morgan (ECFP-like, 1024 bits) and MACCS were calculated using RDKit. The former utilizes a circular fingerprinting algorithm to create a vector representation of a molecule while the latter encodes detail information about the substructures and stereochemistry. Both are ideal for representing molecules prior to ML. Meanwhile, molecular embedding was performed using Mol2vec. This is an unsupervised method trained on massive drug molecules (analogous to sentences) to learn the feature vectors of the molecular substructures (analogous to words). The feature vector of the drug molecule is then computed as the sum of the feature vectors of its substructures. To capture the spatial relationships among these substructures within a molecule, the skip-gram model was employed, which learns distributed representations and contextual relationships from large-scale unlabeled data.
Compared to molecular fingerprints, molecular graphs focus on the spatial relationships between atoms in a molecule by encoding atoms, bonds, and distances, etc., directly available in the molecular structure. Atoms and bonds between atoms form the nodes and edges of the graph. Both atoms and bonds have associated properties, such as atom type, atomic number, and bond type, etc. A molecular graph consists of two matrices: an N×N connectivity matrix A, which represents the structure of the graph, and an N × F node feature matrix X, where N is the number of nodes and F is the number of node features. The matrix of node features typically includes the following atomic features, such as atom type, formal charge, hybridization, aromaticity, number of hydrogen atoms, chirality and local charge. The edges indicate the type of chemical bonds between atoms, whether the bonds are on the same ring, whether the bonds are conjugated and the stereo configuration of the bonds. The above molecular features are mainly calculated with the modules deepchem.feat.MolGraphConvFeaturizer and deepchem.feat.ConvMolFeaturizer of the open-source software package DeepChem (https://deepchem.io/, version: 2.6.1).
2.3. Machine Learning Algorithms and Model Construction
Five conventional ML algorithms (KNN, NB, RF, SVM, and XGBoost) and six DL algorithms (DNN, GCN, GAT, MPNN, Attentive FP, and FPGNN) were investigated for chemical ecotoxicity prediction. The KNN, NB, RF, SVM models were built using scikit-learn in Python while the XGBoost model utilized the XGBoost package (https://github.com/dmlc/xgboost). Additionally, the DNN, GCN, GAT, MPNN and Attentive FP models were developed using DeepChem (https://github.com/deepchem/deepchem) while the FPGNN model with the FPGNN package (https://github.com/idrugLab/FP-GNN). All models were trained on a CPU [Intel Xeon 8358(32C,250W,2.6 GHz)] and a GPU [NVIDIA Corporation A800 80G PCIE]. In addition, we conducted grid search to optimize the hyperparameters of each model.
2.3.1. K-Nearest Neighbor (KNN)
The KNN is based on the principle that data points in close proximity tend to have similar labels. This proximity is measured using distance metrics such as Euclidean, Manhattan, and Jaccard distances. This algorithm predicts the labels of new data points by examining the labels of spatially nearby data points in the training set. KNN is a nonparametric algorithm that does not require any assumptions about the data, making it relatively simple and easy to use. Three hyperparameters were optimized: n_neighbors (1,3,5,7,9), p (1,2), and weight function (“uniform”, “distance”).
2.3.2. Naive Bayes (NB)
The NB algorithm is a simple yet effective method for classification and regression tasks. Utilizing Bayes’ theorem and assuming conditional independence of features, NB learns the joint probability distribution of inputs and outputs for a given dataset. NB is suitable for both large and small datasets since it does not require complex iterative parameter estimation. Two hyperparameters were optimized: α (0.01–1) and binarize (0, 0.5, 0.7).
2.3.3. Random Forest (RF)
RF is a nonlinear ensemble method based on decision trees. It is a variant of decision trees which improves the generalization ability of the final ensemble model by introducing random feature selection during the training process. RF is characterized by high prediction accuracy, robustness to outliers and noise, and low probability of overfitting. In addition, RF can handle input samples with high-dimensional features without the need for dimensionality reduction, making it one of the most popular algorithms for QSAR modeling. Four hyperparameters were optimized: n_estimators (10–500), criterion (“gini” and “entropy”), max_depth (0–15) and max_features (“log2”, “auto” and “sqrt”).
2.3.4. Support Vector Machine (SVM)
SVM is a supervised ML algorithm commonly employed for binary classification tasks. Through kernel technology, it can effectively address the nonlinearity in input data to achieve good performance in classification and regression. The fundamental concept of SVM is to identify an optimal hyperplane in the N-dimensional feature space and distinguish between data of different classes by maximizing the boundaries between them. Since the final decision function relies on few selected data points represented as the support vectors, SVM can circumvent the curse of dimensionality and efficiently handle high dimensional data. SVM is often used in drug research, such as predicting the properties of active ingredients. In training SVM models, two hyperparameters, Kernel (“linear”, “poly”, “rbf”, “sigmoid”) and penalty parameter C (0.1,1,10), were optimized.
2.3.5. XGBoost
XGBoost belongs to the gradient boosting method of ensemble learning. It builds multiple weak learners iteratively and combines them to form a powerful prediction model. XGBoost enhances the performance of the model by optimizing the objective function with a gradient boosting algorithm. Additionally, it offers regularization options, like L1 and L2 regularization, to control model complexity, reduce variance and prevent overfitting. Five hyperparameters were optimized: learning_rate (0.01–0.1), γ (0–0.1), min_child_weight (1–3), max_depth (3–5), and n_estimators (10–100).
2.3.6. Deep Neural Network (DNN)
A DNN is a DL framework comprising multiple layers of computational nodes. These layers can be categorized into input, hidden and output layers based on their position within the network. Connections between neurons carry numerical weights, contributing to the network’s ability to learn from training data. The size and structure of DNNs, including the number of neurons and layers, are determined by factors such as the number of features, connections and output types of the problem. Three hyperparameters were optimized: dropouts (0.1, 0.2, 0.5), layer_sizes (8, 32, 64, 128), and weight_decay_penalty (0.1, 0.01, 0.001, 0.0001).
2.3.7. Graph Convolutional Network (GCN)
GCN is a neural network method that applies convolutional ideas to graph-structured data, proposed by Kipf and Welling in 2016. GCN serves as a connection point between spectral domain-based Graph Neural Networks (GNNs) and spatial domain-based GNNs. Comprising convolutional layers, input layers, fully connected layers, and output layers, GCN utilizes small molecules to represent a deep learning system as an undirected graph of atoms, with the structure of the molecular graph serving as input. Its fundamental concept involves utilizing domain information aggregation to update the representation of its nodes. By extracting significant features from graphical structures encompassing atomic and chemical bond attributes, GCN constructs representations at the molecular level. Four hyperparameters were optimized: weight_decay (0, 10 × 10–8, 10 × 10–6, 10 × 10–4), graph_conv_layers [(64, 64), (128, 128), (256, 256)], learning rate (0.01, 0.001, 0.0001), and dense_layer_size (32, 64, 128).
2.3.8. Graph Attention Network (GAT)
GAT introduces the attention mechanism to spatial domain-based GNN. Instead of using Laplace and other matrices for complex computations, GAT simplifies the node feature updates by focusing solely on the neighboring nodes. Each node’s state update in GAT considers the state of its neighbors by computing attention weights between the node and its neighbors. The attention mechanism allows GAT to dynamically adjust the importance of each neigboring node, effectively capturing the most relevant information from the neighborhood. Four hyperparameters were optimized: weight_decay (0, 10 × 10–8, 10 × 10–6, 10 × 10–4), learning rate (0.1, 0.01, 0.001), n_attention_heads (8, 16, 32), and dropouts (0.1, 0.3, 0.5).
2.3.9. Message Passing Neural Network (MPNN)
MPNN is a graph neural network model based on the mechanism of message passing. Its core idea is to regard a node in a graph as a message transmitter and realize deep feature extraction of graph structure data by converting the message transmission process between a node and its neighboring nodes into the forward propagation process of the neural network. MPNN is very versatile and can be used for a variety of different types of graph data and tasks. Three hyperparameters were optimized: weight_decay (10 × 10–8, 10 × 10–6, 10 × 10–4), learning rate (0.1, 0.01, 0.001), and graph_conv_layers [(32, 32), (64, 64), (128, 128)].
2.3.10. Attentive FP
Attentive FP is also a graph neural network model based on the attention mechanism that can be used for molecular and atomic characterization information. It is capable of learning both local and non-local properties of chemical structures. The model is interpretable for what it learns, allowing users to understand the complexity in discovery data. The main hyperparameters were optimized as follows: dropout (0.1, 0.3, 0.5), graph_feat_size (50, 100, 200), learning rate (0.1, 0.01, 0.001) and weight_decay (0, 0.01, 0.0001).
2.3.11. FPGNN
FPGNN combines molecular fingerprint representation with molecular graph representation based on graph neural networks. This fusion results in FPGNN models with improved prediction accuracy, suggesting complementarity between features represented in graphs and fingerprints. Notably, FPGNN shows excellent noise immunity, making it well-suited for real-world scenarios with abundant noisy data. Three hyperparameters were optimized: dropout (0.1, 0.3, 0.5), learning rate (0.1, 0.01, 0.001) and weight_decay (0, 0.01, 0.0001).
2.4. Performance Evaluation of Models
The following metrics are used to evaluate the performance of models, namely Sensitivity (SE, also known as Recall), Specificity (SP), Matthews Correlation Coefficient (MCC), Accuracy (ACC), Area Under Receiver Operating Characteristic Curve (AUC), F1 (F1-measure), and Balanced Accuracy (BA), etc. The formulas are as follows
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.
3. Results
Based on three different types of molecular features (i.e., molecular fingerprints, embedding, and graph-based features) and 11 selected ML and DL algorithms, 161 models were developed. All models were optimized using the training and validation datasets. Specifically, 10% of the original training dataset was randomly held out as the validation dataset for hyperparameter tuning. Model selection was performed based on the validation dataset’s AUC scores to ensure robust performance before final evaluation on the independent test dataset.
3.1. Performance of Fingerprint-Based Prediction Models
Chemical ecotoxicity prediction models were constructed for same-species and cross-species predictions using all six ADORE datasets, namely, F2F-1, F2F-2, F2F-3, A2A, C2C, AF2F-same, and AC2F-diff. Utilizing two molecular fingerprint representations, Morgan and MACCS, in combination with six ML models (KNN, NB, RF, SVM, XGBoost and DNN), a total of eighty-four prediction models were trained. The detail performance of each model on the test datasets is presented in Supporting Tables S2 to S15, with the F1, BA and AUC values of the models depicted in Figures , , and , respectively. Overall, the majority of the molecular fingerprint-based prediction models demonstrate strong performance, as evidenced by their F1, AUC, and BA scores in the test datasets. Only in the AC2F-diff dataset does the difference in chemical compounds between the training dataset and the test dataset led to poorer modeling results.
1.
Test performance of fingerprint-based ecotoxicity prediction models. (A) F1 scores of the Morgan-based models. (B) F1 scores of the MACCS-based models.
2.
Test performance of fingerprint-based ecotoxicity prediction models. (A) BA scores of the Morgan-based models. (B) BA scores of the MACCS-based models.
3.
Test performance of fingerprint-based ecotoxicity prediction models. (A) AUC scores of the Morgan-based models. (B) AUC scores of the MACCS-based models.
Considering the AUC score as a metric, it can be seen that no single model performed best in all datasets, indicating that it is necessary to test different algorithms for all datasets. For two molecular fingerprint representations, Morgan and MACCS, and seven datasets, we selected 14 (2 × 7) optimal models. Of these 14 prediction models, 12 of them are DNN models, indicating that DNN outperforms the other five algorithms in ecotoxicity prediction. The results show that using the Morgan fingerprint DNN performs best in the predictions for the F2F-1, F2F-2, F2F-3, A2A, C2C, and AC2F-diff datasets with AUC scores of 0.972, 0.991, 0.972, 0.988, 0.984, and 0.728 respectively, and F1 scores of 0.951, 0.901, 0.932, 0.937, 0.902, and 0.66. Additionally, the XGBoost model performs best on AC2F-same while it is comparable to DNN on F2F-3 and A2A. In the MACCS fingerprint results, DNN performs best in the predictions for the F2F-1, F2F-2, A2A, C2C, AC2F-same, and AC2F-diff datasets with AUC scores of 0.973, 0.985, 0.987, 0.977, 0.822, and 0.821 respectively. The corresponding F1 scores are 0.949, 0.881, 0.894, 0.901, 0.619, and 0.698. The XGBoost model exhibits the best predictive performance in the F2F-3 dataset with an AUC score of 0.968 and an F1 score of 0.871.
Overall, based on the experimental results, both the DNN and XGBoost models demonstrate robust performance in ecotoxicity classification. Meanwhile, the Morgan fingerprint is slightly better than the MACCS fingerprint in same-species predictions, i.e., out of the five datasets, the Morgan fingerprint achieves the highest AUC scores in four of them. Nevertheless, for cross-species cases, MACCS shows enhanced prediction compared to Morgan on both the AC2F-same (+2%) and AC2F-diff (+13%) datasets.
3.2. Performance of Mol2vec-Based Prediction Models
The Mol2vec model is an unsupervised machine learning method inspired by natural language processing techniques and is primarily used to learn vector representations of molecular structures. Just like the Word2vec model, Mol2vec can learn vector representations of molecular substructures that point in similar directions as chemically related substructures. In this study, 42 ecotoxicity prediction models were built for seven datasets using the Mol2vec model for vector representation of molecular structures with five traditional machine learning algorithms (KNN, NB, RF, SVM, and XGBoost) and one deep learning algorithm (DNN).
The F1, AUC and BA values of the Mol2vec prediction models for the test datasets are shown in Figure . Overall, most of the Mol2vec-based models perform well in ecotoxicology prediction. Except for the AC2F-diff in the dataset, both the AUC score and the F1 score in the test dataset are above 0.5. The XGBoost model performs best in the predictions on the F2F-1, F2F-2, F2F-3, A2A, C2C, and AC2F-same datasets, with AUC scores of 0.973, 0.992, 0.973, 0.986, 0.984, and 0.796 respectively. The corresponding F1 scores are 0.898, 0.907, 0.881, 0.907, 0.908, and 0.758. Additionally, the KNN model shows good predictive results for the AC2F-diff dataset, with an AUC score of 0.732 and an F1 score of 0.612. The RF model also achieves favorable outcomes. The results of the individual models for the test datasets are shown in Supporting Tables S16–S22.
4.
Test performance of the Mol2vec-based ecotoxicity prediction models. (A) F1 scores, (B) BA scores, and (C) AUC scores.
3.3. Performance of Graph-Based Prediction Models
While both molecular fingerprint-based and Mol2vec-based feature representations convert molecular SMILES into feature vectors, graph neural networks work directly on natural input representations of molecules, which are chemical graphs of atoms and bonds. Therefore, GNNs have access to a complete representation of molecules at the atomic level, capable of extracting physicochemical and structural features of molecules from the molecular graph. In this investigation, thirty-five molecular graph-based models are built using five GNN algorithms (GCN, GAT, MPNN, Attentive FP, and FPGNN). As shown in Figure , considering the AUC score as a metric, GCN has the best overall performance compared to the other GNN methods. GCN performs best in datasets F2F-1, F2F-2, F2F-3, A2A, and C2C, with AUC scores of 0.982, 0.991, 0.982, 0.989, and 0.984, respectively, while the F1 scores are 0.916, 0.9, 0.903, 0.886, and 0.902, respectively. On the other hand, GAT exhibits good predictive performance in the AC2F-same dataset with an AUC of 0.844, which is slightly higher than GCN’s 0.838. Meanwhile, the FPGNN model performs well results in the AC2F-diff dataset with an AUC of 0.795, slightly better than Attention FP’s 0.791. The Attention FP model demonstrates a predictive performance comparable to GCN for the C2C dataset. The detailed performance results of the molecular graph-based models are shown in Supporting Tables S23–S29.
5.
Test performance of the graph-based ecotoxicity prediction models. (A) F1 scores, (B) BA scores, (C) AUC scores.
3.4. Optimal Model for Each Dataset
Table compares the best performing models for each molecular representation across the seven test sets. Taking the AUC metric as a measure of overall model performance, it can be seen that GCN consistently outperforms the other models in same-species predictions (F2F-1/2/3, A2A, and C2C) with an AUC value between 0.982 and 0.992. Cross-species predictions prove to be challenging. The best performing model GAT only achieves an AUC value of 0.844 for AC2F-same, which represents a performance reduction of approximately 17% with respect to the ability to distinguish toxic from non-toxic chemicals compared to the same-species prediction cases. We note that GCN, the second-best graph model, comes close to GAT with an AUC of 0.838. Interestingly, for AC2F-diff prediction, the best model is MACCS using DNN, with an AUC of 0.821, while all other models are inferior (less than 0.74). This suggests that for predicting the ecotoxicity of chemicals that have not undergone ecotoxicity testing in algae or crustaceans, their ecotoxicity for fish can be more accurately predicted using MACCS with DNN. This model focuses on the substructure and stereochemistry properties of the molecules, which presumably offer better generalizability.
3. Optimal Predictive Models Utilizing Fingerprints, Molecular Embedding, and Graph Features in 7 Test Sets .
| dataset | molecular feature | model | Acc | BA | SE | SP | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|
| F2F-1 | Morgan | DNN | 0.915 | 0.970 | 0.847 | 0.967 | 0.951 | 0.829 | 0.972 |
| MACCS | DNN | 0.912 | 0.907 | 0.857 | 0.957 | 0.949 | 0.824 | 0.973 | |
| Mol2vec | XGBoost | 0.914 | 0.908 | 0.846 | 0.970 | 0.898 | 0.830 | 0.973 | |
| Graph | GCN | 0.921 | 0.923 | 0.923 | 0.923 | 0.916 | 0.844 | 0.982 | |
| F2F-2 | Morgan | DNN | 0.942 | 0.943 | 0.922 | 0.954 | 0.901 | 0.862 | 0.991 |
| MACCS | DNN | 0.933 | 0.932 | 0.929 | 0.935 | 0.881 | 0.835 | 0.985 | |
| Mol2vec | XGBoost | 0.950 | 0.948 | 0.942 | 0.953 | 0.907 | 0.874 | 0.992 | |
| Graph | GCN | 0.942 | 0.933 | 0.914 | 0.961 | 0.900 | 0.864 | 0.991 | |
| F2F-3 | Morgan | XGBoost | 0.899 | 0.893 | 0.837 | 0.950 | 0.882 | 0.797 | 0.972 |
| MACCS | XGBoost | 0.889 | 0.883 | 0.828 | 0.939 | 0.871 | 0.777 | 0.968 | |
| Mol2vec | XGBoost | 0.897 | 0.892 | 0.843 | 0.941 | 0.881 | 0.792 | 0.973 | |
| Graph | GCN | 0.921 | 0.913 | 0.913 | 0.924 | 0.903 | 0.833 | 0.982 | |
| A2A | Morgan | DNN | 0.964 | 0.960 | 0.952 | 0.967 | 0.937 | 0.906 | 0.988 |
| MACCS | DNN | 0.945 | 0.947 | 0.952 | 0.942 | 0.894 | 0.864 | 0.987 | |
| Mol2vec | XGBoost | 0.951 | 0.944 | 0.929 | 0.959 | 0.907 | 0.874 | 0.986 | |
| Graph | GCN | 0.939 | 0.935 | 0.929 | 0.942 | 0.886 | 0.846 | 0.989 | |
| C2C | Morgan | DNN | 0.928 | 0.930 | 0.935 | 0.924 | 0.902 | 0.850 | 0.984 |
| MACCS | DNN | 0.913 | 0.909 | 0.894 | 0.925 | 0.901 | 0.816 | 0.977 | |
| Mol2vec | XGBoost | 0.929 | 0.929 | 0.927 | 0.930 | 0.908 | 0.851 | 0.984 | |
| Graph | GCN | 0.923 | 0.927 | 0.945 | 0.910 | 0.902 | 0.841 | 0.984 | |
| AC2F-same | Morgan | XGBoost | 0.750 | 0.765 | 0.884 | 0.646 | 0.756 | 0.533 | 0.807 |
| MACCS | DNN | 0.729 | 0.746 | 0.890 | 0.602 | 0.619 | 0.502 | 0.822 | |
| Mol2vec | XGBoost | 0.751 | 0.766 | 0.889 | 0.644 | 0.758 | 0.537 | 0.796 | |
| Graph | GAT | 0.711 | 0.726 | 0.847 | 0.604 | 0.720 | 0.456 | 0.844 | |
| AC2F-diff | Morgan | DNN | 0.728 | 0.609 | 0.337 | 0.881 | 0.660 | 0.256 | 0.728 |
| MACCS | DNN | 0.787 | 0.762 | 0.702 | 0.821 | 0.698 | 0.502 | 0.821 | |
| Mol2vec | KNN | 0.760 | 0.732 | 0.669 | 0.796 | 0.612 | 0.444 | 0.732 | |
| Graph | FPGNN | 0.735 | 0.701 | 0.622 | 0.779 | 0.570 | 0.383 | 0.735 |
The model was selected based on AUC values.
Overall, the predictive performance of molecular graph-based models is generally superior to that of ML and DL models based on molecular fingerprints and molecular embeddings. Consequently, using molecular graph-based models allows for more effective capturing of complex relationships and features within chemical structures, leading to improved predictive performance.
The Wilcoxon signed-rank test was applied to statistically evaluate performance differences among various models across tasks and molecular features. In the F2F-1 task, the accuracy of GCN (graph-based features, accuracy = 0.921) was significantly higher than that of XGBoost (Mol2vec features, accuracy = 0.914) (Wilcoxon W = 32, p = 0.006; Bonferroni-corrected significance level α = 0.0083), whereas the differences between DNN (Morgan fingerprints, accuracy = 0.915) and both models were not statistically significant (p = 0.042 and p = 0.021, respectively). For the A2A task evaluated by the AUC metric, DNN (Morgan fingerprints, AUC = 0.988) significantly outperformed XGBoost (Mol2vec features, AUC = 0.986) (W = 28, p = 0.023), but showed no significant difference compared to GCN (graph-based features, AUC = 0.989; p = 0.156). After adjusting for multiple comparisons, a statistically significant difference was found only in the F2F-3 task, where the AUC of XGBoost (Mol2vec features, 0.973) was lower than that of GCN (graph-based features, 0.982) (p = 0.003), suggesting superior generalization capability of the graph neural network for this task. These results were obtained from 10 repeated experiments, with the family-wise error rate controlled by the Bonferroni method to ensure the reliability of the statistical conclusions.
To compare the performance of the different methods across different datasets more comprehensively, we conducted a robustness analysis. Specifically, for a particular algorithm A, the robustness on dataset D is defined as the ratio between the accuracy of algorithm A on dataset D and the lowest accuracy observed among all methods.
| 7 |
where accA(D) represents the accuracy of algorithm A on dataset D, and minα accα(D) represents the minimum accuracy among all algorithms (including algorithm A and all comparative algorithms) on dataset D. Therefore, the robustness of the algorithm A0 which performs the worst on dataset D is 1, that is rA0(D) = 1. The robustness of other algorithms A is greater than 1, that is rA(D) > 1, and the largest value indicates best performed algorithm on that dataset. Therefore, assuming there are n datasets D1,···, D n , the overall robustness of algorithm A can be defined as the sum of the robustness of the algorithm on all datasets
| 8 |
A higher value of robustness indicates better performance of the algorithm.
The robustness comparison of algorithms on seven datasets is shown in Figure . GCN has the highest overall robustness among all algorithms, whereas MPNN, FPGNN, and attentive FP demonstrate performances close to that of GCN. It is worth noting that GCN, MPNN, FPGNN, and attentive FP algorithms all belong to the category of graph neural network algorithms. In contrast, the robustness of traditional ML algorithms is slightly inferior to graph-based methods. As traditional ML algorithms do not consider explicit molecular graph structure, potentially missing subtle structural properties that are not encoded in standard fingerprints or generalized embedding representations.
6.
Robustness analysis of models based on test accuracy.
3.5. Interpretability of GCN
We analyzed the GCN fish model to understand its interpretability. From the GCN model, one can assess the importance of adjacent atoms by learning and assigning a set of soft masks to each edge in the graph. These masks, represented as weights between 0 and 1, serve as quantitative indicators that measure the relative importance of each edge in the graph for prediction. Specifically, this involves calculating the difference between the predictions of the masked graph and the original graph, ensuring that the prediction results remain as close as possible to the original after applying the masks. The assessment then identifies the edges most crucial to the prediction outcome. Higher mask values correspond directly to edge features that make a greater contribution to the final prediction. These importance coefficients are then mapped onto the chemical bonds associated with the atoms, thereby providing insights into the molecular structure related to the predicted toxicity.
Taking sodium pentachlorophenate, a toxic chemical, as an example (see Figure A). The molecule was predicted to be toxic by the GCN model trained on the F2F dataset. The darker colored edges, corresponds to higher mask values, represent the more influential components to its toxicity and vice versa. It can be observed that the highly chlorinated structure of the benzene ring (with five chlorine atoms) in sodium pentachlorophenate is found to contribute significantly to the predicted toxicity of the substance. This is in agreement to the knowledge that chlorinated benzene exhibits high stability and less biodegradability than its non-chlorinated counterparts.
7.
Visualization of attention weights from the GCN model (trained on the F2F dataset) for four chemical compounds. (A) Sodium pentachlorophenate (toxic); (B) n-Octylphenol (toxic); (C) 4-chloro-3-tert-butylphenyl cyano dimethoxyphosphonate (less toxic); and (D) 2,4,6-trinitrophenol (less toxic).
Another toxic molecule (Figure B, ″n-Octylphenol″), the compound has a hydrophobic long alkyl chain (eight carbon atoms), which gives n-Octylphenol strong lipophilicity, making it prone to accumulation in aquatic organisms, particularly in adipose tissues. This accumulation can lead to heightened toxic effects, especially when organisms are exposed for extended periods in aquatic environments.
Another molecule with lighter color shading in the visualization (as shown in Figure C,D) indicates lower toxicity. This is because no specific molecular structures associated with toxicity were identified in these compounds, resulting in a reduced potential for harmful effects on aquatic organisms.
These examples not only demonstrate the interpretability of the GCN model but also suggest that the GCN can learn relationships between molecular substructures (chemical fragments) and their molecular properties.
4. Discussion and Conclusion
In this study, we performed a comparative analysis of ecotoxicity prediction models using the recently released benchmark dataset ADORE. This dataset is curated from various major ecotoxicity databases and contains acute aquatic toxicity for three relevant taxonomic groups: fish, crustaceans, and algae. The data has been preprocessed and partitioned into training and test sets for ML purposes. In the quest to find the optimal predictive ecotoxicological models for single-species and cross-species predictions, we first assessed the performance of traditional ML models using two molecular fingerprint representations (Morgan and MACCS) and one embedding representation (Mol2vec). These were combined with six widely used ML algorithms to establish the baseline predictive performance. Then, we evaluated multiple GNN models including GAT, GCN, MPNN, Attentive FP and FPGNN for the same prediction tasks. Among the five single-species prediction tasks, GCN achieved the highest AUC of 0.982 on the F2F-1 and F2F-3 datasets, while XGBoost::Mol2vec achieved the highest AUC of 0.992 on the F2F-2 dataset. Additionally, GCN also achieved the highest AUC of 0.988 on the C2C dataset, and 0.989 on the A2A dataset. Meanwhile, all models performed significantly worse in cross-species predictions. The graph models still attained an AUC above 0.8 for AC2F-same and above 0.76 for AC2F-diff. Notably, the best model for AC2F-diff was achieved by MACCS with DNN, with an AUC of 0.821. The inferior performance in cross-species predictions compared to single-species predictions is likely due to the different toxicity mechanisms in the different organisms.
Overall, the optimal predictive models for same-species predictions show remarkable accuracy. The hope has been to make inference based on prediction models trained using ecotoxicity data from species of lesser ethical concern (like algae and crustaceans). Unfortunately, our study indicates that despite this ambition, cross-species predictions remain a formidable challenge with the existing dataset. Interpretation of model performance could be more straightforward when focusing on positive sample (i.e., toxic compound) predictions. Employing cross-species models in which this compound has been previously encountered may still have a considerable chance (11 to 15%) of misclassifying it as non-toxic (AC2F-same models, with SE values ranging from 0.847 and 0.890). Moreover, utilizing cross-species models where the compound has not been encountered before may elevate this further (>30%) (AC2F-diff models, with SE values ranging from 0.337 to 0.702). Given the risk associated with misclassifying a toxic compound as non-toxic, which can result in severe damage to the ecosystems, there remains an urgent need for substantial improvement in cross-species prediction for both known and novel compounds.
One important factor contributing to the difficulty of accurate cross-species prediction is the phylogenetic distance among species. Phylogenetic distance reflects evolutionary divergence that affects molecular interactions and toxicity mechanisms. For instance, crustaceans, algae, and fish are separated by considerable evolutionary gaps, leading to differences in physiological processes and molecular recognition patterns. This biological divergence increases heterogeneity in cross-species datasets and reduces model transferability. Therefore, considering phylogenetic distance helps deepen the understanding of performance drops in cross-species prediction and offers theoretical support for developing models that better incorporate multi-species evolutionary information.
Supplementary Material
Acknowledgments
This study is supported by Macao Polytechnic University. As part of the thesis work of X.L., this paper can be referred by the submission code (s/c fca.272d.ecd6.c).
The experimental code and data are provided at https://github.com/pdssunny/ecotoxicology. The ecotoxicology prediction web server for fish, algae, and crustaceans is accessible at https://app.cbbio.online/ecotoxicology/home.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c03753.
Table S1: EC50 intervals for binary toxicity classification; Tables S2–S8: Results of Morgan fingerprint-based prediction models on Fish-1, Fish-2, Fish-3, A2A, C2C, AC2F-same, and AC2F-diff test datasets; Tables S9–S15: Results of MACCS fingerprint-based prediction models on F2F-1, F2F-2, F2F-3, A2A, C2C, AC2F-same, and AC2F-diff test datasets; Tables S16–S22: Results of Mol2vec-based prediction models on F2F-1, F2F-2, F2F-3, A2A, C2C, AC2F-same, and AC2F-diff test datasets; Tables S23–S29: Results of graph-based prediction models on Fish-1, Fish-2, Fish-3, A2A, C2C, AC2F-same, and AC2F-diff test datasets; Figure S1: Distribution of data samples for F2F, A2A, C2C, AC2F-same, and AC2F-diff in molecular chemical space (LogP vs. molecular weight, calculated by RDKit); Figures S2–S4: Chemical space distribution, Tanimoto similarity frequency, and similarity matrices for the F2F dataset; Figures S5–S7: Chemical space distribution, Tanimoto similarity frequency, and similarity matrices for the A2A dataset; Figures S8–S10: Chemical space distribution, Tanimoto similarity frequency, and similarity matrices for the C2C dataset; Figures S11–S13: Chemical space distribution, Tanimoto similarity frequency, and similarity matrices for the AC2F-same dataset; Figures S14–S16: Chemical space distribution, Tanimoto similarity frequency, and similarity matrices for the AC2F-diff dataset (PDF)
S.W.I.S. conceived the idea. X.L., C.-w.U. and J.C. were involved in the data collection process, implementation, and experimentation. X.L., J.C., and S.W.I.S. were involved in the writing of the manuscript and in the interpretation of the results. All authors read and approved the final manuscript.
This work was supported by Macao Polytechnic University (grant no. RP/FCA-06/2024) and National Natural Science Foundation of China (grant no. 62172140).
The authors declare no competing financial interest.
References
- Wang Z., Walker G. W., Muir D. C. G.. et al. Toward a Global Understanding of Chemical Pollution: A First Comprehensive Analysis of National and Regional Chemical Inventories. Environ. Sci. Technol. 2020;54(5):2575–2584. doi: 10.1021/acs.est.9b06379. [DOI] [PubMed] [Google Scholar]
- Tlili S., Mouneyrac C.. New challenges of marine ecotoxicology in a global change context. Mar. Pollut. Bull. 2021;166:112242. doi: 10.1016/j.marpolbul.2021.112242. [DOI] [PubMed] [Google Scholar]
- Hellal J., Barthelmebs L., Bérard A.. et al. Unlocking secrets of microbial ecotoxicology: Recent achievements and future challenges. FEMS Microbiol. Ecol. 2023;99(10):fiad102. doi: 10.1093/femsec/fiad102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gross E. M.. Aquatic chemical ecology meets ecotoxicology. Aquat. Ecol. 2022;56(2):493–511. doi: 10.1007/s10452-021-09938-2. [DOI] [Google Scholar]
- De Boeck, G. ; Rodgers, E. ; Town, R. M. . Using Ecotoxicology for Conservation: from Biomarkers to Modeling. In Fish Physiology; Academic Press, 2022; Vol. 39, pp 111–174. [Google Scholar]
- Thoré E. S. J., Philippe C., Brendonck L.. et al. Towards improved fish tests in ecotoxicology-efficient chronic and multi-generational testing with the killifish Nothobranchius furzeri. Chemosphere. 2021;273:129697. doi: 10.1016/j.chemosphere.2021.129697. [DOI] [PubMed] [Google Scholar]
- Jin S., Zeng X., Xia F.. et al. Application of deep learning methods in biological networks. Briefings Bioinf. 2021;22(2):1902–1917. doi: 10.1093/bib/bbaa043. [DOI] [PubMed] [Google Scholar]
- Wu, L. ; Cui, P. ; Pei, J. . et al. Graph Neural Networks: Foundation, Frontiers and Applications, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022; pp 4840–4841. [Google Scholar]
- Feng Y. H., Zhang S. W.. Prediction of drug-drug interaction using an attention-based graph neural network on drug molecular graphs. Molecules. 2022;27(9):3004. doi: 10.3390/molecules27093004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kipf, T. N. ; Welling, M. . Semi-supervised classification with graph convolutional networks arXiv, preprint arXiv:1609.02907, 2016.
- Gilmer, J. ; Schoenholz, S. S. ; Riley, P. F. . et al. Neural Message Passing for Quantum Chemistry, International Conference on Machine Learning, PMLR, 2017; pp 1263–1272. [Google Scholar]
- Xiong Z., Wang D., Liu X.. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 2020;63(16):8749–8760. doi: 10.1021/acs.jmedchem.9b00959. [DOI] [PubMed] [Google Scholar]
- Cai H., Zhang H., Zhao D.. et al. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Briefings Bioinf. 2022;23(6):bbac408. doi: 10.1093/bib/bbac408. [DOI] [PubMed] [Google Scholar]
- Chen, T. ; Guestrin, C. . et al. Xgboost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016; pp 785–794. [Google Scholar]
- Schür C., Gasser L., Perez-Cruz F.. et al. A benchmark dataset for machine learning in ecotoxicology. Sci. Data. 2023;10(1):718. doi: 10.1038/s41597-023-02612-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodrigues N. M., Batista J. E., Mariano P.. et al. Artificial intelligence meets marine ecotoxicology: applying deep learning to bio-optical data from marine diatoms exposed to legacy and emerging contaminants. Biology. 2021;10(9):932. doi: 10.3390/biology10090932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogers D., Hahn M.. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- Durant J. L., Leland B. A., Henry D. R.. et al. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
- Bento A. P., Hersey A., Félix E.. et al. An open source chemical structure curation pipeline using RDKit. J. Cheminf. 2020;12:51. doi: 10.1186/s13321-020-00456-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaeger S., Fulle S., Turk S.. Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 2018;58(1):27–35. doi: 10.1021/acs.jcim.7b00616. [DOI] [PubMed] [Google Scholar]
- Mikolov, T. Efficient estimation of word representations in vector space arXiv, preprint arXiv:1301.3781, 2013.
- Duvenaud, D. K. ; Maclaurin, D. ; Iparraguirre, J. . et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 2015; p 28. [Google Scholar]
- Pedregosa F., Varoquaux G., Gramfort A.. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Ramsundar, B. Molecular machine learning with DeepChem. Stanford University, 2018. [Google Scholar]
- Hu L. Y., Huang M. W., Ke S. W.. et al. The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus. 2016;5:1304. doi: 10.1186/s40064-016-2941-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clarke, M. R. B. Pattern classification and scene analysis. 1974.
- Breiman L.. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- Suthaharan, S. Support vector machine. In Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning 2016; Vol. 36, pp 207–235 10.1007/978-1-4899-7641-3_9. [DOI] [Google Scholar]
- Heikamp K., Bajorath J.. Support vector machines for drug discovery. Expert Opin. Drug Delivery. 2014;9(1):93–104. doi: 10.1517/17460441.2014.866943. [DOI] [PubMed] [Google Scholar]
- LeCun Y., Bengio Y., Hinton G.. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Crandall C. A., Goodnight C. J.. The Effect of Various Factors on the Toxicity of Sodium Pentachlorophenate to Fish 1. Limnol. Oceanogr. 1959;4(1):53–56. doi: 10.4319/lo.1959.4.1.0053. [DOI] [Google Scholar]
- Prasad, G. S. ; Rout, S. K. ; Malik, M. M. . et al. Occurrence of xenoestrogen alkylphenols (octylphenols and nonylphenol) and its impact on the aquatic ecosystem. In Xenobiotics in Aquatic Animals: Reproductive and Developmental Impacts 2023; pp 275–284 10.1007/978-981-99-1214-8_13. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The experimental code and data are provided at https://github.com/pdssunny/ecotoxicology. The ecotoxicology prediction web server for fish, algae, and crustaceans is accessible at https://app.cbbio.online/ecotoxicology/home.









