Systematic approaches for the encoding of chemical groups: a case study

Panagiotis G Karamertzanis; Grace Patlewicz; Marta Sannicola; Katie Paul-Friedman; Imran Shah

doi:10.1021/acs.chemrestox.3c00411

. Author manuscript; available in PMC: 2025 Apr 15.

Published in final edited form as: Chem Res Toxicol. 2024 Mar 18;37(4):600–619. doi: 10.1021/acs.chemrestox.3c00411

Systematic approaches for the encoding of chemical groups: a case study

Panagiotis G Karamertzanis ^†,^*, Grace Patlewicz ^‡, Marta Sannicola ^†, Katie Paul-Friedman ^‡, Imran Shah ^‡

PMCID: PMC11258607 NIHMSID: NIHMS1999147 PMID: 38498310

Abstract

Regulatory authorities aim to organize substances into groups to facilitate prioritization within hazard and risk assessment processes. Often, such chemical groupings are not explicitly defined by structural rules or physicochemical property information. This is largely due to how these groupings are developed, namely a manual expert curation process, which in turn makes updating and refining groupings, as new substances are evaluated, a practical challenge. Herein, machine learning methods were leveraged to build models that could preliminarily assign substances into predefined groups. A set of 86 groupings containing 2,184 substances as published on the European Chemicals Agency (ECHA) website, were mapped to the U.S. Environmental Protection Agency (EPA) Distributed Toxicity Structure Database (DSSTox) content to extract chemical and structural information. Substances were represented using Morgan fingerprints and two machine learning approaches were used to classify test substances into 56 groups containing at least 10 substances with a structural representation in the dataset: k-nearest neighbor (k-NN) and random forest (RF), that led to a mean 5-fold cross-validation test accuracies (average F1 scores) of 0.781 and 0.853, respectively. With a 9% improvement, the RF classifier was significantly more accurate than K-NN (p-value=0.001). The approach offers promise as a means of initial profiling of new substances into predefined groups to facilitate prioritization efforts and streamline the assessment of new substances when earlier groupings are available. The algorithm to fit and use these models has been made available in the accompanying repository, thereby enabling both use of the produced models but also refitting of these models, as new groupings become available by regulatory authorities or industry.

Graphical Abstract

graphic file with name nihms-1999147-f0001.jpg

Introduction

Grouping approaches to help inform chemical categories and associated read-across have been in practical use in regulatory programmes for many years by both industry and regulatory authorities^1-3. There are clear benefits in that risk assessment and management of chemicals can be performed more efficiently on the basis of groups of chemicals rather than on a chemical-by-chemical basis. Notable examples include the High Production Volume (HPV) categories published over 20 years ago under the auspices of the Organisation of Economic and Co-operative Development (OECD). Most of these HPV categories were underpinned by the principle that the hazard profile was related to some function of chemical structure. Indeed, structural similarity is by far the most common grouping basis although other criteria such as composition, manufacturing process, and commonality in precursor or breakdown products are also used⁴.

In the case of a new category, its definition in terms of its endpoints and members can benefit from computational approaches. The choice of computational approach is dependent on the decision context and starting point. There are two main routes by which categories might be developed, a top-down or bottom-up approach⁵. In the bottom-up approach, the starting point is a single chemical or small group of chemicals with the intention of “building up a category”. This is the typical read-across use case where a chemical of interest (target) is lacking specific data points and candidate source analogues with relevant data are identified and evaluated. The workflow for this analogue approach is well described in the existing OECD grouping guidance⁴ and elsewhere^6,7 whilst in-silico tools that can facilitate this process have significantly evolved in the last decade. Tools such the OECD Toolbox⁸, ToxRead⁹ and Generalised Read-Across^10,11 are just a few examples which encode this type of workflow and facilitate the analogue identification and evaluation steps before a read-across prediction can be made.

In contrast, the top-down approach starts from a predefined set of chemicals with the intention of grouping some or all substances into one or more categories. The starting set of chemicals might comprise tens of substances arising from one common structural functional moiety. For example, a set of per and polyfluorinated alkyl substances (PFAS) could be sub-categorized into groups of more closely structurally related PFAS. Examples where such sub-categorizations have been used include the selection of diverse PFAS for testing¹². Other examples include phthalates for exposure reduction¹³. In the top-down approach a starting group of chemicals might even comprise a large diverse inventory of substances, such as an extensive regulatory inventory from which ‘categories’ are extracted that could help prioritize further action. Such structural categories serve as initial hypotheses that require further assessment, for example using mechanistic or metabolic information, to better refine the category and ensure robustness for toxicity inference. One example of this type of prioritization has made use of the Threshold of Toxicological Concern TTC, a scheme for establishing health-protective exposure thresholds, to partition substances into different categories including the three Cramer structural classes^14,15. Comparable categorizations have been applied in regulatory frameworks, notably Health Canada’s Domestic Substances List¹⁶. In the EU, groups of industrial chemicals have been developed to facilitate possible risk management measures under REACH¹⁷. For each of these groups, an assessment is performed and published to support a more consistent approach for possible risk management measures. Within the EPA, efforts to help identify potential priority candidates under the Toxic Substances Control Act (TSCA) have been undertaken using a combination of toxicity data and in silico model predictions of toxicity and exposure¹⁸.

The top-down approach may rely in practice on unsupervised or supervised machine learning methods. Unsupervised approaches simply rely on characterizing the substances within the inventory by some molecular representation and computing categorical or continuous structural descriptors or binary chemical fingerprints. Commonly used statistical techniques such as principal components analysis (PCA), hierarchical clustering approaches, self-organizing maps or t-distributed stochastic neighbor embeddings are then used to subset the descriptor/fingerprint information into subgroups, possibly after reducing the dimensionality of the features. Supervised approaches to grouping are also feasible though here information about the activity/toxicity of the chemicals is taken into account in addition to the structural/fingerprint information. Recursive partitioning¹⁹ that can discriminate for the toxicity might be employed. Ranking methods²⁰ are yet another means by which sub-categorization incorporating toxicity profiles might be performed. Total and partial order ranking have also been successfully employed to demonstrate potential utility in prioritizing substances for their Persistence, Bioaccumulation and Toxicity (PBT) profile²⁰.

The aim of this study is to develop a generic and automated approach that uses as input a set of substance groups, generated by a regulatory authority or industry as part of a chemical safety assessment or regulatory process. It builds a supervised machine learning model that can predict the group membership for any given structure. By automating the model building we essentially allow the systematic encoding of an evolving substance group repository so that new substances can be provisionally added to existing groups. The potential of the approach is demonstrated by using as features binary fingerprints generated from the structure of the substances in the groups. In principle, the features can contain independent variables other than fingerprint bits, assuming that this information is available for the substances in the training set and can be readily generated for new structures. Such variables could be for example physicochemical or ADME properties. This would make the model more selective when placing substances in groups, but requires that the groups in the training set were generated so that the members have similar properties. Hence, although the method introduced in this paper is purely based on chemical structure as encoded by binary fingerprints, it can be readily extended to accommodate more elaborate grouping hypotheses. The assessment of the toxicological robustness of the groups is out of scope for this analysis, as this depends on the regulatory purpose and especially on the level of certainty that differs between screening and definitive hazard assessment for risk management purposes. The method introduced here takes the groups in the training set as a given and attempts to deduce, encode and replicate the grouping logic without analyzing, questioning or improving its regulatory rationale, and by using the same structural information used when the groups were formed. Such classification methods are frequently used for developing models to predict toxicity²¹, but have not been used for encoding the grouping of substances.

Methods

Substance group and chemical structure data

ECHA has formulated groupings to accelerate a screening level assessment of all high-volume substances registered under REACH and systematize the identification of substance groups for which risk management measures may be needed for one or more of the group members. At the time of writing, assessments have been published for 86 of these groupings. The substance groupings were downloaded from the ECHA website in February 2023 as an Excel file²² and can be found in the SI (S1_2023_02_03_assessment-of-regulatory-needs--arn—export.xlsx). As a first step, we removed all entries that corresponded to other assessment types, such as risk management option analysis and screening carried out by Member States prior to the introduction of group assessments by ECHA. This produced a dataset with 2,184 substances in 86 substance groups.

Although the assessment reports contain a narrative description of the principles and criteria underpinning each group when it comes to the common chemical structure moieties of the group members, the export did not include chemical structures needed for modelling nor a structural representation of the group definition, such as the functional groups that need to be present or absent. Hence, we first sought to derive molecular structures from the provided identifiers. Names and Chemical Abstracts Registry Numbers (CASRN) were used to query EPA’s Distributed Structure Searchable Toxicity database²³ (DSSTox) to retrieve DSSTox Substance identifiers (DTXSIDs), primary CASRN and structural information in the simplified molecular-input line-entry system (SMILES) format. ChemReg, EPA’s internal registration system for DSSTox was used to conduct the queries to access the most current version of the DSSTox database. A tag for DSSTox Quality Control (QC) level was also retrieved, which provided a measure of the level of manual vs programmatic curation that was performed to verify unique chemical identifiers, i.e., CASRN-name-structure mappings. From the 2,184 substances, 2,138 could be mapped to content in DSSTox, and 1,541 out these were associated with a molecular structure through SMILES. DSSTox also provides QSAR-ready molecular structures that have been extensively used for building toxicity models²⁴. However, for the purposes of this analysis they were not used because the desalting step would remove fingerprint bits that are critical for the correct assignment of a structure to a group. One of the DSSTox SMILES strings could not be resolved into a molecular object by the cheminformatics open-source toolkit RDKit²⁵ used in this study and hence the final dataset for subsequent analysis contained 1,540 substances. The substance groups are identified by a name that has been prepended with the group number. This designation is used in all subsequent figures and tables herein. The result of the curation procedure has also been deposited in the SI (S2_ARN_groups.xlsx).

The 1,540 substances with an available structure had various levels of quality control (QC) flags in DSSTox. There were 1,341 substances with a high confidence QC tag that denotes high accuracy and consistency of chemical identifiers either by expert curation or programmatic curation from high quality sources (929 DSSTox_High, 402 Public_High_CAS, 10 Public_High) whereas 199 substances had medium or low QC tag in the consistency of chemical identifiers. Due to the relatively small dataset, we used all molecular structures regardless of their quality which could have a detrimental effect in the predictive ability of the models. Future refitting of the models could apply such structure quality filters as the number of groups and group members grow.

Figure 1 shows the 86 substance groups and visualizes the availability of structural representations for each group. Some groups contained only a low number of substances or a low number of substances with a structural representation. Given that in this study the models learned group membership by relying exclusively on the molecular structure, we attempted to predict group membership only for the 56 groups that had at least ten structures. The remaining 30 groups had their structures pooled together into an artificial “miscellaneous chemistry” group. This gives the opportunity to the model to assign a class for all structures in the training set. This procedure gave a reasonably balanced dataset. The median number of structures in the 56 explicitly modelled groups was 21.5, whilst 50% of the groups had between 14 (inclusive) and 32 (exclusive) structures. A quarter of the explicitly modelled groups had sizes in the range [10, 14] and another quarter, 32 or more structures up to the maximum 57. The “miscellaneous chemistry” group had 178 structures (S3_ARN_stats.xlsx in the SI contains detailed statistics on group sizes and structural availability). In principle, the substances in the ”miscellaneous chemistry” group could be removed from the training set and we could rely on the applicability domain to ensure than no predictions are generated for these. However, in practice some of these substances would be within the applicability domain given that the latter primarily relies on the overall structural similarity (see section “Applicability domain”), and hence may erroneously be predicted to belong to the structurally closest large group. The fact that there is a small group is useful information to the model that can generate a “miscellaneous chemistry” prediction to indicate that this chemistry has been seen in the training set, but the structural information is not sufficient to have an explicit group (category) in the classification problem. Over time, as grouping work by authorities progresses, smaller groups will acquire more members, and hence will be taken out of the miscellaneous chemistry group and be explicitly modelled.

Each pie chart represents one substance group; the number in the center denotes the group size, i.e., the number of substances in the chemical group; green denotes that a structure is available and black that no structure is available. The groups are arranged according to the structural availability percentage. Groups labelled in red comprised fewer than 10 substances with an available structure that were pooled together in a separate class termed “miscellaneous chemistry” for modelling purposes. All other groups were included as their own class in the model and possessed structural coverage ranging from 28% to 100% with a median of 80%.

Some of the assessment reports partition the substances into subgroups based on known or expected differences in toxicity, mode of action or use patterns. The subgrouping information could tighten the structural diversity of the modelled groups and increase the model performance. However, it would also lead to an even smaller set of groups with a sufficient number of structures, and in addition subgrouping information is not available in the exported dataset. Hence, subgrouping may be investigated in the future as the number of groups grows further.

Following the retrieval of a representative chemical structure for as many substances as possible, we used the group membership as an independent variable in a supervised machine learning approach to systematically codify the groups for systematic re-use as explained below. Every substance belongs to only one of the 57 modelled groups leading to a multiclass but not multilabel classification problem. In doing so, the existing groupings could be used prospectively to assign structural group membership for new chemicals and to compare the groupings to other category schemes in use by other regulatory jurisdictions.

Chemical fingerprints

Morgan chemical fingerprints²⁶ were calculated for all substances within the dataset using the open-source library RDKit²⁵. Morgan fingerprints, also known as Extended Connectivity Fingerprints (ECFPs), are widely used in machine learning applications for cheminformatics. These circular fingerprints map the molecular environment of every atom, with designations like ECFP2 or ECFP4 denoting the bond diameter. While ECFP2 provides a localized snapshot within a 2-bond radius, tasks demanding a wider molecular perspective might lean towards ECFP4 or ECFP6. Critically, in the realm of drug discovery, ECFPs consistently demonstrate performance on par with, if not superior to, other fingerprinting methods²⁷ . On the other hand, Morgan fingerprints may poorly capture global features of the molecule, such as the size and shape, or fail to distinguish all positional isomerism, linkers of different length or the exact amino acid sequence in a peptide²⁸. Grouping of industrial chemicals does not typically have such requirements, and some of these challenges could be addressed by using a suitably tailored applicability domain, e.g., by deriving bounds for the range of molecular weights, degree of branching or maximum interatomic distances in the molecule.

The performance of fingerprints can vary by the type of classification problem, and the properties of the data sets²⁹. Hence, the bit-length and radius of the fingerprints were both included in the hyperparameter tuning of the classification models instead of being fixed. This is because the optimal fingerprint characteristics depend strongly on the modelled structures and because we intended to develop a generic group encoding approach that is sufficiently general and can be used with inventories other than industrial chemicals.

The fingerprints in this study are binary, which means that they only encode the presence or absence of structural moieties and not the number of their occurrences. A consequence of this is that the developed models are not able to distinguish between structures that only differ in the number of structural features if the neighborhood of the features is the same.

The molecular structures were not standardized beyond their standardization in DSSTox prior to the fingerprint calculation. Standardization, such as removal of explicit hydrogen atoms, would be required for future applications of the developed algorithm, in case the molecular structures have not been generated in a consistent way because of its effect on the fingerprint calculation.

Supervised machine learning

Two different machine learning approaches were employed, k-nearest neighbors (kNN) and a random forest (RF) classifier. The kNN model predicts group membership based on the similarity between chemicals, which was calculated using Tanimoto (Jaccard) distance between Morgan binary fingerprints. The kNN model served as the baseline model as it uses all structural information encoded in the fingerprints without being able to assign greater importance to individual bits. For example, the consistent presence of a structural moiety may be reflected in certain fingerprint bits being consistently on for all group members, but this information is not fully exploited by the kNN model that only relies on the Euclidean distance of the fingerprint pairs. Given that predicting substance group membership is essentially a single label, multiclass classification problem, ensemble methods that are inherently able to simultaneously predict more than one class were also tried. Although ensemble methods based on boosting are generally expected to perform better than bagging methods, this was found not to be the case in this study. In preliminary investigations using limited tuning of the default parameters as implemented in the gradient boosting Scikit-learn classifier³⁰, gradient boosting did not seem to outperform random forest and given its computational cost and greater complexity for avoiding overfitting, was not pursued further. Hence, random forest was the ensemble method that was finally subject to more extensive hyperparameter tuning.

Both classification methods were used to predict group membership for the 56 large groups with ten or more structures and the miscellaneous chemistry group. Prior to fitting the models, the dataset was split into a training (80%) and an external (hold-out) set (20%) using random stratified sampling. Given that each class had at least ten structures, and due to the stratified sampling, all classes were present in both the training set (and its folds during cross-validation) and the external (hold-out) set. The external (hold-out) set was not used during cross-validation. It was only used for assessing the performance of the models.

For all models we applied a pre-processing step to remove fingerprint bits with zero variance. This also removed fingerprint bits that were zero for all points in the training set. It could well be that the removed bits are contained in the structures in the external (hold-out) set, but these bits will not be used by the model that relies on the same Scikit-learn pipeline developed during training. The consequence is that the model will ignore structural features that the model has not seen during training, and it may predict a group membership with high probability even though the predicted structure contains more elements or functional groups than the structures in the training set for this class. To avoid spurious group assignments, we introduced an applicability domain as explained in the next section.

Hyperparameter tuning used only the training set and was conducted using a nested grid search. The outer loop attempted different fingerprint characteristics with the radius being varied between 2 and 5 (inclusive) and the fingerprint length taking the values 1,536, 2,048 and 2,560. The inner loop optimized the classifier parameters using a 5-fold cross validation over a grid of hyperparameter values. Cross-validation performance was evaluated using a metric of accuracy for binary classification tasks, the F1 score (F-measure), computed as the harmonic mean of precision and recall. Its use in a multiclass problem requires its treatment as a collection of one-vs-rest binary problems, one for each class. The F1 score of the model in the inner grid search was computed as the arithmetic mean of the F1 scores for each class (macro-average), i.e., all classes were treated equally regardless of their support values. We then computed the average of these F1 scores for the 5-folds and the resulting score was used to find the optimal fingerprint parameterization within the outer loop. Once the nested grid search was complete, we refitted the model using the optimal fingerprint and classifier parameters on the entire training set. Hence, the external (hold-out) set was only used to assess whether the above procedure produced a model that generalizes well with a good bias/variance balance, avoiding overfitting.

The hyperparameter tuning of the kNN model in the inner grid search included the number of neighbors and the weight function used in the prediction, based either on uniform weights or weighing the points by the inverse of their distance. We relied on the standard Euclidean metric to measure distance between points.

For the RF model, the hyperparameter tuning in the inner grid search optimized the number of trees, the number of features considered when looking for the best split (max_features) expressed as the fraction of the total number of features, and the minimum number of samples required to be at a leaf node (min_samples_leaf). The latter ensures that, regardless of the depth, a split will be considered only if it leaves at least min_samples_leaf training samples in each of the left and right partitions. The number of features to consider at each split is generally an important parameter to tune in an RF ensemble and hence we increased the grid density accordingly. In principle, this parameter depends on the number of informative features that is difficult to estimate upfront as it depends on the structural diversity and structural coherence of the substance groups. Increasing it decreases bias but increases variance and computational cost. The number of trees was varied up to 300. Using an extensive grid search with these three parameters was sufficient to strike a reasonable bias-variance trade-off during cross-validation as shown by the comparison of the accuracy metrics during cross-validation and with the external (hold-out) set.

Scikit-learn offers several additional parameters to control the depth and overall complexity of the trees: the maximum depth of the tree (we did not use a limit), the minimum number of samples to be at a leaf node (we used the default 1), the maximum number of leaf nodes (we did not use a limit) and the minimum impurity decrease in each split (we did not use a limit). We did not use these parameters because including the minimum number of samples required to be at a leaf node was deemed sufficient for controlling the tree complexity.

In the RF model bootstrapping drew as many samples as the size of training set size. The quality of the split was measured using the Gini impurity. For both models all classes (substance groups) were assigned equal weights of one because imposing a minimum number of substances in each explicitly modelled group avoided a significantly imbalanced training set.

Applicability domain

Several approaches have been proposed for assessing the applicability domain of models and ensuring reliable predictions.^31,32 Here, we assume that a new structure is within the applicability domain if two conditions are simultaneously fulfilled. The first condition requires that the structure does not contain elements not seen in the training set. This was necessary because of the occurrence of toxicologically relevant metal cations not seen in the training set, although the organic part of the salt was clearly structurally similar to substances in the training set, leading to an overall high similarity. The second condition is based on a kNN approach and requires that the average distance of the new structure from the k-closest neighbors in the training set is less than a defined threshold

\bar{d} = \frac{1}{k} \sum_{i}^{k} d_{i} \leq \bar{d_{t h r}}

where the distances to the training set molecules are arranged in ascending order, i.e. $d_{1} \leq d_{2} \leq . . d_{N}$ and $N$ is the number of molecules in the training set. The threshold $\bar{d_{t h r}}$ was computed so that 95% of the chemicals in the training set are within the domain, i.e., is computed as the 95% percentile of the average distances

\bar{d_{j}} = \frac{1}{k} \sum_{\begin{matrix} i \\ i \neq j \end{matrix}}^{k} d_{i}^{j}, j = 1, . ., N

for all molecules in the training set. Here, the distance was calculated using Tanimoto (Jaccard) similarity using Morgan fingerprints of radius 2 and length 2,560, which were found to be optimal during the model tuning for both the baseline model and random forest classifier. The whole fingerprint was used, i.e., we did not remove fingerprint bits not seen in the training set because these bits should lead to higher distances when the applicability domain is evaluated. The k parameter was set equal to three. In this work, we modelled groups as separate classes when they had more than ten members, and hence we would expect that most substances in the training set have at least three structural close analogues. In principle, the k parameter could be refined by judging how well the AD works by examining the results obtained when the model is run on a large inventory of diverse chemical structures, but this was beyond the scope of this study.

The applicability domain used in this study is model agnostic, i.e., it applies equally to the baseline model and the random forest classifier.

Feature importance

The feature importance of RF can be computed through the mean decrease in impurity (MDI) or the mean decrease in accuracy (MDA)³³. The MDI method is based on the mean accumulation of the impurity decrease within each tree and can be readily computed during the model fitting using Scikit-learn. The MDA method requires the permutation of feature values and the calculation of its effect on the overall model performance. Both methods provide an estimate of the global feature importance. In this study we relied on the MDI method due to the large number of features that make the MDA method computationally intensive. Moreover, by leveraging the impurity decrease as trees are built, we shed light on the features the RF is using to make splits. We also used a local feature importance method based on Shapley Additive exPlanations (SHAP) values^34,35. The SHAP values are computed separately for each prediction but, for the purposes of this analysis, were aggregated at the substance group level so that we estimate the feature importance for each substance group separately.

Using the model for a large inventory

The model building approach estimated the model generalization error by using an external (hold-out) set. Having achieved this, we refitted once more the RF model by freezing the optimal fingerprint parameters identified in the nested grid search for computational efficiency, and tuning the classifier parameters using all 1,540 substances with an available structure. The model was then used to predict the group membership of all REACH registered substances other than the substances in the assessed groups used to train the model. The inventory was constructed by retrieving the latest submissions of all 102,212 active and inactive registrations retrieved from the REACH-IT database³⁶ as of 17 March 2023. The total number of unique substances was 22,345, out of which 16,161 had a CAS number and were processed further. Active and inactive registrations include all substances that are currently manufactured or imported in EU and also the substances for which manufacture and import has ceased but may restart in the future without the need for a new registration.

Due to confidentiality constraints that prevented sharing of the full set REACH registered substances and their identifiers between the two collaborating jurisdictions in this study, rather than using EPA’s internal ChemReg application to source DSSTox content directly, DSSTox structures were extracted from the download made available on the EPA website which itself contains a structure-data-file of the 2022 snapshot of the DSSTox database²³. This file was processed by merging on the CAS number of the registered substance without further curation. For 12,626 of the substances, we obtained a chemical structure from DSSTox and predicted the group membership. The results were manually examined to check the plausibility of the predicted group memberships given that the model was used on an inventory characterized by much wider structural diversity compared to the grouping dataset in the public assessments. In addition to predicting the most probable group, we also predicted the probability of the three most probable groups to assess how often the model predicted the correct group but not with the highest probability.

Software

Data processing was performed using NumPy³⁷ and Pandas³⁸. Fingerprints were generated using RDKit³⁹. Model building and evaluation used Scikit-learn³⁰, whilst all visualizations were generated with Matplotlib⁴⁰ and Seaborn⁴¹. All Python modules to build and use the models is available at the accompanying repository⁴².

Results

Summary of chemical groups

The curation procedure resulted in 1,540 substances with a molecular structure, out of which 1,362 belong to 56 groups with ten or more substances each. From the remaining 30 substance groups, 28 had between 2 and 9 substances with a structural representation and were pulled together into a structurally diverse group named “miscellaneous chemistry” with 178 substances. The remaining two groups, “montan, carnauba and rice bran waxes and their derivatives” and “cardanols” had no structure available and were not used in neither developing a model nor validation (see S3_ARN_stats.xlsx in SI). Figure 3 shows one of the 56 groups that were given their own label in the classification models. It was not feasible to manually examine the correctness of the molecular structures in DSSTox and it may be that some structural representations are erroneous. As an illustration, checking the CAS number 58485-68-0 in SciFinder we see that it corresponds to 1-chloro-1,2,3,4-tetrahydronaphthalene and hence the substance does not contain an ether or hydroxyl group as shown in Figure 3. Such errors are common in large inventories of substances despite curation efforts and hence we wanted to explore the usefulness of the modelling methodology in realistic use cases.

“Chlorinated aromatic hydrocarbon” group showing the 49 substances for which a structure is available in DSSTox. The group contained 57 substances in total.

As a first exploratory step, we calculated the pairwise similarity between all chemicals using Morgan fingerprints with a radius of 3 and 2,048 bits length. The distance matrix for all substance pairs for the 1,540 substances across the 56 groups is shown in Figure 4. Substances in the distance matrix are sorted by group membership to visualize the similarities and differences across all groups. The heatmap has an evident block diagonal nature, highlighting the extent of structural coherence within the groups. Off-diagonal areas with low structural distance indicate that nearest neighbor approaches could result in poor group membership prediction in certain cases.

Pairwise distance matrix for structure pairs (Morgan fingerprints, radius 3, 2,048 bits). Darker color corresponds to smaller structural distance. The structures are arranged according to the group they belong to and in the same order for both rows and columns. Only groups with at least 10 structures are included. To improve readability, the group names are only shown for groups with at least 30 structures (Figure S7 in the SI includes all group names). The group names are prepended with the group number.

To further examine the coherence of groups, we reduced the dimensionality of the data set using the t-SNE algorithm⁴³ as implemented in Scikit-learn³⁰ and projected all 1,540 substances from 2,048 (using the same Morgan fingerprints as in Figure 4) into two dimensions. The projection used the default learning rate (‘auto’), the perplexity was set to 30, the metric was the Jaccard distance, and the initialization of the embedding relied on PCA. The t-SNE visualization in Figure 5 illustrates how several groups, such as “dihydropurinedione derivatives” (group number 53), are distinct from other groups. This is also evident when we look in the pairwise distance heatmap (Figure 4 and Figure S7 in the SI) as the only close structural analogues are in the block diagonal for such groups. Easily separable groups can likely be encoded using nearest neighbor clustering, while less separable groups will likely benefit by using all fingerprint bits as features. The t-SNE also suggests that a more complex ensemble machine learning method, like RF, could better classify the substances into 56 groups than kNN.

Visualization of substance proximity in 2 dimensions using t-SNE dimensionality reduction of the fingerprints (Morgan fingerprints, radius 3, 2,048 bits). Only groups with at least 10 structures are included. Clearly separable groups are annotated using the group number.

Model tuning and predictive performance

Fitting the models relied on a nested grid search that tuned both the fingerprint characteristics and classifier parameters. At the optimal parameterization the baseline kNN model achieved a mean 5-fold cross-validation F1 (macro) score of 0.781. The RF clearly outperformed kNN achieving a mean 5-fold cross-validation F1 (macro) score of 0.853 that is a satisfactory performance for using the model in practice, e.g. for provisionally allocating newly registered substances under REACH into existing groups.

The F1 (macro) score does not substantially vary from one fold to another for the RF model. With the optimal RF parameterization the standard deviation of F1 (macro) score across the folds is 0.021 (see S6_outer_inner_grid_details_rf.xlsx in the SI). The standard deviation is consistently low for all cross-validation iterations. This confirms that the model generalizes stably across different subsets of data. With the optimal kNN parameterization the standard deviation of F1 (macro) score across the folds is somewhat higher at 0.028, but still low (see S7_outer_inner_grid_details_kn.xlsx in the SI).

Table 1 shows that the F1 (macro) score of the external (hold-out) set is higher than the cross-validation one for both models, with the difference being larger in the case of the RF model. This further corroborates the good model generalization with no evidence of overfitting. The hyperparameter tuning is done by fitting a set of models on fewer data points than the training set (80% of the training set for each fold, i.e., 986 data points) to identify the optimal fingerprint and model parameters. Following the standard practice, we have done a final model fitting with the whole training data set (1,232 data points) using the optimal model parameters. The obtained model was then used to evaluate the model performance on the external (hold-out) set (308 data points). This means that the performance assessment of the external (hold-out) set used a model built with more data than during the cross-validation model fitting, that explains the increased F1 (macro) score when using a model that generalizes well. This suggests that the proposed modelling approach will improve further in the future as more groups become available and the training set size increases.

Table 1.

Hyperparameter tuning of baseline kNN and RF classifier models. The parameters of the optimal models obtained in the nested grid search are shown in bold.

Model	Hyperparameters (optimal are shown in bold)	Cross- validation F1 (macro) score	External (hold- out) set F1 (macro) score
kNN	fingerprint radius: [2, 3, 4, 5] fingerprint length: [1536, 2048, 2560] number of nearest neighbors: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] weights: [‘uniform’, ‘distance’]	0.784	0.796
RF	fingerprint radius: [2, 3, 4, 5] fingerprint length: [1536, 2048, 2560] n_estimators: [50, 100, 150, 200, 250, 300] max_features: [0.001, 0.002, 0.005, 0.01, 0.02,.., 1.] min_samples_split: [2, 3, 4]	0.853	0.861

Open in a new tab

Figure 6 shows the classifier grid search results for the optimal fingerprint characteristics (radius 2, length 2,560 for both models as shown in Table 1). Both models are not very sensitive to their parameterization, with RF achieving almost equal performance with several parameterizations. The kNN method identified k=3 as the optimal number of nearest neighbors in the cross validation. The groups that were explicitly modelled as a separate class in the classifier had at least ten group members and hence one would have expected that cross validation could have led to an optimal k value that is higher than three. However, even with this small k parameter there is no indication of overfitting, as also shown when we compare the F1 (macro) score of the training and test sets. As shown in Figure 6, higher k values (n_neighbors) reduce the F1 (macro) score during cross validation, indicating over-smoothing and hence reduced ability to capture the local structure of the data. The cross-validation identified an odd optimal k value, that is beneficial in avoiding ties.

Hyperparameter tuning results (a) Hyperparameter grid search results for the baseline model using the optimal fingerprint options (radius=2, length=2,560), (b), (c), (d) Hyperparameter grid search for the random forest classifier using the optimal fingerprint options (radius=2, length=2,560) for min_samples_split = 2, 3, and 4.

The model building approach varied the fingerprint parameters in the outer grid search. This was worthwhile and justified the additional computational cost as it led to a more accurate model. Table 2 shows the cross-validation F1 (macro) score with the optimal RF and kNN models as the fingerprint radius and length were varied. In the case of the RF model, the cross-validation F1 (macro) score span between the best and worst performing fingerprint parameters is 0.03. Although this is a subtle effect, it is approximately half of the predictive performance difference between the optimal kNN and RF models. Moreover, the RF model tuning (inner grid search) using the optimal fingerprint parameters (radius 2 and length 2,560) varied the cross-validation F1 (macro) score between 0.808 and 0.853, i.e., a span of 0.045 (see S6_outer_inner_grid_details_rf.xlsx in the SI). This suggests that optimizing the fingerprint parameters is almost as important as finetuning the classifier parameters. As the fingerprint parameters vary in the outer grid search, the inner grid search proposes different optimal classifier parameters. This can be attributed to an interplay between fingerprint and classifier parameters and the fact that the cross-validation metric does not seem to depend strongly on the classifier parameters.

Table 2.

Effect of the fingerprint parameters on model predictive performance. The row in bold corresponds to the optimal fingerprint characteristics identified in the outer grid search. The SI provides detailed information for every cross-validation iteration (S6_outer_inner_grid_details_rf.xlsx and S7_outer_inner_grid_details_rf.xlsx).

Fingerprint parameters		kNN		RF
Radius	Length	Optimal model parameters (n_neighbors, weights)	Cross- validation F1 (macro) score	Optimal model parameters (max_features, min_samples_split, n_estimators)	Cross- validation F1 (macro) score
2	1536	5, distance	0.771	0.01, 4, 200	0.850
2	2048	4, distance	0.781	0.002, 4, 250	0.845
2	2560	3, distance	0.784	0.01, 3, 150	0.853
3	1536	4, distance	0.766	0.02, 3, 150	0.838
3	2048	4, distance	0.781	0.005, 4, 200	0.834
3	2560	4, distance	0.780	0.02, 3, 250	0.842
4	1536	3, distance	0.771	0.005, 4, 250	0.825
4	2048	4, distance	0.779	0.2, 3, 250	0.830
4	2560	4, distance	0.778	0.02, 3, 100	0.839
5	1536	4, distance	0.773	0.02, 2, 150	0.823
5	2048	4, distance	0.776	0.15, 2, 200	0.821
5	2560	4, distance	0.776	0.1, 2, 300	0.831

Open in a new tab

The importance of optimizing the fingerprint parameters is also observed with kNN, albeit weaker, and hence does not seem to be classification model specific. Both RF and kNN led to the same optimal fingerprint parameters, that further corroborates the hypothesis that the effect of fingerprint parameters on model performance is genuine and worth considering.

We also used the gradient tree boosting classifier (GB) as implemented in Scikit-learn by fixing the fingerprint parameters to the optimal values identified as explained above (radius 2 and length 2,560) and conducting the inner grid search (see S8_outer_inner_grid_details_gb.xlsx for the model parameters included in the hyperparameter tuning). The best cross-validation F1 (macro) score obtained was 0.856 that is very close to one obtained with RF, but the computational cost per cross-validation iteration was over an order magnitude larger than with RF. Moreover, GB requires more cross-validation iterations because of a larger number of parameters expected to affect performance and because the cross-validation F1 (macro) score varies more strongly during the inner grid search that requires a denser grid. The higher model tuning computational cost would be challenging in a regulatory setting, given that the model would need to be rebuilt periodically as more groups are constantly generated and evaluated. Using a histogram-based gradient boosting algorithm would not help given that our features are binary. Hence, we did not investigate boosting methods further.

Classification performance across groups

The previous section analyzed the overall predictive performance of the models. In this section we analyze how the models perform in predicting the group membership of the 56 explicitly modelled groups. Following the nested grid search we fitted the models using the whole training set that was then used for deriving the metrics for both training and external (hold-out) sets. Figure 7 shows the confusion matrix of the kNN and RF models for the external (hold-out) set, whilst Table 3 shows the precision, recall and F1 metrics (classification report) for each group also for the external (hold-out) set. Figures S1-S4 in the SI provide a visualization of the precision, recall and F1 metrics for both the training set and external (hold-out) sets. In this section, we do not discuss in detail the metrics for the training set and focus on the external (hold-out) set only, in order to assess how the models would predict the individual group membership of previously unseen structures.

Confusion matrix of the (a) kNN, and (b) RF model for the external (hold-out) set. The rows correspond to the true groups and the columns to the predicted ones.

Table 3.

Classification report for the external (hold-out) set with optimal fingerprint and model parameterization. The rows are arranged in descending F1-score with the RF model. The SI contains a visualization of the classification report for both models and training/test datasets (Figures S1-S4).

	precision		recall		F1-score		support
group number and name	kNN	RF	kNN	RF	kNN	RF
(85) (tetrahydro)furan primary alcohol derivatives and their oxidation products	0.833	1.000	1.000	1.000	0.909	1.000	5
(84) 1,2-ethanediols and their carbonates	0.750	1.000	1.000	1.000	0.857	1.000	3
(80) Aliphatic nitriles	1.000	1.000	0.900	1.000	0.947	1.000	10
(79) Aliphatic primary amides	1.000	1.000	1.000	1.000	1.000	1.000	6
(78) Alkyl aryl and cyclic diaryl esters of phosphoric acid	0.667	1.000	1.000	1.000	0.800	1.000	2
(73) Aralkylaldehydes	1.000	1.000	0.800	1.000	0.889	1.000	5
(72) Aromatic nitriles	0.857	1.000	0.857	1.000	0.857	1.000	7
(64) Brominated cycloalkanes, alcohols, phosphates, triazine triones, diphenyl ethers and diphenyl alkyls (flame retardants related substances)	0.833	1.000	0.833	1.000	0.833	1.000	6
(57) Dialkyl (and diaryl) dithiophosphates (DDP)	1.000	1.000	1.000	1.000	1.000	1.000	2
(53) Dihydropurinedione derivatives	0.750	1.000	1.000	1.000	0.857	1.000	3
(52) Ditriazine stilbenedisulfonic acid dyes (optical brighteners)	1.000	1.000	1.000	1.000	1.000	1.000	4
(47) Glycidyl ethers and esters	1.000	1.000	0.750	1.000	0.857	1.000	8
(42) Linear aliphatic ketones	1.000	1.000	1.000	1.000	1.000	1.000	3
(41) Linear and branched alpha-beta unsaturated ketones	1.000	1.000	1.000	1.000	1.000	1.000	3
(31) Ortho-phthalates	0.714	1.000	0.714	1.000	0.714	1.000	7
(26) Polyol amines	1.000	1.000	0.750	1.000	0.857	1.000	4
(24) Salicylate esters	0.833	1.000	1.000	1.000	0.909	1.000	5
(23) Salicylic acid, its salts and alkylated derivatives	0.500	1.000	0.333	1.000	0.400	1.000	3
(18) Thioureas	0.667	1.000	1.000	1.000	0.800	1.000	2
(16) Vinylbenzene derivatives	1.000	1.000	1.000	1.000	1.000	1.000	3
(14) acrylate and methacrylate amines	1.000	1.000	1.000	1.000	1.000	1.000	3
(11) chlorinated aromatic hydrocarbons	0.833	1.000	1.000	1.000	0.909	1.000	10
(8) imidazoles	1.000	1.000	1.000	1.000	1.000	1.000	5
(3) tetrahydroxymethyl and tetraalkyl phosphonium salts	1.000	1.000	1.000	1.000	1.000	1.000	2
(17) Unsubstituted and linear aliphatic-substituted cyclic ketones	0.917	0.917	1.000	1.000	0.957	0.957	11
(13) aralkylamines	0.818	0.917	0.818	1.000	0.818	0.957	11
(6) primary aliphatic diamines and their salts	0.900	1.000	1.000	0.889	0.947	0.941	9
(58) Cyclic ethers	0.857	0.875	0.857	1.000	0.857	0.933	7
(32) Organic phosphonic acids, salts and esters	1.000	1.000	0.500	0.875	0.667	0.933	8
(9) hydrocarbyl siloxanes	1.000	1.000	0.833	0.833	0.909	0.909	6
(81) Acyl glycinates and sarcosinates	0.800	0.800	1.000	1.000	0.889	0.889	4
(43) Isophthalates, Terephthalates and Trimellitates	0.500	0.800	1.000	1.000	0.667	0.889	4
(21) Simple manganese compounds	0.600	1.000	0.600	0.800	0.600	0.889	5
(12) aromatic ethers	0.800	0.800	1.000	1.000	0.889	0.889	4
(5) simple vanadium compounds	1.000	1.000	0.600	0.800	0.750	0.889	5
(−) miscellaneous chemistry	0.906	0.865	0.806	0.889	0.853	0.877	36
(22) Simple Lithium compounds	0.857	0.778	0.857	1.000	0.857	0.875	7
(65) Branched/cyclic dialiphatic ethers (excluding alpha,beta-unsaturated ethers)	1.000	1.000	1.000	0.750	1.000	0.857	4
(50) Esters from linear saturated dicarboxylic acids and branched aliphatic alcohols	0.600	1.000	0.750	0.750	0.667	0.857	4
(37) Molybdenum and its simple compounds	0.600	0.750	1.000	1.000	0.750	0.857	3
(27) Polycarboxylic acid monoamines, hydroxy derivatives and their salts with monovalent cations	1.000	1.000	0.750	0.750	0.857	0.857	4
(7) nitroalkanes	0.857	0.750	1.000	1.000	0.923	0.857	6
(75) Alpha-chloro aliphatic carboxylate derivatives	0.583	0.727	0.875	1.000	0.700	0.842	8
(66) Branched carboxylic acids and its salts	0.750	0.727	0.750	1.000	0.750	0.842	8
(71) Benzoates	0.800	0.667	1.000	1.000	0.889	0.800	4
(49) Ethoxylated < C6 alcohols (other than methanol and ethanol); ethoxylated aromatic alcohols	1.000	0.667	1.000	1.000	1.000	0.800	2
(28) Phthalic anhydrides and hydrogenated phthalic anhydrides	0.857	0.714	1.000	0.833	0.923	0.769	6
(70) Bisphenol A (BPA) derivatives	0.667	0.750	0.500	0.750	0.571	0.750	4
(51) Esters from branched or non-aromatic cyclic dicarboxylic acids and aliphatic alcohols	0.750	1.000	0.600	0.600	0.667	0.750	5
(29) Paraben acid, salts and esters	0.750	0.750	0.750	0.750	0.750	0.750	4
(62) Caesium compounds	0.200	1.000	0.500	0.500	0.286	0.667	2
(1) thioxanthenones	1.000	1.000	1.000	0.500	1.000	0.667	2
(82) Acyl derivatives from alpha-amino acids other than glutamic acid, glycine or sarcosine	1.000	1.000	0.667	0.333	0.800	0.500	3
(38) Miscellaneous bisphenols	0.500	1.000	0.333	0.333	0.400	0.500	3
(15) Zirconium and its simple inorganic compounds	1.000	0.500	0.250	0.250	0.400	0.333	4
(59) Cyclic acetals from aldehydes	0.000	0.000	0.000	0.000	0.000	0.000	2
(36) Mono-, di-phenyl phosphite derivatives	0.000	0.000	0.000	0.000	0.000	0.000	2
Accuracy					0.8312	0.9026
macro avg	0.8089	0.8904	0.8164	0.8629	0.7963	0.8611	308
weighted avg	0.8448	0.9031	0.8312	0.9026	0.8246	0.8924	308

Open in a new tab

The confusion matrix of the RF model shows fewer group misclassifications compared to kNN that is consistent with the overall better predictive performance seen in the previous section. Two groups, namely “(59) cyclic acetals from aldehydes” and “(36) mono-, di-phenyl phosphite derivatives” have no true positives in the external (hold-out) set with both models and for this reason there are no metrics reported in Table 3. However, this is not significant because there are only 2 substances in the external (hold-out) set for each of them, both of which are misclassified. Table 3 shows than in addition to these 2 groups there are only 9 other groups for which the F1 score with the RF model is less than 0.8. With the kNN model there are 18 groups with F1 score less than 0.8.

We examined some of substances in the test set that were misclassified with the RF model. The test set contains 5 substances in the group "esters from branched or non-aromatic cyclic dicarboxylic acids and aliphatic alcohols”, out of which 2 were erroneously predicted to belong to the group “alpha-chloro aliphatic carboxylate derivatives”. In both cases the predicted probabilities for the correct and wrongly predicted groups were similar, i.e. the model would have identified the correct group as the second most probable one. The misclassified substances are 1,3-diethyl 2-methylpropanedioate (CAS 609-08-5) and 1,3-diethyl 2-ethylpropanedioate (CAS 133-13-1) and are both structurally similar to the training set substance 1,3-diethyl 2-chloropropanedioate (CAS 14064-10-9) that belongs to the “alpha-chloro aliphatic carboxylate derivatives” group. Hexamethylcyclotrisiloxane (CAS 541-05-9) was erroneously predicted to belong to the group “aralkylamines” instead of the correct “hydrocarbyl siloxanes” due to the wrong structure in our database, based on which the predicted group is plausible. Given the overall classification performance of the model and the explicability of the few analysed misclassifications we did not further analyse individual predictions.

Figure 8 shows a swarm plot with the average F1 scores of the different groups for the substances in the external (hold-out) with the kNN and the RF models. The groups were partitioned into five pools depending on the group size with the inner bin boundaries corresponding to the 20^th, 40^th, 60^th, 80^th percentiles. We can see that as the group size increases the model has sufficient data to achieve higher F1 scores. Moreover, the RF model performs better than kNN with the improvement being more visible for the larger groups.

Average F1 score for each group for the substances in the external (hold-out) group as a function of the group size. The groups were partitioned into five pools according to the group size with the inner bin boundaries corresponding to the 20^th, 40^th, 60^th, and 80^th percentiles (the outer boundaries, 10 and 57, correspond to the smallest and largest group).

With this definition of applicability domain, 35 out of 308 substances in the external (hold-out) set were out of the applicability domain. From these 35 out-of-domain substances, 8 (or 23%) are misclassified using the RF model. From the 273 in-domain substances, 22 (or 8%) are misclassified. This indicates that using the domain can help eliminating spurious group membership predictions. However, the prediction accuracy of the model does not deteriorate significantly even for out-of-domain substances and hence the domain can provide secondary evidence for the accuracy of the prediction together with the predicted group membership probability. In the analysis so far, substances have been assigned to the group with the highest probability. However, the RF model generates a group membership probability for each modelled group and for each predicted structure, that can be leveraged to further achieve the desired precision and recall trade-off.

We ran a dependent t-test of the F1 scores for the external (hold-out) set for the 57 classes with the optimal RF and kNN models shown in Table 3 using SciPy⁴⁴. The alternative hypothesis was that the mean of the distribution underlying the RF F1 scores is greater than the mean of the distribution underlying the kNN F1 scores. The p-value of the alternative hypothesis is 0.001 and hence the null hypothesis is clearly rejected, i.e., the increase in F1 scores seen with the RF are statistically significant.

Feature importance using the random forest classifier

An analysis of the feature importance for the best random forest model with the MDI method shows that approximately 200 fingerprint bits are important (Figures S5 and S6 in the SI). However, many bits are “on” for particular groups, even though they are not included in the 200 globally most important features. As an example, the group “ditriazine stilbenedisulfonic acid dyes (optical brighteners)” with group number 52, has consistent “on” bits for several features that are globally not very important as shown in Figure 9. This illustrates the need to keep all features and not reduce their number based on their global importance, as this would likely be detrimental on the predictive performance of the model for groups that contain rather infrequent structural moieties.

Visualization of the 500 globally most important features (fingerprint bits) as calculated for the optimal random forest classifier for the 1090 substances in the training set. The features are arranged so that their (MDI) global importance reduces from the left to right with the “on” bits shown in black. The color bar on the y-axis shows the group membership using the same colors as in Figure 2. The group number is shown at the right. Some groups are robustly identified by a small number of globally less important bits that are nevertheless consistently “on” for all group members, as is the case for group 47, 52 and 70 (see fingerprint bits in red box). The (MDI) global feature importance is also shown in Figures S5 and S6 in the supporting information.

Figure 9 shows the global feature importance for all substances in the training set. By leveraging the SHAP values we can identify the most important features for each group separately. Figure 10 shows the mean absolute SHAP values for the 20 most important features for the 15 substances in the training set that belong to the “paraben acid, salts and esters” group. The plot does not distinguish features with positive and negative SHAP values that, respectively, increase or decrease the probability of a substance to belong to the group. The feature importance gradually levels off with the 20^th most important feature having approximately 7 times smaller mean absolute SHAP value compared to the most important feature.

Feature importance (fingerprint bits) for the “paraben acid, salts and esters group” as calculated using the 15 substances in the training set. The structures of the 10 most important features are visualized in Figure 11.

The structures of the 10 most important features are visualized in Figure 11. We can identify parts of the aromatic ring with the carboxylic and hydroxyl groups at the para position, but using a Morgan fingerprint with radius 2 is not sufficient to capture all these structural moieties in a single fingerprint bit. A fingerprint radius of 2 also means that atom centered fragments of both radius 1 and 2 are included in the fingerprint. These features are naturally nested. As an illustration, when a carbonyl fragment is present then a sp2 hybridized oxygen atom fragment will also be present. Using nested features is not detrimental when fitting a random forest model, as each decision tree only sees a random extraction of observations and a random extraction of features avoiding overfitting. However, this somehow reduces the interpretability of the feature importance because nested features are both given importance that is reduced compared to the importance we would assign if the model used only one of the nested features.

Structural representation of the 10 most important features (fingerprint bits) for the “paraben acid, salts and esters group” group.

All of the 15 substances belonging to the “paraben acid, salts and esters” group in the training set were classified correctly. There are four more substances belonging to the same group in the external (hold-out) set. The SHAP values for these four substances and for the nine most important features are shown in Figure 12. The higher the SHAP value of a feature the higher its importance for placing the substance in the group. In other words, a SHAP value represents the marginal effect that the corresponding feature value of a given substance has on the final predicted probability of the substance to belong to the “paraben acid, salts and esters” group. With the exception of one, all SHAP values are positive and lead to an increase in probability when the feature is present (indicated with “1 =” in Figure 12). There are no important features with high absolute SHAP value when absent.

Waterfall plot of the SHAP values for four substances in the external (hold-out) set belonging to group “paraben acid, salts and esters”, namely CAS numbers 53201-62-0 (a), 94-13-3 (b), 96682-10-9 (c) and 109236-76-2 (d). The sum of the SHAP values and of the expected value E[f(X)] is shown in the upper right corner of each waterfall plot and is equal to the predicted probability for assigning the substance to the “paraben acid, salts and esters” group.

Three of the substances in the external (hold-out) set were classified correctly and one was incorrectly predicted to belong to the “benzoates” group with a probability 0.455, whilst the “paraben acid, salts and esters” group had the second highest probability at 0.236. For the three correctly predicted structures (Figure 11 a, b and d) the 8 features with the highest SHAP value are among the 10 most important features for the “paraben acid, salts and esters” group. For the incorrectly classified substance (Figure 11 c) only the first 4 features with the highest SHAP value are among the 10 most important features for the “paraben acid, salts and esters” group. Notably, features x2114, x242, x1257 and x1831 are all lacking because the hydroxyl group of the central paraben acid is esterified and hence the oxygen atom does not have a hydrogen atom as in Figure 11. The lack of a hydroxyl group affects many important bits that collectively lead to a severe reduction of the predicted probability. Moreover, the structure is also a benzoate. In this sense, the model prediction is overall plausible and in practice it will be useful to use expert judgement when the probability gap between the first and the second predicted group is small.

The SHAP values of the most important bits assist with the interpretation of the model predictions for understanding whether the presence or absence of the corresponding structural moiety increases or decreases the predicted probability of the structure to belong to a given group. The interpretation of the prediction is facilitated by using a waterfall plot of the SHAP values as in Figure 12 and visualizing the structure of the corresponding features. As a note of caution, due to fingerprint bit collisions, there may be situations where the structural representation of a fingerprint bit may seem erroneous. In such cases it is important to visualize all atomic environments in the training set that correspond to the investigated fingerprint bit and not only the first one.

Using the model against a large inventory

In order to further assess the predictive performance of the model, the RF model was applied to a large inventory of substances, i.e., the database of industrial chemicals registered under the EU REACH Regulation. As described in the methodology section, the model was fitted once more on all 1,540 substances by freezing the fingerprint characteristics to their optimal values given in Table 1 and tuning the classifier parameters. The mean 5-fold cross-validation F1 (macro) score of the best model was somewhat lower at 0.833 compared to 0.853 that is within one standard deviation of the F1 (macro) score across the folds.

Predictions were generated for 12,624 substances but the supporting information only contains substance identifiers for 12,560 substances due to confidentiality constraints (see S5_rf_application_1_results_redacted.xlsx in SI). The results were manually analyzed. This is a more stringent test compared to the external (hold-out) set because the latter only contains substances with similar chemistries to the ones in the training set due to the stratified test/train split. The inventory used in this section also contains substances with previously unseen structural features, some of which may correspond to fingerprint bits that were eliminated when the model was built because they were never “on” in the training set and hence had zero variance.

For each substance, the RF model predicted the three most likely groups and the associated probabilities. For example, potassium 2-ethylhexanoate (CASRN 3164-85-0) was predicted to belong to group “branched carboxylic acids and its salts” with probability 0.987, with the probabilities of the second, “polycarboxylic acid monoamines, hydroxy derivatives and their salts with monovalent cations”, and third group, “simple manganese compounds”, being negligible.

From the 12,624 substances with a structural representation, 1,102 substances were either in the training set or in the external (hold-out) set and were not analyzed further. The number is lower than 1,540 because the groups in the ECHA’s assessments also contain substances that do not have an active or inactive registration and because of the subtle differences in retrieving structural information from DSSTox. Hence, there are 11,522 substances with a DSSTox structure that were neither in the training set nor in the external (hold-out) set. This is the inventory to which all results presented in this section refer to.

From the 11,522 predictions, we retained 4,267 that were in the applicability domain (~37%). Out of these 3,600 were assigned to one of the explicitly modelled groups with the remaining being assigned to the group “miscellaneous chemistry”. We examined manually 47 of the 4,267 substances for which the probability of the most likely group was 0.8 or more. Despite the limited number of substances fulfilling the abovementioned criteria, the manual investigation was informative to get a glimpse of the model behavior in the context of a realistic use setting. The group generation work at ECHA is still in progress and hence not all generated groups have been disseminated yet. This analysis used such internally available finalized or provisional groups.

From the 47 substances, 19 are substances for which ECHA has not yet created a group. Two substances (CASRN 59272-84-3 and 73772-46-0) were correctly assigned to the group “miscellaneous chemistry”. The model assigned 13 of the substances to one of the 56 explicitly modelled groups that was in all cases the closest structural match. However, in the majority of cases the structure contained additional functional groups, elements, such as fluorine, or counterions not present in the substance group. As an example, 4-chlorobenzenethiol (CASRN 106-54-7) and o-chlorobenzenethiol (CASRN 6320-03-2) are both assigned to the “chlorinated aromatic hydrocarbons” group, whilst in the original group, none of the members contains a thiol functional group, but the remaining of the structural features match the group. The remaining 4 substances were correctly assigned to the group if we ignore the number of functional groups. As an example, aminocaproic acid (CASRN 60-32-2) is predicted to belong to “primary aliphatic diamines and their salts” group, while it is in fact a mono-amine. This is expected given that the model uses binary fingerprints that only capture the presence of each atomic environment and not its occurrences. Hence, in all cases the model generates a plausible prediction given the modelling methodology and the limited size of the training set. Considering only the groups in the training set, the model always proposes the most plausible group assignment, whilst any additional structural moieties would need to be examined by the grouping experts before confirming or refuting the predicted group membership.

For the remaining 28 of the 47 substances the model prediction was compared to the final or provisional group assigned by expert judgement at ECHA. For 5 of the substances the prediction of the model agreed with the expert judgement. For the remaining 23 substances, the model once more correctly assigned the substance to the structurally closest group, but due to functional groups not known to the model or known groups being present multiple times, experts chose to generated a new group. As an example, disodium dihydrogen ethylenediaminetetraacetate (CASRN 139-33-3) is predicted to belong to the group “polycarboxylic acid monoamines, hydroxy derivatives and their salts with monovalent cations”. However, based on expert judgement, the substance was placed instead in the newly created group “EDTA derivatives”.

For none of the substances discussed in this section the second and third most likely group was more plausible than the first, but this is to be expected given the 0.8 probability threshold. A cursory look at predictions for which the most likely group has a probability of less than 0.8, shows that subjecting the second and third group to expert judgement could be beneficial. Figure 12 shows an example (CASRN 96682-10-9) where the first two predicted groups are both structurally plausible.

Discussion

Using a set of substance groupings recently published by ECHA, we were able to build machine learning models that successfully predict the group membership based on the molecular structure as represented by Morgan fingerprints. The optimal RF model achieved an F1-score of 0.853 on the external (hold-out) set and was used to predict the group membership of the substances in a large inventory of REACH registered substances. An added benefit of the RF model is its interpretability. By leveraging the SHAP values, it is possible to quantify the importance of atomic environments globally, for a given substance group and for an individual prediction. Understanding how the model assigns probabilities to groups can assist with improving feature engineering and model tuning. This data driven methodology has advantages over the more traditional approach of manually compiling SMARTS patterns that are difficult to maintain as the set of available groupings evolves. The methodology assigns substances to existing groups by replicating the expert group assignment logic, regardless of whether the latter was based on overall structural similarity or endpoint specific considerations.

Leaving aside substances in the training and external (hold-out) sets, the model successfully identified a number of substances in the inventory of REACH registered substances that would be added to existing groups based on expert judgement. Approximately 37% of the REACH registered susbtances with a structure available were within the applicability domain of the model. For the majority of in-domain group membership predictions with probability exceeding 0.8, manual examination of the result showed that it would have been more appropriate to place the substance in the “miscellaneous chemistry” group or generate a new group. In all cases, the model assigned the substance to a structurally plausible group, whilst the overall structural similarity to the substances in the training set was sufficient for obtaining an in-domain prediction. This, and the limited in-domain coverage, does not suggest a low predictive performance of the model but is a consequence of the limited size of the training set. The model used for predicting the group membership for this large inventory contained 1,540 substances, i.e. it is approximately 10% of the 12,624 REACH substances for which there is a structure and for which we generated a prediction. This means that many of predicted structures contain structural moieties not seen at all in the training set and hence are ignored by the model. Our RF model can only predict labels that are present in its training set and it consistently assigns the most meaningful group from the groups it is aware of. However, as a supervised machine learning method, it is naturally incapable of generating new groups. Hence, we expect that refitting the model with a larger training set that covers a larger part of the structural space in the inventory, would alleviate this challenge. Given that grouping work by authorities is constantly advancing with more groups being generated and disseminated, the model usefulness will continue to grow, whilst its predictions are already valuable as an aid to the experts generating the groups.

An important finding of the study is that that the fingerprint parameter optimization has a considerable effect on the model predictive performance, which justifies the additional computational cost for including the fingerprint characteristics in the grid search. Varying the fingerprint parameters affects the predictive performance in a similar way for the different classifiers, and hence the fingerprint parameter optimization could be conducted only with the least computationally demanding model. The model performance increases with fingerprint length, possibly because it reduces the number of bit collisions. To investigate this further we computed the number of “on” bits for all 1,540 molecules in the starting dataset as a function of the Morgan fingerprint length and radius (Figure S8 in the SI). The number of “on” bits stabilizes when the fingerprint length exceeds 8000 with radius 2, whilst convergence requires even longer fingerprints for larger radii and is expected to be even more demanding as the number of molecules in the dataset increases. Increasing the fingerprint radius is expected to also increase model performance as the fingerprint more accurately encodes the atomic environments in the molecule. We can hypothesize that the outer grid did not lead to larger optimal fingerprint radius because the benefits of increasing the radius are counterbalanced by more frequent bit collisions. Allowing the outer grid to cover even longer fingerprints would increase the computational cost of model fitting and would require more elaborate feature selection. Hence it is left as a topic for future research. The fingerprint collisions were also noted when visualizing the structure of fingerprint bits with large absolute SHAP values, as for some of the fingerprint bits the resulting structure of the atomic environment generated with RDKit depended on the training set molecule used.

In this study, we used binary fingerprints, which means that two structures with identical atomic environments that differ in the number of functional groups present will be represented with the same binary fingerprint and hence will be predicted to belong to the same group. One possibility to alleviate this limitation is to represent the bit counts by setting multiple bits for each feature where the number of bits set is determined by the count. RDKit²⁵ offers such simple mechanism for capturing counts. For example, a simple implementation could use two bits per feature so that if the count is one only the first bit is set and if the count is two or more both bits are set. Although such approach seems convenient, further investigation is needed because the two feataure bits are not truly independent and in addition the effective length of the fingerprint is reduced by a factor of two that increases the frequency of bit collisions. In addition to representing bit counts we could also use other fingerprints, such as fingerprints based on atom-pairs, alleviating some of the limitations of circular fingerprints, including their inability to encode the overall shape and size of the molecule⁴⁵. Using such fingerprints can be investigated as a future research topic.

The developed models currently use a feature vector comprising exclusively fingerprint bits and hence the model predicts group membership based on structural moieties alone. The set of features can be expanded with measured or predicted physicochemical or ADME properties, or mechanistic structural alerts using models such as Derek Nexus⁴⁶ or the QSAR Toolbox^8,47. For such features to be used, they would also need to be considered systematically during the creation of substance groups in the training set, that is not the case for the groupings used in this paper. However, this is not a limitation of the proposed approach and can be pursued in the future, leading into more toxicologically robust grouping encodings. Another interesting approach to explore would be to use kNN as a feature engine, where kNN is used to predict the probability for a substance to belong to the most probable groups based on the fingerprints, and then use these probabilities as features in a subsequent non-kNN classifier. This can be understood as a form of ensemble learning where the first modelling method based on nearest neighbors is adding local knowledge that is exploited at a second stage, which can also include features other than fingerprint bits.

Next steps will aim to apply the group modelling methodology to other landscapes of regulatory interest, such as the EPA’s TSCA non-confidential active inventory, to explore the coverage of the groups. Further for the TSCA inventory, a comparison could be made between the group modelling methodology developed here and the EPA’s own existing New Chemical Categories (NCC)¹. Using the methodology with large inventories may benefit from further methodological improvements until the available substance groupings cover a larger part of the structural diversity in the inventories. One possibility is to use the whole inventory when building the model by including the ungrouped substances in the miscellaneous chemistry group and handling the class unbalance with techniques such as over or under sampling. Another possibility is to improve the domain of applicability so that it becomes class specific, possibly by automatically leveraging the class specific feature importance for building structure patterns that must be present or absent for each class. Such measures will reduce the frequency of placing substances with structural features unseen in the training set in an existing group, by either predicting as most probable the miscellaneous chemistry group or indicating that the substance is out of domain.

Conclusions

We developed a comprehensive machine learning methodology for encoding substance groups. The models capture the grouping hypothesis based solely on the molecular structure as represented using Morgan fingerprints. The RF model outperformed the kNN model, especially as the number of substances in the group increases. Using a training set with 1,540 substances it was possible to encode 56 publicly available substance groups with F1 score of 0.853 on the external (hold-out) set. The RF model is interpretable both globally and locally, i.e., it allows quantifying the effect of each structural moiety on the final prediction. By applying the model to a large inventory, ten times larger than the training set, we predicted plausible group memberships in the sense that the model successfully identified the group that is structurally closest to the predicted structure. However, due the limited size of the training set the predicted structure is often assigned to existing groups, although it contains functional groups that would warrant the generation of a new group in practice.

Supplementary Material

Supplement1

NIHMS1999147-supplement-Supplement1.zip^{(12.4MB, zip)}

Acknowledgements

We gratefully acknowledge internal reviewers, Dr Nathaniel Charest and Dr Richard Judson, for critical review of this manuscript.

Footnotes

Disclaimer

The views expressed in this paper are those of the authors and do not necessarily reflect the views of the US Environmental Protection Agency or the European Chemicals Agency. Mention of trade names, algorithm packages or commercial products does not constitute endorsement or recommendation for use.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work presented in this paper.

Supporting information

The supporting information contains the following material:

S1_2023_02_03_assessment-of-regulatory-needs--arn—export.xlsx

Dataset with ARN groups downloaded from https://echa.europa.eu/assessment-regulatory-needs in Feb 2023
S2_ARN_groups.xlsx

Curated dataset with ARN groups with molecular structures and their quality scores, that was used for building the models.
S3_ARN_stats.xlsx

Descriptive statistics for the 86 substance groups. For each group we provide the number of substances as in the ARN group, the number of substances matched in DSSTox, the DSSTox substance type and the number of substances with structural information and its quality.
S4_SystematicGroupingSI.docx

Document with additional figures and explanations referred to in the manuscript.
S5_rf_application_1_results_redacted.xlsx

Predicted groups, probabilities and domain assessment for all non-confidential substances registered under REACH.
S6_outer_inner_grid_details_rf.xlsx

Cross-validation scoring results obtained in every iteration of outer and inner grid search for the random forest (RF) model.
S7_outer_inner_grid_details_kn.xlsx

Cross-validation scoring results obtained in every iteration of outer and inner grid search for the nearest neighbor (kNN) model.
S8_outer_inner_grid_details_gb.xlsx

Cross-validation scoring results obtained for the gradient boosting (GB) model. This dataset is only provided for completeness because the GB model was evaluated but not used further. Due to the computational cost we only performed the inner grid search using the optimal fingerprint parameters identified by the outer grid search with kNN and RF (radius 2, length 2,560).

References

(1).Chemical Categories Used to Review New Chemicals under TSCA ∣ US EPA. https://www.epa.gov/reviewing-new-chemicals-under-toxic-substances-control-act-tsca/chemical-categories-used-review-new (accessed 2023-12-19).
(2).Substance Groupings Initiative - Canada.ca. https://www.canada.ca/en/health-canada/services/chemical-substances/substance-groupings-initiative.html (accessed 2023-12-19).
(3).Assessment of regulatory needs list - ECHA. https://echa.europa.eu/assessment-regulatory-needs (accessed 2023-12-19).
(4).Guidance on Grouping of Chemicals, Second Edition; OECD Series on Testing and Assessment; OECD, 2017. 10.1787/9789264274679-en. [DOI] [Google Scholar]
(5).Tier G; Gallegos Saliner A; Pavan M; Worth A; Benigni R; Aptula A; Bassan A; Bossa C; Falk-Filipsson A; Gillet V; Jeliazkova N; Mcdougal A; Mestres J; Munro A; Netzeva T; Safford B; Simon-Hettich B; Tsakovska I; Wallén M. Chemical Similarity and Threshold of Toxicological Concern (TTC) Approaches: Report of an ECB Workshop Held in Ispra; 2005. [Google Scholar]
(6).Escher SE; Kamp H; Bennekou SH; Bitsch A; Fisher C; Graepel R; Hengstler JG; Herzler M; Knight D; Leist M; Norinder U; Ouédraogo G; Pastor M; Stuard S; White A; Zdrazil B; van de Water B; Kroese D Towards Grouping Concepts Based on New Approach Methodologies in Chemical Hazard Assessment: The Read-across Approach of the EU-ToxRisk Project. Arch Toxicol 2019, 93 (12), 3643–3667. 10.1007/s00204-019-02591-7. [DOI] [PubMed] [Google Scholar]
(7).Patlewicz G; Cronin MTD; Helman G; Lambert JC; Lizarraga LE; Shah I Navigating through the Minefield of Read-across Frameworks: A Commentary Perspective. Computational Toxicology 2018, 6, 39–54. https://doi.org/ 10.1016/j.comtox.2018.04.002. [DOI] [Google Scholar]
(8).Schultz Terry W and Diderich R and K. CD and M. OG The OECD QSAR Toolbox Starts Its Second Decade. In Computational Toxicology: Methods and Protocols; Nicolotti O, Ed.; Springer New York: New York, NY, 2018; pp 55–77. 10.1007/978-1-4939-7899-1_2. [DOI] [PubMed] [Google Scholar]
(9).Benfenati E; Roncaglioni A; Petoumenou MI; Cappelli CI; Gini G Integrating QSAR and Read-across for Environmental Assessment. SAR QSAR Environ Res 2015, 26 (7–9), 605–618. 10.1080/1062936X.2015.1078408. [DOI] [PubMed] [Google Scholar]
(10).Patlewicz G; Shah I Towards Systematic Read-across Using Generalised Read-Across (GenRA). Computational Toxicology 2023, 25, 100258. https://doi.org/ 10.1016/j.comtox.2022.100258. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).Shah I; Tate T; Patlewicz G Generalized Read-Across Prediction Using Genra-Py. Bioinformatics 2021, 37 (19), 3380–3381. 10.1093/bioinformatics/btab210. [DOI] [PMC free article] [PubMed] [Google Scholar]
(12).Patlewicz G; Richard AM; Williams AJ; Grulke CM; Sams R; Lambert J; Noyes PD; DeVito MJ; Hines RN; Strynar M; Guiseppi-Elie A; Thomas RS A Chemical Category-Based Prioritization Approach for Selecting 75 Per- and Polyfluoroalkyl Substances (PFAS) for Tiered Toxicity and Toxicokinetic Testing. Environ Health Perspect 2019, 127 (1). 10.1289/EHP4555. [DOI] [PMC free article] [PubMed] [Google Scholar]
(13).Maffini MV; Rayasam SDG; Axelrad DA; Birnbaum LS; Cooper C; Franjevic S; MacRoy PM; Nachman KE; Patisaul HB; Rodgers KM; Rossi MS; Schettler T; Solomon GM; Woodruff TJ Advancing the Science on Chemical Classes. Environmental Health 2023, 21 (S1), 120. 10.1186/s12940-022-00919-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
(14).Nicolas CI; Linakis MW; Minto MS; Mansouri K; Clewell RA; Yoon M; Wambaugh JF; Patlewicz G; McMullen PD; Andersen ME; Clewell HJ III Estimating Provisional Margins of Exposure for Data-Poor Chemicals Using High-Throughput Computational Methods. Front Pharmacol 2022, 13. 10.3389/fphar.2022.980747. [DOI] [PMC free article] [PubMed] [Google Scholar]
(15).Patlewicz G; Wambaugh JF; Felter SP; Simon TW; Becker RA Utilizing Threshold of Toxicological Concern (TTC) with High Throughput Exposure Predictions (HTE) as a Risk-Based Prioritization Approach for Thousands of Chemicals. Computational Toxicology 2018, 7, 58–67. 10.1016/j.comtox.2018.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
(16).Categorization of chemical substances - Canada.ca. https://www.canada.ca/en/health-canada/services/chemical-substances/canada-approach-chemicals/categorization-chemical-substances.html (accessed 2023-12-19).
(17).European Commission. Regulation (EC) 1907/2006 of the European Parliament and of Teh Council of 18 December 2006 - REACH. Official Journal of the European Union 2006, 396–849. https://doi.org/http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2006:396:0001:0849:EN:PDF. [Google Scholar]
(18).The Frank R. Lautenberg Chemical Safety for the 21st Century Act ∣ US EPA. https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/frank-r-lautenberg-chemical-safety-21st-century-act (accessed 2023-12-19).
(19).Yang C; Rathman JF; Bienfait B; Burbank M; Detroyer A; Enoch SJ; Firman JW; Gutsell S; Hewitt NJ; Hobocienski B; Kenna G; Madden JC; Magdziarz T; Marusczyk J; Mostrag-Szlichtyng A; Krueger C-T; Lester C; Mahoney C; Najjar A; Ouedraogo G; Przybylak KR; Ribeiro JV; Cronin MTD The Role of a Molecular Informatics Platform to Support next Generation Risk Assessment. Computational Toxicology 2023, 26, 100272. https://doi.org/ 10.1016/j.comtox.2023.100272. [DOI] [Google Scholar]
(20).WORTH A; PAVAN M A Set of Case Studies to Illustrate the Applicability of DART (Decision Analysis by Ranking Techniques) in the Ranking of Chemicals. https://publications.jrc.ec.europa.eu/repository/handle/JRC47007 (accessed 2023-08-19). [Google Scholar]
(21).Liu J; Patlewicz G; Williams AJ; Thomas RS; Shah I Predicting Organ Toxicity Using &ITin Vitro&IT Bioactivity Data and Chemical Structure. Chem Res Toxicol 2017, 30 (11), 2046–2059. 10.1021/acs.chemrestox.7b00084. [DOI] [PMC free article] [PubMed] [Google Scholar]
(22).Assessment of regulatory needs list. https://echa.europa.eu/assessment-regulatory-needs.
(23).Grulke CM; Williams AJ; Thillanadarajah I; Richard AM EPA’s DSSTox Database: History of Development of a Curated Chemistry Resource Supporting Computational Toxicology Research. Computational Toxicology 2019, 12, 100096. 10.1016/j.comtox.2019.100096. [DOI] [PMC free article] [PubMed] [Google Scholar]
(24).Mansouri K; Grulke CM; Richard AM; Judson RS; Williams AJ An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling. SAR QSAR Environ Res 2016, 27 (11), 911–937. 10.1080/1062936X.2016.1253611. [DOI] [PubMed] [Google Scholar]
(25).Landrum G; Tosco P; Kelley B; Ric; sriniker; gedeck; Cosgrove D; Vianello R; Schneider Nadine; Kawashima E; N D; Dalke A; Jones G; Cole B; Swain M; Turk S; Savelyev Alexander; Vaucher A; Wójcikowski M; Take I; Probst D; Scalfani VF; Ujihara K; godin guillaume; Pahl A; Berenger F; JLVarjo; jasondbiggs; strets123; JP. Rdkit/Rdkit: 2022_09_5 (Q3 2022) Release. 2023. 10.5281/ZENODO.7671152. [DOI] [Google Scholar]
(26).Rogers D; Hahn M Extended-Connectivity Fingerprints. J Chem Inf Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
(27).O’Boyle NM; Sayle RA Comparing Structural Fingerprints Using a Literature-Based Similarity Benchmark. J Cheminform 2016, 8 (1), 36. 10.1186/s13321-016-0148-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
(28).Capecchi A; Probst D; Reymond J-L One Molecular Fingerprint to Rule Them All: Drugs, Biomolecules, and the Metabolome. J Cheminform 2020, 12 (1), 43. 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
(29).Orosz Á; Héberger K; Rácz A Comparison of Descriptor- and Fingerprint Sets in Machine Learning Models for ADME-Tox Targets. Front Chem 2022, 10. 10.3389/fchem.2022.852893. [DOI] [PMC free article] [PubMed] [Google Scholar]
(30).Pedregosa F; Varoquaux G; Gramfort A; Michel V and Thirion B; Grisel O; Blondel M; Prettenhofer P and Weiss R; Dubourg V; Vanderplas J; Passos A; Cournapeau D; Brucher M; Perrot M; Duchesnay E Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
(31).Sahigara F; Mansouri K; Ballabio D; Mauri A; Consonni V; Todeschini R Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. MOLECULES 2012, 17 (5), 4791–4810. 10.3390/molecules17054791. [DOI] [PMC free article] [PubMed] [Google Scholar]
(32).Sahigara F; Ballabio D; Todeschini R; Consonni V Defining a Novel K-Nearest Neighbours Approach to Assess the applicability Domain of a QSAR Model for Reliable Predictions. J Cheminform 2013, 5. 10.1186/1758-2946-5-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
(33).Bruce P; Bruce A; Gedeck P Practical Statistics for Data Scientists, second.; O’Reilly, 2020. [Google Scholar]
(34).Lundberg SM; Lee S-I A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, Eds.; Curran Associates, Inc., 2017; pp 4765–4774. [Google Scholar]
(35).Lundberg SM; Erion G; Chen H; DeGrave A; Prutkin JM; Nair B; Katz R; Himmelfarb J; Bansal N; Lee S-I From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2020, 2 (1), 2522–5839. [DOI] [PMC free article] [PubMed] [Google Scholar]
(36).European Chemicals Agency. REACH-IT. European Chemicals Agency: Helsinki. https://echa.europa.eu/web/guest/support/dossier-submission-tools/reach-it. [Google Scholar]
(37).Harris CR; Millman KJ; van der Walt SJ; Gommers R; Virtanen P; Cournapeau David; Wieser E; Taylor J; Berg Sebastian; Smith NJ; Kern R; Hoyer MP and S.; van Kerkwijk MH; Brett Matthew; Haldane A; del Río JF; Wiebe M; Peterson P; Gérard-Marchant Pierre; Sheppard K; Reddy T; Weckesser W; Abbasi H; Gohlke C; Oliphant TE Array Programming with NumPy. Nature 2020, 585 (7825), 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
(38).pandas development team, T. Pandas-Dev/Pandas: Pandas. Zenodo; November 2023. 10.5281/zenodo.10107975. [DOI] [Google Scholar]
(39).RDKit: Open-Source Cheminformatics. February 23, 2023.
(40).Hunter JD Matplotlib: A 2D Graphics Environment. Comput Sci Eng 2007, 9 (3), 90–95. 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
(41).Waskom ML Seaborn: Statistical Data Visualization. J Open Source Softw 2021, 6 (60), 3021. 10.21105/joss.03021. [DOI] [Google Scholar]
(42).pkaramertzanis/regulatory_grouping: a collaboration project to encode the ECHA (ARN) and EPA groups. https://github.com/pkaramertzanis/regulatory_grouping (accessed 2023-12-19).
(43).van der Maaten L; Hinton G Visualizing Data Using T-SNE. Journal of Machine Learning Research 2008, 9 (86), 2579–2605. [Google Scholar]
(44).Virtanen P; Gommers R; Oliphant TE; Haberland M; Reddy T; Cournapeau D; Burovski E; Peterson P; Weckesser W; Bright J; van der Walt SJ; Brett M; Wilson J; Millman KJ; Mayorov N; Nelson ARJ; Jones E; Kern R; Larson E; Carey CJ; Polat Ilhan\.; Feng Y; Moore EW; VanderPlas J; Laxalde D; Perktold J; Cimrman R; Henriksen I; Quintero EA; Harris CR; Archibald AM; Ribeiro AH; Pedregosa F; van Mulbregt P; SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific in Python. Nat Methods 2020, 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
(45).Carhart RE; Smith DH; Venkataraghavan R Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J Chem Inf Comput Sci 1985, 25 (2), 64–73. 10.1021/ci00046a002. [DOI] [Google Scholar]
(46).Lhasa Limited ∣ Shared Knowledge, Shared Progress. https://www.lhasalimited.org/ (accessed 2023-12-19).
(47).QSAR Toolbox. https://qsartoolbox.org/ (accessed 2023-12-19).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1

NIHMS1999147-supplement-Supplement1.zip^{(12.4MB, zip)}

[R1] (1).Chemical Categories Used to Review New Chemicals under TSCA ∣ US EPA. https://www.epa.gov/reviewing-new-chemicals-under-toxic-substances-control-act-tsca/chemical-categories-used-review-new (accessed 2023-12-19).

[R2] (2).Substance Groupings Initiative - Canada.ca. https://www.canada.ca/en/health-canada/services/chemical-substances/substance-groupings-initiative.html (accessed 2023-12-19).

[R3] (3).Assessment of regulatory needs list - ECHA. https://echa.europa.eu/assessment-regulatory-needs (accessed 2023-12-19).

[R4] (4).Guidance on Grouping of Chemicals, Second Edition; OECD Series on Testing and Assessment; OECD, 2017. 10.1787/9789264274679-en. [DOI] [Google Scholar]

[R5] (5).Tier G; Gallegos Saliner A; Pavan M; Worth A; Benigni R; Aptula A; Bassan A; Bossa C; Falk-Filipsson A; Gillet V; Jeliazkova N; Mcdougal A; Mestres J; Munro A; Netzeva T; Safford B; Simon-Hettich B; Tsakovska I; Wallén M. Chemical Similarity and Threshold of Toxicological Concern (TTC) Approaches: Report of an ECB Workshop Held in Ispra; 2005. [Google Scholar]

[R6] (6).Escher SE; Kamp H; Bennekou SH; Bitsch A; Fisher C; Graepel R; Hengstler JG; Herzler M; Knight D; Leist M; Norinder U; Ouédraogo G; Pastor M; Stuard S; White A; Zdrazil B; van de Water B; Kroese D Towards Grouping Concepts Based on New Approach Methodologies in Chemical Hazard Assessment: The Read-across Approach of the EU-ToxRisk Project. Arch Toxicol 2019, 93 (12), 3643–3667. 10.1007/s00204-019-02591-7. [DOI] [PubMed] [Google Scholar]

[R7] (7).Patlewicz G; Cronin MTD; Helman G; Lambert JC; Lizarraga LE; Shah I Navigating through the Minefield of Read-across Frameworks: A Commentary Perspective. Computational Toxicology 2018, 6, 39–54. https://doi.org/ 10.1016/j.comtox.2018.04.002. [DOI] [Google Scholar]

[R8] (8).Schultz Terry W and Diderich R and K. CD and M. OG The OECD QSAR Toolbox Starts Its Second Decade. In Computational Toxicology: Methods and Protocols; Nicolotti O, Ed.; Springer New York: New York, NY, 2018; pp 55–77. 10.1007/978-1-4939-7899-1_2. [DOI] [PubMed] [Google Scholar]

[R9] (9).Benfenati E; Roncaglioni A; Petoumenou MI; Cappelli CI; Gini G Integrating QSAR and Read-across for Environmental Assessment. SAR QSAR Environ Res 2015, 26 (7–9), 605–618. 10.1080/1062936X.2015.1078408. [DOI] [PubMed] [Google Scholar]

[R10] (10).Patlewicz G; Shah I Towards Systematic Read-across Using Generalised Read-Across (GenRA). Computational Toxicology 2023, 25, 100258. https://doi.org/ 10.1016/j.comtox.2022.100258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] (11).Shah I; Tate T; Patlewicz G Generalized Read-Across Prediction Using Genra-Py. Bioinformatics 2021, 37 (19), 3380–3381. 10.1093/bioinformatics/btab210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] (12).Patlewicz G; Richard AM; Williams AJ; Grulke CM; Sams R; Lambert J; Noyes PD; DeVito MJ; Hines RN; Strynar M; Guiseppi-Elie A; Thomas RS A Chemical Category-Based Prioritization Approach for Selecting 75 Per- and Polyfluoroalkyl Substances (PFAS) for Tiered Toxicity and Toxicokinetic Testing. Environ Health Perspect 2019, 127 (1). 10.1289/EHP4555. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] (13).Maffini MV; Rayasam SDG; Axelrad DA; Birnbaum LS; Cooper C; Franjevic S; MacRoy PM; Nachman KE; Patisaul HB; Rodgers KM; Rossi MS; Schettler T; Solomon GM; Woodruff TJ Advancing the Science on Chemical Classes. Environmental Health 2023, 21 (S1), 120. 10.1186/s12940-022-00919-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] (14).Nicolas CI; Linakis MW; Minto MS; Mansouri K; Clewell RA; Yoon M; Wambaugh JF; Patlewicz G; McMullen PD; Andersen ME; Clewell HJ III Estimating Provisional Margins of Exposure for Data-Poor Chemicals Using High-Throughput Computational Methods. Front Pharmacol 2022, 13. 10.3389/fphar.2022.980747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] (15).Patlewicz G; Wambaugh JF; Felter SP; Simon TW; Becker RA Utilizing Threshold of Toxicological Concern (TTC) with High Throughput Exposure Predictions (HTE) as a Risk-Based Prioritization Approach for Thousands of Chemicals. Computational Toxicology 2018, 7, 58–67. 10.1016/j.comtox.2018.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] (16).Categorization of chemical substances - Canada.ca. https://www.canada.ca/en/health-canada/services/chemical-substances/canada-approach-chemicals/categorization-chemical-substances.html (accessed 2023-12-19).

[R17] (17).European Commission. Regulation (EC) 1907/2006 of the European Parliament and of Teh Council of 18 December 2006 - REACH. Official Journal of the European Union 2006, 396–849. https://doi.org/http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2006:396:0001:0849:EN:PDF. [Google Scholar]

[R18] (18).The Frank R. Lautenberg Chemical Safety for the 21st Century Act ∣ US EPA. https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/frank-r-lautenberg-chemical-safety-21st-century-act (accessed 2023-12-19).

[R19] (19).Yang C; Rathman JF; Bienfait B; Burbank M; Detroyer A; Enoch SJ; Firman JW; Gutsell S; Hewitt NJ; Hobocienski B; Kenna G; Madden JC; Magdziarz T; Marusczyk J; Mostrag-Szlichtyng A; Krueger C-T; Lester C; Mahoney C; Najjar A; Ouedraogo G; Przybylak KR; Ribeiro JV; Cronin MTD The Role of a Molecular Informatics Platform to Support next Generation Risk Assessment. Computational Toxicology 2023, 26, 100272. https://doi.org/ 10.1016/j.comtox.2023.100272. [DOI] [Google Scholar]

[R20] (20).WORTH A; PAVAN M A Set of Case Studies to Illustrate the Applicability of DART (Decision Analysis by Ranking Techniques) in the Ranking of Chemicals. https://publications.jrc.ec.europa.eu/repository/handle/JRC47007 (accessed 2023-08-19). [Google Scholar]

[R21] (21).Liu J; Patlewicz G; Williams AJ; Thomas RS; Shah I Predicting Organ Toxicity Using &ITin Vitro&IT Bioactivity Data and Chemical Structure. Chem Res Toxicol 2017, 30 (11), 2046–2059. 10.1021/acs.chemrestox.7b00084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] (22).Assessment of regulatory needs list. https://echa.europa.eu/assessment-regulatory-needs.

[R23] (23).Grulke CM; Williams AJ; Thillanadarajah I; Richard AM EPA’s DSSTox Database: History of Development of a Curated Chemistry Resource Supporting Computational Toxicology Research. Computational Toxicology 2019, 12, 100096. 10.1016/j.comtox.2019.100096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] (24).Mansouri K; Grulke CM; Richard AM; Judson RS; Williams AJ An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling. SAR QSAR Environ Res 2016, 27 (11), 911–937. 10.1080/1062936X.2016.1253611. [DOI] [PubMed] [Google Scholar]

[R25] (25).Landrum G; Tosco P; Kelley B; Ric; sriniker; gedeck; Cosgrove D; Vianello R; Schneider Nadine; Kawashima E; N D; Dalke A; Jones G; Cole B; Swain M; Turk S; Savelyev Alexander; Vaucher A; Wójcikowski M; Take I; Probst D; Scalfani VF; Ujihara K; godin guillaume; Pahl A; Berenger F; JLVarjo; jasondbiggs; strets123; JP. Rdkit/Rdkit: 2022_09_5 (Q3 2022) Release. 2023. 10.5281/ZENODO.7671152. [DOI] [Google Scholar]

[R26] (26).Rogers D; Hahn M Extended-Connectivity Fingerprints. J Chem Inf Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[R27] (27).O’Boyle NM; Sayle RA Comparing Structural Fingerprints Using a Literature-Based Similarity Benchmark. J Cheminform 2016, 8 (1), 36. 10.1186/s13321-016-0148-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] (28).Capecchi A; Probst D; Reymond J-L One Molecular Fingerprint to Rule Them All: Drugs, Biomolecules, and the Metabolome. J Cheminform 2020, 12 (1), 43. 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] (29).Orosz Á; Héberger K; Rácz A Comparison of Descriptor- and Fingerprint Sets in Machine Learning Models for ADME-Tox Targets. Front Chem 2022, 10. 10.3389/fchem.2022.852893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] (30).Pedregosa F; Varoquaux G; Gramfort A; Michel V and Thirion B; Grisel O; Blondel M; Prettenhofer P and Weiss R; Dubourg V; Vanderplas J; Passos A; Cournapeau D; Brucher M; Perrot M; Duchesnay E Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]

[R31] (31).Sahigara F; Mansouri K; Ballabio D; Mauri A; Consonni V; Todeschini R Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. MOLECULES 2012, 17 (5), 4791–4810. 10.3390/molecules17054791. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] (32).Sahigara F; Ballabio D; Todeschini R; Consonni V Defining a Novel K-Nearest Neighbours Approach to Assess the applicability Domain of a QSAR Model for Reliable Predictions. J Cheminform 2013, 5. 10.1186/1758-2946-5-27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] (33).Bruce P; Bruce A; Gedeck P Practical Statistics for Data Scientists, second.; O’Reilly, 2020. [Google Scholar]

[R34] (34).Lundberg SM; Lee S-I A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, Eds.; Curran Associates, Inc., 2017; pp 4765–4774. [Google Scholar]

[R35] (35).Lundberg SM; Erion G; Chen H; DeGrave A; Prutkin JM; Nair B; Katz R; Himmelfarb J; Bansal N; Lee S-I From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2020, 2 (1), 2522–5839. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] (36).European Chemicals Agency. REACH-IT. European Chemicals Agency: Helsinki. https://echa.europa.eu/web/guest/support/dossier-submission-tools/reach-it. [Google Scholar]

[R37] (37).Harris CR; Millman KJ; van der Walt SJ; Gommers R; Virtanen P; Cournapeau David; Wieser E; Taylor J; Berg Sebastian; Smith NJ; Kern R; Hoyer MP and S.; van Kerkwijk MH; Brett Matthew; Haldane A; del Río JF; Wiebe M; Peterson P; Gérard-Marchant Pierre; Sheppard K; Reddy T; Weckesser W; Abbasi H; Gohlke C; Oliphant TE Array Programming with NumPy. Nature 2020, 585 (7825), 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] (38).pandas development team, T. Pandas-Dev/Pandas: Pandas. Zenodo; November 2023. 10.5281/zenodo.10107975. [DOI] [Google Scholar]

[R39] (39).RDKit: Open-Source Cheminformatics. February 23, 2023.

[R40] (40).Hunter JD Matplotlib: A 2D Graphics Environment. Comput Sci Eng 2007, 9 (3), 90–95. 10.1109/MCSE.2007.55. [DOI] [Google Scholar]

[R41] (41).Waskom ML Seaborn: Statistical Data Visualization. J Open Source Softw 2021, 6 (60), 3021. 10.21105/joss.03021. [DOI] [Google Scholar]

[R42] (42).pkaramertzanis/regulatory_grouping: a collaboration project to encode the ECHA (ARN) and EPA groups. https://github.com/pkaramertzanis/regulatory_grouping (accessed 2023-12-19).

[R43] (43).van der Maaten L; Hinton G Visualizing Data Using T-SNE. Journal of Machine Learning Research 2008, 9 (86), 2579–2605. [Google Scholar]

[R44] (44).Virtanen P; Gommers R; Oliphant TE; Haberland M; Reddy T; Cournapeau D; Burovski E; Peterson P; Weckesser W; Bright J; van der Walt SJ; Brett M; Wilson J; Millman KJ; Mayorov N; Nelson ARJ; Jones E; Kern R; Larson E; Carey CJ; Polat Ilhan\.; Feng Y; Moore EW; VanderPlas J; Laxalde D; Perktold J; Cimrman R; Henriksen I; Quintero EA; Harris CR; Archibald AM; Ribeiro AH; Pedregosa F; van Mulbregt P; SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific in Python. Nat Methods 2020, 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] (45).Carhart RE; Smith DH; Venkataraghavan R Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J Chem Inf Comput Sci 1985, 25 (2), 64–73. 10.1021/ci00046a002. [DOI] [Google Scholar]

[R46] (46).Lhasa Limited ∣ Shared Knowledge, Shared Progress. https://www.lhasalimited.org/ (accessed 2023-12-19).

[R47] (47).QSAR Toolbox. https://qsartoolbox.org/ (accessed 2023-12-19).

PERMALINK

Systematic approaches for the encoding of chemical groups: a case study

Panagiotis G Karamertzanis

Grace Patlewicz

Marta Sannicola

Katie Paul-Friedman

Imran Shah

Abstract

Graphical Abstract

Introduction

Methods

Substance group and chemical structure data

Figure 1.

Chemical fingerprints

Supervised machine learning

Applicability domain

Feature importance

Using the model for a large inventory

Software

Results

Summary of chemical groups

Figure 3.

Figure 4.

Figure 5.

Model tuning and predictive performance

Table 1.

Figure 6.

Table 2.

Classification performance across groups

Figure 7.

Table 3.

Figure 8.

Feature importance using the random forest classifier

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Using the model against a large inventory

Discussion

Conclusions

Supplementary Material

Figure 2.

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases