Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jun 1.
Published in final edited form as: Comput Toxicol. 2024 Jun;30:1–15. doi: 10.1016/j.comtox.2024.100304

A systematic analysis of read-across within REACH registration dossiers

G Patlewicz a,*, P Karamertzanis b, K Paul Friedman a, M Sannicola b, I Shah a
PMCID: PMC11235147  NIHMSID: NIHMS1999146  PMID: 38993812

Abstract

Read-across is a well-established data-gap filling technique used within analogue or category approaches. Acceptance remains an issue, mainly due to the difficulties of addressing residual uncertainties associated with a read-across prediction and because assessments are expert-driven. Frameworks to develop, assess and document read-across may help reduce variability in read-across results. Data-driven read-across approaches such as Generalised Read-Across (GenRA) include quantification of uncertainties and performance. GenRA also affords opportunities on how New Approach Method (NAM) data can be systematically incorporated to support the read-across hypothesis. Herein, a systematic investigation of differences in expert-driven read-across with data-driven approaches was pursued in terms of establishing scientific confidence in the use of read-across. A dataset of expert-driven read-across assessments that made use of registration data as disseminated in the public International Uniform Chemical Information Database (IUCLID) (version 6) of Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) Study Results were compiled. A dataset of ~5000 read-across cases pertaining to repeated dose and developmental toxicity was extracted and mapped to content within EPA’s Distributed Structure Searchable Toxicity database (DSSTox) to retrieve chemical name and structural identification information. Content could be mapped to ~3600 cases which when filtered for unique cases with curated quantitative structure-activity relationship-ready SMILES resulted in 389 target-source analogue pairs. The similarity between target and the source analogues on the basis of different contexts – from structural similarity using chemical fingerprints to metabolic similarity using predicted metabolic information was evaluated. An attempt was also made to quantify the relative contribution each similarity context played relative to the target-source analogue pairs by deriving a model which predicted known analogue pairs. Finally, point of departure values (PODs) were predicted using the GenRA approach underpinned by data extracted from the EPA’s Toxicity Values Database (ToxValDB). The GenRA predicted PODs were compared with those reported within the REACH dossiers themselves. This study offers generalisable insights on how read-across is already applied for regulatory submissions and expectations on the levels of similarity necessary to make decisions.

Keywords: read-across, GenRA, REACH, similarity context, New Approach Methods (NAMs)

1. INTRODUCTION

Read-across remains a popular data gap filling technique to meet information requirements for different regulatory purposes. Although there is much technical guidance for developing read-across assessments, notably the Organisation for Economic Co-operation and Development (OECD) guidance, last revised in 2014,1 as well as the European Chemical Agency’s (ECHA) Read-Across Assessment Framework,2 regulatory acceptance remains an issue. One challenge hindering acceptance relates to what an acceptable level of uncertainty is for a read-across prediction. This is a multi-faceted issue, requiring consideration of the endpoint of interest, the inherent chemistry of the target substance, the decision context, and, to an extent, whether the read-across is predicting the absence or presence of toxicity. These challenges have been investigated over the last decade resulting in the development of frameworks and templates to identify and document the sources of uncertainty in read-across3,4,5,6,7. Using data from in vitro New Approach Methods to reduce uncertainties (see references8,9,10) has also been an area of research. Complementary strategies have explored ways of quantifying uncertainty and assessing the performance of read-across. One such approach has been the Generalised Read-Across (GenRA) which has aimed to systematically quantify the contribution of different similarity contexts, such as bioactivity and physicochemical information, in predicting in vivo toxicity outcomes11,12,13. Related efforts have also been undertaken by Lester et al14 and Gadaleta et al15 to explore the impact of different similarity contexts including metabolism in analogue identification and evaluation. One difficulty in evaluating the similarity context contribution and indeed the performance of read-across is the paucity of actual read-across examples. A number of read-across examples have been developed under the auspices of the OECD’s Integrated Approaches to Testing and Assessment (IATA) case studies programme (see https://www.oecd.org/chemicalsafety/risk-assessment/iata/), and historically there have been categories developed to facilitate read-across under the OECD High Production Volume (HPV) programme (see https://www.oecd.org/env/ehs/risk-assessment/history-cocap-cooperative-chemicals-assessmentprogramme.htm). Within the US Environmental Protection Agency (EPA), a number of read-across assessments have been performed as part of the Provisional Peer-Reviewed Toxicity Values (PPRTV) programme. These fall within the scope of the Superfund Risk Assessment requirements (https://www.epa.gov/pprtv). However, there has not been a concerted effort to compile these cases collectively to help identify generalisable guiding principles for improved read-across application. Read-across has also been extensively applied under the Registration Evaluation and Authorisation of Chemicals (REACH) regulation16 by industry registrants, but algorithmically identifying all dossiers where read-across has been performed including the identity of both source and target substance is not a trivial undertaking. The dissemination information made available on the ECHA website (https://echa.europa.eu/en-US/information-on-chemicals) enables a single chemical to be queried and its dossier browsed. Querying eChemPortal (https://www.echemportal.org/echemportal/), an OECD effort to disseminate free public access to information on chemical properties, allows for the identification of REACH dossiers where read-across might have been performed, but the summary output only provides information on which substances have relied upon read-across to satisfy specific information requirements not what the associated candidate source analogues were or how these were rationalised for any read-across performed. This presents challenges in identifying ways in which read-across can be improved and at scale.

In this study, the published REACH Study Results (RSR) (https://iuclid6.echa.europa.eu/en/rsr-dossiers) from ECHA’s IUCLID website were leveraged to identify REACH registration dossiers that had included read-across for selected higher-tier endpoints, specifically oral repeated dose toxicity and developmental toxicity studies. These were used to explore the similarity between target and source substances through the lens of different contexts from structural similarity, structural alerts, physicochemical properties and metabolism predictions. Through the systematic identification of analogues using GenRA, the variability of the point of departure (POD) values for source analogues was assessed, which in practice can be subject to practical data availability constraints. In doing so, this study helped to characterise the level of uncertainty that might be present in currently applied read-across assessments which in turn informs reasonable expectations for read-across performance, especially if performed systematically. The analysis workflow that formed the main basis of this study is summarised in Figure 1.

Figure 1:

Figure 1:

Analysis workflow used in this study. Two lines of investigation were primarily followed. In the first, the RSR analogue pairs were evaluated from the lens of different similarity contexts with a view to evaluating the relative contribution each played in the selection of source analogues. In the second, the RSR POD values were compared with GenRA predictions derived using ToxValDB data.

2. METHODS

2.1. Data Sources

The RSR dataset was downloaded from the ECHA website in January 2023 https://iuclid6.echa.europa.eu/reach-study-results. These data are formatted and stored according to the International Uniform Chemical Information Database (IUCLID), and were imported into a local IUCLID server instance deployed over a PostgreSQL database [https://www.postgresql.org/]. The data extraction relied on database queries to locate the endpoint study records related to oral repeated dose toxicity and developmental toxicity studies, followed by the extraction of the study details through the IUCLID REST Web Services API [https://iuclid6.echa.europa.eu/public-api]. Since the purpose of this study was to construct target-source analogue pairs for read-across, data retrieval focused on the administrative part of the endpoint study record, the test material composition, the test guideline, and the effect levels. Endpoint study record pairs containing the source experimental study with the source substance and the linked read-across application to the target substance were extracted. The identifiers of the read-across source analogues were retrieved from the test material composition of the experimental endpoint study record, by processing all reference substances linked to it. The read-across target identifiers were retrieved from the reference substance that was linked to the substance that was the dossier subject. This approach retrieved only the target-analogue pairs when read-across had been reported using the functionality offered with the release of IUCLID v6. The RSR dataset also contains older dossiers that have not been updated since the introduction of IUCLID v6. The read-across information in these older dossiers was not extracted because of the reduced clarity of the source substance identifiers. Only endpoint study records that were tagged by the registrant as reliable with/without restriction (Klimisch et al17 score 1 or 2) were retained for analysis. In addition, both the registered substance and the test material had to have a CASRN within the set of provided chemical identifiers to facilitate retrieval of additional identifiers from EPA’s Distributed Structure Searchable Toxicity database (DSSTox)18. It is acknowledged that the source analogues identified may have arisen from both analogue and category approaches. In this study, all endpoint study records for the substance being registered were extracted, irrespective of whether there had been a category or not. The working assumption was that the registrant would have identified and included in the registered substance, the data for the closest category analogue(s) from the category if a category approach had been applied.

2.2. Chemical Information

The data extracted as described in Section 2.1 provided a matrix of target (registered substance) and source analogue substances (test substance) associations. These were manually inspected to verify the names, CASRN and other identifiers for both target and source analogue substances. Processing included harmonising names, filling in missing CASRN (e.g. where CASRN was captured in the name field), correcting typographical errors, etc. All target and source substances were then queried against the EPA’s DSSTox database using the internal ChemReg application (chemical registration database) in order to retrieve DSSTox Substance identifier information (DTXSID), CASRN (preferred CASRN) and structural information (simplified molecular-input line-entry system (SMILES) and QSAR ready SMILES (i.e. desalted, stereochemistry removed, standardised SMILES etc.)). Searches were performed using a combination of CASRN and names to match as many of the read-across target and source identifiers as possible. CASRN was used as the initial search, if no matches were found, a name search was then performed. DSSTox information, including QSAR-ready SMILES, are publicly available at the CompTox Chemicals Dashboard19. The list of REACH registered substances was additionally downloaded from https://echa.europa.eu/universe-of-registered-substances and mapped to DSSTox content using CASRN.

2.3. Similarity context evaluation

The different similarity contexts evaluated included an assessment of structural similarity, physicochemical similarity, structural alert similarity and metabolic similarity between target and source analogue associations. Structural similarity relied upon deriving 2D chemical structure descriptors. Here, Morgan chemical fingerprints with a bit vector length of 1024 and a radius of 3 were generated using the freely available python library RDKit20. A data matrix of binary chemical fingerprints for all target and source substances was constructed from which a pairwise Jaccard similarity matrix was computed using the python scipy21 library. Physicochemical similarity relied upon estimates of the log of the octanol-water partition coefficient (LogP), number of Hydrogen donors and acceptors (HBD, HBA) and calculated molecular weight (MW) using the open-source OPERA tool22. Although, some of these data may have have been available in the registration dossiers for the registered substances, to ensure a complete dataset for as many of the target-analogue pairs as possible, predicted properties were relied upon. These parameters are often used as a surrogate to model likely bioavailability which is an important consideration in evaluating similarity for read-across. A pairwise normalised Euclidean similarity matrix was computed. Structural alert similarity relied upon batch processing all substances using the default settings within the commercial expert system, Derek Nexus v2.5 (Lhasa Ltd, https://www.lhasalimited.org/). This was first processed to extract out a standardised matrix of chemicals as rows and all toxicity-toxicophore (Derek alerts) combinations as columns. A binary fingerprint representation reflecting the presence and absence of structural alerts for all substances permitted a Jaccard similarity matrix to be computed. The confidence level associated with a structural alert (equivocal, plausible, probable etc.) was not used in the fingerprint representation. Metabolic similarity made use of the commercial expert system TIssue Metabolism System (TIMES) (Laboratory of Mathematical Chemistry, University As Zlatarov) and its in vivo rat liver model to simulate metabolites. Three different approaches were used to quantify metabolic similarity. The Weisfeiler-Lehman (WL) Kernel23, a measure of graph kernel similarity was computed for the metabolic graphs simulated. The transformation pathways from the TIMES output and the union of all metabolites simulated were represented as binary bit vectors from which pairwise Jaccard similarity matrices were derived. The latter two representations of metabolic information are consistent with the approaches described previously in Boyce et al24. The three approaches aimed to capture different aspects of metabolic similarity from a metric that quantified the similarity between the simulated metabolic maps for the target-source analogues, the similarity of the transformation pathways (e.g. presence of a oxidation, hydrolysis reaction) and the overlap of the actual metabolites simulated for a given target-source analogue pair.

Distribution plots for each of the different similarity contexts provided a summary view of the range of pairwise similarities observed across the target-analogue associations. Boxplots were also constructed for those target-source associations where there were multiple source analogues per target substance. The boxplots provided a perspective of the extent of variation observed across the different source analogues proposed for a given target substance.

2.4. Modelling the similarity contexts

A matrix of all possible targets and source analogue combinations with their respective pairwise similarity metrics (structural, physicochemical, structural alert and the 3 metabolic similarity indices) as descriptors was constructed. All true target-source pairings were labelled as 1 and all false target-source pairings were labelled as 0. The dataset was then split randomly by stratified sampling (80%:20%) into a training and test set. The test set was reserved as an external validation set. Four different machine learning models (Logistic Regression, Ridge Regression, Linear Discriminant Analysis and Random Forest) as implemented in the scikit learn python library25 were then applied to relate the similarity metrics to the labels (true or false target-source pairing). The aim was to explore the relative importance of the different similarity contexts in the target-source pair, i.e. were the different contexts weighted equally or was one similarity context more dominant than another? Models were trained to optimise for balanced accuracy given the imbalance in the dataset. Based on the mean cross-validation balanced accuracy from the initial machine learning approaches attempted, one model was carried forward for hyperparameter tuning. A nested 10-fold cross validation procedure was then undertaken to identify the best parameters for the model in the inner loop whilst evaluating the cross validation performance on the outer loop. The model performance was finally evaluated on the 20% test set that had not been used during the model training/testing phase.

2.5. GenRA analysis - baseline performance

The EPA Toxicity Values database (ToxValDB v9.4) [https://doi.org/10.23645/epacomptox.20394501.v5] for all studies conducted by the oral route for which a No Observable Adverse Effect Level (NOAEL) or Lowest Observable Adverse Effect Level (LOAEL) point of departure (POD) was available and where the units were in mg/kg-day were first extracted. It should be noted that ToxValDB v9.4 includes public REACH dossier data, such that in some cases analogue data could be presented for a given substance and in doing so overestimate the performance of GenRA. The DSSTox substance identifier (DTXSID) was used to query the DSSTox database and retrieve associated chemical structure information for all substances in ToxValDB. Studies were grouped by substance, POD and study type to calculate a 10th percentile POD per study. Studies were grouped by substance to derive the 10th percentile POD across studies. NOAEL values were used preferentially but if no NOAEL studies were available across studies for a given substance, LOAEL values were used and adjusted by dividing by a factor of 10. Lastly the 10th percentile POD derived across studies for a given substance was divided by its MW and the −log10 was taken (making the units mmol/kg-day). This was used as the “modelled” endpoint for all subsequent model development. The dataset was split into a train and test split where 80% was used for all training and testing whereas the final 20% was reserved as the hold out test set to evaluate final performance. A nested 10-fold CV procedure was applied with GenRA making use of the genra-py python library26 where the inner loop was tuned to determine the optimal number of neighbours whereas the outer loop evaluated performance using root mean squared error (RMSE). Morgan fingerprints were used (as already described in Section 2.3) as chemical fingerprint inputs. The GenRA approach applied to the ToxValDB hold out set provided a measure of final performance. Predictions were then made for the REACH dossier target substances to estimate their POD values. These were compared with the PODs cited in the dossiers for the associated source analogues. i.e. were predictions derived using chemical fingerprints more or less conservative than the PODs in the REACH dossiers?

2.6. Comparison of GenRA predictions with REACH dossier toxicity values

All PODs for repeated dose oral toxicity and developmental toxicity endpoints for the source analogues cited in the dossiers were extracted. For the repeated dose oral toxicity values - only POD information where the units were mg/kg bw day and the POD type was either NO(A)EL or LO(A)EL were retrieved. For consistency with the ToxValDB data processing, LO(A)EL values from the REACH dossiers were adjusted by a factor of 10. All developmental toxicity values were processed similarly, restricting PODs to those only by the oral route and in mg/kg bw/day units. The repeated dose oral toxicity and oral developmental toxicity values were then pooled together. For each source analogue the 10th percentile of all values was computed, regardless of which dossier the values were taken from. The source analogue data was then merged with the target substances (described in Section 2.1) to infer the likely read-across prediction. If a target substance was associated with more than one source analogue POD, the minimum value was taken as the prediction. This manipulation resulted in a single value derived from the REACH dossiers for a given target (referred to as the RSR toxicity POD) that could be compared with the GenRA prediction that had been generated using the ToxValDB data. Empirical cumulative distribution functions were plotted to enable a visual comparison of the 2 sets of POD values.

In the first comparison, discrepancies between the RSR toxicity POD and the GenRA predicted POD could be potentially due to the selected source analogue as well as the breadth of different studies captured within ToxValDB. To probe the contribution further, a second comparison was undertaken where the source of toxicity data was fixed.

In this second investigation, all REACH registered substances that were associated with a ToxValDB POD and structure in DSSTox were pooled with the source and target substances. GenRA predictions were then made for as many of the target substances as possible using the REACH registered substances landscape with POD values in ToxValDB as the source analogue pool. GenRA predictions were made for each target which were compared with the ToxValDB PODs for the dossier source analogues. The cumulative distribution function of the difference between the source POD (i.e. the inferred POD for the target) and the GenRA prediction was plotted.

To determine whether certain types of substances tended to be associated with ‘more conservative’ read-across predictions using REACH or GenRA, an enrichment analysis was performed using ToxPrint chemotypes27. First, the standard deviation (sd) of the absolute difference between the REACH and GenRA read-across outcomes was determined and set as a threshold of ‘conservativeness’. Then for each substance, the absolute difference between their REACH and GenRA read-across prediction was computed and compared with this threshold. If the difference exceeded the sd threshold, that substance was labelled with a 1 to denoted a ‘more conservative’ outcome, otherwise it was labelled with a 0. ToxPrints27 were then used to determine whether there were certain structural features that were ‘more enriched’ in the group of substances which resulted in large differences between GenRA predictions and REACH outcomes (i.e. those substances which exceeded the sd threshold) or not. ToxPrints27 were generated from SMILES using the command line version of the Corina Symphony software from Molecular Networks, GmbH (https://mn-am.com/). The Fisher’s exact test was then used for each ToxPrint to compute the odds ratio and the p-value. The approach was comparable to the methodology discussed in Wang et al28. A ToxPrint was considered enriched if it had an odds ratio greater than or equal to 3, a p-value less than 0.05 and at least three read-across pairs exhibiting a large difference (> sd) with the ToxPrint present.

A final investigation explored the structural similarity between the source analogue and the target and compare whether a more similar analogue could have been identified based on the landscape of REACH registered substances themselves (taken from https://echa.europa.eu/universe-of-registered-substances). Here, only those registered substances that were mapped to DSSTox content were considered in scope. A histogram of the difference in Jaccard similarities was plotted to evaluate the frequency of cases where a much more similar analogue could have been chosen.

3. Data analysis and code

All analysis was performed using Python 3.9 and standard packages within Jupyter notebooks. The associated notebooks are available at https://github.com/patlewig/read-across/, whereas all data files are available at figshare https://doi.org/10.23645/epacomptox.25343383. Morgan chemical fingerprints were calculated using the RDKit python package. Physicochemical parameters were estimated using the OPERA software tool version 2.922. Metabolism predictions were made using TIMES v12 (Laboratory As. Zlatarov, Bourgas, Bulgaria) and structural alerts were generated using Derek Nexus 2.5 (Lhasa Ltd).

4. RESULTS

4.1. Dataset

Extraction of target-source analogue pairings from the public REACH dossiers in IUCLID format where read-across had been performed to satisfy information requirements for repeated dose toxicity or developmental toxicity resulted in 5021 associations. Mapping to DSSTox content and removal of cases where source and target substances were the same resulted in 3655 associations. Profiling the target substances based on their substance type annotation as tagged by DSSTox found over 55% target substances to be ‘Mixtures or Formulations’. Only 39% of substances were designated ‘single compounds’ by DSSTox with the remainder as polymers or mineral/composites. Figure 2 shows the DSSTox substance type designation for the target substances. Under REACH, a substance is a chemical element and its compounds in their natural state or as a result of a manufacturing process and chemically may correspond to more than one molecular structure. This difference accounts for the mixture annotations within DSSTox.

Figure 2:

Figure 2:

Profile of DSSTox substance type for target substances

Target substances that were designated as mixtures included ‘Alkenes, C11–12’, ‘Alkenes, C13–14’, ‘Alkenes, C8–10, C9-rich’, ‘castor oil, ester with trimethylolpropane’, ‘Hydrocarbons, C16–20, n-alkanes, isoalkanes, cyclics’ amongst others.

Limiting target-source associations to those with defined structure only netted 1088 pairings of which 511 were unique. Filtering further to only consider target-source analogue pairs with defined organic structures and QSAR-ready SMILES gave rise to 389 unique associations. These comprised 270 unique target substances and 259 sources substances such that each target could be associated with 1 or more source analogues. Most of the targets mapped to one source analogue each (182 out of 270 target substances). The remaining 88 targets were associated with more than 1 source analogue. Figure 3 shows the distribution of number of source analogues per target substance.

Figure 3:

Figure 3:

Number of substances as a function of the number of source analogues

4.2. Structural similarity assessment between target and source substances

Morgan chemical fingerprints were computed for all target and source substances from which a pairwise similarity matrix was derived. The pairwise structural similarity for each target-source analogue set was computed to explore the overall distribution of structural similarities. Figure 4 shows the broad variation in structural similarity distribution across the target-source sets with the median pairwise similarity being 0.43. Given how REACH stipulates that ‘any read-across approach must be based on structural similarity between the source and target substances’ amongst other requirements, a median value of 0.43 across all target-analogue pairs is unexpectedly low. Obviously, the metric derived depends on the manner in which the substances have been characterised using Morgan chemical fingerprints and the use of QSAR-ready structures, and these are potential sources of uncertainty; other chemical fingerprint approaches may yield different levels of similarity. A torsion fingerprint was used as an alternative fingerprint type and the median value was found to be 0.51. The lower than expected pairwise similarity is likely to be a combination of three aspects: the extent to which a submitter might have been constrained in their selection of source analogue given the requirement to have appropriate letter of access to use the underlying study information; secondly, the paucity of analogues with relevant toxicity data might result in structural analogues being identified that had a lower similarity; and, thirdly, structural similarity might not be the largest contributing factor in the analogue selection. From the distribution across all the target-source analogue pairs, a proportion of the pairs were associated with very high similarities reflecting those target-source analogue pairs where the pairs might only differ in chain length. Examples where the similarities were particularly high included the QSAR-ready form of octanoic acid derived from target substance Calcium octanoate (DTXSID7052280) and its corresponding source analogue, Docosanoic acid (DTXSID3026930) (pairwise similarity of 0.956). The chain length difference of 12 carbons accounts for the drop in similarity score; the calcium cation was not included in the QSAR-ready SMILES representation and therefore did not contribute to the similarity calculation. Another example was target Pentane, 1,5-diisocyanato- (DTXSID4024143) and its source 1,6-Diisocyanatohexane (DTXSID501031491) with a pairwise similarity of 0.89. These substances contain an isocyanate moiety at either end of a carbon chain comprising either 5 or 6 carbons in length. An example where there was a complete similarity between target and source analogue was for target Dioctyldimethylammonium chloride (DTXSID6035491) and its source, Didecyldimethylammonium chloride (DTXSID9032537), differing in chain length by only 2 carbons. Examples where the similarities were quite low included target, 1-(2-Hydroxy-3-sulphonatopropyl)pyridinium (DTXSID001014636) and source 3-(Pyridinium-1-yl)propane-1-sulfonate (DTXSID3044592) which had a pairwise similarity of only 0.378. The scaffold underpinning these substances was the same, a pyridinium ring with a 3 carbon chain terminating in a sulfonyl group substituent. The only difference between the substances was a hydroxy substituent on the carbon chain. Another example was target 1-(3-Chlorophenyl)-4-(3-chloropropyl)piperazine hydrochloride (DTXSID9057761) and source, 1-Methylpiperazine (DTXSID4021898) with a pairwise similarity of 0.094. It is more likely that the rationale for grouping these 2 together was on account of a transformation pathway.

Figure 4:

Figure 4:

Distribution of pairwise structural similarities for target-source associations

Exploring the pairwise similarities across target-source associations where there were more than 1 source analogue provided some perspective of the degree of variation across source analogues proposed for a given target and how this may differ between industry registrants. Figure 5 shows the distribution of pairwise similarities for all 88 targets where there was more than 1 source substance. Overall, a left shift in structural similarity was observed although there were cases where there was a high structural similarity between the target and all prospective source analogues. Many more target substances were associated with source analogues that spanned a broad range in structural similarity with median Jaccard metrics being less than 0.5. Examples included Eosin (DTXSID0025234) and (4-(alpha-(4-(Dimethylamino)phenyl)benzylidene)cyclohexa-2,5-dien-1-ylidene)dimethylammonium acetate (DTXSID9068295) as listed in Table 1.

Figure 5:

Figure 5:

Distribution of pairwise structural similarities for target-source associations with more than 1 source analogue per target. The y-axis gives the DSSTox substance identity of the target. The targets are sorted so that the median structural similarity of the target with the source analogues increases going from top to bottom.

Table 1:

Examples of target substances with large variations in pairwise Jaccard structural similarities

target Name target dtxsid source Name source dtxsid pairwise similarity
Eosin DTXSID0025234 FD&C Red 3 DTXSID7021233 0.544
Eosin DTXSID0025234 Fluorescein sodium DTXSID9025328 0.134
Eosin DTXSID0025234 C.I. Pigment Orange 13 DTXSID6052031 0.471
(4-(alpha-(4-(Dimethylamino)phenyl)benzylidene)cyclohexa-2,5-dien-1-ylidene)dimethylammonium acetate DTXSID9068295 C.I. Basic Violet 14 DTXSID6021246 0.414
(4-(alpha-(4-(Dimethylamino)phenyl)benzylidene)cyclohexa-2,5-dien-1-ylidene)dimethylammonium acetate DTXSID9068295 Acid green 50 DTXSID4046577 0.818
(4-(alpha-(4-(Dimethylamino)phenyl)benzylidene)cyclohexa-2,5-dien-1-ylidene)dimethylammonium acetate DTXSID9068295 Gentian Violet DTXSID5020653 0.219

4.3. Physicochemical similarity

OPERA derived features of LogP and number of hydrogen bond donors (HBDs) and acceptors (HBAs) as well as MW were calculated for all substances with QSAR-ready SMILES. These parameters were normalised using sklearn’s MinMaxScaler. A Euclidean similarity matrix was created from which the distribution of similarities for a given target substance could be computed. This was performed to ensure that all pairwise similarities ranged between 0 and 1. The distribution of physicochemical similarity across target-source analogues was much higher than that observed for structural similarity with a median of 0.9, perhaps reflecting that this domain of regulated substances are less likely to vary widely in terms of their physicochemical properties. Figure 6 shows the distribution of the physicochemical properties for all target and source analogues. The distribution of the pairwise similarities and the boxplot representations for targets with more than 1 source analogues are captured in the supplementary information (see Figure A1 and Figure A2).

Figure 6:

Figure 6:

Distribution of the physicochemical properties for all the target and source analogues.

4.4. Structural alert similarity

Derek Nexus v2.5 was used to profile all target and source substances across its broad suite of endpoints and toxicophores. This captured 131 endpoint-toxicophore combinations ranging from adrenal gland toxicity, carcinogenicity to thyroid toxicity. The number of alerts identified for each target-source pair was determined to enable a comparison of how many alerts were fired for a target substance relative to its source substance. The majority of target-source associations were associated with no structural alerts. Only 50 pairings were associated with alerts, with target substances typically firing the same or fewer alerts than their corresponding source analogues (see Figure 7). Of the 50 associations where there were alerts fired, in 24 cases, source analogues fired more alerts than their corresponding target substances, in 20 cases, the number of alerts fired were the same and there were only 6 instances where the target substance fired more alerts.

Figure 7:

Figure 7:

Distribution of the alert count difference between target substance and source analogue.

Based on this set of target-source pairs, most did not fire alerts which is not altogether surprising given the number of substances that were alkanes or alkenes in nature and contained no features indicative of overt reactivity. For the proportion of target-source associations with alerts, the source analogue was typically comparable or more conservative in terms of its structural alert profile, although a more thorough analysis would require examining each alert individually instead of the total number of alerts. Moreover, the absence of an alert does not necessarily infer a lack of toxicity as this could also be due to the scope of the current knowledge base within Derek.

4.5. Metabolic similarity

In vivo TIMES simulated metabolites were processed in three ways: (a) a WL kernel, (b) a bit vector of transformation pathways and (c) and a bit vector derived for the union of actual metabolites simulated. The distribution of similarities across these aspects of metabolism were particularly low as shown in Figure 8. It is plausible that the transformation similarity is higher than the similarity of metabolites, as simple substitutions may not necessarily affect the overall sequence of transformations, although the actual metabolites may be different. The actual metabolites between a target and an associated analogue could be expected to be low as there may only be a handful if any common metabolites within a target-analogue pair or across a target-group of substances. Evaluating the structural similarity of the metabolites simulated may be prove to be more impactful in codifying the similarity of the metabolites themselves. The WL kernel provides a measure of similarity for the metabolic graph but there are a number of reasons why low scores might be identified – from different structures of the graphs between substances, different graph sizes as well as noisy features. It is plausible that the simulated metabolic graphs might merit some pruning to remove downstream metabolites that have very low probabilities of being formed or are produced at extremely low levels.

Figure 8:

Figure 8:

Distribution of the metabolism similarity for all of the target and source analogues.

4.6. Quantifying the contribution of each similarity context

A matrix of all target and source substance combinations with their different similarity metrics as descriptors was constructed. All actual target-source pairings were labelled as ‘1’ and all other combinations were labelled as ‘0’. Based on the four machine learning models applied, the Random Forest Classifier (RFC) performance gave rise to the best 10-fold stratified Balanced Accuracy (BA) with a mean BA score of 0.79 (std 0.032). The RFC was initially carried forward for hyperparameter tuning but its mean BA from the 10-fold stratified cross validation was only 0.61. The next best performing model was a Linear Discriminant Analysis (LDA) model, which had a comparable mean BA score of 0.78 (std 0.03) and was in addition more intuitive to interpret. Table 2 shows the similarity contexts and their associated coefficients from the LDA model. Structural similarity and metabolic similarity had the greatest impact on assigning target-analogue pairs correctly whereas similarity in structural alerts had the lowest impact. The high positive coefficient for structural similarity is intuitive – it is likely a source analogue was identified based on structure. The high negative coefficient for metabolites similarity is counterintuitive – a higher value of the similarity in the metabolites would suggest that the model was less likely to correctly predict a target-analogue pair. A higher metabolite similarity ought to be desirable for identifying an appropriate source analogue. On the otherhand, the other two measures of metabolic similarity showed positive contributions in terms of predicting target-analogue pairs correctly. To explore this further, the model was retrained using the mean of the 3 metabolic similarities. Table 2 also reflects the updated coefficients from the re-trained LDA model. There was no discernible difference in the mean cross validation balanced accuracy with the mean metabolic metric vs. using all 3 metabolic similarity metrics. The mean cross validation balanced accuracy of the updated LDA model was found to be 0.77 (std 0.02). The balanced accuracy on the final test set was determined to be 0.77. In the updated model, structure similarity still had the largest positive contribution, metabolic similarity was associated with a small positive contribution and neither physicochemical or structural alert similarity played much of a role. The low positive contribution of the metabolic similarities is no doubt accounted for by the low pairwise similarity distributions, as shown in Figure 8. Structural similarity dominates the analogue identification but metabolic similarity plays a small role.

Table 2:

Coefficients from the original and updated Linear Discriminant Analysis model

similarity context Original coefficients Updated coefficients
Structure 59.77 55.94
Structural alert 0.18 0.16
Metabolites −48.8
Transformation 22.44
WL 4.89
mean of the 3 metabolism N/A 1.62
Physicochemical −0.33 0.37

4.7. GenRA analysis

All repeated dose studies conducted by the oral route where the point of departure (POD) took the form of NO(A)EL or LO(A)EL as queried from ToxValDB 9.4 resulted in 99,406 records. PODs were aggregated by substance, study type and POD type whereby the 10th percentile of all PODs was computed. Studies were then aggregated by substance, any PODs that were LOAEL-like in form were first adjusted by a factor of 10 before their 10th percentile was taken whereas the 10th percentile of NOAEL-like values were taken directly. For a given substance, NOAEL-like derived PODs were preferentially used unless only LOAEL-like values were available. This resulted in a single POD value for a given substance. This was merged with structural identifier information from which a −log10 mmolar POD (pPOD) was derived. PODs were available for 7635 substances whereas only 5321 substances had structural information (SMILES) and hence permitted the conversion of the POD to its mmolar basis. The resulting pPOD (−log10 of the mmol/kg/day POD) was the modelled endpoint used in the GenRA analysis.

4.8. GenRA baseline model development

Using a 10-fold cross validation (CV) approach, the mean 10-CV RMSE (std) was found to be 0.913 (0.0515). The optimal number of neighbours to use in the GenRA approach was determined to be 7. Figure 9 shows the performance of the inner CV results confirming that 7 neighbours gave rise to the optimal RMSE. The RMSE for the test set was determined to be 0.956 with a coefficient of determination (r2) of 0.360.

Figure 9:

Figure 9:

Inner CV results (10-folds on the y-axis) with number of neighbours in the neighbourhood on the x-axis

4.9. Comparison of GenRA predictions with REACH dossier toxicity values

Predictions were made for the target substances and compared with the single point estimates derived from the REACH dossiers, the RSR PODs. GenRA outcomes were typically more conservative (65% of the time) whereas the REACH outcomes were more conservative the remaining 35% of the time. The empirical cumulative distribution functions for the pPOD outcomes from both GenRA and the REACH dossiers show this shift (see Figure 10).

Figure 10:

Figure 10:

ECDFs for GenRA and REACH predictions for the target substances. Lower potency (in mg/kg-bw) values are right shifted. As can be seen in the ECDF, most of the GenRA predictions are right shifted (more toxic) than those from the REACH dossiers until the probability reaches 0.8 and over

A similar comparison was also made by making predictions for the target substances drawing source analogues from REACH registered substances with ToxValDB data. The intent was to account for the variability in the toxicity data used by relying on a single source of data, namely ToxValDB so that a comparison of the read-across predictions would only evaluate the differences in the source analogues themselves. Figure 11 shows the empirical cumulative distribution functions (ECDFs) of the difference between source and GenRA pPOD outcomes as well as the ECDFs for the source and GenRA pPOD values themselves. GenRA are in many cases right shifted (more toxic) than the source analogues from the REACH dossiers using ToxValDB data. 56% of examples have a source-GenRA pPOD difference of 0 or less.

Figure 11:

Figure 11:

ECDFs for GenRA predictions for the target substances using ToxValDB data for REACH registered substances only and ToxValDB data for the source analogues. Only 153 cases were available where both a GenRA prediction and a ToxValDB outcome were available.

The ToxPrint enrichment analysis identified that there were 2 enriched ToxPrints associated with the largest discrepancies between the REACH and GenRA predictions. Both of these were 5-membered heterocyclic ring structures as shown in Table 3. The std of the pPODs of all source analogues of the target substances containing these ToxPrints was lower when compared with the pPODs of all source analogues.

Table 3:

Enriched ToxPrints: Structural features found to be overrepresented in substances where there were large differences between REACH and GenRA read-across predictions.

OR P TxP TP TxP representation
inf 0.003832 ring:hetero_[5]_Z_1-Z 8 graphic file with name nihms-1999146-t0001.jpg
inf 0.031941 ring:hetero_[5]_O_oxolane 5 graphic file with name nihms-1999146-t0002.jpg

For the downloaded REACH registered substances, only 10,034 were associated with CASRN that enabled ease of querying for DSSTox content. Of the 10,034 substances, there were 9515 matches to DTXSID identifiers and of those 8480 were associated with defined SMILES. Morgan fingerprints could be generated for 8424 of the substances with SMILES. The list of target and original source substances were added to the set before a pairwise Jaccard similarity matrix was computed. For each of the target substances, the pairwise similarity with the chosen source analogue was derived as well as a determination of the closest source analogue that could have been identified from the REACH registered substances. For about 24% of the target substances, there was no difference in the similarities between the actual source analogue identified and the closest source analogue possible. However in the remainder of cases, the mean difference was 0.22, demonstrating that a structurally more similar analogue might have been identified. Figure 12 shows the distribution of the Jaccard similarity differences between the source analogue that had been selected and the most structurally similar analogue that could have been potentially drawn from the REACH registered substances inventory. It should be recognised that some of the substances may not have been registered at the time for them to have been considered candidates for analogue selection, furthermore those structurally closer analogues might not necessarily be data rich.

Figure 12:

Figure 12:

Histogram and ECDF for the difference in Jaccard similarities between source analogues selected for each target substance and what the most structurally similar analogue that could have been potentially drawn from the REACH registered inventory.

5. CONCLUSIONS

Using the REACH Study Results from IUCLID and the structural information from DSSTox, it was possible to extract a large number of target-source read-across associations from repeated dose oral toxicity and developmental studies. An assumption was made that all the target-source pairs were valid although further triage of the cases identified would be needed to verify whether the read-across proposed was ultimately appropriate. A handful of dossiers were reviewed to see what any justifications if any were provided (see supplementary information). In the few cases identified, the extent to which sufficient and adequate supporting data was provided seemed to determine their success or not.

A significant percentage of target-source pairs appeared quite different when evaluating their pairwise similarities, particularly their structural similarity (median 0.43) and metabolic similarity (from a median of 0 for similarity in metabolites to a median of 0.085 for the WL kernel and a median of 0.14 for the transformations). An attempt was made to quantify the potential contribution of each similarity context for identifying the target-source pairs using a LDA model. Structural similarity followed by metabolic similarity played the largest roles in rationalising the relationship. Though pairwise similarity of physicochemical properties was high for target-source analogue pairs (median 0.9), physicochemical property similarity appeared relatively uninformative for defining source analogues. Evaluating similarity for read-across associations for this large dataset suggests that whilst structural similarity is the most determining factor in read-across, the actual structural similarity was quite variable.

A model to predict the 10th percentile of POD outcomes from oral repeated dose studies extracted from ToxValDB was then undertaken to create a baseline model using the GenRA approach. The coefficient of determination (r2) for the test set was 0.36 with a RMSE of 0.956. The RMSE of this model seems reasonable, given that multi-linear regression models of in vivo PODs for replicate repeated dose toxicity studies of a single chemical perform with RMSE values approaching 0.529,30. When constructing a minimum prediction interval for a new repeated dose toxicity study POD using an RMSE of approximately 0.5, one might expect that a new value could fall within ±1.96 multiplied by this RMSE, or roughly speaking within ±1 log10 unit of the “true” POD value. Quantitative predictions will be associated with estimates of variance that include not only the inherent variability in replicate in vivo studies of single chemicals but also the variance in the multiple chemicals used to construct a particular local neighbourhood. Predictions of the target substances were then made using 7 nearest neighbours extracted from the ToxValDB database on the basis of Morgan chemical fingerprints. The PODs from the IUCLID dossiers for the source analogues used in reading across the target substances were extracted and compared with the GenRA predictions. In 65% of cases the GenRA predictions were more conservative whereas in 35% of cases the RSR POD values were more conservative. A ToxPrint enrichment analysis was performed in an attempt to explore whether there were specific chemical features more likely to explain the larger discrepancies between the GenRA and RSR POD values. The results were not particularly insightful given only 2 structural features were identified, both of which were heterocyclic ring structures. Whilst the p-values were significant, there were no examples of substances that did not exceed the threshold but contained these structural features. This is why the odds ratio values were represented by infinity. Overall, there were no obvious types of structures which might account for the differences in the read-across predictions generated by the two approaches. It is likely that the differences in predictions may have been more related to the differences and variability in the toxicity PODs themselves.

This examination of read-across cases reported in the REACH dossiers highlighted several practical challenges that are common to other regulatory decision contexts. The target-source analogue pairs reported in the REACH dossiers by industry registrants were able to provide some insight into the read-across submissions that might be expected. Most notable was that the analogues identified were often associated with low structural similarity, no doubt because adequate and relevant empirical data (to address the specific information requirement) was lacking for source analogues that might have been more structurally similar. Further, over half of the REACH dossier dataset extracted corresponded to target substances that were not single structure, organic compounds. Such chemicals present a cheminformatic challenge in terms of identifying the most representative structure. Future work will consider the impact of bioactivity similarity as well as other ways of characterising metabolic similarity. As demonstrated in the current work, the pairwise similarity for target-source analogues was relatively poor for WL, transformations, and metabolite similarity, with metabolite similarity being the most poor. It is unclear whether improvements in predicting metabolites would improve these contexts of similarity, or whether the metabolites simulated should have been pruned based on their predicted probability of formation or quantity produced. More data to benchmark metabolism predictions would ideally be needed to progress this research. Characterising metabolic similarity is an evolving area of study, and these three approaches represented a pragmatic starting point to objectively capture relevant aspects of metabolism and establish a baseline in quantifying the contribution metabolic similarity plays in analogue identification and evaluation.

GenRA systematically provided quantitative POD estimates that tended to be more conservative than the typical use of one source analogue to derive a POD estimate. This may be due to the use of a higher number of source analogues to construct a similarity-weighted average. Further, though more conservative, the GenRA POD estimates were typically within 1 log10-mmolar of the REACH dossier estimates, suggesting that without additional information or intensive expert review, GenRA could still provide an informative value for chemical assessment. A similar conclusion was reached when the comparison was made constraining the source analogues to substances that were registered substances under REACH and containing toxicity data within ToxValDB. If these structure-based POD estimates are sufficiently informative for chemical assessment, it does suggest that whilst the source analogues identified and used in submissions are imperfect, their read-across predictions are actually quite reasonable given the variability that can be expected in the underlying toxicity data relied upon as well as the variability observed in the different analogue similarities themselves.

Finally, a comparison was made of the source analogues identified for each of the target substances and whether more structurally similar analogues might have been chosen from the REACH registered substances inventory (at least for those substances which could be mapped to DSSTox content). In 24% of cases, there was no difference in the similarity, i.e. the proposed source substance was the closest analogue that could have been selected. However, the remainder demonstrates how much more similar analogues could have been chosen, assuming registrants had access to the whole REACH database though it is also acknowledged those analogues might not be associated with the relevant empirical data.

Supplementary Material

Supplement1

FUNDING

The work presented in this manuscript was supported by funding from the US Environmental Protection Agency and the European Chemicals Agency respectively.

Footnotes

DISCLAIMER

The views expressed in this manuscript are those of the authors and do not necessarily reflect the views or policies of the US Environmental Protection Agency or the European Chemicals Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.

References

  • [1].OECD, Guidance Document for the Use of Adverse Outcome Pathways in Developing Integrated Approaches to Testing and Assessment (IATA), Organisation for Economic Co-operation and Development, Paris, 2017. URL https://www.oecd-ilibrary.org/environment/guidance-document-for-the-use-of-adverse-outcome-pathways-in-developing-integrated-approaches-to-testing-and-assessment-iata_44bb06c1-en;jsessionid=qIxTrvRIM6C5cT-QZyfB0GFgUAChc_ZMpz9Tt5GK.ip-10-240-5-4 [Google Scholar]
  • [2].ECHA, Read-Across Assessment Framework (RAAF) (2017). doi: 10.2823/619212. URL https://echa.europa.eu/support/registration/how-to-avoid-unnecessary-testing-on-animals/grouping-of-substances-and-read-across [DOI] [Google Scholar]
  • [3].Ball N, Bartels M, Budinsky R, Klapacz J, Hays S, Kirman C, Patlewicz G, The challenge of using read-across within the EU REACH regulatory framework; how much uncertainty is too much? Dipropylene glycol methyl ether acetate, an exemplary case study, Regulatory Toxicology and Pharmacology 68 (2) (2014) 212–221, number: 2. doi: 10.1016/j.yrtph.2013.12.007. URL https://linkinghub.elsevier.com/retrieve/pii/S0273230013002225 [DOI] [PubMed] [Google Scholar]
  • [4].Schultz TW, Amcoff P, Berggren E, Gautier F, Klaric M, Knight DJ, Mahony C, Schwarz M, White A, Cronin MTD, A strategy for structuring and reporting a read-across prediction of toxicity, Regulatory Toxicology and Pharmacology 72 (3) (2015) 586–601. doi: 10.1016/j.yrtph.2015.05.016. URL https://www.sciencedirect.com/science/article/pii/S0273230015001154 [DOI] [PubMed] [Google Scholar]
  • [5].Ball N, Cronin MTD, Shen J, Blackburn K, Booth ED, Bouhifd M, Donley E, Egnash L, Hastings C, Juberg DR, Kleensang A, Kleinstreuer N, Kroese ED, Lee AC, Luechtefeld T, Maertens A, Marty S, Naciff JM, Palmer J, Pamies D, Penman M, Richarz A-N, Russo DP, Stuard SB, Patlewicz G, van Ravenzwaay B, Wu S, Zhu H, Hartung T, Toward Good Read-Across Practice (GRAP) guidance, ALTEX 33 (2) (2016) 149–166, number: 2. doi: 10.14573/altex.1601251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Patlewicz G, Lizarraga LE, Rua D, Allen DG, Daniel AB, Fitzpatrick SC, Garcia-Reyero N, Gordon J, Hakkinen P, Howard AS, Karmaus A, Matheson J, Mumtaz M, Richarz A-N, Ruiz P, Scarano L, Yamada T, Kleinstreuer N, Exploring current read-across applications and needs among selected U.S. Federal Agencies, Regulatory toxicology and pharmacology: RTP 106 (2019) 197–209. doi: 10.1016/j.yrtph.2019.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Schultz TW, Richarz A-N, Cronin MTD, Assessing uncertainty in read-across: Questions to evaluate toxicity predictions based on knowledge gained from case studies, Computational Toxicology 9 (2019) 1–11. doi: 10.1016/j.comtox.2018.10.003. URL https://www.sciencedirect.com/science/article/pii/S2468111318300811 [DOI] [Google Scholar]
  • [8].Beal MA, Gagne M, Kulkarni SA, Patlewicz G, Thomas RS, Barton-Maclaren TS, Implementing in vitro bioactivity data to modernize priority setting of chemical inventories, ALTEX - Alternatives to animal experimentation 39 (1) (2022) 123–139, number: 1. doi: 10.14573/altex.2106171. URL https://www.altex.org/index.php/altex/article/view/2293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Escher SE, Kamp H, Bennekou SH, Bitsch A, Fisher C, Graepel R, Hengstler JG, Herzler M, Knight D, Leist M, Norinder U, Ouédraogo G, Pastor M, Stuard S, White A, Zdrazil B, van de Water B, Kroese D, Towards grouping concepts based on new approach methodologies in chemical hazard assessment: the read-across approach of the eu-toxrisk project, Archives of Toxicology 93 (12) (2019) 3643–3667, number: 12. doi: 10.1007/s00204-019-02591-7. URL http://link.springer.com/10.1007/s00204-019-02591-7 [DOI] [PubMed] [Google Scholar]
  • [10].Pradeep P, Mansouri K, Patlewicz G, Judson R, A systematic evaluation of analogs and automated read-across prediction of estrogenicity: A case study using hindered phenols, Computational Toxicology (Amsterdam, Netherlands) 4 (2017) 22–30. doi: 10.1016/j.comtox.2017.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Helman G, Shah I, Patlewicz G, Extending the Generalised Read-Across approach (GenRA): A systematic analysis of the impact of physicochemical property information on read-across performance, Computational Toxicology (Amsterdam, Netherlands) 8 (2018) 34–50. doi: 10.1016/j.comtox.2018.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Helman G, Patlewicz G, Shah I, Quantitative prediction of repeat dose toxicity values using GenRA, Regulatory Toxicology and Pharmacology 109 (2019) 104480. doi: 10.1016/j.yrtph.2019.104480. URL https://linkinghub.elsevier.com/retrieve/pii/S0273230019302442 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G, Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information, Regulatory toxicology and pharmacology: RTP 79 (2016) 12–24. doi: 10.1016/j.yrtph.2016.05.008. [DOI] [PubMed] [Google Scholar]
  • [14].Lester C, Byrd E, Shobair M, Yan G, Quantifying Analogue Suitability for SAR-Based Read-Across Toxicological Assessment, Chemical Research in Toxicology 36 (2) (2023) 230–242. doi: 10.1021/acs.chemrestox.2c00311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Gadaleta D, Golbamaki Bakhtyari A, Lavado GJ, Roncaglioni A, Benfenati E, Automated integration of structural, biological and metabolic similarities to improve read-across, ALTEX 37 (3) (2020) 469–481. doi: 10.14573/altex.2002281. [DOI] [PubMed] [Google Scholar]
  • [16].EC, Regulation (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), establishing a European Chemicals Agency, amending Directive 1999/45/EC and repealing Council Regulation (EEC) No 793/93 and Commission Regulation (EC) No 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC, legislative Body: CONSIL, EP (Dec. 2006). URL http://data.europa.eu/eli/reg/2006/1907/oj/eng
  • [17].Klimisch HJ, Andreae M, Tillmann U, A systematic approach for evaluating the quality of experimental toxicological and ecotoxicological data, Regulatory toxicology and pharmacology: RTP 25 (1) (1997) 1–5. doi: 10.1006/rtph.1996.1076. [DOI] [PubMed] [Google Scholar]
  • [18].Grulke CM, Williams AJ, Thillanadarajah I, Richard AM, EPA’s DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research, Computational Toxicology (Amsterdam, Netherlands) 12 (Nov. 2019). doi: 10.1016/j.comtox.2019.100096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS, Richard AM, The CompTox Chemistry Dashboard: a community data resource for environmental chemistry, Journal of Cheminformatics 9 (1) (2017) 61. doi: 10.1186/s13321-017-0247-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Landrum G, RDKit: Open-source cheminformatics; http://www.rdkit.org, programmers: _:n10337.
  • [21].Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature Methods 17 (3) (2020) 261–272, number: 3 Publisher: Nature Publishing Group. doi: 10.1038/s41592-019-0686-2. URL https://www.nature.com/articles/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Mansouri K, Grulke CM, Judson RS, Williams AJ, OPERA models for predicting physicochemical properties and environmental fate endpoints, Journal of Cheminformatics 10 (1) (2018) 10. doi: 10.1186/s13321-018-0263-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Shervashidze N, Weisfeiler-Lehman Graph Kernels, Journal of Machine Learning Research 12 (2011) 2539–2561. [Google Scholar]
  • [24].Boyce M, Meyer B, Grulke C, Lizarraga L, Patlewicz G, Comparing the performance and coverage of selected in silico (liver) metabolism tools relative to reported studies in the literature to inform analogue selection in read-across: A case study, Computational Toxicology 21 (2022) 100208. doi: 10.1016/j.comtox.2021.100208. URL https://www.sciencedirect.com/science/article/pii/S2468111321000542 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay Ѐ, Scikit-learn: Machine learning in python, The Journal of Machine Learning Research 12 (null) (2011) 2825–2830, number: null. [Google Scholar]
  • [26].Shah I, Tate T, Patlewicz G, Generalized Read-Across prediction using genra-py, Bioinformatics (Oxford, England) 37 (19) (2021) 3380–3381. doi: 10.1093/bioinformatics/btab210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Yang C, Tarkhov A, Marusczyk J, Bienfait B, Gasteiger J, Kleinoeder T, Magdziarz T, Sacher O, Schwab CH, Schwoebel J, Terfloth L, Arvidson K, Richard A, Worth A, Rathman J, New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modeling, Journal of Chemical Information and Modeling 55 (3) (2015) 510–528. doi: 10.1021/ci500667v. [DOI] [PubMed] [Google Scholar]
  • [28].Wang J, Hallinger DR, Murr AS, Buckalew AR, Lougee RR, Richard AM, Laws SC, Stoker TE, High-throughput screening and chemotype-enrichment analysis of ToxCast phase II chemicals evaluated for human sodium-iodide symporter (NIS) inhibition, Environment International 126 (2019) 377–386. doi: 10.1016/j.envint.2019.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Ly Pham L, Watford S, Pradeep P, Martin MT, Thomas R, Judson R, Setzer RW, Paul Friedman K, Variability in in vivo studies: Defining the upper limit of performance for predictions of systemic effect levels, Computational Toxicology (Amsterdam, Netherlands) 15 (August 2020) (2020) 1–100126. doi: 10.1016/j.comtox.2020.100126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Pradeep P, Friedman KP, Judson R, Structure-based QSAR Models to Predict Repeat Dose Toxicity Points of Departure, Computational Toxicology (Amsterdam, Netherlands) 16 (November 2020) (2020). doi: 10.1016/j.comtox.2020.100139. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1

RESOURCES