Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 11.
Published in final edited form as: Chem Res Toxicol. 2022 May 13;35(6):992–1000. doi: 10.1021/acs.chemrestox.1c00443

Mixtures-inclusive in silico models of ocular toxicity based on US and international hazard categories

Alexander Sedykh †,*, Neepa Choksi , David G Allen , Warren M Casey §, Ruchir Shah , Nicole C Kleinstreuer §
PMCID: PMC12973229  NIHMSID: NIHMS2109055  PMID: 35549170

Abstract

Computational modeling grounded in reliable experimental data can help design effective nonanimal approaches to predict eye irritation and corrosion potential of chemicals. The National Toxicology Program (NTP) Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) has compiled and curated a database of in vivo eye irritation studies from the scientific literature and from stakeholder-provided data. The database contains 810 annotated records of 593 unique substances, including mixtures, categorized according to UN GHS and US EPA hazard classifications. This study reports a set of in silico models to predict EPA and GHS hazard classifications for chemicals and mixtures, accounting for purity by setting thresholds of 100% and 10% concentration.

We used two approaches to predict classification of mixtures: conventional and mixture-based. Conventional models evaluated substances based on the chemical structure of its major component. These models achieved balanced accuracy in the range of 68–80% and 87–96% for the 100% and 10% test concentration thresholds, respectively. Mixture-based models, which accounted for all known components in the substance by weighted feature averaging, showed similar or slightly higher accuracy of 72–79% and 89–94% for the respective thresholds.

We also noted a strong trend between the pH feature metric calculated for each substance and its activity. Across all the models, the calculated pH of inactive substances was within one log10 unit of neutral pH, on average, while for active substances, pH varied from neutral by at least 2 log10 units. This pH dependency is especially important for complex mixtures.

Additional evaluation on an external test set of 673 substances obtained from ECHA dossiers achieved balanced accuracies of 64–71%, which suggests that these models can be useful in screening compounds for ocular irritation potential. Negative predictive value was particularly high and indicates the potential application of these models in a bottom-up approach to identify non-irritant substances.

Keywords: eye irritation, eye corrosion, Draize test, machine learning

Graphical Abstract

graphic file with name nihms-2109055-f0003.jpg

1. INTRODUCTION

Eye irritation testing is conducted as part of the overall safety assessment of a wide variety of regulated substances, including industrial and agricultural use chemicals, cosmetics, and ophthalmic care products1. For the past 75 years, the in vivo rabbit eye test (Draize test) has been used to assess eye irritation and corrosion potential2. In this test, substances are applied to a single rabbit eye and the severity of response on ocular tissues (i.e., cornea, conjunctiva, and iris) are used to assign ocular hazard classifications. However, studies have suggested that the responses observed in the rabbit eye test are not always relevant to the responses observed in humans3. Potential lack of human relevance, combined with animal welfare concerns and implementation of international regulations banning animal testing of chemicals, cosmetic formulations and ingredients, have led to an increase in the development and evaluation of ocular irritation methods that may reduce or replace animal testing4, 5. Several in vitro and ex vivo methods have been developed for the identification of severe eye irritants and corrosives and for identification of chemicals that do not require eye irritation hazard labelling. However, there is currently no single alternative, nonanimal eye irritation test that is accepted as a complete replacement for the rabbit eye test.

Computational approaches offer several advantages over in vitro and ex vivo alternatives, including reduced time and cost. Additionally, studies suggest that in silico tools could outperform in vitro and ex vivo methods when sufficient mechanistic information is available for model development6. In silico model reliability and relevance is dependent upon a variety of factors including, but not limited to, well-defined chemical structures with sufficient diversity, test concentration information, and results from high-quality in vivo and/or in vitro studies to serve as model training and evaluation data.

The National Toxicology Program (NTP) Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) collected high quality rabbit eye irritation data to assist in the evaluation of alternative methods. Using these data, we developed quantitative structure-activity relationship (QSAR) models to predict ocular irritation categories. Since the database includes information on the concentration tested in vivo, we were able to develop models based on concentration thresholds.

Conventional QSAR modeling7 uses chemical descriptors derived from single organic molecules (such as, the largest chemical component of a salt). When applied on chemical datasets that contain mixtures, this can result in identical chemical descriptor vectors for different substances if they share the same main component. Hence mixtures are often excluded from toxicity modeling studies altogether8,9. Nevertheless, modeling of mixtures is the active area of research10,11. Its development is hindered mostly by availability of experimental data for large datasets. We refer the reader to a decade old but still very relevant review on QSAR analysis of mixtures by Muratov et al12. In an effort to improve on the conventional approach, we developed QSAR models that used a stoichiometry-based averaging of chemical descriptors12 over the components of a test substance identified as a salt or mixture.

The models jointly represent a panel of binary classifiers, each trained to discriminate between substances on a spectrum of eye irritation potential based on U.S. Environmental Protection Agency (EPA) and U.N. Globally Harmonized System (GHS) hazard categories (Tables 1 and 2, respectively).

Table 1.

EPA Hazard Categories for Eye Irritation

EPA Hazard Category Positive Response Classification Criteria Personal Protective Equipment (PPE)
Category I • Corneal opacity or iritis ≥1
• Conjunctival redness or chemosis ≥2

in a single animal at any observed time point up to 21 days after substance administration
Corrosive (irreversible destruction of ocular tissue), corneal involvement or irritation persisting for more than 21 days Protective eyewear1
Category II Corneal involvement or irritation clearing in 8–21 days Protective eyewear1
Category III Corneal involvement or irritation clearing in 7 days or less No minimum2
Category IV Minimal effects clearing in less than 24 hours No minimum2

Label Review Manual. Chapter 7: Precautionary Statements. Revised July 2014.

1

“Protective eyewear” is described for eye protection, unless a specific type of eyewear protection is needed to ensure adequate protection.

2

Agency may require PPE on a product-specific basis.

Table 2.

GHS Hazard Categories for Eye Irritation

GHS Hazard Category Positive Response
Category 1 Response in at least one animal on the cornea, iris, or conjunctivae that are not expected to reverse or do not full reverse within an observation period of normally 21 days

In at least 2 of 3 animals, a positive response of
(i) corneal opacity ≥3; and/or
(ii) iritis >1.5;
Calculated as the mean of scores following grading at 24, 28, and 72 hours after instillation of the test material
Category 2A In at least 2 of 3 animals, a positive response of
(i) corneal opacity ≥1; and/or
(ii) iritis ≥1;
(iii) conjunctival redness ≥2; and/or
(iv) conjunctival oedema ≥2
Calculated as the mean of scores following grading at 24, 28, and 72 hours after instillation of the test material and which fully reverses within an observation period of normally 21 days
Category 2B In at least 2 of 3 animals, a positive response of
(i) corneal opacity ≥1; and/or
(ii) iritis ≥1;
(iii) conjunctival redness ≥2; and/or
(iv) conjunctival oedema ≥2
Calculated as the mean of scores following grading at 24, 28, and 72 hours after instillation of the test material and which fully reverses within 7 days of observation

Globally Harmonized System of Classification and Labelling of Chemicals (GHS). United Nations. New York. 2019

2. EXPERIMENTAL PROCEDURES

2.1. Classification of Database Substances

The NICEATM ocular toxicity database (OCUTOXDB; Supplemental Table 1) contains 810 curated records for 593 chemical substances. Data were obtained from a variety of sources including, but not limited to, published literature and data provided in response to requests for data from NICEATM. For each substance, CAS Registry Number (CASRN), name, test concentration (if provided), data source, and provided in vivo data were recorded. If test concentration was not provided, it was presumed to be tested at 100% in accordance with testing guidelines. Individual animal data were reviewed and curated by NICEATM staff for accuracy. Ocular toxicity hazard categories (EPA and GHS classification systems) were either provided by the data source or assigned by NICEATM based on individual animal data. If a discrepancy was noted between the provided classification and the NICEATM-assigned classification based on the individual animal data, the classification based on individual animal data was used for database and model development. SMILES structures were retrieved from the US EPA Chemistry Dashboard (https://comptox.epa.gov/dashboard) using the provided CASRN. Additionally, each chemical substance is linked (via auxiliary data tables) to a set of its simple chemical components with stoichiometry information. The chemical components table contained 567 unique chemicals.

Table 3 shows the distribution of EPA and GHS classifications for the 593 chemical substances in the 810 database records. The EPA and GHS ocular toxicity hazard classification systems do not correlate directly between categories. For example, 114 records labeled as “No Category” in GHS (i.e., not requiring ocular hazard labeling under GHS) were classified as EPA Category III while another 204 “No Category” records were classified as EPA Category IV. However, Table 3 also shows that there is high concordance between the two systems for classification of chemicals as corrosives (98% of entries classified as GHS Category 1 also were classified as EPA Category I). Since there was a high degree of overlap for this category, the 57 entries that were classified as GHS Category 1 but lacked requisite individual animal data for a definitive EPA classification, were treated as EPA Category I for model development. A similar assignment of EPA hazard classification based on GHS category was done to a limited extent for EPA Categories II, III, and IV; details are provided in Supplemental Table 2.

Table 3.

Comparison of ocular irritation hazard categories for EPA and GHS systemsa.

US EPA Categories
I II III IV (no call)b
GHS Categories 1 151 3 57
2A 3 30 10 37
2B 3 38 8
No Category 5 114 204 71
(no call) 2 2 10 1 61
a

Hazard categories correspond to corrosive (EPA Category I and GHS Category 1), irritant (EPA Category II, and GHS 2A), mild irritant (EPA Category III and GHS 2B), and practically non-irritant/not classified (EPA Category IV and GHS No Category). Shaded cells reflect concordant calls between classification systems.

b

”No call” summarizes unclassifiable studies or missing data.

2.2. Modeling Datasets

For modeling purposes, we created six modeling categories (Figure 1), namely: “EPA_CORR” (to represent substances that produce a corrosive effect; EPA Category I), “EPA_IRR” (to represent substances that produce an irritant effects that require use of protective eyewear; EPA Category I or II), and “EPA_ANY” (to represent substances that produce ocular irritation; EPA Category I, II, or III).

Figure 1.

Figure 1.

Three modeling categories developed (A) based on the EPA hazard classifications for ocular irritation and (B) based on GHS classification.

  • For the modeling category “EPA_CORR”, a substance was assigned the status
    • “Active” (and numerical value of 1) if it was classified as EPA Category I or classified as EPA Category I based on a GHS Category 1 classification
    • “Inactive” (and numerical value of 0) if the substance was classified as Category II, III, or IV, or
    • omitted from the category if the EPA classification could not be assigned.
  • For the modeling category “EPA_IRR”, a substance was assigned as
    • “Active” if it was classified as EPA Category I or II, or classified as EPA Category I based on a GHS Category 1 classification
    • “Inactive” if the substance was classified as EPA Categories III or IV, or
    • omitted if the EPA classification could not be assigned. This binary decision was selected based on the labeling requirements for eye protection according to hazard category (i.e., eye protection is required for Categories I-II, but not for Categories III-IV, see Table 1)
  • For the modeling category “EPA_ANY”, a substance was assigned as
    • “Active” if it was classified as EPA Category I, II, or III, or classified as EPA Category I based on a GHS Category 1 classification
    • “Inactive” if it was classified as EPA Category IV, or
    • omitted if the EPA classification could not be assigned.

Likewise, three modeling categories were also created for GHS-based hazard classifications (Figure 1) using GHS classes instead of their EPA counterparts (as seen in Table 3).

These three modeling categories were further subdivided based on the concentration tested in the assay. Two concentration levels were evaluated: 10% (low concentration; “L”) and 100% (high concentration, “H”). The distribution of reported test concentrations across OCUTOXDB entries is provided in Supplemental Table 3.

For 100% concentration datasets (i.e., EPA_CORR_H, EPA_IRR_H, EPA_ANY_H, and corresponding GHS versions), substances were identified as “active” when (a) assigned an “active” categorization based on the modeling category (see above) and (b) tested at any concentration. A substance was identified as “inactive” when (a) assigned an “inactive” categorization based on the modeling category and (b) tested at concentration ≥90%. To reduce potential bias due to the large number of inactive substances in the CORR_H category, only test substances tested at 100% were included in the “inactive” classification. Therefore, a test substance reported as corrosive at a 10% concentration was identified as an active substance in the CORR_H dataset. However, a test substance identified as non-corrosive at 10% concentration was omitted.

For 10% concentration datasets (EPA_CORR_L, EPA_IRR_L, EPA_ANY_L, and corresponding GHS versions), substances were identified as “active” when (a) assigned an “active” categorization based on the modeling category (see above) and (b) tested at a 10% concentration. A substance was identified as “inactive” when (a) assigned an “inactive” categorization based on the modeling category and (b) tested at concentration ≥10%.

Category labels “active” and “inactive” were assigned using the above criteria to each tested substance. All entries for the same substance (as identified by its CASRN) were assigned a single activity call for each dataset; the most severe classification was assigned (i.e., if any of the entries was active the overall call for the substance was active). The distribution of actives and inactives for each dataset modeled is shown in Table 4.

Table 4.

Ocular Toxicity datasetsa of US EPA and EU GHS hazard categories

Dataset Name Active Concentration Threshold Inactive Concentration Threshold Activity Type Total Inactives Total Actives
OCU_EPA_CORR_H (≤100%) ≥90% EPA_CORR 311 155
OCU_EPA_IRR_H EPA_IRR 258 184
OCU_EPA_ANY_H EPA_ANY 142 333
OCU_EPA_CORR_L (≤10%) >10% and <100% b EPA_CORR 45 32
OCU_EPA_IRR_L EPA_IRR 39 35
OCU_EPA_ANY_L EPA_ANY 152 46
OCU_GHS_CORR_H (≤100%) ≥90% GHS_CORR 330 152
OCU_GHS_IRR_H GHS_IRR 284 205
OCU_GHS_ANY_H GHS_ANY 261 230
OCU_GHS_CORR_L (≤10%) >10% and <100% b GHS_CORR 49 32
OCU_GHS_IRR_L GHS_IRR 43 35
OCU_GHS_ANY_L GHS_ANY 282 38
a

See Supplemental Materials for a list of substances in each dataset;

b

See Supplemental Table 3 for additional details.

This approach, which we term maximum-based merging, ensured that, among multiple data entries, entries reporting highest ocular hazard were given priority. About 20% of the activity labels in each dataset were based on multiple data entries, which required such merging. For comparison (Supplemental Table 4), we also performed mean-based merging, in which the activity labels for all entries for a single substance were averaged and rounded to either 1 or 0. We noted around 8% of discrepancy between merged calls by these two methods. Modeling datasets (Table 4), have varying composition of active and inactive substances. Counts of multi-component substances (e.g., salts and mixtures) in each dataset are given in Supplemental Table 5.

2.3. Chemical Features

Our main sources of chemical structural features were 1613 Mordred 2D descriptors13 and 729 ToxPrint chemotypes14. Both structural feature sets are free for public use. ToxPrint chemotypes were developed to represent chemical structural moieties most often leading to toxicity. This combined set was reduced to 811 descriptors after removal of redundant variables, such as highly correlated variables having a Pearson r2 of greater than 0.99 and nearly constant variables having fewer than two unique values for continuous variables or fewer than two occurrences for discrete variables. Two descriptors based on SMARTS query language (Daylight Chemical Information Systems Inc.) were developed and coded to represent heavy metals and electrophiles. We also added one pH-related descriptor based on calculations from ADMET Predictor software (V9.5, Simulations Plus). The final set of 814 descriptors calculated for 567 substance components are provided in Supplemental Table 6.

Finally, we prepared two versions of descriptor matrices based on chemical substance-to-component mapping in OCUTOXDB. For the set of MAIN descriptors, each substance was represented by chemical descriptors calculated for its largest component (conventional approach). For the set of MIX descriptors, the descriptors for all the known components of a mixture were combined as a weighted average12 using the following equation.

Xsubst=iwiXci;wi=NiHci/iNiHci (Eq. 1)

In this equation, Xsubst and Xcj are chemical descriptor vectors for the substance and its i-th component, wi is the weight, Hci is the number of heavy atoms and Ni is the occurrence count of i-th component in the substance. For the pH descriptor, the median pH values of substance’s components were taken. These calculations on example of sodium oxalate (CASRN#62–76-0) are given in Supplemental Table 7.

2.4. Modeling Procedure

The following machine learning algorithms were used for the analysis: random forest (RF), support vector machines with radial kernel (SVM), extremely randomized trees (XT), and generalized linear models (GLM). These algorithms were selected based on previously demonstrated utility and capability of capturing complex relationships in the data15. All modeling procedures were implemented using the “caret” R package, with parameter tuning via random search and with 5-fold cross-validation repeated three times to collect cross-validation prediction results for individual models and their consensus (Supplemental Table 8). The models were trained to maximize (on the training set) the area under receiver operating characteristic curve (ROC AUC; ranking metric, with 0.5 corresponding to random performance) and were not used in consensus average if training ROC AUC was less than 0.65.

2.5. European Chemicals Agency (ECHA) Test Set

To further evaluate models developed in this study, we used a set of chemicals extracted from registration dossiers provided to NICEATM by the European Chemicals Agency. We started with 920 substances that had ocular toxicity data in some form (either as GHS category or free text description of animal data outcomes). We removed substances with missing ocular toxicity category, undefined chemical structure, and those of unlikely relevance (e.g., pure element entries, insoluble metal oxides). We also removed 93 substances (based on CASRN) in overlap with OCUTOXDB that were represented by 127 rows of data. Comparison of these overlapping substances indicates that 75/127 (59%) entries had the same response reported.

The final external validation set contained 673 substances (515 non-irritants, 108 of GHS Category 1, 35 of Category 2A and 15 of Category 2B). The curated list of substances along with available ocular toxicity information and prediction results is given in Supplemental Table 9. These remaining data were reviewed and GHS hazard classification categories were assigned or confirmed. Since no test dose information was available, we assumed 100% test dose. Around 20% of those entries had more than one structural component (most are salts). We implemented global applicability domain (AD) for ECHA predictions based on the Euclidean distance in the descriptor space from each test chemical to its nearest neighbor in the OCUTOXDB database16. Distribution of nearest neighbor distances within OCUTOXDB was used as a reference to convert test chemical distances into Z-scores (e.g., Z = 0, signifies average nearest neighbor distance within OCUTOXDB), which are provided in Supplemental Table 9. Based on the Z cut-off of 3.0, only 12 substances can be deemed out of AD, being structurally very dissimilar for modeling set compounds (see Supplemental Table 9).

2.6. “Everything out” validation

Since distinct multi-component substances may still have some of their chemical components shared, this could still lead to overly optimistic results if modeling sets are randomly split into training and test sets only on a substance level12. We therefore performed additional validation scenario, so called “everything out” scheme where neither substances nor their components are shared between training and test sets as opposed to “mixtures out” scheme12 when randomly splitting unique substances. This validation partitioning groups are provided in the Supplemental Information. Briefly, 90 substances were assigned to “PART1” (always in modeling set), and 22 substances – to “PART2” (always in external validation set). Each group was defined based on interlinking components shared across the substances of that group. Remaining, independent substances (404) were assigned to the “ANY” partitioning group and appropriate number was randomly sampled from them into external validation set to make it up to about 20% for each validation iteration (repeated five times). Same modeling procedure was performed on thus reduced modeling sets as described in Methods 2.4, the validation results are provided in Table 5 and Supplemental Table 10.

Table 5.

Comparison of external validation (20% off) performance of mixture-based (“MIX”) and conventional (“MAIN”) models of ocular toxicity a

Dose threshold Activity type Model Type Sensitivity Specificity Balanced Accuracyb ROC AUCc
H

(High, 100%)
EPA_CORR MAIN 0.80 (0.13) 0.64 (0.13) 0.72 (0.05) 0.75 (0.06)
MIX 0.72 (0.10) 0.75 (0.05) 0.74 (0.04) 0.77 (0.03)
EPA_IRR MAIN 0.85 (0.10) 0.66 (0.05) 0.76 (0.04) 0.80 (0.04)
MIX 0.85 (0.04) 0.72 (0.04) 0.79 (0.04) 0.83 (0.03)
EPA_ANY MAIN 0.71 (0.15) 0.65 (0.12) 0.68 (0.04) 0.71 (0.06)
MIX 0.72 (0.12) 0.72 (0.14) 0.72 (0.02) 0.75 (0.04)
L

(Low, 10%)
EPA_CORR MAIN 0.85 (0.17) 0.89 (0.15) 0.87 (0.11) 0.89 (0.10)
MIX 0.93 (0.09) 0.87 (0.14) 0.90 (0.11) 0.92 (0.09)
EPA_IRR MAIN 0.95 (0.06) 0.97 (0.07) 0.96 (0.06) 0.95 (0.08)
MIX 0.97 (0.06) 0.92 (0.08) 0.94 (0.03) 0.97 (0.04)
EPA_ANY MAIN 0.90 (0.11) 0.87 (0.06) 0.89 (0.03) 0.89 (0.05)
MIX 0.86 (0.13) 0.92 (0.05) 0.89 (0.04) 0.91 (0.06)
H

(High, 100%)
GHS_CORR MAIN 0.83 (0.06) 0.59 (0.08) 0.71 (0.03) 0.74 (0.02)
MIX 0.72 (0.19) 0.76 (0.15) 0.74 (0.03) 0.77 (0.04)
GHS_IRR MAIN 0.85 (0.13) 0.75 (0.06) 0.80 (0.04) 0.83 (0.04)
MIX 0.81 (0.09) 0.74 (0.05) 0.77 (0.02) 0.80 (0.02)
GHS_ANY MAIN 0.81 (0.09) 0.71 (0.10) 0.76 (0.04) 0.81 (0.05)
MIX 0.78 (0.10) 0.68 (0.12) 0.73 (0.03) 0.77 (0.05)
L

(Low, 10%)
GHS_CORR MAIN 0.96 (0.09) 0.91 (0.14) 0.93 (0.07) 0.93 (0.07)
MIX 0.97 (0.07) 0.87 (0.08) 0.92 (0.05) 0.93 (0.06)
GHS_IRR MAIN 1.00 (0.00) 0.80 (0.14) 0.90 (0.07) 0.90 (0.11)
MIX 0.98 (0.06) 0.88 (0.15) 0.93 (0.07) 0.92 (0.09)
GHS_ANY MAIN 0.94 (0.09) 0.88 (0.07) 0.91 (0.04) 0.94 (0.04)
MIX 0.89 (0.11) 0.93 (0.03) 0.91 (0.06) 0.93 (0.05)
a

Table shows repeated (n=5) external validation (20% off) prediction statistics (mean and standard deviation values in parentheses) for the consensus across individual models for the “all-out” external validation (see Methods 2.6). Individual models’ external validation predictions are given in Supplemental Table 10, while their cross-validation statistics on entire modeling sets are given in Supplemental Table 8.

b

Average of sensitivity and specificity.

c

Area under receiver operating characteristic curve.

3. RESULTS AND DISCUSSION

Based on curated data entries in OCUTOXDB, we compiled, per hazard classification system (GHS or EPA), six datasets for three progressively inclusive activity endpoints, each at test concentration thresholds of 10% (designated as “_L”) and 100% (designated as “_H”), by mass or volume. The EPA_CORR datasets were designed to identify corrosive substances (US EPA Category I), while the EPA_IRR datasets identified irritants as well (US EPA Categories I and II) and the EPA_ANY datasets covered corrosive, irritant and mildly irritant substances as actives (US EPA Categories I-III). Likewise, respective GHS categories (Figure 1) were used for GHS-based datasets. As expected, these activity endpoint definitions had a direct influence on data composition biases (Table 4). For example, the EPA_CORR dataset contained fewer actives relative to the number of inactives. This was especially pronounced for datasets at 100% concentration threshold, which essentially represents test outcomes for pure substances. For the 10% concentration threshold, the bias was much smaller due to overall fewer studies at the noted concentration threshold. The remarkable exception was the EPA_ANY_L dataset, for which nearly all qualified inactives were used, including those substances that were inactive at 10% but were active as pure substances (i.e., active in EPA_ANY_H).

It can also be noted that EPA datasets have lower numbers of inactive substances than their GHS counterparts (Table 4). This is due to the slightly more conservative approach used by EPA whereby a single animal with a positive irritation score (regardless of how severe) can drive the classification, consequently, resulting in a larger number of active calls relative to GHS (see also Table 3).

For each of the twelve datasets in Table 4 we built models based on two different representations, in terms of chemical descriptor matrices, used as input variables: 1) mixture-based representation (“MIX”), which employed for each substance a weighted-average approach across its components; 2) conventional representation (“MAIN”), which considered for each substance only the properties of its largest component. The obvious drawback of the “MAIN” representation was that it resulted in identical chemical descriptor vectors being used for different substances that shared the same main component. For example, different metal salts of the same organic acid were treated identically by this representation. A typical approach to addressing this would be to remove redundant data points and harmonize any discrepant activity values. However, in this study, we left redundancies in place. This could result in biased accuracy of “MAIN” based models in either direction (i.e., higher or lower than real accuracy, depending on redundant data points having mostly matching active labels or contradictory ones), but we considered it more important to ensure modeling comparisons on the same data set.

Table 5 summarizes the modeling validation results for both dosing thresholds (100% “High” and 10% “Low”) and representation approaches (“MIX” and “MAIN”) for two agencies (EPA and GHS) and three ocular toxicity endpoints (“CORR”, “IRR”, and “ANY”). Briefly, five times repeated predictions of individual models (RF, XT, SVM, GLM) were combined and resulting consensus performance calculated.

As shown in Table 5, both the “MAIN” and “MIX” model types exhibited better balanced accuracy and ROC AUC than would be expected from random chance, which would yield a value of 0.5 for both statistics. All the low-dose threshold models showed higher validation accuracy (around 0.9) than that of their respective high-dose counterparts (around 0.7 – 0.8). This is likely overly optimistic, as these datasets have a very limited pool of available active substances (30–40 on average, Table 4), some of which were easy to generalize (e.g., inorganic acids or bases).

Interestingly, regardless of the representation used, EPA models exhibited higher modeling accuracy for the “EPA_IRR” activity endpoint. This may indicate a stronger mechanistic relationship for the activity binning schemes that are based on combining corrosives and irritants (US EPA Categories I-II). For the GHS datasets validation results for all three endpoints were quite similar. When comparing respective high-dose models between hazard classification systems (i.e., EPA and GHS), they showed similar performance for corrosives (“CORR”) and for irritants (“IRR”) endpoints, but for all irritants, including weak ones (“ANY”), the GHS models were more accurate by 2–10%.

We also observed that, on general, models based on mixture-based representation of substances (Table 5, “MIX”) slightly outperformed their corresponding conventional QSAR models, but these differences cannot be deemed significant. This small boost is likely due to the more accurate representation provided by mixture-based representation for multi-component substances, which constitute around 20% of OCUTOXDB. However, additional computational studies on larger datasets would be needed to confirm such trend with confidence. The modeling datasets employed in this study are limited in size, especially those for 10% dose threshold, because the in vivo method is typically conducted with undiluted chemicals. Consequently, observations that we make based on comparative modeling performance will need to be supported by additional validation studies or by remodeling on expanded data sets containing greater numbers of curated ocular toxicity data points.

We have assessed performance of our ocular toxicity models on an additional large external set of 673 substances from European Chemicals Agency (ECHA) dossiers (see “ECHA test set” in methods). We applied high-dose threshold models (on assumption that ECHA data was mostly for pure substances) and calculated their consensus for CORR, IRR, and ANY activity endpoints per hazard classification system. Consensus scores of 0.55 and greater were binarized as active calls, consensus scores of 0.45 and lower as inactive calls, and the borderline predictions (0.45 – 0.55) were considered as out of coverage. Table 6 shows summary of ECHA test set results, while detailed predictions and applicability domain Z-scores are also given in Supplemental Table 9.

Table 6.

ECHA test set (n=673) predictions by high-dose ocular toxicity models.

Activity type Model Type Sensitivity Specificity Balanced accuracy PPV a NPV Coverage
EPA_CORR MAIN 0.68 0.65 0.67 0.26 0.92 0.89
MIX 0.74 0.66 0.70 0.28 0.93 0.90
EPA_IRR MAIN 0.74 0.60 0.67 0.32 0.90 0.86
MIX 0.68 0.67 0.68 0.33 0.90 0.88
EPA_ANY MAIN 0.83 0.51 0.67 0.37 0.90 0.83
MIX 0.73 0.62 0.68 0.37 0.88 0.78
GHS_CORR MAIN 0.51 0.76 0.64 0.27 0.90 0.86
MIX 0.67 0.75 0.71 0.32 0.93 0.85
GHS_IRR MAIN 0.70 0.66 0.68 0.34 0.90 0.89
MIX 0.74 0.66 0.70 0.35 0.91 0.79
GHS_ANY MAIN 0.72 0.68 0.70 0.39 0.89 0.82
MIX 0.62 0.70 0.66 0.38 0.86 0.83
a

PPV: positive prediction value – fraction of correctly predicted positives; NPV: negative prediction value -fraction of correctly predicted negatives; Coverage – fraction of substances that received prediction.

In general, the sensitivity of high-dose models on ECHA test set is similar to that seen in cross-validation results. However, the specificity is notably lower (by 5–10%) even for GHS models, which should be directly comparable, as they share ECHA’s underlying category definitions. This could be in part due to data noise in ECHA test set, as well as learning limitations of our modeling sets, which are about 1.5 times smaller than the ECHA test set. Two other reported metrics (PPV and NPV, Table 6) indicate relatively high confidence in negative predictions (only 5 – 15% false negatives), while positive predictions have high fraction of false positives (up to 74% for corrosive category). This overprediction could be in part due to outdated toxicity records and/or missing test dose information (e.g., irritant substances can test as negative at very low doses). At the same time, there is generally a slightly better performance for mixture-based vs conventional models (1–8% of balanced accuracy) especially for corrosive category (“GHS_CORR” and “EPA_CORR” models). Despite being based on GHS ocular hazard classification system the results of this test set indicate nevertheless good reliability of negative predictions from all the models developed in this study. Since high false positives rate was found for this test set, positive predictions may need further investigation, especially in regards of test dose used.

Chemical descriptors found important by each model (see Supplemental Table 8) generally confirmed that features related to acidity and basicity were quite prominent (e.g., pH, nAcids, nBases, OH-related descriptors). We therefore examined the pH descriptor value distribution across active and inactive chemical substances for the EPA datasets employed in this study (Table 7). The average difference from neutral pH for inactive compounds was much smaller than for active compounds. For instance, inactive substances of EPA_ANY sets had mean pH closest to neutral, which complies with this endpoint’s stricter definition of inactivity, including only substances in EPA Category IV. Conversely, active substances had mean pH that were on average further from neutral pH. These results suggest that many of these chemicals may cause damage to ocular tissues at least in part via direct acidic or basic action, which is consistent with Guidance Document 263 that suggests that substances with pH extremes (i.e., ≤2 or ≥11.5) may be assumed to be corrosive and no additional animal testing is needed17. Similar results were observed for the GHS classification system.

Table 7.

Absolute difference from neutral pH (7.0) for active and inactive substances in EPA ocular toxicity modeling datasets.

Dose threshold Activity type Model Type Inactive substances Active substances
Mean SD Mean SD
H

(High, 100%)
EPA_CORR MAIN 0.88 1.71 2.90 2.50
MIX 0.56 1.08 2.35 2.36
EPA_IRR MAIN 0.76 1.56 2.65 2.49
MIX 0.50 1.01 2.11 2.30
EPA_ANY MAIN 0.54 1.26 2.05 2.38
MIX 0.43 0.97 1.59 2.08
L

(Low, 10%)
EPA_CORR MAIN 1.28 1.64 4.97 2.60
MIX 1.13 1.49 3.07 2.96
EPA_IRR MAIN 1.15 1.57 4.77 2.66
MIX 1.05 1.43 2.94 2.88
EPA_ANY MAIN 0.60 1.29 4.33 2.74
MIX 0.49 1.04 2.91 2.81

Figure 2 illustrates the distribution in pH values for the EPA_IRR data sets in more detail. As can be seen, active substances (red) are clustered further away from neutral pH in the middle, where many inactive substances (blue) tend to accumulate.

Figure 2.

Figure 2.

Distribution of calculated pH values among active (red) and inactive (blue) substances in EPA_IRR_H_MIX (A) and EPA_IRR_H_MAIN (B) data sets.

We also noted that the mixture-based approach mostly affects the calculated pH distribution of inactive substances (Table 7 and Figure 2A vs 2B). The calculated pH values of inactive substances tend to cluster more toward neutral pH and are less spread out in comparison to the pH distribution values from the conventional (“MAIN”) approach. At the same time, the distribution of the active substances towards more acidic and basic pH values was similar for both approaches. Thus, the mixture-based approach better reflected the interaction of the components of a complex substance, such as the tendency of a salt to have a more neutral pH than its constituents. A conventional modeling approach that is based on the main chemical component of a mixture would likely mispredict such substances as false positives.

4. CONCLUSIONS

We developed a set of in silico models for three different EPA and GHS hazard classification endpoints, at 100% and 10% test concentration thresholds, for the chemical substances in the NICEATM OCUTOXDB database, many of which are mixtures. Conventional models that were based on chemical structure of the largest component of the test substance achieved validated balanced accuracies in the ranges of 68–80% for the 100% dose threshold and 87–96% for the 10% dose threshold. Comparatively, the mixture-based models, which account for all components in the substance by weighted feature averaging, showed similar or slightly higher accuracies of 72–79% and 89–94% for the 100% and 10% dose thresholds, respectively. Evaluation on the ECHA external test set exhibited balanced accuracies of 64–71% for the high-dose models.

Most of our models identified calculated pH descriptor as an important feature. This pH dependency can be especially important for the test substances and mixtures with multiple chemical components that impact the overall pH of the mixture in opposite ways (e.g., soaps).

To our knowledge this study represents the first attempt to build QSAR models of ocular irritation that take dose into account. Additionally, these models are an attempt to computationally characterize complex chemical substances such as mixtures and salts, which are typically excluded entirely from computational analysis (such as in Wang et al 8).

We believe that the main limitation of the current study is the relatively small size of the employed modeling datasets. Thus, we restricted our modeling attempts to binary classification endpoints, and two concentration thresholds, as there were not enough data to afford stricter thresholds (such as 1% test concentration threshold), or to obtain reliable models for all four EPA or GHS hazard categories, separately. Much more data may become available9 in the future, if available public data sources are curated using EPA compatible guidelines. Given the availability of such data, we would expect performance of in silico models based on larger datasets to converge to the level of accuracy limited only by experimental errors and remaining discordances in the underlying training data.

The simplest and intuitive way to apply developed models for hazard classification would be a tiered prediction strategy. An unknown substance can be evaluated first by the most general endpoint model (e.g., “EPA_ANY”), then, if a positive prediction is obtained, it would be evaluated by a more specific model (e.g., “EPA_IRR”) and then, if still positive there, further evaluated by the last, most specific model (e.g., “EPA_CORR”). This approach provides unambiguous assignment of a hazard category but in case of potentially contradicting predictions (if all three models are run jointly) defers to more generic endpoints, which may not be the most accurate.

One likely valuable application of these in silico models could be predicting ocular irritation hazard for a wide variety of chemical structures and their mixtures. In particular, several of the EPA and GHS classification models presented here to distinguish non-irritants have high accuracy and negative predictive value (>90%), and could therefore be used as a bottom-up testing approach. Additionally, these models could potentially be combined with other nonanimal methods in a defined approach to fully replace in vivo testing for ocular irritation.

Supplementary Material

Supplemental info and tables

SUPPORTING INFORMATION: Curated ocular toxicity data and modeling datasets used in this study as well as additional details on statistical analysis of data and models. Data files, R-objects of models and auxiliary files can be also obtained from https://github.com/NIEHS/OcularToxicityQSAR.git.

FUNDING

The project was funded in part with federal funds from the National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH) under Contract No. HHSN273201500010C.

REFERENCES

  • (1).Prior H; Casey W; Kimber I; Whelan M; Sewell F Reflections on the Progress towards Non-Animal Methods for Acute Toxicity Testing of Chemicals. Regul. Toxicol. Pharmacol. 2019, 102, 30–33. 10.1016/j.yrtph.2018.12.008 [DOI] [PubMed] [Google Scholar]
  • (2).Wilhelmus KR The Draize Eye Test. Survey of ophthalmology. 2001, pp 493–515. 10.1016/S0039-6257(01)00211-9 [DOI] [PubMed] [Google Scholar]
  • (3).Verstraelen S; Jacobs A; De Wever B; Vanparys P Improvement of the Bovine Corneal Opacity and Permeability (BCOP) Assay as an in Vitro Alternative to the Draize Rabbit Eye Irritation Test. Toxicol. Vitr. 2013, 27(4), 1298–1311. 10.1016/j.tiv.2013.02.018 [DOI] [PubMed] [Google Scholar]
  • (4).Oliveira GAR; Ducas R. do N.; Teixeira GC; Batista AC; Oliveira DP; Valadares MC Short Time Exposure (STE) Test in Conjunction with Bovine Corneal Opacity and Permeability (BCOP) Assay Including Histopathology to Evaluate Correspondence with the Globally Harmonized System (GHS) Eye Irritation Classification of Textile Dyes. Toxicol. Vitr. 2015, 29(6), 1283–1288. 10.1016/j.tiv.2015.05.007 [DOI] [PubMed] [Google Scholar]
  • (5).Regulation (EC) No 1223/2009; https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32009R1223 (accessed Dec 12, 2019). [Google Scholar]
  • (6).Asturiol D; Casati S; Worth A Consensus of Classification Trees for Skin Sensitisation Hazard Prediction. Toxicol. Vitr. 2016, 36, 197–209. 10.1016/j.tiv.2016.07.014 [DOI] [PubMed] [Google Scholar]
  • (7).Fourches D; Muratov E; Tropsha A Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics andQSAR Modeling Research. J. Chem. Inf. Model. 2010, 50(7), 1189–1203. 10.1021/ci100176x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Wang Q; Li X; Yang H; Cai Y; Wang Y; Wang Z; Li W; Tang Y; Liu G In Silico Prediction of Serious Eye Irritation or Corrosion Potential of Chemicals. RSC Adv. 2017, 7(11), 6697–6703. 10.1039/c6ra25267b [DOI] [Google Scholar]
  • (9).Luechtefeld T; Marsh D; Rowlands C; Hartung T Machine Learning of Toxicological Big Data Enables Read-across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility. Toxicol. Sci. 2018, 165(1), 198–212. 10.1093/toxsci/kfy152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Chatterjee M; Roy K Prediction of aquatic toxicity of chemical mixtures by the QSAR approach using 2D structural descriptors. J. Hazard. Mat. 2021, 408, 124936. 10.1016/j.jhazmat.2020.124936 [DOI] [PubMed] [Google Scholar]
  • (11).Qin L-T.; Chen Y-H.; Zhang X; Mo L-Y.; Zeng H-H.; Liang Y-P. QSAR prediction of additive and non-additive mixture toxicities of antibiotics and pesticide. Chemosphere. 2018, 198, 122–129. 10.1016/j.chemosphere.2018.01.142 [DOI] [PubMed] [Google Scholar]
  • (12).Muratov E; Varlamova E; Artemenko A; Polishchuk P; Kuz’min V Existing and Developing Approaches for QSAR Analysis of Mixtures. Mol. Inform. 2012, 31(3–4), 202–221. 10.1002/minf.201100129 [DOI] [PubMed] [Google Scholar]
  • (13).Moriwaki H; Tian YS; Kawashita N; Takagi T Mordred: A Molecular Descriptor Calculator. J. Cheminform. 2018, 10(1). 10.1186/s13321-018-0258-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Yang C; Tarkhov A; Marusczyk J; Bienfait B; Gasteiger J; Kleinoeder T; Magdziarz T; Sacher O; Schwab CH; Schwoebel J; et al. New Publicly Available Chemical Query Language, CSRML, to Support Chemotype Representations for Application to Data Mining and Modeling. J. Chem. Inf. Model. 2015, 55(3), 510–528. 10.1021/ci500667v [DOI] [PubMed] [Google Scholar]
  • (15).Wu Zh.; Zhu M; Kang Y; Leung EL-H; Lei T; Shen Ch.; Jiang D; Wang Zh.; Cao D; Hou T Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets. Brief. Bioinformatics. 2021, 22 (4), bbaa321. 10.1093/bib/bbaa321 [DOI] [PubMed] [Google Scholar]
  • (16).Tropsha A; Golbraikh A Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr. Pharm. Des. 2007, 13(34), 3494–3504. 10.2174/138161207782794257 [DOI] [PubMed] [Google Scholar]
  • (17).OECD. Guidance Document No 263 on integrated approaches to testing and assessment (iata) for serious eye damage and eye irritation. Series on testing and assssment Number 263. 10.1787/84b83321-en [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental info and tables

RESOURCES