Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Jan 24;58(7):3386–3398. doi: 10.1021/acs.est.3c05643

Chemical Space Covered by Applicability Domains of Quantitative Structure–Property Relationships and Semiempirical Relationships in Chemical Assessments

Zhizhen Zhang , Alessandro Sangion , Shenghong Wang , Todd Gouin §, Trevor Brown , Jon A Arnot ‡,∥,, Li Li †,*
PMCID: PMC10882972  PMID: 38263624

Abstract

graphic file with name es3c05643_0004.jpg

A significant number of chemicals registered in national and regional chemical inventories require assessments of their potential “hazard” concerns posed to humans and ecological receptors. This warrants knowledge of their partitioning and reactivity properties, which are often predicted by quantitative structure–property relationships (QSPRs) and other semiempirical relationships. It is imperative to evaluate the applicability domain (AD) of these tools to ensure their suitability for assessment purpose. Here, we investigate the extent to which the ADs of commonly used QSPRs and semiempirical relationships cover seven partitioning and reactivity properties of a chemical “space” comprising 81,000+ organic chemicals registered in regulatory and academic chemical inventories. Our findings show that around or more than half of the chemicals studied are covered by at least one of the commonly used QSPRs. The investigated QSPRs demonstrate adequate AD coverage for organochlorides and organobromines but limited AD coverage for chemicals containing fluorine and phosphorus. These QSPRs exhibit limited AD coverage for atmospheric reactivity, biodegradation, and octanol–air partitioning, particularly for ionizable organic chemicals compared to nonionizable ones, challenging assessments of environmental persistence, bioaccumulation capability, and long-range transport potential. We also find that a predictive tool’s AD coverage of chemicals depends on how the AD is defined, for example, by the distance of a predicted chemical from the centroid of the training chemicals or by the presence or absence of structural features.

Keywords: chemical inventories, chemical property, hazard, quantitative structure−property relationships (QSPRs), applicability domain, chemical assessment

Short abstract

Existing prediction techniques show the potential for predicting partitioning and reactivity properties for over half of known organic chemicals.

Introduction

We inhabit a chemically intensive world where the number of substances on the market is constantly growing. More than 350,000 chemicals and mixtures have been registered for production and use in national and regional chemical inventories.1 Nearly 20,000 chemicals have been measured and detected in environmental media.2 However, mounting evidence suggests that the production and use of many of these chemicals can pose adverse effects on both the environment and human health.3,4 A prominent example is chemicals that demonstrate “hazard” potentials, e.g., to resist biotic and abiotic breakdown (persistence or P), accumulate in biota (bioaccumulation or B), permeate natural barriers to contaminate surface and subsurface aquatic environments (mobility or M), and spread to remote areas worldwide (long-range environmental transport or LRTP).5,6

The sheer number of chemicals in commerce necessitates an efficient evaluation of their hazard potentials. This evaluation can be achieved through high-throughput assessments using either (i) a direct comparison with cutoff thresholds for key physicochemical properties or (ii) computational models that require the input of physicochemical property data. Table 1 illustrates examples of cutoff thresholds in partitioning and reactivity properties used frequently for hazard assessments. Partitioning and reactivity properties are also input parameters to multimedia fate and exposure models, e.g., European Union System for the Evaluation of Substances (EUSES),7 Risk Assessment IDentification And Ranking (RAIDAR),8,9 PROduction-To-EXposure (PROTEX),10,11 and the UNEP-SETAC toxicity (USEtox) model.12 These fate and exposure models evaluate the environmental concerns from a mechanistic multimedia perspective. Overall, these threshold-based and model-based approaches are cost-effective, providing the opportunity for high-throughput screening-level assessments for tens of thousands of chemicals, even before their production and commercialization, which helps guide chemical risk assessment and management as well as achieve safety and sustainability objectives.1315

Table 1. Cutoff Thresholds of Partitioning and Reactivity Properties Commonly Used in the Assessment of Chemical Hazards.

  chemical properties example cutoff thresholds (not exhaustive)
persistence degradation half-life (HLdeg) HLdeg > 40 days in freshwater or estuarine water, 60 days in marine water, 120 days in soil, 120 days in freshwater or estuarine sediment, or 180 days in marine sediment16
bioaccumulation octanol–water partition coefficient (KOW), octanol–air partition coefficient (KOA), and biotransformation half-life (HLB,animal or HLB,human) KOW > 105 for water-breathing animals;5 and KOW > 102 and KOA > 105 for air-breathing animals;17 HLB,human > 1200 h for humans18(note)
mobility organic carbon-normalized sorption coefficient (KOC) lowest KOC over the pH range of 4–9 < 104 (ref (19))
long-range atmospheric transport potential rate constant for reaction with the hydroxyl radical (kOH) kOH < 4.13 × 10–12 cm3/molecule/s if assuming the global weighted-average lower atmospheric OH concentration of 970,000 molecules/cm3,20 and applying the cutoff of the atmospheric half-life of 2 days21

Note: Regulatory assessment frameworks have yet to incorporate prescribed cutoff values for biotransformation half-life. Nonetheless, scientific studies have demonstrated that a compound possessing a biotransformation half-life of fewer than 700 h in mammals and humans is unlikely to accumulate in the food chain containing a combination of aquatic and terrestrial organisms, regardless of its partitioning properties.22,23 A recent discussion in paper proposes a threshold from 168 h (rats) to 1200 h (humans) for air-breathing organisms.18 The human threshold is used here for illustration.

Since property data are often insufficient for chemicals on the market, notably premanufactured chemicals or those in the design stage, hazard assessments often rely on model-predicted values. In general, there are two major prediction approaches:

  • (I)

    Quantitative structure–property relationships (QSPRs) correlating a chemical property of interest with variables describing chemical structure or molecule-level interactions. Examples of chemical structure variables include the “fragments and atoms” used by the Estimation Programs Interface (EPI) Suite,24 “fragments” used by the Iterative Fragment Selection-QSAR (IFS-QSAR),2527 and the constitutional, topological, and geometrical “molecular descriptors” used by the OPEn structure–activity/property Relationship App (OPERA)28 and QSAR-INSubria (QSARINS).29,30 Examples of molecule-level interactions include Abraham solute descriptors reflecting van der Waals interactions and H-bond interactions, used frequently in poly parameter linear free energy relationships (pp-LFERs).31,32

  • (II)

    Semiempirical relationships correlating a chemical property of interest solely with other properties, without considering chemical structure or molecule-level interactions. A prominent example is the single-parameter linear free energy relationship (sp-LFER) such as the various forms of the Karickhoff equation correlating the organic carbon-normalized sorption coefficient (KOC) with KOW.3337 Here, we refer to these relationships as “semiempirical” rather than “empirical” because the development of these relationships involves certain mechanistic considerations (e.g., the analogy between soil amorphous organic matter and octanol in the sorption of neutral chemicals), rather than being entirely data-driven.

The development of QSPRs and semiempirical relationships relies on statistical or mechanistic insights gained from chemicals with known information (especially experimentally derived data), known as the training set. These training chemicals define the applicability domain (AD) of a model. The AD refers to a theoretical space defined by relevant structural features, physicochemical descriptor values, or the range of prediction end points, in which the chemical of interest to the user is compliant with the model’s specifications.3840 Statistically, if a chemical falls within the AD, the chemical is deemed to be “similar” to chemicals in the training set, and the model’s predictions are based on interpolation rather than extrapolation.41,42 The concept of “applicability” is grounded in the philosophy that QSPRs rely on the principle of “analogy”, which means that a QSPR is considered valid only “within a series of chemicals” whose properties are controlled and influenced by a shared set of “relevant and consistent” chemical descriptors.43 Since extrapolation is typically more error-prone than interpolation for a given training set size,44 predictions within AD are likely to be systematically closer to true values compared to those outside AD from a statistical standpoint. There is a broad consensus that if a prediction falls within AD, the use of this prediction approach is more “appropriate”,45 “suitable”,46 and more “acceptable”47 for regulatory purposes. For this reason, “a defined domain of applicability” becomes one of the important prerequisites for the regulatory use of chemical property prediction techniques.48

“Applicability” represents the first step in establishing confidence in predictions from a QSPR or semiempirical relationship, which, in collaboration with two further steps, “reliability” and “decidability”, collectively determines the trustworthiness of predictions.46 It should be emphasized that “applicability” mainly confirms whether the model should be used to make predictions for a specific chemical, but it does not guarantee the model’s “predictivity”, reflected by the certainty, fidelity, or accuracy of every single prediction. Sometimes, a chemical may appear to fall within the AD, but the model prediction may deviate substantially from its true value. Conversely, a chemical may seem to be outside the AD, but the prediction could still be trustworthy.38 According to the OECD Principles for the Validation, for Regulatory Purposes, of (Q)SAR Models, both “applicability” (evaluated based on AD) and “predictivity” (evaluated based on internal and external validations) are integral parts of the regulatory acceptance of QSPRs or semiempirical relationships. Moreover, in many cases, QSAR developers and users may need to consider a trade-off between the “breadth of applicability” and the “level of predictivity”, that is, “one would either aim to develop a model with broad applicability, sacrificing to some extent the level of predictivity or one would aim to develop a model with narrow applicability (for example, a specific class of chemicals) but with greater predictivity”.47

Although several reasons may contribute to a chemical being outside the ADs of prediction approaches, the insufficient representation of certain chemical categories in the training set is the most common reason. This is often attributed to a lack of reliable experimentally determined data for these specific categories. For instance, organofluoride and organosilicon compounds exceed the ADs of most prediction approaches due to a scarcity of experimentally determined data for training models. Incorporating these chemicals into training sets can expand the ADs of these models.49,50 In addition, whether a chemical falls within or outside the AD is also influenced by how the AD or the similarity between the chemical of interest and the training chemicals is defined and calculated. Currently, there is a lack of a universally accepted definition of the AD for QSPRs and semiempirical relationships.

Given that chemical assessments ultimately aim to inform and support decision-making regarding a large number of chemicals, two main concerns can arise in this process. The first is whether our current knowledge, especially when it comes to experimentally derived information, is sufficient for developing QSPRs and semiempirical relationships that have broad enough ADs to support regulatory assessments of a vast array of chemicals. Existing high-throughput chemical screening efforts have rarely considered whether predictions by employed QSPRs and semiempirical relationships fall within the AD.13,14,51 If the answer to the first concern is “no”, then the second concern is to identify the primary data gaps in cases in which these models are deemed inadequate. For example, nonionizable hydrophobic organochlorides and organobromines have long been the main focus of regulatory assessments, with the Stockholm Convention being one of the most prominent examples.5 However, it remains unsure whether the ADs of existing QSPRs and semiempirical relationships are inclusive enough to encompass emerging categories of environmental chemicals, such as ionizable and water-soluble compounds (especially those with the mobility concern52,53), as well as organofluorine, organosilicon, and organophosphorus compounds. Furthermore, it remains unclear which specific chemical properties are the least covered by existing QSPRs and semiempirical relationships, making them potential weak points in the regulatory assessment of chemical substances.

To address these two issues, here, we conduct a comprehensive assessment of the properties of chemicals registered in different regional or national inventories in relation to the ADs of commonly used QSPRs and semiempirical relationships. This innovative analysis not only quantifies the AD coverage of these data prediction approaches but also reveals chemical categories that have received inadequate attention in these data prediction approaches. To this end, we analyze a comprehensive chemical data set (a chemical “space”) comprising over 81,000 discrete chemicals compiled and curated from several national and regional chemical inventories. Our analysis involves an examination of the distribution of key chemical properties, including partitioning and reactivity properties, and an assessment of the relationship between these properties, and their corresponding ADs. The insights generated by our investigation are expected to be of significant value in identifying methodological gaps and informing the further development and application of property predictive tools and chemical assessment models with the aim to evaluate and manage chemicals in a scientifically reliable and environmentally sustainable manner.

Methods and Data

Development of a Comprehensive Chemical Inventory for Analysis

To develop the chemical space, we collect 260,830 chemical records from the U.S. Toxic Substances Control Act (TSCA) Chemical Substance Inventory, the European Union Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) Registered Substances List, the Domestic Substances List (DSL) in Canada, the Inventory of Existing Chemical Substances in China (IECSC), and the Network of Reference Laboratories, Research Centers, and Related Organizations for Monitoring of Emerging Environmental Substances (NORMAN). After removing duplicates and retrieving structural information, we obtained 112,258 discrete chemicals with their Simplified Molecular Input Line Entry System (SMILES) strings54 and International Chemical Identifier (InChI) Keys. For details on the data processing and curation, see Supporting Information Text S1.

These 112,258 discrete chemicals include 3597 inorganics, 1234 metalorganics, 1780 polymers, 1 radical, and 24,368 chemicals with SMILES that cannot be recognized or processed by QSPRs (including permanently charged compounds; mixtures; unknown or variable composition, complex reaction products or biological materials known as UVCBs; SMILES strings exceeding 360 characters; and SMILES identified to be “incorrect” by QSPR toolkits). Since these chemicals are not supported by QSPRs, the exclusion of them leads to a final set of 81,278 chemicals that serve as the starting point for our assessment.

Calculation of Chemical Properties

Figure 1 presents an overview of the relationship between the chemical properties of our interest and the QSPRs and semiempirical relationships used for predictions. We focus on seven chemical properties intensively used in chemical assessment: KOW (based on wet octanol), KOA, KOC (in L/kg), fish and human whole-body biotransformation half-lives (HLB,fish and HLB,human; in hours), the rate constant for reaction with the hydroxyl radical (kOH; in cm3/molecule/s), and biodegradation half-life (HLbiodeg; in days) (Figure 1). Here, the air–water partition coefficient (KAW) is not predicted since it can be calculated by the thermodynamic triangular relationship KAW = KOW (dry octanol)/KOA. To predict these seven chemical properties, we consider four QSPR toolkits (Figure 1), namely, OPERA (version 2.6),28 EPI Suite (version 4.11),24 IFS-QSAR (version 1.0.0),25 and QSARINS-Chem (standalone version, 2021),55 given that these four are freely accessible and widely used in scholarly and regulatory chemical assessment. Supporting Information Text S2 details the algorithms used by the four QSPR toolkits. We also consider different “class-specific” Karickhoff relationships for the “predominantly hydrophobic” (the term used by ref (37) to refer to nonionizable chemicals containing only carbon, hydrogen, and halogen atoms), “nonhydrophobic” (the term used by ref (37) to refer to nonionizable chemicals that cannot be classified as “predominantly hydrophobic chemicals”), anionic, and cationic chemicals (Figure 1).

Figure 1.

Figure 1

Overview of the chemical properties investigated in this work, QSPR toolkits and empirical correlations used for predictions, and the definitions of applicability domains (AD) for the predictions (“L” = the leverage approach; “D” = the denylist approach). Most of the investigated chemical properties are predicted from variables describing the chemical structure or molecule-level interactions (fragments, atoms, PaDEL descriptors, etc.). In addition, the IFS-QSAR predicted Abraham solute descriptors are fed into pp-LFER correlations (eqs 1 to 3; with the system constants being recalibrated based on training sets provided in refs (31,5658)) for predictions of KOW, KOA, and KOC; the EPI Suite-predicted rapid/ready biodegradability probabilities or ratings are fed into the Arnot relationships (eqs 4 to 7)59 for predictions of HLbiodeg; the OPERA-predicted KOW (for neutral species) is fed into four “class-specific” Karickhoff relationships36,37 for predictions of KOC (eqs 8 to 11, where φn indicates the fraction of neutral species). Here, the OPERA KOW predictions are used because OPERA’s AD exhibits a wider coverage of the investigated chemicals than other investigated QSPRs (for details, see Figure 2o).

Characterization of the Applicability Domains

While various AD definitions exist in the literature, our investigated QSPRs typically build on two approaches to establish their ADs.

  • (I)

    A “leverage” approach (marked “(L)” in Figure 1) measures the “leverage” of a query chemical, i.e., its Mahalanobis distance from the centroid of the training chemicals in a multidimensional space defined by the variables statistically selected for making predictions.60,61

    Specifically, for a QSPR developed based on a training set composed of n chemicals and m selected variables (fragments, atoms, PaDEL descriptors, Abraham solute descriptors, etc.), a design matrix X can be defined by arranging each training chemical as a row and each selected variable (for linear regression models such as pp-LFER, an additional constant of 1 needs to be included to reflect the regression intercept; same below) as a column. A query chemical can be represented by a column vector x composed of values of the selected variables, based on which a “leverage” (h) can be calculated using the inverse and transpose operations of X and x as follows
    graphic file with name es3c05643_m001.jpg

    A query chemical is deemed to be “out of domain” if its leverage is greater than a threshold distance (or a “warning leverage”), typically 3 × p/n (where p = m + 1 for linear regression models or p = m for others).6264 If the leverage is directly provided by a QSPR toolkit, e.g., IFS-QSAR,2527 QSARINS-Chem,29,30 and OPERA,40,65 we adopt the given leverage values without any modifications. However, in cases where the leverage is not explicitly calculated by the QSPR toolkit, such as in Karickhoff relationships and pp-LFER correlations, we calculate the leverage ourselves based on the training chemicals used to develop these correlations. SI Text S3 provides a detailed, step-by-step illustration on how to calculate leverage, exemplified through pp-LFER prediction of KOW for a case chemical named polychlorinated biphenyl (PCB)-153.

  • (II)

    A “denylist” approach (marked “(D)” in Figure 1) uses a series of predefined bright-line rules to exclude chemicals that violate one or more of them from the AD. For example, EPI Suite documentation provides general guidance to identify “less reliable predictions”, including chemicals that (i) exceed the defined boundaries of the property value of a training set (termed as the “bounding box” approach40), (ii) fall outside the molar mass range of the training chemicals, (iii) have more instances of a given structural feature than the highest count among all of the training chemicals, or (iv) contain structural features that are not represented in the training set. We have developed an “in-house” algorithm, as described in Li et al.,66 to evaluate whether a prediction meets one or more rules outlined above and hence should be excluded from the AD. SI Text S4 illustrates how these bright-line rules are applied to evaluate whether PCB-153 falls within the AD of EPI Suite’s KOW module. IFS-QSAR also flags chemicals that (i) exceed the defined boundaries of the property value of a training set, (ii) contain uncalibrated atom types, or (iii) contain structural features that do not overlap with fragments derived from the training set (for Abraham solute descriptor L and biotransformation half-lives only; for detailed information, see Supporting Information Table S2).

We characterize AD using the following two principles:

  • (I)

    When a chemical property is directly predicted from a chemical’s structure (e.g., fragments and atoms and compositional, structural, and topological features; obtaining such quantities does not rely on training sets), we use the AD definition as provided in each QSPR toolkit, without making any modifications. Specifically, QSARINS-Chem, OPERA, and IFS-QSAR define ADs based on the leverage approach, whereas EPI Suite’s AD definition uses the denylist approach (see Figure 1); for IFS-QSAR predictions, we additionally output AD information based on the denylist approach, which allows us to explore the impact of the selected definition on the AD’s coverage.

  • (II)

    When a chemical property is predicted through a correlation with one or more quantities predicted by other QSPRs (e.g., Abraham solute descriptors, the KOW of neutral species, probabilities or ratings of rapid/ready biodegradation; obtaining such quantities relies on training sets), the overall AD is defined as the “intersection” between (i) the AD(s) of the QSPR(s) that generate these input parameters and (ii) the AD of the correlation itself. In other words, a chemical is within the AD only when both conditions are simultaneously met. Examples of such a situation include the derivation of KOW, KOA, HLbiodeg, and KOC using eqs 1 to 11 (Figure 1).

For principle II, we ensure that both the QSPR’s AD and the correlation’s AD are defined consistently, using either the “leverage” approach for both or the “denylist” approach for both. For example, since the ADs of IFS-QSAR-predicted Abraham solute descriptors and OPERA-predicted KOW are defined using the “leverage” approach in respective QSPR toolkits, we define the ADs of the pp-LFER correlations (eqs 1 to 3)44 and the Karickhoff correlations (eqs 8 to 11) using the “leverage” approach as well (Figure 1). Likewise, since the ADs of EPI Suite-predicted rapid/ready biodegradability probabilities and ratings are defined using the “denylist” approach in EPI Suite, we also define the HLbiodeg correlations (eqs 4 to 7) using the “denylist” approach (Figure 1).

Result and Discussion

Overview of Distribution of Chemicals Properties in the Chemical “Space”

Among the 81,278 discrete chemicals with available identification and structural information, OPERA predicted that 23,916 (29%) are nonionizable, whereas the rest includes 31,874 (39%) anionic chemicals (chemicals with a predicted “acidic pKa”), 22,174 (27%) cationic chemicals (chemicals with a predicted “basic pKa”), 3,236 (4%) zwitterions (chemicals with both predicted “acidic pKa” and “basic pKa”), and 78 chemicals with the ionization status to be “ionizable” but without pKa predictions. Supporting Information Figure S1 shows that 71.3% of these anionic chemicals, 83.1% of these cationic chemicals, and 75.8% of the zwitterions fall within the AD of the OPERA algorithm. Among the anionic and cationic chemicals, 29,528 (93%) and 19,897 (90%), respectively, are predicted to possess a pKa between 2 and 12, for which dissociation can be relevant in the water environment and biota.67 Our finding that the investigated chemicals are dominated by ionizable chemicals agrees with an earlier finding68 that ionizable chemicals account for 49% of a random sample of 1500+ chemicals preregistered under the REACH. Given the different partitioning and sorption behaviors between ionizable and nonionizable chemicals,67,69,70 such a large portion of ionizable chemicals underscores the necessity and importance of better characterizing ionizable chemicals in chemical assessments.

Figure 2a–g displays the variabilities in predictions (consensus values of predictions from multiple QSPRs) of partitioning and reactivity properties. Partitioning properties are found to have wider variations in their predictions, in comparison to reactivity properties (less than 2 orders of magnitude). The property KOA displays the highest variability, spanning almost 20 orders of magnitude among the investigated chemicals.

Figure 2.

Figure 2

Figure 2

Distribution of partitioning and reactivity properties of 81,278 investigated chemicals: (a–g) Distribution of “consensus values” (geometric means) of predictions from multiple QSPRs, regardless of whether the prediction falls within or outside the AD of a QSPR (a previous study66 shows that both in-AD and out-AD predictions need to be considered in the calculation of “consensus values”), within the AD of at least one QSPR (“in-AD”) and outside the ADs of all QSPRs (“out-AD”), respectively. White lines indicate the medians of “in-AD” predictions and “out-AD” predictions. (h–n) Distribution of predictions from individual QSPRs (represented by different colors), within (“in-AD”) and outside (“out-AD”) the AD of corresponding QSPR. Numbers indicate the percentages of predictions in the ADs of individual QSPRs. Percentages of “in-AD” predictions and “out-AD” predictions may not add up to 100% for reactivity properties, as some of the investigated chemicals do not have predictions available from the QSPRs used here. The density distributions are generated with kernel density estimation. (o) Numbers of investigated chemicals falling within the ADs of individual QSPRs or correlations.

Figure 2a–g indicates that when evaluated using the consensus values of predictions from multiple QSPRs, the investigated chemicals generally have low hydrophobicity (evidenced by a median log KOW of 2.4), low volatility (evidenced by a median log KOA of 9.2), and low persistence in vertebrate species (evidenced by median biotransformation half-lives of 5.1 and 4.5 h in fish and humans, respectively) and water (a median biodegradation half-life shorter than 50 days).

Coverage of Different Definitions of the Applicability Domain

It is imperative to note that QSARINS-Chem, OPERA, and IFS-QSAR define ADs based on the leverage approach, whereas EPI Suite’s AD definition uses the denylist approach. Such inconsistency is a result of the absence of a universal approach to establish the ADs of QSPRs.40,43 Therefore, understanding the extent of deviation between these two approaches is an intriguing aspect of QSPR analysis. Fortunately, IFS-QSAR also outputs AD information by using the denylist approach, making it possible and reasonable to perform a comparison between the two approaches. Supporting Information Table S1 compares the number of chemicals identified as being outside the AD of IFS-QSAR when AD is defined using the leverage and denylist approaches. The comparison indicates that the leverage approach excludes more chemicals from the AD for (i) KOW, KOA, and KOC that are calculated from Abraham solute descriptors (19,039–28,287 chemicals based on the leverage approach vs 672–2834 chemicals based on the denylist approach) and (ii) HLB,human (7002 chemicals based on the leverage approach vs 3427 chemicals based on the denylist approach). By contrast, the denylist approach excludes more chemicals from the AD for HLB,fish (8294 chemicals based on the leverage approach vs 22,233 chemicals based on the denylist approach). This contrast indicates that there is no general conclusion as to which approach is more stringent and hence defines a narrower coverage of the chemical space. As a result, any comparisons between the coverage of EPI Suite and other QSPR toolkits should be approached with caution.

Coverage of the Applicability Domains of Commonly Used QSPR Toolkits

Figure 2 shows the extent to which the ADs of QSPRs in four QSPR toolkits cover the investigated chemicals. Specifically, Figure 2h–n additionally shows the distributions of in-AD chemicals and out-AD chemicals predicted by these QSPRs; for comparison, Figure 2a–g displays the distributions of investigated chemicals falling within the AD of at least one QSPR and outside the ADs of all investigated QSPRs.

As Figure 2 shows, when partitioning properties are predicted (KOW, KOA, and KOC), OPERA exhibits the broadest AD, covering around or over 50% of the investigated chemicals. In contrast, the use of pp-LFERs in combination with IFS-QSAR-predicted Abraham solute descriptors shows the narrowest AD coverage.

Since the overall AD of pp-LFER predictions is defined as the intersection of (i) the AD of IFS-QSAR’s Abraham solute descriptor predictions and (ii) the AD of the pp-LFER correlation (Figure 1), we further explore which of these two conditions to a greater extent limits the overall AD of pp-LFER predictions. Figure 2o shows that 65–77% of the 81,278 investigated chemicals fall within the IFS-QSAR’s AD (defined by the leverage approach) when Abraham solute descriptors S, A, B, and L are predicted separately (the AD concept does not apply to descriptor V, see the footnote of Supporting Information Table S1). However, since the pp-LFER prediction requires all used Abraham solute descriptors to fall within their respective ADs, such an overlap significantly decreases the AD coverage to 32,400 investigated chemicals (Figure 2o). In addition, overlap with the AD of pp-LFER correlations further slightly reduces the coverage of the overall AD to 22,751, 20,720, and 22,388 chemicals for KOW, KOA, and KOC, respectively (Figure 2o). These findings suggest that the integration of multiple QSPRs and conversions in series may greatly reduce the AD coverage of chemicals compared to the direct prediction of chemical properties from the molecular structure.

Figure 2 also indicates that the examined QSPR toolkits exhibit comparable AD coverage for HLB,fish and HLB,human, encompassing around or over half of the investigated chemicals for both properties. IFS-QSAR exhibits the widest AD coverage among the QSPR toolkits.

A considerable portion of the investigated chemicals falls outside the ADs of the examined QSPR toolkits for kOH and HLbiodeg predictions (Figure 2). It is important to note that the combination of EPI Suite-predicted rapid/ready biodegradability probabilities or ratings with the Arnot et al. correlations (eqs 4 through 7; Figure 1)59 gives a broad coverage of 54,040 (66%) chemicals (Figure 2o): 60,134 (74%) investigated chemicals fall within the intersection of the ADs of all four EPI Suite’s biodegradability models, and the overlap with the ADs of Arnot correlations only slightly changes the coverage of chemicals (Figure 2o). This indicates that EPI Suite’s biodegradability models are based on a diverse training set. By contrast, OPERA has a QSPR model specifically designed to predict HLbiodeg for hydrocarbons, building on a training set comprising 112 petroleum chemicals alone (111 hydrocarbons and benzothiophene).28 Therefore, it is not surprising that this model covers only 16.9% of the investigated chemicals (Figure 2n). In fact, OPERA also includes a QSPR for predicting a chemical’s rapid/ready biodegradability, as a binary outcome of “yes” or “no”, which builds on a training set nearly identical to that used by EPI Suite. However, what currently limits the use of this QSPR in hazard and risk assessments is the absence of an algorithm that supports translating these binary results into HLbiodeg.

A comparison between the percentages in Figure 2a–g and h–n indicates that around or more than half of the chemicals studied are covered by at least one of the four commonly used QSPR toolkits, likely because different algorithms consider different structural features or assign different weights to different molecular descriptors in their predictions.64 Nevertheless, Figure 2a–g shows that the joint AD of multiple QSPRs has much narrower coverage for predicting kOH, HLbiodeg, and KOA, compared to other properties. Since kOH, HLbiodeg, and KOA are crucial in assessing the hazards of LRTP, P, and B, respectively, such a narrower coverage may pose challenges to the acceptability of assessment results in regulatory and academic settings.

Interestingly, for KOA, KOC, and HLbiodeg (Figure 2b,c,g), the medians of in-AD predictions are consistently higher than the medians of out-AD predictions. This discrepancy is especially pronounced for KOA. Likewise, Figure 2i also demonstrates that for all three KOA QSPRs, the out-AD predictions generally cluster at a significantly higher level than the in-AD predictions. In contrast, for other properties (Figure 2a,e,f), the medians of in-AD predictions and out-AD predictions are close to each other, with the in-AD predictions generally following a bell distribution, while the out-AD predictions are evenly distributed. These findings imply that one cannot make a universal conclusion as to whether extreme predictions are more likely to fall outside AD in general.

A detailed analysis of the AD coverage of chemicals containing different nonmetal atoms reveals disparities in-AD coverage among commonly used QSPR toolkits (Supporting Information Table S2). As shown in SI Table S2, nitrogen-containing chemicals (42,671 chemicals) are most abundant among the 81,278 investigated chemicals, whereas boron-containing chemicals (156 chemicals) are the least prevalent. For the investigated partitioning and reactivity properties, the ADs of four investigated QSPR toolkits exhibit the highest coverage of chemicals containing chloride and bromine, followed by chemicals containing fluorine and silicon (SI Table S2). Such wide AD coverage is not unexpected, given that organochlorides and organobromines have long been the focus of chemical research and regulations and current QSPR toolkits incorporate a significant number of these chemicals into their training sets. Recently, organosilicons and organofluorides have also been incorporated into property data prediction tools such as pp-LFER, thereby enhancing the AD coverage for predicting partitioning coefficients specifically for these types of chemicals.49 Interestingly, the AD of the KOA QSPRs exhibits the best coverage of silicon-containing chemicals (92.8%), compared to other chemicals (SI Table S2). Nevertheless, quite limited numbers of fluorine-containing chemicals are covered by the AD of the HLB,human, kOH, and HLbiodeg QSPRs studied here. In contrast, for partitioning coefficients, phosphorus-containing chemicals are least covered by the AD of four investigated QSPR toolkits; notably, 29.3 and 56.7% of these chemicals fall outside of the AD of all KOW and KOA QSPRs (SI Table S2). Therefore, such limited AD coverage poses challenges to the regulatory assessment of organophosphorus compounds. In addition, chemicals with different nonmetal atoms seem to be well and balanced represented in the training sets of KOC QSPRs, with fewer than 17% of all categories of chemicals falling outside the ADs of the studied QSPRs (SI Table S2).

Furthermore, Supporting Information Table S3 reveals that similar percentages of ionizable and nonionizable organic chemicals are covered by the ADs of the commonly used KOW, KOC, HLB,human, and HLbiodeg QSPRs. This is not surprising, given that their training sets encompass both ionizable and nonionizable organic chemicals. However, when compared to nonionizable organic chemicals, a much larger proportion of ionizable organic chemicals fall outside the ADs of the KOA (6.5 vs 51.9%) and kOH (22.9 vs 54.7%) QSPRs. This is understandable because partitioning and reaction in air for ionized forms of chemicals may not be as relevant as those for natural chemicals that can substantially exist in the gas phase. This disparity highlights that ionizable structures are significantly under-represented in the training sets of commonly used QSPRs, which poses substantial challenges for the hazard and risk assessments, e.g., evaluations of bioaccumulation and long-range transport potential, of ionizable organic chemicals.

It is important to reiterate that applicability does not necessarily indicate the accuracy of a QSPR. For example, despite the KOA pp-LFER (with IFS-QSAR-predicted Abraham solute descriptors) having a much narrower AD coverage than OPERA’s (Figure 2i), Baskaran et al.71 showed that the two approaches performed similarly well, with comparable mean absolute errors of 0.34 and 0.33 for pp-LFER and OPERA, respectively, and comparable root-mean-square errors of 0.50 and 0.52 for pp-LFER and OPERA, respectively. Three reasons may be responsible for the divergence between applicability and accuracy. First, applicability reflects the degree of similarity between the query chemical and chemicals in the training set. Mathematically, this similarity is quantified in a multidimensional space defined by a limited number of selected variables that statistically explain the majority of variance in the property of interest among training chemicals. However, since the selection of variables cannot be exhaustive, unselected variables may still be responsible for the variance in the property of interest. An example is that OPERA’s KOA prediction builds on merely the number of hydrogen bond donors and hexadecane–air partition coefficient,72 which are chemically equivalent to pp-LFER’s descriptors A and L, respectively. However, the KOA pp-LFER (eq 2) indicates that descriptors S and B additionally contribute to the chemical-to-chemical variance in KOA. Second, the AD coverage by the KOA pp-LFER is mostly limited by the requirement for overlapping ADs among individual Abraham solute descriptors (Figure 2o). However, the calculation of OPERA’s AD does not consider the overlapping of ADs of individual PaDEL descriptors. On the one hand, most PaDEL descriptors (e.g., the number of hydrogen bond donors) are structural, geometrical, or compositional descriptors that are derived from molecular structures or compositions through predefined rules, and therefore, they do not possess statistically defined ADs. On the other hand, PaDEL does not output AD information for other descriptors (e.g., hexadecane–air partition coefficient). These two aspects prevent the consideration of ADs of PaDEL descriptors in determining OPERA’s AD. In other words, pp-LFER’s narrower AD coverage results from a more stringent definition of AD being applied. Third, the leverage threshold of 3 × p/n may not always reflect the differences in the accuracy between results obtained from extrapolation and interpolation. For example, a pp-LFER model developed from a large training set size (greater than 100) can be robust even if the leverage far exceeds the 3 × p/n threshold. In this case, this threshold may be overly stringent and the use of it could exclude accurate values obtained through extrapolation.44

Impacts of Coverage of the Applicability Domains on Hazard Assessment

As Table 1 shows, a comparison of chemical-specific partitioning and reactivity properties with corresponding cutoff thresholds of P, B, M, and LRTP indicators enables flagging chemicals that may exhibit potential hazards to the environment. However, the limited coverage of QSPRs may constrain their suitability for such regulatory assessment efforts. For example, among the 81,278 investigated chemicals, 24,653 (30.3%) chemicals have a HLbiodeg greater than 40 days and, hence, could be potentially labeled as “P” chemicals. Nevertheless, since Figure 2g illustrates that 12,688 (51.5%) of these chemicals fall outside the ADs of all of the QSPRs investigated here, the extent to which we can confidently interpret that the P assessment results remains unclear. Likewise, 49,205 (60.5%) chemicals have a KOW > 102 and KOA > 105 and hence demonstrate potential for bioaccumulation in air-breathing animals if persistent. However, Figure 2b illustrates that 18,511 (37.6%) of these chemicals fall outside the ADs of all of the KOA QSPRs investigated here, implying the uncertainty in our confidence in the B assessment results.

Coverage of the Applicability Domains of Semiempirical Relationships

We use four “class-specific” Karickhoff relationships predicting KOC as an illustrative example (Table 2), noting that similar analysis can also be done for other semiempirical relationships. The overall AD of Karickhoff relationships is defined as the intersection between the ADs of (i) the QSPR that predicts the KOW and (ii) the sp-LFER correlating KOW and KOC. Here, OPERA is used to predict KOW, whose AD covers 86.4% (67,324) of the 77,958 “predominantly hydrophobic”, “nonhydrophobic”, anionic, and cationic chemicals that are supported by the Karickhoff relationships (Table 2). On the other hand, the sp-LFER’s AD (defined by the leverage approach) covers 89.5% (69,776) of the 77,958 chemicals. The intersection of these two ADs encompasses 80.5% (62,787) out of the 77,958 investigated chemicals (the “KOW-based leverage” results in Table 2). A comparison between Table 2 and Figure 2j indicates that the Karickhoff relationships’ AD (62,787) covers more chemicals than any of the three QSPRs investigated here (62,749, 61,178, and 22,388 by QSARINS-Chem, OPERA, and IFS-QSAR, respectively). Notably, it is worth noting that the Karickhoff relationships provide coverage for 79.1, 79.9, 77.4, and 85.8% of the “predominantly hydrophobic”, “nonhydrophobic”, anionic, and cationic chemicals examined in this study, respectively (the “KOW-based leverage” results in Table 2).

Table 2. Numbers of Chemicals Falling within the AD of Karickhoff Relationships37 for Nonionizable Organic Chemicals Containing only Carbon, Hydrogen, and/or Halogen Atoms, other Nonionizable Organic Chemicals, Anionic Chemicals, and Cationic Chemicals (the Total Numbers Sum up to 77,958, Less Than 81,278, Because of the Exclusion of Zwitterions and Other Chemicals), with the AD Defined by Considering the Input Parameter (KOW) Only or the Categories of Training Chemicals (Defined by the Abraham Solute Descriptors).

  “class” of Karickhoff relationships
 
  “predominantly hydrophobic” (nonionizable, containing only carbon, hydrogen, and/or halogen atoms) “nonhydrophobic” (other nonionizable) anionic cationic total
no. of investigated chemicals 4112 19,798 31,874 22,174 77,958
no. of chemicals in OPERA’s AD 3439 17,442 26,914 19,529 67,324
no. of chemicals in sp-LFER’s AD (defined by KOW-based leverage) 3660 16,773 28,568 20,775 69,776
no. of chemicals in the combined OPERA and the sp-LFER AD (defined by KOW-based leverage) 3251 15,822 24,679 19,035 62,787
no. of chemicals in sp-LFER’s AD (defined by Abraham solute descriptors-based leverage) 1850 8474 18,994 15,403 44,721
no. of chemicals in the combined OPERA and the sp-LFER AD (defined by Abraham solute descriptors-based leverage) 1678 7365 18,630 15,182 42,855

At first glance, the above results give us the impression that the “class-specific” Karickhoff relationships outperform QSPRs in terms of broader AD coverage. However, despite the better statistical performance, it is essential to recognize that it may not always be mechanistically sound to correlate KOC and KOW. This is because the definition of KOC involves the normalization of the sorption coefficient (Kd) by the fraction of organic carbon (fOC), which builds on the behavior of nonionizable chemicals and assumes that sorption is primarily controlled by hydrophobic effects and hydrogen bonding. Yet, this assumption may not hold for other classes of chemicals; for example, the sorption of cations can be primarily controlled by ion exchanges.73,74 In addition, the sorption coefficient combines a chemical’s affinities for multiple geosorbents in sediment or soil, such as carbonaceous organic matter, phyllosilicate clay minerals, and amorphous natural organic matter. For these reasons, the relative importance of the sorption mechanisms may differ among chemicals, even within the same class, such as between strong and weak acids or bases, and among chemicals with varied functional groups. Therefore, it becomes imperative to further categorize chemicals within each class based on, e.g., similar sorption mechanisms, and ensure that different categories of chemicals are adequately represented in the training set used to develop the Karickhoff relationships. This explains the reason why varied forms of Karickhoff correlations (with different regression coefficients and intercepts) are available for chemicals with different functional groups or functional uses37,75 and why users (e.g., users of EUSES) are advised to select the equation that best suits the category of their specific chemical their interest.

However, quantifying the leverage, or the similarity, between the categories of query chemicals and those in the training set can be challenging in the case of Karickhoff relationships. This is because Karickhoff relationships rely solely on KOW as the independent variable, which is not indicative of a chemical’s sorption mechanism. For expediency purposes, we here explore quantifying the mechanism-based categories using a leverage defined by Abraham solute descriptors (S, A, B, V, and L) and a threshold distance of 3 × p/n, where p is the number of Abraham solute descriptors + 1 and n is the number of chemicals in a Karickhoff relationship training set. The Abraham solute descriptors were selected as indicative of the sorption mechanism because they (i) expand the coverage of molecular interactions beyond what sp-LFERs provide76 and (ii) have been used previously to quantify the similarity between chemicals with similar sorption mechanisms.77 If such an “Abraham-solute-descriptor-based leverage” exceeds the threshold distance of 3 × p/n, then the chemical is outside the AD even if its “KOW-based leverage” falls within its corresponding threshold. As Table 2 shows, when considering the categories of chemicals in the training set using the “Abraham solute descriptors-based leverage”, the sp-LFER’s AD covers only 44,721 of the investigated chemicals, and if combined with the AD of OPERA’s KOW predictions, only 42,855 chemicals were within the overall AD of the Karickoff correlations. Therefore, the use of Abraham solute descriptors (42,855 chemicals) defines a much narrower AD, compared to the use of KOW (62,787 chemicals). This means that 19,932 chemicals (62,787 – 42,855) are not “categorically” or “mechanistically” similar to the training chemicals, although they have KOW values close to the centroid of KOW of the training chemicals by chance. The question then arises as to whether these chemicals should be classified within AD or not. One may argue that the AD should be defined solely by the parameter used for prediction without considering additional information, given that the AD “should be described in terms of the most relevant parameters, i.e., usually those that are descriptors of the model”.78 However, others may contend that the similarity between the categories of training chemicals and query chemicals should also be considered in the definition of AD. We thus leave this issue open to further discussion.

Closing Remarks

When assessing the potential hazards of chemicals on human health and the environment, the AD of the QSPRs and semiempirical relationships determines the suitability and desirability of applying these tools to a particular chemical. Our research shows that the use of multiple QSPRs results in covering approximately half or more of the chemicals investigated here within the AD of at least one QSPR. Note that this finding should not be interpreted as implying that the combined use of multiple QSPRs necessarily results in more accurate estimates, given that “applicability” does not necessarily equate to “accuracy”. However, current QSPRs have more limited AD coverage for predicting atmospheric reactivity, biodegradation, and octanol–air partitioning, compared to the AD coverage of other partitioning and biotransformation predictions. Therefore, caution should be exercised when using QSPRs in assessments of persistence, bioaccumulation capability, and long-range transport potential. Future laboratory efforts should focus on measurements for more diverse categories of chemicals to improve the AD coverage of predictive tools aiming for these properties. In addition, previous studies have suggested that mechanistically based theoretical models, such as those derived from quantum chemistry, can outperform QSPRs or semiempirical relationships in terms of prediction accuracy when no data are available.79 Therefore, for chemicals falling outside the AD of QSPRs or semiempirical relationships, the use of mechanistically based theoretical models may offer a viable alternative.

The definition of AD also affects the coverage of chemicals. For example, in the case of IFS-QSAR, we show that the leverage approach and denylist approach give different AD coverages for partitioning and reactivity predictions. In the case of the sp-LFER-type Karickhoff relationships, we show that considering only the parameter used for making the prediction will lead to wider AD coverage than additionally considering categoric information. Hence, there is no single QSPR that has superior coverage compared to others. It is essential to note that the use of more stringent conditions to define the AD may exclude more query chemicals from the AD, thus, limiting the applicability of the predictive tool.

It is crucial to note that information about the AD mainly indicates whether a prediction is interpolated or extrapolated from values in the training set. While interpolated values may be statistically more reliable when the training set is small, this does not mean that all interpolated values are superior to extrapolated values, especially for parameters with a large training set. Therefore, careful consideration must be given to the size of the training set and the nature of the prediction for the reliability of the results.

Finally, note that this work builds on a set of chemicals sourced from several national and regional chemical inventories. Despite their large number, these chemicals may not necessarily represent the spectrum of chemicals present in specific markets or jurisdictions. Future research with more comprehensive data sets on national or regional markets is warranted.

Acknowledgments

The authors acknowledge funding from project ECO-54 “Developing A Tiered Modeling Framework in Support of Risk Assessment of Chemical Substances Associated with Mobility Concerns” of the Long-range Research Initiative of the European Chemical Industry Council (CEFIC). The views expressed in this paper are solely those of the authors and do not necessarily reflect those of the CEFIC.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.3c05643.

  • Development of the chemical inventory for analysis; QSPRs used for analysis; examples of calculations using the leverage and denylist approaches; comparison between applicability domains with different definitions; percentages of chemicals containing different nonmetal atoms and with different ionization status; and distribution of dissociation constants (PDF)

The authors declare no competing financial interest.

Supplementary Material

es3c05643_si_001.pdf (480.8KB, pdf)

References

  1. Wang Z.; Walker G. W.; Muir D. C. G.; Nagatani-Yoshida K. Toward a Global Understanding of Chemical Pollution: A First Comprehensive Analysis of National and Regional Chemical Inventories. Environ. Sci. Technol. 2020, 54 (5), 2575–2584. 10.1021/acs.est.9b06379. [DOI] [PubMed] [Google Scholar]
  2. Muir D. C. G.; Getzinger G. J.; McBride M.; Ferguson P. L. How Many Chemicals in Commerce Have Been Analyzed in Environmental Media? A 50 Year Bibliometric Analysis. Environ. Sci. Technol. 2023, 57 (25), 9119–9129. 10.1021/acs.est.2c09353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Naidu R.; Biswas B.; Willett I. R.; Cribb J.; Kumar Singh B.; Paul Nathanail C.; Coulon F.; Semple K. T.; Jones K. C.; Barclay A.; Aitken R. Chemical Pollution: A Growing Peril and Potential Catastrophic Risk to Humanity. Environ. Int. 2021, 156, 106616. 10.1016/j.envint.2021.106616. [DOI] [PubMed] [Google Scholar]
  4. Manisalidis I.; Stavropoulou E.; Stavropoulos A.; Bezirtzoglou E. Environmental and Health Impacts of Air Pollution: A Review. Front. Public Health 2020, 8, 14 10.3389/fpubh.2020.00014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. UNEP . Stockholm Convention on Persistent Organic Pollutants 2021http://www.pops.int (accessed Jun 2, 2023).
  6. Hale S. E.; Arp H. P. H.; Schliebner I.; Neumann M. Persistent, Mobile and Toxic (PMT) and Very Persistent and Very Mobile (VPvM) Substances Pose an Equivalent Level of Concern to Persistent, Bioaccumulative and Toxic (PBT) and Very Persistent and Very Bioaccumulative (VPvB) Substances under REACH. Environ. Sci. Eur. 2020, 32, 155 10.1186/s12302-020-00440-4. [DOI] [PubMed] [Google Scholar]
  7. European Commission . European Union System for the Evaluation of Substances 2.0 (EUSES 2.0). Prepared for the European Chemicals Bureau by the National Institute of Public Health and the Environment (RIVM), (Report no. 601900005): Bilthoven; Europea Chemicals Agency: The Netherlands; 2004.
  8. Arnot J. A.; Mackay D.; Webster E.; Southwood J. M. Screening Level Risk Assessment Model for Chemical Fate and Effects in the Environment. Environ. Sci. Technol. 2006, 40 (7), 2316–2323. 10.1021/es0514085. [DOI] [PubMed] [Google Scholar]
  9. Arnot J. A.; Mackay D. Policies for Chemical Hazard and Risk Priority Setting : Can Toxicity, and Quantity Information Be Combined ?. Environ. Sci. Technol. 2008, 42 (13), 4648–4654. 10.1021/es800106g. [DOI] [PubMed] [Google Scholar]
  10. Li L.; Arnot J.; Wania F. Revisiting the Contributions of Far- and Near-Field Routes to Aggregate Human Exposure to Polychlorinated Biphenyls (PCBs). Environ. Sci. Technol. 2018, 52 (12), 6974–6984. 10.1021/acs.est.8b00151. [DOI] [PubMed] [Google Scholar]
  11. Li L.; Arnot J. A.; Wania F. Towards a Systematic Understanding of the Dynamic Fate of Polychlorinated Biphenyls in Indoor, Urban and Rural Environments. Environ. Int. 2018, 117, 57–68. 10.1016/j.envint.2018.04.038. [DOI] [PubMed] [Google Scholar]
  12. Fantke P.; Bijster M.; Guignard C.; Hauschild M.; Huijbregts M.; Jolliet O.; Kounina A.; Magaud V.; Margni M.; McKone T.; Posthuma L.; Rosenbaum R.; van de Meent D.; van Zelm R.. USEtox 2.0 Documentation, (Version 1.1), 2017.
  13. Zhang X.; Sun X.; Jiang R.; Zeng E. Y.; Sunderland E. M.; Muir D. C. G. Screening New Persistent and Bioaccumulative Organics in China’s Inventory of Industrial Chemicals. Environ. Sci. Technol. 2020, 54 (12), 7398–7408. 10.1021/acs.est.0c01898. [DOI] [PubMed] [Google Scholar]
  14. Muir D. C. G.; Howard P. H. Are There Other Persistent Organic Pollutants? A Challenge for Environmental Chemists. Environ. Sci. Technol. 2006, 40 (23), 7157–7166. 10.1021/es061677a. [DOI] [PubMed] [Google Scholar]
  15. Arnot J. A.; Brown T. N.; Wania F.; Breivik K.; McLachlan M. S. Prioritizing Chemicals and Data Requirements for Screening-Level Exposure and Risk Assessment. Environ. Health Perspect. 2012, 120 (11), 1565–1570. 10.1289/ehp.1205355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. European Union Commission Regulation (EU) No 253/2011 of 15 March 2011 Amending Regulation (EC) No 1907/2006 of the European Parliament and of the Council on the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) as Regards Annex XIII Off. J. Eur. Union, 2011, 7–12. [Google Scholar]
  17. European Chemicals Agency . Guidance on Information Requirements and Chemical Safety Assessment Chapter R. 7b: Endpoint Specific, Version 4.0: Helsinki, Finland, 2017.
  18. Arnot J. A.; Brik B.; Curtis-Jackson P.; Gobas F.; Goss K.-U.; Habekost M.; Bonnomet V.; Hirmann D.; Hofer T.; Jacobi S.; Krause S.; Laue H.; Laurentie M.; Aparicio A. M.; van der Mescht M.; Rauert C.; Treu G.; Redman A.; Saunders L.; Verbruggen E.; Wania F.; Whalley P.. Bioaccumulation Assessment of Air-Breathing Mammals: A Discussion Paper. 2022.
  19. Neumann M.; Schliebner I.. Protecting the Sources of Our Drinking Water: The Criteria for Identifying Persistent, Mobile and Toxic (PMT) Substances and Very Persistent and Very Mobile (VPvM) Substances Under EU Regulation REACH (EC) No 1907/2006. 2019.
  20. Prinn R. G.; Weiss R. F.; Miller B. R.; Huang J.; Alyea F. N.; Cunnold D. M.; Fraser P. J.; Hartley D. E.; Simmonds P. G. Atmospheric Trends and Lifetime of CH3CCl3 and Global OH Concentrations. Science 1995, 269 (5221), 187–192. 10.1126/science.269.5221.187. [DOI] [PubMed] [Google Scholar]
  21. United Nations: Economic Commission for Europe . Handbook for the 1979 Convention on Long-Range Transboundary Air Pollution and Its Protocols; United Nations Publications: Geneva, Switzerland, 2004. [Google Scholar]
  22. Zhang Z.; Wang S.; Li L. Emerging Investigator Series: The Role of Chemical Properties in Human Exposure to Environmental Chemicals. Environ. Sci. Process. Impacts 2021, 23, 1839–1862. 10.1039/D1EM00252J. [DOI] [PubMed] [Google Scholar]
  23. McLachlan M. S.; Czub G.; MacLeod M.; Arnot J. A. Bioaccumulation of Organic Contaminants in Humans: A Multimedia Perspective and the Importance of Biotransformation. Environ. Sci. Technol. 2011, 45 (1), 197–202. 10.1021/es101000w. [DOI] [PubMed] [Google Scholar]
  24. Office of Pollution Prevention and Toxics . Exposure Assessment Tools and Models, Estimation Program Interface (EPI) Suite, version 4.11, http://www.epa.gov/opptintr/exposure/pubs/episuite.htm.
  25. Brown T. N.; Arnot J. A.; Wania F. Iterative Fragment Selection: A Group Contribution Approach to Predicting Fish Biotransformation Half-Lives. Environ. Sci. Technol. 2012, 46 (15), 8253–8260. 10.1021/es301182a. [DOI] [PubMed] [Google Scholar]
  26. Brown T. N. QSPRs for Predicting Equilibrium Partitioning in Solvent–Air Systems from the Chemical Structures of Solutes and Solvents. J. Solution Chem. 2022, 51 (9), 1101–1132. 10.1007/s10953-022-01162-2. [DOI] [Google Scholar]
  27. Brown T. N.; Armitage J. M.; Arnot J. A. Application of an Iterative Fragment Selection (IFS) Method to Estimate Entropies of Fusion and Melting Points of Organic Chemicals. Mol. Inf. 2019, 38 (8–9), 1800160 10.1002/minf.201800160. [DOI] [PubMed] [Google Scholar]
  28. Mansouri K.; Grulke C. M.; Judson R. S.; Williams A. J. OPERA Models for Predicting Physicochemical Properties and Environmental Fate Endpoints. J. Cheminf. 2018, 10, 10 10.1186/s13321-018-0263-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Gramatica P.; Cassani S.; Chirico N. QSARINS-chem: Insubria Datasets and New QSAR/QSPR Models for Environmental Pollutants in QSARINS. J. Comput. Chem. 2014, 35 (13), 1036–1044. 10.1002/jcc.23576. [DOI] [PubMed] [Google Scholar]
  30. Gramatica P.; Chirico N.; Papa E.; Cassani S.; Kovarich S. QSARINS: A New Software for the Development, Analysis, and Validation of QSAR MLR Models. J. Comput. Chem. 2013, 34 (24), 2121–2132. 10.1002/jcc.23361. [DOI] [Google Scholar]
  31. Brown T. N. Empirical Regressions Between System Parameters and Solute Descriptors of Polyparameter Linear Free Energy Relationships (PPLFERs) for Predicting Solvent-Air Partitioning. Fluid Phase Equilib. 2021, 540, 113035. 10.1016/j.fluid.2021.113035. [DOI] [Google Scholar]
  32. Endo S.; Goss K. Applications of Polyparameter Linear Free Energy Relationships in Environmental Chemistry. Environ. Sci. Technol. 2014, 48 (21), 12477–12491. 10.1021/es503369t. [DOI] [PubMed] [Google Scholar]
  33. Karickhoff S. W.; Brown D. S.; Scott T. A. Sorption of Hydrophobic Pollutants on Natural Sediments. Water Res. 1979, 13 (3), 241–248. 10.1016/0043-1354(79)90201-X. [DOI] [Google Scholar]
  34. Karickhoff S. W.; Brown D. S.. Determination of Octanol/Water Distribution Coefficients, Water Solubilities, and Sediment/Water Partition Coefficients for Hydrophobic Organic Pollutants; Environmental Research Laboratory, U.S. Environmental Protection Agency: Athens, GA, 1979. [Google Scholar]
  35. Karickhoff S. W. Semi-Empirical Estimation of Sorption of Hydrophobic Pollutants on Natural Sediments and Soils. Chemosphere 1981, 10 (8), 833–846. 10.1016/0045-6535(81)90083-7. [DOI] [Google Scholar]
  36. Franco A.; Trapp S. Estimation of the Soil–Water Partition Coefficient Normalized to Organic Carbon for Ionizable Organic Chemicals. Environ. Toxicol. Chem. Int. J. 2008, 27 (10), 1995–2004. 10.1897/07-583.1. [DOI] [PubMed] [Google Scholar]
  37. Sabljić A.; Güsten H.; Verhaar H.; Hermens J. QSAR Modelling of Soil Sorption. Improvements and Systematics of log KOC vs. log KOW Correlations. Chemosphere 1995, 31 (11), 4489–4514. 10.1016/0045-6535(95)00327-5. [DOI] [Google Scholar]
  38. OECD . Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q) SAR] Models ENV/JM/MONO 2007; Vol. 2, pp 1–154.
  39. Netzeva T. I.; Aptula A. O.; Benfenati E.; Cronin M. T. D.; Gini G.; Lessigiarska I.; Maran U.; Vračko M.; Schüürmann G. Description of the Electronic Structure of Organic Chemicals Using Semiempirical and Ab Initio Methods for Development of Toxicological QSARs. J. Chem. Inf. Model. 2005, 45 (1), 106–114. 10.1021/ci049747p. [DOI] [PubMed] [Google Scholar]
  40. Sahigara F.; Mansouri K.; Ballabio D.; Mauri A.; Consonni V.; Todeschini R. Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. Molecules 2012, 17, 4791–4810. 10.3390/molecules17054791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Bruneau P.; McElroy N. R. Generalized Fragment Substructure Based Property Prediction Method. J. Chem. Inf. Model 2006, 46, 1379–1387. 10.1021/ci0504014. [DOI] [PubMed] [Google Scholar]
  42. Nikolova-Jeliazkova N.; Jaworska J. An Approach to Determining Applicability Domains for QSAR Group Contribution Models: An Analysis of SRC KOWWIN. Altern. Lab. Anim. 2005, 33 (5), 461–470. 10.1177/026119290503300510. [DOI] [PubMed] [Google Scholar]
  43. Eriksson L.; Jaworska J.; Worth A. P.; Cronin M. T. D.; McDowell R. M.; Gramatica P. Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification-and Regression-Based QSARs. Environ. Health Perspect. 2003, 111 (10), 1361–1375. 10.1289/ehp.5758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Endo S. Applicability Domain of Polyparameter Linear Free Energy Relationship Models Evaluated by Leverage and Prediction Interval Calculation. Environ. Sci. Technol. 2022, 56 (9), 5572–5579. 10.1021/acs.est.2c00865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kar S.; Roy K.; Leszczynski J.; Nicolotti O.. Applicability Domain: A Step Toward Confident Predictions and Decidability for QSAR Modeling. In Computational Toxicology: Methods and Protocols; Springer New York: New York, NY, 2018; pp 141–169. [DOI] [PubMed] [Google Scholar]
  46. Hanser T.; Barber C.; Marchaland J. F.; Werner S. Applicability Domain: Towards A More Formal Definition. SAR QSAR Environ. Res. 2016, 27 (11), 865–881. 10.1080/1062936X.2016.1250229. [DOI] [PubMed] [Google Scholar]
  47. Netzeva T. I.; Worth A. P.; Aldenberg T.; Benigni R.; Cronin M. T. D.; Gramatica P.; Jaworska J. S.; Kahn S.; Klopman G.; Marchant C. A.; Myatt G.; Nikolova-Jeliazkova N.; Patlewicz G.; Perkins R.; Roberts D. W.; Schultz T. W.; Stanton D. T.; van de Sandt J. J. M.; Tong W.; Veith G.; Yang C. Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships: The Report and Recommendations of ECVAM Workshop 52. Altern. Lab. Anim. 2005, 33 (2), 155–173. 10.1177/026119290503300209. [DOI] [PubMed] [Google Scholar]
  48. Organisation for Economic Co-operation and Development (OECD) . OECD Principles for the Validation, for Regulatory Purposes, of (Quantitative) Structure-Activity Relationship Models. 2004.
  49. Endo S.; Goss K.-U. Predicting Partition Coefficients of Polyfluorinated and Organosilicon Compounds Using Polyparameter Linear Free Energy Relationships (PP-LFERs). Environ. Sci. Technol. 2014, 48 (5), 2776–2784. 10.1021/es405091h. [DOI] [PubMed] [Google Scholar]
  50. Zhang X.; Sun X.; Jiang R.; Zeng E. Y.; Sunderland E. M.; Muir D. C. G. Response to Comment on “Screening New Persistent and Bioaccumulative Organics in China’s Inventory of Industrial Chemicals”: A Call for Further Environmental Research on Organosilicons Produced in China. Environ. Sci. Technol. 2022, 56 (1), 693–696. 10.1021/acs.est.1c05528. [DOI] [PubMed] [Google Scholar]
  51. Strempel S.; Scheringer M.; Ng C. A.; Hungerbühler K. Screening for PBT Chemicals among the “Existing” and “New” Chemicals of the EU. Environ. Sci. Technol. 2012, 46 (11), 5680–5687. 10.1021/es3002713. [DOI] [PubMed] [Google Scholar]
  52. Reemtsma T.; Berger U.; Arp H. P. H.; Gallard H.; Knepper T. P.; Neumann M.; Quintana J. B.; de Voogt P. Mind the Gap: Persistent and Mobile Organic Compounds - Water Contaminants That Slip Through. Environ. Sci. Technol. 2016, 50 (19), 10308–10315. 10.1021/acs.est.6b03338. [DOI] [PubMed] [Google Scholar]
  53. Arp H. P. H.; Brown T. N.; Berger U.; Hale S. E. Ranking REACH Registered Neutral, Ionizable and Ionic Organic Chemicals Based on Their Aquatic Persistency and Mobility. Environ. Sci. Process. Impacts 2017, 19, 939–955. 10.1039/C7EM00158D. [DOI] [PubMed] [Google Scholar]
  54. Weininger D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28 (1), 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
  55. Chirico N.; Sangion A.; Gramatica P.; Bertato L.; Casartelli I.; Papa E. QSARINS-Chem Standalone Version: A New Platform-independent Software to Profile Chemicals for Physico-chemical Properties, Fate, and Toxicity. J. Comput. Chem. 2021, 42 (20), 1452–1460. 10.1002/jcc.26551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Abraham M. H.; Chadha H. S.; Whiting G. S.; Mitchell R. C. Hydrogen Bonding. 32. An Analysis of Water-Octanol and Water-Alkane Partitioning and the Δlog P Parameter of Seiler. J. Pharm. Sci. 1994, 83 (8), 1085–1100. 10.1002/jps.2600830806. [DOI] [PubMed] [Google Scholar]
  57. Grubbs L. M.; Saifullah M.; De La Rosa N. E.; Ye S.; Achi S. S.; Acree W. E.; Abraham M. H. Mathematical Correlations for Describing Solute Transfer into Functionalized Alkane Solvents Containing Hydroxyl, Ether, Ester or Ketone Solvents. Fluid Phase Equilib. 2010, 298 (1), 48–53. 10.1016/j.fluid.2010.07.007. [DOI] [Google Scholar]
  58. Bronner G.; Goss K. U. Predicting Sorption of Pesticides and Other Multifunctional Organic Chemicals to Soil Organic Carbon. Environ. Sci. Technol. 2011, 45 (4), 1313–1319. 10.1021/es102553y. [DOI] [PubMed] [Google Scholar]
  59. Arnot J. A.; Gouin T.; Mackay D.. Development and Application of Models of Chemical Fate in Canada: Practical Methods for Estimating Environmental Biodegradation Rates. Report to Environment Canada; Canadian Environmental Modelling Network: Peterborough, ON, 2005.
  60. Tetko I. V.; Sushko I.; Pandey A. K.; Zhu H.; Tropsha A.; Papa E.; Öberg T.; Todeschini R.; Fourches D.; Varnek A. Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena Pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection. J. Chem. Inf. Model. 2008, 48 (9), 1733–1746. 10.1021/ci800151m. [DOI] [PubMed] [Google Scholar]
  61. Atkinson A. C.Plots, Transformations and Regression; Clarendon Press: Oxford (UK), 1985. [Google Scholar]
  62. Gramatica P.; Pilutti P.; Papa E. QSAR Prediction of Ozone Tropospheric Degradation. QSAR Comb. Sci. 2003, 22 (3), 364–373. 10.1002/qsar.200390026. [DOI] [Google Scholar]
  63. Gramatica P.; Papa E. QSAR Modeling of Bioconcentration Factor by Theoretical Molecular Descriptors. QSAR Comb. Sci. 2003, 22 (3), 374–385. 10.1002/qsar.200390027. [DOI] [Google Scholar]
  64. Gramatica P.; Pilutti P.; Papa E. Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training–Test Sets and Consensus Modeling. J. Chem. Inf. Comput. Sci. 2004, 44 (5), 1794–1802. 10.1021/ci049923u. [DOI] [PubMed] [Google Scholar]
  65. Naseem S.; Zushi Y.; Nabi D. Development and Evaluation of Two-Parameter Linear Free Energy Models for the Prediction of Human Skin Permeability Coefficient of Neutral Organic Chemicals. J. Cheminf. 2021, 13 (1), 25 10.1186/s13321-021-00503-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Li L.; Zhang Z.; Men Y.; Baskaran S.; Sangion A.; Wang S.; Arnot J. A.; Wania F. Retrieval, Selection, and Evaluation of Chemical Property Data for Assessments of Chemical Emissions, Fate, Hazard, Exposure, and Risks. ACS Environ. Au 2022, 2 (5), 376–395. 10.1021/acsenvironau.2c00010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Manallack D. T. The PKa Distribution of Drugs: Application to Drug Discovery. Perspect. Med. Chem. 2007, 1, 25–38. [PMC free article] [PubMed] [Google Scholar]
  68. Franco A.; Ferranti A.; Davidsen C.; Trapp S. An Unexpected Challenge: Ionizable Compounds in the REACH Chemical Space. Int. J. Life Cycle Assess. 2010, 15 (4), 321–325. 10.1007/s11367-010-0165-6. [DOI] [Google Scholar]
  69. Bonnell M. A.; Zidek A.; Griffiths A.; Gutzman D. Fate and Exposure Modeling in Regulatory Chemical Evaluation: New Directions from Retrospection. Environ. Sci. Process. Impacts 2018, 20 (1), 20–31. 10.1039/C7EM00510E. [DOI] [PubMed] [Google Scholar]
  70. Armitage J. M.; Erickson R. J.; Luckenbach T.; Ng C. A.; Prosser R. S.; Arnot J. A.; Schirmer K.; Nichols J. W. Assessing the Bioaccumulation Potential of Ionizable Organic Compounds: Current Knowledge and Research Priorities. Environ. Toxicol. Chem. 2017, 36 (4), 882–897. 10.1002/etc.3680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Baskaran S.; Lei Y. D.; Wania F. Reliable Prediction of the Octanol–Air Partition Ratio. Environ. Toxicol. Chem. 2021, 40 (11), 3166–3180. 10.1002/etc.5201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. QSAR Model Reporting Format (QMRF): OPERA-Model for Octanol/Air Partition Coefficient, available at the JRC QSAR Model Database. European Commission, Joint Research Centre [Dataset]. 2020. http://data.europa.eu/89h/e4ef8d13-d743-4524-a6eb-80e18b58cba4 (accessed Jun 2, 2023).
  73. Droge S. T. J.; Goss K.-U. Sorption of Organic Cations to Phyllosilicate Clay Minerals: CEC-Normalization, Salt Dependency, and the Role of Electrostatic and Hydrophobic Effects. Environ. Sci. Technol. 2013, 47 (24), 14224–14232. 10.1021/es403187w. [DOI] [PubMed] [Google Scholar]
  74. Sigmund G.; Arp H. P. H.; Aumeier B. M.; Bucheli T. D.; Chefetz B.; Chen W.; Droge S. T. J.; Endo S.; Escher B. I.; Hale S. E.; Hofmann T.; Pignatello J.; Reemtsma T.; Schmidt T. C.; Schönsee C. D.; Scheringer M. Sorption and Mobility of Charged Organic Compounds: How to Confront and Overcome Limitations in Their Assessment. Environ. Sci. Technol. 2022, 56 (8), 4702–4710. 10.1021/acs.est.2c00570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Schwarzenbach R. P.; Gschwend P. M.; Imboden D. M.. Environmental Organic Chemistry, 2nd ed.; John Wiley & Sons: Hoboken, NJ, 2003. [Google Scholar]
  76. Reppas-chrysovitsinos E.; Sobek A.; Macleod M. Screening-Level Models to Estimate Partition Ratios of Organic Chemicals between Polymeric Materials, Air and Water. Environ. Sci. Process. Impacts 2016, 18 (6), 667–676. 10.1039/C5EM00664C. [DOI] [PubMed] [Google Scholar]
  77. Baskaran S.; Lei Y. D.; Wania F. A Database of Experimentally Derived and Estimated Octanol–Air Partition Ratios (KOA). J. Phys. Chem. Ref. Data 2021, 50 (4), 43101 10.1063/5.0059652. [DOI] [Google Scholar]
  78. Jaworska J. S.; Comber M.; Auer C.; Van Leeuwen C. J. Summary of a Workshop on Regulatory Acceptance of (Q) SARs for Human Health and Environmental Endpoints. Environ. Health Perspect. 2003, 111 (10), 1358–1360. 10.1289/ehp.5757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Endo S.; Hammer J.; Matsuzawa S. Experimental Determination of Air/Water Partition Coefficients for 21 Per- and Polyfluoroalkyl Substances Reveals Variable Performance of Property Prediction Models. Environ. Sci. Technol. 2023, 57 (22), 8406–8413. 10.1021/acs.est.3c02545. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

es3c05643_si_001.pdf (480.8KB, pdf)

Articles from Environmental Science & Technology are provided here courtesy of American Chemical Society

RESOURCES