Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 15.
Published in final edited form as: Clean Technol Environ Policy. 2020 Mar 1;22(2):441–458. doi: 10.1007/s10098-019-01795-w

An automated framework for compiling and integrating chemical hazard data

Leora Vegosen 1,2, Todd M Martin 2
PMCID: PMC8048128  NIHMSID: NIHMS1674430  PMID: 33867908

Abstract

Comparative chemical hazard assessment, which compares hazards for several endpoints across several chemicals, can be used for a variety of purposes including alternatives assessment and the prioritization of chemicals for further assessment. A new framework was developed to compile and integrate chemical hazard data for several human health and ecotoxicity endpoints from public online sources including hazardous chemical lists, Globally Harmonized System hazard codes (H-codes) or hazard categories from government health agencies, experimental quantitative toxicity values, and predicted values using Quantitative Structure–Activity Relationship (QSAR) models. QSAR model predictions were obtained using EPA’s Toxicity Estimation Software Tool. Java programming was used to download hazard data, convert data from each source into a consistent score record format, and store the data in a database. Scoring criteria based on the EPA’s Design for the Environment Program Alternatives Assessment Criteria for Hazard Evaluation were used to determine ordinal hazard scores (i.e., low, medium, high, or very high) for each score record. Different methodologies were assessed for integrating data from multiple sources into one score for each hazard endpoint for each chemical. The chemical hazard assessment (CHA) Database developed in this study currently contains more than 990,000 score records for more than 85,000 chemicals. The CHA Database and the methods used in its development may contribute to several cheminformatics, public health, and environmental activities.

Keywords: Chemical hazard assessment, Quantitative Structure–Activity Relationships (QSARs), Globally Harmonized System (GHS), Environmental health, Cheminformatics, Computational toxicology

Graphic abstract

graphic file with name nihms-1674430-f0001.jpg

Introduction

The Frank R. Lautenberg Chemical Safety for the twenty-first Century Act (LCSA), which amends and modernizes the Toxic Substances Control Act (TSCA), was passed by the United States Congress in 2016 (Public Law 114–182; 15 USC 2601). Under Sect. 6(b) of amended TSCA, the United States Environmental Protection Agency (EPA) is required to prioritize existing chemical substances for risk evaluation. Prioritization is a public process with deadlines in which the EPA is required to designate at least 20 chemical substances as high priority and 20 chemical substances as low priority for risk evaluation. LCSA includes deadlines for completing risk assessments for high-priority chemicals and then designating additional chemicals as high priority (Public Law 114–182; 15 USC 2601).

The implementation of LCSA is an important development that is beginning to address the long-standing problem that only a small percentage of the chemicals in commerce have been assessed for toxicity (GAO 2005). In 1984, the National Research Council (NRC) estimated that no toxicity information was available for approximately 80% of the chemicals in commerce (NRC 1984). In 2007, the NRC produced Toxicity Testing in the twenty-first Century: A Vision and a Strategy, which described how a paradigm shift implementing new and emerging approaches including high throughput in vitro assays and in silico methods such as Quantitative Structure–Activity Relationship (QSAR) models could provide toxicity data for a larger number of chemicals using fewer animals, less time, and less money than traditional toxicology methods (NRC 2007). Since then, the use of such new approach methodologies (NAMs), particularly in EPA projects such as ToxCast, has begun to provide some of this toxicity information (Kavlock et al. 2012; Richard et al. 2016).

Although NAMs are improving toxicity assessment and LCSA is improving the risk assessment process, missing data are currently still a limitation in these areas. Furthermore, existing chemical hazard data are in several different locations and in several different formats. This paper describes the development of a new comparative chemical hazard assessment database that integrates chemical hazard data from a variety of sources and formats into ordinal hazard scores.

The compilation and integration of chemical hazard data from multiple sources can provide information that can aid in the prioritization of chemicals under TSCA. Comparative chemical hazard assessment, which compares hazards for several endpoints across several chemicals, is also an important component of chemical alternatives assessment (Whittaker and Heine 2013). According to the NRC, the “goal of an alternatives assessment is to facilitate an informed consideration of the advantages and disadvantages of alternatives to a chemical of concern, resulting in the identification of safer alternatives” (NRC 2014).

There are several methods and tools for conducting comparative chemical hazard assessments as part of alternatives assessments, but few of these tools are fully transparent and publicly available (NRC 2014; Whittaker and Heine 2013). Design for the Environment (DfE), which includes the Safer Choice Program, is a non-regulatory EPA initiative that began in the 1990s, which provides transparent publicly available criteria for comparing chemical alternatives based on several human health, ecotoxicity, and fate endpoints (US EPA 2011a). The DfE Alternatives Assessment Criteria for Hazard Evaluation provides a method for classifying chemicals based on the United Nations (UN) Globally Harmonized System of Classification and Labeling of Chemicals (GHS). The GHS was developed by the UN to provide an internationally consistent approach for categorizing and communicating the human health and environmental hazards of chemicals (UN 2017). The GHS includes hazard codes (H-codes) that correspond to hazard statements, hazard classes, and hazard categories. For example, the code H300 corresponds to the health hazard statement “Fatal if swallowed,” the hazard class “acute toxicity, oral,” and hazard categories 1 and 2 (UN 2017). Different endpoints have different GHS H-codes. For example, H300 indicates oral acute toxicity, whereas H310 indicates dermal acute toxicity. The DfE Alternatives Assessment Criteria for Hazard Evaluation provides guidance for converting values from a variety of different sources and a variety of different formats (including, but not limited to, GHS H-codes) into consistent ordinal hazard scores of low (L), moderate (M), high (H), or very high (VH) hazard (US EPA 2011b). Frameworks such as the GreenScreen for Safer Chemicals Hazard Assessment Guidance (Clean Production Action 2018a) build upon the DfE approach.

Conducting an alternatives assessment using tools such as DfE or GreenScreen can require substantial time and resources (Wehage et al. 2017). Automated software tools can improve the efficiency of the alternatives assessment process (Wehage et al. 2017). There are some online tools for automating comparative chemical hazard assessment, including the Chemical Hazard and Alternatives Toolbox (ChemHAT) (Chemhat.org 2018) and the Licensed GreenScreen List Translator Automators (Clean Production Action 2018b) the Pharos Chemical and Material Library (Pharos CML) (which was combined with the Chemical Hazard Data Commons in September 2019) (Healthy Building Network 2018, 2019) and Toxnot PBC (Toxnot PBC 2018). However, these sources have limitations in the amount of chemical hazard comparison information that is publicly available without a fee. In 2013, an Organization for Economic Cooperation and Development (OECD) meta-review of the alternatives assessment landscape identified gaps including a need for improved accessibility to “automated tools and methods to reduce hours of highly technical work” (OECD 2013).

This paper describes the development of a publicly available chemical hazard assessment (CHA) Database, which includes hazard scores for 19 hazard endpoints based on EPA’s DfE Alternatives Assessment Criteria for Hazard Evaluation (US EPA 2011a), with some modifications to facilitate the automation of hazard score generation from a variety of data sources. Automated methods based on the framework developed by Wehage et al. (Wehage et al. 2017) were used to merge data from GHS H-codes or hazard categories from several international government agencies, chemical hazard lists, quantitative experimental toxicity data, and predicted values from QSAR models into a single flat database table. The CHA Database and its underlying methodology for compiling and integrating hazard data, which are presented here, may aid in chemical prioritization under TSCA (US EPA 2018h) and may be utilized in decision-support tools such as RapidTox, which is currently being developed by the EPA “to integrate chemistry, toxicity and exposure information” for decision-specific workflows (US EPA 2018c).

Methods

Hazard assessment endpoints

Hazard scores were determined for the following human health endpoints: acute toxicity (via the oral, inhalation, and dermal routes of exposure), carcinogenicity, genotoxicity/mutagenicity, endocrine disruption, reproductive toxicity, developmental toxicity, neurotoxicity (single exposure and repeat exposure), systemic toxicity (single exposure and repeat exposure), skin sensitization, skin irritation, and eye irritation. Hazard scores were determined for the ecotoxicity endpoints of acute aquatic toxicity and chronic aquatic toxicity and the fate endpoints of persistence and bioaccumulation.

Data sources

Chemical hazard data were obtained from the 25 sources listed in Table 1. The number of unique chemicals with data for each hazard category from each source in the CHA Database is provided in Supplemental Information (SI) Table S1. The original formats of the data that were used to determine the hazard scores for each source are summarized in SI Table S2.

Table 1.

Data sources for the CHA Database

Source Abbreviation Data type Authority level # Hazard endpoints References
Safe Work Australia Hazardous Chemical Information System (HCIS) Australia GHS codes Screening 14 Safe Work Australia (2018)
Canada CNESST Workplace Hazardous Materials Information System (WHMIS) Canada GHS codes Screening 12 CNESST (2015)
ChemIDplus ChemIDplus Experimental (quantitative continuous) toxicity values Screening 3 U.S. National Library of Medicine (2018)
Ministry of Environment and Food of Denmark Advisory List for Self-Classification of Dangerous Substances Denmark QSAR Predicted GHS categories Predicted 8 Ministry of Environment and Food of Denmark (2010)
Environment and Climate Change Canada Domestic Substances List (DSL) DSL Yes/No Screening 3 Environment and Climate Change Canada (2006)
European Chemicals Agency (ECHA) Classification Labeling and Packaging (CLP) Annex VI ECHA CLP GHS codes Authoritative 14 ECHA (2018a)
EPA mid-Atlantic Region Human Health Risk-Based Concentrations EPA mid-Atlantic Region Human Health Risk-Based Concentrations Cancer slope factor, mutagen yes/no Authoritative 2 US EPA mid-Atlantic (2018)
Germany Permanent Senate Commission for the Investigation of Health Hazards of Chemical Compounds in the Work Area (MAK Commission) Germany GHS categories, pregnancy risk groups, Sah/Sh Authoritative 4 Germany MAK Commission (2017)
Health Canada Priority Substances Lists (2006) (Carcinogenicity) (via Actor) Health Canada Priority Substance Lists (2006) (Carcinogenicity) Presence on List Screening 1 US EPA (2006a)
Health Canada Priority Substances Lists (2006) (Reproductive Toxicity) (via Actor) Health Canada Priority Substance Lists (2006) (Reproductive Toxicity) Presence on List Screening 1 US EPA (2006b)
World Health Organization International Agency for Research on Cancer (IARC) Monographs on the Evaluation of Carcinogenic Risks to Humans IARC Cancer categories Authoritative 1 WHO IARC (2018)
Integrated Risk Information System (IRIS) (via DSSTOX) IRIS Cancer categories Authoritative 1 US EPA (2008)
National Institute of Technology and Evaluation (NITE) of Japan GHS Classification Results Japan GHS codes Screening 16 NITE of Japan (2018)
Department of Occupational Safety and Health Ministry of Human Resources Malaysia Industry Code of Practice on Chemicals Classification and Hazard Communication Malaysia GHS codes Screening 14 Department of Occupational Safety and Health Ministry of Human Resources Malaysia (2014)
New Zealand Environmental Protection Authority New Zealand GHS codes Screening 10 New Zealand Environmental Protection Authority (2018)
US National Institute for Occupational Safety and Health (NIOSH) list of potential occupational carcinogens (via ACTOR) NIOSH list of potential occupational carcinogens Presence on List Authoritative 1 NIOSH (2012) and US EPA (2018b)
California Office of Environmental Health Hazard Assessment Proposition 65 List Prop 65 Presence on List Authoritative 3 California (2018)
EU European Chemicals Agency (ECHA) Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) Candidate List of Substances of Very High Concern for Authorization REACH Very High Concern List Presence on List Authoritative 7 ECHA (2018b)
US Department of Health and Human Services National Toxicology Program Report on Carcinogens Report on Carcinogens Known Human Carcinogen (KHC) or Reasonably Anticipated to be a Human Carcinogen (RAHC) Authoritative 1 National Toxicology Program (2016)
ChemSec Substitute It Now (SIN) List SIN Presence on List Screening 1 Chemsec (2018)
The Endocrine Disruption Exchange (TEDX) List of Potential Endocrine Disruptors TEDX Presence on List Screening 1 TEDX (2018)
US EPA Toxicity Estimation Software Tool (T.E.S.T.) Experimental T.E.S.T. Experimental Quantitative continuous and binary data Screening 6 US EPA (2016b)
US EPA Toxicity Estimation Software Tool (T.E.S.T.) Predicted T.E.S.T. Predicted Predicted toxicity values from QSAR models Predicted 6 US EPA (2016b)
US EPA Toxic Substances Control Act (TSCA) Work Plan for Chemical Assessments: 2014 Update TSCA work plan Categorical data and yes values Screening 12 US EPA (2014)
University of Maryland (UMD) List of Acute Toxins, Teratogens, Carcinogens, or Mutagens (via ACTOR) UMD Presence on List Screening 3 University of Maryland (2018)

GHS data from publicly available databases from the European Chemicals Agency (ECHA) and the governments of several countries were included (see Table 1). Records from the Safe Work Australia Hazardous Chemical Information System (HCIS) that had ECHA listed as the source were omitted to avoid redundancy. In addition to H-codes, the National Institute of Technology and Evaluation (NITE) of Japan includes a category of “not classified,” indicating that a chemical does not meet the GHS criteria for being classified as hazardous (i.e., the hazard potential is low), a category of “not classifiable” or “classification not possible,” indicating there is not enough information to classify the chemical, and a category of “not applicable,” indicating the hazard does not apply to the chemical (i.e., inhalation hazard from a chemical that is in the solid phase in the relevant temperature range) (NITE of Japan 2018). During data curation, NITE was found to have incorrectly assigned a category of “not classified” for carcinogenicity for several chemicals that should have been assigned as “not classifiable.” These misclassifications were corrected in the CHA Database. NITE was the only source that included data on the endpoint of neurotoxicity single exposure. In addition to GHS categories, pregnancy risk group and skin sensitization data were obtained from The Permanent Senate Commission for the Investigation of Health Hazards of Chemical Compounds in the Work Area (called the MAK Commission) of Germany (Germany MAK Commission 2017).

Chemicals were classified based on the hazard lists described below and are included in Table 1. Some of the data for these lists were obtained via EPA’s Aggregated Computational Toxicology Online Resource (ACToR), which “aggregates data from thousands of public sources on over 500,000 chemicals” (US EPA 2006a, b, 2018a, b, 2019c).

Categorization of chemicals on the Environment and Climate Change Canada Domestic Substances List (DSL) was completed in 2006 under the requirements of the Canadian Environmental Protection Act (CEPA) (Environment and Climate Change Canada 2006). DSL includes classifications of chemicals for acute aquatic toxicity, persistence, and bioaccumulation. The Health Canada Priority Substances Lists for Carcinogenicity and Reproductive Toxicity identify “substances to be assessed on a priority basis to determine whether they are toxic” under CEPA (Government of Canada 1995). The US National Institute for Occupational Safety and Health (NIOSH) list of potential carcinogens is a list of substances that NIOSH has identified as potential occupational carcinogens (NIOSH 2012). California’s Proposition 65 list includes chemicals the state of California has determined are “known to cause cancer or birth defects or other reproductive harm” (California 2018). The ECHA Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) Candidate List of Substances of Very High Concern (SVHC) for Authorization (ECHA 2018b) is published in accordance with article 59(10) of the REACH Regulation of the European Union (EU) (European Union 2006). The SIN List, produced by the nonprofit ChemSec, “consists of chemicals that have been identified by ChemSec as being SVHCs, based on the criteria defined within REACH” (ChemSec 2018). The Endocrine Disruption Exchange (TEDX) List of Potential Endocrine Disruptors identifies “chemicals that have shown evidence of endocrine disruption in scientific research” (TEDX 2018). The University of Maryland (UMD) List of Acute Toxins, Teratogens, Carcinogens, or Mutagens includes chemicals that meet the UMD definition of these classifications (US EPA 2018a). If the exposure route was not specified for UMD acute toxins, the route was assumed to be oral for classification in the CHA Database.

Data were obtained from the EPA mid-Atlantic Region (Region 3) Human Health Risk-Based Concentrations, which is “a collection of Reference Doses (RfDs) and cancer slope factors and other values developed by Region 3 of the US EPA” (US EPA mid-Atlantic Region 2018). Data were obtained from the World Health Organization (WHO) International Agency for Research on Cancer (IARC) Monographs on the Evaluation of Carcinogenic Risks to Humans, which are critical reviews and evaluations of evidence on the carcinogenicity of substances prepared with the input of international working groups of experts (WHO IARC 2018). The IARC monographs indicate cancer categories for chemicals. Cancer category data were also obtained from the US EPA’s Integrated Risk Information System (IRIS) database (US EPA 2019d). Data were obtained from the US Department of Health and Human Services National Toxicology Program Report on Carcinogens, which is “a congressionally mandated, science-based, public health document” that currently includes “248 listings of agents, substances, mixtures, and exposure circumstances that are known or reasonably anticipated to cause cancer in humans” (National Toxicology Program 2016). Hazard data for 90 chemicals were obtained from the 2014 EPA TSCA Work Plan for Chemical Assessments (US EPA 2014). The TSCA Work Plan and NITE of Japan were the only two sources that included data on the endpoint of neurotoxicity repeat exposure.

Experimental and predicted toxicity values from WebTEST (US EPA 2018g) were included. EPA’s Toxicity Estimation Software Tool (T.E.S.T.) predicts toxicity values and physical properties of chemicals using QSAR models based on the Hierarchical Clustering, Single Model, Group Contribution, Nearest Neighbor, and Consensus methods (Martin 2016).

T.E.S.T. is available as a downloadable software tool (US EPA 2016b) and as part of a Java-based web service called Web-services Toxicity Estimation Software Tool (WebTEST) (US EPA 2018g). Molecular structures for compounds in the CHA Database were obtained from the EPA’s CompTox Chemicals Dashboard, which provides compiled chemistry, toxicity, and exposure data (US EPA 2019a; Williams et al. 2017) and from ChemIDplus, which is a TOXNET database of the U.S. National Library of Medicine (2018). These molecular structures were used to generate WebTEST predictions (using the Consensus method) for the following endpoints: acute toxicity, developmental toxicity, genotoxicity/mutagenicity, acute aquatic toxicity, endocrine disruption, and bioaccumulation (US EPA 2016b). The experimental toxicity data that were used to develop the models are included in WebTEST (US EPA 2018g). The acute mammalian oral toxicity models in WebTEST were based on oral LD50 values for rats from ChemIDplus (U.S. National Library of Medicine 2018). The developmental toxicity models in WebTEST were based on data compiled by Arena et al. 2004). The genotoxicity/mutagenicity models in WebTEST were based on a dataset of Ames Salmonella typhimurium reverse mutation assay results compiled by Hansen et al. (2009). The acute aquatic toxicity models used from WebTEST were based on median 96-h fathead minnow LC50 and 48 h Daphnia magna LC50 data from ECOTOX (US EPA 2016a). Bioaccumulation models in T.E.S.T. were based on bioconcentration factor (BCF) data compiled from several databases (US EPA 2016b). The endocrine disruption models in WebTEST were based on rat estrogen receptor (ER) binding assay data of Tong et al. (2004). Models were developed for ER binding activity (whether the chemical binds to the ER) and relative binding affinity (binding of a chemical to the ER relative to the binding of the endogenous ER ligand, 17β-estradiol (E2), which is set to 100).

Data from The Advisory List for Self-Classification of Dangerous Substances produced by the Ministry of Environment and Food of Denmark, which provides predicted GHS hazard categories based on QSAR models (Ministry of Environment and Food of Denmark 2010), were included in the CHA Database.

Quantitative toxicity data for acute mammalian toxicity were obtained from ChemIDplus, which contains data from more than 100 sources (U.S. National Library of Medicine 2018). Based on the Series 870 Harmonized Health Effects Test Guidelines (US EPA 2018d) and the species included in ChemIDplus (U.S. National Library of Medicine 2018), acute mammalian toxicity lethal dose 50% (LD50) and lethal concentration 50% (LC50) data from rats, mice, rabbits, and guinea pigs were included in the CHA Database. Data from other species and data on other toxicity measures such as the lowest observed lethal dose (LDLo) or the lowest observed lethal concentration (LCLo) were excluded.

Determination of hazard scores from individual sources

Chemical hazard data are in different formats due to being from different types of sources such as GHS H-codes or hazard categories, the presence of a chemical on a hazardous chemical list, quantitative toxicity data, and predicted values based on QSAR models. These heterogeneous hazard data were converted into ordinal hazard scores to enable the comparison of chemicals on a consistent basis. For each data source, hazard information for each chemical was converted into scores of low, medium, high, or very high (L, M, H, or VH, respectively) based on a modified version of the DfE Alternatives Assessment Criteria for Hazard Evaluation (US EPA 2011b). In cases where there were no data available (or the data were insufficient to assign a score), a score of not available (N/A) was assigned.

The CHA Database criteria for converting acute mammalian toxicity data into hazard scores are shown in Table 2. The dictionary for assigning scores for all hazard endpoints is provided in SI Table S3. For acute mammalian toxicity, DfE criteria include quantitative cutoff points, which were derived from GHS criteria. The cutoff points are based on LD50 values for the oral and dermal routes of exposure and LC50 values for the inhalation route of exposure (US EPA 2011b). For example, an oral LD50 less than or equal to 50 mg/kg was categorized as “VH.” The LD50 and LC50 data from ChemIDplus were converted to hazard scores based on these criteria. A hazard endpoint (such as acute mammalian oral toxicity) could have more than one record from ChemIDplus due to the presence of data from more than one species or more than one study.

Table 2.

Criteria for converting acute mammalian toxicity data into hazard scores

Source Endpoint Hazard score
VH H M L N/A
DfE criteria Oral LD50 (mg/kg) ≤ 50 > 50–300 > 300–2000 > 2000
Hazard Code H300 H301 H302
ChemIDplus; T.E.S.T. Predicteda Oral LD50 (mg/kg)a ≤ 50 > 50–300 > 300–2000 > 2000
Australia; Canada; ECHA CLP; Japan, Malaysiab Hazard Code H300 H301 H302 H303
Denmark Category AcuteTox1 and AcuteTox2 AcuteTox3 AcuteTox4
New Zealand Category Category 6.1A Category 6.1C Category 6.1D Category 6.1E
Category 6.1B
TSCA Work Plan Acute toxicity
UMD Acute toxin
a

T.E.S.T. Predicted predicts rat LD50 values. ChemIDplus LD50 values for rats, mice, rabbits, and guinea pigs were included

b

Japan is the only source that included H303

For sources that did not include continuous quantitative toxicity data, the CHA Database categorizations were based on GHS H-codes or hazard categories and corresponding DfE criteria. Other categorization schemes were matched as closely as possible to DfE categories. For example, the New Zealand Environmental Protection Authority categories of 6.1A and 6.1B correspond to an oral LD50 of less than or equal to 50 mg/kg, so these categories were classified as VH.

The DfE criteria include classifications for some chemical hazard lists, and these criteria were used in the CHA Database where applicable, with a modification for DfE hazard designations that spanned a range of scores. The DfE criteria translate presence on certain lists into a range of possible scores rather than a single score due to uncertainty in hazard levels. To be conservative and consistent in assigning a single score, if the DfE criteria indicated a range of two possible scores, the higher of the two scores was used as the CHA Database score. For example, the DfE criteria classify presence on the NIOSH Occupational Carcinogen List and the California Proposition 65 list (for carcinogenicity) as VH or H (US EPA 2011b). Based on this range of VH or H, the CHA Database assigned a score of VH if a chemical was present on any list of carcinogens.

For other endpoints, if the DfE criteria did not include presence on a specific list, then presence on that list was categorized as “H” unless additional information was available. For example, chemicals that were listed by UMD as “acute toxins” were classified as “VH” because the UMD definition of “acute toxin” corresponds to the criteria for H300.

For carcinogenicity, to be consistent with DfE criteria (US EPA 2011b), H350 and corresponding classifications such as IARC Groups 1 and 2A were classified as VH, and H351 and corresponding classifications such as IARC Group 2B were classified as H. In contrast, the GreenScreen criteria for GHS [country] classify H350 as H and H351 as M (Clean Production Action 2018a). If a chemical had a cancer slope factor listed in the EPA mid-Atlantic Region Human Health Risk-Based Concentrations, then that chemical was given a score of VH for carcinogenicity because the determination of a slope factor indicates a cancer risk. Similar to the CHA Database scoring mechanism for carcinogenicity, the CHA Database scoring mechanism for genotoxicity/mutagenicity, based on DfE criteria (US EPA 2011b), differed from GreenScreen’s scoring mechanism (Clean Production Action 2018a).

DfE lists several “Authoritative Lists or Reports That Do Not Include Threshold Levels and Therefore Do Not Correlate with DfE’s Hazard and Potency-Based Criteria” including EU CLP H360 and H361 for reproductive and developmental toxicity (US EPA 2011b). In contrast to DfE, the CHA Database classified H360 as H and H361 as M based on the GHS definitions for these categories.

Combining hazard scores from multiple sources into an integrated score

Hazard scores from different sources may differ from each other. Three methods for combining information from multiple sources into a single integrated hazard score were considered: a trumping method, a weighted average nearest integer method, and a conservative weighted average method. These three methods were assessed by comparing the integrated hazard score results for the 20 chemicals with the most complete hazard data in the database.

A trumping scheme designates a particular order of preference for the use of different types of data sources. The trumping method considered here is a modified version of the trumping scheme from the GreenScreen List Translator, which classifies hazardous chemical lists as “Authoritative” or “Screening.” (see GreenScreen v1.4, pages 42–43) (Clean Production Action 2018a). The scores based on Authoritative lists are considered to have a higher level of confidence because these lists “are generated by recognized experts, often as part of a government regulatory process to identify chemicals and known associated hazards” (Clean Production Action 2018a). GreenScreen classified lists as Screening rather than Authoritative if the list had any of the following characteristics: “(1) developed using a less comprehensive review, (2) compiled by an organization that is not considered to be authoritative, (3) developed using predominantly or exclusively estimated data, or 4) developed to identify chemicals for further review and/or testing” (Clean Production Action 2018a).

The trumping method considered here was based on the GreenScreen criteria for Authoritative and Screening lists (Clean Production Action 2018a), with the addition of a third category of Predicted for values predicted from QSAR models. Predicted sources were considered less authoritative than Authoritative and Screening sources. Two data sources were designated as Predicted: T.E.S.T. predicted values (US EPA 2016b) and GHS categories from The Advisory List for Self-Classification of Dangerous Substances (Ministry of Environment and Food of Denmark 2010). The Ministry of Environment and Food of Denmark estimated that the predicted GHS categories are correct for approximately 80% of cases. Therefore, the Danish Environmental Agency recommends that these predicted GHS categories should only be used if a substance does not have an EU-harmonized classification (Ministry of Environment and Food of Denmark 2010). The designation of a Predicted category in the present method aims to be consistent with these recommendations.

For lists that were included in GreenScreen v 1.4, the Authoritative or Screening designations from GreenScreen were used (Clean Production Action 2018a). Lists that were not included in GreenScreen were designated as Screening unless the lists met the criteria for Authoritative or Predicted described above.

ChemIDplus and T.E.S.T. experimental values were designated as screening because these databases include data from multiple sources and the automation process did not provide the ability to easily go back to review all of the original sources to assess data quality. Previous research has “encountered high rates of inaccuracies and mismapped chemical identifiers” in public domain sources including ChemIDplus (Williams et al. 2017). For chemicals with multiple entries for an endpoint in ChemIDplus, the trumping method effectively selects the data from the study with the highest observed toxicity level.

In descending order, the authority ranking in the present method was: Authoritative, Screening, and Predicted. Within those three levels, the list that produces the highest score takes precedence in the trumping method. Similar to GreenScreen’s trumping method (Clean Production Action 2018a), the present trumping method selects the highest score from the most authoritative source as the integrated score. The trumping level assigned to each source is shown in Table 1. The trumping method is illustrated in Fig. 2.

Fig. 2.

Fig. 2

Illustration of the trumping method (based on a modified version of the GreenScreen List Translator trumping scheme (Clean Production Action 2018a))

As a potential alternative to having one score completely trump the others, another method was developed to include the scores from all available sources in a weighted average. Individual L, M, H, and VH scores were converted into the integers 1, 2, 3, and 4, respectively. The same Authoritative, Screening, and Predicted designations as the trumping method were used for the weighted average method. Authoritative sources were assigned a weight of 10, Screening sources a weight of 5, and Predicted sources a weight of 1. A weighted average is calculated as follows (where w = weight):

scorefinal=i=1#sourcesscoreiwii=1#sourceswi (1)

To convert back to a letter score, the integrated score needs to be rounded to an integer. In the weighted average nearest integer (WANI) method, scorefinal from Eq. 1 was rounded to the nearest integer. For example, using the WANI method, a score of 1.6 would be rounded to 2 and a score of 1.1 would be rounded to 1. In the conservative weighted average (WAC) method, scorefinal from Eq. 1 was rounded up the next integer. Thus, using the WAC method, both a score of 1.6 and a score of 1.1 would be rounded to 2.

Storage of hazard data

An object-oriented approach (in Java) was developed to extract hazard data from a variety of sources and store these data in a common format. The code was based on the method described by Wehage et al. for automating the generation of GreenScreen List Translator scores from Japan’s NITE data (Wehage et al. 2017).

For a given source, data for each chemical were stored in a “Chemical” class (see graphical representation in Fig. 1). Each chemical has structural identifiers such as the chemical name, the CAS number, and the chemical structure in terms of a simplified molecular-input line-entry system (SMILES) string or MDL mol file. Each chemical has an array of scores for each toxicity category (e.g., acute mammalian toxicity, carcinogenicity, etc.). Each Score class contains the hazard name, the integrated score (L, M, H, VH, or N/A), and the source of the integrated score. Each Score class contains an array of score records that were used to determine the integrated score. Each score record contains fields such as name (the chemical name from the original source), source (the name of the source), score (L, M, H, VH, or N/A for the score record), category (e.g., Category 1), hazard code (e.g., H300), hazard statement (e.g., “Fatal if swallowed”), rationale (how the score was assigned), route (exposure route), note (additional metadata for the score record), and additional fields (valueMassOperator, valueMass, and valueMassUnits) to store quantitative toxicity data when available.

Fig. 1.

Fig. 1

Representation of hazard data as Java-based classes

For a given data source, each Chemical class was stored within a Chemicals object, which was then exported to a JSON text file. The data in the Chemicals object were exported to a single flat file. The fields in the flat file combine data from the Chemical, Score, and ScoreRecord classes: CAS, name, hazard_name, source, score, route, category, hazard_code, hazard_statement, rationale, note, note2, valueMassOperator, valueMass, and valueMassUnits.

The flat text files for all the data sources were combined into a single flat text file, which was stored as a single table in an SQLite (SQLite Consortium and Hipp Wyrick & Company Inc. 2018) database for compact storage and easy access. (Hazard score records for a given chemical can be retrieved in a single simple query.) This flat file that comprises the main CHA Database was converted to an Excel spreadsheet so the table can easily be accessed as SI Table S4. Each row in SI Table S4 represents a score record. The Java code to create this database (starting from the raw data from each source) is being made publicly available in a GitHub repository (Martin 2019).

Results and discussion

Compiled hazard information in the CHA database

The CHA database currently includes information on approximately 85,880 chemicals. The score records in the CHA Database, including records from T.E.S.T., are provided in SI Table S4. The CHA Database provided in SI Table S4 currently contains over 290,000 score records for more than 85,000 chemicals from 23 sources, plus approximately 300,000 additional score records from T.E.S.T. (excluding instances in which T.E.S.T. cannot make a prediction due to the limited applicability domains of models or due to a lack of molecular structure information). Color-coded integrated scores for each chemical (using the trumping method for score record prioritization) for each hazard category are provided in SI Table S5. The results in SI Table S5 are estimated that are subject to change with the addition of more score record sources or with the use of a method other than the trumping method for integrating score data from multiple sources.

The number of unique chemicals (based on CAS numbers) with hazard scores for each hazard endpoint from each data source is shown in SI Table S1. Using multiple sources enabled the CHA Database to have data for more chemicals than it would have had utilizing the data from any individual source. For example, the largest number of chemicals with an acute mammalian toxicity oral score from any individual source was 29,110 chemicals with data from ChemIDplus, but the total number of chemicals with acute mammalian toxicity oral data from at least one source was 44,903 (a 54% increase).

TSCA as amended by LCSA requires the EPA “to designate chemical substances on the TSCA Chemical Substance Inventory as either ‘active’ or ‘inactive’ in U.S. commerce” (US EPA 2018e). The active non-confidential portion of the TSCA Inventory that has been unambiguously mapped to the DSSTOX database (US EPA 2019b) included 18,696 chemicals as of March 2018 (US EPA 2018f). Table 3 shows the availability of information on these 18,696 chemicals in the CHA Database. The endpoints of acute aquatic toxicity, persistence, and bioaccumulation have the most coverage, with data on approximately 40% of these 18,696 chemicals. In contrast, there are data for less than 5% of the chemicals on the endpoints of carcinogenicity, endocrine disruption, reproductive toxicity, developmental toxicity, and neurotoxicity (single and repeat exposure).

Table 3.

CHA Database coverage for the active non-confidential portion of the TSCA inventory (n = 18,696)

Endpoint % Coveragea
Human Health Outcomes
Acute Mammalian Toxicity Oral 18.5
Acute Mammalian Toxicity Inhalation 6.7
Acute Mammalian Toxicity Dermal 8.6
Carcinogenicity 3.6
Genotoxicity Mutagenicity 9.6
Endocrine Disruption 2.3
Reproductive 3.4
Developmental 4.1
Neurotoxicity Repeat Exposure 1.5
Neurotoxicity Single Exposure 1.9
Systemic Toxicity Repeat Exposure 5.6
Systemic Toxicity Single Exposure 5.0
Skin Sensitization 3.6
Skin Irritation 12.3
Eye Irritation 13.1
Ecotoxicity
Acute Aquatic Toxicity 40.3
Chronic Aquatic Toxicity 9.4
Fate
Persistence 40.4
Bioaccumulation 40.1
a

Omits QSAR predictions from T.E.S.T. and Denmark

Advantages and disadvantages of methods for integrating scores from multiple sources

Curating data for a large number of chemicals and integrating information from multiple sources into one overall hazard score poses challenges. In a review of 20 alternatives assessment frameworks, Jacobs et al. noted that most of these frameworks “offer examples of publicly available resources where information can be collected but do not suggest preferred sources or any data hierarchy wherein certain data types might be considered of higher value than others” (Jacobs et al. 2016). However, the use of different sources can produce different hazard scores for the same hazard endpoint, and methods are needed to determine the most appropriate score.

The results for the 20 chemicals with the most data in the CHA Database for the trumping method are shown in Table 4. The results for these 20 chemicals for the weighted average nearest integer and conservative weighted average methods are shown in SI Tables S6 and S7, respectively. For these 20 chemicals, there were a total of 275 integrated hazard scores compiled from more than one source for which the scores were not all the same and at least two scores were not N/A. Of these 275 integrated scores, 44 scores differed when computed using the weighted average nearest integer method versus the trumping method. The trumping method resulted in a higher score than the weighted average nearest integer method for 32 out of these 44 (73%) integrated scores. Of the 275 integrated scores, 55 scores differed when computed using the conservative weighted average method versus the trumping method. The trumping method resulted in a higher score than the conservative weighted average for 23 out of these 55 (42%) integrated scores.

Table 4.

Output* from the CHA Database for the 20 chemicals with the most records

Chemical Acute Oral Toxicity Acute Inhalation Toxicity Acute Dermal Toxicity Carcinogenicity Genotoxicity Mutagenicity Endocrine Disruption Reproductive Developmental Repeat Dose Neurotoxicity Single Dose Neurotoxicity Repeat Dose Systemic Toxicity Single Dose Systemic Toxicity Skin Sensitization Skin Irritation Eye Irritation Acute Aquatic Toxicity Chronic Aquatic Toxicity Persistence Bioaccumulation
Acrylamide H M M VH VH L M H H H H H H H H H N/A L L
Trichloroethylene L M L VH VH N/A H H H H H M H H H M M H L
Phenol H H H H H H H H H H M H H VH VH H L L L
Formaldehyde H H H VH H H N/A L N/A N/A H M H VH VH H L L L
Glutaral H VH H M H H L L N/A H H M H VH VH VH H L L
Hydrazine H H H VH VH N/A M M H H H H H VH VH VH VH L L
Ethylene Oxide VH H N/A VH VH H H H H H H M H H H M L H L
Hydrazine hydrate VH VH H VH VH N/A M N/A H H H H H VH VH VH VH N/A N/A
4,4’- Methylenedianiline H N/A VH VH H L N/A H N/A H M H H L H L H L L
Sodium dichromate H VH M VH VH N/A H H N/A N/A H H H VH VH VH VH H N/A
2-Propenenitrile H H H VH H L H H H H H M H H VH H H H L
Morpholine M M M N/A L L N/A L N/A N/A H H N/A VH VH L M L L
Ethylene dibromide H H H VH H H M H N/A H M M N/A H H H H H L
Methyl alcohol H H H N/A VH H H H H H H H L N/A H L L H L
Hydrofluoric acid VH VH VH N/A N/A N/A N/A L H N/A H H N/A VH VH L L H L
Glycidol M H M VH H H H M M H M M N/A H H M L L L
Pentachlorophenol H VH H VH L H H H H H H M N/A H H VH VH H M
Aniline H H H VH H L M L H H H H H H VH VH VH L L
Epichlorohydrin H VH H VH M H H M N/A N/A H H H VH VH M L H L
Potassium dichromate H VH M VH VH N/A H H N/A H H H H VH VH VH VH H N/A
*

Scores were determined using the trumping method

In some cases, the score for the most authoritative source was lower than the scores from several less authoritative sources. For example, for pentachlorophenol, there are six records for acute mammalian toxicity dermal as shown in SI Table S8. Of these records, one is from an authoritative source, ECHA CLP, and the rest are from screening sources. The score for ECHA CLP was H, and the score for one of the screening sources, the Department of Occupational Safety and Health Ministry of Human Resources Malaysia, was H. However, the score for each of the other four screening sources—Canada CNESST, ChemIDplus, NITE of Japan, and the New Zealand Environmental Protection Authority—was VH. Therefore, the trumping method resulted in a score of H because the score from ECHA CLP trumped the scores from other sources, whereas the weighted average nearest integer and the conservative weighted average methods resulted in a score of VH because there were more scores of VH than H.

Given uncertainty in determining an integrated hazard score, assigning a higher score for an individual chemical will be more protective of public health if policies for that chemical are based on the hazard score. The trumping method results in a higher score more often than the weighted average nearest integer method and therefore may be more protective. For example, for aniline, as shown in SI Table S9, there are 12 records for carcinogenicity, including 4 authoritative scores of VH, 1 authoritative score of H, 1 authoritative score of M, 1 authoritative score of N/A, 2 screening scores of VH, and 3 screening scores of H. The weighted average nearest integer method results in a score of H, whereas the trumping method results in a score of VH. The conservative weighted average method results in a higher score more often than both the weighted average method and the trumping method. Thus, the conservative weighted average might be the most protective for assessing chemicals individually. However, in contrast to assessing the risk of one chemical, comparative chemical hazard assessment and alternatives assessment aim to compare different chemicals with the goal of choosing the best alternative. The conservative weighted average method might result in the loss of information that is relevant for comparison because the scores for too many chemicals might get rounded up, reducing information for distinguishing differences between chemicals.

Using the highest score from the most authoritative source gives the trumping method the advantage of increased confidence in the data source because scores from less authoritative sources will be filtered out. Because the integrated hazard score for the trumping method is determined by the score from one source, the score records can be sorted in trumping order so the source of the integrated hazard score can easily be viewed. Although the source of the integrated score is transparent, a limitation of having one source trump the others is the most authoritative source might not have easily accessible information on how the individual hazard score was determined. For example, in many cases ECHA CLP was the most authoritative source, but information on how ECHA CLP determined the hazard classification (H-code) was not easily accessible. A disadvantage of having one score trump the others is that one score could be an outlier that is different than most of the scores from other sources. If the integrated score for a chemical is driven by an outlier, then it might not be the most representative score for comparison to other chemicals.

The use of one consistent method for determining the integrated score enables decision makers to have access to hazard scores that remain constant at a given point in time, enabling consistent comparison of chemicals. However, some scores may change over time if the underlying database is updated (such as through the addition of new data sources or the updating of data within existing sources).

The challenges of missing data and uncertainty in hazard scoring

Whittaker and Heine noted that “it is imperative to the integrity” of an alternatives assessment “that all hazard scores are based on sound scientific knowledge and can be properly supported and defended” (Whittaker and Heine 2013). Although sound science is indeed crucial to alternatives assessment, the limited availability of high-quality data can impact the process. The variability in the quality and quantity of data for different chemicals and different endpoints poses a fundamental challenge in comparative chemical hazard assessment due to the differences in the uncertainty underlying the determination of chemical hazard scores.

Some endpoints may have only one data source. For example, NITE of Japan is the only data source in the CHA Database that provides information on neurotoxicity via single exposure. However, Japan’s GHS H-codes may have been determined based on multiple toxicity studies, or multiple underlying data sources, and NITE of Japan’s categorizations generally include explanations of how the category was determined. Still, the lack of additional sources limits the confidence in the neurotoxicity scores.

For endpoints with multiple data sources, the integrated hazard score can be impacted by the amount of variability in scores between data sources. For example, for CAS 79–06-1 (2-propenamide or acrylamide), there are 11 score records for acute mammalian toxicity via the oral route of exposure. Each of these data sources has a score of H. In contrast, for CAS 62–53-3 (aniline), there are 10 score records for acute mammalian toxicity via the oral route of exposure. One of these records has a score of VH, four have a score of H, and five have a score of M, resulting in the trumping method and the weighted average method producing an integrated score of H. Thus, although the integrated scores are the same, the score for aniline has more uncertainty, or a lower level of confidence, than the score for acrylamide.

Another challenge is posed by endpoints that lack available scores from any data sources in the CHA Database. As shown in Table 4, of the 20 chemicals with the most records, 17 have at least one score of N/A. The inclusion of additional data sources might reduce the frequency of scores of N/A, but for many chemicals and endpoints hazard data does not exist in any currently available source. Thus, the CHA Database has two types of missing data: (1) data that are missing from the CHA Database but that exist in sources that are not yet included in the Database and (2) data that do not currently exist anywhere or in any publicly available data source. The latter type of missing data has been described by GreenScreen as a data gap that “indicates that measured data and authoritative and screening lists have been reviewed, and expert judgment and estimation such as modeling and analog data have been applied, and there is still insufficient information to assign a hazard level to an endpoint” (Clean Production Action 2018a). Future updates to the CHA Database may include adding additional data sources to reduce missing data. However, gaps in existing available chemical hazard data currently remain a problem. NAMs are vital to the process of filling in these data gaps. Therefore, QSAR model predictions from T.E.S.T. for six hazard endpoints are included in the CHA Database, and we are currently developing T.E.S.T. QSAR models for additional endpoints. Predicted toxicity values from these models may be added to the CHA Database in the future to further reduce missing data.

Missing data is an important limitation of alternatives assessment in general. For example, if a chemical of concern has a score of VH for carcinogenicity, a goal of an alternatives assessment might be to find a replacement that is not a potent carcinogen. A chemical with a score of N/A for carcinogenicity might not be an ideal replacement because the carcinogenicity of the potential replacement chemical is unknown and might (or might not) be worse than the chemical of concern. However, due to the limited existing chemical hazard data, decision makers might be faced with the problem that carcinogenicity data might not be available for any of the chemicals that are being considered as a replacement. Comprehensive automated comparative chemical hazard assessment can highlight areas where additional toxicologic research, including NAMs, is needed to fill in data gaps (Wehage et al. 2017).

Converting scores from multiple sources into a common format poses challenges. The original format of the scores for each data source is shown in SI Table S2. Although the CHA Database aims to be consistent with DfE and consistent across sources in the criteria for determining scores, the determination of the scoring criteria was not straightforward for some sources. For example, Germany’s MAK Commission categorizes developmental toxicity based on pregnancy risk groups (Germany MAK Commission 2017). These categorizations are defined on a different basis than the GHS H-codes and categories provided by many of the other developmental toxicity data sources. The CHA Database aimed to be as consistent as possible in the methods for converting these pregnancy risk groups and other hazard categorizations into ordinal hazard scores.

Comprehensive inclusion of all available high-quality data sources can improve the accuracy of integrated hazard scores. If fewer sources are used or if the sources are not authoritative or not based on studies with sufficient quality assurance, then confidence in the integrated hazard score may be diminished.

The inclusion of different sources and the use of different scoring criteria can result in hazard scores differing between chemical assessment tools. A study that compared results for seven chemicals assessed using a total of eight screening modules from five screening tools—DfE, GreenScreen, GreenWERCS, GreenSuite, and SciVera Lens—found differences in hazard scores between all of the tools (Panko et al. 2017). For example, six out of seven chemicals had inconsistent scores for the endpoint of reproductive toxicity (Panko et al. 2017). In addition to hazard scores for single endpoints, the tools generated an overall hazard score across all endpoints for each chemical, and these overall chemical hazard scores differed between tools (Panko et al. 2017). A lack of consistent hazard scores across comparative chemical hazard assessment tools poses a substantial problem for decision makers because there is not a gold standard to determine which score is likely to be the most accurate estimate of the actual hazard. To adequately protect public health and the environment, an objective scientifically based consistent comparative chemical hazard assessment scoring method is needed. Furthermore, comparative chemical hazard assessment screening tools should be distinguished from in-depth expert manual assessment because some of the differences in results between tools might be due to differences in the purpose of the tools and the extent of manual analysis involved (Panko et al. 2017). The CHA Database is a comparative chemical hazard assessment tool that was developed using automation. Thus, the integrated hazard scores from the CHA Database should not be regarded as final hazard determinations, but rather, as estimates that can be used for purposes such as prioritizing chemicals for in-depth risk assessment.

An important difference between alternatives assessment and risk assessment is that alternatives assessment aims to compare chemicals, whereas risk assessment aims to assess individual chemicals (Whittaker 2015). Thus, uncertainty should be approached differently in each of these types of assessment. The incorporation of safety factors into risk assessment aims to be protective of public health and the environment. In alternatives assessment, such a conservative approach might not always be appropriate. For example, in the comparative chemical hazard component of an alternatives assessment, if all the chemicals under consideration end up being categorized as VH for an endpoint, this information does not help in the selection of the least hazardous chemical from among this group. A scoring method that provides more resolution to distinguish between categories (i.e., a method for which the score distribution is not clustered at one score) would be more appropriate for alternatives assessment. Because hazard scores generated for alternatives assessment have the intended purpose of facilitating the comparison of chemicals relative to other chemicals, alternatives assessment scores should not be assumed to indicate absolute hazard levels for individual chemicals.

Strengths and limitations of the CHA database

Strengths of the CHA Database include that it is a publicly available database that integrates hazard data from multiple different formats into ordinal hazard scores that are based on GHS and EPA DfE criteria. Furthermore, the CHA Database provides hazard scores for several hazard endpoints and enables quick comparison of hazard profiles across large numbers of chemicals simultaneously. The CHA Database keeps hazard endpoints separate rather than generating a single combined hazard score across endpoints, thus avoiding introducing the bias that would be associated with weighting hazard endpoints.

The CHA Database was developed using Java code to automate hazard data integration, which enables the CHA Database to efficiently provide information for a large number of chemicals from a large number of sources. However, the automation of the generation of hazard scores limits the scope of the search for and assessment of available data. Ideally, an alternatives assessment would include a comprehensive search of all available data sources including primary literature and an assessment of the quality assurance for individual studies. The GreenScreen List Translator is only intended for lists and is not as comprehensive as a full GreenScreen assessment, which requires expert manual data review (Clean Production Action 2018a). Thus, Wehage et al. note that the use of automation “is intended to augment, not replace, the role of a GreenScreen Licensed Profiler or GreenScreen Certified Practitioner” (Wehage et al. 2017). Likewise, although the scoring criteria for the CHA Database are based on DfE criteria, the CHA Database does not constitute a full DfE alternatives assessment, which would involve a more in-depth analysis as well as stakeholder participation (US EPA 2011b). Thus, the integrated hazard scores from the CHA Database should not be regarded as final hazard determinations, but rather, as hazard estimates that can provide practitioners with information that can be used for alternatives assessment and the prioritization of chemicals for further evaluation.

Another limitation of the automation of the generation of hazard scores is that records from the same primary source could be duplicated by several secondary sources. For example, examination of the sources for the 20 chemicals with the most records showed there is some duplication between ChemIDplus and GHS data sources. This duplication could impact weighted average scores but does not impact trumping method scores. The use of automation and secondary source data rather than primary source toxicity studies from the literature limits the ability to directly conduct in-depth reviews of the studies. The designation of authority levels for the secondary sources aims to prioritize the use of Authoritative sources that were curated by recognized experts.

The CHA Database includes data from multiple sources including QSAR models, which adds information on hazard endpoints. However, a limitation of the CHA Database is that it currently does not include some of the data sources included in online GreenScreen List Translator Automator tools (Clean Production Action 2018b) such as Pharos (Healthy Building Network 2018, 2019), so some integrated scores might change as additional sources are added. However, the CHA Database includes sources that are not in these tools, such as the quantitative toxicity data from ChemIDplus and predicted values from WebTEST QSAR models. Because the CHA Database and the Java code that was used to create this database are being made publicly available (Martin 2019), users can add data sources and make other modifications to personalize the database for their own purposes. Comparative chemical hazard assessment tools such as the CHA Database may be useful for several purposes. For example, users may be interested in comparing predicted hazard scores for custom-made molecules to hazard scores for other chemicals.

Future research needs

Future plans for the CHA Database include adding additional data sources such as experimental data from ECOTOX (US EPA 2016a), ToxRefDB (Knudsen et al. 2009; Martin et al. 2009; Williams et al. 2017), and REACH registration data via the International Uniform ChemicaL Information Database (IUCLID) (ECHA and OECD 2018) and the Global Portal to Information on Chemical Substances (eChemPortal) (OECD 2018), the Hazardous Substances Data Bank (HSDB) (US EPA 2018h), and other large databases. Additionally, work currently in progress includes the development of new QSAR models for endpoints such as skin sensitization. These models, along with additional existing models such as persistence models from EPA’s Estimation Program Interface Suite (EPI Suite) (US EPA 2012), will be added to the CHA Database. Future research could assess the extent to which scores generated from the CHA Database differ from scores from tools such as Pharos (Healthy Building Network 2018, 2019). A challenge of chemical hazard assessment is that hazard lists may be updated as additional data becomes available. Future work will include checking for updates to underlying hazard data.

Perhaps a future version of the CHA Database could include a quantitative measurement of the level of uncertainty underlying each score. Uncertainty and variability have been analyzed for other chemical assessment methods such as fate and exposure models (Hertwich et al. 1999). Faludi et al. have described an alternatives assessment method that generates an overall chemical score across hazard endpoints that includes a quantitative graphical estimate of uncertainty (Faludi et al. 2016). Some chemical assessment software tools include an indicator of uncertainty. For example, the Chemical Life Cycle Collaborative (CLiCC) Tool provides a quantitative estimate of uncertainty for toxicity endpoints and other measures for individual chemicals (Yang et al. 2018). Because the CHA Database determines ordinal hazard scores, the assessment of variability would need to be appropriate for ordinal categorical data. New measures of variability for categorical data have recently been proposed (Allaj 2018).

The CHA Database can be integrated with other EPA cheminformatics tools to increase the scope and applicability of accessible chemical data. For example, the EPA is currently developing an online tool called RapidTox, which will integrate available chemistry, toxicity, and exposure information into workflows to aid in specific decision-making tasks (US EPA 2018c). In the future, methods and data from the CHA Database and RapidTox may be combined to aid in workflows for purposes such as the prioritization of chemicals under TSCA. Additionally, the EPA has recently developed a tool that links the molecular structure of a chemical to information about the synthesis route and manufacturing process for that chemical (Barrett et al. 2019). The CHA Database can be integrated with this tool and used to aid in identifying less hazardous synthesis routes for the desired chemical.

Conclusions

The CHA Database provides a publicly available user-friendly database for comparative chemical hazard assessment for more than 85,000 chemicals. The CHA Database integrates data from GHS sources, hazardous chemical lists, experimental toxicity values, and QSAR models into ordinal hazard scores. The availability of the CHA Database will enable users to readily compare compiled and integrated chemical hazard data in a central location. Source code is being provided to enable users to update the database or add additional sources. The CHA Database can contribute to improving public health and environmental quality by aiding in chemical management activities such as the selection of less hazardous alternatives or the prioritization of chemicals for further assessment.

Supplementary Material

Supplementary Table S1
Supplementary Table S3
Supplementary Table S2
Supplementary Table S6
Supplementary Table S7
Supplementary Table S8
Supplementary Table S9
Supplementary Table S5
Table S4 CAS 0-0-0 to 9999-99-9
Table S4 CAS 10000-00-0 to 29999-99-9
Table S4 CAS 30000-00-0 to 59999-99-9
Table S4 CAS 60000-00-0 and above or No CAS

Acknowledgements

The authors thank Valery Tkachenko and Jonathan Fox for their assistance with code development and data mining. The authors thank Michael Gonzalez, William Barrett, Richard Judson, and Maureen Gwinn for helpful feedback on earlier drafts of the manuscript, and Sudhakar Takkellapati, Kidus Tadele, Paul Harten, Jane Bare, Grace Patlewicz, and Antony Williams for helpful suggestions and insights on comparative chemical hazard assessment. Dr. Vegosen gratefully acknowledges support by an appointment to the Research Participation Program at the U.S. Environmental Protection Agency (EPA), administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy and EPA.

Footnotes

The views expressed in this journal article are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s10098-019-01795-w) contains supplementary material, which is available to authorized users.

References

  1. Allaj E (2018) Two simple measures of variability for categorical data. J Appl Stat 45:1497–1516. 10.1080/02664763.2017.1380787 [DOI] [Google Scholar]
  2. Arena VC, Sussman NB, Mazumdar S, Yu S, Macina OT (2004) The utility of structure–activity relationship (SAR) models for prediction and covariate selection in developmental toxicity: comparative analysis of logistic regression and decision tree models. SAR QSAR Environ Res 15:1–18. 10.1080/1062936032000169633 [DOI] [PubMed] [Google Scholar]
  3. Barrett WM, Takkellapati S, Tadele K, Martin TM, Gonzalez MA (2019) Linking molecular structure via functional group to chemical literature for establishing a reaction lineage for application to alternatives assessment. ACS Sustain Chem Eng 7:7630–7641. 10.1021/acssuschemeng.8b05983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. California (2018) The proposition 65 list https://oehha.ca.gov/proposition-65/proposition-65-list. Accessed 2 Oct 2018
  5. Chemhat.org (2018) Chemical hazards and alternatives Toolbox http://www.chemhat.org/en. Accessed 24 Sep 2018
  6. Chemsec (2018) Substitute it now (SIN) list http://sinlist.chemsec.org/search/searchall. Accessed 4 Oct 2018 [Google Scholar]
  7. Clean Production Action (2018a) GreenScreen for safer chemicals hazard assessment guidance; Version 1.4, January 2018 [Google Scholar]
  8. Clean Production Action (2018b) Greenscreen list translator™ https://www.greenscreenchemicals.org/learn/greenscreen-list-translator. Accessed 24 Sep 2018
  9. CNESST (2015) WHMIS 2015 classification http://www.csst.qc.ca/en/prevention/reptox/Pages/list-whmis-2015-a.aspx Accessed 2 Oct 2018
  10. Department of Occupational Safety and Health Ministry of Human Resources Malaysia (2014) Industry code of practice on chemicals classification and hazard communication http://www.dosh.gov.my/index.php/en/list-of-documents/osh-info/chemical-management-1/2217-industry-code-of-practice-on-chemicals-classification-and-hazard-communication-2014-pdf/file. Accessed 3 Oct 2018
  11. ECHA (2018a) Classification labeling and packaging (CLP) Annex VI https://echa.europa.eu/information-on-chemicals/annex-vi-to-clp. Accessed 2 Oct 2018
  12. ECHA (2018b) Registration, evaluation, authorization and restriction of chemicals (REACH) candidate list of substances of very high concern for authorization https://echa.europa.eu/candidate-listtable. Accessed 4 Oct 2018
  13. ECHA and OECD (2018) International uniform chemical information database (IUCLID) https://iuclid6.echa.europa.eu/. Accessed 16 Oct 2018
  14. Environment and Climate Change Canada (2006) Domestic substance list (DSL) categorizations. Document available upon request from: eccc.substances.eccc@canada.ca
  15. European Union (2006) Regulation (EC) No. 1907/2006 of the european parliament and of the council concerning the registration, evaluation, authorisation and restriction of chemicals (REACH) and establishing a european chemicals agency
  16. Faludi J, Hoang T, Gorman P, Mulvihill M (2016) Aiding alternatives assessment with an uncertainty-tolerant hazard scoring method. J. Environ Manag 182:111–125. 10.1016/j.jenvman.2016.07.028 [DOI] [PubMed] [Google Scholar]
  17. GAO (2005) Chemical regulation: options exist to improve EPA’s ability to assess health risks and manage its chemical review program GAO-05–458
  18. Germany MAK Commission (2017) List of MAK and BAT values 2017: permanent senate commission for the investigation of health hazards of chemical compounds in the work area; Report 53
  19. Government of Canada (1995) Canadian environmental protection act: priority substances list https://www.canada.ca/en/environment-climate-change/services/canadian-environmental-protection-actregistry/substances-list/priority-list.html. Accessed 11 Oct 2018
  20. Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, Steger-Hartmann T, Heinrich N, Müller KR (2009) Benchmark data set for in silico prediction of Ames mutagenicity. J Chem Inf Model 49:2077–2081. 10.1021/ci900161g [DOI] [PubMed] [Google Scholar]
  21. Healthy Building Network (2018) Pharos chemical and material library (CML) full system description, June 2018. https://www.pharosproject.net/uploads/files/library/Pharos_CML_System_Description.pdf. Accessed 9 Sep 2019
  22. Healthy Building Network (2019) Pharos https://pharosproject.net. Accessed 9 Sept 2019
  23. Hertwich EG, McKone TE, Pease WS (1999) Parameter uncertainty and variability in evaluative fate and exposure models. Risk Anal 19:1193–1204. 10.1111/j.1539-6924.1999.tb01138.x [DOI] [PubMed] [Google Scholar]
  24. Jacobs MM, Malloy TF, Tickner JA, Edwards S (2016) Alternatives assessment frameworks: research needs for the informed substitution of hazardous chemicals. Environ Health Perspect 124:265–280. 10.1289/ehp.1409581 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kavlock R, Chandler K, Houck K, Hunter S, Judson R, Kleinstreuer N, Knudsen T, Martin M, Padilla S, Reif D, Richard A, Rotroff D, Sipes N, Dix D (2012) Update on EPA’s ToxCast program: providing high throughput decision support tools for chemical risk management. Chem Res Toxicol 25:1287–1302. 10.1021/tx3000939 [DOI] [PubMed] [Google Scholar]
  26. Knudsen TB, Martin MT, Kavlock RJ, Judson RS, Dix DJ, Singh AV (2009) Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA’s ToxRefDB. Reprod Toxicol 28:209–219. 10.1016/j.reprotox.2009.03.016 [DOI] [PubMed] [Google Scholar]
  27. Martin MT, Mendez E, Corum DG, Judson RS, Kavlock RJ, Rotroff DM, Dix DJ (2009) Profiling the reproductive toxicity of chemicals from multigeneration studies in the toxicity reference database. Toxicol Sci 110:181–190. 10.1093/toxsci/kfp080 [DOI] [PubMed] [Google Scholar]
  28. Martin TM (2016) User’s guide for T.E.S.T. (version 4.2) (Toxicity Estimation Software Tool) https://www.epa.gov/sites/production/files/2016-05/documents/600r16058.pdf
  29. Martin TM (2019) Github Repository: GHS data gathering https://github.com/tmarti02/ghs-data-gathering. Accessed 18 Nov 2019
  30. Ministry of Environment and Food of Denmark (2010) Advisory list for self-classification of dangerous substances https://eng.mst.dk/chemicals/chemicals-in-products/assessment-of-chemicals/the-advisory-list-for-self-classification-of-hazardous-substances/. Accessed 9 Aug 2019
  31. National Toxicology Program (2016) 14th Report on carcinogens https://ntp.niehs.nih.gov/pubhealth/roc/index-1.html. Accessed 4 Oct 2018 [PubMed]
  32. New Zealand Environmental Protection Authority (2018) Chemical classification and information database (CCID) https://www.epa.govt.nz/database-search/chemical-classification-and-information-database-ccid/. Accessed 4 Oct 2018
  33. NIOSH (2012) Occupational cancer carcinogen list https://www.cdc.gov/niosh/topics/cancer/npotocca.html. Accessed 4 Oct 2018
  34. NITE of Japan (2018) GHS classification results http://www.safe.nite.go.jp/english/ghs/all_fy_e.html. Accessed 3 Oct 2018
  35. NRC (1984) NRC (1984) Toxicity testing: strategies to determine needs and priorities The National Academies Press, Washington, D.C. 10.17226/317 [DOI] [PubMed] [Google Scholar]
  36. NRC (2007) Toxicity testing in the 21st Century: a vision and a strategy The National Academies Press, Washington, D.C. 10.17226/11970 [DOI] [Google Scholar]
  37. NRC (2014) A framework to guide selection of chemical alternatives The National Academies Press, Washington, D.C. 10.17226/18872 [DOI] [PubMed] [Google Scholar]
  38. OECD (2013) Current landscape of alternatives assessment practice: a meta-review http://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=ENV/JM/MONO%282013%2924&docLanguage=En. Accessed 22 Nov 2019
  39. OECD (2018) The global portal to information on chemical substances (eChemPortal) https://www.echemportal.org/echemportal/index.action. Accessed 16 Oct 2018
  40. Panko J, Hitchcock K, Fung M, Spencer P, Kingsbury T, Mason A (2017) A comparative evaluation of five hazard screening tools. Integr Environ Assess Manag 13:139–154. 10.1002/ieam.1757 [DOI] [PubMed] [Google Scholar]
  41. Public Law 114–182; 15 USC 2601 Frank R. Lautenberg Chemical Safety for the 21st Century Act (2016)
  42. Richard A, Judson RS, Houck KA, Grulke CM, Volarath P, Thillainadarajah I, Yang C, Rathman J, Martin MT, Wambaugh JF, Knudsen TB, Kancherla J, Mansouri K, Patlewicz G, Williams AJ, Little SB, Crofton KM, Thomas RS (2016) ToxCast chemical landscape: paving the road to 21st century toxicology. Chem Res Toxicol 29:1225–1251. 10.1021/acs.chemrestox.6b00135 [DOI] [PubMed] [Google Scholar]
  43. Safe Work Australia (2018) Hazardous chemical information system (HCIS) http://hcis.safeworkaustralia.gov.au/HazardousChemical. Accessed 2 Oct 2018
  44. SQLite Consortium and Hipp Wyrick & Company Inc. (2018) SQLite https://www.sqlite.org/index.html. Accessed 14 Nov 2018
  45. TEDX (2018) The TEDX List of potential endocrine disruptors https://endocrinedisruption.org/interactive-tools/tedx-list-of-potentialendocrine-disruptors/search-the-tedx-list. Accessed 4 Oct 2018
  46. Tong W, Fang H, Hong H, Xie Q, Perkins R, Sheehan DM (2004) Receptor-mediated toxicity: QSARs for estrogen receptor binding and priority setting of potential estrogenic endocrine disruptors. In: Cronin MT, Livingstone DJ (eds) Predicting chemical toxicity and fate CRC Press, Boca Raton [Google Scholar]
  47. Toxnot PBC (2018). https://toxnot.com/. Accessed 24 Sept 2018
  48. UN (2017) Globally harmonized system of classification and labelling of chemicals (GHS). Rev 7 [Google Scholar]
  49. University of Maryland (2018) List of Acute Toxins, Teratogens, Carcinogens, or Mutagens (via ACToR) https://actor.epa.gov/actor/assay.xhtml?assayId=2198. Accessed 4 Oct 2018
  50. US EPA (2006a) ACToR link to Health Canada Priority Substance Lists (2006) (Carcinogenicity) https://actor.epa.gov/actor/assay.xhtml?assayId=1558. Accessed 20 Nov 2019
  51. US EPA (2006b) ACToR link to Health Canada Priority Substance Lists (2006) (Reproductive Toxicity) https://actor.epa.gov/actor/assay.xhtml?assayId=1561. Accessed 20 Nov 2019
  52. US EPA (2008) DSSTox EPA integrated risk information system structure-index locator file: SDF file and documentation https://cfpub.epa.gov/si/si_public_record_report.cfm?dirEntryId=186904&Lab=NCCT. Accessed 10 Dec 2019
  53. US EPA (2011a) Comments on the design for the environment (DfE) program alternatives assessment criteria for hazard evaluation https://www.epa.gov/sites/production/files/2014-01/documents/aa_criteria_comments.pdf
  54. US EPA (2011b) Design for the environment program alternatives assessment criteria for hazard evaluation Version 2.0 https://www.epa.gov/sites/production/files/2014-01/documents/aa_criteria_v2.pdf. Accessed 20 Nov 2019
  55. US EPA (2012) Estimation program interface suite (EPI Suite) Version 4.11. syracuse research corporation https://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-program-interface. Accessed 16 Oct 2018
  56. US EPA (2014) TSCA Work plan for chemical assessments: 2014 update https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/tsca-work-plan-chemical-assessments-2014-update. Accessed 4 Oct 2018
  57. US EPA (2016a) ECOTOX Knowledgebase https://cfpub.epa.gov/ecotox/. Accessed 22 July 2016
  58. US EPA (2016b) T.E.S.T. Version 4.2 http://www2.epa.gov/chemical-research/toxicity-estimation-software-tool-test. Accessed 20 Nov 2019
  59. US EPA (2018a) ACToR link to Chemical substances that meet the university of maryland definition of an acute toxin, teratogen, carcinogen, or mutagen https://actor.epa.gov/actor/collection.xhtml?dataCollectionId=2111. Accessed 18 Dec 2018
  60. US EPA (2018b) ACToR link to NIOSH list of potential occupational carcinogens https://actor.epa.gov/actor/assay.xhtml?assayId=1936. Accessed 4 Oct 2018
  61. US EPA (2018c) RapidTox Dashboard https://www.epa.gov/chemical-research/rapidtox-dashboard. Accessed 19 Dec 2018
  62. US EPA (2018d) Series 870 Health Effects Test Guidelines https://www.epa.gov/test-guidelines-pesticides-and-toxic-substances/series-870-health-effects-test-guidelines
  63. US EPA (2018e) TSCA Inventory Notification (Active-Inactive) Rule https://www.epa.gov/tsca-inventory/tsca-inventory-notification-active-inactive-rule. Accessed 20 Nov 2019
  64. US EPA (2018f) TSCA Inventory, active non-confidential portion https://comptox.epa.gov/dashboard/chemical_lists/tscaactivenonconf. Accessed 7 Nov 2018
  65. US EPA (2018g) WebTEST https://comptox.epa.gov/dashboard/predictions/index
  66. US EPA (2018h) A working approach for identifying potential candidate chemicals for prioritization https://www.epa.gov/sites/production/files/2018-09/documents/preprioritization_white_paper_9272018.pdf. Accessed 10 Oct 2018
  67. US EPA (2019a) CompTox chemicals dashboard https://www.epa.gov/chemical-research/comptox-chemicals-dashboard. Accessed 3 Sep 2019
  68. US EPA (2019b) Distributed structure-searchable toxicity (DSSTox) database https://www.epa.gov/chemical-research/distributed-structure-searchable-toxicity-dsstox-database. Accessed 20 Nov 2019 [DOI] [PubMed]
  69. US EPA (2019c) ACToR (Aggregated Computational Toxicology Resource) https://actor.epa.gov. Accessed 20 Nov 2019 [DOI] [PubMed]
  70. US EPA (2019d) Integrated risk information system (IRIS) https://www.epa.gov/iris. Accessed 20 Nov 2019
  71. US EPA mid-Atlantic Region (2018) Human health risk-based concentrations https://actor.epa.gov/actor/assay.xhtml?assayId=1092. Accessed 2 Oct 2018
  72. U.S. National Library of Medicine (2018) ChemIDplus. https://chem.nlm.nih.gov/chemidplus/. Accessed 2 Oct 2018
  73. Wehage K, Chenhansa P, Schoenung JM (2017) An open framework for automated chemical hazard assessment based on GreenScreen for Safer Chemicals: a proof of concept. Integr Environ Assess Manag 10.1002/ieam.1763 [DOI] [PubMed]
  74. Whittaker MH (2015) Risk assessment and alternatives assessment: comparing two methodologies. Risk Anal 35:2129–2136. 10.1111/risa.12549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Whittaker MH, Heine LG (2013) Chemical alternatives assessment (CAA): tools for selecting less hazardous chemicals. In: Hester RE, Harrison R (eds) Chemical Alternatives Assessments Issues in Environmental Science and Technology, Burlington, pp 1–43 [Google Scholar]
  76. WHO IARC (2018) IARC monographs on the evaluation of carcinogenic risks to humans https://monographs.iarc.fr/list-of-classifications. Accessed 9 Aug 2019
  77. Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS, Richard AM (2017) The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. J. Cheminformatics 10.1186/s13321-017-0247-6 [DOI] [PMC free article] [PubMed]
  78. Yang Y, Tao M, Suh S (2018) Geographic variability of agriculture requires sector-specific uncertainty characterization. Int J Life Cycle Assess 23:1581–1589. 10.1007/s11367-017-1388-6 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Table S1
Supplementary Table S3
Supplementary Table S2
Supplementary Table S6
Supplementary Table S7
Supplementary Table S8
Supplementary Table S9
Supplementary Table S5
Table S4 CAS 0-0-0 to 9999-99-9
Table S4 CAS 10000-00-0 to 29999-99-9
Table S4 CAS 30000-00-0 to 59999-99-9
Table S4 CAS 60000-00-0 and above or No CAS

RESOURCES