Abstract
Background:
Systematic evaluation of literature data on the cancer hazards of human exposures is an essential process underlying cancer prevention strategies. The scope and volume of evidence for suspected carcinogens can range from very few to thousands of publications, requiring a complex, systematically planned, and critical procedure to nominate, prioritize and evaluate carcinogenic agents. To aid in this process, database fusion, cheminformatics and text mining techniques can be combined into an integrated approach to inform agent prioritization, selection, and grouping.
Results:
We have applied these techniques to agents recommended for the IARC Monographs evaluations during 2020–2024. An integration of PubMed filters to cover cancer epidemiology, key characteristics of carcinogens, chemical lists from 34 databases relevant for cancer research, chemical structure grouping and a literature databased clustering was applied in an innovative approach to 119 agents recommended by an advisory group for future IARC Monographs evaluations. The approach also facilitated a rational grouping of these agents and aids in understanding the volume and complexity of relevant information, as well as important gaps in coverage of the available studies on cancer etiology and carcinogenesis.
Conclusion:
A new data-science approach has been applied to diverse agents recommended for cancer hazard assessments, and its applications for the IARC Monographs are demonstrated. The prioritization approach has been made available at www.cancer.idsl.me site for ranking cancer agents.
Keywords: IARC Monographs, Text mining, Hazard identification, Database fusion, Chemoinformatics
1. Introduction
Cancer is largely a preventable environmental disease, and understanding the collective or individual roles of exposure agents (the ‘exposome’) in causation of the disease is a key step toward prevention (Saracci and Wild 2016; Vermeulen et al. 2020; Wild 2005). An essential component of cancer prevention strategies is identifying carcinogenic agents (defined broadly as chemical, physical, or biological agents, complex mixtures or exposure circumstances, and other exposures of everyday life) using epidemiological studies of cancer, experimental animal bioassay studies, as well as mechanistic research studies. However, complexities, scale, agent diversity, evidence strength and generalization of findings differ across the thousands of studies conducted each year relevant to carcinogen hazard identification (Cogliano et al. 2004). To address this challenge, the IARC Monographs on the Identification of Carcinogenic Hazards to Humans have recently updated and enhanced the robustness of the methodology for evaluation of the available information to reach uniform conclusions regarding the carcinogenic hazard of a given agent (Samet et al. 2020). This method, published as a Preamble to every Monographs Volume since 1971, provides the guidelines for systematic evaluations of carcinogenic hazard by expert working groups, which meet to evaluate selected agents (IARC 2019). The review is conducted by multidisciplinary and international working groups of global leading experts, who are free from conflicts of interest. The working groups assemble and evaluate the available information and data on 1) human exposure 2) cancer epidemiology evidence 3) animal bioassays on carcinogenicity and 4) mechanistic evidence built on the recently published ‘key characteristics of carcinogens’ (KCs) (Guyton et al. 2018a; Smith et al. 2016; Smith et al. 2020). Agents are assigned a carcinogenicity assessment after a thorough review and consensus process. Through the history of the program, IARC has evaluated more than 1020 agents and shared the results with the global health community. National and international authorities and organizations use this information on causes of cancer to support actions to reduce human exposure to agents classified by IARC working groups as carcinogenic, probably carcinogenic, or possibly carcinogenic to humans.
The IARC Monographs process for identifying topics consists of five major phases (Fig. 1) − 1) IARC receives nominations from the general public, the scientific community, national and international health agencies, and other organizations; 2) IARC sorts those nominations into different categories by function or origin; 3) an independent advisory group is convened by IARC to review the nominations and recommend priorities for a five year period 4) IARC groups the priorities for evaluation in future meeting; and 5) IARC organizes regular expert working group meetings to evaluate these priorities and shares the results through its list of classifications and by publishing fully referenced Monographs summarizing the resulting classifications.
Fig. 1.

An overview of IARC Monographs agent identification and selection process.
IARC periodically convenes independent advisory groups to recommend priorities for the IARC Monographs to ensure that the program’s future evaluations reflect the current state of scientific evidence relevant to carcinogenicity. The Advisory Group to Recommend Priorities for the IARC Monographs during 2020–2024 recommended agents for evaluation with high, medium or low priority on the basis of (i) evidence of human exposure and (ii) evidence or suspicion of carcinogenicity (for agents not previously evaluated) or evidence that may lead to a change in classification (for agents previously evaluated) (IARC Monographs Priorities Group, 2019). Agents not meeting these criteria were not recommended for evaluation.
Cancer hazard assessment of an agent relies on information available in published literature, authoritative reports, government registries and toxicological databases. Experts and data analysts manually mine through these resources to extract a range of information about an agent’s carcinogenic potential. This manual process can have limitations for understanding the type and extent of available information across a large number of agents prioritized for evaluation. Fortunately, the main literature database NCBI PubMed can be queried automatically using R scripts for hundreds of keyword searches. Such queries provide counts of papers in the PubMed database for each executed search, and provide an indication of the volume of published literature available for each agent on the topics relevant to cancer hazard identification-including cancer epidemiology, cancer bioassays and mechanistic information regarding key characteristics of carcinogens (Guyton et al. 2018b). We have demonstrated application of such an approach for prioritizing pesticides for IARC Monographs evaluation by the Advisory group for 2015–2019 (Guha et al. 2016). Recently, the approach of chemical text mining has been extended to construct a database of blood-related compounds by mining over 800,000 PubMed abstracts (Barupal and Fiehn 2019). In order to further improve the approach, text mining can be complemented by assessing the coverage of agents in toxicology and bioinformatics databases. Additionally, toxicological databases provide their chemical agent lists, which can be cross-referenced using a chemical identifier such as chemical abstract service (CAS) number. Such a combination of text mining and database fusion forms a promising complementary approach for prioritizing agents, especially chemicals, for cancer hazard assessments by expert working groups.
This report describes an application of a text mining and database fusion approach to supplement reviews by the IARC Monographs Advisory Group for 2020–2024 for assigning priorities to the nominated agents.
2. Materials and methods
2.1. Agent nominations and priorities
IARC solicited nominations of agents via the website of the IARC Monographs and IARC RSS news feed, and through direct contact with the IARC Governing Council and members of the IARC Scientific Council, WHO headquarters and regional offices, and previous participants of Monographs Working Groups. Nominations were also developed by IARC scientists and included the recommended priorities remaining from a similar Advisory Group meeting convened in 2014 (Straif et al. 2014). Submissions could include chemicals, mixtures, physical and biological agents, and other non-chemical agents including complex exposures. The nominated agents were compiled into a single list of unique agents prior to the Advisory Group meeting. Agents’ names and identities were then mapped to standardized scientific concepts and terms using the medical subject heading database. Then, agents were classified into categories as per their source and nature of the exposure. Agents already covered in previous IARC Monographs classifications were identified by matching their CAS number (or name, for non-chemical agents) in the IARC list. IARC convened an independent advisory group of 29 scientists from 18 countries in Lyon, France, on 25–27 March 2019. According to a standardized review procedure (IARC, 2019), this expert panel provided four priority recommendations 1) High and ready for evaluation within 2.5 years 2) High and ready for evaluation within 5 years 3) Medium and 4) Low.
2.2. Text mining
This step involves a systematic mining of the biomedical literature with a set of NCBI PubMed search filters to automatically retrieve articles aligned with the IARC Monographs classification system. Query patterns were created in consultation with the IARC knowledge manager/librarian for ten literature searches which covered cancer (comprising studies in humans and in experimental animals) and key characteristics of carcinogens (Guyton et al. 2018a; Guyton et al. 2018b; Ward et al. 2019) Fig. 3 and Table 1). These filters, covering ‘cancer all’ (Q1), ‘cancer epidemiology’ (Q2), ‘key characteristics of carcinogens’ KC1–3 (Q3) and seven other key characteristics (Q4-Q10) of carcinogens (Smith et al. 2016), were systematically and automatically searched in the PubMed database in combination with search filters for each agent. Before their implementation, these search filters were also endorsed by the members of the Advisory Group.
Fig. 3.

Scope of literature queries. Search filters Q3-Q10 (red) refer to the key characteristics of carcinogens (Smith et al. 2016). See Table 1 for search terms for these filters. KC 1 to 3 were combined into one filter (Q3) because of an overlap of MESH and keyword terms, it include KC-1: Is electrophilic or can be metabolically activated to an electrophile, KC-2: Is genotoxic and KC-3: Alters DNA repair or causes genomic instability.
Table 1.
Comprehensive queries to capture cancer epidemiological and cancer mechanism literature on the PubMed database.
| Query ID | Description | Modified Search Pattern |
|---|---|---|
|
| ||
| Query 1 | Cancer all | (neoplasm*[All Fields] OR carcinogen*[All Fields] OR malignan*[All Fields] OR tumor[All Fields] OR tumors[All Fields] OR tumour[All Fields] OR tumours[All Fields] OR cancer[All Fields] OR cancers[All Fields]) |
| Query 2 | Cancer epidemiology [New on 23 October 2019) | ((neoplasm*[All Fields] OR carcinogen*[All Fields] OR malignan*[All Fields] OR tumor[All Fields] OR tumors[All Fields] OR tumour[All Fields] OR tumours[All Fields] OR cancer[All Fields] OR cancers[All Fields]) AND (cohort [All Fields] OR case*[All Fields] OR epidemiolog*[All Fields] OR Epidemiology[Mesh] OR “Epidemiologic Studies”[Mesh] OR “Occupational Exposure”[Mesh] OR workers[All Fields])) |
| Key characteristics of carcinogens | ||
| Query 3 | 1- Is electrophilic or can be metabolically activated to an electrophile; 2- Is genotoxic; 2- is genotoxic; 3- Alters DNA repair or genomic instability | (“Mutation” [Mesh] OR “Cytogenetic Analysis”[Mesh] OR “Mutagens” [Mesh] OR “Oncogenes”[Mesh] OR “Genetic Processes”[All Fields] OR “genomic instability”[MesH] OR chromosom* [All Fields] OR clastogen*[All Fields] OR “genetic toxicology”[All Fields] OR “strand break”[All Fields] OR “unscheduled DNA synthesis”[All Fields] OR “DNA damage”[All Fields] OR “DNA adducts”[All Fields] OR “SCE”[All Fields] OR “chromatid”[All Fields] OR micronucle*[All Fields] OR mutagen*[All Fields] OR “DNA repair”[All Fields] OR “UDS”[All Fields] OR “DNA fragmentation”[All Fields] OR “DNA cleavage”[All Fields]) |
| Query 4 | 4- Induces epigenetic alterations | (“rna”[MeSH] OR “epigenesis, genetic”[MesH] OR rna OR “rna, messenger”[MeSH] OR “rna”[All Fields] OR “messenger rna”[All Fields] OR mrna[All Fields] OR “histones”[MeSH] OR histones[All Fields] OR epigenetic[All Fields] OR miRNA[All Fields] OR methylation [All Fields]) |
| Query 5 | 5- Induces oxidative stress | (“reactive oxygen species”[MeSH Terms] OR “reactive oxygen species”[All Fields] OR “oxygen radicals”[All Fields] OR “oxidative stress”[MeSH Terms] OR “oxidative”[All Fields] OR “oxidative stress”[All Fields] OR “free radicals”[All Fields]) |
| Query 6 | 6- Induces chronic inflammation | ((chronic[All Fields] AND “inflammation”[MeSH Terms]) OR (chronic inflamm*[All Fields])) |
| Query 7 | 7- Is immunosuppressive | (Immunosuppression[MH] OR “Killer Cells, Natural”[MH] OR “CD4-Positive T- Lymphocytes”[MH] OR immunosuppress*[tw] OR immune response*[tw] OR immune function*[tw] OR “immune status”[tw] OR “immune state*”[tw] OR “immune competence”[tw] OR “immune impairment”[tw] OR “immune dysregulation”[tw] OR “humoral immunity”[tw] OR “cell-mediated immunity”[tw] OR NK[tw] OR “Natural Killer”[tw] OR CD4[tw] OR “T4 Cell*”[tw] OR T4 Lymphocyte[tw]) |
| Query 8 | 8- Modulates receptor-mediated effects | (“Androgen Antagonists”[Mesh: NoExp] OR “Androgen Receptor Antagonists”[Mesh:NoExp] OR “Estrogen Antagonists”[MH] OR “Estrogen Receptor Modulators”[MH:NoExp] OR “Gonadal Hormones”[MH] OR “Thyroid Hormones”[MH] OR “Endocrine Disruptors”[MH] OR “Receptors, Steroid”[MH] OR “Receptors, Cytoplasmic and Nuclear”[MH] OR “Receptors, Aryl Hydrocarbon”[MH] OR Androgen* [tw] OR Estradiol[tw] OR Estrogen* [tw] OR Progesterone[tw] OR Testosterone[tw] OR thyroid[tw] OR “Endocrine disrupt*”[tw] OR “Peroxisome Proliferator-Activated Receptor”[tw] OR PPAR[tw] OR “constitutive androstane receptor”[tw] OR “farnesoid X- activated receptor”[tw] OR “liver X receptor”[tw] OR “Retinoid X receptor”[tw] OR “Aryl hydrocarbon receptor”[tw] OR “Ah receptor”[tw]) |
| Query 9 | 9- Causes immortalization | (“Cell Transformation, Neoplastic”[MH:NoExp] OR “Cell Transformation, Viral”[MH] OR Telomere[MH] OR “Telomere Shortening”[MH] OR “Telomere Homeostasis”[MH] OR “cell transformation”[tw] OR “tumorigen transformation”[tw] “tumorigenic transformation”[tw] OR “neoplastic transformation”[tw] OR “carcinogen transformation”[tw] OR “carcinogenic transformation”[tw] OR “viral transformation”[tw] OR immortalization[tw] OR Telomer* [tw]) |
| Query 10 | 10- Alters cell proliferation, cell death, or nutrient supply | (“Cell Proliferation”[MH] OR “DNA Replication”[MH] OR “Cell Cycle”[MH] OR Hyperplasia[MH] OR Metaplasia[MH:NoExp] OR “Neovascularization, Pathologic”[MH:NoExp] OR Apoptosis[MH] OR “Angiogenesis Modulating Agents”[MH:NoExp] OR “Angiogenesis Inducing Agents”[MH] OR “Heat-Shock Proteins”[MH] OR “Extracellular Matrix”[MH:NoExp] OR “Cell proliferation”[tw] OR “Cellular proliferation”[tw] OR “Cell multiplication”[tw] OR “Cell division”[tw] OR “Proliferative activity”[tw] OR “Sustained proliferation”[tw] OR “DNA replication”[tw] OR “DNA synthesis”[tw] OR “tumor growth”[tw] OR “neoplastic growth”[tw] OR “malignant growth”[tw] OR Hyperplasia[tw] OR Metaplasia[tw] OR “Apoptosis inhibition”[tw] OR Angiogenesis [tw] OR “heat shock protein”[tw] OR “extracellular matrix”[tw]) |
For agents (chemical and non-chemical both) that had been linked with a medical subject heading (MeSH), the MeSH term was included in the search; otherwise, the provided name was used for executing searches using the NCBI Entrez web-services (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi). For chemical agents, MeSH term or CAS Registry Number (CASRN) was used. In order to generate a literature search result matrix, we made an URL syntax for the PubMed query which was submitted to the NCBI Entrez web-services to identify the number of papers retrieved by each search filter for each agent. For each query, we used the RCurl package in R to communicate with the NCBI Entrez API for retrieving the query results. Queries were not restricted by language and publication types. As we were not interested in the actual publication list, we added the limit = 0 parameters in the RCurl command, which forces the API to return only the count of the PubMed papers for a query. The resulting object data from the NCBI API were parsed in R and the paper count information was exported to a tab-separated file. Randomly selected literature searches were performed again on the PubMed main site to match the count for papers found from the API and manual searches. The resulting matrix can be found in the supplementary Table S4 and provides a rough overview of the availability of scientific literature for each agent with respect to the individual queries.
2.3. Database fusion
Chemical agent lists from the websites of the selected 34 databases (Table S1) were downloaded. Such curated databases were not available for non-chemical agents. A database was selected if it was curated by an expert panel or maintained by a governmental agency or international organization, or if it provided relevant information that can be incorporated into an IARC Monographs assessment report for an agent. The relevant information included toxicology studies, chemical and physical properties, industrial production, occupational exposure and health information, substance registry, metabolism, risk assessment and biomonitoring. The list included 64 total databases. Of those, 34 databases provided access to downloadable chemical lists which are shown in the Table S2. To instantly query these chemical lists, Table S3 is provided as a tool. CAS numbers of the agents were queried against all 34 databases using an Excel formula search in Table S3. The search yielded “1” for presence and “0” for absence of a chemical agent in a database (shown by columns). We did not curate CASRN for the recommended agents. In case of multiple CASRN for an agent, all were queried against the 34 databases and presence or absence were noted.
2.4. Unsupervised clustering of agents
Chemical and non-chemical agents were separately clustered by a hierarchical clustering method using R software version 4.0.0. Distance and linkage parameters for the clustering were “spearman” distance and “ward.d2” linkage. Input data for the unsupervised clustering were literature counts, database coverage or substructure fingerprint matrix (Barupal et al. 2012), depending on the nature of the agent (chemical vs. non-chemical agent). No filtering of data was applied.
2.5. Statistics and graphs
Descriptive statistics and graphs were constructed in R and Microsoft excel 2016.
We provide an online resource www.cancer.idsl.me to perform the literature queries against the PubMed database and to query chemical agents against these 34 databases. Both non-chemical and chemical agent categories can be analyzed by this online tool.
3. Results
3.1. Overview of the prioritized agents
A total of 178 agent nominations were received by the IARC Monographs Group in 2019 for the prioritization by the Advisory Group to Recommend Priorities for the IARC Monographs during 2020–2024 (Table S4). Nominations were submitted by 1) the public using an online interface provided on the IARC Monographs website 2) by IARC Monographs personnel and 3) by the Advisory Group members. Leftover prioritized agents from the previous Advisory Group meeting in 2014 were also considered. Standardization of agent names and identifiers was performed to avoid redundant entries.
The Advisory Group recommended priorities for 119 agents (Fig. 2a), and there was no recommendation made for 59 agents because of a lack of relevant information in the public domain to support a new or updated evaluation of for other reasons. The 119 prioritized agents reflected seven major categories by their source and nature (Fig. 2a). Among those categories, “Chemicals” were the largest constituting 43% of all prioritized agents. All other categories ranged between 9 and 14 agents. These agent groups were created to facilitate discussion by expertise within the Advisory Group; for example all smoking-related agents were considered by experts engaged in the hazard assessment of smoking. The Advisory Group assigned them with four different priorities for inclusion in IARC Monographs evaluations in the next five years. These priorities were 1) high, and ready for evaluation within 2.5 years 2) high, and ready for evaluation within the latter half of 5-year period 3) medium and 4) low. Forty-three percent of agents received a recommendation for evaluation with high priority within 2.5 years (Fig. 2b). Overall, 55% of the prioritized agents had not been previously evaluated by the IARC Monographs (Fig. 2c).
Fig. 2.

Overview of the prioritized agents. a) Priority recommendations for seven major categories of the prioritized agents for the IARC Monographs evaluations during 2020–2024. Inner number in the pie-chart shows the count of nominations, the number inside a donut chart shows prioritized agents within each category. Donut color shows different priorities. b) Distribution of agents by priorities. c) Coverage of prioritized agents by the IARC Monographs classification.
Among the chemicals, 35 (68%) received low or medium priority (Fig. 2a), while among healthcare related agents 8 (73%) received high (2.5 year) priority, which was highest among all categories. The smoking-related category and the category containing metals, particles, fibers, and physical agents had more than 50% agents with a high (within 2.5 years) priority.
In recommending these priorities, the Advisory Group considered peer-reviewed published literature data as well as information from over 60 databases (supplementary table S1–3) and discussed the impact of each agent or its source’s exposure on cancer causation. To summarize the large volume of information in these databases in reference to the prioritized agents, a computational approach of text mining and database fusion was developed. Previously, IARC has utilized an earlier version of this approach to aid prioritization and selection of pesticides for Monographs evaluations during 2016–2019 (Guha et al. 2016), which we have extended by including non-chemical agents, utilizing KCs and by including 34 additional databases.
3.2. Development of a text mining and database fusion resource
3.2.1. Text mining
A descriptive summary of the literature data is provided in Table 2. Up to 90% prioritized agents were reported in > 5 cancer related papers, and up to 72% agents were reported in > 5 cancer epidemiology papers. Among the KC queries (Q3-Q10), Q4 yielded the highest number of average publications per agent, while query 9 (KC9 - “causes immortalization”) retrieved less than 5 publications for 72% of the prioritized agents.
Table 2.
Summary of the literature count data for all nominated agents. Query details are provided in the Table 1. SD – standard deviation.
| Publication data for all
prioritized agents (n = 119) |
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Query | Mean | Median | Max | SD | >1000 | >100 | >10 | >5 | >=1 | 0 |
|
| ||||||||||
| Q1 | 3634 | 218 | 114,126 | 131,50 | 32 | 77 | 101 | 107 | 116 | 3 |
| Q2 | 793 | 45 | 17,892 | 2566 | 14 | 45 | 83 | 86 | 112 | 7 |
| Q3 | 1550 | 74 | 89,750 | 8734 | 18 | 54 | 93 | 99 | 116 | 3 |
| Q4 | 1749 | 59 | 130,953 | 12,085 | 17 | 46 | 86 | 92 | 104 | 15 |
| Q5 | 1026 | 62 | 50,191 | 4971 | 16 | 51 | 89 | 95 | 113 | 6 |
| Q6 | 72 | 3 | 3163 | 331 | 2 | 11 | 38 | 50 | 79 | 40 |
| Q7 | 478 | 12 | 16,685 | 1815 | 11 | 30 | 66 | 75 | 96 | 23 |
| Q8 | 1404 | 26 | 83,049 | 8196 | 12 | 36 | 77 | 85 | 103 | 16 |
| Q9 | 41 | 1 | 2286 | 224 | 1 | 6 | 25 | 30 | 68 | 51 |
| Q10 | 961 | 40 | 43,128 | 4360 | 13 | 43 | 86 | 92 | 109 | 10 |
3.2.2. Database fusion
In addition to the literature searches, the hazard assessment process relies on information available in specialized and curated databases such as HazMap (Fitzpatrick 2004). These databases may contain unstructured text on key findings that indicate an agent’s potential role in cancer initiation or progression. To utilize these databases, the relevant details need to be accessed on the corresponding websites or through downloadable reports. But, the number of relevant databases can be large (>10) and manually querying each agent in these databases can be cumbersome. Therefore, we undertook the task of developing an integrated resource for databases and the covered agents. It should be noted that only the agents with a CASRN (60% of nominations) could be queried against the database resource.
A list of 64 databases and information resources was compiled by reviewing their websites for cancer hazard related materials. They have been developed by agencies of 9 countries, the EU and UN, and they were grouped into five major categories (Fig. 4). Hazard assessment was the largest category with 25 databases. US, EU and UN agencies provided more than half of these databases. Of those 64 total databases, 34 provided access to downloadable chemical lists for including the CASRN, and from the remaining 29, chemical information was available online by manual queries. The approach covers only the 34 downloadable chemical lists. Database size ranges from 114 (PEC Australia) to 67,270 (The Safe Chemical Act) chemicals. The list of databases and corresponding chemical lists are provided in Table S2.
Fig. 4.

Database categories and their source countries/regions.
When 74 prioritized chemical agents with CASRN were queried against the 34 databases, a binary matrix was produced, indicating each agent’s presence or absence in each database. The Toxnet-Hazardous Substance Database (HSDB) had the highest coverage of agents assigned high priorities by the Advisory Group (Table 3), and it also covered up to 78% of the chemicals listed in the latest IARC Monographs database. Japan Existing Chemical Database (JECDB) covered only one compound, which was of low priority. California-Prop65 and National Toxicology Program-long term reports are specialized databases for cancer hazard assessment, and that could explain the highest proportional coverage of high priorities agents. Overall, up to 90% of chemicals reviewed in the past by the IARC Monographs were covered by these 34 databases. Only three agents that were assigned priorities for 2020–2024 (CASRN 519–02-8,63–75-2 and 65666–07-1) were not covered by any of the 34 databases.
Table 3.
Coverage of prioritized agents with CAS numbers in 34 different databases. 72 agents had CAS numbers. These agents by categories were - chemicals (45), Medical device or pharmaceutical (5), Metals, particles fibres, physical (7), Smoking or tobacco related (9) and dietary (8). The IMO coverage column shows what proportion of agents in the latest IARC Monographs list (n = 852) were covered in a database. Full names and websites of databases are provided in the supplementary Table S1.
| Database property |
Database coverage |
||||||
|---|---|---|---|---|---|---|---|
| Database | Database size | IMO coverage | All agents | High (2.5 years) | High (5 years) | Medium | Low |
|
| |||||||
| IARC-IMO | 852 | 852 | 33 | 16 | 2 | 9 | 6 |
| Toxnet-HSDB | 5893 | 662 | 68 | 26 | 3 | 23 | 16 |
| TSCA-Act | 67,270 | 489 | 52 | 20 | 3 | 17 | 12 |
| Gestis-Germany | 8014 | 479 | 62 | 20 | 3 | 22 | 17 |
| CDC NIOSH NOES | 9318 | 458 | 52 | 19 | 3 | 19 | 11 |
| CA-Prep65 | 776 | 407 | 39 | 15 | 3 | 12 | 9 |
| ICSC | 1704 | 326 | 48 | 16 | 2 | 18 | 12 |
| DFG- MAK and BAT values | 1411 | 323 | 41 | 15 | 3 | 14 | 9 |
| EU CLP classification | 4007 | 300 | 50 | 18 | 3 | 15 | 14 |
| NTP - long term reports | 544 | 279 | 34 | 13 | 3 | 13 | 5 |
| ECHA | 14,983 | 271 | 44 | 15 | 3 | 14 | 12 |
| UN Hazard Numbers | 2381 | 271 | 36 | 12 | 1 | 13 | 10 |
| OECD-HPV | 4636 | 247 | 43 | 17 | 2 | 16 | 8 |
| RISCAD - Japan | 4482 | 247 | 37 | 14 | 3 | 11 | 9 |
| OSHA | 901 | 247 | 40 | 16 | 2 | 13 | 9 |
| China-CHC | 2844 | 246 | 32 | 11 | 1 | 10 | 10 |
| EPA_HPV | 4308 | 246 | 43 | 18 | 2 | 15 | 8 |
| EPA-CDAT | 2626 | 238 | 31 | 10 | 2 | 11 | 8 |
| NTP-ROC | 227 | 198 | 19 | 9 | 2 | 4 | 4 |
| EPA-IRIS | 555 | 193 | 34 | 11 | 2 | 12 | 9 |
| NIOSH-NPG | 648 | 187 | 31 | 13 | 1 | 7 | 10 |
| Korea - Chemical Control Act | 2106 | 187 | 25 | 12 | 2 | 7 | 4 |
| EU: Carcinogenicity Assessment | 1243 | 186 | 24 | 9 | 1 | 7 | 7 |
| EFSA-foodTox | 5501 | 179 | 31 | 11 | 3 | 11 | 6 |
| OECD-SIDS | 1569 | 137 | 25 | 9 | 2 | 8 | 6 |
| EPA- Carcino Assess | 253 | 124 | 19 | 8 | 1 | 4 | 6 |
| ATSDR | 275 | 111 | 14 | 8 | 0 | 2 | 4 |
| BUA - German Chemical Society | 358 | 103 | 15 | 7 | 2 | 4 | 2 |
| JBRC-JETOC | 412 | 98 | 17 | 6 | 2 | 5 | 4 |
| EU-SCOEL | 269 | 81 | 14 | 8 | 2 | 2 | 2 |
| WHO JPMR | 325 | 60 | 14 | 5 | 0 | 8 | 1 |
| WHO CICADs | 346 | 49 | 9 | 6 | 0 | 2 | 1 |
| Australia PEC | 114 | 19 | 6 | 3 | 1 | 2 | 0 |
| JECDB | 445 | 16 | 1 | 0 | 0 | 0 | 1 |
This database fusion and coverage analysis sorted the selected databases by their relevance for cancer hazard assessment. As long as an agent has been assigned a CASRN, a simple query of the database fusion can identify the specific databases that cover the agent. The fusion approach also provided an overview of databases that lack assessment of chemicals by IARC Monographs. For example, half of the chemicals from the EPA carcinogenicity assessment database never received an IARC Monographs evaluation (data not shown).
3.2.3. Application of the new informatics approach in cancer hazard assessments
In order to facilitate interaction with the data, we generated unsupervised clustering using literature data, and chemical substructure fingerprints (Figure S1–S3). Hierarchical clustering analysis suggested clustering of the nominated chemical agents by three major structural domains - aliphatic, aromatic and elements. The tree diagram also highlighted the chemical agents with a high volume of key characteristics of carcinogenic literature. For example, many of the heavy metals have been reported in several thousand cancer epidemiology studies (Figure S3), suggesting that they should be given a higher priority for future IARC Monographs evaluations, aligning with the high priority assigned to lead, cobalt and nickel. Several compounds with few epidemiological studies had a good coverage in top cancer related databases (Table 3) and had a substantial number of papers on mechanistic studies (Fig. 5A). For instance, heavy metals (cobalt, lead and nickel), estradiol, bisphenol A, arecoline and acrylonitrile were given higher priorities in light of the published literature on KCs.
Fig. 5.

Unsupervised clustering of prioritized agents for the IARC Monographs evaluation. Label color indicates the priorities- red = high, black = medium and blue = low. Agents having similarity priorities and similarity structures can be evaluated in a single working group meeting by the IARC Monographs. Dot size indicates the paper count, where the biggest dot indicates over 500 papers, and smallest dot indicates 1 paper. A- Advisory Group Priorities for the clustered chemicals. Node size indicates the sum of papers for KC1–10 filters. Label and point color indicates the priorities- red = high, black = medium and blue = low. B- Non-chemical agents with priorities and clustered by the literature data for queries 1–10. Label and point color indicates the priorities- red = high, black = medium and blue = low.
While the Advisory Group used the text mining and database fusion to inform assignment of the priority ranking for the nominated agents, future IARC working group evaluations can take advantage of clustering of agents for joint evaluation of similar agents. Therefore, we explored an unsupervised clustering analysis to find data-driven groups of agents using literature data and chemical similarity. When epidemiological paper count data were visualized using the chemical and non-chemical cluster trees (Figure S4–S6), several groups of agents emerged that had similar priority, structure and literature counts. To further incorporate the literature data on key characteristics of carcinogens, we mapped the node-size in the chemical clustering tree to the sum of papers of all KC filters in PubMed. This highlighted the chemically similar compounds that may have a sufficient volume of literature data to be jointly evaluated (Fig. 5A), and it also provided insight into the availability of mechanistic data for non-chemical agents (Fig. 5B). As a result, aniline, anisidine, nitroanisole and cupferron were recommended to be evaluated together in the IARC Monographs Meeting 127(IARC Monographs Vol 127 group).
Next, we visualized the literature data for the ten PubMed filters in a grid view (Fig. 6) to provide information on how many and which agents could be evaluated in a single meeting. This representation enabled identification of agents within a category that have adequate papers on cancer epidemiology and key characteristics of carcinogens. For example, for the high-priority agents pyrethrins, malachite green and gentian violet, a considerable number of papers indicate availability of information relevant to KCs (Q3–10). The graphical display provides a quantitative way for planning the number of agents that can reasonably be reviewed in one working group meeting. If the number of papers is overly large, only 1–2 candidates can be evaluated in a single meeting. Several medium priority agents such as TCPP and terbufos have a small number of papers, indicating they could be evaluated together in one working group meeting (Fig. 6). For agents such as nano-structures a large number of papers on cancer epidemiology as well as KCs are published, and this broad agent category may require sub-division. The grid view also highlights the gaps in scientific knowledge for implicating an agent in cancer etiology (Krewski et al., 2019; IARC Monographs Vol 124 group).
Fig. 6.

Overview of the volume of literature on agents with high, medium and low priorities. C1-C7 represent agent classes, C1 - bio-agent, C2 - chemical, C3 – medical device of pharmaceutical, C4 – smoking and tobacco-related, C5 – Dietary, C6 – metals, particles, fibres, physical, and C7 – complex exposures. Dot size indicates the paper count, where the biggest dot indicates over 500 papers, and smallest dot indicates 1 paper. Q1-Q10 are PubMed filters as shown in Table 1. Point color indicate current IARC Monographs classification, red = group 1, yellow = Group 2A, orange = Group 2B, Green = Group 3, and black = Not evaluated. The reference year for the time-frames in priorities is year 2019 when the Advisory Group recommended the priorities.
For the chemical agents, we generated a separate grid view summarizing coverage in various relevant databases (Fig. 7). This depiction may represent potential (literature) reviews of chemical agents by another agency or regulatory body. Those reports may be available online for inclusion in the hazard assessment process. For example, a high priority agent, hydrocholorothiazide is a high production volume chemical covered by EPA and OECD databases, and it has been evaluated by the Safe Chemical Act, HSDB, IARC Monographs and National Toxicology Program long term reports. Hence for this chemical agent, reports from these sources should be considered by the working group. Another example is estradiol, a reproductive hormone, that showed a large body of literature using the PubMed filters (Fig. 6), however this chemical was evaluated only in food, occupation exposure and toxicology assessment databases such as HSDB and California Prop65 in the US. Its presence in Prop65, the California carcinogen warning database, and its high-volume nature may warrant a meeting fully dedicated to just this compound.
Fig. 7.

Coverage of agents with priorities in different databases. White cells indicate absence. C1-C7 represent agent classes, C1 - bio-agent, C2 - chemical, C3 – medical device of pharmaceutical, C4 – smoking and tobacco-related, C5 – Dietary, C6 – metals, particles, fibres, physical, and C7 – complex exposures.
4. Discussion
Due to the large amount of available data, there is a clear need to develop integrated data science approaches to aid the operation of the key cancer hazard assessment programs (Atwood et al. 2019). These approaches can utilize omics assays (Zhivagui et al. 2019), literature, chemical assays (Chiu et al. 2018), toxicological databases and natural language processing. While it will be a major undertaking to combine all these resources into a unified and consolidated suite for applications at all stages of the cancer hazard assessment process, we have developed a novel data science approach to improve selection and grouping of suspected cancer agents for the IARC Monographs evaluations. We have built this method in continuation of our previous work on prioritizing pesticides (pesticide ranking) (Guha et al. 2016) for cancer hazard assessments and on ranking chemicals in blood specimen related publications (Barupal and Fiehn 2019). With the ever-growing literature, computational approaches such as presented in this study can be exceedingly valuable for providing a useful overview of literature on scopes and contexts that are relevant for cancer hazard assessment. Expert panels can use these overviews to guide the relevant prioritization of agents.
The new integrated data science approach added several new aspects – 1) text mining is optimized for key characteristics of carcinogens (Smith et al. 2020) and the other evidence streams used in the Monographs evaluations; 2) it can be used for both chemical and non-chemical agents; 3) the database fusion covers over 30 sources; 4) a website to enable querying new agents; and 5) the approach uses chemical structure and data-driven agent grouping/clustering (e.g. for future IARC Monographs evaluations).
The text mining approach highlighted that a large amount of published literature on cancer epidemiology and cancer-related mechanisms is available for a majority of the prioritized agents. Among the mechanistic data, the number of publications varies considerably, which may reflect differences in importance of individual mechanisms during cancer development, assay availability or research priorities. The reliability of the text mining approach focusing on key characteristics of carcinogens is underscored by the association of agents with known mechanisms of action with the respective literature queries. For aflatoxins, a group of strongly genotoxic mycotoxins (Smela et al. 2001), text mining retrieved predominantly results for Q3 (‘causes DNA alterations’), while human papillomavirus (HPV), a well-established human cancer virus with the ability to cause cell immortalization (zur Hausen 2000), resulted in the largest number of publications for Q9 (KC9, ‘causes immortalization’) among all queried agents.
A key feature of the approach is that it can rank both chemical and non-chemical agents. So far, chemicals made up the major proportion of agents in hazard assessment. However, non-chemical agents and complex exposures (e.g., tobacco smoking) have long been recognized as important causes of cancer, they may have a large body of human cancer evidence, and there is increasing interest in evaluating other non-chemical agents, which requires improved approaches for agent prioritization. This might also be impactful from a cancer prevention point of view if exposure to certain non-chemical agents would be easily modifiable by a policy implementation or life-style change. One limitation is the lack of good databases that use standardized ontology terms for non-chemical agents, constraining the transfer of learning from one database to another. In this case, most of the analysis can only focus on literature data, which is facilitated by the MeSH ontology terms. It is therefore recommended that MeSH ontology terms are used for non-chemical agents for reporting and indexing in future databases.
The approach can be extended in the future to include natural language processing (NLP) (Crichton et al. 2020) of full-text publications, abstracts, previous IARC Monographs reports and reports from other agencies. We can foresee the development of an artificial intelligence-driven database to retrieve exact information (Baker et al. 2017; Pyysalo et al. 2019) that can be systematically presented to the expert panels. A future version may include development of a novel NLP algorithm to identify positive and negative associations of an agent with cancer hazards at individual sentence level (Allot et al. 2019). Additionally, the approach can be extended to rank publications themselves to recommend the papers that should be cited in the IARC Monographs reports.
Overall, we have shown that the new informatics approach has a great potential to improve ranking of agents for cancer hazard assessment on two levels 1) ranking the overall nominations for an expert panel review and 2) ranking and grouping prioritized agents for future IARC Monographs working group meetings. The IARC Monographs Advisory Group utilized the analysis results to improve the ranking of several agents, especially in the light of the availability of mechanistic studies. To assist in the Advisory Group’s process of assigning priorities, the new informatics approach provided an overview of relevant information for each agent. The approach was used for all the nominated agents, and the text mining result and the database coverage were provided to the members of the Advisory Group. The data provided a quick overview of agents with available information in databases, highlighting the ones that could deserve higher priorities. This view of the information availability was a useful factor in selecting new agents, which had not received a previous IARC Monographs evaluation but may have enough volume of published evidence to assess the carcinogenic potential. Similarities among agents based on their structure, literature data and database coverages can be utilized for assigning their priorities and for grouping agents together for future IARC Monographs working group meetings.
We expect the newly developed approach to be regularly used by the IARC Monographs Programme for prioritizing agents using the available website. The analysis also highlighted several candidate chemical agents for nomination for the next Advisory Group meeting in 2024, such as those present in toxicological databases, but not yet nominated for cancer hazard assessment by the IARC Monographs Programme. Since the described approach is easily reproducible via the provided website, other cancer hazard assessment agencies can immediately start applying it for their purposes.
When using the described approach, it is important to keep in mind that it relies on the publication count for an agent and therefore well-known agents such as coffee or benzene may result in higher counts due to existing forms of research and publication bias in the scientific community. However, queries can be constrained for publication years, for example, to cover only literature that has emerged since a previous Monographs evaluation. Moreover, of the chemical databases that contained relevant information for cancer hazard assessment, about half provided downloadable data, highlighting the need to further improve procedures ensuring more balanced access and contents of scientific and regulatory databases.
5. Conclusions
Prioritization of chemical and non-chemical agents for cancer hazard assessment can be achieved by a combination of text mining and database fusion approaches. Relevant information for cancer hazard assessment can be found in curated databases, unstructured published text, toxicological priority lists and expert reports. We have utilized an innovative and integrated computational approach of text mining and database fusion to retrieve and organize the heterogeneous information on a set of agents in a systematic and structured way. For the IARC Monographs Programme, such an approach has been successfully implemented for ranking nominated agents for future evaluations.
Supplementary Material
Funding
IARC Monographs Programme is partially funded by the NIH-NCI Grant U01CA033193
Footnotes
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Where authors are identified as personnel of the International Agency for Research on Cancer / World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer / World Health Organization.
Appendix A. Supplementary material
Supplementary data to this article can be found online at https://doi.org/10.1016/j.envint.2021.106624.
References
- Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, et al. , 2019. Litsense: Making sense of biomedical literature at sentence level. Nucleic Acids Res 47, W594–W599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atwood ST, Lunn RM, Garner SC, Jahnke GD, 2019. New perspectives for cancer hazard evaluation by the report on carcinogens: A case study using read-across methods in the evaluation of haloacetic acids found as water disinfection by-products. Environ Health Perspect 127, 125003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker S, Ali I, Silins I, Pyysalo S, Guo Y, Hogberg J, et al. , 2017. Cancer hallmarks analytics tool (chat): A text mining approach to organize and evaluate scientific literature on cancer. Bioinformatics 33, 3973–3981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barupal DK, Haldiya PK, Wohlgemuth G, Kind T, Kothari SL, Pinkerton KE, et al. , 2012. Metamapp: Mapping and visualizing metabolomic data by integrating information from biochemical pathways and chemical and mass spectral similarity. BMC Bioinf. 13, 99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barupal DK, Fiehn O, 2019. Generating the blood exposome database using a comprehensive text mining and database fusion approach. Environ Health Perspect 127, 97008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chiu WA, Guyton KZ, Martin MT, Reif DM, Rusyn I, 2018. Use of high-throughput in vitro toxicity screening data in cancer hazard evaluations by iarc monograph working groups. ALTEX 35, 51–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cogliano VJ, Baan RA, Straif K, Grosse Y, Secretan MB, El Ghissassi F, et al. , 2004. The science and practice of carcinogen identification and evaluation. Environ Health Perspect 112, 1269–1274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crichton G, Baker S, Guo Y, Korhonen A, 2020. Neural networks for open and closed literature-based discovery. PLoS ONE 15, e0232891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitzpatrick RB, 2004. Haz-map: Information on hazardous chemicals and occupational diseases. Med Ref Serv Q 23, 49–56. [DOI] [PubMed] [Google Scholar]
- Guha N, Guyton KZ, Loomis D, Barupal DK, 2016. Prioritizing chemicals for risk assessment using chemoinformatics: Examples from the IARC monographs on pesticides. Environ Health Perspect 124, 1823–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guyton KZ, Rieswijk L, Wang A, Chiu WA, Smith MT, 2018a. Key characteristics approach to carcinogenic hazard identification. Chem Res Toxicol 31, 1290–1292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guyton KZ, Rusyn I, Chiu WA, Corpet DE, van den Berg M, Ross MK, et al. , 2018b. Application of the key characteristics of carcinogens in cancer hazard identification. Carcinogenesis 39, 614–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- IARC, 2019. Iarc monographs on the identification of carcinogenic hazards to humans. Preamble. [Google Scholar]
- IARC Monographs Priorities Group, 2019. Advisory group recommendations on priorities for the IARC monographs. Lancet Oncol 20, 763–764. [DOI] [PubMed] [Google Scholar]
- IARC Monographs Vol 124 group. 2019. Carcinogenicity of night shift work. Lancet Oncol 20:1058–1059. https://pubmed.ncbi.nlm.nih.gov/31281097. [DOI] [PubMed] [Google Scholar]
- IARC Monographs Vol 127 group. 2020. Carcinogenicity of some aromatic amines and related compounds. Lancet Oncol 21:1017–1018. https://pubmed.ncbi.nlm.nih.gov/32593317. [DOI] [PubMed] [Google Scholar]
- Krewski D, Bird M, Al-Zoughool M, Birkett N, Billard M, Milton B, et al. , 2019. Key characteristics of 86 agents known to cause cancer in humans. J Toxicol Environ Health B Crit Rev 22, 244–263. [DOI] [PubMed] [Google Scholar]
- Pyysalo S, Baker S, Ali I, Haselwimmer S, Shah T, Young A, et al. , 2019. Lion lbd: A literature-based discovery system for cancer biology. Bioinformatics 35, 1553–1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Samet JM, Chiu WA, Cogliano V, Jinot J, Kriebel D, Lunn RM, et al. , 2020. The iarc monographs: Updated procedures for modern and transparent evidence synthesis in cancer hazard identification. J Natl Cancer Inst 112, 30–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saracci R, Wild CP, 2016. Fifty years of the international agency for research on cancer (1965 to 2015). Int J Cancer 138, 1309–1311. [DOI] [PubMed] [Google Scholar]
- Smela ME, Currier SS, Bailey EA, Essigmann JM, 2001. The chemistry and biology of aflatoxin b(1): From mutational spectrometry to carcinogenesis. Carcinogenesis 22, 535–545. [DOI] [PubMed] [Google Scholar]
- Smith MT, Guyton KZ, Gibbons CF, Fritz JM, Portier CJ, Rusyn I, et al. , 2016. Key characteristics of carcinogens as a basis for organizing data on mechanisms of carcinogenesis. Environ Health Perspect 124, 713–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith MT, Guyton KZ, Kleinstreuer N, Borrel A, Cardenas A, Chiu WA, et al. , 2020. The key characteristics of carcinogens: Relationship to the hallmarks of cancer, relevant biomarkers, and assays to measure them. Cancer Epidemiol Biomarkers Prev 29, 1887–1903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Straif K, Loomis D, Guyton K, Grosse Y, Lauby-Secretan B, El Ghissassi F, et al. , 2014. Future priorities for the iarc monographs. Lancet Oncol. 15, 683–684. [DOI] [PubMed] [Google Scholar]
- Vermeulen R, Schymanski EL, Barabasi AL, Miller GW, 2020. The exposome and health: Where chemistry meets biology. Science 367, 392–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward EM, Germolec D, Kogevinas M, McCormick D, Vermeulen R, Anisimov VN, et al. , 2019. Carcinogenicity of night shift work. Lancet Oncol. 20, 1058–1059. [DOI] [PubMed] [Google Scholar]
- Wild CP, 2005. Complementing the genome with an “exposome”: The outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 14, 1847–1850. [DOI] [PubMed] [Google Scholar]
- Zhivagui M, Ng AWT, Ardin M, Churchwell MI, Pandey M, Renard C, et al. , 2019. Experimental and pan-cancer genome analyses reveal widespread contribution of acrylamide exposure to carcinogenesis in humans. Genome Res 29, 521–531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- zur Hausen H 2000. Papillomaviruses causing cancer: Evasion from host-cell control in early events in carcinogenesis. J Natl Cancer Inst 92:690–698. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
