Novel text analytics approach to identify relevant literature for human health risk assessments: a pilot study with health effects of in utero exposures

Michelle Cawley; Renee Beardsley; Brandy Beverly; Andrew Hotchkiss; Ellen Kirrane; Reeder Sams, II; Arun Varghese; Jessica Wignall; John Cowden

doi:10.1016/j.envint.2019.105228

. Author manuscript; available in PMC: 2023 Mar 21.

Published in final edited form as: Environ Int. 2019 Nov 8;134:105228. doi: 10.1016/j.envint.2019.105228

Novel text analytics approach to identify relevant literature for human health risk assessments: a pilot study with health effects of in utero exposures

Michelle Cawley ¹, Renee Beardsley ², Brandy Beverly ², Andrew Hotchkiss ², Ellen Kirrane ², Reeder Sams II ², Arun Varghese ¹, Jessica Wignall ¹, John Cowden ^3,⁴

PMCID: PMC10029921 NIHMSID: NIHMS1699399 PMID: 31711016

Abstract

Background:

Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references.

Challenge:

Using computational approaches to identify relevant studies for risk assessments is challenging, as these assessments examine multiple chemical effects across lifestages (e.g., human health risk assessments) or specific effects of multiple chemicals (e.g., cumulative risk). The broad scope of potentially relevant literature can make selection of seed references difficult.

Approach:

We developed a generalized computational scoping strategy to identify human health relevant studies for multiple chemicals and multiple effects. We used semi-supervised machine learning to prioritize studies to review manually with training data derived from references cited in the hazard identification sections of several US EPA Integrated Risk Information System (IRIS) assessments. These generic training data or seed studies were clustered with the unclassified corpus to group studies based on text similarity. Clusters containing a high proportion of seed studies were prioritized for manual review. Chemical names were removed from seed studies prior to clustering resulting in a generic, chemical-independent method for identifying potentially human health relevant studies. We developed a case study that focused on identifying the array of chemicals that have been studied with respect to in utero exposure to test the recall of this novel literature searching strategy. We then evaluated the general strategy of using generic, chemical-independent training data with two previous IRIS assessments by comparing studies predicted relevant to those used in the assessments (i.e., total relevant).

Outcome:

A keyword search designed to retrieve studies that examined the in utero effects of environmental chemicals identified over 54,000 candidate references. Clustering algorithms were applied using 1,456 studies from multiple IRIS assessments with chemical names removed as training data or seeds (i.e., semi-supervised learning). Using a six-algorithm ensemble approach 2,602 articles, or approximately 5% of candidate references, were “voted” relevant by four or more clustering algorithms and manual review confirmed nearly 50% of these studies were relevant. Further evaluations on two IRIS assessments, using a nine-algorithm ensemble approach and a set of generic, chemicalindependent, externally-derived seed studies correctly identified 77 to 83% of hazard identification studies published in the assessments and eliminated the need to manually screen more than 75% of search results on average.

Limitations:

The chemical-independent approach used to build the training literature set provides a broad and unbiased picture across a variety of endpoints and environmental exposures but does not systematically identify all available data. Variance between actual and predicted relevant studies will be greater because of the external and non-random origin of seed study selection. This approach depends on access to readily available generic training data that can be used to locate relevant references in an unclassified corpus.

Impact:

A generic approach to identifying human health relevant studies could be an important first step in literature evaluation for risk assessments. This initial scoping approach could facilitate faster literature evaluation by focusing reviewer efforts, as well as potentially minimize reviewer bias in selection of key studies. Using externally-derived training data has applicability particularly for databases with very low search precision where identifying training data may be cost-prohibitive.

Background and Challenge

Systematic review methods identify, select, critically assess, and synthesize scientific evidence for literature-based evaluations. Multiple environmental organizations have adopted (Birnbaum et al. 2013; European Food Safety Authority (EFSA) 2010; National Academy of Sciences (NAS) 2018; Rooney et al. 2014; US EPA 2013; Woodruff and Sutton 2011) or recommended (Institute of Medicine (IOM) et al. 2011; National Research Council (NRC) 2013; 2014) the use of systematic review methods for evaluating the association between health effects and environmental exposures. These approaches address many challenges of adapting evidence-based medicine approaches to risk assessment, but a critical need remains largely unaddressed: identifying relevant studies from large literature databases.

Manual screening of literature is labor and time-intensive, often relying on multiple reviewers to identify relevant studies. To enhance screening throughput, computational tools using text analytics have been developed including an application of semi-supervised learning. Semi-supervised learning-based classification works by using a small training set in a variety of possible ways including generative models, low-density separation, and graph-based models (Chapelle et al. 2006). Specifically, semisupervised learning that uses clustering algorithms and a set of key studies or “seeds” as training data can be used to prioritize clusters for manual review (Varghese et al. 2018). Clustering algorithms group studies based on text similarities. Clusters with a relatively high concentration of seed studies have been found to contain a high concentration of relevant studies. Studies in the same cluster as seed studies are more likely to be relevant as they share more language in common. Varghese et al. (2018) provides a detailed explanation of semi-supervised learning using seed studies which the authors refer to as supervised clustering. Similarly, machine learning algorithms use training data to predict relevance in an unclassified corpus. These computational approaches have been demonstrated to reduce the burden of manually screening literature for systematic reviews and other large scale literature reviews (Aphinyanaphongs et al. 2005; Bekhuis and Demner-Fushman 2012; Cohen et al. 2006; Frunza et al. 2011; Jonnalagadda and Petitti 2013; Shemilt et al. 2014; Varghese et al. 2018; Wallace et al. 2010). A review by O’Mara-Eves et al. (2015) suggested that machine learning could replace the second reviewer typically used in systematic reviews. The technical details of how machine learning algorithms are applied to textual classification tasks are reviewed and demonstrated in numerous sources (Aphinyanaphongs et al. 2005; Bekhuis and Demner-Fushman 2012; O’Mara-Eves et al. 2015; Shemilt et al. 2014; Varghese et al. 2018; Wallace et al. 2010) and are not summarized in this paper.

A key feature of the existing computational approaches is selection of training data or seed studies. Training data are often selected using a priori knowledge, such as expert judgement or previous literature reviews. It can be challenging to identify training data for assessments that consider the health effects across multiple lifestages and chemicals. For instance, the United States Environmental Protection Agency (US EPA) Integrated Risk Information System (IRIS) program reviews human, animal, and mechanistic literature to evaluate potential toxicity from lifetime chemical exposure. Similarly, cumulative risk assessments consider the effect of multiple chemical exposures on specific health outcomes (US EPA 2007). These large literature databases (e.g., tens of thousands of studies) with a broad array of chemical effects and/or chemical exposures make selecting key studies challenging. Further, the cost of developing a training dataset may negate the value of the solution (Frunza et al. 2011). This is particularly true in a literature corpus that contains few relevant studies (i.e., low search precision). Shekelle et al. (2017) tested the efficacy of using citations from original systematic reviews as training data when conducting search updates for three topics. Using this method, they identified 96% of the relevant articles on average and reduced the number of articles requiring manual screening by 78% on average, indicating that externally-derived training data can effectively identify relevant articles in an unclassified corpus. In general, methods are needed to identify studies that are relevant for inclusion in assessments that are designed to inform cumulative risk or the array of health endpoints associated with exposures during different lifestages.

To address this challenge, we developed a broad scoping strategy to identify relevant literature for the hazard identification section of a human health risk assessment. The hazard identification section identifies and evaluates epidemiological and toxicological evidence of health effects following chemical exposure. A pilot study using human health impact(s) of in utero exposure to environmental chemicals was used to develop and evaluate the strategy. A novel, chemical-independent approach was used to identify seed studies across a variety of endpoints and environmental exposures. These seed studies were clustered with literature search results and used as a signal to discover relevant studies in the unclassified corpus. From these clusters, an ensemble approach was used to predict hazard identification relevant studies, which were then manually reviewed for relevance. In addition, this strategy was used to predict relevant studies published in two existing IRIS assessments with low search precision. Our results suggest that a chemical-independent seed set can effectively identify relevant studies from large literature searches.

Approach and Outcomes

Selecting a Pilot Study

We developed a literature scoping strategy using a pilot study approach. In selecting the pilot study, two criteria were required. First, the literature database must focus on health effects of chemical exposures. Second, the literature database must include multiple chemicals and multiple health effect outcomes.

These criteria were selected to reflect the challenges of identifying relevant literature for human health and cumulative risk assessments. Based on these criteria, we selected health effects from in utero exposures as our pilot study.

Health effects resulting from in utero exposure to environmental pollutants have been widely studied. Epidemiological studies have characterized relationships between various health effects and in utero exposure to environmental pollutants, including polybrominated diphenyl ether (Chen et al. 2013a; Eskenazi et al. 2013), polyaromatic hydrocarbons (Perera et al. 2009; Perera et al. 2012), arsenic (Nadeau et al. 2014; Recio-Vega et al. 2015; Steinmaus et al. 2014; Wasserman et al. 2014), lead (Nye et al. 2014), methylmercury (Ryan 2008; Yorifuji et al. 2015; Zeilmaker et al. 2013), perfluorooctanoic acid (Chen et al. 2013b), and organochlorines (Eskenazi et al. 2008; Vested et al. 2014). These studies indicate a broad range of chemicals and health effects associated with in utero exposure, consistent with our criteria for pilot study selection.

Pilot Study Literature Search Strategy

A gray literature search was conducted in Google to capture relevant reports or other reviews of observed health effects associated with in utero exposure to environmental pollutants. Results from this effort were used in combination with subject matter experts to inform the keywords for health outcomes and pollutants. Although this strategy is unlikely to identify all existing literature on potential health effects, it was sufficient to generate a large literature database consisting of multiple chemicals and multiple health effects.

The keyword search was conducted in PubMed in May 2015. The search was designed to broadly rather than comprehensively identify all literature evaluating health impact(s) of in utero exposure to an environmental chemical (Supplemental Table 1). Three characteristics were considered for search term selection: 1) window of exposure, 2) pollutant, and 3) health outcome.

To develop the window of exposure terms, terms used in literature specific to in utero exposure timeframes were identified and compiled in an iterative process based on consensus from subject matter experts from the US EPA.

To limit in utero exposure literature to environmental chemicals a set of pollutant terms was developed and combined with the window of exposure terms. We compiled a list of chemical terms using the National Health and Nutrition Examination Survey (NHANES), US EPA Office of Research and Development (ORD) Children’s Environmental Health Research Roadmap, Human Health Benchmarks for Pesticides (HHBP), and a broad internet search. However, this list included more than 600 terms and resulted in over 100,000 results, so we refined Set 2 to include general categories (e.g., chemical, pollutant, metal) to prevent bias towards any specific chemical.

To filter literature to human health relevant outcomes, a list of health outcomes of interest terms was developed based on effects referenced in the Children’s Environmental Health Research Roadmap. These terms were refined based on expert knowledge and review of other relevant references. This final set of terms was designed to reduce bias resulting from pre-existing knowledge. Additional limits were applied to narrow the search results to studies most likely to be relevant and reviewable (e.g., results with abstracts, journal articles).

Novel Approach to Seed Study Selection for Supervised Clustering

Over 54,000 candidate references (Supplemental Material 1 – PMIDs used in Supervised Clustering) were identified from our literature search. To identify studies relevant for human health hazard identification, supervised clustering was used. Supervised clustering is a form of semi-supervised learning that groups an unclassified corpus of studies with a set of known relevant (i.e., “seed”) studies (Varghese et al. 2018). Clusters containing a high proportion of seed studies are expected to contain a high proportion of relevant studies and are prioritized for manual screening (Varghese et al. 2018). Seed studies were identified using a novel, chemical-independent method. For our pilot study, seeds for supervised clustering were derived from the hazard identification literature of the chromium (VI), inorganic arsenic, and polychlorinated biphenyls IRIS assessments. These hazard identification studies include a wide variety of health effect data across several endpoints. In addition, the studies have been manually reviewed and identified as relevant for human health risk assessments. Specific chemical names and synonyms were removed from titles and abstracts of seed studies to prevent clustering bias from chemical-specific terms in the seed studies (Supplemental Table 2). The resulting seed studies provide a generalized approach to supervised clustering that is not based upon specific chemical identity or health effect.

Supervised Clustering of Pilot Study Literature

After combining the PubMed search results (54,134 studies) with the chemical-independent seed set (1,456 seed studies from the IRIS hazard identification sections), we used text analytics, specifically supervised clustering, to prioritize a subset of PubMed search results for manual screening (Figure 1). The DoCTER software (Document Classification and Topic Extraction Resource) (ICF, Virginia, USA) was used to conduct supervised clustering as described by Varghese et al. (2018). This method uses two algorithms: k-means and nonnegative matrix factorization (NMF) and three cluster sizes: 10, 20, and 30. Using each algorithm with the three different cluster numbers yields six different clustering algorithms (e.g., KM-10 algorithm is the k-means algorithm with 10 clusters and KM-20 is the k-means algorithm with 20 clusters). The six clustering algorithms were applied to text in titles and abstracts of the in utero literature search result studies and the 1,456 seed studies. Specifically, we input the full corpus of unclassified studies along with the seed studies into the software as a CSV file. The CSV file contained a unique identifier, title and abstract in a single cell, and labels to indicate if study was a seed or unclassified. Using this input file, DoCTER ran clustering for all six algorithms. A CSV file was output by DoCTER containing a cluster number for all seeds and unclassified studies for each algorithm.

Figure 1: — The distribution of seed studies throughout the clusters is not uniform. Because seed studies are similar in language used (i.e., they are all hazard identification studies) they tend to group into a few clusters and some clusters do not contain any seeds. For each algorithm, we selected the cluster containing the highest percentage of seed studies. If this single cluster contained fewer than 30% of seeds, we also selected the cluster with the second highest number of seeds. Table 1 reports the numbers of clusters used to retrieve at least 30% of the seed studies, with the total number of studies within the cluster(s). In four of six algorithms a single cluster contained more than 30% of seeds. Only NMF-20 and NMF-30 required the top two clusters to reach 30% seed capture. In total 66% of the seed studies were captured by one or more algorithms. While a seed capture rate of 90% or higher is recommended, this lower rate of seed capture was appropriate given the seed studies were external to the search results and included more studies than typically required. There were 6,691 unique studies identified as potentially relevant based on this supervised clustering approach.

We minimized screening bias from any individual algorithm by using an ensemble approach to further refine the potentially relevant literature. Of the total literature identified by the keyword search, 88% were not predicted relevant by any supervised clustering algorithm and were eliminated from screening.

The remaining studies were grouped based on how many clustering algorithms “voted” them relevant for hazard identification (Figure 2). Studies in Group A were voted relevant by all six clustering algorithms, whereas studies in Group F were only identified as relevant by a single algorithm.

Evaluating the Supervised Clustering Results

Most seed studies were identified by multiple algorithms, suggesting that relevant studies would be identified by more than one algorithm. Therefore, we selected the supervised clustering results voted relevant by at least four algorithms (i.e., Groups A, B, and C) for evaluation. These studies were evaluated for relevancy using manual review (Supplemental Text 1: Manual Literature Review Instructions). Manual screening consisted of reviewing titles and abstracts. A study was considered relevant if it estimated or measured chemical exposure during pregnancy and evaluated health effects observed in a fetus, children, or adults later in life. Studies were considered not relevant for reasons including out of scope effects (e.g., study only included maternal health effects unrelated to fetal health, such as reproductive outcomes), exposure measured or estimated outside the relevant timeframe, out of scope exposures (e.g., alcohol, caffeine, tobacco), or out of scope study design (e.g., case report, methods paper). Screeners also indicated if a study was about a particular evidence stream (human, animal, mechanistic), exposure scenario (e.g., occupational, paternal exposure), or a review or metaanalysis.

Some studies were neither clearly relevant nor not relevant and were tagged as “maybe” relevant or supporting data, depending on the information available. (See Supplemental Text 1: Manual Literature Review Instructions for full details on the screening instructions and the tags applied to the literature.)

In total, 2,602 studies in Groups A, B, and C were manually screened for relevance for in utero effects following chemical exposure. While we manually screened less than 5% of the search results, nearly 50% of the screened studies were relevant (Figure 3a). In contrast, Belter (2016) evaluated the precision of 14 published systematic reviews and found search precision on average to be less than 5%. When one outlier was removed (i.e., one search retrieved 25 results of which 10 were relevant) search precision dropped to 1.5% on average. Golder et al. (2006)reported low search precision (less than 3%) for sensitive search strategies in PubMed and Embase to identify studies of adverse effects. Further, for the 15 datasets used by Cohen et al. (2006), search precision ranged from 2 to 32% and was 18% on average.

Figure 3: — **3a:** Number of and relative percentage of relevant, maybe relevant, and not relevant studies identified in Groups A, B, and C based on manual screening. **3b:** Relative percentage of hazard identification, in vitro, and supporting studies for studies identified in Groups A, B, and C. Hazard identification includes animal and human studies.

Of the relevant studies identified during manual review, over 99% of studies voted relevant by six algorithms were human or animal studies that would be relevant for hazard identification (Figure 3b). A similarly high number of hazard identification relevant studies were found in Group B studies, although nearly 40% of studies in Group C were not relevant for hazard identification. Group A includes studies that were identified by all six algorithms, therefore, we expected this group to be most similar to the seed studies. The concentration of hazard identification studies steadily declined from Group A to Group C (Figure 3b). This decrease in concentration was offset by capture of supporting and in vitro studies which increased in Groups B and C. This increase in supporting and mechanistic studies is not surprising, as these studies use language similar to that found in hazard identification studies.

Evaluating Recall of Supervised Clustering Strategy

The pilot study results suggest that using chemical-independent seed studies for supervised clustering can effectively identify relevant studies for hazard identification. This is demonstrated by the fact that we identified 1,300 relevant studies with a relatively lower percentage of false positives than expected using a traditional comprehensive keyword search (Belter 2016; Cohen et al. 2006; Golder et al. 2006). This increased efficiency may be useful for scoping literature in risk assessments. However, by manually evaluating only a subset of the total literature search for relevancy, it is not possible to quantify how effectively our approach identifies relevant studies and screens out non-relevant studies from a large literature database.

To demonstrate the utility of using an externally-derived, chemical-independent seed set, we evaluated our approach on two published IRIS toxicological reviews using simulations. For each simulation we calculated recall and elimination rate. Recall indicates the percent of relevant studies that were captured using the method (e.g., 80% search recall does not capture 20% of the relevant studies). Elimination rate indicates the percent of non-relevant studies that were excluded without manual review and is an indication of efficiency or time saved. The toxicological reviews for n-butanol and RDX were selected because screening results were available in US EPA’s Health and Environmental Research Online (HERO) database and each has low search precision (i.e., high number of false positives and few relevant results). RDX is a small database with approximately 1,600 studies and n-butanol is moderate in size with approximately 5,900 total studies (Table 2). In both instances, search precision is less than 1%. Search results with few relevant studies represent a scenario in which gathering appropriate training data may be cost-prohibitive and a general, chemical-independent seed set can provide greater efficiency. The HERO database provides both the complete literature search results, as well as the studies identified as relevant for hazard identification. These literature lists provide both known relevant and known non-relevant studies for evaluating our approach.

Table 2:

Recall and elimination rate for two simulations using supervised clustering and externallyderived, chemical-independent training data.

Assessment	Total Studies	Actual Recall	Predicted Recall	Elimination Rate
n-Butanol	5,955	83%	99%	77%
RDX	1,609	77%	99%	79%

Open in a new tab

Actual recall indicates the percent of known relevant hazard identification studies retrieved using supervised clustering. Predicted recall indicates DoCTER prediction of recall. Elimination rate refers to the total number of true negatives (i.e., not relevant studies) that did not require manual review.

As in the in utero pilot study, we developed a set of chemical-independent seed studies using hazard identification sections of IRIS assessments. For this follow-up study, we used hazard identification studies identified in HERO for the chromium (VI), inorganic arsenic, tetrachloroethane, and benzo(a)pyrene assessments. We ran simulations on two sets of search results (n-butanol and RDX) using a 9-algorithm ensemble approach to group studies for relevance. These simulations also used Latent Dirichlet Allocation (LDA) in addition to k-means and NMF algorithms used in the pilot study. Using chemical-independent seed studies, we achieved recall rates of 83% and 77% for n-butanol and RDX respectively (Table 2). Preliminary efforts to improve recall for the simulations of n-butanol and RDX were successful. Specifically, using machine learning to prioritize a subset of studies not predicted relevant by supervised clustering improved recall to over 93% for both n-butanol and RDX (unpublished observations). These results demonstrate that supervised clustering using chemical-independent seed studies is effective in identifying relevant studies in search results, even for literature databases with a very low number of relevant studies.

Discussion

Systematic literature review is a well-established method to collect and summarize empirical evidence. A critical aspect of systematic review is identification of relevant literature. Literature search results for human health risk assessments can be large (hundreds to thousands of articles), covering a range of hazards and exposure scenarios and often have very low search precision. Typically, literature search results require manual review to identify relevant studies, which is time and resource intensive. We developed a literature search and study selection strategy using text analytics that reduces the time and cost associated with developing training data. This approach uses semi-supervised machine learning with externally derived, chemical-independent training data in an ensemble approach to predict relevant literature and reduce the volume of literature requiring manual review.

Limitations of our Approach

We conducted a traditional literature search in PubMed using keywords to identify epidemiologic studies of chemical in utero chemical exposures. This search retrieved over 50,000 potentially relevant references. We developed and applied a novel “chemical agnostic” approach to identify studies relevant to hazard identification in human health risk assessments. Using US EPA IRIS assessments, we developed a set of chemical-independent seed studies by removing the chemical name from the titles and abstracts of hazard identification studies from IRIS assessments. Initial supervised clustering using chemical agnostic seeds was effective in identifying relevant studies, as demonstrated by achieving a precision rate of approximately 50%. We recognize that reviewing all search results would improve the recall of relevant studies, however this reduction in manual screening represents approximately 1,700 hours of time saved. This time savings assumes 1 minute/study on average and 2 reviewers/study—estimates based on authors’ experience manually screening literature for various types of regulatory risk assessments. This approach was intended to scope literature across a variety of endpoints and environmental exposures, not systematically identify all available data. As a literature scoping approach, we suggest it is unlikely that manually reviewing additional studies would identify outcomes not already represented in the prioritized studies.

Our pilot study of supervised clustering performed well despite using seed studies not specifically designed to identify hazards of in utero exposures to environmental chemicals. A more precise application of this approach could use seed sets for the specific health outcomes expected to be found in the unclassified corpus. For instance, using generalized seed studies from US EPA’s Integrated Science Assessments (ISAs) may provide a better approach to identifying effects associated with air pollution. Further, while selecting seeds based on US EPA IRIS assessments may bias toward studies with human 12 health effects rather than ecological impacts, this chemical-independent approach could be applied to any chemical, exposure, and/or endpoint literature search results for hazard identification.

Our results were obtained even with selecting clusters that captured only 66% of the total seed set (Table 1), likely because we included seed studies external to the search results and included more studies than typically required. Varghese et al. (2018) demonstrated that as few as 50 seeds are sufficient, whereas we included 1,456 seeds. Higher recall could be obtained by including additional clusters to raise the total number of seeds captured. In this study we employed an ensemble approach using three different numbers of clusters (i.e., 10, 20, and 30) to maximize recall. Additional work could also consider how to define an optimal number of clusters to maximize both recall and precision. In addition, our literature search for this pilot project was limited to publications in PubMed and restricted to approximately 50,000 studies. Identification of relevant studies could be improved by expanding the literature search dataset to additional databases and including more studies.

Table 1:

Percent of seeds captured and number of studies returned by algorithm.

Algorithm	Number of Clusters Used	Percent of Seeds Captured	Number of Studies to Screen^a
KM-10	1	58%	4,157
NMF-10	1	44%	2,829
KM-20	1	54%	3,011
NMF-20	2	57%	5,145
KM-30	1	31%	1,249
NMF-30	2	38%	3,455
Total Unique		66%	6,691

Open in a new tab

A given study might appear more than once. KM = k-means algorithm, NMF = nonnegative matrix factorization, numbers in algorithm column indicate number of clusters (e.g., KM-10 = k-means algorithm with 10 clusters, KM-20 = k-means algorithm with 20 clusters).

Simulations using two independent scenarios (IRIS Toxicological Reviews of n-butanol and RDX) demonstrated the efficacy of an externally-derived, chemical-independent seed set to identify relevant literature in an unclassified corpus. We achieved recall of 77 to 83% in these two scenarios; however, we note that the predicted and actual recall differed substantially (Table 2) and that recall rates below 90% are often unacceptable. Preliminary results indicate that machine learning algorithms may be used to improve overall recall. Given that we eliminated approximately 75% of results from manual screening, a significant portion of studies are likely to be excluded without manual review following a second pass. For both simulations the overall database was small to moderate in size and had low search precision. Validating our approach in other scenarios (e.g., larger databases with higher precision) may provide additional insight.

In this study, externally-derived training data were relatively easy to acquire and convert to generic studies (i.e., by removing chemical names and synonyms). While application of this approach need not be limited to chemical risk assessments, the utility of the approach relies on access to readily available generic training data and is therefore best suited to scenarios that are similar to previous literature searches where results have been manually classified. Search updates for systematic reviews are an obvious application and Shekelle et al. (2017) demonstrated the promise of this method to reduce manual screening without significant sacrifice to recall of relevant literature.

Impact

A scoping method was developed for evaluating literature search results that relies on automated methods, including text analytics. Semi-supervised learning with an ensemble approach offers a way to prioritize broad literature search results for subsequent manual review. Such an application might be useful for scoping human health risk assessments with large literature databases and evaluating literature across multiple chemicals or exposures for cumulative risk. We demonstrate that relevant studies in an unclassified corpus can be identified using externally-derived training data. This approach may be a suitable solution in a variety of scenarios and particularly when training data are difficult to locate as in the case of low precision literature searches.

Machine learning techniques afford the ability to assess a large set of literature search results quickly by prioritizing a subset for manual review. Reviewing all results from a more narrowly designed literature 13 search conceivably could have returned fewer relevant studies than we were able to discover with this novel approach.

Supplementary Material

NIHMS1699399-supplement-1.pdf^{(223.4KB, pdf)}

Footnotes

Disclaimer: The views expressed in this document are those of the authors, and do not represent the official views or policies of the US EPA.

References

Aphinyanaphongs Y; Tsamardinos I; Statnikov A; Hardin D; Aliferis CF Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association 2005;12:207–216 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bekhuis T; Demner-Fushman D. Screening nonrandomized studies for medical systematic reviews: A comparative study of classifiers. Artificial Intelligence in Medicine 2012;55:197–207 [DOI] [PMC free article] [PubMed] [Google Scholar]
Belter CW Citation analysis as a literature search method for systematic reviews. Journal of the Association for Information Science and Technology 2016;67:2766–2777 [Google Scholar]
Birnbaum LS; Thayer KA; Bucher JR; Wolfe MS Implementing systematic review at the National Toxicology Program: status and next steps. Environmental Health Perspectives 2013;121:a108–a109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Chapelle O; Scholkopf B; Zien A. Semi-supervised learning. Cambridge: MIT Press; 2006 [Google Scholar]
Chen A; Park JS; Linderholm L; Rhee A; Petreas M; DeFranco EA; Dietrich KN; Ho SM Hydroxylated polybrominated diphenyl ethers in paired maternal and cord sera. Environ Sci Technol 2013a;47:3902–3908 [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen MH; Ha EH; Liao HF; Jeng SF; Su YN; Wen TW; Lien GW; Chen CY; Hsieh WS; Chen PC Perfluorinated compound levels in cord blood and neurodevelopment at 2 years of age. Epidemiology 2013b;24:800–808 [DOI] [PubMed] [Google Scholar]
Cohen AM; Hersh WR; Peterson K; Yen P-Y Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association 2006;13:206–219 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eskenazi B; Chevrier J; Rauch SA; Kogut K; Harley KG; Johnson C; Trujillo C; Sjodin A; Bradman A. In utero and childhood polybrominated diphenyl ether (PBDE) exposures and neurodevelopment in the CHAMACOS study. Environ Health Perspect 2013;121:257–262 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eskenazi B; Rosas LG; Marks AR; Bradman A; Harley K; Holland N; Johnson C; Fenster L; Barr DB Pesticide toxicity and the developing brain. Basic Clin Pharmacol Toxicol 2008;102:228–236 [DOI] [PubMed] [Google Scholar]
European Food Safety Authority (EFSA). Application of systematic review methodology to food and feed safety assessments to support decision making. EFSA Journal 2010;8:1–90 [Google Scholar]
Frunza O; Inkpen D; Matwin S; Klement W; O’Blenis P. Exploiting the systematic review protocol for classification of medical abstracts. Artificial Intelligence in Medicine 2011;51:17–25 [DOI] [PubMed] [Google Scholar]
Golder S; McIntosh HM; Duffy S; Glanville J. Developing efficient search strategies to identify reports of adverse effects in medline and embase. Health Information & Libraries Journal 2006;23:3–12 [DOI] [PubMed] [Google Scholar]
Institute of Medicine (IOM); Eden J; Levit L; Berg A; Morton S. Finding what works in health care: Standards for systematic reviews. Washington, DC: Institute of Medicine; 2011 [PubMed] [Google Scholar]
Jonnalagadda S; Petitti D. A new iterative method to reduce workload in systematic review process. Int J Comput Biol Drug Des 2013;6:5–17 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nadeau KC; Li Z; Farzan S; Koestler D; Robbins D; Fei DL; Malipatlolla M; Maecker H; Enelow R; Korrick S; Karagas MR In utero arsenic exposure and fetal immune repertoire in a US pregnancy cohort. Clin Immunol 2014;155:188–197 [DOI] [PMC free article] [PubMed] [Google Scholar]
National Academy of Sciences (NAS). Progress toward transforming the Integrated Risk Information System (IRIS) program: a 2018 evaluation. Washington, DC: The National Academies Press; 2018 [Google Scholar]
National Research Council (NRC). Critical aspects of EPA’s IRIS assessment of inorganic arsenic - interim report. Washington, DC: National Research Council; 2013 [Google Scholar]
National Research Council (NRC). Review of EPA’s Integrated Risk Information System (IRIS) process. Washington, DC: National Research Council; 2014 [PubMed] [Google Scholar]
Nye MD; Fry RC; Hoyo C; Murphy SK Investigating Epigenetic Effects of Prenatal Exposure to Toxic Metals in Newborns: Challenges and Benefits. Med Epigenet 2014;2:53–59 [DOI] [PMC free article] [PubMed] [Google Scholar]
O’Mara-Eves A; Thomas J; McNaught J; Miwa M; Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 2015;4:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Perera FP; Li Z; Whyatt R; Hoepner L; Wang S; Camann D; Rauh V. Prenatal airborne polycyclic aromatic hydrocarbon exposure and child IQ at age 5 years. Pediatrics 2009;124:e195–202 [DOI] [PMC free article] [PubMed] [Google Scholar]
Perera FP; Tang D; Wang S; Vishnevetsky J; Zhang B; Diaz D; Camann D; Rauh V. Prenatal polycyclic aromatic hydrocarbon (PAH) exposure and child behavior at age 6–7 ye ars. Environ Health Perspect 2012;120:921–926 [DOI] [PMC free article] [PubMed] [Google Scholar]
Recio-Vega R; Gonzalez-Cortes T; Olivas-Calderon E; Lantz RC; Gandolfi AJ; Gonzalez-De Alba C. In utero and early childhood exposure to arsenic decreases lung function in children. J Appl Toxicol 2015;35:358–366 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rooney AA; Boyles AL; Wolfe MS; Bucher JR; Thayer KA Systematic review and evidence integration for literature-based environmental health science assessments. Environmental Health Perspectives 2014;122:711–718 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ryan L. Combining data from multiple sources, with applications to environmental risk assessment. Stat Med 2008;27:698–710 [DOI] [PubMed] [Google Scholar]
Shekelle PG; Shetty K; Newberry S; Maglione M; Motala A. Machine learning versus standard techniques for updating searches for systematic reviews: a diagnostic accuracy study. Annals of Internal Medicine 2017;167:213–215 [DOI] [PubMed] [Google Scholar]
Shemilt I; Simon A; Hollands GJ; Marteau TM; Ogilvie D; O’Mara-Eves A; Kelly MP; Thomas J. Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Research Synthesis Methods 2014;5:31–49 [DOI] [PubMed] [Google Scholar]
Steinmaus C; Ferreccio C; Acevedo J; Yuan Y; Liaw J; Duran V; Cuevas S; Garcia J; Meza R; Valdes R; Valdes G; Benitez H; VanderLinde V; Villagra V; Cantor KP; Moore LE; Perez SG; Steinmaus S; Smith AH Increased lung and bladder cancer incidence in adults after in utero and early-life arsenic exposure. Cancer Epidemiol Biomarkers Prev 2014;23:1529–1538 [DOI] [PMC free article] [PubMed] [Google Scholar]
US EPA. Concepts, methods and data sources for cumulative health risk assessment of multiple chemicals, exposures and effects: a resource document. Washington, DC: US Environmental Protection Agency; 2007 [Google Scholar]
US EPA. Materials submitted to the National Research Council Part I: status of implementation of recommendations. Washington, DC: US Environmental Protection Agency; 2013 [Google Scholar]
Varghese A; Cawley M; Hong T. Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environment Systems and Decisions 2018;38:398–414 [Google Scholar]
Vested A; Ramlau-Hansen CH; Olsen SF; Bonde JP; Stovring H; Kristensen SL; Halldorsson TI; Rantakokko P; Kiviranta H; Ernst EH; Toft G. In utero exposure to persistent organochlorine pollutants and reproductive health in the human male. Reproduction 2014;148:635–646 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wallace BC; Trikalinos TA; Lau J; Brodley C; Schmid CH Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 2010;11:55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wasserman GA; Liu X; Loiacono NJ; Kline J; Factor-Litvak P; van Geen A; Mey JL; Levy D; Abramson R; Schwartz A; Graziano JH A cross-sectional study of well water arsenic and child IQ in Maine schoolchildren. Environ Health 2014;13:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Woodruff TJ; Sutton P. An evidence-based medicine methodology to bridge the gap between clinical and environmental health sciences. Health Aff (Millwood) 2011;30:931–937 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yorifuji T; Kato T; Kado Y; Tokinobu A; Yamakawa M; Tsuda T; Sanada S. Intrauterine Exposure to Methylmercury and Neurocognitive Functions: Minamata Disease. Arch Environ Occup Health 2015;70:297–302 [DOI] [PubMed] [Google Scholar]
Zeilmaker MJ; Hoekstra J; van Eijkeren JC; de Jong N; Hart A; Kennedy M; Owen H; Gunnlaugsdottir H. Fish consumption during child bearing age: a quantitative risk-benefit analysis on neurodevelopment. Food Chem Toxicol 2013;54:30–34 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1699399-supplement-1.pdf^{(223.4KB, pdf)}

[R1] Aphinyanaphongs Y; Tsamardinos I; Statnikov A; Hardin D; Aliferis CF Text categorization models for high-quality article retrieval in internal medicine. Journal of the American Medical Informatics Association 2005;12:207–216 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bekhuis T; Demner-Fushman D. Screening nonrandomized studies for medical systematic reviews: A comparative study of classifiers. Artificial Intelligence in Medicine 2012;55:197–207 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Belter CW Citation analysis as a literature search method for systematic reviews. Journal of the Association for Information Science and Technology 2016;67:2766–2777 [Google Scholar]

[R4] Birnbaum LS; Thayer KA; Bucher JR; Wolfe MS Implementing systematic review at the National Toxicology Program: status and next steps. Environmental Health Perspectives 2013;121:a108–a109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chapelle O; Scholkopf B; Zien A. Semi-supervised learning. Cambridge: MIT Press; 2006 [Google Scholar]

[R6] Chen A; Park JS; Linderholm L; Rhee A; Petreas M; DeFranco EA; Dietrich KN; Ho SM Hydroxylated polybrominated diphenyl ethers in paired maternal and cord sera. Environ Sci Technol 2013a;47:3902–3908 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chen MH; Ha EH; Liao HF; Jeng SF; Su YN; Wen TW; Lien GW; Chen CY; Hsieh WS; Chen PC Perfluorinated compound levels in cord blood and neurodevelopment at 2 years of age. Epidemiology 2013b;24:800–808 [DOI] [PubMed] [Google Scholar]

[R8] Cohen AM; Hersh WR; Peterson K; Yen P-Y Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association 2006;13:206–219 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Eskenazi B; Chevrier J; Rauch SA; Kogut K; Harley KG; Johnson C; Trujillo C; Sjodin A; Bradman A. In utero and childhood polybrominated diphenyl ether (PBDE) exposures and neurodevelopment in the CHAMACOS study. Environ Health Perspect 2013;121:257–262 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Eskenazi B; Rosas LG; Marks AR; Bradman A; Harley K; Holland N; Johnson C; Fenster L; Barr DB Pesticide toxicity and the developing brain. Basic Clin Pharmacol Toxicol 2008;102:228–236 [DOI] [PubMed] [Google Scholar]

[R11] European Food Safety Authority (EFSA). Application of systematic review methodology to food and feed safety assessments to support decision making. EFSA Journal 2010;8:1–90 [Google Scholar]

[R12] Frunza O; Inkpen D; Matwin S; Klement W; O’Blenis P. Exploiting the systematic review protocol for classification of medical abstracts. Artificial Intelligence in Medicine 2011;51:17–25 [DOI] [PubMed] [Google Scholar]

[R13] Golder S; McIntosh HM; Duffy S; Glanville J. Developing efficient search strategies to identify reports of adverse effects in medline and embase. Health Information & Libraries Journal 2006;23:3–12 [DOI] [PubMed] [Google Scholar]

[R14] Institute of Medicine (IOM); Eden J; Levit L; Berg A; Morton S. Finding what works in health care: Standards for systematic reviews. Washington, DC: Institute of Medicine; 2011 [PubMed] [Google Scholar]

[R15] Jonnalagadda S; Petitti D. A new iterative method to reduce workload in systematic review process. Int J Comput Biol Drug Des 2013;6:5–17 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Nadeau KC; Li Z; Farzan S; Koestler D; Robbins D; Fei DL; Malipatlolla M; Maecker H; Enelow R; Korrick S; Karagas MR In utero arsenic exposure and fetal immune repertoire in a US pregnancy cohort. Clin Immunol 2014;155:188–197 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] National Academy of Sciences (NAS). Progress toward transforming the Integrated Risk Information System (IRIS) program: a 2018 evaluation. Washington, DC: The National Academies Press; 2018 [Google Scholar]

[R18] National Research Council (NRC). Critical aspects of EPA’s IRIS assessment of inorganic arsenic - interim report. Washington, DC: National Research Council; 2013 [Google Scholar]

[R19] National Research Council (NRC). Review of EPA’s Integrated Risk Information System (IRIS) process. Washington, DC: National Research Council; 2014 [PubMed] [Google Scholar]

[R20] Nye MD; Fry RC; Hoyo C; Murphy SK Investigating Epigenetic Effects of Prenatal Exposure to Toxic Metals in Newborns: Challenges and Benefits. Med Epigenet 2014;2:53–59 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] O’Mara-Eves A; Thomas J; McNaught J; Miwa M; Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 2015;4:5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Perera FP; Li Z; Whyatt R; Hoepner L; Wang S; Camann D; Rauh V. Prenatal airborne polycyclic aromatic hydrocarbon exposure and child IQ at age 5 years. Pediatrics 2009;124:e195–202 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Perera FP; Tang D; Wang S; Vishnevetsky J; Zhang B; Diaz D; Camann D; Rauh V. Prenatal polycyclic aromatic hydrocarbon (PAH) exposure and child behavior at age 6–7 ye ars. Environ Health Perspect 2012;120:921–926 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Recio-Vega R; Gonzalez-Cortes T; Olivas-Calderon E; Lantz RC; Gandolfi AJ; Gonzalez-De Alba C. In utero and early childhood exposure to arsenic decreases lung function in children. J Appl Toxicol 2015;35:358–366 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Rooney AA; Boyles AL; Wolfe MS; Bucher JR; Thayer KA Systematic review and evidence integration for literature-based environmental health science assessments. Environmental Health Perspectives 2014;122:711–718 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Ryan L. Combining data from multiple sources, with applications to environmental risk assessment. Stat Med 2008;27:698–710 [DOI] [PubMed] [Google Scholar]

[R27] Shekelle PG; Shetty K; Newberry S; Maglione M; Motala A. Machine learning versus standard techniques for updating searches for systematic reviews: a diagnostic accuracy study. Annals of Internal Medicine 2017;167:213–215 [DOI] [PubMed] [Google Scholar]

[R28] Shemilt I; Simon A; Hollands GJ; Marteau TM; Ogilvie D; O’Mara-Eves A; Kelly MP; Thomas J. Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Research Synthesis Methods 2014;5:31–49 [DOI] [PubMed] [Google Scholar]

[R29] Steinmaus C; Ferreccio C; Acevedo J; Yuan Y; Liaw J; Duran V; Cuevas S; Garcia J; Meza R; Valdes R; Valdes G; Benitez H; VanderLinde V; Villagra V; Cantor KP; Moore LE; Perez SG; Steinmaus S; Smith AH Increased lung and bladder cancer incidence in adults after in utero and early-life arsenic exposure. Cancer Epidemiol Biomarkers Prev 2014;23:1529–1538 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] US EPA. Concepts, methods and data sources for cumulative health risk assessment of multiple chemicals, exposures and effects: a resource document. Washington, DC: US Environmental Protection Agency; 2007 [Google Scholar]

[R31] US EPA. Materials submitted to the National Research Council Part I: status of implementation of recommendations. Washington, DC: US Environmental Protection Agency; 2013 [Google Scholar]

[R32] Varghese A; Cawley M; Hong T. Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environment Systems and Decisions 2018;38:398–414 [Google Scholar]

[R33] Vested A; Ramlau-Hansen CH; Olsen SF; Bonde JP; Stovring H; Kristensen SL; Halldorsson TI; Rantakokko P; Kiviranta H; Ernst EH; Toft G. In utero exposure to persistent organochlorine pollutants and reproductive health in the human male. Reproduction 2014;148:635–646 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wallace BC; Trikalinos TA; Lau J; Brodley C; Schmid CH Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 2010;11:55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Wasserman GA; Liu X; Loiacono NJ; Kline J; Factor-Litvak P; van Geen A; Mey JL; Levy D; Abramson R; Schwartz A; Graziano JH A cross-sectional study of well water arsenic and child IQ in Maine schoolchildren. Environ Health 2014;13:23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Woodruff TJ; Sutton P. An evidence-based medicine methodology to bridge the gap between clinical and environmental health sciences. Health Aff (Millwood) 2011;30:931–937 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Yorifuji T; Kato T; Kado Y; Tokinobu A; Yamakawa M; Tsuda T; Sanada S. Intrauterine Exposure to Methylmercury and Neurocognitive Functions: Minamata Disease. Arch Environ Occup Health 2015;70:297–302 [DOI] [PubMed] [Google Scholar]

[R38] Zeilmaker MJ; Hoekstra J; van Eijkeren JC; de Jong N; Hart A; Kennedy M; Owen H; Gunnlaugsdottir H. Fish consumption during child bearing age: a quantitative risk-benefit analysis on neurodevelopment. Food Chem Toxicol 2013;54:30–34 [DOI] [PubMed] [Google Scholar]

PERMALINK

Novel text analytics approach to identify relevant literature for human health risk assessments: a pilot study with health effects of in utero exposures

Michelle Cawley

Renee Beardsley

Brandy Beverly

Andrew Hotchkiss

Ellen Kirrane

Reeder Sams II

Arun Varghese

Jessica Wignall

John Cowden

Abstract

Background:

Challenge:

Approach:

Outcome:

Limitations:

Impact:

Background and Challenge

Approach and Outcomes

Selecting a Pilot Study

Pilot Study Literature Search Strategy

Novel Approach to Seed Study Selection for Supervised Clustering

Supervised Clustering of Pilot Study Literature

Figure 1: Summary of process to prioritize search results using supervised clustering and chemicalindependent seeds.

Figure 2: Summary of supervised clustering with 6-algorithm ensemble approach.

Evaluating the Supervised Clustering Results

Figure 3: Manual screening results.

Evaluating Recall of Supervised Clustering Strategy

Table 2:

Discussion

Limitations of our Approach

Table 1:

Impact

Supplementary Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases