Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 1.
Published in final edited form as: Curr Epidemiol Rep. 2019 Jul 15;6(3):291–299. doi: 10.1007/s40471-019-00205-5

Data Science in Environmental Health Research

Christine Choirat 1, Danielle Braun 2,3, Marianthi-Anna Kioumourtzoglou 4,*
PMCID: PMC6853613  NIHMSID: NIHMS1534718  PMID: 31723546

Abstract

Purpose of Review

Data science is an exploding trans-disciplinary field that aims to harness the power of data to gain information or insights on researcher-defined topics of interest. In this paper we review how data science can help advance environmental health research.

Recent Findings

We discuss the concepts computationally scalable handling of Big Data and the design of efficient research data platforms, and how data science can provide solutions for methodological challenges in environmental health research, such as high-dimensional outcomes and exposures, and prediction models. Finally, we discuss tools for reproducible research.

Summary

In this paper we present opportunities to improve environmental research capabilities by embracing data science, and the pitfalls that environmental health researchers should avoid when employing data scientific approaches. Throughout the paper, we emphasize the need for environmental health researchers to collaborate more closely with biostatisticians and data scientists to ensure robust and interpretable results.

Keywords: Data Science, Big Data, Environmental Health Research, Reproducibility, Environmental Mixtures, High-Dimensional, Research Data Platforms

Introduction

Data science is an exploding trans-disciplinary field that aims to harness the power of data to gain information or insights on researcher-defined topics of interest. Although multiple definitions for data science—and what the field entails—exist, the main theme shared across all is the dimensionality of the data (aka “Big Data”). In this paper, we describe the role of data science in environmental health research, opportunities to greatly improve our research capabilities by embracing data science, and the pitfalls that environmental health researchers should avoid when employing data scientific approaches. We strongly emphasize throughout the paper that as useful and as appealing as such approaches might appear, for the most part they were developed in other fields, e.g., in machine learning, artificial intelligence, and computer science. The goals of these fields can be quite different than the overarching aim of environmental health research, namely to understand the biological pathways of toxicity of environmental factors, and to inform maximally efficient interventions and regulations to promote physical and mental well-being, protect the public and prevent disease. Blindly borrowing and applying data scientific methods, thus, might not always be appropriate. It is crucial that environmental health researchers first develop a well-framed research question and subsequently choose the appropriate method(s) for analysis. Although there is a lot of interest in applying these methods for hypothesis generation, we argue that even then careful consideration and choice of analytic approaches is important to encourage reproducibility and efficient use of time and resources. Overall, when using data scientific methods, careful adaptation and potential extensions to better fit the specific research question should be prioritized. This can be facilitated by (a) close collaboration of the environmental health researchers with biostatisticians and data scientists when developing the analysis plan; and (b) appropriate training for environmental health researchers.

Blei and Smyth [1••] in a recent paper discuss the importance of incorporating three distinct and critical components in data scientific research: statistical, computational, and human. The statistical component refers to the development and use of robust methods that can accommodate complex, structured and high-dimensional data with an emphasis on systematic characterization of uncertainty and causal inference. The computational component refers to scalable and efficient algorithmic implementations of the statistical methods and solutions to balance statistical accuracy even when computational resources are limited [2]. The third component—and arguably the most important for environmental health research—is the human perspective. This component incorporates expert knowledge in the field, study design, data collection, definition of the research question and subsequent choice of the appropriate statistical methods to address the question, and finally interpretation and communication of the results. Each of these components—statistical, computational, human—is crucial to achieve the aims of environmental health research. Nonetheless, it is the complementary and integrated use of all three that will allow for a holistic incorporation of data science in environmental health research, and consequently allow for research that is efficient, high-dimensional, incorporates complex data structures, and that is ultimately interpretable, highly impactful and actionable.

In this paper, we discuss the main topics in which data scientific practices can help elevate environmental health research, and discuss pitfalls to be avoided. First, we present tools for handling Big Data, both in terms of computational scalability, as well as how to design efficient research platforms. Second, we discuss methodological challenges in environmental health research—that is becoming increasingly high-dimensional—for which adaptation of data scientific methods would be especially beneficial. Next, we highlight the importance of reproducible and replicable research and the necessary tools to achieve this. Finally, we make some recommendations on the training of future generations of environmental health researchers for conscientious adaptation of data scientific methods in our field. This paper is not a systematic literature review of a topic in environmental health (e.g., the review of air pollution effects on gynecological outcomes [3]) nor a review of a group of methods to address a specific research question (e.g., a review on methods to assess exposures to environmental mixtures in health studies [4•]). Rather, we aim to review the benefits of adapting the practices of one field, Data Science, into a different one, Environmental Health, providing examples of data scientific practices throughout and a glossary of terms in Table 1.

Table 1.

Glossary of Terms

Central processing unit (CPU) The central processor of the computer in which instructions are executed
Cloud computing Data centers available to many users over the Internet providing on-demand availability of data storage and computing power
Cloud-service provider A company offering infrastructure, network services, and applications in the cloud
Codebase The complete source code required to build a specific software package, software or application
Continuous integration The practice of integrating—automatically building, testing, and validating—the code of multiple developers into a shared repository frequently, i.e., multiple times a day
Distributed system A system of autonomous components that are located across different computers of a network
File system The system controlling data storage, retrieval, and file organization
Findability A property of a specific dataset enabling data and meta-data to be easily findable for both humans and computers
Graphics processing unit (GPU) A programmable processor that performs rapid memory manipulations with high degree of parallelism, making it more efficient than a CPU
Interoperability Property of a specific dataset enabling data to work in conjunction with applications and workflows for storage, processing and analysis
Jupyter Notebook Open-source web application that enables the creation and sharing of documents that contain code, equations, visualizations and text
Logical drive An area of computer storage with a single file system that operates independently with its own parameters and functions
Meta-data Data providing descriptive, structural, statistical, and administrative information about a dataset
Parallel computing The use of computing resources enabling simultaneous calculations or execution of processes
Reusability Well documented description of data and meta-data to enable reuse and replication
Software stack A set of software components required to create a complete platform, e.g. an application
Virtualization The process of creating virtual (instead of actual) versions of storage devices, applications, servers and networks

Big Data

Computational Scalability

Datasets used in environmental health are becoming increasingly larger. For example, air pollution prediction at fine spatio-temporal resolution requires hundreds of terabytes of inputs. Mobile health components are also more commonly added to large scale studies, and hundreds of gigabytes of global positioning system (GPS) data, step counts and heart rates, are routinely collected for the study participants. The emerging fields of exposomic and metabolomic research [5, 6] require massive amounts of data about drugs, diet, infectious agents, and pollutants to which an individual is exposed over time.

A recent example of a health study using a massive dataset is that of Di et al. [7]; the authors leveraged information on more than 60 million Medicare enrollees over 13 years (>460 million person-years of follow-up) to investigate the association between long-term exposure to air pollution and mortality. In one of their sub-analyses, the authors reported that fitting a mixed-effects Cox model was computationally infeasible. They instead divided the full dataset randomly into 50 groups, repeated the analysis in each group separately, and subsequently pooled group-specific estimates using a fixed-effects meta-analysis. This example highlights that some of the existing computational and statistical approaches cannot accommodate the increasing size of the datasets used in environmental health studies.

Big Data raise two types of challenges. First, conventional file systems (such as those of personal computers) have hard limits in terms of maximum size of logical drive, number of files per drive, and single file size. For instance, a file system can fall short because a single file is too large or because there are too many files on the same drive. Second, massive data availability opens new research questions that demand for more sophisticated statistical and computational methods. Therefore, algorithms have to be fine-tuned or even re-implemented to be run in a parallel manner on a distributed file system, i.e., on data that are possibly stored across a network of machines and storage drives accessed as if they were on the local computer of the user.

New hardware and software developments are fundamental assets in addressing these challenges. The combination of cheaper storage and central processing unit (CPU) compute resources, graphics processing units (GPUs), and open-source and user-friendly software stacks, such as R interfaces to parallel computing libraries or to Spark on the Apache Hadoop Distributed File System (HDFS) [8], make Big Data accessible to environmental health researchers.

In their publications, researchers do not always report the details of the hardware and software used for their analyses—other than just the name and version of the software used, e.g. “R statistical software version 3.6.0”—making it difficult to assess which approaches are currently being implemented to handle Big Data in Environmental Health. Some studies of air pollution modeling, nonetheless, have described in detail their use of massively parallel machines and their benefits [9, 10], and have reported greatly reduced computational time [11].

Research Platforms

When analyzing Big Data, it is a formidable asset to have access to an infrastructure that is sometimes called a Research Data Platform (RDP). Ideally, a RDP provides a one-point access to the data even if data storage is actually distributed. However, many environmental health RDPs host individual-level health information, i.e. Protected Health Information. As a consequence, they have to be compliant with the appropriate regulations [12], e.g. the Health Insurance Portability and Accountability Act (HIPAA) in the US and the General Data Protection Regulation (GDPR) in the EU. Therefore, large-scale RDPs are best built and maintained by a multidisciplinary team of experts in data science, research computing, compliance, and information technology.

Recently, Patel et al. [15••] merged 255 separate files, integrating data from four National Health and Nutrition Examination Surveys (NHANES) on >41,000 individuals and more than a thousand variables, including information on demographics, physical exam and laboratory results and answers to questionnaires, accompanied by a data dictionary. Although NHANES data have been publicly available, interested researchers needed to go through multiple steps to obtain a final dataset ready for analysis. This powerful new RDP, therefore, will greatly facilitate environmental health research providing all information in a single-point access coupled with detailed data description. Another example of a successful RDP is that of the urban exposome used in multiple studies in Europe (e.g. [13, 14]). Researchers from multiple groups compiled a comprehensive list of urban exposures experienced by approximately 30,000 women participating in six birth cohorts. So far, this RDP has enabled the investigation of how the urban exposome during pregnancy varies by socio-economic determinants [13] and its association with birth weight [14], and will continue supporting further research using these data.

For scalability, and in some cases for cost-efficiency over more traditional high-performance computing solutions, it is sometimes easier to leverage cloud-computing resources for hosting RDPs. Over the past few years, several major cloud-service providers, most notably AWS (Amazon Web Services) and GCP (Google Cloud Platform) became HIPAA-compliant, and can be configured to be GDPR-compliant as well, with the potential of standardizing collaboration procedures for US-EU research projects [16]. This can be especially advantageous for environmental health research, allowing researchers to pull information across multiple cohorts and continents.

Although there are no requirements nor is it a standard practice to report in a paper use of services like AWS for analysis or data storage, use of these cloud-services is becoming increasingly prevalent. For example, a recent paper by Madhyastha et al. [17] describes some of the best practices for using AWS to execute neuroimaging analyses in the cloud. In addition, the National Institute of Allergy and Infectious Diseases has developed a cloud-based microbiome data analysis platform with standardized pipelines, called Nephele, that is built on the AWS cloud and has a simple web interface for “transforming raw data into biological insights” [18].

Methodological Challenges

Prediction Models for Exposure Assessment

For certain exposures and study designs, exposure assessment at the personal level might not be feasible, especially for long periods of time. In such cases, and whenever possible, environmental health researchers have developed prediction models to facilitate exposure assessment. Air pollution is one such example, on which we will focus in this section, but it is not the only one (e.g., [1921]). Exposure prediction is one field in environmental health research in which direct implementation of data scientific tools is most straightforward, since many methods have been developed to maximize predictive accuracy.

Many air pollution prediction models have been developed to date to assess exposure to air pollutants even in locations where monitoring stations are sparse. These models tend to fuse information across multiple data sources, including—but not limited to—concentrations of pollutants measured at monitoring stations, meteorologic factors, land use variables, emission sources, remote sensing, and outputs from chemical transport models. The first spatio-temporal air pollution prediction models that were developed for exposure assessment relied mostly on regression analyses, including geographically weighted and generalized additive models [2225]. More recently developed air pollution prediction models have embraced machine learning methods to improve their predictive accuracy. Methods that have been used include random forests [26] and neural networks [27]. Combining information from multiple prediction models leads to more accurate predictions [28, 29]; ensemble learning methods, thus, have also been used to predict exposures to air pollutants [30, 31]. Finally, a recent method developed by Hong et al. [32] fuses satellite images at two different zoom levels with a land use regression model in a deep convolutional neural network for predicting very finely resolved ultrafine particle concentrations.

Although use of machine learning methods has greatly improved the predictive accuracy of exposure assessment models, to date, almost none of the developed models provide information on the fully characterized spatio-temporal uncertainty associated with the predictions, with only few exceptions [33]. Failure to characterize the prediction-related uncertainty and propagate it into the health effect estimates may result in invalid inferences, and potentially spurious findings [34, 35]. Only very recently, new ensemble learning methods have emerged that use adaptive weights, i.e., weighing each input by its spatio-temporal predictive accuracy, and comprehensively characterizing the spatio-temporal uncertainty related with each prediction point in space and time [36, 37].

Data scientific and machine learning approaches have greatly helped improve air pollution exposure assessment, and in general the development of exposure models with excellent predictive accuracy. Nonetheless, these successes require close collaboration of environmental engineers, epidemiologists, biostatisticians and computer scientists, to achieve the development of sound fundamental methods for exposure prediction and subsequent use in health studies.

The Curse of Dimensionality

The phrase “curse of dimensionality” refers to situations when due to a very high-dimensional space (almost all) data regions necessarily remain sparse, providing limited support for the results. This phenomenon has important implications for assessing exposure to environmental mixtures and adequately adjusting for confounding in these high-dimensional spaces. In this section, we slightly simplify this term to refer to the investigation of associations with high-dimensional outcomes, exposures, and confounders in environmental health research.

As the availability of environmental health variables increases, statistical models used for the data analysis have the potential of becoming increasingly complex. While many investigators focus on modeling one outcome at a time, there may be interest in modeling the exposure effect on multiple (and even a high-dimensional) number of outcomes simultaneously. For example, Bobb et al. [38] evaluated the effects of heat waves on risk of hospitalization, examining the effects for 214 different disease outcomes independently applying a Bonferroni-Holm method to adjust the p-values for the multiple comparisons. Others have combined outcomes by defining pre-determined categories [39]. However, developing more sophisticated approaches to model associations simultaneously, and grouping and identifying outcomes that have similar effects, may lead to more powerful analysis. Such analyses could also help elucidate common biological pathways through which an environmental exposure can induce harm across multiple organs or systems. For example, Bayesian hierarchical models naturally allow for the estimation of exposure effects on multiple outcomes with an a priori defined hierarchical structure [40, 41]. One drawback of Bayesian analyses, nonetheless, is their high computational cost. Fast, approximate two-stage Bayesian approaches based on stochastic variational inference [42, 43] can be used to scale to the increasingly larger datasets used in environmental health research.

In addition, as data from multiple sources become more readily available, the number of both exposures and confounders to be included in the model increases as well. While many analyses focus on one, or few exposures (for example, Di et al. [7] estimate the effects of PM2:5 and ozone exposure on mortality in Medicare), extending the current framework to account for multiple exposures is necessary. Confounding adjustment in the setting of multiple outcomes and exposures requires careful thought, as potential confounders can be associated with different outcomes and exposures. This complicated structure needs to be accounted for in the model. In the context of high-dimensional confounders, procedures for variable selection could be used [4448]. Efficient and computational scalable methods for variable selection, such as lasso and elastic net, can be adopted from other fields [49, 50]; however, expert knowledge on the outcome and exposure studied are necessary, as most available methods tend to select strong risk factors of the outcome. Environmental health researchers would need to ensure that the variables selected are associated with both the exposure and outcome (and are not just risk factors of the outcome, especially when the measure of association of interest is not collapsible [51, 52]), and that they are not in the causal pathway between exposure and outcome, i.e., that they are not mediators. There could also be scenarios in which the number of potential confounders is larger than the number of observations. Various causal inference approaches have been proposed to handle these settings [5357]; for instance, Antonelli et al. [58] developed a doubly robust matching estimator for confounding adjustment when the number of potential confounders is large relative to the number of observations. Overall, as research questions evolve to consider more variables, careful thought should be given to the choice of statistical models to ensure that appropriate models are chosen. In addition, advanced methods for confounding adjustment for the most part cannot adjust for unmeasured confounding; sensitivity analyses, thus, should be conducted to assess the robustness of the results to the assumption of no unmeasured confounding [59, 60].

Exposure to Mixtures in Health Analyses

Most epidemiological studies have traditionally only assessed exposure to one pollutant at a time, or adjusted for co-pollutant confounding. This, however, does not represent reality, as we are exposed to numerous chemicals simultaneously. For this reason, the US Environmental Protection Agency, the National Institute of Environmental Health Sciences, and the National Research Council all have recognized the need to assess exposure to mixtures. When doing so, the method of choice should depend on the research question [4•], i.e., does the researchers’ interest lie in identifying (a) specific patterns of exposure; (b) the toxic agents in the mixture; (c) potential synergism across mixture members; or (d) characterizing the overall effect of the mixture (or combinations of the above)? Traditional regression methods fail as the dimensionality of the mixture and the correlations among mixture members increase. Environmental epidemiologists, thus, have turned to data scientific and machine learning methods that have been developed for and can better deal with high-dimensional and complex data structures.

There have been many reviews discussing the different methods that have been either adopted from other fields or developed specifically for assessing exposure to mixtures [4•, 6164]; we encourage the readers to read these reviews. Here, we would like to emphasize the careful incorporation of machine learning and data scientific methods with which environmental epidemiologists can be best positioned to assess exposure to complex mixtures in health studies, and specifically the importance of prioritizing the use of robust methods that yield interpretable results. As Gibson et al. [4•] concluded, results are only useful and can be actionable if they are robust, reproducible and interpretable. Therefore, it is important to not merely adopt methods from other fields, whose goal is to improve prediction accuracy, such as machine learning, but adapt and extend these methods, in collaboration with biostatisticians and data scientists, to best address the research question(s) of interest.

Reproducibility and Replicability

The terms replicability and reproducibility have definitions that may vary across disciplines. Here, we say that a study is replicated when new data are collected and analyzed independently by a new set of investigators, and that a study is reproduced when the same data are re-analyzed, independently by a new set of investigators [65••]. An example of a successfully reproduced study is the reanalysis of the Harvard Six Cities Study; a team of independent researchers was able to reproduce the findings of the original study and reach the same conclusion, i.e., that long-term air pollution exposure is related to mortality [66]. Peng [67] points out that reproducibility is a minimum standard for scientific claims to be assessed, and that there is a whole spectrum of reproducibility: from none (where a publication does not include data, code, or even instructions) to full (where a publication is linked to executable code and data).

Reproducibility is articulated around the pillars of data, code and compute environments. The FAIR data principles [68••] set standards for findability, accessibility, interoperability, and reusability. These principles structure best practices in data sharing and allow for citation and proper attribution. Academic journals are increasingly requiring authors to deposit their data in repositories that abide with the FAIR principles. Social coding platforms such as GitHub offer a set of collaborative tools based on the git software to provide version control for code with combined fine-grain access control policies, and promote transparency. For example, the authors of a recent study to assess population exposures to power plant emissions across the US, using advanced parallel computing, made all code publicly available on github [69].

Reproducibility of compute environments has gained a lot of interest lately [70, 71]. Statistical software systems such as R, include thousands of packages that are contributed by the community and updated asynchronously, which may break an existing codebase and make rolling back to the computing environment used for a specific set of analyses challenging. Other communities, in particular in the information technology (IT) world where software is deployed in production and updates should not break the system, recently created solutions that rely on virtualization. One such example is Docker, which is a game-changer, as it allows for the creation of images on top of operating systems. These images allow to permanently encapsulate a compute environment, such as a specific version of R and all the packages that were used for a given publication. They can even include the corresponding code and data, and their respective version histories. Docker images can be archived, shared and reused. Several platforms provide Docker-based solutions that users can simply run in the browser with RStudio or Jupyter Notebook interfaces, e.g. Code Ocean [72], oriented towards academic publishing, or binder [73], Renku [74] and The Whole Tale [75] that aims to provide a holistic experience across the life-cycle of a Big Data project. In fact, recently some Nature journals have started a trial partnership with Code Ocean “to enable authors to share fully-functional and executable code accompanying their articles and to facilitate peer review of code by the reviewers” [76]. Best practices from software engineering such as continuous integration are easy to incorporate and allow for performing real-time quality checks of the data science workflows.

Another challenge is the difficulty of embracing reproducible open science with data that cannot be shared, such as patient-level cohort data or health insurance claims. Recent developments in privacy-preserving techniques can be leveraged, e.g. through cryptographic methods of encryption or anonymization methods such as differential privacy [77]. However, even without these methods, there are situations when data sharing is not possible; still, there is a lot to be learned from just code sharing, and the FAIR principles do not require data to be open if suitable meta-data are provided.

By making code and other replication material available, scientists voluntarily make their work more easily amenable to criticism. Re-analyses, nonetheless, should be performed in good faith and can be a powerful tool to indirectly encourage more careful and robust analyses.

Training

As the environmental health field evolves, future environmental health researchers will need to be well-versed in data science. To date, training in environmental health focuses mainly on exposure assessment, epidemiology, and (for the most part) “traditional” statistical methods. However, as detailed above, future environmental health researchers will need to also excel in novel data scientific methods, computationally efficient coding, and use of tools that ensure reproducibility. We, therefore, strongly recommend that environmental health students take advanced biostatistics courses as well as data science and machine learning courses. It may not be feasible to have a machine learning class in an Environmental Health Department; however, it is crucial to have a course or a seminar series to enforce critical thinking for environmental health students in the application of methods from other fields. This course should train students to prioritize the goals of environmental health research by first forming the research question they are interested in and subsequently identifying the most appropriate method to use to address this question. Online teaching platforms, such as Coursera or EdX are already being used to promote data science in Public Health (e.g. [78, 79]), and can facilitate the conscientious training of environmental health students in sound use of data scientific methods and practices.

Conclusions

Use of data scientific approaches in the field of environmental health sciences provides a unique opportunity to analyze high-dimensional and complex data structures, and accommodate the increasingly larger datasets that are becoming more common in environmental health. Data science can bring to environmental health science advanced methods that are computationally scalable and encourage reproducible research. Data science should be incorporated in the training of future environmental health scientists. However, it should never be prioritized over expert knowledge. Environmental health researchers should first formulate a well-defined research question and based on that select the most appropriate method, in collaboration with biostatisticians and data scientists, that yields robust and interpretable results.

Acknowledgments

This work was supported by NIEHS P30 ES009089 and R01 ES028805. This work was partially supported by HEI grant 4953-RFA14–3/16–4; the Health Effects Institute (HEI) is an organization jointly funded by the United States Environmental Protection Agency (EPA) (Assistance Award No.CR-83467701) and certain motor vehicle and engine manufacturers. The contents of this article do not necessarily reflect the views of HEI, or its sponsors, nor do they necessarily reflect the views and policies of the EPA or motor vehicle and engine manufacturers.

Conflict of Interest

The authors declare no conflicts of interest.

Footnotes

Compliance with Ethical Standards

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by the author.

References

Papers of particular interest, published recently, have been highlighted as:

•• Of major importance

• Of importance

  • [1].Blei David M and Smyth Padhraic. Science and data science. Proceedings of the National Academy of Sciences, 114(33):8689–8692, 2017.•• This paper discusses data science from the statistical, computational and human perspective and why scientists should care about data science.
  • [2].Jordan Michael I et al. On statistics, computation and scalability. Bernoulli, 19(4):1378–1390, 2013. [Google Scholar]
  • [3].Mahalingaiah Shruthi, Lane Kevin J, Kim Chanmin, Cheng J Jojo, and Hart Jaime E. Impacts of air pollution on gynecologic disease: Infertility, menstrual irregularity, uterine fibroids, and endometriosis: a systematic review and commentary. Current Epidemiology Reports, 5(3):197–204, 2018. [Google Scholar]
  • [4].Gibson EA, Goldsmith JA, and Kioumourtzoglou M-A. Complex mixtures, complex analyses: An emphasis on interpretable results. Current Environmental Health Reports, 15;6(2):53–61, 2019.• This paper discusses methods to address exposure to environmental mixtures in health studies—one of the areas where environmental health research is already embracing data science analytic approaches—and discusses advantages and pitfalls for the specific application in mixtures analyses.
  • [5].Manrai Arjun K., Cui Yuxia, Bushel Pierre R., Hall Molly, Karakitsios Spyros, Mattingly Carolyn J., et al. Informatics and data analytics to support exposome-based discovery for public health. Annual Review of Public Health, 38(1):279–294, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Lankadurai Brian P., Nagato Edward G., and Simpson Myrna J.. Environmental metabolomics: an emerging approach to study organism responses to environmental stressors. Environmental Reviews, 21(3):180–205, 2013. [Google Scholar]
  • [7].Di Qian, Wang Yan, Zanobetti Antonella, Wang Yun, Koutrakis Petros, Choirat Christine, Dominici Francesca, and Schwartz Joel D. Air pollution and mortality in the Medicare population. New England Journal of Medicine, 376(26):2513–2522, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Luraschi Javier, Kuo Kevin, Ushey Kevin, Allaire JJ, and The Apache Software Foundation. sparklyr: R Interface to Apache Spark, 2019. https://CRAN.R-project.org/package=sparklyr. R package version 1.0.0. [Google Scholar]
  • [9].Owczarz Wojciech and Zlatev Zahari. Parallel matrix computations in air pollution modelling. Parallel Computing, 28(2):355–368, 2002. [Google Scholar]
  • [10].Brown John, Wásniewski Jerzy, and Zlatev Zahari. Running air pollution models on massively parallel machines. Parallel Computing, 21(6):971–991, 1995. [Google Scholar]
  • [11].Molnar Ferenc Jr, Szakaly Tamas, Meszaros Robert, and Lagzi Istvan. Air pollution modelling using a graphics processing unit with CUDA. Computer Physics Communications, 181(1):105–112, 2010. [Google Scholar]
  • [12].Flaumenhaft Yakov and Ben-Assuli Ofir. Personal health records, global policy and regulation review. Health Policy, 122(8):815–826, August 2018. ISSN 0168–8510. [DOI] [PubMed] [Google Scholar]
  • [13].Robinson Oliver, Tamayo Ibon, De Castro Montserrat, Valentin Antonia, Giorgis-Allemand Lise, Krog Norun Hjertager, Aasvang Gunn Marit, Ambros Albert, Ballester Ferran, Bird Pippa, et al. The urban exposome during pregnancy and its socioeconomic determinants. Environmental health perspectives, 126(7), 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Nieuwenhuijsen Mark J, Agier Lydiane, Basagaña Xavier, Urquiza Jose, Tamayo-Uria Ibon, Giorgis-Allemand Lise, Robinson Oliver, Siroux Valérie, Maitre Léa, de Castro Montserrat, et al. Influence of the urban exposome on birth weight. Environmental health perspectives, 127(4), 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Patel Chirag J, Pho Nam, McDuffie Michael, Easton-Marks Jeremy, Kothari Cartik, Kohane Isaac S, and Avillach Paul. A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey. Scientific data, 3:160096, 2016.•• This paper presents the successful integration of multiple publicly available datasets into a unified Research Data Platform
  • [16].Raisaro Jean Louis, Troncoso-Pastoriza Juan, Misbach Mickael, Sousa Joao Sa, Pradervand Sylvain, Missiaglia Edoardo, Michielin Olivier, Ford Bryan, and Hubaux Jean-Pierre. MedCo: Enabling Secure and Privacy-Preserving Exploration of Distributed Clinical and Genomic Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 1–1, 2018. ISSN 1545–5963. doi: 10.1109/TCBB.2018.2854776. https://ieeexplore.ieee.org/document/8410926/. [DOI] [PubMed] [Google Scholar]
  • [17].Madhyastha Tara M, Koh Natalie, Day Trevor KM, Hernández-Fernández Moises, Kelley Austin, Peterson Daniel J, Rajan Sabreena, Woelfer Karl A, Wolf Jonathan, and Grabowski Thomas J. Running neuroimaging applications on amazon web services: How, when, and at what cost? Frontiers in Neuroinformatics, 11:63, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Weber Nick, Liou David, Dommer Jennifer, Philip MacMenamin Mariam Quiñones, Misner Ian, Oler Andrew J, Wan Joe, Kim Lewis, McCarthy Meghan Coakley, et al. Nephele: a cloud platform for simplified, standardized and reproducible microbiome data analysis. Bioinformatics, 34(8):1411–1413, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Frei Patrizia, Mohler Evelyn, Bürgi Alfred, Fröhlich Jürg, Neubauer Georg, Charlotte Braun-Fahrländer Martin Röösli, et al. A prediction model for personal radio frequency electromagnetic field exposure. Science of the total environment, 408(1):102–108, 2009. [DOI] [PubMed] [Google Scholar]
  • [20].Boeije Geert, Vanrolleghem Peter, and Matthies Michael. A geo-referenced aquatic exposure prediction methodology for down-the drain chemicals. Water science and technology, 36(5):251–258, 1997. [Google Scholar]
  • [21].Kloog Itai, Nordio Francesco, Coull Brent A, and Schwartz Joel. Predicting spatiotemporal mean air temperature using MODIS satellite surface temperature measurements across the northeastern USA. Remote Sensing of Environment, 150:132–139, 2014. [Google Scholar]
  • [22].Kloog Itai, Chudnovsky Alexandra A, Just Allan C, Nordio Francesco, Koutrakis Petros, Coull Brent A, Lyapustin Alexei, Wang Yujie, and Schwartz Joel. A new hybrid spatio-temporal model for estimating daily multi-year PM2.5 concentrations across northeastern USA using high resolution aerosol optical depth data. Atmospheric Environment, 95:581–590, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Van Donkelaar Aaron, Martin Randall V, Spurr Robert JD, and Burnett Richard T. High resolution satellite-derived PM2.5 from optimal estimation and geographically weighted regression over north America. Environmental science & technology, 49(17):10482–10491, 2015. [DOI] [PubMed] [Google Scholar]
  • [24].Al-Hamdan Mohammad Z, Crosson William L, Limaye Ashutosh S, Rickman Douglas L, Quattrochi Dale A, Estes Maurice G Jr, Qualters Judith R, Sinclair Amber H, Tolsma Dennis D, Adeniyi Kafayat A, et al. Methods for characterizing fine particulate matter using ground observations and remotely sensed data: potential use for environmental public health surveillance. Journal of the Air & Waste Management Association, 59(7):865–881, 2009. [DOI] [PubMed] [Google Scholar]
  • [25].Yanosky Jeff D, Paciorek Christopher J, Laden Francine, Hart Jaime E, Puett Robin C, Liao Duanping, and Suh Helen H. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. Environmental Health, 13(1):63, 2014s [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Bi Jianzhao, Belle Jessica H, Wang Yujie, Lyapustin Alexei I, Wildani Avani, and Liu Yang. Impacts of snow and cloud covers on satellite-derived PM2.5 levels. Remote Sensing of Environment, 221:665–674, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Di Qian, Kloog Itai, Koutrakis Petros, Lyapustin Alexei, Wang Yujie, and Schwartz Joel. Assessing PM2.5 exposures with high spatiotemporal resolution across the continental United States. Environmental science & technology, 50(9):4712–4721, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Chipman Hugh A, George Edward I, and McCulloch Robert E. Bayesian ensemble learning. In Advances in neural information processing systems, pages 265–272, 2007. [Google Scholar]
  • [29].Hoeting Jennifer A, Madigan David, Raftery Adrian E, and Volinsky Chris T. Bayesian model averaging: a tutorial. Statistical science, pages 382–401, 1999. [Google Scholar]
  • [30].Li Lianfa, Zhang Jiehao, Qiu Wenyang, Wang Jinfeng, and Fang Ying. An ensemble spatiotemporal model for predicting PM2.5 concentrations. International journal of environmental research and public health, 14(5):549, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Shaddick Gavin, Thomas Matthew L, Green Amelia, Brauer Michael, van Donkelaar Aaron, Burnett Rick, Chang Howard H, Cohen Aaron, Van Dingenen Rita, Dora Carlos, et al. Data integration model for air quality: a hierarchical approach to the global estimation of exposures to ambient air pollution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 67(1):231–253, 2018. [Google Scholar]
  • [32].Hong KY, Pinheiro PO, Minet L, Hatzopoulou M, and Weichenthal S. Extending the spatial scale of land use regression models for ambient ultrafine particles using satellite images and deep convolutional neural networks. Environmental research, 176:108513–108513, 2019. [DOI] [PubMed] [Google Scholar]
  • [33].Lee Duncan, Mukhopadhyay Sabyasachi, Rushworth Alastair, and Sahu Sujit K. A rigorous statistical framework for spatio-temporal pollution prediction and estimation of its long-term impact on health. Biostatistics, 18(2):370–385, 2016. [DOI] [PubMed] [Google Scholar]
  • [34].Carroll Raymond J, Ruppert David, Crainiceanu Ciprian M, and Stefanski Leonard A. Measurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC, 2006. [Google Scholar]
  • [35].Sheppard Lianne, Burnett Richard T, Szpiro Adam A, Sun-Young Kim, Jerrett Michael, Pope C Arden, and Brunekreef Bert. Confounding and exposure measurement error in air pollution epidemiology. Air Quality, Atmosphere & Health, 5(2):203–216, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Liu J, Paisley J, Kioumourtzoglou M-A, and Coull BA. Adaptive and calibrated ensemble learning with dependent tail-free process. BNP @ NeurIPS, 2018. [Google Scholar]
  • [37].Liu Jeremiah Zhe, Paisley John, Marianthi-Anna Kioumourtzoglou, and Coull Brent A.. Adaptive ensemble learning of spatiotemporal processes with calibrated predictive uncertainty: A bayesian nonparametric approach, arXiv:1904.00521 [stat.ME], 2019. [Google Scholar]
  • [38].Bobb Jennifer F, Obermeyer Ziad, Wang Yun, and Dominici Francesca. Cause-specific risk of hospital admission related to extreme heat in older adults. Jama, 312(24):2659–2667, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Krall Jenna R, Chang Howard H, Waller Lance A, Mulholland James A, Winquist Andrea, Talbott Evelyn O, Rager Judith R, Tolbert Paige E, and Sarnat Stefanie Ebelt. A multicity study of air pollution and cardiorespiratory emergency department visits: Comparing approaches for combining estimates across cities. Environment international, 120:312–320, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Gelman Andrew, Stern Hal S, Carlin John B, Dunson David B, Vehtari Aki, and Rubin Donald B. Bayesian data analysis. Chapman and Hall/CRC, 2013. [Google Scholar]
  • [41].Gelman Andrew and Hill Jennifer. Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006. [Google Scholar]
  • [42].Blei David M, Kucukelbir Alp, and McAuliffe Jon D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]
  • [43].Hoffman Matthew D, Blei David M, Wang Chong, and Paisley John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. [Google Scholar]
  • [44].van der Laan Mark J and Gruber Susan. Collaborative double robust targeted maximum likelihood estimation. The international journal of biostatistics, 6(1), 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].De Luna Xavier, Waernbaum Ingeborg, and Richardson Thomas S. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98(4):861–875, 2011. [Google Scholar]
  • [46].Vansteelandt Stijn, Bekaert Maarten, and Claeskens Gerda. On model selection and model misspecification in causal inference. Statistical methods in medical research, 21(1):7–30, 2012. [DOI] [PubMed] [Google Scholar]
  • [47].Wang Chi, Parmigiani Giovanni, and Dominici Francesca. Bayesian effect estimation accounting for adjustment uncertainty. Biometrics, 68(3):661–671, 2012. [DOI] [PubMed] [Google Scholar]
  • [48].Zigler Corwin Matthew and Dominici Francesca. Uncertainty in propensity score estimation: Bayesian methods for variable selection and model-averaged causal effects. Journal of the American Statistical Association, 109(505):95–107, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Trevor Hastie, Robert Tibshirani, and Friedman JH. The elements of statistical learning: data mining, inference, and prediction, 2009, Springer Series in Statistics. [Google Scholar]
  • [50].James Gareth, Witten Daniela, Hastie Trevor, and Tibshirani Robert. An introduction to statistical learning, New York, Springer, 2013. [Google Scholar]
  • [51].Greenland Sander, Robins James M, Pearl Judea, et al. Confounding and collapsibility in causal inference. Statistical science, 14(1):29–46, 1999. [Google Scholar]
  • [52].Hernán Miguel A, Clayton David, and Keiding Niels. The Simpson’s paradox unraveled. International journal of epidemiology, 40(3):780–785, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Antonelli Joseph, Parmigiani Giovanni, Dominici Francesca, et al. High-dimensional confounding adjustment using continuous spike and slab priors. Bayesian Analysis, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Belloni Alexandre, Chernozhukov Victor, and Hansen Christian. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014. [Google Scholar]
  • [55].Ertefaie Ashkan, Asgharian Masoud, and Stephens David A. Variable selection in causal inference using a simultaneous penalization method. Journal of Causal Inference, 6(1), 2018. [Google Scholar]
  • [56].Farrell Max H. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23, 2015. [Google Scholar]
  • [57].Wilson Ander and Reich Brian J. Confounder selection via penalized credible regions. Biometrics, 70(4):852–861, 2014. [DOI] [PubMed] [Google Scholar]
  • [58].Antonelli Joseph, Cefalu Matthew, Palmer Nathan, and Agniel Denis. Doubly robust matching estimators for high dimensional confounding adjustment. Biometrics, 74(4):1171–1179, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].VanderWeele Tyler J and Ding Peng. Sensitivity analysis in observational research: introducing the e-value. Annals of internal medicine, 167(4):268–274, 2017. [DOI] [PubMed] [Google Scholar]
  • [60].Haneuse Sebastien, VanderWeele Tyler J, and Arterburn David. Using the e-value to assess the potential effect of unmeasured confounding in observational studies. JAMA, 321(6):602–603, 2019. [DOI] [PubMed] [Google Scholar]
  • [61].Hamra Ghassan B and Buckley Jessie P. Environmental exposure mixtures: Questions and methods to address them. Current Epidemiology Reports, 5(2):160–165, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [62].Stafoggia Massimo, Breitner Susanne, Hampel Regina, and Basagaña Xavier. Statistical approaches to address multi-pollutant mixtures and multiple exposures: the state of the science. Current environmental health reports, 4(4):481–490, 2017. [DOI] [PubMed] [Google Scholar]
  • [63].Huang Hongtai, AolinWang Rachel Morello-Frosch, Lam Juleen, Sirota Marina, Padula Amy, and Woodruff Tracey J. Cumulative risk and impact modeling on environmental chemical and social stressors. Current environmental health reports, 5(1):88–99, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [64].Bellavia Andrea, James-Todd Tamarra, and Williams Paige L. Approaches for incorporating environmental mixtures as mediators in mediation analysis. Environment international, 123:368–374, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [65].National Academies of Sciences, Engineering and Medicine. Reproducibility and Replicability in Science. The National Academies Press, Washington, DC, 2019. ISBN 978–0-309–48613-2. doi: 10.17226/25303. https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science.•• This report defines the terms “reproducibility” and “replicability” for intended use across all fields of science.
  • [66].Krewski Daniel, Burnett RT, Goldberg M, Hoover K, Siemiatycki J, Abrahamowicz M, and White W. Reanalysis of the Harvard Six Cities Study, Part I: Validation and replication. Inhalation Toxicology, 17(7–8):335–342, 2005. ISSN 08958378. doi: 10.1080/08958370590929402. [DOI] [PubMed] [Google Scholar]
  • [67].Peng Roger D.. Reproducible research in computational science. Science, 334(6060):1226–1227, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [68].Wilkinson Mark D., Dumontier Michel, Jan Aalbersberg IJsbrand Appleton Gabrielle, Axton Myles, Baak Arie, et al. Comment: The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3:1–9, 2016. ISSN 20524463. doi: 10.1038/sdata.2016.18.•• This paper presents four principles to improve infrastructure supporting the reuse of scholarly data.
  • [69].Henneman Lucas RF, Choirat Christine, Ivey Cesunica, Cummiskey Kevin, and Zigler Corwin M. Characterizing population exposure to coal emissions sources in the United States using the Hyads model. Atmospheric Environment, 203:271–280, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [70].Perkel Jeffrey M.. A toolkit for data transparency. Nature, 560(7719):513–515, 2018. ISSN 0028–0836. doi: 10.1038/d41586-018-05990-5. URL http://www.nature.com/articles/d41586-018-05990-5. [DOI] [PubMed] [Google Scholar]
  • [71].Beaulieu-Jones Brett K and Greene Casey S. Reproducibility of computational workflows is automated using continuous analysis. Nature biotechnology, 35(4):342–346, 2017. ISSN 1546–1696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [72].Code Ocean — Discover & Run Scientific Code. URL https://codeocean.com/.
  • [73].Binder (beta). URL https://mybinder.org/.
  • [74].Renku. URL https://renkulab.io/.
  • [75].Brinckman Adam, Chard Kyle, Gaffney Niall, Hategan Mihael, Jones Matthew B., Kowalik Kacper, Kulasekaran Sivakumar, Bertram Ludäscher Bryce D. Mecum, Nabrzyski Jarek, Stodden Victoria, Taylor Ian J., Turk Matthew J., and Turner Kandace. Computing environments for reproducibility: Capturing the Whole Tale. Future Generation Computer Systems, 94:854–867, 2019. ISSN 0167739X. doi: 10.1016/j.future.2017.12.029. [DOI] [Google Scholar]
  • [76].Pastrana Erika and Swaminathan Sowmya. Nature research journals trial new tools to enhance code peer review and publication. http://blogs.nature.com/ofschemesandmemes/2018/08/01/nature-research-journals-trial-new-tools-to-enhance-code-peer-review-and-publication, August 2018. [Google Scholar]
  • [77].Dwork Cynthia. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II, ICALP’06, pages 1–12, Berlin, Heidelberg, 2006. Springer-Verlag; ISBN 3–540-35907–9, 978–3-540–35907-4. [Google Scholar]
  • [78].edX. Courses taught by Rafael Irizarry. https://www.edx.org/bio/rafael-irizarry.
  • [79].Coursera. Courses taught by Jeff Leek. https://www.coursera.org/instructor/˜694443.

RESOURCES