Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Sep 1.
Published in final edited form as: Curr Environ Health Rep. 2020 Sep;7(3):170–184. doi: 10.1007/s40572-020-00282-5

Machine Learning within Studies of Early-life Environmental Exposures and Child Health: Review of the current literature and discussion of next-steps

Sabine Oskar a, Jeanette A Stingone a
PMCID: PMC7483339  NIHMSID: NIHMS1606573  PMID: 32578067

Abstract

Purpose of Review

The goal of this article is to review the use of machine learning (ML) within studies of environmental exposures and children’s health, identify common themes across studies and provide recommendations to advance their use in research and practice.

Recent Findings

We identified 42 articles reporting upon the use of ML within studies of environmental exposures and children’s health between 2017 and 2019. Common themes among the articles were analysis of mixture data, exposure prediction, disease prediction and forecasting, analysis of complex data and causal inference.

Summary

With increasing complexity of environmental health data, we anticipate greater use of ML to address challenges that cannot be handled by traditional analytics. In order for these methods to beneficially impact public health, the ML techniques we use need to be appropriate for our study questions, rigorously evaluated and reported in a way that can be critically assessed by the scientific community.

Keywords: Machine learning, Child health, Environmental mixtures, Environmental Health, Data science Prenatal Environment

Introduction

The last decade has marked an increase in the interest of using machine learning (ML) techniques for health research. A simple PubMed search of “machine learning” AND “health” shows that the number of published articles has grown from 61 in 2010 to 2,712 in 2019. ML, a branch of artificial intelligence, has been defined as “a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform decision-making under uncertainty” (1). While many ML methods have been around for decades, the more recent availability of computational resources and big health data have ignited a renewed enthusiasm for their use in a range of substantive fields within medicine and public health. There have been a growing number of commentaries and reviews on the use of ML within the health sciences (24). This includes areas within environmental health (5).

ML methods have typically been used for research questions related to prediction and classification (6). In contrast, environmental health has more commonly focused on causal or explanatory modelling, addressing research questions that seek to estimate the magnitude and precision of health effects related to environmental contaminants. Emerging technologies, such as electronic health records, -omics platforms, social media and wearable sensors have led to a greater availability of complex environmental health data (7). Additionally, interest in more holistic exposure paradigms, such as the exposome (8), particularly during pregnancy and early-life, has spurred coordinated research initiatives such as the Child’s Health and Exposure Analysis Resource, the Environmental influences on Child Health Outcomes Program, and HELIX-the Early Life Exposome Project that seek to capitalize on big environmental health data (9, 10). These initiatives as well as other pregnancy and pediatric consortia are implementing novel technologies and enhanced data-sharing to generate large and complex repositories of environmental data (11). As a result, many within the field have stressed the need for analytic methods that can address the unique challenges presented by high-dimensional environmental health data (7).

Despite this growing interest and availability of big environmental health data, it is unclear if greater use of ML methods will result in advances in early-life environmental health research and practice. A recent review in the clinical literature suggested that prediction models resulting from ML were on average no better than models developed using traditional logistic regression (12). That systematic review highlights a growing concern, that ML methods may not result in the expected improvements in research and health. Although the use of ML is not as common within perinatal and pediatric environmental health, there is still a need to assess the current state of the literature to evaluate how ML is applied and what if any adaptations to ML methods are needed to achieve our goals of improving research and subsequently public health.

The goal of this article is to review how researchers have used ML within studies of environmental exposures and children’s health. This review identifies common themes across studies and summarizes the strengths and limitations with implementing these methods. We also identified gaps in the literature and provide recommendations to move the field forward. As the goal is to discuss the use of ML within environmental health studies, more focus has been placed on the application of the methods, rather than the substantive findings of individual studies.

Methods of Literature Review

We conducted a systematic search in PubMed of studies that applied ML methods to study environmental exposures and children’s health within the last two years (January 2017- October 2019). For the purpose of this review, we defined an environmental exposure as an exogenous chemical, pollutant(s) or ambient physical factor that comes into direct contact with the pregnant mother, fetus, or child. We focused on exposures and outcomes occurring during the prenatal to adolescent period (in utero to 18 years). ML methods were defined as techniques that fit models algorithmically by adapting to patterns in data (13). For the purposes of this review, we did not include principal component analysis or stepwise regression, unless coupled with other algorithms. There is limited agreement about whether these two approaches are considered ML algorithms, although they are discussed in primary texts of statistical learning(14). However, we sought to focus this review on techniques that are being newly applied within environmental health research. Given their long history of use, we excluded these two approaches from this review.

Published studies were identified using the following medical subject headings and keywords in combination: (“artificial neural networks” OR algorithm* OR “artificial Intelligence” OR “artificial learning” OR “bayesian learning” OR “bayesian network” OR BKMR OR “cluster analysis” OR “decision tree*” OR “deep learning*” OR “elastic net” OR forecasting OR “gradient boosting” OR GBM OR “knowledge representation” OR “K-means” OR LASSO OR “machine intelligence” OR “machine learning*” OR “neural network*” OR “outcome prediction” OR “probabilistic model*” OR mixture* OR “probabilistic networks” OR prediction OR “random forest” OR “regression tree” OR regularization OR “ridge regression” OR segmentation OR “semi-supervised learning” OR “statistical learning” OR “supervised learning” OR “unsupervised learning” OR XGBoost); AND (adolesc*; boys; child*; early life; girls; fetal; infant; prenatal; “maternal exposure*”); AND environment*. The search was limited to articles in English, and excluded animal studies, review articles, and letters to the editor. Studies of exposure characterization nested within child health studies were included. We excluded studies of social-structural environments (e.g. socioeconomic disadvantage, psychological stress or conditions, etc.) and diet. Studies of metabolomics and the microbiome were omitted if researchers did not investigate its relationship with a specific environmental exposure. A single reviewer performed the first phase of screening using the information available in the title and abstract. After the initial screening, two authors independently conducted a full text review for final inclusion. The flow diagram depicting the identification of the literature and selection of studies is shown in Fig.1.

Fig. 1.

Fig. 1

Literature flow diagram of search methods and resulting studies that applied machine-learning methods to study environmental exposures and children’s health

Use of Machine Learning Described in Current Literature

Characteristics of identified studies

A total of 42 articles were included and are summarized in Table 1. We grouped the studies according to common themes including analysis of mixture data (n=22), exposure prediction (n=6), disease prediction, forecasting, and decision support (n=7), analysis of other complex data (n=5), and improvements to causal inference (n=2). More than half of the publications were based on prospective data, with 60% of studies examining exposures occurring during the prenatal period. The majority of studies (71%) assessed exposure to air pollution, the weather, or built environment (n=18) or exposure to heavy metals (n =12). The health outcomes investigated varied widely across studies with most studies examining postnatal outcomes (n=13), respiratory illness including lung function and asthma (n=8), or neurodevelopmental or cognitive effects (n=9). The most common methods used were tree-based methods, such as classification and regression trees (CaRT) or random forest (RF)) (n=13) and Bayesian kernel machine regression (BKMR) (n=9). Six publications introduced new or modified ML approaches for analyzing chemical mixtures data (1517), disease prediction (18), disease forecasting (19), or causal inference for mixtures (20), while the remainder reported on the application of existing algorithms and approaches.

Table 1.

Studies using machine learning in early-life environmental exposures and child health by theme

Author, Year Population Study design Environmental exposure Child Health Outcome ML method Contribution/Findings
Analysis of Chemical Mixtures, n= 22
Agier, 2019 (38) 1,033 mother - child pairs (6–12 years) Prospective cohort Pre-and postnatal exposures across 17 domains (e.g. air pollution, traffic, metals, phthalates, phenols, pesticides, organochlorines) Lung function DSA None of the exposures were consistently selected by the DSA algorithm across multiple runs.
Berg, 2017 (85) 391 mother - infant pairs Prospective cohort Prenatal exposure to 19 persistent organic pollutants Thyroid homeostasis measured by TSH and TH Hierarchical clustering Identified several persistent organic pollutants that were significantly associated with TSH and TH.
Chen, 2019 (25) 726 adolescents (14–17 years) Prospective cohort Prenatal and childhood exposure to inorganic arsenic and metals Blood pressure BKMR Identified associations between current and early childhood exposure to inorganic arsenic with higher blood pressure. No evidence of interaction.
Cilluffo, 2018 (31) 219 children (8–10 years) Cross-sectional Urban-related environmental exposures (“green”, “grey” and air pollution) Respiratory/allergic conditions and general symptoms Poisson ridge regression Identified specific exposures related to greenness, greyness and air pollution that were associated with respiratory/ allergic and general symptoms.
Coker, 2018 (26) 708 mother - child pairs (1–2 years) Prospective cohort Prenatal exposure to 6 compounds of DDT/E and pyrethroids BMI and body composition at 1 and 2 years BKMR Identified individual pollutants associated with body composition and body weight in girls and boys. Evidence of interaction in boys.
Deyssenroth, 2018 (27) 237 mother - infant pairs Prospective cohort Prenatal exposure to 16 trace metals SGA BKMR Identified a multi-metal index. Assoications of specific metals remained and predominated even after accounting for the presence of co-pollutant correlations. No evidence of interaction.
Grant, 2018 (32) 27,538 children (2–17 years) Cross-sectional Neighborhood environment (including green space, park density) BMI Compared 4 spatial scale (SS) selection algorithms: SS LASSO, SS forward stepwise regression, SS incremental forward stage-wise regression and SS least angle regression The SS algorithms selected covariates at different spatial scales, producing better goodness-of-fit in comparison to traditional statistical models.
Heggeseth, 2019 (40) 335 mother-child pairs (2–14 years) Prospective cohort Prenatal exposure to 11 phthalates BMI trajectories Regression trees and random forest Uncovered nonlinear associations between specific prenatal phthalate concentrations and BMI level in children. Findings confirmed across a variety of statistical methods.
Hou, 2019 (33) Mother - infant pairs; 246 cases and 406 controls Nested case-control Prenatal exposure to 22 metals Low-BW vs. normal-BW Elastic-net regression Identified 15 metals that were significantly associated with the increased risk of low-BW.
Iszatt, 2019 (34) 267 mother- infant pairs Cross-sectional Exposure to 28 chemicals of PCBs, PBDEs, PFASs, pesticides Gut bacteria composition and function Elastic-net regression Identified specific exposures that influence infant gut microbial composition and function. No evidence of interaction.
Kupsco, 2019 (28) 548 mother-child pairs (4–6 years) Prospective cohort Prenatal exposure to 11 metals Childhood cardiometabolic risk BKMR Low essential metals during pregnancy were associated with increased cardiometabolic risk factors in childhood. No evidence of interaction.
Lenters, 2019 (35) 1199 mother - child pairs (4–13 years) Prospective cohort Early-life exposure to 27 persistent OCPs ADHD Elastic-net regression Subset of variables identified linear and nonlinear exposure-response relationships with ADHD. No evidence of interaction.
Liu, 2018 (16) 665 mother - child pairs (6–24 months) Prospective cohort Prenatal exposure to 9 heavy metals Neurodevelopmental trajectories assessed from 6 to 24 months Developed Bayesian varying coefficient kernel machine regression Developed method for examining health outcome trajectories within a mixtures model. Applied method and identified specific metals that were associated with neurodevelopmental trajectories within a metals mixture. Observed evidence of interaction within mixture.
Liu, 2018 (17) 84 mother - child pairs (0–3 months) Prospective cohort Pre-and postnatal exposure to 6 heavy metals Neurodevelopment (visual motor abilities) Developed lagged kernel machine regression (LKMR) LKMR method identified time windows of susceptibility to exposures of complex mixtures, while accounting for nonlinear and non-additive effects of the mixture at any given exposure window. LKMR detected interaction effects that was not captured when using BKMR or joint kernel BKMR.
Liu, 2018 (15) 391 mother - infant pairs Prospective cohort Pre-and postnatal exposure to 9 heavy metals BW Developed mean field variational approximation method for Bayesian inference procedure for LKMR (MFVB-LKMR) MFVB- LKMR showed computational efficiency and reasonable accuracy as compared with the corresponding MCMC estimation method. Found an association between second trimester metal levels and birth weight. Evidence of interaction.
Ni, 2019 (29) Mother - infant pairs; 89 cases and 129 controls Case-control Prenatal exposure to 16 PAHs Orofacial clefts BKMR No evidence that high prenatal exposure to PAHs is associated with an increased risk for orofacial clefts. No evidence of interaction.
Philippat, 2019 (36) 473 mother - son infant pairs Prospective cohort Prenatal exposure to 20 phthalates and phenols Placental weight, BW and placental - to-BW ratio Elastic-net regression Identified 4 biomarkers associated with placental weight that is consistent with previous literature. Evidence of interaction in placental weight model.
Stingone, 2017 (41) 6,900 children (4–5 years) Retrospective cohort Infant exposure to 104 air toxics Math standardized test scores Regression Trees Identified air pollutant profiles associated with lower math test scores. Evidence of interaction.
Serrano- Lomelin, 2019 (86) 333,247 mother - infant pairs Retrospective cohort Prenatal exposure to industrial air pollutants ABO including preterm birth, SGA, and low-BW at term Developed pruning approach for the spatial data mining algorithm to identify and prioritize candidate hypotheses in complex setting Method allowed for extraction of candidate hypotheses linking ABO with mixtures originating from hundreds of chemicals. Identified potential mixtures of industrial pollutants spatially related to the occurrence of ABO.
Valeri, 2017 (30) 825 mother - child pairs (20–40 months) Prospective cohort Prenatal exposure to 3 metals Neurodevelopmental outcomes BKMR Found an association between metal mixture and cognitive score and nonlinear effects. Evidence of interaction.
Warembourg, 2019 (39) 1,277 mother - child pairs (6–11 years) Prospective cohort Pre-and postnatal exposure to hundreds of factors (e.g. air pollution, built environment, traffic, disinfection products, metals, phthalates, phenols, pesticides) Blood pressure DSA Method allowed for simultaneous evaluation of the possible health effects from exposure to hundreds of environmental factors during early-life. Several environmental exposures were associated with an increase or decrease in blood pressure.
Woods, 2017 (37) 272 mother- infant pairs Prospective cohort Prenatal exposure to 53 chemicals of phthalates, BPA, PFAS, PCBs, PBDEs, OPPs, heavy metals, OCPs BW Bayesian hierarchical linear models. Sensitivity analysis with LASSO and elastic-net regression LASSO selected 7 chemicals; elastic-net regression selected the same 7 and an additional 6 chemicals. LASSO and elastic-net regression coefficients were larger in size but with less precise 95% confidence intervals compared to Bayesian hierarchical linear models.
Exposure Prediction, n= 6
Boland, 2017 (49) Mother-infant pairs. Fetal loss: 14,922 cases and 33,043 unaffected. Congenital anomalies: 5,658 cases and 31,240 unaffected Retrospective cohort Prenatal exposure to category C pharmacological drugs Fetal loss and congenital anomalies Random forest Classified category C drugs of unknown fetal effect into harmful or safe categories based on fetal outcomes by those who used drug, drug’s potential to affect gene expression and other drug characteristics
Brokamp, 2017 (50) Sites from CCAAPS participants Cross-sectional Childhod exposure to elemental components of PM and total PM Childhood allergy and respiratory health Random forest Land use random forest (LURF) used to predict PM exposure based on land use predictors. LURF utilized a more diverse and greater number of predictors than standard land use step-wise regression (LUR). LURF had higher accuracy in prediciting specific emental components and total PM compared to standard LUR.
Brokamp, 2018 (51) Sites from CCAAPS participants Cross-sectional Childhod exposure to PM Childhood allergy and respiratory health Random forest Aerosol optical density (AOD), weather, atmospheric, and land-use data used to train a random forest (RF) to predict daily PM. Addressed potential issues with missing AOD data from previous studies. Spatiotemporal RF models with missing AOD data performed the same as models with AOD data.
Ghaedrahmat, 2019 (52) Adolescents living in Ahvaz City Prospective cohort Childhood exposure to air pollution Respiratory health Artificial neural network model Predictive features and models differed between the cold and warm seasons when trying to predict the FENO biomarker in children.
Kloog, 2018 (53) 56,141 mother - infants Retrospective cohort Prenatal exposure to ambient temperature and emissions of greenhouse gases Low-BW and SGA Hybrid spatiotemporally resolved prediction model incorporates multiangle implementation of atmospheric correction algorithm Applied complex exposure assessment pipeline within study of birth outcomes.
Tognola, 2019 (54) 977 children (0 – 14 years) Cross-sectional Childhood exposure to electric networks Characterization indoor extremely low frequency (ELF) magnetic field (MF) exposure K-means cluster analysis K-means cluster analysis revealed significant and recurrent patterns in personal exposure to ELF MF.
Disease Prediction, Forecasting and Decision Support, n= 7
Deng, 2019 (56) 4,548 children (5–7 years) Prospective cohort Childhood in-home exposures, as well as exposure to traffic and air pollution Bronchitis symptoms Gradient boosting tree models Identified key risk factors for prediction of bronchitis symptoms.
Hassan, 2018 (57) Male autistic children; 58 cases and 32 controls (3–12 years) Case-control Childhood exposure to 9 biomarkers including metals, vitamin E, serotonin, dopamine, etc. ASD onset and development Hierarchical clustering Combining biomarkers into profiles improved the accuracy of ASD prediction but failed to distinguish between participants with severe versus mild or moderate ASD.
Hosseini, 2017 (60) 1 child with mild asthma Prospective cohort Childhood exposure to PM Self-management of pediatric asthma (Real-time asthma attack risk assessment) Random forest Determination of asthma exacerbation risk level based physiological and environmental sensor data analyzed with random forest.
Luo, 2017 (18) 33,831 mother- infant pairs with 78 cases of CHD Cross-sectional Prenatal exposure to pesticides and chemical fertilizers CHD Developed and validated weighted support vector machine (WSVM) and weighted random forest (WRF) algorithms Compared the prediction performance of 3 classifiers (logit, WSVM, WRF) when data is unbalanced and cross-sectional. All 3 models were precise enough to identify groups with higher prevelance of CHD, but WSVM was substantially better than the logit and WRF.
Rao, 2017 (59) Children (4–12 years) Cross-sectional Childhood exposure to NO2 based on land use predictors Annual incidence rates of asthma exacerbation Random Forest Land use random forest (LURF) used to predict NO2 based on ~200 land use and land cover (LULC) variables. LURF model also used to investigate the association of NO2 with individual LULC categories and evaluate LULC modifications for enhanced respiratory health in the presence of noisy and missing data. Found that NO2 associated with roadways and tree-canopied areas may be affecting annual incidence rates of asthma exacerbation and that increasing tree canopy may reduce incidences rates.
Sewe, 2017 (19) 8,476 children (<5 years) with confirmed malaria admissions Retrospective cohort Childhod exposure to weather conditions (e.g., rainfall, average temperature and evapotranspiration) Hospital malaria admissions Developed 2 different general additive models, one using a boosting algorithm to optimize model fit and the other without boosting Forecasted monthly pediatric malaria admissions at a district hospital in Western Kenya. Model structures involving generalized additive models with a boosting algorithm provided the best forecasts at all lead times.
Zhong, 2018 (58) Data from 60 state hospitals, 6 mother and child care centers, and 619 community rehabilitation centers Retrospective cohort Childhood exposure to weater and environmental conditions (e.g., air quality, temperature) HFMD outbreaks XGBoost tree models Built XGBoost tree model based on historical HFMD rates, air-quality factors, and temperature factors. Found that addition of air-quality factors to the historical HFMD rate and temperature data may improve forecasting of HFMD.
Analysis of Other Complex Data, n=5
Eguchi, 2017 (61) 93 mother-fetus pairs Prospective cohort Prenatal exposure to PCBs Metabolic pathways of adverse effects Random Forest RF model with the metabolome profile predicted PCB exposure levels for pregnant women and fetuses.
Fernandez, 2018 (62) 384 children (7–15 years) Case-control Gene and childhood exposure to air pollution (ambient B[a]pyrene concentration) interaction Allergic asthma K-means clustering and regression trees Reduced phenotypic heterogeneity of asthma severity, and identified SNPs associated with phenotype subgroups. Showed that SNP clustering may help to partly explain heterogeneity in children’s asthma susceptibility in relation to ambient B[a] pyrene concentration.
Huang, 2018 (63) 705 mother - adolescent pairs (17 years) Prospective cohort Prenatal exposure to MeHg Examined 7 neurodevelopmental outcomes Regression trees and semiparametric additive models No evidence of an association between adverse neurodevelopmental outcomes and prenatal MeHg overall. Subgroups of population identified from regression trees found suggestive relationship between prenatal MeHg exposure >10 ppm and adverse neurodevelopment in some individuals.
Lalonde, 2018 (64) 675 mother- child pairs (8–10 years) Prospective cohort Prenatal exposure to MeHg Examined 20 neurodevelomental outcomes Non-parametric Bayesian model, specifically Dirichlet process mixture models Model grouped the 20 testing outcomes into 7 domains rather than 4 domains based on previous methods.
Ren, 2018 (65) 39,053 mother - infant pairs Prospective cohort Prenatal exposure to PM, temperature, and humidity CHD Random forest and gradient boosting Evidence that maternal exposure to PM10 was associated with increased risk of CHD with both models. Identified nonlinear and interpretable dose-response association between maternal exposure to PM10 and the risk of CHDs with RF.
Improvement of causal inference, n = 2
Herrera, 2017 (66) 288 children (6 - ≥12 years) Cross-sectional Childhood exposure to gold and copper mines Respiratory disease (asthma and/or rhino conjunctivitis) Targeted maximum likelihood estimation with the super learner algorithm Quantified the causal attributable risk of living close to the mines on asthma or rhinoconjunctivitis risk. Based on a causal attributable risk model, a hypothetical intervention to increase distance between residences and mines may reduce the prevalence of respiratory disease.
Oulhote, 2019 (20) 449 mother-child paris (5–7 years) Prospective cohort Prenatal and childhood exposure to Hg, PCBs, PFASs Neurodevelomental outcomes SuperLearner with G- compuation Developed flexible model to detect indepedendent and joint effects and to estimate valid causal effect estimates. Results corroborate previous findings from individual chemical analyses. Joint effect of the mixtures of chemicals was stronger, but no potential for synergistic effects was observed.

Abbreviations: ABO, adverse birth outcomes; ADHD, attention deficit hyperactivity disorder; AOD, aerosol optical depth; ASD, autism spectrum disorder; BPA, bisphenol-A; BKMR, Bayesian kernel machine regression; BMI, body mass index; BW, birth weight; CCAAPS, Cincinnati Childhood Allergy and Air Pollution Study; CHD, congenital heart defects; DDT/E, dichlorodiphenyltrichloroethane (DDT)/dichlorodiphenyldichloroethylene (DDE); DSA, deletion-substitution-addition; FENO, fractional exhaled nitric oxide, HFMD, hand-foot-and-mouth disease; Hg, mercury; LASSO, least absolute shrinkage and selection operator; LKMR, lagged kernel machine regression; MCMC, Markov Chain Monte Carlo; MeHg, methylmercury; ML, machine learning; NO2, nitrogen dioxide; OCPs, organochlorine pesticides; OPPs, organophosphate pesticides; PAHs, polycyclic aromatic hydrocarbons; PBDE, polybrominated diphenyl ethers; PCB, polychlorinated biphenyls; PFAS, perfluoroalkyl substances; PM, particulate matter; RF, random forest; SGA, small for gestational age; SS, spatial scales; Ta, ambient temperatures; TH, thyroid hormones; TSH, thyroid-stimulating hormone; XGBoost, Extreme Gradient Boosting.

Theme 1: Analysis of environmental mixtures

Environmental mixtures have the potential to produce greater adverse health outcomes than each single exposure would alone (21, 22). Previous epidemiologic studies have primarily focused on single-chemical analyses, in part due to the statistical challenges of analyzing mixtures (23, 24). Thus, it is not surprising that more than half (52%) of identified studies employed ML methods for analyzing chemical mixture data. Among the studies included in this review, the most frequently used methods for analyzing chemical mixture data were BKMR or modified BKMR methods (n=9) (1517, 2530) penalized regression (n=7) (3137) the deletion/substitution/addition (DSA) algorithm (n=2)(38, 39) and tree-based methods (n=2)(40, 41). The algorithms used often vary based on the dimensionality of the exposure data.

Briefly, BKMR is a semi-parametric hierarchical variable selection method that can assess individual components within a mixture while accounting for the correlated structure of a mixture (42). Because these methods can quantify the overall effect of a mixture and estimate the joint effects between exposures within a mixture, studies have applied BKMR to investigate and document interactions between mixture components (16, 17, 26, 30, 42, 43). Three publications reported the development and application of modified BKMR methods to analyze chemical mixtures while also accounting for time-varying effects (1517). BKMR has been used in settings with limited numbers of exposures. Within the studies identified in this review, the maximum number of exposures examined was 16 (1517).

Other algorithms have been used in settings with higher exposure dimensionality in order to identify the subset of features that distinguish individuals with the outcome of interest. Seven studies used penalized regression (e.g., ridge regression, least absolute shrinkage and selection operator (LASSO), and elastic net regression (ENR)) (3137), and all but one examined more than 20 exposures simultaneously. LASSO and ENR are typically used to enhance prediction accuracy when analyzing a large number of features because these methods are robust to multicollinearity, but the interpretability of final coefficients can be difficult (44). As a result, many studies fit a penalized regression model for feature selection and subsequently run traditional regression models based on the selected subset of variables (45). For example, Lenters et al applied a two-step approach of ENR and a second logistic model to identify features and then estimate effects between early-life exposure to 27 persistent organic pollutants measured in maternal breast milk and attention deficit hyperactivity disorder in childhood (35).

Two publications from the HELIX project applied the DSA algorithm to assess the associations between a broad range of lifestyle and environmental chemical exposures and lung function (38) and blood pressure (39) during childhood. DSA is a data-adaptive algorithm that was initially introduced in genomics research to explore interactions in high-dimensional data by iteratively constructing generalized linear models and updating the “best” model through deletion, substitution and addition of features and/or polynomials of the features (46). These two studies had the largest number of exposure features (≥ 210 exposures) among the studies that analyzed mixture data. However, only two-way interactions were explored and the DSA algorithm did not identify any exposures consistently within the study of lung function (38).

Tree-based methods were used both for feature selection and to explore higher-order interactions. These methods use a nonparametric binary partitioning algorithm that recursively splits the data into more-homogenous subgroups to distinguish individuals with the outcome of interest (47). Since a single tree can be highly unstable, most studies, including the two we identified in this review, utilize ensemble methods that aggregate findings across individual trees (48). Variable importance measures, metrics that identify the most predictive feature for a specific outcome, from a random forest were used for feature selection within a study of prenatal exposure to phthalates and BMI trajectories (40). An exploration of higher-order interactions was performed in a study of multiple air toxics and children’s school readiness. In that study, trees were generated and then decomposed into their branches and analyzed to identify the exposure profiles or collection of air toxics that were able to distinguish children with lower scores (41).

Theme 2: Exposure Prediction and Characterization

Six publications implemented ML techniques for exposure prediction or characterization, with half of these studies predicting air pollutants or biomarkers of air pollutants (4954). RF was used in two of these studies for the prediction of daily particulate matter (PM) concentrations in a setting of missing aerosol optical depth data (51), and prediction of elemental components of PM from land use variables (50). A third study applied an existing algorithm to provide aerosol optical depth in high resolution for the prediction of PM2.5(53), while the fourth used artificial neural networks or “deep learning” to predict biomarkers of ozone exposure in children (52). A cross-sectional study employed k-means cluster analysis, an unsupervised iterative algorithm that partitions the data into distinct non-overlapping subgroups (55) for characterization of patterns in indoor low frequency magnetic field exposure (54). Lastly, one study aimed to infer new knowledge about the safety of category C drugs of unknown fetal effect using RF and reported adverse outcomes (49).

Theme 3: Disease Prediction, Forecasting and Decision Support

Three studies used ML methods for explicit outcome prediction, either applying tree-based methods and/or support vector machines (18, 56) or applying hierarchical clustering to multiple features prior to using them in a prediction model(57). Luo et al introduced and validated two ML algorithms, weighted support vector machine (WSVM) and weighted random forest (WRF) (18). WSVM and WRF were developed for handling highly unbalanced datasets, a common problem in environmental health research. In a comparison of statistical methods, WSVM was shown to have substantially better prediction of congenital heart defects using prenatal exposure to environmental contaminants as features (18). Three additional studies used ML methods for disease forecasting, two for infectious diseases and one for asthma (19, 58, 59). The study forecasting malaria using weather and environmental data reported on the development of an approach that incorporated a boosting algorithm to optimize model fit and compared it to another approach without boosting. Approaches were compared in a testing dataset for a number of error-based metrics, with the boosted algorithm showing lower amounts of forecasting error. Only one study used ML for decision support within disease management. This report of one individual used RF to predict risk of asthma exacerbation based on imputed physiological and environmental data measured via smartwatch sensors which in turn could be visualized by the patient (60).

Theme 4: Analysis of Other Complex Data

Five studies used ML methods to analyze complex data, separate from exposure mixtures (6165). ML methods were applied to the complexity of metabolomics (61) and genetic data (62), while other studies used ML to identify non-linear relationships among multiple covariates (63), exposures (65) or outcomes (64). Most of these studies utilized tree-based methods, either alone (61) or in combination with other methods (62, 63, 65), as tree-based models easily accommodate non-linearities and interactions. The remaining study used the Dirichlet process mixture model implemented through a Bayesian Markov chain Monte Carlo framework to cluster 20 neurocognitive outcome measures into domains based on similarities in prenatal methylmercury exposure and relevant covariates (64). This method, which does not place restrictions on the number on size of clusters and assumes nonlinear associations in each cluster and non-homogeneous associations between clusters, allowed researchers to group neurocognitive outcomes into seven distinct domains as opposed to the original four domains identified via traditional statistical methods (64).

Theme 5: Causal inference

Two studies integrated ML techniques, specifically the SuperLearner algorithm, into causal inference methods (20, 66). The SuperLearner algorithm is a flexible, data-adaptive prediction method that uses cross-validation to estimate a weighted combination of predictions from many candidate learners (67) Herrera et al applied targeted maximum likelihood estimation jointly with the SuperLearner algorithm to quantify the causal attributable risk of living close to gold and copper mines on asthma and rhinoconjunctivitis in children (66). . Based on the estimated causal attributable risk, Herrera et al were able to estimate the reduction of respiratory disease risk based on a hypothetical intervention of increased distance between residences and mines (66). In contrast, Oulhote et al combined SuperLearner with G-computation to yield causal effect estimates of the independent and joint effects of prenatal and childhood exposure to multiple environmental contaminants from seafood consumption with neurodevelopmental outcomes in children (20). The results from Oulhote et al’s study corroborated previous findings from individual chemical analyses. In both examples, the application of the SuperLearner algorithm generates an initial prediction model for the outcome using the exposures of interest and any included covariates. The model and its predictions are then used within the respective causal inference frameworks.

Discussion

The majority of studies incorporating ML methods focused on the analysis of environmental mixtures and multiple exposures. The statistical analysis of mixtures and multiple exposures has emerged as a primary challenge within environmental epidemiology and there have been a number of papers to compare different methods (45, 68). The challenge of analyzing complex environmental exposures is intensified within the emerging paradigm of the exposome and the corresponding research questions that focus on identifying the effects of collective exposure across multiple domains (9). The majority of ML approaches have been optimized for tasks related to prediction, not identifying causal components (6). However, many of the studies we reviewed used typical predictive modelling pipelines for feature selection. This approach may be problematic because it assumes that features that are highly predictive are also causally explanatory. This assumption often doesn’t hold, and researchers must consider these limitations when using ML approaches for studies interested in causality and explanation (6).

In addition to the analysis of mixtures, there is great opportunity and interest for the use of ML for the refinement of exposure and outcome assessment. Studies within environmental epidemiology have already used ML to augment existing exposure assessment approaches such as land-use regression (50) and spatiotemporal air pollution modelling (51). The potential for algorithms to integrate and leverage large amounts of data to refine phenotypic definitions of complex outcomes such as asthma have been highlighted in earlier reports (69), but were not observed in our review. These types of approaches can reduce measurement error, ensure more homogenous exposure and outcome definitions, and subsequently improve the quality of our research. Similarly, investigators from the broader field of environmental health have begun to use ML for data generation and integration. From incorporating image analysis into exposure assessment (70) to integration of multiple features for health-related forecasts related to climate change (71), we anticipate greater use of these approaches from other fields to address questions within children’s environmental health.

There has been limited use of ML for predictive tasks embedded within observational research such as the calculation of propensity scores and missing data imputation. Other fields have more fully embraced these implementations. For example, the use of ML for propensity score or inverse probability of treatment weights have been compared to traditional methods using simulated data and then implemented within study data to improve exchangability (72, 73). Similarly, one recent paper utilized ML integrated with causal inference methods to estimate causal effects of a hypothesized intervention related to pediatric environmental health (66). There has been increased discussion around analytic approaches to integrate ML with causal inference methods (74), although there are few papers to show that these integrated approaches work as expected within the context of environmental health. Additional research, using both simulated data and actual study data, is needed to determine if ML has the ability to advance our efforts to estimate causal effects from observational data.

While we have focused primarily on the opportunities related to ML, we must also consider the potential problems that can arise when using data-driven methods for environmental health research. Particularly as ML is often used with data mined from administrative and clinical data sources not collected as part of standard research procedures, we must be cognizant of the role that bias, error and selection may play when interpreting our findings. This is especially true within the growing field of decision support and personalized risk assessment. As evidenced by recent work on the racism of an algorithm deployed in health care (75), models and the features that are used to construct those predictive models must be critically evaluated not just in terms of accuracy, but in terms of fairness. Environmental injustice has deep historical roots that impact many of our public health and clinical data sources (76). These issues must be taken into account when building and interpreting data-driven models of environmental health.

Recommendations for Future Research

Based on this synthesis of the literature, we make four recommendations to advance the use of ML in perinatal and pediatric environmental health research. These recommendations are offered to encourage critical thinking around the use of these data-driven methods, acknowledging their potential within the context of the goals of environmental health research.

Recommendation 1: Explicitly state the goal of the analysis

As described within this review, environmental health researchers are often using ML to address challenges related to complex environmental mixtures with a goal of understanding etiology. However, many of the algorithms being used have been developed with the goal of prediction (6). We would encourage researchers to be explicit about their goal for the use of ML, their use of analytic terminology, their rationale for algorithm(s) selection and how the analytic pipeline used is consistent with the goal of their research. To improve clarity, researchers should avoid vague terms such as “predictors” when the goal is not to develop a predictive model with applicability to a new dataset . Clearly defining research goals both when first designing the study and subsequently reporting results of ML-based analyses will also allow reviewers and readers to determine if the analytic pipeline used is aligned with the study goals. For example, implementation of ML used in conjunction with causal inference methods will likely be reported on and evaluated differently than ML used for feature selection in high-dimensional data. Being clear about our research goals will enable us to better evaluate the use of ML within our field, both in terms of its appropriateness and performance in different settings.

Recommendation 2: Expand the use of simulation studies

As in other methods-focused work, we encourage researchers to utilize simulations to evaluate how ML methods will perform in contexts that are common within environmental health research. This is especially relevant when the ML methods are being used in contexts different from their traditional development and when constructing new ensembles by combining different algorithms into a single analytic pipeline. Only a few of the studies we reviewed utilized simulations prior to implementing ML within their study data. Even though many of these methods have been used before, they often have not been used for the explicit goals under study. Utilizing simulated data where truth is known would provide us with some confidence that these techniques are applicable to our specific questions (77). For example, Lampa et al illustrated the appropriateness of boosted regression trees to assess interactions in complex mixtures using simulated data.(78) The authors subsequently applied this method to examine interactions between multiple environmental contaminants and metabolic syndrome within a cohort of Swedish adults, referencing the earlier study to support their use of boosted trees.(79) In addition, the use of simulations can help us identify the contexts where certain ML methods may not perform as expected, for example with very highly correlated exposures or smaller sample sizes.

Recommendation 3: Rigorously validate and evaluate models using multiple criteria appropriate for the study goal

To improve both the rigor and reproducibility of data-driven analyses, researchers should clearly describe the full analytic pipeline used to generate results and final model performance metrics. The majority of the studies included in our review utilized some form of validation or evaluation to assess model performance. However, often only one criterion, typically related to predictive accuracy, was assessed and reported. For predictive models, there are a number of additional criteria, such as precision and recall, that should be used to assess the multiple aspects of model performance (80). Model performance can also be sensitive to the choice of values for hyperparameters, the user-defined elements that influence the behavior of a given algorithm. For example, lambda in a regularized regression is the hyperparameter that controls the shrinkage penalty. A full description of an analytic pipeline would include a description of whether resampling, such as cross-validation or bootstrapping, was used to compare model performance with different hyperparameter values, the values considered for each hyperparameter and whether hyperparameters were assessed individually or in combination, when an algorithm possesses more than one. Few studies we reviewed throuroughly reported a complete pipeline, limiting the reader’s ability to effectively assess the analytic approach.

We anticipate that data-driven risk prediction and aids to disease management in clinical and public health settings will be of growing interest, including the use of biomarkers of exposure and –omic signatures as predictors for clinical outcomes (81). For these studies, we also recommend including an assessment of calibration to ensure the generated risk predictions are accurate when applied in clinical and public health settings (80, 82). Similarly, the systematic construction and evaluation of forecasting models need to be reported within studies to promote transparency and rigor before promoting their use for policy and public health practice. This is especially relevant to the pressing need for research that seeks to characterize and forecast potential health effects of climate change. It is important to note that validation and evaluation pipelines will vary based on the goals of the study. For example, studies using ML for propensity score estimation do not assess area under the curve when comparing performance among algorithms. Rather, they compare the balance in covariate distribution, as that is the desired outcome when using propensity scores (72). In that same vein, researchers need to carefully consider how the validation and evaluation pursued within specific analyses is consistent with the research question the study intends to address.

Recommendation 4: Expand the use of ML for descriptive and predictive tasks within observational studies

We identified a few studies that used ML to improve or augment exposure models, by imputing unknown information or developing prediction models to assign exposure in areas without measurements (50, 51, 53). These types of implementations are consistent with traditional use of ML as they are questions of prediction that are embedded in observational research studies. We encourage these types of studies and hope to see them expand to include greater use of imaging and other technologies that are facilitated through the use of ML techniques, as has been done in the broader field of environmental health (70). The use of unsupervised techniques is also an area that has been pursued in other fields that we think could benefit perinatal and pediatric research as we seek to describe the complex patterns of exposure within populations and within different windows of susceptibility (83). As discussed above, research suggests that ML methods can often outperform traditional models for tasks such as propensity score construction (72). Additionally, studies from other areas of perinatal and pediatric research can illustrate how the use of ML can allow us to creatively address challenges within our field. For example, a study from Naimi et al used ML to address the epidemiologic challenge of missing fetal weights when studying small for gestational age, implementing ML to address what is essentially a prediction of missing data problem (84). Challenges such as automating data harmonization across studies, entity resolution across data sources, and conducting in silico toxicology experiments of emerging chemicals are other challenges within our field that could be amenable to the use of ML. We encourage researchers to pursue and report on these novel implementations.

Conclusions

The majority of the studies used ML to address the challenge of environmental mixtures. Other implementations have ranged from improving exposure assessment to the analysis of complex -omics data. As the amount of complex environmental data continues to grow, we anticipate greater use of ML to address challenges that cannot be easily handled with traditional analytic methods. In order to conduct high-quality research, we need to ensure that the ML techniques we use are appropriate for our study questions and are implemented and reported in a way that can critically assessed by the scientific community. Additionally, we recommend increasing the use of simulation studies and rigorously optimizing and evaluating our methods and subsequent models within diverse populations to ensure fairness and accuracy across subpopulations. Finally, we encourage researchers to look beyond environmental mixtures for research questions that can benefit from the use of ML and consider the predictive and descriptive tasks within our etiologic studies that may benefit from the increased use of ML in environmental health research.

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

Conflict of Interest Disclosure

Dr. Stingone reports funding from the NIEHS during the conduct of the study (ES027022). Dr. Oskar declares no conflicts.

Human and Animal Rights

All reported studies/experiments with human or animal subjects performed by the authors have been previously published and complied with all applicable ethical standards (including the Helsinki declaration and its amendments, institutional/national research committee standards, and international/national/institutional guidelines).

References

  • 1.Murphy KP. Machine learning: a probabilistic perspective: MIT press; 2012. [Google Scholar]
  • 2.**.Bi Q, Goodman KE, Kaminsky J, Lessler J. What Is Machine Learning: a Primer for the Epidemiologist. Am J Epidemiol. 2019. [DOI] [PubMed] [Google Scholar]; A clear and comprehensive description of multiple machine learning algorithms from an epidemiologic perspective and describes some of the opportunities and challenges of the wider adoption of machine learning methods.
  • 3.Shameer K, Johnson KW, Glicksberg BS, Dudley JT, Sengupta PP. Machine learning in cardiovascular medicine: are we there yet? Heart. 2018;104(14):1156–64. [DOI] [PubMed] [Google Scholar]
  • 4.Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60. [DOI] [PubMed] [Google Scholar]
  • 5.Bellinger C, Mohomed Jabbar MS, Zaiane O, Osornio-Vargas A. A systematic review of data mining and machine learning for air pollution epidemiology. BMC Public Health. 2017;17(1):907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Shmueli G To explain or to predict? J Statistical science. 2010;25(3):289–310. [Google Scholar]
  • 7.Manrai AK, Cui Y, Bushel PR, Hall M, Karakitsios S, Mattingly CJ, et al. Informatics and Data Analytics to Support Exposome-Based Discovery for Public Health. Annu Rev Public Health. 2017;38:279–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012;41(1):24–32. [DOI] [PubMed] [Google Scholar]
  • 9.Stingone JA, Buck Louis GM, Nakayama SF, Vermeulen RC, Kwok RK, Cui Y, et al. Toward Greater Implementation of the Exposome Research Paradigm within Environmental Epidemiology. Annu Rev Public Health. 2017;38:315–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jacobson LP, Lau B, Catellier D, Parker CB. An Environmental influences on Child Health Outcomes viewpoint of data analysis centers for collaborative study designs. Curr Opin Pediatr. 2018;30(2):269–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Stingone JA, Mervish N, Kovatch P, McGuinness DL, Gennings C, Teitelbaum SL. Big and disparate data: considerations for pediatric consortia. Curr Opin Pediatr. 2017;29(2):231–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology. 2019;110:12–22. [DOI] [PubMed] [Google Scholar]
  • 13.Mooney SJ, Pejaver V. Big Data in Public Health: Terminology, Machine Learning, and Privacy. Annu Rev Public Health. 2018;39:95–112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: Springer; 2013. [Google Scholar]
  • 15.Liu SH, Bobb JF, Claus Henn B, Schnaas L, Tellez-Rojo MM, Gennings C, et al. Modeling the health effects of time-varying complex environmental mixtures: Mean field variational Bayes for lagged kernel machine regression. Environmetrics. 2018;29(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu SH, Bobb JF, Claus Henn B, Gennings C, Schnaas L, Tellez-Rojo M, et al. Bayesian varying coefficient kernel machine regression to assess neurodevelopmental trajectories associated with exposure to complex mixtures. Stat Med. 2018;37(30):4680–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.*.Liu SH, Bobb JF, Lee KH, Gennings C, Claus Henn B, Bellinger D, et al. Lagged kernel machine regression for identifying time windows of susceptibility to exposures of complex mixtures. Biostatistics. 2018;19(3):325–41. [DOI] [PMC free article] [PubMed] [Google Scholar]; Reports on the development of an approach that seeks to account for both mixture effects and windows of susceptibility.
  • 18.Luo Y, Li Z, Guo H, Cao H, Song C, Guo X, et al. Predicting congenital heart defects: A comparison of three data mining methods. PLoS One. 2017;12(5):e0177811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sewe MO, Tozan Y, Ahlm C, Rocklöv J. Using remote sensing environmental data to forecast malaria incidence at a rural district hospital in Western Kenya. Sci Rep. 2017;7(1):2589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.*.Oulhote Y, Coull B, Bind MA, Debes F, Nielsen F, Tamayo I, et al. Joint and independent neurotoxic effects of early life exposures to a chemical mixture: A multi-pollutant approach combining ensemble learning and g-computation. Environ Epidemiol. 2019;3(5). [DOI] [PMC free article] [PubMed] [Google Scholar]; Example of integrating machine learning and causal inference approaches to analyze environmental mixtures.
  • 21.Silva E, Rajapakse N, Kortenkamp A. Something from “nothing”--eight weak estrogenic chemicals combined at concentrations below NOECs produce significant mixture effects. Environ Sci Technol. 2002;36(8):1751–6. [DOI] [PubMed] [Google Scholar]
  • 22.Rajapakse N, Silva E, Kortenkamp A. Combining xenoestrogens at levels below individual no-observed-effect concentrations dramatically enhances steroid hormone action. Environ Health Perspect. 2002;110(9):917–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sun Z, Tao Y, Li S, Ferguson KK, Meeker JD, Park SK, et al. Statistical strategies for constructing health risk models with multiple pollutants and their interactions: possible choices and comparisons. Environ Health. 2013;12(1):85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Braun JM, Gennings C, Hauser R, Webster TF. What Can Epidemiological Studies Tell Us about the Impact of Chemical Mixtures on Human Health? Environ Health Perspect. 2016;124(1):A6–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Chen Y, Wu F, Liu X, Parvez F, LoIacono NJ, Gibson EA, et al. Early life and adolescent arsenic exposure from drinking water and blood pressure in adolescence. Environ Res. 2019;178:108681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Coker E, Chevrier J, Rauch S, Bradman A, Obida M, Crause M, et al. Association between prenatal exposure to multiple insecticides and child body weight and body composition in the VHEMBE South African birth cohort. Environ Int. 2018;113:122–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Deyssenroth MA, Gennings C, Liu SH, Peng S, Hao K, Lambertini L, et al. Intrauterine multi-metal exposure is associated with reduced fetal growth through modulation of the placental gene network. Environ Int. 2018;120:373–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kupsco A, Kioumourtzoglou MA, Just AC, Amarasiriwardena C, Estrada-Gutierrez G, Cantoral A, et al. Prenatal Metal Concentrations and Childhood Cardiometabolic Risk Using Bayesian Kernel Machine Regression to Assess Mixture and Interaction Effects. Epidemiology. 2019;30(2):263–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ni W, Yang W, Jin L, Liu J, Li Z, Wang B, et al. Levels of polycyclic aromatic hydrocarbons in umbilical cord and risk of orofacial clefts. Sci Total Environ. 2019;678:123–32. [DOI] [PubMed] [Google Scholar]
  • 30.Valeri L, Mazumdar MM, Bobb JF, Claus Henn B, Rodrigues E, Sharif OIA, et al. The Joint Effect of Prenatal Exposure to Metal Mixtures on Neurodevelopmental Outcomes at 20–40 Months of Age: Evidence from Rural Bangladesh. Environ Health Perspect. 2017;125(6):067015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Cilluffo G, Ferrante G, Fasola S, Montalbano L, Malizia V, Piscini A, et al. Associations of greenness, greyness and air pollution exposure with children’s health: a cross-sectional study in Southern Italy. Environ Health. 2018;17(1):86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Grant LP, Gennings C, Wickham EP, Chapman D, Sun S, Wheeler DC. Modeling Pediatric Body Mass Index and Neighborhood Environment at Different Spatial Scales. Int J Environ Res Public Health. 2018;15(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hou Q, Huang L, Ge X, Yang A, Luo X, Huang S, et al. Associations between multiple serum metal exposures and low birth weight infants in Chinese pregnant women: A nested case-control study. Chemosphere. 2019;231:225–32. [DOI] [PubMed] [Google Scholar]
  • 34.Iszatt N, Janssen S, Lenters V, Dahl C, Stigum H, Knight R, et al. Environmental toxicants in breast milk of Norwegian mothers and gut bacteria composition and metabolites in their infants at 1 month. Microbiome. 2019;7(1):34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.*.Lenters V, Iszatt N, Forns J, Čechová E, Kočan A, Legler J, et al. Early-life exposure to persistent organic pollutants (OCPs, PBDEs, PCBs, PFASs) and attention-deficit/hyperactivity disorder: A multi-pollutant analysis of a Norwegian birth cohort. Environ Int. 2019;125:33–42. [DOI] [PubMed] [Google Scholar]; *Described two-stage approach that couples feature selection with more traditional etiologic modelling.
  • 36.Philippat C, Heude B, Botton J, Alfaidy N, Calafat AM, Slama R, et al. Prenatal Exposure to Select Phthalates and Phenols and Associations with Fetal and Placental Weight among Male Births in the EDEN Cohort (France). Environ Health Perspect. 2019;127(1):17002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Woods MM, Lanphear BP, Braun JM, McCandless LC. Gestational exposure to endocrine disrupting chemicals in relation to infant birth weight: a Bayesian analysis of the HOME Study. Environ Health. 2017;16(1):115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Agier L, Basagaña X, Maitre L, Granum B, Bird PK, Casas M, et al. Early-life exposome and lung function in children in Europe: an analysis of data from the longitudinal, population-based HELIX cohort. Lancet Planet Health. 2019;3(2):e81–e92. [DOI] [PubMed] [Google Scholar]
  • 39.Warembourg C, Maitre L, Tamayo-Uria I, Fossati S, Roumeliotaki T, Aasvang GM, et al. Early-Life Environmental Exposures and Blood Pressure in Children. J Am Coll Cardiol. 2019;74(10):1317–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Heggeseth BC, Holland N, Eskenazi B, Kogut K, Harley KG. Heterogeneity in childhood body mass trajectories in relation to prenatal phthalate exposure. Environ Res. 2019;175:22–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Stingone JA, Pandey OP, Claudio L, Pandey G. Using machine learning to identify air pollution exposure profiles associated with early cognitive skills among U.S. children. Environ Pollut. 2017;230:730–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Bobb JF, Valeri L, Claus Henn B, Christiani DC, Wright RO, Mazumdar M, et al. Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics. 2015;16(3):493–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Coull BA, Bobb JF, Wellenius GA, Kioumourtzoglou MA, Mittleman MA, Koutrakis P, et al. Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents. Res Rep Health Eff Inst. 2015(183 Pt 1–2):5–50. [PubMed] [Google Scholar]
  • 44.Tibshirani R Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological. 1996;58(1):267–88. [Google Scholar]
  • 45.Gibson EA, Goldsmith J, Kioumourtzoglou MA. Complex Mixtures, Complex Analyses: an Emphasis on Interpretable Results. Curr Environ Health Rep. 2019;6(2):53–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Sinisi SE, van der Laan MJ. Deletion/substitution/addition algorithm in learning with applications in genomics. Stat Appl Genet Mol Biol. 2004;3:Article18. [DOI] [PubMed] [Google Scholar]
  • 47.Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. New York: Chapman & Hall; 1984. [Google Scholar]
  • 48.Breiman L Machine Learning (2001) 45: 5 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • 49.Boland MR, Polubriaginof F, Tatonetti NP. Development of A Machine Learning Algorithm to Classify Drugs Of Unknown Fetal Effect. Sci Rep. 2017;7(1):12839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Brokamp C, Jandarov R, Rao MB, LeMasters G, Ryan P. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches. Atmos Environ (1994). 2017;151:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.*.Brokamp C, Jandarov R, Hossain M, Ryan P. Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. Environ Sci Technol. 2018;52(7):4173–9. [DOI] [PubMed] [Google Scholar]; An example of machine learning used to improve exposure assessment of air pollutants
  • 52.Ghaedrahmat Z, Vosoughi M, Tahmasebi Birgani Y, Neisi A, Goudarzi G, Takdastan A. Prediction of Ozone in the respiratory system of children using the artificial neural network model and with selection of input based on gamma test, Ahvaz, Iran. Environ Sci Pollut Res Int. 2019;26(11):10941–50. [DOI] [PubMed] [Google Scholar]
  • 53.Kloog I, Novack L, Erez O, Just AC, Raz R. Associations between ambient air temperature, low birth weight and small for gestational age in term neonates in southern Israel. Environ Health. 2018;17(1):76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Tognola G, Bonato M, Chiaramello E, Fiocchi S, Magne I, Souques M, et al. Use of Machine Learning in the Analysis of Indoor ELF MF Exposure in Children. Int J Environ Res Public Health. 2019;16(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Jain A Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31(8):651–66. [Google Scholar]
  • 56.Deng H, Urman R, Gilliland FD, Eckel SP. Understanding the importance of key risk factors in predicting chronic bronchitic symptoms using a machine learning approach. BMC Med Res Methodol. 2019;19(1):70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Hassan WM, Al-Ayadhi L, Bjørklund G, Alabdali A, Chirumbolo S, El-Ansary A. The Use of Multi-parametric Biomarker Profiles May Increase the Accuracy of ASD Prediction. J Mol Neurosci. 2018;66(1):85–101. [DOI] [PubMed] [Google Scholar]
  • 58.Zhong R, Wu Y, Cai Y, Wang R, Zheng J, Lin D, et al. Forecasting hand, foot, and mouth disease in Shenzhen based on daily level clinical data and multiple environmental factors. Biosci Trends. 2018;12(5):450–5. [DOI] [PubMed] [Google Scholar]
  • 59.Rao M, George LA, Shandas V, Rosenstiel TN. Assessing the Potential of Land Use Modification to Mitigate Ambient NO2 and Its Consequences for Respiratory Health. Int J Environ Res Public Health. 2017;14(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Hosseini A, Buonocore CM, Hashemzadeh S, Hojaiji H, Kalantarian H, Sideris C, et al. Feasibility of a Secure Wireless Sensing Smartwatch Application for the Self-Management of Pediatric Asthma. Sensors (Basel). 2017;17(8). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Eguchi A, Sakurai K, Watanabe M, Mori C. Exploration of potential biomarkers and related biological pathways for PCB exposure in maternal and cord serum: A pilot birth cohort study in Chiba, Japan. Environ Int. 2017;102:157–64. [DOI] [PubMed] [Google Scholar]
  • 62.Fernández D, Sram RJ, Dostal M, Pastorkova A, Gmuender H, Choi H. Modeling Unobserved Heterogeneity in Susceptibility to Ambient Benzo[a]pyrene Concentration among Children with Allergic Asthma Using an Unsupervised Learning Algorithm. Int J Environ Res Public Health. 2018;15(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Huang LS, Cory-Slechta DA, Cox C, Thurston SW, Shamlaye CF, Watson GE, et al. Analysis of Nonlinear Associations between Prenatal Methylmercury Exposure from Fish Consumption and Neurodevelopmental Outcomes in the Seychelles Main Cohort at 17 Years. Stoch Environ Res Risk Assess. 2018;32(4):893–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Lalonde A, Love T. Using the Syechelles child development study to cluster multiple outcomes into domains to improve estimation of the overall effect of mercury on neurodevelopment. Math Appl. 2018;7(1):53–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ren Z, Zhu J, Gao Y, Yin Q, Hu M, Dai L, et al. Maternal exposure to ambient PM10 during pregnancy increases the risk of congenital heart defects: Evidence from machine learning models. Sci Total Environ. 2018;630:1–10. [DOI] [PubMed] [Google Scholar]
  • 66.*.Herrera R, Berger U, von Ehrenstein OS, Díaz I, Huber S, Moraga Muñoz D, et al. Estimating the Causal Impact of Proximity to Gold and Copper Mines on Respiratory Diseases in Chilean Children: An Application of Targeted Maximum Likelihood Estimation. Int J Environ Res Public Health. 2017;15(1). [DOI] [PMC free article] [PubMed] [Google Scholar]; Example of targeted learning, integration of machine learning and causal inference, within the context of children’s environmental health
  • 67.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. [DOI] [PubMed] [Google Scholar]
  • 68.Hamra GB, Buckley JP. Environmental exposure mixtures: questions and methods to address them. Curr Epidemiol Rep. 2018;5(2):160–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Williams-DeVane CR, Reif DM, Hubal EC, Bushel PR, Hudgens EE, Gallagher JE, et al. Decision tree-based method for integrating gene expression, demographic, and clinical data to determine disease endotypes. BMC Syst Biol. 2013;7:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.**.Weichenthal S, Hatzopoulou M, Brauer M. A picture tells a thousand…exposures: Opportunities and challenges of deep learning image analyses in exposure science and environmental epidemiology. Environ Int. 2019;122:3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]; Summarizes how improvements in imaging technology will advance exposure assesment.
  • 71.Chalghaf B, Chemkhi J, Mayala B, Harrabi M, Benie GB, Michael E, et al. Ecological niche modeling predicting the potential distribution of Leishmania vectors in the Mediterranean basin: impact of climate change. Parasit Vectors. 2018;11(1):461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Bentley R, Baker E, Simons K, Simpson JA, Blakely T. The impact of social housing on mental health: longitudinal analyses using marginal structural models and machine learning-generated weights. Int J Epidemiol. 2018;47(5):1414–22. [DOI] [PubMed] [Google Scholar]
  • 74.Blakely T, Lynch J, Simons K, Bentley R, Rose S. Reflection on modern methods: when worlds collide-prediction, machine learning and causal inference. Int J Epidemiol. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. [DOI] [PubMed] [Google Scholar]
  • 76.Vera LA, Walker D, Murphy M, Mansfield B, Siad LM, Ogden J. When Data Justice and Environmental Justice Meet: Formulating a Response to Extractive Logic through Environmental Data Justice. Inf Commun Soc. 2019;22(7):1012–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Hallgren KA. Conducting Simulation Studies in the R Programming Environment. Tutor Quant Methods Psychol. 2013;9(2):43–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.*.Lampa E, Lind L, Lind PM, Bornefalk-Hermansson A. The identification of complex interactions in epidemiology and toxicology: a simulation study of boosted regression trees. Environ Health. 2014;13:57. [DOI] [PMC free article] [PubMed] [Google Scholar]; Example of using simulation analyses to support use of machine learning methods for environmental health applications.
  • 79.Lind L, Salihovic S, Lampa E, Lind PM. Mixture effects of 30 environmental contaminants on incident metabolic syndrome-A prospective study. Environ Int. 2017;107:8–15. [DOI] [PubMed] [Google Scholar]
  • 80.Platt RW, Grandi SM. Machine learning for the prediction of postpartum complications is promising, but needs rigorous evaluation. Bjog. 2019;126(6):710. [DOI] [PubMed] [Google Scholar]
  • 81.Holland N Future of environmental research in the age of epigenomics and exposomics. Rev Environ Health. 2017;32(1–2):45–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, initiative TGEdtapmotS. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.White AJ, Keller JP, Zhao S, Carroll R, Kaufman JD, Sandler DP. Air Pollution, Clustering of Particulate Matter Components, and Breast Cancer in the Sister Study: A U.S.-Wide Cohort. Environ Health Perspect. 2019;127(10):107002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Naimi AI, Platt RW, Larkin JC. Machine Learning for Fetal Growth Prediction. Epidemiology. 2018;29(2):290–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Berg V, Nøst TH, Pettersen RD, Hansen S, Veyhe AS, Jorde R, et al. Persistent Organic Pollutants and the Association with Maternal and Infant Thyroid Homeostasis: A Multipollutant Assessment. Environ Health Perspect. 2017;125(1):127–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Serrano-Lomelin J, Nielsen CC, Jabbar MSM, Wine O, Bellinger C, Villeneuve PJ, et al. Interdisciplinary-driven hypotheses on spatial associations of mixtures of industrial air pollutants with adverse birth outcomes. Environ Int. 2019;131:104972. [DOI] [PubMed] [Google Scholar]

RESOURCES