Reducing the complexity of high-dimensional environmental data: an analytical framework using LASSO with considerations of confounding for statistical inference

Seth Frndak; Guan Yu; Youssef Oulhote; Elena I Queirolo; Gabriel Barg; Marie Vahter; Nelly Mañay; Fabiana Peregalli; James R Olson; Zia Ahmed; Katarzyna Kordas

doi:10.1016/j.ijheh.2023.114116

. Author manuscript; available in PMC: 2024 Apr 1.

Published in final edited form as: Int J Hyg Environ Health. 2023 Feb 16;249:114116. doi: 10.1016/j.ijheh.2023.114116

Reducing the complexity of high-dimensional environmental data: an analytical framework using LASSO with considerations of confounding for statistical inference

Seth Frndak ¹, Guan Yu ², Youssef Oulhote ³, Elena I Queirolo ⁴, Gabriel Barg ⁴, Marie Vahter ⁵, Nelly Mañay ⁶, Fabiana Peregalli ⁴, James R Olson ¹, Zia Ahmed ⁷, Katarzyna Kordas ¹

PMCID: PMC10977870 NIHMSID: NIHMS1970975 PMID: 36805184

Abstract

Purpose:

Frameworks for selecting exposures in high-dimensional environmental datasets, while considering confounding, are lacking. We present a two-step approach for exposure selection with subsequent confounder adjustment for statistical inference.

Methods:

We measured cognitive ability in 338 children using the Woodcock-Muñoz General Intellectual Ability (GIA) score, and potential predictive features across several environmental domains. Initially, 111 variables theoretically associated with GIA score were introduced into a Least Absolute Shrinkage and Selection Operator (LASSO) in a 50% feature selection subsample. Effect estimates for selected features were subsequently modeled in linear regressions in a 50% inference (hold out) subsample, initially adjusting for sex and age and subsequently adjusting for covariates selected via directed acyclic graphs (DAGs). All models were adjusted for clustering by school.

Results:

Of the 15 LASSO selected variables, eleven were not associated with GIA score following our inference modeling approach. Four variables were associated with GIA scores, including: serum ferritin adjusted for inflammation (negatively), mother’s IQ (positively), father’s education (positively), and hours per day the child works on homework (positively). Serum ferritin was not in the expected direction.

Conclusions:

Our two-step approach moves high-dimensional feature selection a step further by incorporating DAG-based confounder adjustment for statistical inference.

Keywords: Machine Learning, LASSO, High-Dimensional Data, Environmental Epidemiology, Child Health, Statistical Inference

BACKGROUND

Several conceptual models, like the “total” environment(1, 2) and the exposome(3), integrate different measures of the environment (built, social, natural) to recognize the multiple layers of influence on health. To measure the ”total” environment, scientists collect a large array of variables resulting in high-dimensionality, wherein the number of variables exceeds the number of subjects(4). The wealth of data often collected in environmental epidemiology results in analytical challenges. Environmental epidemiologists need standardized statistical frameworks for analyzing high-dimensional data(5).

Traditional approaches test separate hypotheses for each environmental factor, resulting in inflated type I errors(6, 7). Another ill-advised approach is to add all variables in a single model. This results in wide confidence intervals, unreliable estimates of association(8), spurious inferences(9). This is often referred to as the table 2 fallacy(10). Alternatively, studies of numerous, simultaneous, chemical exposures typically use either a mixture approach or dimension reduction(11–13). However, these methods do not address how to analyze across questionnaire-, geographic- or personal monitor-based exposure measures. Furthermore, the reduction of all variables into subsets of indices or clusters may not be possible, or the created index may be difficult to interpret(14). These considerations merit investigation of alternative or complementary methods.

We propose the use of a machine learning (ML) method, Least Absolute Shrinkage and Selection Operator (LASSO), to select a set of predictive features (variables) thereby reducing the data to a manageable feature set. Once a feature set has been selected, however, confounding may persist between the selected features and outcome of interest. Prosperi and colleagues (2020) provide a thorough discussion of the limitations of ML techniques; especially when not considering confounding after ML feature selection(15). In epidemiology, DAG-based confounder-adjustment is considered the gold-standard(16). DAGs help to minimize over-adjustment, prevent adjusting for causal mediators (variables along the causal pathway), and conceptualize potential biases (17). In short, a DAG-based confounder selection approach helps provide the least confounded effect estimate, an invaluable tool for statistical inference.

Therefore, we present a two-step process for statistical inference in high-dimensional environmental datasets: (i) LASSO feature selection on a 50% feature selection subsample, (ii) post-selection inference testing in a separate 50% subsample using DAG-based confounder adjustment. We demonstrate this technique by using an interpretable, normally distributed IQ-like measure of ~7-year-old children and their families from Montevideo, Uruguay and 111 features representing chemical, dietary, and psychosocial environments.

METHODS

Sample Recruitment

The original sample consisted of 357 children (~7 years of age) and their caregivers (typically mothers) recruited from eleven primary schools in Montevideo, Uruguay, in neighborhoods with suspected toxic metal exposure. Participants without the outcome of interest (n=18) were excluded, alongside one participant with a high number missing values, resulting in a final sample of 338 participants. Parents provided consent and children assented to study procedures. The study protocol was approved by the Pennsylvania State University, the University at Buffalo, the University of the Republic of Uruguay, and the Catholic University of Uruguay. Additional detail is provided elsewhere(18).

Variables Collected

Primary Outcome

Our outcome was the Woodcock-Muñoz general intellectual ability (GIA) score, standardized by child sex and age, and validated for Spanish-speaking populations(19). Trained psychologists administered the test battery at the child’s school.

Caregiver Questionnaire, Home Visit, and 24-Hour Dietary Recall

Caregivers completed questionnaires at the child’s school. Example topics included: child demographics, health history, parental characteristics (i.e., demographics and smoking), and pregnancy information of the study child. Based on household characteristics (i.e. removing shoes before entering, dusting and hours windows open) an environmental exposure index was created(20). Possessions of wealth (i.e. TV, car, computer) were reduced into a single factor, after principal component analysis(21). Caregivers reported the child’s diet aided by trained nutritionists using two 24-hour dietary recalls administered 2 weeks apart. Dietary nutrients were extracted(22–25), and individual foods reduced into processed and nutrient dense food patterns(21, 26). A social worker visited the home and administered the Home Observation for the Measurement of the Environment Inventory (HOME)(27). During the visit, household water samples were analyzed for metal content (iron, lead, manganese, cadmium, and arsenic)(28).

Clinical Measures

During a separate school visit, height and weight were taken by a trained nurse or nutritionist and anthropometric z-scores were calculated (height, weight, and BMI)(29). Visual functioning was measured by placing the child 3 meters from an eye chart, covering one eye at a time. The eye chart contained 12 progressively smaller lines of letters from top to bottom. For each eye, the child read across and down the chart, reading each line until unable to correctly identify the letters. Each eye was scored independently from 0 to 10; 10 indicating that the child correctly read up to line 10. Blood and hair samples were collected at the child’s school by a trained nurse. Blood hemoglobin was measured at the school(22). Serum samples were analyzed for C-reactive protein, zinc, folate, vitamin B12, and ferritin following published procedures(22–25). Blood lead was analyzed at the University of the Republic of Uruguay(24), and hair metals at Pennsylvania State University(30). First morning urine samples were collected at home in previously supplied cups rinsed with 10% HNO₃ and deionized water. Urinary metals (including urinary arsenic and cadmium) were analyzed at Karolinska Institutet, Stockholm(22, 24). Analyses of urinary pesticide metabolites (including urinary creatinine) were completed at the University at Buffalo Analytical Toxicological Laboratory (Director, Dr. James Olson). Details of the laboratory methods are provided in the Online Supplemental Material (https://osf.io/spfte/).

Neighborhood Variables

Census segment characteristics from the Municipality of Montevideo Geographic Services (MMGS) were assigned based on home location, and a neighborhood disadvantage factor was developed(31). The MMGS maintains locations of informal settlements and parks. Distances to the nearest informal settlement and greenspace were calculated using the gDistance function in R(32). A total of 163 Montevideo “ferias” (outdoor markets and food outlets) were geocoded. Using the GISTools R package we calculated the number of ferias within a 1 kilometer radius from the child’s home(33). Normalized difference vegetation index (NDVI) was calculated using a cloud free 3-meter resolution 4-bands satellite image from the Planet Imagery archive(34) (October 15^th, 2018). NDVI was assigned using bilinear interpolation in ArcGIS, based on the location of the child’s home.

Indices and Factors

Administered questions were grouped into child sleep problems, caregivers’ thoughts, and attitudes toward child rearing at home and school, and child’s preparation for school. Online Supplemental Material (https://osf.io/spfte/) explains the creation of these 4 factor variables.

Variable Descriptions

Our final variable set contained 111 variables theoretically associated with child IQ. Supplemental Table 1 contains the source (i.e., questionnaire or laboratory report), number of variables used to create each index, and citations (where appropriate/available) of study variables (https://osf.io/spfte/).

Missing Data

We imputed up to the total analytical sample (n=338), where 329 children had at least one missing data point. We used an unsupervised random forest model to impute missing data with the missForest package in R. missForest outperforms many other imputation techniques(35, 36). Specifications are in Online Supplemental Material (https://osf.io/spfte/).

Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO is appropriate for high-dimensional data and is useful for selecting predictive features. We performed 5-fold cross validated (CV) LASSO in STATA to obtain the optimal value of the tuning parameter λ on a 50% subsample of our data. We further adjusted for our recruitment methods by clustering by school using STATA’s vce(cluster) statement. The LASSO was repeated 100 times with 100 different random seeds for CV fold selection, resulting in 100 final models. This method is often used to increase model robustness(37). For each model, we used the selected value of λ corresponding to the model with the minimum CV error. We recorded the number of times each variable was selected by each LASSO model. Variables consistently selected by LASSO (>50 times) were advanced to the confounder-adjustment stage.

DAG-Based Confounder Adjustment for LASSO selected features

To prevent biased estimates during post-selection inference (38) we used the other 50% inference (hold out) subsample for confounding adjustment. First, we estimated partially adjusted associations between the LASSO selected features and GIA score using separate generalized linear models (GLMs) for each LASSO selected feature adjusted for age, sex, and survey clustering by school. DAGs were subsequently created only for features that maintained an association with GIA score in the partially adjusted models. DAGs were created using expert knowledge to determine a minimal confounding adjustment set in dagitty(39). DAG-based models were then estimated for each LASSO selected feature using the DAG identified confounder adjustment set including age, sex, and clustering by school. All continuous ML-selected variables were standardized so the effect estimate represented the change in GIA score per one standard deviation (SD) change in LASSO selected feature.

RESULTS

Study children were ~7 years of age (mean months=81, SD=6.5, n=338) with slightly more males (55%, n=338). The average number of possessions of wealth was 3 (SD=1.2, n=300), average years of education for the mother was 9 (SD=2.7, n=324) and average age of the mother was 33 years (SD=6.4, n=302).

Fifteen variables were selected (Figure 1): Parent’s familial situation (married vs other), whether the child has ever been hospitalized (Y/N), serum ferritin level (ng/mL), child preparation for school factor, caregiver attitude towards discipline, mother’s IQ, parent reported child’s weight at birth (g), frequency reading to the child, the nutrient dense dietary pattern factor, father’s education – secondary (Y/N), hours per day the child works on homework, urinary λCA, urinary cobalt (g/L), urinary uranium (g/L), and HOME inventory score. Examples of non-selected variables included: blood lead concentration (μg/dL), number of books owned by the child, and parental smoking status.

LASSO selected features were added to separate GLMs each accounting for sex, age, and clustering by school. Eight variables were not associated with GIA score in these partially adjusted models: Parent’s familial situation (married vs other), whether the child has ever been hospitalized (Y/N), child preparation for school factor, caregiver attitude towards discipline, parent reported child’s weight at birth (g), the nutrient dense dietary pattern factor, urinary λCA, and urinary cobalt (g/L).

In the DAG-based confounder-selection, Table 1 illustrates which confounders were considered and selected for each LASSO selected feature. After DAG-based confounder adjustment, two additional variables were no longer associated with GIA score: urinary uranium (g/L), frequency reading to the child, and HOME inventory score. Four variables were associated with GIA score after DAG-based confounder adjustment: serum ferritin adjusted for inflammation (ng/mL) (β= −3.10 CI=( (−5.16, −1.04)), mother’s IQ (β=4.43 CI=(1.27, 7.59)), father’s education – secondary (β=4.10 CI=(1.29, 6.91)), and hours per day the child works on homework (β=−1.08 CI=(−1.81, −0.36)). Figure 2 and Online Supplemental Table 2 (https://osf.io/spfte/) present estimates of association of the LASSO selected features with GIA scores, in partially (Panel A) and DAG-based adjusted models (Panel B).

Table 1.

Variables considered and selected as confounders using DAGs (rows) across LASSO selected features (columns)

		Exposure Variables
		Serum Ferritin (ng/mL)	Mother’s IQ	Father’s Education (Secondary/ Other)	Urinary Uranium (g/L)	Frequency Reading to Child	HOME Inventory Score	Hours per day Child Works on Homework
Confounders Considered	Maternal Education in Years		✓	✓	✓	✓	✓	✓
	Mother’s IQ		---	✓
	HOME Score	✓			✓		---	✓
	BMI Z-Score
	Serum Zinc (μg/dL)
	Hemoglobin (g/dL)
	Lead (Pb) (μg/dL)	✓			✓
	Fruits and Vegetables Factor	✓
	Dietary Protein (mg/day)
	Dietary Iron (mg/day)				✓
	Dietary B12 (mg/day)
	Dietary Folate (mg/day)
	Urinary Arsenic Metabolites (μg/L)				✓
	Household Crowding (Persons/Rooms)	✓					✓
	Caregiver Support of Child Autonomy			✓		✓	✓	✓
	Neighborhood Disadvantage Factor
	Total Population in Census Segment
	Children <5 in Household (Yes/No)					✓
	Child Sleep Problems Factor							✓
	Times Child Spanked (Past Week)
	Child Preparation for School
	Any Developmental Disability (Yes/No)		✓			✓	✓	✓
	Caregiver Occ. Exposure to Metals		✓
	Covariates
	Age	✓	✓	✓	✓	✓	✓	✓
	Sex	✓	✓	✓	✓	✓	✓	✓

Open in a new tab

Grey Indicates variable considered as a confounder in DAG for selected feature.

--- Indicates same variable as selected feature.

✓ Indicates a selected confounder based on minimal adjustment set.

Figure 2. — Estimates of association of LASSO selected features and general cognitive ability in children using inference subsample

Note. All continuous LASSO selected features standardized. All models adjusted for age, sex, and clustering by school.

DISCUSSION

The “total” environment requires collection of data across multiple contexts, often resulting in high dimensionality. We presented an analytical framework for analyzing high-dimensional data that integrates both ML and causal epidemiology, recognizing that some variables may be predictive but not causally associated with the outcome or vice versa. LASSO selected 15 out of 111 potential features. Eight LASSO selected features were not associated with GIA score in the 50% inference subsample after partial adjustment. After DAG-based adjustment for confounders three additional variables were not associated with GIA score. The final four variables associated with GIA score were serum ferritin adjusted for inflammation (ng/mL), mother’s IQ, father’s education – secondary, and hours per day the child works on homework. Our methodology can be applied to reduce high-dimensional data, account for post-selection bias, and adjust for confounding. DAG creation and interpretation of findings from this approach requires expert knowledge. To that end, we first discuss our methodology and subsequently use exert knowledge to consider plausibility and potential mechanisms between LASSO selected features and cognitive ability.

Caveats of Our Statistical Framework

Because LASSO feature selection does not adjust for confounding, we suggest a two-step process of LASSO feature selection followed by DAG-based confounding adjustment. We note, however, that our framework might not be appropriate for all scenarios. The specific outcome of interest, size of dataset, and research goals may warrant other approaches. For example, expert knowledge could further exclude variables before LASSO feature selection. Biomarkers that do not accurately reflect the internal dose of an environmental factor might also be excluded. For example, as urinary cadmium is the gold-standard for long-term cadmium exposure(24, 40, 41), and reported relationships between hair cadmium and cognitive performance in children are mixed(42), we might have removed hair cadmium. However, when previous scientific knowledge does not exist, ML feature selection might be useful.

Hold-Out Subsample and DAGs for confounder selection

Eight features were not associated with GIA score in the 50% inference (hold out) subsample after partial adjustment. When data-driven approaches such as LASSO are used to select features, statistical inferences from that same dataset may be incorrect due to overfitting (43). As bootstrapping approaches cannot correct for this issue (44), data splitting is recommended for post-selection inference. Using the same dataset for feature selection and inference in environmental epidemiology may incorrectly result in greater type 1 error.

Similarly, three additional features were no longer associated with GIA score after DAG-based confounder adjustment. While these features may have predictive power and were frequently selected by the LASSO model, confounding was not considered during feature selection. DAGs provide a way to consider relationships between potential confounders, exposure, and outcome; selecting a minimal adjustment set for statistical inference. To neglect adjustment for potential confounders after LASSO feature selection, may have resulted in incorrect statistical inferences. Our two-step method is crucial for statistical inference in high-dimensional data.

Examining Features Associated with GIA score after DAG-based Confounder Adjustment

Four features were associated with child GIA score after DAG-based confounder adjustment: serum ferritin, mother’s IQ, father’s education, and hours per day the child works on homework. First, serum ferritin adjusted for inflammation was inversely associated with GIA score. This is contrary to many published studies on iron deficiency (measured by serum ferritin) and cognitive achievement in children (45). Additional examination of this study sample may be warranted to better understand this finding. Second, maternal IQ was positively associated with child IQ. This finding is highly consistent with prior literature, as maternal IQ is a well-established, positive predictor of child cognitive ability, with IQ heritability between 0.4–0.8 (1.0 indicates all variation is genetically explained) (46). Third, the father’s education was positively associated with child cognitive ability. In the case of feature selection, where if there is a group of variables among which the pairwise correlations are high, then the LASSO tends to arbitrarily select only one variable from the group. This may be an instance of arbitrary selection of father’s education over mother’s education by the LASSO. In our sample, children of fathers with at least secondary education had mothers with approximately two more years of education (M= 9.63, SD= 2.70), compared to fathers without secondary education (M= 7.86, SD= 2.03) on average (t_df=336 = −5.84, p<0.001). Lastly, time spent working on homework was inversely associated with child cognitive ability. This could be due to reverse causality between time spent on homework and GIA score, reflected by difficulty in school. In the literature, the homework problems checklist (HPC) (47) which includes length of time to complete assignments is negatively correlated with GPA among students with ADHD (48). While LASSO selected this feature and DAG confounder adjustment supported the statistical association, expert knowledge helps assess the nature of the relationship.

Variables Not Selected by LASSO or Not Associated with Cognitive Ability

Many variables were not selected despite prior evidence of associations with children’s cognitive performance. Noting the purpose of LASSO feature selection – maximizing prediction – can help interpret these findings. Non-selected variables may be causally associated with the outcome of interest, but not strongly predictive. Measurement error or selection forces could affect selection. For example, child’s birthweight (reported by the mother) may be subject to recall error. Dietary variables may only be predictive if the sample includes children with dietary deficiencies. Specification of a variable in the model may further affect selection. Previously, we demonstrated a non-linear relationship between blood lead and cognitive performance profiles(49). As we did not test non-linearity, this may have affected selection. Lastly, we chose the largest λ within one standard error of the minimum CV error to decrease the number of selected variables. The largest λ at the minimum CV error might have selected additional variables.

Strengths and Limitations of Our Methodology

Our methodology is not “one size fits all”. Specific situations may require additional data cleaning, index creation or changes in ML algorithm specification. Targeted learning, for example, may be a promising approach for causal inference in high-dimensional data(50). The approach outlined in this paper, however, may be a useful guide to identify important predictors and adjust for confounding using a traditional DAG-based framework. The integration of feature selection with causal inference thinking is an important topic, and our methodology provides one framework for such integration.

Our approach has some limitations. While LASSO can handle high multicollinearity, a variable may be left out in favor of another collinear variable. In our example, the selection of father’s education over mother’s education may have been such an instance. Our approach is also time-consuming. The computer processing power needed to run 100 CV models is not insignificant. Furthermore, time and expert knowledge is required to construct each DAG. Also, because our data was cross-sectional, we cannot establish temporality as demonstrated by the variable “time spent on homework”. Lastly, ML is agnostic to measurement error, bias, and interpretation of biomarkers. Further consideration of the relevant exposure window, dose, duration, and frequency might aid in understanding why certain variables are left un-selected by ML.

Conclusion

Our two-step process using LASSO feature selection and consideration of confounding allowed us to estimate non-confounded relationships between numerous variables representing different environmental domains with a well-characterized developmental outcome. Our methodology can be adopted to other high-dimensional environmental datasets.

Supplementary Material

Supplemental online materials

NIHMS1970975-supplement-Supplemental_online_materials.docx^{(62.5KB, docx)}

Funding:

The results reported herein correspond to specific aims of grant 1R01ES023423 to investigator Katarzyna Kordas from the National Institute of Environmental Health Sciences. This work was also supported by grant(s) R21ES019949 and 1R21ES16523 from the National Institute of Environmental Health Sciences (PI: Kordas). Support was also provided by the Community of Excellence in Global Health Equity (CGHE) at the University at Buffalo -State University of New York (Recipient: Frndak).

Footnotes

No competing interests to declare.

Data and Source Code: Data and source code used in these analyses may be obtained by request, made via email to kkordas@buffalo.edu

References:

1.Tulve N, Ruiz J, Lichtveld K, Darney S, Quackenboss J. Development of a conceptual framework depicting a child’s total (built, natural, social) environment in order to optimize health and well-being. J Environ Health Sci. 2016;2(2):1–8. [Google Scholar]
2.Ruiz JDC, Quackenboss JJ, Tulve NS. Contributions of a child’s built, natural, and social environments to their general cognitive ability: A systematic scoping review. Plos one. 2016;11(2):e0147741. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kordas K, Young SL, Golding J. Measuring the Lifetime Environment in LMICs: Perspectives from Epidemiology, Environmental Health, and Anthropology. Transforming Global Health: Springer; 2020. p. 19–34. [Google Scholar]
4.Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications: Springer Science & Business Media; 2011. [Google Scholar]
5.Stingone JA, Buck Louis GM, Nakayama SF, Vermeulen RC, Kwok RK, Cui Y, et al. Toward greater implementation of the exposome research paradigm within environmental epidemiology. Annual review of public health. 2017;38:315–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological). 1995;57(1):289–300. [Google Scholar]
7.Patel CJ. Exposome-Wide Association Studies: A Data-Driven Approach for Searching for Exposures Associated with Phenotype. Unraveling the Exposome: Springer; 2019. p. 315–36. [Google Scholar]
8.Alin A Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2(3):370–4. [Google Scholar]
9.Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic medicine. 2004;66(3):411–21. [DOI] [PubMed] [Google Scholar]
10.Westreich D, Greenland S. The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients. American Journal of Epidemiology. 2013;177(4):292–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97(458):611–31. [Google Scholar]
12.Santos S, Maitre L, Warembourg C, Agier L, Richiardi L, Basagaña X, et al. Applying the exposome concept in birth cohort research: a review of statistical approaches. European Journal of Epidemiology. 2020;35(3):193–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hendryx M, Luo J. Latent class analysis to model multiple chemical exposures among children. Environmental Research. 2018;160:115–20. [DOI] [PubMed] [Google Scholar]
14.Weisskopf MG, Seals RM, Webster TF. Bias Amplification in Epidemiologic Analysis of Exposure to Mixtures. Environ Health Perspect. 2018;126(4):047003. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nature Machine Intelligence. 2020;2(7):369–75. [Google Scholar]
16.Greenland S, Pearl J. Causal diagrams. 2007.
17.Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999:37–48. [PubMed] [Google Scholar]
18.Roy A, Queirolo E, Peregalli F, Mañay N, Martínez G, Kordas K. Association of blood lead levels with urinary F2–8α isoprostane and 8-hydroxy-2-deoxy-guanosine concentrations in first-grade Uruguayan children. Environmental research. 2015;140:127–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Schrank FA, McGrew KS, Ruef ML, Alvarado CG. Batería III Woodcock-Muñoz^™. Assessment Service Bulletin. 2005(1). [Google Scholar]
20.Roy A, Quierolo E, Peregalli F, Mañay N, Kordas K. Associations between dietary micronutrient intake and blood lead level in Uruguayan children. The FASEB Journal. 2011;25(S1):32.6–.6. [Google Scholar]
21.Kordas K, Burganowski R, Roy A, Peregalli F, Baccino V, Barcia E, et al. Nutritional status and diet as predictors of children’s lead concentrations in blood and urine. Environment international. 2018;111:43–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Kordas K, Queirolo EI, Mañay N, Peregalli F, Hsiao PY, Lu Y, et al. Low-level arsenic exposure: Nutritional and dietary predictors in first-grade Uruguayan children. Environmental research. 2016;147:16–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kordas K, Roy A, Vahter M, Ravenscroft J, Mañay N, Peregalli F, et al. Multiple-metal exposure, diet, and oxidative stress in Uruguayan school children. Environmental research. 2018;166:507–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Burganowski R, Vahter M, Queirolo EI, Peregalli F, Baccino V, Barcia E, et al. A cross-sectional study of urinary cadmium concentrations in relation to dietary intakes in Uruguayan school children. Science of The Total Environment. 2019;658:1239–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Desai G, Millen AE, Vahter M, Queirolo EI, Peregalli F, Mañay N, et al. Associations of dietary intakes and serum levels of folate and vitamin B-12 with methylation of inorganic arsenic in Uruguayan children: Comparison of findings and implications for future research. Environmental Research. 2020;189:109935. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Desai G, Vahter M, Queirolo EI, Peregalli F, Mañay N, Millen AE, et al. Vitamin B-6 Intake Is Modestly Associated with Arsenic Methylation in Uruguayan Children with Low-Level Arsenic Exposure. The Journal of nutrition. 2020;150(5):1223–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Bradley RH, Caldwell BM, Corwyn RF. The Child Care HOME Inventories: Assessing the quality of family child care homes. Early Childhood Research Quarterly. 2003;18(3):294–309. [Google Scholar]
28.Ravenscroft J, Roy A, Queirolo EI, Mañay N, Martínez G, Peregalli F, et al. Drinking water lead, iron and zinc concentrations as predictors of blood lead levels and urinary lead excretion in school children from Montevideo, Uruguay. Chemosphere. 2018;212:694–704. [DOI] [PubMed] [Google Scholar]
29.Donangelo CM, Kerr BT, Queirolo EI, Vahter M, Peregalli F, Mañay N, et al. Lead exposure and indices of height and weight in Uruguayan urban school children, considering co-exposure to cadmium and arsenic, sex, iron status and dairy intake. Environ Res. 2021;195:110799. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Rink SM, Ardoino G, Queirolo EI, Cicariello D, Mañay N, Kordas K. Associations between hair manganese levels and cognitive, language, and motor development in preschool children from Montevideo, Uruguay. Archives of environmental & occupational health. 2014;69(1):46–54. [DOI] [PubMed] [Google Scholar]
31.Frndak S, Gallo Y, Queirolo EI, Barg G, Mañay N, Kordas K. A mixed methods study examining neighborhood disadvantage and childhood behavior problems in Montevideo, Uruguay. International Journal of Hygiene and Environmental Health. 2021;235:113753. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.van Etten J R package gdistance: distances and routes on geographical grids. Journal of Statistical Software. 2017;76(1):1–21.36568334 [Google Scholar]
33.Brunsdon C, Chen H, Brunsdon MC. Package ‘GISTools’. Comprehensive R Archive Network. 2015. [Google Scholar]
34.Marta S Planet Imagery Product Specifications. Planet Labs: San Francisco, CA, USA. 2018:91. [Google Scholar]
35.Stekhoven DJ. Using the missForest package. R package. 2011:1–11. [Google Scholar]
36.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112–8. [DOI] [PubMed] [Google Scholar]
37.Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC bioinformatics. 2020;21(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kuchibhotla AK, Kolassa JE, Kuffner TA. Post-Selection Inference. Annual Review of Statistics and Its Application. 2022;9(1):505–27. [Google Scholar]
39.Textor J, Hardt J, Knüppel S. DAGitty: a graphical tool for analyzing causal diagrams. Epidemiology. 2011;22(5):745. [DOI] [PubMed] [Google Scholar]
40.Adams SV, Newcomb PA. Cadmium blood and urine concentrations as measures of exposure: NHANES 1999–2010. Journal of Exposure Science & Environmental Epidemiology. 2014;24(2):163–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Tchounwou PB, Yedjou CG, Patlolla AK, Sutton DJ. Heavy metal toxicity and the environment. Exp Suppl. 2012;101:133–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Sanders AP, Claus Henn B, Wright RO. Perinatal and Childhood Exposure to Cadmium, Manganese, and Metal Mixtures and Effects on Cognition and Behavior: A Review of Recent Literature. Current Environmental Health Reports. 2015;2(3):284–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Rasines DG, Young GA. Splitting strategies for post-selection inference. arXiv preprint arXiv:210202159. 2021. [Google Scholar]
44.Hong L, Kuffner TA, Martin R. On overfitting and post-selection uncertainty assessments. Biometrika. 2018;105(1):221–4. [Google Scholar]
45.Pivina L, Semenova Y, Doşa MD, Dauletyarova M, Bjørklund G. Iron Deficiency, Cognitive Functions, and Neurobehavioral Disorders in Children. Journal of Molecular Neuroscience. 2019;68(1):1–10. [DOI] [PubMed] [Google Scholar]
46.Nisbett RE, Aronson J, Blair C, Dickens W, Flynn J, Halpern DF, et al. Intelligence: new findings and theoretical developments. American psychologist. 2012;67(2):130. [DOI] [PubMed] [Google Scholar]
47.Anesko KM, Schoiock G, Ramirez R, Levine FM. The homework problem checklist: Assessing children’s homework difficulties. Behavioral Assessment. 1987. [Google Scholar]
48.Langberg JM, Epstein JN, Girio-Herrera E, Becker SP, Vaughn AJ, Altaye M. Materials Organization, Planning, and Homework Completion in Middle-School Students with ADHD: Impact on Academic Performance. School Mental Health. 2011;3(2):93–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Frndak S, Barg G, Canfield RL, Quierolo EI, Mañay N, Kordas K. Latent subgroups of cognitive performance in lead- and manganese-exposed Uruguayan children: Examining behavioral signatures. Neurotoxicology. 2019;73:188–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Van der Laan MJ, Rose S. Targeted learning in data science: causal inference for complex longitudinal studies: Springer; 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental online materials

NIHMS1970975-supplement-Supplemental_online_materials.docx^{(62.5KB, docx)}

[R1] 1.Tulve N, Ruiz J, Lichtveld K, Darney S, Quackenboss J. Development of a conceptual framework depicting a child’s total (built, natural, social) environment in order to optimize health and well-being. J Environ Health Sci. 2016;2(2):1–8. [Google Scholar]

[R2] 2.Ruiz JDC, Quackenboss JJ, Tulve NS. Contributions of a child’s built, natural, and social environments to their general cognitive ability: A systematic scoping review. Plos one. 2016;11(2):e0147741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Kordas K, Young SL, Golding J. Measuring the Lifetime Environment in LMICs: Perspectives from Epidemiology, Environmental Health, and Anthropology. Transforming Global Health: Springer; 2020. p. 19–34. [Google Scholar]

[R4] 4.Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications: Springer Science & Business Media; 2011. [Google Scholar]

[R5] 5.Stingone JA, Buck Louis GM, Nakayama SF, Vermeulen RC, Kwok RK, Cui Y, et al. Toward greater implementation of the exposome research paradigm within environmental epidemiology. Annual review of public health. 2017;38:315–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological). 1995;57(1):289–300. [Google Scholar]

[R7] 7.Patel CJ. Exposome-Wide Association Studies: A Data-Driven Approach for Searching for Exposures Associated with Phenotype. Unraveling the Exposome: Springer; 2019. p. 315–36. [Google Scholar]

[R8] 8.Alin A Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2(3):370–4. [Google Scholar]

[R9] 9.Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic medicine. 2004;66(3):411–21. [DOI] [PubMed] [Google Scholar]

[R10] 10.Westreich D, Greenland S. The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients. American Journal of Epidemiology. 2013;177(4):292–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97(458):611–31. [Google Scholar]

[R12] 12.Santos S, Maitre L, Warembourg C, Agier L, Richiardi L, Basagaña X, et al. Applying the exposome concept in birth cohort research: a review of statistical approaches. European Journal of Epidemiology. 2020;35(3):193–204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Hendryx M, Luo J. Latent class analysis to model multiple chemical exposures among children. Environmental Research. 2018;160:115–20. [DOI] [PubMed] [Google Scholar]

[R14] 14.Weisskopf MG, Seals RM, Webster TF. Bias Amplification in Epidemiologic Analysis of Exposure to Mixtures. Environ Health Perspect. 2018;126(4):047003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nature Machine Intelligence. 2020;2(7):369–75. [Google Scholar]

[R16] 16.Greenland S, Pearl J. Causal diagrams. 2007.

[R17] 17.Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999:37–48. [PubMed] [Google Scholar]

[R18] 18.Roy A, Queirolo E, Peregalli F, Mañay N, Martínez G, Kordas K. Association of blood lead levels with urinary F2–8α isoprostane and 8-hydroxy-2-deoxy-guanosine concentrations in first-grade Uruguayan children. Environmental research. 2015;140:127–35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Schrank FA, McGrew KS, Ruef ML, Alvarado CG. Batería III Woodcock-Muñoz^™. Assessment Service Bulletin. 2005(1). [Google Scholar]

[R20] 20.Roy A, Quierolo E, Peregalli F, Mañay N, Kordas K. Associations between dietary micronutrient intake and blood lead level in Uruguayan children. The FASEB Journal. 2011;25(S1):32.6–.6. [Google Scholar]

[R21] 21.Kordas K, Burganowski R, Roy A, Peregalli F, Baccino V, Barcia E, et al. Nutritional status and diet as predictors of children’s lead concentrations in blood and urine. Environment international. 2018;111:43–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Kordas K, Queirolo EI, Mañay N, Peregalli F, Hsiao PY, Lu Y, et al. Low-level arsenic exposure: Nutritional and dietary predictors in first-grade Uruguayan children. Environmental research. 2016;147:16–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Kordas K, Roy A, Vahter M, Ravenscroft J, Mañay N, Peregalli F, et al. Multiple-metal exposure, diet, and oxidative stress in Uruguayan school children. Environmental research. 2018;166:507–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Burganowski R, Vahter M, Queirolo EI, Peregalli F, Baccino V, Barcia E, et al. A cross-sectional study of urinary cadmium concentrations in relation to dietary intakes in Uruguayan school children. Science of The Total Environment. 2019;658:1239–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Desai G, Millen AE, Vahter M, Queirolo EI, Peregalli F, Mañay N, et al. Associations of dietary intakes and serum levels of folate and vitamin B-12 with methylation of inorganic arsenic in Uruguayan children: Comparison of findings and implications for future research. Environmental Research. 2020;189:109935. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Desai G, Vahter M, Queirolo EI, Peregalli F, Mañay N, Millen AE, et al. Vitamin B-6 Intake Is Modestly Associated with Arsenic Methylation in Uruguayan Children with Low-Level Arsenic Exposure. The Journal of nutrition. 2020;150(5):1223–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Bradley RH, Caldwell BM, Corwyn RF. The Child Care HOME Inventories: Assessing the quality of family child care homes. Early Childhood Research Quarterly. 2003;18(3):294–309. [Google Scholar]

[R28] 28.Ravenscroft J, Roy A, Queirolo EI, Mañay N, Martínez G, Peregalli F, et al. Drinking water lead, iron and zinc concentrations as predictors of blood lead levels and urinary lead excretion in school children from Montevideo, Uruguay. Chemosphere. 2018;212:694–704. [DOI] [PubMed] [Google Scholar]

[R29] 29.Donangelo CM, Kerr BT, Queirolo EI, Vahter M, Peregalli F, Mañay N, et al. Lead exposure and indices of height and weight in Uruguayan urban school children, considering co-exposure to cadmium and arsenic, sex, iron status and dairy intake. Environ Res. 2021;195:110799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Rink SM, Ardoino G, Queirolo EI, Cicariello D, Mañay N, Kordas K. Associations between hair manganese levels and cognitive, language, and motor development in preschool children from Montevideo, Uruguay. Archives of environmental & occupational health. 2014;69(1):46–54. [DOI] [PubMed] [Google Scholar]

[R31] 31.Frndak S, Gallo Y, Queirolo EI, Barg G, Mañay N, Kordas K. A mixed methods study examining neighborhood disadvantage and childhood behavior problems in Montevideo, Uruguay. International Journal of Hygiene and Environmental Health. 2021;235:113753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.van Etten J R package gdistance: distances and routes on geographical grids. Journal of Statistical Software. 2017;76(1):1–21.36568334 [Google Scholar]

[R33] 33.Brunsdon C, Chen H, Brunsdon MC. Package ‘GISTools’. Comprehensive R Archive Network. 2015. [Google Scholar]

[R34] 34.Marta S Planet Imagery Product Specifications. Planet Labs: San Francisco, CA, USA. 2018:91. [Google Scholar]

[R35] 35.Stekhoven DJ. Using the missForest package. R package. 2011:1–11. [Google Scholar]

[R36] 36.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112–8. [DOI] [PubMed] [Google Scholar]

[R37] 37.Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC bioinformatics. 2020;21(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Kuchibhotla AK, Kolassa JE, Kuffner TA. Post-Selection Inference. Annual Review of Statistics and Its Application. 2022;9(1):505–27. [Google Scholar]

[R39] 39.Textor J, Hardt J, Knüppel S. DAGitty: a graphical tool for analyzing causal diagrams. Epidemiology. 2011;22(5):745. [DOI] [PubMed] [Google Scholar]

[R40] 40.Adams SV, Newcomb PA. Cadmium blood and urine concentrations as measures of exposure: NHANES 1999–2010. Journal of Exposure Science & Environmental Epidemiology. 2014;24(2):163–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Tchounwou PB, Yedjou CG, Patlolla AK, Sutton DJ. Heavy metal toxicity and the environment. Exp Suppl. 2012;101:133–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Sanders AP, Claus Henn B, Wright RO. Perinatal and Childhood Exposure to Cadmium, Manganese, and Metal Mixtures and Effects on Cognition and Behavior: A Review of Recent Literature. Current Environmental Health Reports. 2015;2(3):284–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Rasines DG, Young GA. Splitting strategies for post-selection inference. arXiv preprint arXiv:210202159. 2021. [Google Scholar]

[R44] 44.Hong L, Kuffner TA, Martin R. On overfitting and post-selection uncertainty assessments. Biometrika. 2018;105(1):221–4. [Google Scholar]

[R45] 45.Pivina L, Semenova Y, Doşa MD, Dauletyarova M, Bjørklund G. Iron Deficiency, Cognitive Functions, and Neurobehavioral Disorders in Children. Journal of Molecular Neuroscience. 2019;68(1):1–10. [DOI] [PubMed] [Google Scholar]

[R46] 46.Nisbett RE, Aronson J, Blair C, Dickens W, Flynn J, Halpern DF, et al. Intelligence: new findings and theoretical developments. American psychologist. 2012;67(2):130. [DOI] [PubMed] [Google Scholar]

[R47] 47.Anesko KM, Schoiock G, Ramirez R, Levine FM. The homework problem checklist: Assessing children’s homework difficulties. Behavioral Assessment. 1987. [Google Scholar]

[R48] 48.Langberg JM, Epstein JN, Girio-Herrera E, Becker SP, Vaughn AJ, Altaye M. Materials Organization, Planning, and Homework Completion in Middle-School Students with ADHD: Impact on Academic Performance. School Mental Health. 2011;3(2):93–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Frndak S, Barg G, Canfield RL, Quierolo EI, Mañay N, Kordas K. Latent subgroups of cognitive performance in lead- and manganese-exposed Uruguayan children: Examining behavioral signatures. Neurotoxicology. 2019;73:188–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Van der Laan MJ, Rose S. Targeted learning in data science: causal inference for complex longitudinal studies: Springer; 2018. [Google Scholar]

PERMALINK

Reducing the complexity of high-dimensional environmental data: an analytical framework using LASSO with considerations of confounding for statistical inference

Seth Frndak

Guan Yu

Youssef Oulhote

Elena I Queirolo

Gabriel Barg

Marie Vahter

Nelly Mañay

Fabiana Peregalli

James R Olson

Zia Ahmed

Katarzyna Kordas

Abstract

Purpose:

Methods:

Results:

Conclusions:

BACKGROUND

METHODS

Sample Recruitment

Variables Collected

Primary Outcome

Caregiver Questionnaire, Home Visit, and 24-Hour Dietary Recall

Clinical Measures

Neighborhood Variables

Indices and Factors

Variable Descriptions

Missing Data

Least Absolute Shrinkage and Selection Operator (LASSO)

DAG-Based Confounder Adjustment for LASSO selected features

RESULTS

Figure 1.

Table 1.

Figure 2.

DISCUSSION

Caveats of Our Statistical Framework

Hold-Out Subsample and DAGs for confounder selection

Examining Features Associated with GIA score after DAG-based Confounder Adjustment

Variables Not Selected by LASSO or Not Associated with Cognitive Ability

Strengths and Limitations of Our Methodology

Conclusion

Supplementary Material

Funding:

Footnotes

References:

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases