Abstract
Latent class analysis (LCA), although minimally applied to the statistical analysis of mixtures, may serve as a useful tool for identifying individuals with shared real-life profiles of chemical exposures. Knowledge of these groupings and their risk of adverse outcomes has the potential to inform targeted public health prevention strategies. This example applies LCA to identify clusters of pregnant women from a case-control study within the LIFECODES birth cohort with shared exposure patterns across a panel of urinary phthalate metabolites and parabens, and to evaluate the association between cluster membership and urinary oxidative stress biomarkers. LCA identified individuals with: “low exposure,” “low phthalates, high parabens,” “high phthalates, low parabens,” and “high exposure.” Class membership was associated with several demographic characteristics. Compared to “low exposure,” women classified as having “high exposure” have elevated urinary concentrations of the oxidative stress biomarkers 8-hydroxydeoxyguanosine (19% higher, 95% confidence interval [CI]=7%, 32%) and 8-isoprostane (31%, 95% CI=−5%, 64%). However, contrast examinations indicated that associations between oxidative stress biomarkers and “high exposure” were not statistically different from those with “high phthalates, low parabens” suggesting a minimal effect of higher paraben exposure in the presence of high phthalates. The presented example offers verification through application to an additional data set as well as a comparison to another unsupervised clustering approach, k-means clustering. LCA may be more easily implemented, more consistent, and more able to provide interpretable output.
Keywords: latent class models, mixtures methods, phthalates, phenols, oxidative stress
1. Introduction
Human-made chemical exposures, xenobiotics, in isolation and in combination have the potential to impact health (1). However, estimating the health impacts of chemical exposure mixtures is complicated by the typically highly correlated nature of chemicals that exist together in an environment of interest, e.g., the body, atmosphere, etc. Several statistical approaches have gained ground in recent years as a way to understand various questions pertaining to how mixtures impact human health (1). The selection of the best statistical method is inextricable from the research question. Some specific aims of mixtures research include: detection of important chemical-outcome relationships within a mixture; estimation of cumulative effects; identification of joint effects, such as interactions between chemicals; and exposure profile characterization (2). Our primary question here is: can we identify groups of individuals based on similar exposures to chemicals? We posit that answering this question may have strong public health relevance for three reasons. First, it enables identification of exposure combinations within individuals that occur in reality rather than in theory. Second, it allows for estimation of elevated risk among a group of individuals that may be related to a number of factors, including combinations of chemical exposures as well as certain vulnerabilities within that group. Third, because we are understanding the risk of a known group rather than an unknown individual, this information could be used to develop targeted, impactful interventions.
Latent class analysis (LCA) identifies groups of individuals with shared, real-world, exposure profiles (3), but has been applied minimally in environmental epidemiology and never in studies of endocrine disruptors in pregnant women (4). The primary aim in this study was to apply LCA to identify latent classes (i.e., clusters) of pregnant women who share similar patterns of phthalate and phenol exposure biomarkers. We test the reproducibility of latent class groupings using data on pregnant women from the National Health and Nutrition Examination Survey (NHANES) from the same time frame as the current study. Additionally, we compared the results from LCA with those from another clustering approach, k-means clustering (5, 6), which has been used more commonly as a chemical mixtures approach and offers a similar interpretation through the clustering of individuals but allows for continuous rather than categorical exposures. As a secondary aim, we examine the association between membership in the latent classes and two urinary oxidative stress biomarkers, which we previously found to be associated with these chemicals in single-pollutant analyses (7, 8).
2. Methods
2.1. Study population
The LIFECODES birth cohort is an on-going prospective cohort study that was first designed to identify risk factors for preeclampsia (9). Women who are planning to deliver at Brigham and Women’s Hospital in Boston, Massachusetts, USA are enrolled and consented at <15 weeks gestation then participate in four study visits at median 10, 18, 26, and 35 weeks of gestation. At the first visit, participants complete a questionnaire providing demographic information, tobacco and alcohol use, and information on medical history. At each study visit, participants provide urine as well as blood samples. The present analysis utilizes a subset of individuals from this study population who delivered between 2006 and 2008 and were part of a case-control study designed to examine the relationship between environmental phthalate exposure and preterm birth (10). In that study, we selected 130 cases of preterm birth (delivery <37 weeks of gestation) from participants who delivered within that time frame as well as 352 random (i.e., unmatched) controls. To be included in the present analysis, participants needed at least one available measure for all chemicals of interest; as such, 4% (n=15) of controls were removed and 5% (n=7) of cases were removed. This resulted in a final sample size of 460 (n=123 cases of preterm birth, n=337 unmatched controls). Unless otherwise specified, analyses were inverse probability weighted using case-control sampling fractions so that the results are representative of the overall LIFECODES study population. Self-reported maternal age, pre-pregnancy body mass index (BMI), race/ethnicity, and education were available as covariates.
2.2. Exposure biomarkers
All urinary phthalate metabolites and phenols were analyzed in participants from the case-control study by NSF International (Ann Arbor, MI, USA) using methods described in detail elsewhere (10, 11). The exposures for this analysis were selected based on their availability and consistent measurement above the limit of detection (>50%). We included the following urinary phthalate metabolites: di-2-ethylhexyl phthalate (DEHP), mono(3-carboxypropyl) phthalate (MCPP), mono-benzyl phthalate (MBzP), mono-n-butyl phthalate (MBP), mono-isobutyl phthalate (MiBP), and mono-ethyl phthalate (MEP). Urinary phenols included: 2,4- and 2,5-dichlorophenols (2,4-DCP and 2,5-DCP), triclosan (TCS), benzophenone-3 (BP3), bisphenol-A (BPA), butyl paraben (BPB), ethyl paraben (EPB), methyl paraben (MPB), propyl paraben (PPB). Bisphenol-S and triclocarban were available but not included due to low detection rates (46.6% and 15.4%, respectively). For the present analyses, all chemicals were specific-gravity corrected and the exposures were the geometric means of available measurements across visits 1, 2, and 3 in pregnancy (median 10, 18, and 26 weeks of gestation). In other words, for a participant with only one measurement, the average measure would be equal to that measurement; for a participant with two measurements, the average measure would be equal to the geometric mean of those two measures, etc. There was equal representation of cases as well as controls at each of these three study visits (12).
2.3. Oxidative stress biomarkers
Oxidative stress is considered a mediator between exposures and an array of health end points, including adverse birth outcomes. The urinary oxidative stress biomarkers 8-hydroxydeoxyguanosine (8-OHdG) and 8-isoprostane were measured in all available urine samples using enzyme immunoassay by Cayman Chemical (Ann Arbor, MI, USA) (8). As with the exposure biomarkers, we used the geometric means of specific-gravity corrected oxidative stress marker measurements across visits 1-3.
2.4. Statistical analysis
To assess correlation between urinary phthalate metabolites and phenols, we first created a Pearson correlation heat map of associations between subject-specific geometric averages using the package “corrplot” in R (13).
2.4.1. Latent class analysis
To address our primary aim, we applied LCA using the R package “poLCA” (14). Accompanying code for this example is available at GitHub repository “LCAmix” from user “carrollrm.” We used an unsupervised LCA approach, meaning that the outcome of interest was not considered when making class assignments, because the primary aim was to identify meaningful exposure profiles and not to estimate exposure-outcome associations (15). Further,the classes were defined using only controls to minimize potential bias associated with the case-control study design.
We dichotomized each pregnancy-specific average exposure measure into below versus above the median, using the median measure among the controls only, to indicate “low” versus “high exposure” for each of the M chemicals described in Section 2.2. In this method, we prespecify to use K classes for describing the exposure profile of the n individuals such that πk describes the proportion of individuals in class k (k = 1,…,K). Each class represents a different exposure profile based on m = 1,2,…,M chemicals, and the proportion of individuals with high (or low) exposure to chemical m in class k is pmk (or 1 — pmk). The representation of each chemical in the K latent categories can be summarized in a contingency table of size M × K. From these definitions, it follows that Nπkpmk represents the resulting number of individuals in class k with relatively high exposure to chemical m.
The class assignments were estimated by maximizing a multinomial log likelihood (see supplemental material) using an expectation-maximization (EM) algorithm. We determined the appropriate number of classes by performing the LCA for k = 2 to 10 classes and comparing several goodness-of-fit statistics, including Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are considered the most appropriate for basic LCA models because of simplicity and the use of penalty terms for the number of parameters (16, 17). However, we also considered interpretability as an important deciding factor in determining the ideal number of classes.
Once the best choice for the number of classes, K, was determined from the controls-only data, a mean overall exposure was calculated across all M chemicals via for each of the K classes. This value represents the mean proportion of individuals with greater than the median biomarker concentrations across all chemicals for latent class k. Classes were ordered by lowest to highest mean overall exposure value for presentation. Class membership was assigned to each participant based on their highest class membership probability, where a higher probability indicated a higher chance of being assigned to the given class. Posterior, i.e., following the completion of the EM algorithm, class membership probabilities indicating individual-level certainty of class assignment are also calculated (15).
Finally, we generated class assignments for the cases in the LIFECODES data set with the K class LCA model for further examination. As a first step, we examined demographic characteristics of individuals within each latent class and tested for differences across groups using independent chi-squared tests.
2.4.2. Latent class associations with oxidative stress
To address our secondary aim, we examined the association between latent class membership and urinary oxidative stress biomarkers using linear regression models with each of the log-transformed urinary oxidative stress biomarkers as the outcome in separate models and a variable for latent class assignment as the predictor. We first used the latent class with the lowest mean overall exposure as the reference group, and subsequently changed the reference category in order to calculate all contrasts. All covariates examined in relation to latent classes are included as covariates in the models in order to replicate results from previously published single-pollutant models.
2.4.3. Reproducibility of latent classes
We were additionally interested in whether the latent classes identified in the LIFECODES study were present in other populations, after accounting for differences in demographic characteristics (i.e., whether the exposure patterns are consistent across populations). To assess this, we created covariate-adjusted LCA models based on: 1) controls from the LIFECODES population; and 2) pregnant women from NHANES (2005-2006 and 2007-2008 cycles). We then assigned class membership to each individual from NHANES based on each model and compared chemical exposure distributions and class membership. We restricted the NHANES dataset to the 134 pregnant women who had urinary concentrations of phthalate metabolites and phenols available and used created creatinine-corrected concentrations for analysis. Urine collection methods and laboratory analysis for NHANES are described in detail elsewhere (18, 19).
Covariate-adjusted LCA was necessary because of the strong associations between demographic characteristics and class membership, and because NHANES data collection entails a complex survey design involving oversampling of underrepresented populations for precise estimates within those groups. Covariates were selected based on those characteristics that were accounted for in the NHANES sampling strategy (20) and were also available in the LIFECODES dataset: age; education level (to reflect socioeconomic status); and race/ethnicity. For both LCA models, the number of latent classes is forced to be the same as in the non-covariate-adjusted LIFECODES LCA model.
2.4.4. Comparison of latent class analysis with k-means clustering
Because k-means clustering has been used more commonly as a chemical mixtures approach, we wanted to compare the classes generated through this method, both in terms of number as well as interpretability. K-means clustering is an alternative unsupervised clustering approach that has been implemented to address mixtures questions using air pollution and exposure biomarker data (5, 6). Whereas LCA utilizes categorical exposures or predictors and maximizes the multinomial likelihood, k-means clustering uses continuous exposures and minimizes the sum of squared-Euclidian distances across possible clusterings, given K (21, 22). We applied the classic k-means algorithm of McQueen (21). K-means clustering proceeds by randomly selecting K points (called centroids) within the range of the exposures (e.g., a centroid in the LIFECODES data would have values for each of the M chemicals) and assigning each point to a cluster with the centroid that has the smallest distance to the point. Centroid locations are recalculated based on the cluster mean, and this process is iterated until convergence under many (300) random sets of starting locations. This analysis was performed using functions from the “flexclust” package in R (23).
In addition to comparing the number and interpretability of the classes using these two approaches, we also examined the differences in the class membership associations with oxidative stress biomarkers when the number of classes was forced to be the same as what we observed in LCA. We then proceeded with regression analysis of the covariate-adjusted association between the k-means clusters and urinary oxidative stress biomarkers using an approach identical to that described in the LCA analysis.
2.4.5. Code availability
Accompanying code for the LCA methods is available at GitHub repository “LCAmix” from user “carrollrm” This is available as an R markdown file to lead viewers through a simple example of performing these methods.
3. Results
Distributions of the phenols and urinary phthalate metabolites considered here were similar to what has been observed in the general US population (9, 10). Figure 1 displays a heat map of the Pearson correlations between chemical averages. Urinary phthalate metabolites and phenols analyzed were moderately to highly correlated (range: −0.09 to 0.98), with the highest correlations being between: 2,4-DCP and 2,5-DCP at 0.98; MPB and PPB at 0.63; and MBP and MCPP at 0.54. Correlations within phthalate metabolites and within phenols were greater in magnitude than across the two chemical classes.
Figure 1.
Pearson correlation heat map for average urinary phthalate metabolite and phenol concentrations measured during pregnancy in LIFECODES.
The abbreviations are defined as follows: DEHP - di-2-ethylhexyl phthalate, MCPP - mono(2-carboxypropyl) phthalate, MBzP - mono-benzyl phthalate, MBP - mono-n-butyl phthalate, MiBP - mono-isobutyl phthalate, MEP - mono-ethyl phthalate, 2,4-DCP - 2,4 - dichlorophenol, 2,5-DCP - 2,5 - dichlorophenol, TCS - triclosan, BP3 - benzophenone-3, BPA - bisphenol-A, BPB - butyl paraben, EPB - ethyl paraben, MPB - methyl paraben, PPB - propyl paraben.
3.1. Latent class analysis
To determine the number of latent classes within our dataset, we compared goodness—of-fit measures for models with 2-10 latent classes (Supplemental Table 1). Results varied based on the criteria used for selecting the optimal number of classes. We selected our final model with four latent classes because that choice of K produced the lowest BIC because this number of classes was more interpretable, as compared to 10 latent classes identified as best based on AIC. Classes were ordered from low to high mean overall exposure and all pmk proportion estimates are shown in Figure 2. Each bar displays the proportion of individuals with high (above median, dark gray) or low (below median, light gray) exposure biomarker concentrations for each chemical within each latent class.
Figure 2.
Proportion of term birth individuals with high (above median, dark gray) or low (below median, light gray) exposure biomarker concentrations within each of four latent classes identified by the unadjusted 4 class model.
The abbreviations are defined as follows: DEHP - di-2-ethylhexyl phthalate, MCPP - mono(2-carboxypropyl) phthalate, MBzP - mono-benzyl phthalate, MBP - mono-n-butyl phthalate, MiBP - mono-isobutyl phthalate, MEP - mono-ethyl phthalate, 2,4-DCP - 2,4 - dichlorophenol, 2,5-DCP - 2,5 - dichlorophenol, TCS - triclosan, BP3 - benzophenone-3, BPA - bisphenol-A, BPB - butyl paraben, EPB - ethyl paraben, MPB - methyl paraben, PPB - propyl paraben.
For the n=377 controls, the first latent class (n=72, “low exposure” group) contained mostly individuals who had lower than median biomarker concentrations across all chemicals (mean of 26% with “high exposure” across all biomarkers). The second latent class (n=93, “low phthalates, high parabens”) and was made up of individuals with lower than median concentrations of phthalate metabolites, BPA, and dichlorophenols, but higher than median concentrations of phenols found commonly in personal care products including parabens and BP3 (mean of 48% with “high exposure” across all biomarkers). The third latent class (n=79, “high phthalates, low parabens”) comprised individuals with greater than median concentrations of all phthalates, dichlorophenols, and BPA, but below median concentrations of parabens and BP3 (mean of 51% with “high exposure” across all biomarkers). Note that TCS levels were similar in “low phthalates, high parabens” group and the “high phthalates, low parabens” group, where approximately half of the women had greater than the median urinary concentrations in each class. The fourth and final latent class (n=93, “high exposure”) contained individuals with higher than the median biomarker concentrations for most of the phthalate metabolites and phenols measured, with the exception of BP3. This group had the highest mean percentage of individuals with “high exposure” across all biomarkers (71%). Posterior class membership probabilities were high (median 95%, interquartile range: 81%, 99%) indicating that women had a high probability of belonging to the class to which they were assigned (histogram of probabilities shown in Supplemental Figure 1).
Next, we assigned class membership to all cases as well as controls in the LIFECODES population and assessed the associations between latent class assignment and the covariates of interest for all women in the LIFECODES data set. Table 1 displays distributions and chi square test results for differences in latent class assignment by each level of demographic characteristic. The “low exposure” group contained more women who were White (70% vs. 59% overall) and who had a college degree or higher (47% vs. 40% overall). In contrast, the “high exposure” group had a higher percentage of Black and Other race participants (34% and 36%, respectively, vs. 16% and 25% overall) and more participants in the lower age groups. The “low phthalates, high parabens” group appeared to include older individuals as well as individuals with higher education and lower BMI. Individuals in the “high phthalates, low parabens” group were more likely to be classified as Other race when comparing to the overall study population. Chi squared tests indicated significant differences in all characteristics across classes.
Table 1:
Demographic characteristics overall and within latent classes for pregnant women (n=460) in the LIFECODES birth cohort.
| Covariatea | Overall population (n=460) | Low exposure (n=97) | Low phthalates, high parabens, (n=134) | High phthalates, low parabens (n=104) | High exposure (n=125) | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Race | ||||||||||
| White | 271 | (59%) | 68 | (70%) | 112 | (84%) | 54 | (52%) | 37 | (30%) |
| Black | 73 | (16%) | 10 | (10%) | 20 | (7%) | 10 | (10%) | 43 | (34%) |
| Other | 116 | (25%) | 19 | (20%) | 12 | (9%) | 40 | (38%) | 45 | (36%) |
| Education | ||||||||||
| ≤ HS | 67 | (14%) | 7 | (7%) | 5 | (4%) | 19 | (18%) | 36 | (29%) |
| Technical school | 77 | (17%) | 11 | (11%) | 14 | (10%) | 17 | (16%) | 35 | (28%) |
| Junior college/some college | 133 | (29%) | 33 | (34%) | 50 | (37%) | 20 | (19%) | 30 | (24%) |
| ≥ College graduate | 183 | (40%) | 46 | (47%) | 65 | (49%) | 48 | (46%) | 24 | (19%) |
| Age (years) | ||||||||||
| <25 | 51 | (11%) | 7 | (7%) | 2 | (2%) | 14 | (14%) | 28 | (22%) |
| 25-29 | 93 | (20%) | 17 | (18%) | 26 | (19%) | 19 | (18%) | 31 | (25%) |
| 30-34 | 182 | (40%) | 40 | (41%) | 54 | (40%) | 40 | (38%) | 48 | (39%) |
| 35+ | 134 | (29%) | 33 | (34%) | 52 | (39%) | 31 | (30%) | 18 | (14%) |
| Body Mass Index (kg/m2) | ||||||||||
| <25 | 255 | (55%) | 51 | (53%) | 93 | (69%) | 60 | (58%) | 51 | (41%) |
| 25-30 | 120 | (26%) | 26 | (27%) | 32 | (24%) | 23 | (22%) | 39 | (31%) |
| >30 | 85 | (19%) | 20 | (21%) | 9 | (7%) | 21 | (20%) | 35 | (28%) |
Chi squared tests indicate significant associations between all classes and each of the characteristics examined.
3.2. Latent class associations with oxidative stress
Compared to those in the “low exposure” group, individuals in the “high phthalate, low parabens” group and those in the “high exposure” group had elevated urinary concentrations of both 8-isoprostane and 8-OHdG (Table 2). However, individuals in the “low phthalate, high parabens” group did not have statistically different levels of either biomarker. When the reference group was changed to examine additional contrasts, we observed that there was only a modest difference in oxidative stress biomarker concentrations in the “high exposure” group compared to the “high phthalate, low paraben” group, suggesting minimal influence of the parabens on the associations in the presence of high phthalates.
Table 2.
Percent change (95% confidence interval) in log-transformed urinary oxidative stress biomarker averages in association with latent class assignment (weighted regression model).
| Latent class assignmenta | 8-OHdG | 8-isoprostane |
|---|---|---|
| Low exposure | Ref | Ref |
| Low phthalates, high parabens | 3 (−6, 14) | 8 (−12, 34) |
| High phthalates, low parabens | 16 (5, 28) | 21 (−3, 51) |
| High exposure | 19 (7, 31) | 31 (5, 64) |
| Low exposure | −3 (−12, 6) | −14 (−22, −5) |
| Low phthalates, high parabens | Ref | Ref |
| High phthalates, low parabens | 12 (2, 24) | −11 (−19, −2) |
| High exposure | 15 (4, 27) | 2 (−7, 13) |
| Low exposure | −14 (−22, −5) | −17 (−34, 3) |
| Low phthalates, high parabens | −11 (−19, −2) | −10 (−28, 11) |
| High phthalates, low parabens | Ref | Ref |
| High exposure | 2 (−7, 13) | 8 (−13, 34) |
The reference groups are varied for additional contrast comparison. Models adjusted for maternal age, race, education, and pre-pregnancy BMI.
3.3. Reproducibility of latent classes
Among pregnant women from the NHANES population, the Pearson correlation matrix for chemicals analyzed in this study illustrated similarities to what we observed in the LIFECODES study sample (Supplemental Figure 2). In this population we also observed the highest correlation between 2,4-DCP and 2,5-DCP at 0.97, and the correlation between MPB and PPB also remained the second highest at 0.71.
Latent class assignments for NHANES participants based on covariate-adjusted LCA models from LIFECODES and NHANES data showed somewhat similar patterns (Supplemental Figures 3 and 4, respectively), and posterior class membership probabilities for both were very certain (median 99% certainty, interquartile range 95%-100%). Latent class 1 was designated as “low exposure” in the LIFECODES model, although the most similar class created from NHANES data was not very low (i.e., had a higher mean proportion of participants with above median exposure across all biomarkers, 43% [NHANES] vs. 26% [LIFECODES]). This class was the least consistent between the two models. Latent classes 2 and 3 exhibited “low phthalates, high parabens” and “high phthalates, low parabens” respectively and consistently across the two models. However, the “high phthalates, low parabens” class did show some differences with lower measures for MBP, MEP, and the dichlorophenols in individuals classified with the NHANES LCA model compared to the LIFECODES LCA model. Latent class 4 was clearly “high exposure” in classes from both models (mean of 67% and 70% of participants with above median exposure across all biomarkers for NHANES and LIFECODES, respectively). The cross tabulation comparing class membership based on the two different LCA models showed moderate agreement, with 59% (n=9+24+22+24=79) of women assigned to the same three corresponding classes (Cohen’s Kappa, which quantifies the agreement between two classification schemes, of 0.46 for measuring the agreement not due to chance where random agreement is κ = 0.0 and perfect agreement is κ = 1.0) (Supplemental Table 2).
3.4. Comparison of LCA with k-means clustering
In the LIFECODES data, the adjusted Rand index was highest for K = 3, indicating that k-means clustering was most stable at a lower number of clusters. The distributions of exposures in these clusters could roughly be categorized as “low”, “moderate”, and “high” for all exposures, which has less interpretability compared to the classes identified by LCA (Supplemental Table 3).
For comparison of associations between class membership and oxidative stress biomarkers with LCA, we took K = 4 to mirror the same number of clusters, so that we could identify if additional information harnessed from the continuous exposure measures in k-means clustering improved the ability to detect effects. K-means clusters yielded groupings with similar interpretation to those from the LCA analysis (Supplemental Table 4). Thus, we also refer to the k-means clusters as “low exposure”, “low phthalates, high phenols”, “high phthalates, low phenols”, and “high exposure” in order to facilitate interpretation. While the k-means clusters demonstrated similar characteristics to those generated from LCA with respect to levels of exposures, there were notable discrepancies among how individuals were grouped in each approach. For example, the “low exposure” group from k-means clustering comprised a nearly equal split between the “low exposure” and “low phthalates, high parabens” LCA classes (Supplemental Table 5). Cohen’s Kappa is estimated as κ = 0.43 suggesting that the two classification schemes yield moderate agreement (24).
Patterns of adjusted log-linear regression model coefficients for the two urinary oxidative stress biomarkers were similar for the 4 k-means clusters and the LCA classes, suggesting the two groupings identify similar patterns as predictors of our chosen urinary oxidative stress biomarkers. Point estimates contrasting results for the estimated percent change in 8-OHdG for a change in exposure group were similar between k-means clusters and LCA. However, with k-means clustering they were approximately twice as large when estimating the adjusted percent increase in 8-isoprostane, compared to those based on LCA. Contrasted with the “low exposure group,” the percent increase (95% CI) in 8-isoprostane was 8 (−12, 34), 21 (−3, 51), and 31 (5,64) for “low phthalates, high parabens”, “high phthalates, low parabens”, and “high exposure” groups respectively (Table 3). This latter result suggests that the k-means clusters were more highly associated with 8-isoprostane.
Table 3.
Percent change (95% confidence interval) in log-transformed urinary oxidative stress biomarker averages in association with k-means cluster assignment (weighted regression model).
| K-means clustersa | 8-OHdG | 8-isoprostane |
|---|---|---|
| Low exposure | Ref | Ref |
| Low phthalates, high parabens | 1 (−7, 11) | 26 (3, 53) |
| High phthalates, low parabens | 6 (−4, 17) | 52 (23, 89) |
| High exposure | 19 (6, 34) | 63 (26, 111) |
| Low exposure | −1 (−10, 8) | −5 (−14, 5) |
| Low phthalates, high parabens | Ref | Ref |
| High phthalates, low parabens | 4 (−5, 14) | −4 (−13, 6) |
| High exposure | 17 (5, 31) | 13 (1, 26) |
| Low exposure | −5 (−14, 5) | −34 (−47, −18) |
| Low phthalates, high parabens | −4 (−13, 6) | −17 (−33, 2) |
| High phthalates, low parabens | Ref | Ref |
| High exposure | 13 (1, 26) | 7 (−16, 36) |
The reference groups are varied for additional contrast comparison. Models adjusted for maternal age, race, education, and pre-pregnancy BMI.
Finally, a likelihood ratio test was performed to contrast regression models with and without latent class cluster or k-means cluster (i.e., does exposure cluster improve model fit over a model that includes only covariates). For the latent class analysis, cluster variables improved model fit in the regression for 8-OHdG but not 8-isoprostane (AIC: 371.3 [unadjusted] vs. 362.1 [adjusted] and 1085.9 (unadjusted) vs. 1086.9 (adjusted), respectively; likelihood ratio Chi-square p-values: <0.01 and 0.17, respectively), whereas fit improved for both models with the k-means analysis (AIC: 371.3 [unadjusted] vs. 368.2 [adjusted] and 1085.9 [unadjusted] vs. 1076.0 [adjusted], respectively; likelihood ratio Chi-square p-values: 0.03 and <0.01, respectively). Thus, LCA produced clusters that appeared to be more relevant to 8-OHdG while k-means clustering appeared to be more relevant to 8-isoprostane.
4. Discussion
To add to the conversation on how to disentangle and summarize the effects of environmental chemical exposures on human health, we presented an example of the application of latent class analysis to identify groups of individuals with similar exposure profiles. We identified four groups of pregnant women from within the LIFECODES cohort with distinct patterns of exposure to phthalates and phenols in pregnancy. Individuals in these classes had somewhat different demographic characteristics, and individuals in the “high phthalates, low parabens” as well as the “high exposure” groups had significantly elevated urinary concentrations of the oxidative stress biomarkers 8-OHdG and 8-isoprostane compared to individuals in the “low exposure” group. We additionally observed similarities in latent classes in the LIFECODES study population compared to pregnant women from NHANES during the same time period. Finally, in a comparison of LCA to k-means clustering, we found that k-means selected a smaller number of clusters with less informative groupings, i.e., “low exposure”, “moderate exposure”, and “high exposure”, which provided little interpretable information about exposure-outcome associations. However, when k-means was forced to select the same number of clusters chosen by LCA, cluster membership was more strongly associated with 8-isoprostane.
The substantive interpretation of these findings is that pregnant women with “high exposure” to phthalates, but not phenols, have higher oxidative stress levels. Among pregnant women with high exposure to both phthalates and phenols, oxidative stress levels were not higher compared to pregnant women with high exposure to phthalates alone. While we found no evidence for interaction, we acknowledge that the absence of such an effect cannot be tested statistically in LCA. However, if we observed evidence for such an interaction, e.g., if pregnant women with high exposure to phthalates and phenols had much higher oxidative stress levels than pregnant women with “high exposure” to just one class of chemicals, formal interaction testing between compounds in the overall population could be a subsequent analytic step.
These results were consistent with what has previously been observed in this study population using single-pollutant models and adaptive elastic net, where the overall conclusions were that the associations between phthalate exposure and oxidative stress appeared to outweigh any associations observed among phenols (8, 25). It should be noted that these associations are not directly comparable since these are markedly different statistical approaches. However, in the study of mixtures, it could be valuable to examine the same question using different statistical methods to confirm findings.
Our secondary analysis examining reproducibility of latent class assignment using LCA models from the LIFECODES and NHANES datasets showed that there was some overlap in latent class assignment for nearly half of the individuals in NHANES. Differences in I demographic characteristics, behavioral patterns, method for urine dilution adjustment (i.e., creatinine vs. specific gravity), and exposure sources across place and time may explain the lack of reproducibility in class membership. It is unlikely that differences in individual exposure distributions in the two populations explain these differences since they were relatively similar. We posit that, even if they are not perfectly reproducible, the classes identified in a unique study population are important because they are accurate characterizations of distributions within that population and may extend to other similar populations to provide information on potentially biologically- or behaviorally-relevant combinations of exposures that could then be explored further. Moreover, within the population examined, identification of demographic characteristics or behavioral patterns associated with exposure could inform targeted and impactful exposure prevention strategies.
This study is one of the first to examine latent class models for chemical mixtures analysis. Previously, Hendryx and Luo used a similar approach to identify latent classes of children with shared exposure profiles in NHANES data (26) and identified individuals with low, medium, and high levels of exposure to various chemicals (including metals, pesticides, phthalates, phenols, and polycyclic aromatic hydrocarbons). Other clustering methods have also been applied in this context, including classification and regression trees (CART), Bayesian profile regression (BPR), and k-means clustering (27–29). CART and BPR are supervised methods which attempt to optimize prediction of an outcome. Using a latent class-based supervised approach was also a possibility; however, our primary objective was to identify groups of individuals with unique exposure patterns and to subsequently identify associations between those groups and oxidative stress. Clustering based on the outcome using a supervised approach could be useful for prediction but less informative for public health with respect to identifying exposure profiles that could correspond to behavioral patterns.
K-means clustering allows for use of continuous rather than categorical variables with output—class assignment by exposure profiles—that is similar to LCA. In our secondary analysis comparing k-means clusters to latent classes, we observed that k-means selected fewer clusters that generally reflect “low,” “moderate,” and “high exposure” individuals. These groupings provided little information that could be used for understanding of exposure-response, interaction, or that could be subsequently used for behavioral interventions. Furthermore, other studies using k-means clustering, especially with chemical exposure biomarkers, although notably not in air pollution studies, appeared to similarly identify only 1-2 groups (e.g., “low,” “high” or “low,” “moderate,” “high”) (5, 30–32). While it is not a given that LCA will identify more and as easily interpretable groups every time, this approach should at the very least be considered alongside k-means.
Additionally, we observed better than random agreement across k-means clusters and LCA classes, though it was disparate enough to suggest that the two approaches utilized somewhat different characteristics of the data by which to create groupings. While we were able to identify some similarities between the k-means clusters and the LCA classes (indeed, we gave the groups similar labels across the two schemes), the k-means groups were less obviously distinguishable from each other based on chemical class groupings than were the LCA classes. Overall, the k-means approach was superior for predicting levels of 8-isoprostane when the number of clusters is forced to the number identified by LCA; however, used alone, k-means appeared to favor fewer classes that do not align with chemical classifications.
Results from LCA cannot be directly compared to other chemical mixtures analyses that address different research questions. However, a major strength of LCA and clustering approaches generally is we are examining comparisons between groups of individuals with real-world exposure profiles. Other mixtures examples with the aim of quantifying cumulative or joint effects of chemical mixtures make such interpretations when, in reality, it is not appropriate because of the statistical method employed. For example, Snowden demonstrated such a problem with approaches relying on estimating single-chemical effects of some exposures while holding others constant (such as multiple regression or “univariate response function” results from Bayesian Kernel Machine Regression) to mixtures where we estimate the difference in outcome per IQR change in one exposure while holding the others constant (33). While LCA does not allow for disentangling independent effects of individual exposures or estimating the extent of interaction between exposures, if clusters correspond to distinct combinations of exposures (as in our analysis), then LCA allows qualitative analysis of interactions on a class-level (rather than exposure level) basis.
LCA may also be an advantageous method for addressing chemical mixtures from a public health perspective, if the classes are indeed organized in a meaningful way. The results from our study identify specific population subsets that may be disproportionally exposed to chemicals associated with elevated oxidative stress levels in pregnancy. For example, the “high exposure” group was detected as having elevated levels of oxidative stress, and, from the covariate assessment, we know that the “high exposure” group was largely made up of younger, non-white women with higher BMI and less education. In the context of pregnancy, identifying population subgroups or behavioral patterns that place individuals at higher risk of exposure could be useful for development of targeted counseling on strategies to reduce exposure. Furthermore, in the area of research on consumer product chemicals, identifying sources to avoid may have a greater public health impact than identifying one chemical from within the mixture that is the most powerful toxicant. The latter could lead to regrettable substitution, and also to underestimation of the impact of behavioral interventions. Future work could examine behavioral patterns (e.g., cosmetic and personal care product use) in relation to latent classes derived from exposure biomarker concentrations to identify points for intervention. This broader approach may be more powerful for developing interventions than examination of single pollutant associations with behaviors and demographics.
LCA is not without limitations. A longitudinal approach to LCA, known as latent transition analysis, would have enabled use of our repeated exposure biomarker concentrations and perhaps more precise groupings. However, this extension could complicate the interpretation of class assignments as well as subsequent behavioral modification recommendations. Another limitation of LCA is that it could result in identified classes that do not provide useful biological comparisons (e.g., no “low” or “high” exposure classes) or in classes that are not well-differentiated which will make interpretation of findings difficult. LCA requires the categorization of exposures, which may result in a loss of information for continuous variables and prohibits traditional testing of non-linear exposure-outcome associations or identification of specific doses associated with an effect. We considered alternatives to categorizing exposures as above vs. below median (e.g., categorizing the chemicals in quartiles or using continuous measures in latent profile analysis, data not shown); however, this further complicated exposure representation and severely limited interpretability of the results, which we saw as the primary advantage of using this method. Finally, because LCA is an unsupervised approach and does not incorporate information about the outcome in the estimation of classes, it is possible that subgroups at highest risk of elevated oxidative stress levels may not be identified due to a low prevalence. However, from a public health perspective, we felt the outcomes identified as associated with class membership, when determined from exposure profiles created without consideration of the outcome, may be the most meaningful.
In conclusion, LCA may be a useful mixtures method for identifying groups of individuals with predictive exposure profiles. Comparison of adverse outcomes between groups can inform our understanding of the cumulative impact of chemicals, and public health interventions for prevention may be facilitated by identification of demographic and behavioral characteristics of groups. Future work should examine replicability of LCA over time and identify exposure sources that are most strongly correlated with group assignment.
Supplementary Material
Acknowledgements
This research was supported by the Intramural Research Program of the National Institute of Environmental Health Sciences (NIEHS), National Institute of Health (Z1AES103321). Additional funding was provided by NIEHS (R01ES018872 and R01ES029531).
Footnotes
Conflict of interest
The authors declare no conflicts of interest.
References
- 1.Taylor KW, Joubert BR, Braun JM, Dilworth C, Gennings C, Hauser R, et al. Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology: Lessons from an Innovative Workshop. Environ Health Perspect. 2016;124(12):A227–A9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Braun JM, Gennings C, Hauser R, Webster TF. What Can Epidemiological Studies Tell Us about the Impact of Chemical Mixtures on Human Health? Environ Health Perspect. 2016;124(1):A6–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Agresti A Other Mixture Models for Categorical Data In: Balding DJ, Bloomfield P, Cressie NAC, Fisher NI, Johnstone IM, Kadane JB, et al. , editors. Categorical Data Analysis. Hoboken, NJ: Wiley; 2002. p. 538–75. [Google Scholar]
- 4.Lazarevic N, Barnett AG, Sly PD, Knibbs LD. Statistical Methodology in Studies of Prenatal Exposure to Mixtures of Endocrine-Disrupting Chemicals: A Review of Existing Approaches and New Alternatives. 2019;127(2):026001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kalloo G, Wellenius GA, McCandless L, Calafat AM, Sjodin A, Karagas M, et al. Profiles and Predictors of Environmental Chemical Mixture Exposure among Pregnant Women: The Health Outcomes and Measures of the Environment Study. Environ Sci Technol. 2018;52(17):10104–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zanobetti A, Austin E, Coull BA, Schwartz J, Koutrakis P. Health effects of multi-pollutant profiles. Environ Int. 2014;71:13–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ferguson KK, Cantonwine DE, McElrath TF, Mukherjee B, Meeker JD. Repeated measures analysis of associations between urinary bisphenol-A concentrations and biomarkers of inflammation and oxidative stress in pregnancy. Reprod Toxicol. 2016;66:93–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ferguson KK, McElrath TF, Chen YH, Mukherjee B, Meeker JD. Urinary phthalate metabolites and biomarkers of oxidative stress in pregnant women: a repeated measures analysis. Environ Health Perspect. 2015;123(3):210–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.McElrath TF, Lim KH, Pare E, Rich-Edwards J, Pucci D, Troisi R, et al. Longitudinal evaluation of predictive value for preeclampsia of circulating angiogenic factors through pregnancy. Am J Obstet Gynecol. 2012;207(5):407 e1–7. [DOI] [PubMed] [Google Scholar]
- 10.Ferguson KK, McElrath TF, Meeker JD. Environmental phthalate exposure and preterm birth. JAMA Pediatr. 2014;168(1):61–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ferguson KK, Meeker JD, Cantonwine DE, Mukherjee B, Pace GG, Weller D, et al. Environmental phenol associations with ultrasound and delivery measures of fetal growth. Environ Int. 2018;112:243–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ferguson KK, McElrath TF, Ko YA, Mukherjee B, Meeker JD. Variability in urinary phthalate metabolite levels across pregnancy and sensitive windows of exposure for the risk of preterm birth. Environ Int. 2014;70:118–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wei T, Simko V. R package “corrplot”: Visualization of a Correlation Matrix. 0.84 ed2017. [Google Scholar]
- 14.Linzer DA, Lewis JB. poLCA: An R Package for Polytomous Variable Latent Class Analysis. J Stat Softw. 2011;42(10):1–29. [Google Scholar]
- 15.McCutcheon AL. Latent class analysis. Thousand Oaks, California: Sage Publications; 1987. [Google Scholar]
- 16.Lin TH, Dayton CM. Model Selection Information Criteria for Non-Nested Latent Class Models. Journal of Educational and Behavioral Statistics. 2016;22(3):249–64. [Google Scholar]
- 17.Forster MR. Key Concepts in Model Selection: Performance and Generalizability. J Math Psychol. 2000;44(1):205–31. [DOI] [PubMed] [Google Scholar]
- 18.Calafat AM, Ye X, Wong LY, Bishop AM, Needham LL. Urinary concentrations of four parabens in the U.S. population: NHANES 2005-2006. Environ Health Perspect. 2010;118(5):679–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Silva MJ, Barr DB, Reidy JA, Malek NA, Hodge CC, Caudill SP, et al. Urinary levels of seven phthalate metabolites in the U.S. population from the National Health and Nutrition Examination Survey (NHANES) 1999-2000. Environ Health Perspect. 2004;112(3):331–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.LR C, LK M, SM D. National Health and Nutrition Examination Survey: Sample design, 2007-2010. Vital Health Stat [Internet]. 2013; 2(160). Available from: https://www.cdc.gov/nchs/data/series/sr_02/sr02_160.pdf. [PubMed] [Google Scholar]
- 21.MacQueen J, editor Some methods for classification and analysis of multivariate observations Proceedings of the fifth Berkeley symposium on mathematical statistics and probability; 1967: Oakland, CA, USA. [Google Scholar]
- 22.Brusco MJ, Shireman E, Steinley D. A comparison of latent class, K-means, and K-median methods for clustering dichotomous data. Psychol Methods. 2017;22(3):563–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Leisch FJCs, analysis d. A toolbox for k-centroids cluster analysis. 2006;51(2):526–44. [Google Scholar]
- 24.Cohen JJE, measurement p. A coefficient of agreement for nominal scales. 1960;20(1):37–46. [Google Scholar]
- 25.Ferguson KK, Zhao L, Boss J, Mukherjee B, McElrath TF, Meeker JD. Urinary concentrations of Parabens, Triclosan, and other phenols in association with biomarkers of oxidative stress in pregnancy. Submitted 2019. [DOI] [PMC free article] [PubMed]
- 26.Hendryx M, Luo J. Latent class analysis to model multiple chemical exposures among children. Environ Res. 2018;160:115–20. [DOI] [PubMed] [Google Scholar]
- 27.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton, FL: Chapman and Hall/CRC; 1984. [Google Scholar]
- 28.Papathomas M, Molitor J, Richardson S, Riboli E, Vineis P. Examining the joint effect of multiple risk factors using exposure risk profiles: lung cancer in nonsmokers. Environ Health Perspect. 2011;119(1):84–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Stafoggia M, Breitner S, Hampel R, Basagana X. Statistical Approaches to Address Multi-Pollutant Mixtures and Multiple Exposures: the State of the Science. Curr Environ Health Rep. 2017;4(4):481–90. [DOI] [PubMed] [Google Scholar]
- 30.Zhao S, Yu Y, Yin D, He J, Liu N, Qu J, et al. Annual and diurnal variations of gaseous and particulate pollutants in 31 provincial capital cities based on in situ air quality monitoring data from China National Environmental Monitoring Center. Environ Int. 2016;86:92–106. [DOI] [PubMed] [Google Scholar]
- 31.White AJ, Keller JP, Zhao S, Kaufman JD, Sandler DP. Air Pollution, Clustering of Particulate Matter Components and Breast Cancer. Cancer Epidemiology Biomarkers & Prevention. 2019;28(3):624.2–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wang X, Mukherjee B, Batterman S, Harlow SD, Park SK. Urinary metals and metal mixtures in midlife women: The Study of Women’s Health Across the Nation (SWAN). International journal of hygiene and environmental health. 2019;222(5):778–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Snowden JM, Reid CE, Tager IB. Framing air pollution epidemiology in terms of population interventions, with applications to multipollutant modeling. Epidemiology. 2015;26(2):271–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


