Abstract
Chronic exposure to groundwater contaminated with geogenic arsenic (As) poses a significant threat to human health worldwide, especially for those living on floodplains in South and Southeast (S–SE) Asia. In the alluvial and deltaic aquifers of S–SE Asia, aqueous As concentrations vary sharply over small spatial scales (10–100 m), making it challenging to identify where As contamination is present and mitigate exposure. Improved mechanistic understanding of the factors that control groundwater As levels is essential to develop models that accurately predict spatially variable groundwater As concentrations. Here we demonstrate that surface flooding duration and interannual frequency are master variables that integrate key hydrologic and biogeochemical processes that affect groundwater As levels in S–SE Asia. A machine-learning model based on high-resolution, satellite-derived, long-term measures of surface flooding duration and frequency effectively predicts heterogeneous groundwater As concentrations at fine spatial scales in Cambodia, Vietnam, and Bangladesh. Our approach can be reliably applied to identify locations of safe and unsafe groundwater sources with sufficient accuracy for making management decisions by solely using remotely sensed information. This work is important to evaluate levels of As exposure, impacts to public health, and to shed light on the underlying hydrogeochemical processes that drive As mobilization into groundwater.
Keywords: arsenic, groundwater, geospatial analysis, environmental predictive modeling, flooding
Graphical Abstract

INTRODUCTION
Groundwater contaminated with geogenic arsenic (As) is consumed by over 200 million people worldwide and more than 100 million people living on alluvial and deltaic floodplains in South and Southeast (S–SE) Asia.1,2 Chronic exposure to toxic levels of As is associated with a wide range of illnesses including skin disorders, heart disease, cancers, diabetes, and cognitive impairment in children.3–6 Furthermore, irrigation with As-contaminated groundwater diminishes crop yields (e.g., rice) and introduces As to the food supply,7,8 thereby threatening food security and exacerbating this global health crisis.
Key to addressing widespread As exposure is a clear delineation of its spatial distribution. Arsenic concentrations in aquifers of S–SE Asia often span several orders of magnitude (~1 to >1000 μg L−1) over very small spatial scales (10–100 m).9–14 Measurements of groundwater As are preferred to characterize exposure levels directly, but are exceedingly difficult to make for the millions of new household wells installed annually. Statistical models based on geospatial environmental parameters of Earth’s surface are thus frequently used to identify large-scale patterns of likely As contamination in groundwater; however, we presently lack the ability to accurately predict groundwater As concentrations for individual locations across S–SE Asia.
Prior models based on geospatial environmental variables have commonly predicted areal-averaged probabilities of As exceeding 10 or 50 μg L−1 at scales (≥1 km) that are much larger than those at which sharp contrasts in As concentration are observed.1,15–21 While these models provide useful insight into levels of As exposure and risk, they (a) often leverage numerous predictor variables from interpolated or discrete data sets (e.g., on climate, soil parameters, and geology) that are based on sparsely collected field measurements; (b) inadequately capture the underlying mechanisms responsible for the observed heterogeneity in groundwater As; and (c) typically do not predict groundwater As concentrations for individual wells or locations. This is particularly true in S–SE Asia, where As contamination is extensive but the geospatial data required for prior models is limited and/or is only available at a coarse spatial resolution that is inconsistent with fine-scale variability in groundwater As.1,15,22–25
Improving model predictions of As contamination requires the inclusion of relevant geospatial environmental parameters with a clear connection to the underlying mechanisms governing groundwater As heterogeneity. Arsenic mobilization in groundwater primarily occurs through the microbial reduction of sediments containing reactive organic matter (OM) and As-bearing iron (Fe) oxides in water-saturated and anaerobic environments.9,26–29 Variability in groundwater As is attributed to a wide range of hydrologic and geochemical conditions within and between aquifers9,30,31 that relate to their mineralogical, geologic, and geomorphic settings10,26,32–34 as well as hydrologic connectivity to surface water.35–37 Scientific understanding of the drivers of significant fine-scale heterogeneity in groundwater As remains incomplete and thus limits our ability to predict As concentrations for individual wells or households.2 For instance, linking individual wells to specific groundwater flow paths and surface recharge zones has proved challenging and limits the inclusion of these hydrologic details in large-scale models.9,38–40
Our goal is to develop a parsimonious model that can predict groundwater As concentrations at a fine-scale (e.g., 30 m) that permits effective assessment of household levels of risk. We aim to develop a model that includes solely continuous (i.e., nondiscrete) variables, which is key for making accurate predictions across a heterogeneous landscape. Furthermore, we aim to develop a model that is based on improved mechanistic understanding, that is accessible and can be widely applied, and that is reliable without the need to collect on-the-ground field data. We focus on including predictor variables that both capture the relevant processes governing As mobilization and heterogeneity, and which can be acquired solely from high-resolution, satellite-derived or remotely sensed, geospatial information.
A number of studies have suggested that surface flooding is mechanistically linked to the onset of water-saturated, chemically reducing, conditions and to the availability of reactive constituents in recharge zones.41–46 With both widespread flooding and ubiquitous surface water features found across many of the As-affected basins, we focus on elucidating how variability in flooding extent explains the observed heterogeneity in groundwater As concentrations in S–SE Asia. In low-lying regions of S–SE Asia, seasonal monsoon rains and river spillover cause extensive flooding from June through October.47 Overbank floodwaters deposit massive amounts of fluvial sediments across the floodplain, providing reactive OM and As-bearing Fe-oxides to these areas.10,26,29 Following the saturation of buried sediments, microbial decomposition of OM depletes dissolved oxygen, leading to the reductive dissolution of As-bearing Fe-oxides in recently recharged groundwater.26 With net groundwater flow moving from the seasonally inundated floodplain (recharge area) toward the major river networks (discharge area),48,49 mobilized As will travel along these flow paths.37,50 What follows is that the integrated duration and interannual frequency of flooding, along with relevant geomorphic conditions, are key controls on the degree of soil/sediment saturation and redox gradients in the near surface environment, which in turn regulates As mobilization and As levels in shallow groundwater aquifers over time. While it is well recognized that groundwater chemistry is governed by a range of surface and subsurface processes, including the interaction with surface water, sediment age, and sediment composition,9,26 we hypothesize that groundwater redox conditions that drive As contamination and heterogeneity within and between shallow aquifers of S–SE Asia are strongly dependent on the behavior of flooding (i.e., frequency and duration) where groundwater recharge occurs (Figure 1). Many of the hydrogeochemical processes that drive groundwater As levels covary with the duration and interannual frequency of flooding and thus can be captured by its delineation across the landscape.26
Figure 1.

Conceptual model of the hypothesized relationship between surface flooding and key biogeochemical conditions linked to As in shallow groundwater. (a) Meandering river and floodplain system showing several distinct flooding regimes and their connection to local geomorphology. (b) Illustration of flooding regimes classified as a function of the integrated duration and interannual frequency of flooding (expressed as a %) between 1984–2019. (c) Cross-sectional diagrams of the surface flooding regimes and their anticipated groundwater As levels, linking flooding behavior and geomorphic environment to the hydrogeochemical redox conditions that regulate As mobilization and its toxicity in shallow groundwater aquifers. The colors of the flooding regimes are based on inferred stable Fe mineralogy. To summarize, the duration and interannual frequency of flooding strongly influence redox conditions in groundwater recharge zones. Areas that flood infrequently or have frequent but very short duration flooding are likely to have more oxic conditions in the soil and shallow aquifer zones and thus are less susceptible to As mobilization via reductive dissolution of As-bearing Fe-oxides. Areas with more frequent and longer duration flooding favor reducing conditions due to the saturated surface environment, which limits oxygen supply and enhances the delivery of reactive, organic carbon (OC) rich, sediments with floodwaters. Arsenic concentrations in areas experiencing a very high degree of flooding vary widely, likely due to the balance between flushing, mobilization, and sequestration, all of which may be favored in these environments.
Here we compare measured groundwater compositions in shallow (<75 m) aquifers with high-resolution (30 m-scale), satellite-derived, surface water occurrence and recurrence data that were integrated over a 35-year time frame (1984–2019).51 Surface water occurrence is defined as the overall fractional duration of surface water in a location over time, capturing both its intra- and interannual variability.51 Surface water recurrence is a measure of the interannual variability of water occurrence and describes the frequency with which water returns from one year to the next (expressed as a %).51 We modified these two variables by removing the portion of map area covered by permanent surface water (e.g., rivers and lakes) and refer their modified versions herein as flooding duration and frequency, respectively. These variables reflect flooded areas, which include where seasonal river overbank flow occurs, seasonally ponded landscapes, and seasonal wetlands. Locally averaged measures of flooding duration and frequency were calculated within recharge zones for >350 000 wells across Cambodia, Vietnam, and Bangladesh where groundwater composition was measured. Given that in recharge areas, shallow groundwater is sourced from local recharge as opposed to longer regional flow paths,52,53 we compute the flooding metrics within a 1.5 km distance of the well (see Supporting Information (SI) for further justification). This was done to define a map surface area that best contains the true recharge area of shallow groundwater, yet is small enough to allow for local-scale variation. Furthermore, this approach does not require detailed hydrologic information, such as groundwater flow direction, which is impractical to obtain for the many wells included in this study.
We evaluated the role that flooding along with a parsimonious set of geomorphic and geographic variables (well depth, distance to the nearest river, width of the nearest river, and fractional river or lake/pond coverage in recharge zones) play in controlling groundwater As concentrations and spatial variability. We then developed a high-resolution model that predicts continuous groundwater As concentrations and probabilities of As exceeding 10, 50, and 100 μg L−1 for individual wells using solely remotely sensed predictor variables. We first focused our model development on Cambodia, where aquifers are historically minimally perturbed by groundwater pumping and thus where remotely sensed data collected from Earth’s surface is likely reflective of the subsurface processes that influence shallow groundwater recharge.54 We then validated our approach in more heavily perturbed aquifers of Vietnam and Bangladesh.
MATERIALS AND METHODS
The materials and methods are described herein with sufficient detail to enable readers to follow the logic of the procedure and results of the study. Additional details can be found in the SI.
Arsenic and Other Geochemical Data.
Dissolved As concentrations in groundwater were gathered for Cambodia, South Vietnam, North Vietnam, and Bangladesh from the Global Database of Arsenic (As) Geochemistry (GIAs Database).55,56 The majority of groundwater As data were measured from field-test kits, which provide semiquantitative estimates of As concentration. Only groundwater well depths between 5 and 75 m were considered for analysis, resulting in 41 342 (Cambodia), 23 791 (Southern Vietnam), 172 390 (Northern Vietnam), and 48 988 (Bangladesh) sampling sites containing As data. This depth range was chosen because shallow wells are recharged from more local sources, and thus are amenable to study using surface water flooding located adjacent to the well.57,58 Furthermore, the vast majority of wells in these regions are at depths <75 m, with exception of Southern Vietnam (SI Table S1). Measurements of dissolved organic carbon (DOC), iron (Fe), manganese (Mn), ammonium (NH4), redox potential (pE), pH, phosphate (PO4), and sulfate (SO4) in groundwater were also acquired, adding 193 to ~8600 geochemical data per variable.
Geospatial Environmental Variables.
Values for all surface hydrologic and geomorphic/geographic variables were gathered for each well location from external data sources and using geospatial tools in R computing software.
Surface water occurrence and recurrence raster maps (30 m-pixel resolution) were obtained from the JRC Global Surface Water data set v1.2, which is based on bimonthly Landsat satellite imagery from 1984 to 2019.51 Surface water occurrence and recurrence raster maps were modified prior to analysis to differentiate surface water derived from flooding from permanent surface water. Their modified versions are referred herein as flooding duration and frequency, respectively. This modification involved masking the occurrence and recurrence raster maps by the pixel area of permanent water, which was acquired from the JRC water transitions map product. Although integrating specific hydrologic measurements of the exact location, flow paths, and length scale of recharge from the surface would be desirable, this information is not available for most wells, including all of those in this study. Thus, we chose to define a circular map surface area that contains the likely recharge zone of shallow groundwater and is reasonably representative of local flooding, yet that is not dependent on detailed measurements of the subsurface hydrology. A 1.5 km-radius circular polygon buffer was created around each sampling site to calculate the average of all flooding occurrence (duration) and recurrence (frequency) values within a representative local recharge zone.
Distance to the nearest river was estimated using the global river width from Landsat (GRWL) river mask (version 1.01).59 Raster pixels were first transformed into spatial coordinates (points). The nearest point on the river edge was then identified for each groundwater well and the distance was subsequently estimated. The width of the nearest river was estimated using the Landsat GRWL centerline vector product (version 1.01).59 The point-nodes of every river segment were transformed into spatial points. The nearest river centerline node was identified for each well, and the river width value was gathered from the GRWL data set. Given the permanent water layer in the JRC Global Surface Water data set does not distinguish types of water bodies, we chose to use the Terra ASTER Global Water Bodies Database (ASTWBD) v1 data product60 to estimate the percentage of permanent river or lake/pond coverage within the circular mapped recharge zone around each well.
Linking Flooding Regimes to Aquifer Geochemistry.
Groundwater geochemistry was compared with surface water flooding regimes defined by ranges of flooding duration and frequency values (conceptualized in Figure 1). A frequent but short-duration flooding event category was first defined (Regime “B” in Figure 1), where all data with flooding frequency values above a boundary line (frequency = 3.5 × duration + 20%) were included. For flooding frequency values below this boundary, geochemical data were then organized into the following regimes: dry or seldom flooding (<5% duration), moderate frequency and duration flooding (≥5% and <15% duration), high frequency and duration flooding (≥15% and <30% duration), and very high frequency and duration flooding (≥30% duration). The distributions of the geochemical measurements grouped by each flooding regime are shown in Figure 2c and SI Figure S2.
Figure 2.

Relationship between flooding behavior, groundwater As levels, and key geochemical variables linked to As mobility. (a) Map of the Upper Mekong River delta of Cambodia, showing flooding duration (%) overlaid by locations of groundwater sampling and their observed As concentration (μg L−1). (b) Heat-map showing the median of As concentrations calculated across 1% flooding frequency (%) and duration intervals. (c) Box-plots showing the distribution of As (μg L−1), dissolved organic carbon (DOC; mM), and ammonium (NH4; mM) concentrations in groundwater grouped by their corresponding flooding regime (illustrated in Figure 1). The lower and upper hinges correspond to the first and third quartiles. The center line corresponds to the median. Whiskers extending from the hinge to the highest or lowest value are within 1.5× the interquartile range of the hinge. Numbers above box-plots denote sample size. Additional geochemical variables are shown in SI Figure S2.
Implementing Random Forest to Predict Groundwater Arsenic.
Random Forest (RF) modeling was implemented in R61 to predict continuous concentrations of As in groundwater from the following predictor variables: flooding duration and frequency, distance to the nearest river, width of the nearest river, well depth, and factional river or lake/pond coverage in the recharge zone. SI Table S2 provides the summary statistics for the RF model, the error metrics of the predictions, and the impurity variable importance rankings for Cambodia, Southern Vietnam, Northern Vietnam, and Bangladesh.
Predictions of As concentrations from the “test” data set were validated against their respective observed As data. Given most of the observed As data were measured from field-test kits (i.e., are discrete semiquantitative data), modeled As results were binned according to appropriate ranges of observed As. Arsenic predictions from the “test” data set were evaluated with respect to an anticipated range of observed As measurement error (25% of observed ±5 μg L−1) (Figure 3a). Histograms of modeled As data grouped by ranges of their observed data were created (SI Figures S4a and S7). Model performance was also evaluated by determining the percentage of modeled As data that were correctly predicted above or below/equal to the following As thresholds: 10, 50, and 100 μg L−1 (SI Table S4). We generated a prediction success ranking by comparing ranges of modeled and observed As concentration (Figure 4). Model predictions were assigned a rank of “very good”, “good”, “fair”, “poor”, or “very poor” depending on where the predicted As value ±5 μg L−1 fell with respect to several ranges of observed As concentration (Figure 4b). In the case that a predicted value overlapped with two success ranking categories, the higher-ranking category was chosen (e.g., “very good” was chosen over “good”). These comparisons were made to focus on the basic health question of how well the RF model predicts safe or unsafe groundwater levels of As for drinking and irrigation. Model areal performance was also evaluated by taking the geometric mean of the observed and predicted As data across 1 sq-km grids (Figure 3b), which is a common spatial resolution of other prior models. This evaluation focuses on how well the RF model predicts continuous groundwater As concentration at the neighborhood-level scale.
Figure 3.

Model performance for predicting groundwater As concentrations and heterogeneity. (a) Box-plots of modeled groundwater As concentrations (μg L−1) for several ranges of observed As in Cambodia using the “test” data set. The lower and upper hinges correspond to the first and third quartiles. The center line corresponds to the median. Whiskers extend from the 10th to the 90th percentile of modeled As data. The red shading illustrates an anticipated error bound. (b) Correlation between modeled versus observed groundwater As concentrations averaged over 1 sq-km areas. (c) Semivariance diagrams for observed and modeled groundwater As data. Lines are model-fitted to the experimental data. Both “test” and “training” data sets from Cambodia were included in (b) and (c). Sill values (×104) are 2.07 and 1.31 for the observed and modeled As lines, respectively.
Figure 4.

Success rankings of predicted groundwater As concentrations (μg L−1). (a) Bar-graphs showing the percentage of modeled groundwater As values that fall within each respective success ranking, organized by several ranges of their observed As concentration. Values above the bars are the percentage of predictions within each success ranking. (b) Legend for the bar-graphs illustrating the prediction success ranking criteria. Pairings of modeled and observed As ranges that represent a missed opportunity (MO) or a public health threat (PHT) are indicated in bold. Pairings left blank represent neither a missed opportunity nor a public health threat.
A RF model was also used to make continuous predictions of the probability that groundwater As meets or exceeds levels relevant to public health standards: 10, 50, and 100 μg L−1. Arsenic data were classified into a binary scale, where all As ≤ 10, 50, and 100 μg L−1 were assigned a zero and As values >10, 50, and 100 μg L−1 were assigned a one. We used the same predictor variables to build three individual RF models to predict probabilities of As exceedance with respect to the thresholds: 10, 50, and 100 μg L−1 for Cambodia, Southern Vietnam, Northern Vietnam, and Bangladesh. SI Table S4 provides the summary metrics of each RF model by country and their respective performance. RF model areal performance was also evaluated by taking the geometric mean of the modeled and observed As probabilities >10, 50, and 100 μgL−1 over 1 sq-km areas and then compared to the percentage of modeled As that are >10, 50, and 100 μgL−1 in each respective 1 sq-km area (SI Figure S5). This evaluation was made to focus on how well the RF model predicts probabilities that As exceeds 10, 50, and 100 μg L−1 at the scale of individual neighborhoods.
Evaluating the Geochemistry of Random Forest Model Misclassifications.
Modeled As data were organized by their performance evaluation (correctly predicted, under predicted, or over predicted) with respect to an anticipated range of observed As measurement error (25% of observed ±5 μg L−1) (red error bound in Figure 3a). The geochemistry of each category was then examined to assess additional factors that help explain why in some cases observed As concentrations were improperly modeled (SI Figure S6). The following geochemical variables were chosen because of their relevance in regulating aquifer redox and As mobilization from sediments: arsenic (As); dissolved organic carbon (DOC), ammonium (NH4), total iron (Fe), manganese (Mn), sulfate (SO4), phosphate (PO4), pH, and redox potential (pE).
Scaling Predictions of Continuous Arsenic Concentration.
To demonstrate the application of our combined remote-sensing and RF modeling approach, an extensive data set of predictor variables were acquired at the geometric center of every 30 sq-m area for a region of the Upper Bassac River, located in the Mekong River delta of Cambodia. Groundwater As was modeled at a well depth of 30 m, although any depth between 5 and 75 m can be evaluated. This resulted in ~435 000 site-specific, continuous, modeled concentrations of groundwater As. A groundwater As prediction map was then created from the modeled As concentration data at a 30m-pixel resolution, overlaid with observations of groundwater As concentrations from aquifer depths between 20 and 40 m for comparison (SI Figure S4b).
RESULTS AND DISCUSSION
Surface Flooding as an Integrative Parameter of Groundwater Composition.
The duration and interannual frequency of surface flooding in groundwater recharge environments is directly linked to As concentrations in aquifers across Cambodia (Figure 2a,b; SI Figure S1). Low groundwater As (<10 μg L−1) with relatively little spatial variability is found consistently in drier areas that fall outside the active floodplain or those that flood frequently from one year to the next, but do so only for a short period of time. Conversely, groundwater As is highly variable within regions that experience longer-lasting flooding (i.e., areas adjacent to large rivers or lakes), where the frequency and duration of flooding, and the interaction of relevant geomorphic features (discussed below), become central to explaining As heterogeneity. Flood-prone recharge environments are associated with thresholds of ~20% and ~10% flooding frequency and duration respectively, above which nearly all elevated levels of groundwater As (i.e., >10 μg L−1) occur (Figure 2). Interestingly, recharge environments that experience high interannual flooding frequencies (~20–70%) but for shorter durations each year (<10–15%) contain low groundwater As (<10 μg L−1), indicating that sustained surface flooding is required to produce higher As in associated shallow aquifers (Figure 2). Contrary to what might be expected, both low and high levels of groundwater As (~1 to >50 μg L−1) coincide with recharge environments that experience notably high values of flooding frequency (>70%) and duration (>30%). These very high flooding frequency and duration values are uncommon, and point to a scenario where overbank floodwaters combine with significantly ponded landscapes that may contain different underlying hydrogeochemical conditions depending on their proximity to rivers. Overall these results provide evidence that much of the observed small-scale variability in shallow groundwater As concentrations can be explained by the integrated duration and interannual frequency of seasonal flooding within their contributing recharge zones.
Geochemical compositions associated with contrasting flooding regimes indicate that flooding behavior regulates and is reflective of the key hydrogeochemical processes that control groundwater redox status and As concentrations. More extensive flooding is associated with higher levels of dissolved As, Fe, organic carbon (DOC), ammonium (NH4), and lower sulfate (SO4) compared to dry/seldom flooded areas or those that flood frequently but for only a short period of time each year (Figure 2c; SI Figure S2). These results suggest that the duration and frequency of flooding modulates the water-saturation state of a given recharge area, which influences As concentrations in the underlying aquifer through the decomposition of OM and the level of anaerobic-reducing conditions that control As mobility from Fe-oxides. Low groundwater As, DOC, NH4, manganese (Mn), and high reduction potential (pE) were observed in recharge zones that flood frequently but for a shorter duration each year (Zone B) (Figure 2c; SI Figure S2). This is expected if these areas drain extensively following flooding and thus do not develop and/or maintain the strongly reducing conditions favorable for As mobilization and transport into the aquifer.28,62 Despite containing both high and low As levels, groundwater associated with very high flooding duration and frequency values had low DOC, NH4, SO4, Mn, and PO4. These instances may represent environments where flooding has maintained reducing conditions for much longer periods of time that lower groundwater As by depleting DOC, flushing sediment-bound As, and/or trapping As in insoluble sulfide phases.
The interaction between flooding and relevant geomorphic features is an important consideration of the conditions that control groundwater As contamination and heterogeneity (SI Figure S3). Sharp contrasts in groundwater As were observed between areas adjacent to a river that flood extensively and contain high As (<3 km distance; typically 25 to >750 μg L−1) compared to those further from the influence of a river that seldom flood and preclude high As (>5 km distance, <10 μg L−1) (SI Figure S3a,f). Groundwater As tends to be higher nearby larger rivers (3–5 km distance; >400 m wide; ≥25 μg L−1) than smaller rivers (3–5 km distance; <400 m wide; <10 μg L−1) (SI Figure S3d,e). However, groundwater nearby larger rivers typically contains low As when the corresponding recharge environment lacks sufficient surface flooding (Group A in SI Figure S3e). Given that groundwater As concentrations are the product of the rate of mobilization and the residence time, groundwater As is typically low (<10 μg L−1) at very shallow aquifer depths (5–15 m) (SI Figure S3c). This is likely due to the generally short residence times of groundwater in this zone irrespective of flooding conditions. Flooded areas are not always river influenced, and in some cases, lower than expected As (<25 μg L−1) prevails when recharge areas overlap considerably with lakes or ponds (>30% coverage) (SI Figure S3b).
Arsenic Prediction Model.
A Random Forest (RF) model informed by our mechanistic framework effectively predicts groundwater As concentrations as well as probabilities of As exceeding 10, 50, and 100 μg L−1 for individual wells in aquifers across Cambodia (Figure 3a; SI Figure S4; Tables S2 and S3). The RF model predicts groundwater As concentrations relevant to human health particularly well (Figure 4; SI Table S4). 95% of wells are correctly predicted to be >10 μg L−1 and between 85–90% of wells containing As > or ≤50 and 100 μgL−1 are correctly classified (SI Table S4). Thus, our RF model avoids predicting elevated As concentrations below standards for safe drinking water and irrigation that might otherwise constitute a public health threat. The model is also effective at predicting low-As wells, although the percentage of As concentrations correctly predicted ≤10 μg L−1 is lower (53%) (SI Table S4). It is important to note here that a significant portion of these low-As wells are only slightly overestimated (typically predicted at 10–20 μg L−1) (SI Table S4 and Figure 4). Occasionally for wells that have observed (i.e., measured) As below the standards for safe drinking water, our RF model predicts an As concentration above the standards for safe drinking water. In these cases, the RF model results would suggest that the well is unsafe when it is in actuality safe with respect to its level of As. This misclassification merely represents a missed opportunity and it does not pose any public health threat. Similarly, the RF model predicted probabilities of As exceedance very well, the positive and negative prediction values (PPV and NPV; defined by a 0.5 threshold) for As >10, 50, and 100 μg L−1 were 83%, 80%, and 77%, and 86%, 93%, and 94%, respectively (SI Table S3). Prediction performance increased markedly when averaging measured and modeled As data across 1 sq-km grids (Figure 3b; SI Figure S5a–c), which is the typical spatial resolution of other prior models. Thus, if preferred, upscaling from predictions of individual wells enables a significant improvement in our ability to evaluate groundwater As levels and risks of As exposure at the neighborhood scale. Our RF model is also able to faithfully capture the spatial variability found in the observed data for scales at 400 m to >10 km (Figure 3c; SI Figure S5d), highlighting that our predictor variables incorporate the relevant environmental controls that explain As heterogeneity at scales relevant to public health and to developing fundamental understanding of the mechanisms governing As mobilization.
It is important to recognize that our RF model could improve with the inclusion of predictor variables that capture a more complete understanding of the surface/subsurface hydrology and other important geochemical considerations. Hydrologically, the recharge zone and flow path for specific wells is difficult to constrain for areas of low topographic relief (i.e., floodplains) and likely to vary over time.9,38–40 Geochemically, the composition and reactivity of OM and Fe mineralogy are important factors in regulating aquifer redox processes.9,27,63 The inclusion of relevant geomorphic variables partially overcomes this limitation by providing information on unique hydrologic and geomorphic environments that correlate differently with groundwater As. Hydrologic and biogeochemical gradients are clearly difficult to describe across scales,9 but are invariably similar for adjacent wells at a given aquifer depth using this approach that generalizes the recharge environmental around the well. Including detailed geochemical or flow path data could help resolve these differences between adjacent wells, but would undesirably make our model more dependent on data that can be exceedingly difficult to obtain in some parts of the world and over a large area. We demonstrate that accurate predictions are achievable from a parsimonious model based on a small group of parameters that integrate key hydrogeochemical processes that control fine-scale variability in groundwater As.
Notably, the geochemical composition of improperly modeled As predictions may provide evidence of aquifer transience or atypical characteristics that reflect shifting groundwater As levels (SI Figure S6). When our RF model over predicts As concentrations, groundwater often shows clear evidence of early stage sediment reduction (i.e., elevated Mn, NH4, DOC, PO4, and SO4) (SI Figure S6), where insufficient time has passed for conditions to become adequately reducing for significant As mobilization, a scenario that may arise in active deposition zones with recently deposited sediments. Thus, these over predictions may indicate areas where As concentrations are poised to increase in the near future. Anthropogenic perturbations (e.g., groundwater pumping) that cause rapid shifts in hydrology and redox status64 may also explain incongruences between predicted and observed groundwater As values.
Validating Arsenic Predictions in S–SE Asia.
To validate our approach in other aquifers of S–SE Asia, we first applied our RF model trained on data from Cambodia to the Mekong River delta of southern Vietnam, the Red River delta of northern Vietnam, and the Ganges-Brahmaputra delta in Bangladesh. Our model trained on Cambodia predicts groundwater As concentrations reasonably well in similar S–SE Asia aquifers. That said, predictions improve considerably when the RF model is refined with region-specific training data (SI Tables S2–S4; Figure S7). The improved performances suggest that tuning our model with region-specific data incorporates subtle differences in surface flooding behavior and geomorphic conditions between aquifers systems in S–SE Asia that are important considerations to more accurately predict groundwater As levels for individual wells. These results highlight that the duration and frequency of flooding are key variables, both in terms of their model importance and linkage to other highly relevant geomorphic variables, which govern the fundamental processes that control groundwater As levels across basins in S–SE Asia.
Implications.
The integrated duration and interannual frequency of surface flooding are master variables that regulate water-saturated, anaerobic, and reducing conditions responsible for As mobilization, and explain much of the observed spatial variability in As concentrations in S–SE Asia aquifers. These and other relevant remotely sensed geomorphic variables can be paired with machine learning models to accurately predict As concentrations and heterogeneity at sub-km scales. The resulting RF model predictions accurately estimate As concentrations and probabilities of As exceedance at small spatial scales, were mechanistically verifiable, and performed at a level comparable to measurements using qualitative field-kits for individual households. Model misfits likely reflect a combination of imprecise recharge location and transience, which is useful for identifying areas susceptible to future As contamination. Generalizing the areas of aquifer recharge proves effective in this study; however, model performance could be improved with additional remotely acquired information on the sources of recharge and nature of groundwater flow paths.
Accompanying an increase in freshwater scarcity in light of climate change will be a rise in the demand for groundwater used for drinking and irrigation, which could increase levels of As exposure in many regions of the world. It has thus far proved challenging to use models based on geospatial environmental parameters to accurately predict As concentrations for individual wells or households over large geographic areas, which is essential to address this staggering global public health crisis. The application of our model is robust and provides an unparalleled opportunity to predict As concentrations, identify safe groundwater sources, and assess effective remediation or mitigation objectives in alluvial and deltaic aquifers of S–SE Asia. This approach uses solely remotely sensed hydrologic variables that are directly linked to the presence of surface water and other key geochemical/geomorphic features responsible for As contamination in groundwater. Thus, our model parameters can be obtained easily for any new location without the need to collect field data. We anticipate that our mechanistic and modeling approach can be reasonably applied to similar aquifer systems worldwide to evaluate both the public health risks of excessive As exposure and the fundamental environmental processes regulating As mobilization and toxicity in groundwater. This information is urgently needed to identify and advise appropriate management decisions that reduce As exposure in severely impacted regions that already have few available options for clean freshwater.
Supplementary Material
ACKNOWLEDGMENTS
We are grateful to the late Dr. Micky Sampson and Andrew Shantz, without whom the extensive dataset in Cambodia would not have been created. The database and part of this work was conducted as a part of the Characterizing Global Variability in Groundwater Arsenic Working Group supported by the John Wesley Powell Center for Analysis and Synthesis, funded by the U.S. Geological Survey. Groundwater data from Drs. Alexander van Geen and Laura Erban were particularly useful in this work. This work was supported by National Science Foundation (NSF) grants EAR-1521356 and ICER-1414131, National Institute of Environmental Health Sciences grants P42 ES010349 and 2T32 ES007322, and the John Wesley Powell Center for Analysis and Synthesis of USGS.
Footnotes
Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.1c05955.
Additional text, figures, and tables related to this paper. The data used and generated in this paper are openly available at FigShare (10.6084/m9.figshare.17167316) (PDF)
Complete contact information is available at: https://pubs.acs.org/10.1021/acs.est.1c05955
The authors declare no competing financial interest.
Contributor Information
Craig T. Connolly, Lamont-Doherty Earth Observatory, Columbia University, Palisades, New York 10964, United States;; Department of Environmental Health Sciences, Columbia University, New York, New York 10032, United States; Data Science Institute, Columbia University, New York, New York 10027, United States;
Mason O. Stahl, Department of Geology, Union College, Schenectady, New York 12308, United States;
Beck A. DeYoung, Department of Geology, Union College, Schenectady, New York 12308, United States;.
Benjamin C. Bostick, Lamont-Doherty Earth Observatory, Columbia University, Palisades, New York 10964, United States;.
REFERENCES
- (1).Podgorksi J; Berg M (2020). Global threat of arsenic in groundwater. Science 2020, 368, 845–850. [DOI] [PubMed] [Google Scholar]
- (2).Ravenscroft P, Brammer H, Richards K Arsenic Pollution: A Global Synthesis, RGS-IBG Book Series; Wiley-Blackwell: Chichester, UK, 2009. [Google Scholar]
- (3).Argos M; et al. Arsenic exposure from drinking water, and all-cause and chronic-disease mortalities in Bangladesh (HEALS): a prospective cohort study. Lancet 2010, 376, 252–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (4).Smith AH; et al. Increased mortality from lung cancer and bronchiectasis in young adults after exposure to arsenic in utero and in early childhood. Environ. Health Perspect 2006, 114, 1293–1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (5).Thayer KA; Heindel JJ; Bucher JR; Gallo MA Role of Environmental Chemicals in Diabetes and Obesity: A National Toxicology Program Workshop Review. Environ. Health Perspect 2012, 120, 779–789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Wasserman GA; et al. Water arsenic exposure and intellectual function in 6-year-old children in Araihazar, Bangladesh. Environ. Health. Perspect 2007, 112, 285–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (7).Huhmann BL; et al. Field Study of Rice Yield Diminished by Soil Arsenic in Bangladesh. Environ. Sci. Technol 2017, 51, 11553–11560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (8).Azizur Rahman M; Hasegawa H; Mahfuzur Rahman M; Mazid Miah MA; Tasmin A Arsenic accumulation in rice (Oryza sativa L.): Human exposure through food chain. Ecotoxicol. Environ. Saf 2008, 69, 317–324. [DOI] [PubMed] [Google Scholar]
- (9).Fendorf S; Michael H; van Geen A Spatial and temporal variations of groundwater arsenic in South and Southeast Asia. Science 2010, 328, 1123–1127. [DOI] [PubMed] [Google Scholar]
- (10).Stahl M; et al. River bank geomorphology on controls groundwater arsenic concentrations in aquifers adjacent to the Red River, Hanoi Vietnam. Water Resour. Res 2016, 52, 1–20. [Google Scholar]
- (11).Eiche E; et al. Geochemical processes underlying a sharp contrast in groundwater arsenic concentrations in a village on the Red River delta, Vietnam. Appl. Geochem 2008, 23, 3143–3154. [Google Scholar]
- (12).van Geen A Spatial variability of arsenic in 6000 tube wells in a 25 km2 area of Bangladesh. Water Resour. Res 2003, 39, 1140. [Google Scholar]
- (13).Mukherjee A; Bhattacharya P; Savage K; Foster A; Bundschuh J Distribution of geogenic arsenic in hydrologic systems: Controls and challenges. J. Contam. Hydrol 2008, 99, 1–7. [DOI] [PubMed] [Google Scholar]
- (14).Pi K; et al. Vertical variability of arsenic concentrations under the control of iron-sulfur-arsenic interactions in reducing aquifer systems. J. Hydrol 2018, 561, 200–210. [Google Scholar]
- (15).Ayotte JD; Medalie L; Qi SL; Backer LC; Nolan BT Estimating the High-Arsenic Domestic-Well Population in the Conterminous United States. Environ. Sci. Technol 2017, 51, 12443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (16).Winkel L; Berg M; Amini M; Hug SJ; Annette Johnson C Predicting groundwater arsenic contamination in Southeast Asia from surface parameters. Nat. Geosci 2008, 1, 536–542. [Google Scholar]
- (17).Mukherjee A; et al. Occurrence, predictors and hazards of elevated groundwater arsenic across India through field observations and regional-scale AI-based modeling. Sci. Total Environ 2021, 759, 143511. [DOI] [PubMed] [Google Scholar]
- (18).Rodriguez-Lado L; et al. Groundwater Arsenic Contamination Throughout China. Science 2013, 341, 866–868. [DOI] [PubMed] [Google Scholar]
- (19).Podgorski JE; et al. Extensive arsenic contamination in high-pH unconfined aquifers in the Indus Valley. Sci. Adv 2017, 3, e1700935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (20).Podgorski JE; Wu R; Chakravorty B; Polya DA Groundwater Arsenic Distribution in India by Machine Learning Geospatial Modeling. Int. J. Environ. Res. Public Health 2020, 17, 7119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (21).Tan Z; Yang Q; Zheng Y Machine Learning Models of Groundwater Arsenic Spatial Distribution in Bangladesh: Influence of Holocene Sediment Depositional History. Environ. Sci. Technol 2020, 54, 9454–9463. [DOI] [PubMed] [Google Scholar]
- (22).Ayotte JD; et al. Modeling the probability of arsenic in groundwater in New England as a tool for exposure assessment. Environ. Sci. Technol 2006, 40, 3578–3585. [DOI] [PubMed] [Google Scholar]
- (23).Amini M; et al. Statistical Modeling of Global Geogenic Arsenic Contamination in Groundwater. Environ. Sci. Technol 2008, 42, 3669–3675. [DOI] [PubMed] [Google Scholar]
- (24).Lado LR; et al. Modelling arsenic hazard in Cambodia: A geospatial approach using ancillary data. Appl. Geochem 2008, 23, 3010–3018. [Google Scholar]
- (25).Hossain MM; Piantanakulchai M Groundwater arsenic contamination risk prediction using GIS and classification tree method. Eng. Geol 2013, 156, 37–45. [Google Scholar]
- (26).Smedley PL; Kinniburgh. A review of the source, behavior and distribution of arsenic in natural waters. Appl. Geochem 2002, 17, 517–568. [Google Scholar]
- (27).Harvey CF; et al. Arsenic mobility and groundwater extraction in Bangladesh. Science 2002, 298, 1602–1606. [DOI] [PubMed] [Google Scholar]
- (28).Stuckey JW, et al. Arsenic release metabolically limited to permanently water-saturated soils in Mekong Delta. Nat. Geosci 9 (2016).70 [Google Scholar]
- (29).Wallis I; et al. The river-groundwater interface as a hotspot for arsenic release. Nat. Geosci 2020, 13, 288–295. [Google Scholar]
- (30).Postma D; et al. Arsenic in groundwater of the Red River floodplain, Vietnam: controlling geochemical processes and reactive transport modeling. Geochim. Cosmochim. Acta 2007, 71, 5054–5071. [Google Scholar]
- (31).Rowland HAL; Polya DA; Lloyd JR; Pancost RD Characterization of organic matter in a shallow, reducing, arsenic-rich aquifer, West Bengal. Org. Geochem 2006, 37, 1101–1114. [Google Scholar]
- (32).Postma D; et al. Groundwater arsenic concentrations in Vietnam controlled by sediment age. Nat. Geosci 2012, 5, 656–661. [Google Scholar]
- (33).McArthur JM, et al. How paleosols influence groundwater flow and arsenic pollution: A model from the Bengal Basin and its worldwide implication. Water Res. Res 200844, W11411. [Google Scholar]
- (34).Papacostas NC; et al. Geomorphic controls on groundwater arsenic distribution in the Mekong River Delta, Cambodia. Geology 2008, 36, 891–894. [Google Scholar]
- (35).Stahl MO; Badruzzaman ABM; Hasan Tarek M; Harvey CF Geochemical transformations beneath man-made ponds: Implications for arsenic mobilization in South Asian aquifers. Geochim. Cosmochim. Acta 2020, 288, 262–281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (36).Nghiem AA, et al. Quantifying Riverine Recharge Impacts on Redox Conditions and Arsenic Release in Groundwater Aquifers Along the Red River, Vietnam. Water Resour. Res 201955, 6712–6728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (37).Polizzotto ML; Kocar BD; Benner SG; Sampson M; Fendorf S Near-surface wetland sediments as a source of arsenic release to ground water in Asia. Nature 2008, 454, 505–509. [DOI] [PubMed] [Google Scholar]
- (38).Harvey CF; et al. Groundwater dynamics and arsenic contamination in Bangladesh. Chem. Geol 2006, 228, 112. [Google Scholar]
- (39).Stute M; et al. Hydrologic control of As concentrations in Bangladesh groundwater. Water Resour. Res 2007, 43, W09417. [Google Scholar]
- (40).Michael HA; Voss CI Evaluation of the sustainability of deep groundwater as an arsenic-safe resource in the Bengal Basin. Proc. Natl. Acad. Sci. U. S. A 2008, 105, 8531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (41).Johnston SG; et al. Arsenic Mobilization in a Seawater Inundated Acid Sulfate Soil. Environ. Sci. Technol 2010, 44, 1968–1973. [DOI] [PubMed] [Google Scholar]
- (42).Weber FA; et al. Temperature dependence and coupling of iron and arsenic reduction and release during flooding of a contaminated soil. Environ. Sci. Technol 2010, 44, 116–122. [DOI] [PubMed] [Google Scholar]
- (43).Parsons CT; et al. The impact of oscillating redox conditions: Arsenic immobilization in contaminated calcareous floodplain soils. Environ. Pollut 2013, 178, 254–263. [DOI] [PubMed] [Google Scholar]
- (44).Foster AL, et al. In-situ identification of arsenic species in soil and aquifer sediment from Ramrail, Brahmanbaria, Bangladesh. Presentation at AGU Fall meeting, December 15–19, 2000, San Francisco, California. [Google Scholar]
- (45).Burton ED; et al. Arsenic mobility during flooding of contaminated soil: the effect of microbial sulfate reduction. Environ. Sci. Technol 2014, 48, 13660–13667. [DOI] [PubMed] [Google Scholar]
- (46).Kinniburgh DG; Smedley PL, Eds., Arsenic Contamination of Ground Water in Bangladesh, Final Report (BGS Technical Report WC/00/19; British Geological Survey: Keyworth, UK, 2001, Vol. 2. [Google Scholar]
- (47).Brammer H Floods in Bangladesh. I. Geographical background to the 1987 and 1988 floods. Geogr. J 1990, 156, 12–22. [Google Scholar]
- (48).Winter TC The concept of hydrologic landscapes. J. Am. Water Resour. Assoc 2001, 37 (2), 335–349. [Google Scholar]
- (49).Kazama S; Hagiwara T; Ranjan P; Sawamoto M Evaluation of groundwater resources in wide inundation areas of the Mekong River basin. J. Hydrol 2007, 340 (3–4), 233–243. [Google Scholar]
- (50).Jakobsen R; Kazmierczak J; So HU; Postma D Spatial Variability of Groundwater Arsenic Concentration as Controlled by Hydrogeology: Conceptual Analysis Using 2-D Reactive Transport Modeling. Water Resour. Res 2018, 54, 10254–10269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (51).Pekel JF; Cottam A; Gorelick N; Belward AS High-resolution mapping of global surface water and its long-term changes. Nature 2016, 540, 418–437. [DOI] [PubMed] [Google Scholar]
- (52).Toth J A theoretical analysis of groundwater flow in small drainage basins. J. of Geophys. Res 68 (16) (1963).4795 [Google Scholar]
- (53).Alley WM, Winter TC, Harvey JW, Franke OL Ground Water and Surface Water: A Single Resource; USGS Publications, 1998, Vol. 79. [Google Scholar]
- (54).Tweed S, et al. Seasonal influenced on groundwater arsenic concentrations in the irrigated region of the Cambodia Mekong Delta. Sci. Total Environ 2020728, 138598. [DOI] [PubMed] [Google Scholar]
- (55).DeYoung B; Stahl MO; Connolly CT; Bostick BC Characterizing global variability in groundwater arsenic. Scientific Data 2021, in preparation. [Google Scholar]
- (56).Characterizing Global Variability in Groundwater Arsenic; USGS John Wesley Powell Center for Analysis and Synthesis, 2020. [Google Scholar]
- (57).GAMACTT (Version 1.0): Groundwater Age Mixtures and Contaminant Trends Tool.
- (58).Appelo CAJ; Postma D Geochemistry, Groundwater and Pollution, 2nd ed.; Balkema Publ., 2005. [Google Scholar]
- (59).Allen GH; Pavelsky TM Global extent of rivers and streams. Science 2018, 361, 585–588. [DOI] [PubMed] [Google Scholar]
- (60).NASA/METI/AIST/Japan Spacesystems, and U.S./Japan ASTER Science Team (2019). ASTER Global Water Bodies Database V001 [Data set]. NASA EOSDIS Land Processes DAAC. accessed 2020/10/30 from DOI: 10.5067/ASTER/ASTWBD.001. [DOI] [Google Scholar]
- (61).Boehmke B; Greenwell BM 2019. Hands-on Machine Learning with R, 1st ed.; CRC Press: Boca Raton. [Google Scholar]
- (62).Roberts LC; et al. Arsenic dynamics in porewater of an intermittently irrigated paddy field in Bangladesh. Environ. Sci. Technol 2011, 45, 971–976. [DOI] [PubMed] [Google Scholar]
- (63).Nghiem AA; et al. Aquifer-Scale Observations of Iron Redox Transformations in Arsenic-Impacted Environments to Predict Future Contamination. Environ. Sci. Technol. Lett 2020, 7, 916–922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (64).Stopelli E; et al. Spatial and temporal evolution of groundwater arsenic contamination in the Red River delta, Vietnam: Interplay of mobilization and retardation processes. Sci. Total Environ 2020, 717, 137143. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
