Abstract
Beach water testing for fecal indicator bacteria (FIB) is a key element of public health protection for beachgoers. Because the process can be expensive and time-consuming, many beaches are infrequently monitored, putting the health of the public at risk. Machine learning (ML) models using large sets of FIB, weather, and other types of environmental data have been applied to predict FIB levels at beaches. If ML models developed using data from frequently monitored beaches in one location could be effectively applied to another location (referred to as “generalization”), public health protections could be easily extended to those infrequently monitored beaches. We found that source to target generalization augmented by transfer learning (TL) can predict FIB threshold exceedance with a specificity of 0.70 to 0.81 and sensitivity ranging from 0.28 to 0.76, depending on the beaches and TL methods. This degree of specificity and the high end of the sensitivity range are comparable to the performance of regression and ML models developed by using data from a given beach and applied to that same beach. With the addition of TL, we observed statistically significant improvements in model performance over source to target generalization, with increases of 28.3% in WF1 scores and 5.4% in AUC. Future research into optimizing the selection of data-rich source beaches for developing models that can be applied to a given target beach may further improve transfer learning.
Keywords: machine learning, artificial intelligence, generalization, transfer learning, fecal indicator bacteria, beach monitoring, environmental microbiology


Introduction
Observational studies have established that in many settings levels of fecal indicator bacteria (FIB), such as Escherichia coli and Enterococci spp. (ENT), predict the occurrence of gastrointestinal illness following swimming in surface waters. , For that reason, in the United States and Europe, a centerpiece of public health protections for swimmers is the monitoring of FIB levels in beach water followed by the prompt communication of those results to the public. Collecting water samples and analyzing those samples for FIB requires personnel, laboratories, and equipment, and culturing the bacteria takes at least 18 h, during which time water quality frequently changes substantially. , Quantitative polymerase chain reaction (qPCR) methods can generate results within 4 h, but the per-sample costs (excluding equipment) in 2017 was estimated to be USD $173.
Because of those challenges and limitations, predictive models, generally linear or logistic regression, have been used to estimate FIB levels or the likelihood of exceeding a FIB threshold value, respectively, at beaches (“nowcasting” or “swimcasting”). A review of studies that evaluated predictive models of FIB at freshwater beaches reported that models generally used the following predictors of FIB: rainfall (occurence or amount during an antecedent time window), wind (speed and/or direction), solar irradiance, air and/or water temperature, wave height, and turbidity.
More recently, machine learning (ML) methods have been used to predict FIB. The above-noted review of predictive models of FIB in freshwater recreational locations identified four studies − that compared regression methods to newer approaches such as random forest (RF) and artificial neural networks (ANN)to predict an FIB level or the exceedance of an FIB threshold value. All four studies found that the newer ML methods were better predictors of FIB than were regression models. Those studies used a variety of ML methods, RF, gradient boosting (XGBoost), ANN, and Bayesian networks. ML model development involves splitting a large data set that contains FIB information as well as predictor variables, typically referred to as “features” in ML terminology, into training and testing data sets (common ML terms and their analogous terms in public health research are available in Table SI1, and additional terminology can be found in Zhu et al.). The training is optimized to produce a final model for a given data set that achieves the most accurate predictions of observed values. Prior studies of ML have used such approaches to develop location-specific models. ,−
ML model development requires a relatively large set of observations of the FIB and predictor variables (such as weather variables). However, large data sets of historical FIB measurements are not available for beaches that are rarely monitored. ML models of FIB could be developed (trained) using data from frequently monitored “data-rich” source beaches, and then directly applied to an infrequently monitored data-poor target beach, without any additional fine-tuning, a process known as source to target generalization. Transfer learning (TL) refers to a class of ML techniques used to improve a source model by fine-tuning it based on some target data. TL may include techniques that adjust the distributions of data from a source location to make them more similar to that of a target location, or they may modify the optimizer’s loss function to incorporate information from the target distributions.
Zhu et al. reviewed 148 highly cited articles, that used several types of ML supervision approaches, specifically, supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning, with a focus on model generalization. That review reflects the fact that TL approaches have not been used in water microbiology research, despite the widespread utlization of such learning strategies in other application domains. The present study brings TL into water quality research, applying it to the challenge of bringing timely water quality information by transferring knowledge from data-rich source beaches to data-poor target beaches.
We recently presented a conference paper that described the performance of ML models that predict log10 ENT concentrations at Chicago and San Diego beaches, using rRMSE to evaluate model performance. Although those initial models did not utilize preprocessing or an evaluation of approaches to address missing data, we demonstrated that transfer of knowledge occurred when models trained on San Diego data were applied to data from Chicago beaches and vice versa. However, to apply ML models for beach monitoring purposes, it would be important to characterize models that predict the exceedance of a threshold value such as a Beach Action Value (BAV) using metrics recommended by the US EPA for evaluating new approaches to developing site-specific water quality criteria, such as sensitivity and specificity. The present study also compares model performance to that of logistic regression, a widely used approach to predicting the exceedance of recreational water quality thresholds.
This manuscript utilizes the Environmental Machine Learning, Baseline Reporting, and Comprehensive Evaluation (EMBRACE) checklist. The aims of this research are 1) to build on our modeling of ENT values by predicting the exceedance of specified threshold values using models developed using data from one set of beaches and applying those models to another set of beaches; 2) to optimize those models in terms of source to target generalization and transfer learning ability; and 3) to evaluate the performance of those classification models using metrics such as sensitivity, specificity, predictive value, and area under the curve (AUC) of receiver operating characteristics (ROC) analyses.
Materials and Methods
Data Collection
We utilized FIB and environmental data for beaches in two US cities: Chicago, Illinois, and San Diego, California (see Table SI2). Both data sets contain daily measurements of ENT. Data were obtained for 19 freshwater beaches in Chicago from 2016 to 2019 and 14 San Diego area marine beaches from 2014 to 2021. Linkage of FIB data to weather, tide, wave, and solar irradiance data was based on the date and hour of FIB sample collection. Variables used in a prior study of ML for predicting ENT levels at California beaches , (although without source to target generalization and without TL) were included in our models if the data were available for both Chicago and San Diego data sets.Total prior-day solar direct normal irradiance (DNI) The total prior-day solar direct normal irradiance (DNI) linked by location, data were linked to ENT values by location, date and time. For calculating offshore and alongshore wind, beach angles were defined by the angle of the beach facing the water, with a beach facing north being 0 degrees as was done in a prior ML study of California beaches. Wind speed (m/s), wind direction (degree), air temperature (C), wave height (m), and water temperature (C) from the prior hour before FIB sampling were also linked to the ENT values. Total precipitation (mm) data in the 72 h (3 days) and 168 h (7 days) preceding FIB sampling were also linked to ENT values.
Chicago Data Set
Beach monitoring in Chicago involves water sample collection at 6AM-8AM for same-day measurement of the ENT levels using EPA Method 1609.1. For every beach listed in Table SI4, water was sampled and analyzed 7 days a week during the beach season, which begins in late May and ends in early September (approximately 104 days) as described previously. Results were available by 1:00 PM and used for water quality advisories at beaches and on the Chicago Park District’s Web sites and social media outlets. Hourly precipitation, wind, and temperature data were downloaded from the Midwest Regional Climate Center for the Midway Airport weather station located in the city of Chicago. Wind direction and wind speed were converted to the speed of wind perpendicular to the beach angle, as determined by the Chicago Park District. Wave and tide data were obtained from the National Oceanographic and Atmospheric Administration (NOAA) National Data Buoy Center for the nearest buoys (no. 45198, Ohio Street, and Calumet Harbor). DNI data for each of the three beach groups (south, central, and north) was obtained from the National Solar Radiation Database by using the coordinates of the beach closest to the center of the group.
San Diego Data Set
Beaches included in the San Diego data set were those within 25 km of San Diego Bay. Because FIB levels at those beaches infrequently exceeded the 2012 EPA Recreational Water Quality Criteria, we further restricted the data set to beaches with a 90th percentile ENT value of 100 colony forming units (CFU)/100 mL or greater. Additionally, we excluded from the data set beaches with fewer than 50 days of beach monitoring over the 8-year period, resulting in a data set of 14 San Diego beaches (see Supporting Information Table SI4).
FIB levels for San Diego beaches were obtained from the EPA BEACON database. Like the Chicago data, ENT were the FIB organism monitored; unlike the Chicago data, the San Diego ENT were measured using culture rather than the qPCR method. Many San Diego area beaches contain more than one sampling location which are often more than 1 km apart; FIB values from sampling locations were analyzed individually rather than averaging the values up to the beach level. For San Diego beach samples, the reported collection time was used, typically 10 AM. In 2014, sample collection time data were generally missing, and the most frequent hour for sample collection in the other years (10 AM) was applied to all San Diego FIB samples that year. Water samples collected at atypical times (before 6 AM or after 2 PM) were excluded (n = 293). Beach angles, used to calculate the wind speed inshore and offshore, were manually calculated by viewing the geocode of the monitoring location on a NOAA map containing shoreline data. Solar irradiance (DNI) data were obtained for each beach from the NREL National Solar Radiation Database. Wave height and water temperature data were retrieved from the National Data Buoy Center. If data were not available from the buoy nearest a beach, data from the next nearest buoy were used. Wind speed, wind direction, and air temperature data were collected from the weather station at San Diego International Airport, available from the NOAA’s Global Hourly Integrated Surface Database. Tide data were obtained from two nearby tide stations via the NOAA Tides and Currents CO-OPS API.
Data Preprocessing
Data Distributions
The distributions of FIB and predictor variables differ in the Chicago and San Diego data sets (Figure SI1). The variations observed in these distributions underscore the importance of employing TL methods when transferring information from one location/data set to another. Table SI3 presents descriptive information about ENT and predictor variables including data missingness in the two data sets before conducting any preprocessing. Of the 8,729 beach-days of ENT observations at San Diego beaches, the median (25th, 75th percentile) were 10 (10, 60) CFU/100 mL, with 33.6% exceeding the 30 CFU/100 mL threshold. Of the 6,304 beach-days of ENT observations at Chicago beaches, the median (25th, 75th percentile) values were 121.6 (41.5, 332.3) CCE/100 mL, with 26.9% exceeding the 320 CCE/100 mL threshold.
Train/Test Data Splitting
Given the temporal nature of the data, we used the time information to split the data into training and testing subsets, an approach suggested by Zhu et al., as block splitting. Given the mismatch in years between Chicago and San Diego data sets, and the desire to have similar size training subsets for Chicago and San Diego, respectively, as well as realistic testing on future years that are not seen in the training subsets, we used data from years before 2019 as training data and data from year 2019 as testing data. Some Chicago samples were excluded from the analysis since they had more than 50% missing features. Given the capability of the models and having a small feature set size, we used the entire feature set (12 features), recognizing that using the entire feature set could result in a reduced model performance. Table provides statistical information about train/test benchmark subsets after preprocessing.
1. Train/Test Splits and Data Statistics for ENT Levels for the Datasets Used in This Study.
| data set | train/test splits | # train/test data | threshold | % of negative samples (train) | % of negative samples (test) |
|---|---|---|---|---|---|
| Chicago | 2016–18/2019 | 4404/1900 | 320 (qPCR) | 68.143 | 79.157 |
| San Diego | 2014–18/2019 | 4339/1210 | 30 (Culture) | 71.307 | 60.661 |
Data Imputation
Because some variables were frequently missing in the Chicago data set, we compared approaches for missing data imputation. Those approaches included replacing missing values with the mean value of the variable, incorporating mean values with added noise, and removing observations containing missing data. Finally, to impute missing values, we employed the noisy-average method, filling-in each missing value with a random number sampled from a normal distribution using the mean and standard deviation of the corresponding feature. This imputation method was applied to both the training and testing sets using training set statistics to prevent data leakage.
Feature Scaling
As part of data preprocessing, all numerical features were scaled between 0 and 1 using the min-max normalization formula below, which rescales features to a fixed range; the minimum and maximum values were computed from the training set and applied to both the training and testing sets to prevent data leakage.
Data Imbalance
Upon generating the labels for the classification task, it became apparent that the data set exhibits an imbalance in class distribution. This outcome was expected as neither Chicago nor San Diego beaches typically approach their respective BAVs. Table shows the percentage of the samples in the negative class corresponding to the exceedance threshold used. To address the class imbalance, we tried undersampling by randomly subsetting the majority class to make a balanced training set. Additionally, we experimented with oversampling the minority class using the Adaptive Synthesis (AdaSyn) and Synthetic Minority Oversampling Technique (SMOTE) methods. We also experimented with the original data without under- and oversampling and compared the results of the two approaches. After conducting experiments with both methods, we observed significant underperformance with sampling techniques; therefore, we proceeded without any data augmentation.
Machine Learning Models
The outcome variable in all models was the exceedance (vs nonexceedance) of an ENT threshold value. The US EPA defined Beach Action Values (BAVs) for ENT in the 2012 Recreational Water Quality Criteria. Those BAVs differ depending on whether ENT were measured using culture or qPCR methods. To limit excess cases of illness to 32 per 1,000 swimmers, if ENT is measured by culture methods, the BAV is 60 ENT CFU/100 mL, while if ENT is measured by qPCR, the BAV is 640 calibrator cell equivalents (CCE)/100 mL. Because of the relatively uncommon exceedance of that risk level, we modeled the exceedance using 50% of those values: 30 CFU/100 mL if measured by culture (for San Diego beaches) and 320 CCE/100 mL (for Chicago beaches). Thus, the models predict a more health-protective (conservative) level of disease risk than would the use of established BAVs.
Figure summarizes the overall approach to modeling. Individual years of study data were assigned to be part of testing and training sets, and then, baseline RF and LR models were developed. The performance of those models was evaluated by running the model developed using the training data subset on the test data subset. Next, simple source-to-target generalization was done, with models developed using source beaches applied to data from target beaches (source beach ≠ target beach). For example, models developed using Chicago data were used to predict the FIB on San Diego beaches and vice versa. After characterizing the model performance for a simple source to target generalization, the approach was repeated using several different TL methods. Comparisons of performance included positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, and receiver operating curve (ROC) analysis, specifically, area under the curve (AUC).
1.
Flow diagram showcasing key stages of the algorithms. AUC: area under the curve; TL: Transfer Learning; LR: Logistic Regression; RF: Random Forest; PPV: Positive Predictive Value, NPV: Negative Predictive Value.
The implementation of the modeling was done in a Python environment (Python version 3.11). Scikit learn Python library (sklearn version 0.0.5) was used for RF and LR implementation. Adapt library (version 0.4.2) was used for all TL implementations.
Supervised ML Algorithms: Logistic Regression
The LR model and the training of the LR are described in the SI-Logistic Regression.
Supervised ML Algorithms: Random Forest
The RF algorithm is an ensemble learning method that combines multiple decision trees to improve predictive performance and reduce overfitting, which is a common problem in machine learning. Overfitting refers to a situation in which the model fits the training data too closely and fails to generalize to new, unseen data. During training, each decision tree is built using a random subset of the training data and a random subset of features at each split. This approach promotes diversity among the trees and helps prevent the ensemble from overfitting.
For classification tasks, RF produces a continuous output by combining the probability distribution of each individual tree. Specifically, each tree provides a probability estimate of the classes derived from the proportion of training samples in its terminal leaf. The final output is typically calculated by integrating these probabilities across all trees, yielding an overall predicted probability for the positive class.
The model’s outcome is a probability distribution that can be interpreted similarly to logistic regression’s output, indicating the model’s confidence in the positive class. A classification threshold can be applied to convert the continuous probability into a binary decision, allowing for the assessment of evaluation metrics, which will be explained in the Evaluating Methods Performance Section.
Supervised and Unsupervised TL Algorithms
We used balanced weighting (BWT) and feature augmentation (FA) for supervised TL and correlation alignment (CORAL) and subspace alignment (SA) for unsupervised TL. The assumption in supervised TL is that a few samples of labeled target data are provided (in this study, the label is the ENT threshold exceedance), while in unsupervised TL, only unlabeled target data (weather, solar radiation, wave, and tide data) is provided. Unsupervised TL algorithms use observations of predictor variables from target data, as well as source training data. To simulate unlabeled data, we removed the labels (ENT values) from the data and used unsupervised TL algorithms. We used the entire training sample from the target data and employed the CORAL and SA alignment methods. Details and equations of the two supervised TL algorithms and the two unsupervised TL algorithms have been described in ref , where they have been referred to as domain adaptation (DA) algorithms. The BWT and CORAL algorithms feature adaptation regularization hyper-parameters designed to regularize the model’s inclination toward the target data. In BWT, the parameter γ ranging between 0 and 1 dictates the importance attributed to the labeled target data. A ratio of 1 signifies that the estimator is solely fitted using target data, while a ratio of 0.5 corresponds to balanced training. On the other hand, CORAL incorporates a regularization hyperparameter λ; the larger the value of λ, the less adaptation is performed.
Model Optimization and Evaluation
Hyperparameter Fine-Tuning
As part of model optimization, several hyperparameters were fine-tuned. Unlike model parameters that are learned during training, hyperparameters are preset configurations in an ML algorithm that influence the model’s behavior and performance. Here, setting hyperparameters refers broadly to either making a choice between several methods (e.g., imputation methods) or selecting a value for tunable algorithm parameters (e.g., the number of estimators/trees in the RF approach, or the adaptation regularizers λ or γ which define the trade-off between target data and source data in the BWT/CORAL approaches, respectively.). In addition, we tested the effectiveness of the data sampling method in improving the results. Also, both LR and RF algorithms have a set of hyperparameters, which are fine-tuned for each data set.
To perform hyperparameter fine-tuning, we split the training data into 4 folds, trained four models by leaving a different fold out for each model, and performed validation using the left-out fold. All of the hyperparameter configurations that were explored as part of model optimization, together with the selected values for each hyperparameter, are listed in Table .
2. Explored Configurations and Hyperparameters, and Selected Values.
| Preprocessing Configurations | ||
|---|---|---|
| issue | approaches | selected method |
| Missing data | Imputation: (mean, mean + noise), removing | Mean + noise |
| Data imbalance | Over-/undersampling (AdaSyn, SMOTE), no sampling | No sampling |
| Hyper-parameters
(tuned separately for each data
set and generalization/TL models.) | ||
|---|---|---|
| algorithm | hyperparameters used | hyperparameter space |
| Logistic regression | Solver C (regularization term), maximum Iterations | “newton-cg”, “lbfgs”, “liblinear” 0.01, 0.1, 1, 10 100, 500, 1000, 1500 |
| Random forest | # Estimators | 400–2000 (step size: 25) |
| Maximum features | “sqrt”, “log2”, “None” | |
| Maximum depth | 10–130 (step size: 15) | |
| Minimum samples split | 2, 5, 10, 15, 20 | |
| Minimum samples leaf | 1–10 (step size: 1) | |
| Bootstrap | True, False | |
| Balance weighting | Adaptation regularizer (γ) | 0–1 (step size: 0.1) |
| Correlation alignment | Adaptation regularizer (λ) | 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10 |
Evaluating Methods Performance Section
After training, the models produce a continuous output in the range [0, 1], representing the predicted probability of an instance belonging to the positive class. In the case of LR, this probability is generated by applying the logistic (sigmoid) function to a linear combination of the input features.
For RF, the probability is typically computed as the fraction of decision trees in the ensemble that vote for the positive class. Threshold values can be applied to the continuous model output to convert them into a binary prediction (e.g., exceedance vs nonexceedance). By systematically varying this threshold, one can assess changes in sensitivity, specificity, and other performance metrics such as the area under the receiver operating characteristic (ROC) curve, denoted by AUC.
To assess the performance of our models, we utilized ROC curves and AUC values. The ROC curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) across various thresholds. The AUC values quantify the overall performance of the model by calculating the area under the ROC curve. In addition, we also utilized the per-class F1 score and the weighted F1 score (WF1-score) for the classification task. The per-class F1-score integrates sensitivity and specificity and positive and negative predictive values. We define the F1 scores for each of the classes as:
To make the F1-score more general and take into account the number of samples in each class in our testing set, we also report the Weighted F1-score, the weighted sum of the Negative and Positive F1-scores defined as
where C is the number of classes, w i is the weight for class i, and F1 i is the F1-score for class i. The weights are typically proportional to the number of samples in each class, ensuring that the evaluation metric considers the impact of each class’s contribution to the overall performance. The WF1-score provides a balanced assessment that accounts for imbalances in the class distribution within the test set.
Model Robustness and Statistical Significance Testing
In the conference proceeding, we compared Random Forest to XGBoost, TabNet, Fine-tuned Llama-3 and found that random forest performed best in the regression task. We added logistic regression, which has been widely used as a predictive model of FIB exceedance. The data set was split into training and testing subsets based on temporal splitting blocks (i.e., years), to prevent train/test leakage. However, with a fixed train-test split, we lack the ability to evaluate the results for statistical significance and the ability to check the robustness of the models. To account for this, we used variations of the training data as a way to check how much the model performance varies with changes in the training data. Specifically, as mentioned already in the context of hyperparameter fine-tuning, we split the training data into 4-fold, trained 4 models (leaving out a different validation fold for each model) and finally evaluated each model on the test set (which was fixed to the year 2019 for both San Diego and Chicago data sets). The final performance for one split is the average over the four folds. To ensure robust evaluation, we repeat this process 15 times with 15 different random seeds and compute means and standard deviations over the 15 iterations.
To determine whether apparent differences in model performance were likely due to chance alone, we used the averages and standard deviations obtained from the 15 iterations as described above. We have utilized t tests using p-values corrected for the number of comparisons being made in order to compare the performance of source to target generalization and TL models. This approach allowed us to assess the consistency and robustness of our models across different partitions of the data. We reported the mean values as the results, along with their corresponding standard deviations, providing a comprehensive understanding of the performance and variability of our models.
Model Interpretability - Feature Importance
To assess the feature importance, we employed both Random Forest and Logistic Regression models. In Random Forests, feature importance is derived from the concept of the mean decrease in impurity (MDI). Impurity, in this context, refers to how mixed the data are at a given node in the decision treethat is, how uncertain the classification is. Common impurity measures, such as Gini impurity or entropy, quantify the level of disorder in class distributions. A feature is considered important if it frequently contributes to splitting the data into purer subsets, thus reducing impurities across the ensemble of trees. We calculated the average and standard deviation of the impurity reductions contributed by each feature across all trees in the forest, providing a robust estimate of their relative importance.
For Logistic Regression, feature importance is inferred from the magnitude of each feature’s coefficient. These coefficients represent the strength and direction of the relationship between each feature and the predicted outcome. Larger absolute coefficient values indicate a greater impact on the model’s decision boundary. By comparing the normalized coefficients across features, we identified those with the strongest influence on the model’s predictions. For both RF and LR, feature importance was evaluated on training data only.
We should note that feature importance, as described above, refers to the features that the model makes use of for its predictions. However, correlations between features that carry very similar information may lead the model to rely on one of the correlated features while neglecting the other. To identify such correlations and features that may be important, although not used by the model, we compute correlation matrices between the features included in the models. Furthermore, to study the potential for model simplification through feature selection, we compute the information gain between every feature and the class variable. The features with the smallest impurity reduction (a.k.a., smallest information gain) for RF and with smallest absolute coefficient values for LR are identified as potential candidates for being removed.
Data Leakage Management
As recommended by Zhu et al., it is important to ensure data leakage management. With this goal in mind, we took all of the necessary steps to prevent data leakage. Our ENT evaluations are based on real values that were used for monitoring water quality at Chicago and San Diego beaches, which ensures that bias in data collection is highly minimized. We fitted the scalers used for feature scaling and the approach for missing data imputation on the training data only (thus the unseen testing data did not have any influence on the values used). Feature importance was also assessed based on training data only. We used the temporal block splitting to split the data into training and testing subsets and ensured that only later years, in relation to training years, are in the testing subset of both Chicago and San Diego data sets.
Results and Discussion
Base Model Performance and Source to Target Generalization
The performance of models trained on data from one set of beaches and applied to predict exceedance of ENT thresholds at the same set of beaches (source = target) are found in Table . Predicting ENT exceedance was most accurate (based on both WF1 scores and AUC) for models trained on San Diego data and applied to San Diego beaches (San Diego/source: San Diego/target). Predictions of ENT exceedance were less accurate for Chicago/source: Chicago/target, especially for the LR model. The ROC curves presented in Figures SI3 and SI4 demonstrate visually the far greater performance of the San Diego/source: San Diego/target models as compared to other baselines of simple source-target generalization models. The AUC values reported in the table are the mean of 15 model runs (statistical significance test runs, described in the previous section), while in the figures the curves are of the individual run with the AUC closest to the mean of the 15 runs.
3. Classification Results Measured by WF1-Score and AUC for Training on Source Data and Testing on Source and Target Data, Respectively .
| Target: San Diego |
Target: Chicago |
||||
|---|---|---|---|---|---|
| model | source | WF1-score | AUC | WF1-score | AUC |
| RF | Chicago | 0.5350.025 | 0.5670.027 | 0.5450.013 | 0.6070.015 |
| San Diego | 0.6910.002 | 0.7330.003 | 0.4080.059 | 0.4820.013 | |
| LR | Chicago | 0.4820.017 | 0.6020.017 | 0.1810.008 | 0.6930.004 |
| San Diego | 0.6800.0011 | 0.7280.000 | 0.1640.023 | 0.4480.005 | |
For both WF1-score and AUC, higher values indicate better model performance. The subscript for values shows the standard deviation of the multiple runs; the lower standard deviation shows more stable results.
The AUC performance of the Chicago/source: San Diego/target generalization is just slightly worse than that of the base model Chicago/source: Chicago/target. However, San Diego/source: Chicago/target generalization models performed poorly and, based on AUC analyses, were no better than chance alone (i.e., AUC = 0.5) at predicting ENT exceedance. The RF base models resulted in better predictions of ENT exceedance in basic generalization models than did LR models (Table ).
The performance of basic source to target generalization models is also noted in Table SI6 in contrast with the performance of the TL models. The TL method that has the highest performance and improvement over the source to target generalization models is marked with an asterisk (*). Additionally, bold fonts are instances in which TL significantly improved the results over generalization models. The t test p-value was set to 0.025 for comparisons between generalization models and each of the TL methods. Based on the t tests, FA and BWT always resulted in an increase in performance, and SA resulted in statistically significant improvement in three out of four comparisons, but CORAL never improved performance.
Transfer Learning
Table SI6 summarizes the performance of simple generalization models by comparison to the models that used TL. For both San Diego/source: Chicago/target and Chicago/source: San Diego/target models, and for both RF and LR base models, model performance increased significantly (p < 0.025, bold font) with the implementation of TL. This was true for all comparisons involving the two supervised TL methods (FA and BWT). One of the unsupervised TL methods, SA, increased model performance for most scenarios, while the other unsupervised DA method, CORAL, did not significantly improve model performance. Model performance differed substantially in terms of predicting the positive class vs negative class. Table SI7 shows the NF1 and PF1 results for this experiment. The variations in methods of measuring ENT across the two data sets and the resulting differences in distributions make the results of unsupervised TL noisy.
Both data sets suffer from class imbalance, which may result in the models performing poorly on the minority class compared to the majority class. While Table SI6 summarizes the overall model performance in terms of WF1 and AUC, Table SI7 presents the F1-scores separately for the positive class and negative class. The much better performance of all models for predicting the negative class is likely due in part to class imbalance in the original data sets.
Based on WF1-scores reported in the Table SI6, with the addition of TL to RF in transfer learning from Chicago to San Diego, both FA and BWT surpass the performance of generalization models, SA exhibits comparable performance, while CORAL exhibits a drop in performance compared to generalization models. Similarly, with the LR algorithm in transferring from Chicago to San Diego, FA, BWT, and SA methods outperform the generalization models. For supervised TL methods (FA and BWT) the per-class F1-scores (Table SI7) show increases in both positive and negative classes. Upon analyzing all methodologies for transferring from Chicago to San Diego, it becomes evident that, based on WF1-scores and AUC values, the combination of RF + FA demonstrates the highest performance, followed by RF + BWT and LR + FA, in the second and third place with 0.680, 0.669, and 0.655 WF1-scores, respectively. The corresponding AUC scores are 0.713, 0.709, and 0.693, respectively. When using SA as the TL method and the RF algorithm, transfer learning from San Diego to Chicago demonstrates a notable improvement over direct source to target generalization models. In San Diego to Chicago TL methods, BWT and SA increased the WF1-scores over generalization models with both RF and LR; however, only the negative class had significant improvement based on Table SI7. Finally, we can report that RF + FA is the best model for transferring from San Diego to Chicago with a WF1-score of 0.642, AUC score of 0.610, and per-class F1 scores of 0.728 and 0.315.
The second and third rows of Figures SI4 and SI5 depict the TL ROC curves and the corresponding AUC values. While the first rows of both figures are identical, Figure SI4 focuses on supervised TL, while Figure SI5 showcases unsupervised TL. In Figure SI4, we observe a general improvement from TL when transferring from Chicago to San Diego, while TL shows minimal improvements when transferring from San Diego to Chicago (as seen by comparing the red and purple lines in the top right plot and the plots below in Figure SI4). In contrast, in Figure SI5, we observe that unsupervised TL demonstrates less improvement in the ROC curves. In combination, the two figures show that supervised TL methods outperform unsupervised TL methods.
Table SI7 shows that most of the TL methods lead to improvements in the per-class F1 scores. It is evident that the positive class, being the minority class, poses greater challenges and exhibits lower F1-scores. Overall, RF demonstrates a better balance between the two classes compared to LR.
Table shows that model performance varies by base model (RF vs LR), target beach, and performance metric. Consistent with the information presented in Table SI6 based on AUC, the use of TL significantly improved the performance of the source to target generalization models. Specificity was generally higher, in some cases by several fold, than sensitivity, with several exceptions for both LF and RF when San Diego was used as a target (specifically, for base models and FA models, but not for the generalization models). Given the substantial differences in the FIB distributions in a large freshwater lake and San Diego ocean beaches (see Supporting Information Table SI4), as well as differences in the FIB measurement methods (qPCR and culture), high levels of source to target generalization were not expected. Nevertheless, the use of TL improved model specificity, sensitivity, and AUC as compared to the generalization models. Model specificity was generally good; all models had specificity above 0.60, and most were above 0.70. However, the sensitivity was generally poorer and more variable; only four of the 12 models had sensitivity above 0.70. The USGS described goals for FIB predictive modeling as a sensitivity of 0.50 and specificity of 0.85. None of the models met those criteria, but four of the models listed in Table had a specificity of at least 0.68 and a sensitivity of at least 0.72. In addition to the two base (source = target) models, the two other models that met those criteria were the Chicago/Source: San Diego/Target TL models that utilized FA (for both RF and LR); the models that used only source to target generalization did not. As models with low sensitivity result in a failure to warn the public about FIB threshold exceedances, from a public health standpoint, sensitivity is more important than specificity. Sensitivity was below 0.3, and AUC was approximately 0.6 (only modestly better than chance) for the San Diego/source: Chicago/target models, even with TL. This makes clear that improvements in data types, data quantities, and/or modeling methods are needed if San Diego/source Chicago/target models are to reach the performance seen with Chicago/source San Diego/target models. The class imbalances (more observations of FIB below the threshold than above the threshold) present a challenge. Additionally, we pooled data from 14 San Diego and 19 Chicago beaches. Our models did not use information about physical characteristics of beaches (such as the presence, type, and source of fecal pollutants; the beach slope; whether or not the beach is freshwater vs marine; whether it is embayed; the estimated numbers of bathers). The use of such information in the TL process may identify characteristics that inform the selection of the optimal source beach for the transfer of learning to a given target beach.
4. Comparison of LR and RF Model Performance Using the Preferred TL Methods for Each Model and Source:Target Combination.
| Source: Target | specificity | sensitivity | neg. pred. value | pos. pred. value | AUC |
|---|---|---|---|---|---|
| RF Models | |||||
| San Diego: San Diego | 0.705 | 0.742 | 0.906 | 0.416 | 0.733 |
| Chicago: San Diego Gen. (no TL) | 0.620 | 0.463 | 0.854 | 0.191 | 0.567 |
| Chicago: San Diego TL (FA) | 0.701 | 0.729 | 0.896 | 0.409 | 0.713 |
| Chicago: Chicago | 0.856 | 0.256 | 0.454 | 0.710 | 0.607 |
| San Diego-Chicago Gen. (no TL) | 0.807 | 0.208 | 0.320 | 0.689 | 0.482 |
| San Diego-Chicago TL (FA) | 0.825 | 0.291 | 0.689 | 0.432 | 0.610 |
| LR Models | |||||
| San Diego: San Diego | 0.695 | 0.780 | 0.933 | 0.369 | 0.728 |
| Chicago: San Diego Gen. (no TL) | 0.611 | 0.591 | 0.982 | 0.036 | 0.602 |
| Chicago: San Diego TL (FA) | 0.683 | 0.760 | 0.923 | 0.337 | 0.693 |
| Chicago: Chicago | 0.966 | 0.220 | 0.072 | 0.990 | 0.693 |
| San Diego: Chicago Gen. (no TL) | 0.770 | 0.207 | 0.065 | 0.929 | 0.448 |
| San Diego: Chicago TL (BWT) | 0.836 | 0.258 | 0.612 | 0.506 | 0.605 |
To the best of our knowledge, this is the first study to transfer knowledge from ML models developed by using data from one set of beaches to predict the exceedance of FIB thresholds at another set of beaches. While model performance metrics, particularly sensitivity, would need to improve before this approach could be applied for public notification purposes, the model performance compares favorably to prior studies of ML for predicting FIB threshold exceedance. Table summarizes the performance of models that used LR, ANN, RF, XG-Boost, and other machine learning models to predict the exceedance of FIB thresholds, including the present study. Substantial variability in study data types, data quantities, and ML experimental methods is found among the studies cited in the table. Among the sources of variability are the FIB type (Enterococci spp., E. coli, fecal coliforms), the specific FIB threshold exceedance value used, the study setting (marine coastal, freshwater coastal, rivers, reservoirs), the number of seasons (years) of data, and the number of beaches studied. Nevertheless, those studies demonstrate that model sensitivity was generally better in models of FIB at rivers, ,, or a reservoir than at coastal waters. The AUCs, sensitivity, and specificity that were demonstrated in the present study, using source to target generation and TL, compare favorably to performance metrics summarized in Table from studies at surface waters other than rivers, even though those studies did not involve TL from one setting to another (they focused on the supervised source = target scenario).
5. Performance of Prior LR and RF Models by Comparison with Our Models.
| study | beach location | LR sens | LR spec | LR AUC | best ML sens | best ML spec | best ML AUC | best ML model |
|---|---|---|---|---|---|---|---|---|
| Mas and Ahlfeld | Boston reservoir | 46–62 | 23–34 | ANN | ||||
| Motamarri and Boccelli | Charles Riv. MA | 68 | 92 | ANN | ||||
| Thoe et al. | Hong Kong*** | 50.2 | 85.7 | ANN | ||||
| Jones et al. | Great Lake | 27–65 | 82–96 | RF | ||||
| Thoe et al. | California | 28–30 | 85–99 | ANN | ||||
| Thoe et al. | California | 30 | 99 | ANN | ||||
| Brooks et al. | Great Lake | 56 | 75 | 0.58–0.68 | 51 | 78–80 | 0.75–0.76 | GBoost |
| Mälzer et al. | Ruhr River | 91–100 | 40–61 | 89–100 | 58–83 | ANN | ||
| Avila et al. | N. Zealand, river | 71 | >85 | RF | ||||
| Zhang et al. | Great Lake | 0–100 | 99–100 | Ensemble | ||||
| García-Alba et al. | Spain | 41.7 | 97 | ANN | ||||
| Xu et al. | New Zealand | 76.5 | 89.30 | Multiple* | ||||
| Guo and Lee | Hong Kong** | 53–81 | 76–90 | Ensemble | ||||
| Tselemponis et al. | Greece | 90 | 90 | Decision Forest | ||||
| Searcy and Boehm | California | 40 | 69 | 0.57 | 13 | 85 | 0.6 | RF |
| Present Study | ||||||||
| San Diego: San Diego | 78.0 | 69.5 | 72.8 | 74.2 | 70.5 | 73.3 | RF | |
| Chicago: San Diego - Generalization | 59.1 | 61.1 | 60.2 | 46.3 | 62.0 | 56.7 | RF | |
| Chicago: San Diego - TL | 76.0 | 68.3 | 69.3 | 72.9 | 70.1 | 71.3 | RF | |
| Chicago: Chicago | 59.1 | 61.1 | 69.3 | 46.3 | 62.0 | 60.7 | RF | |
| San Diego: Chicago - Generalization | 20.7 | 77.0 | 44.8 | 20.8 | 80.7 | 48.2 | RF | |
| San Diego: Chicago - TL | 25.8 | 83.6 | 60.5 | 29.1 | 82.5 | 61.0 | RF | |
While our experimental results show that TL improves performance over source to target generalization, we should note one important limitation from a TL perspective, specifically, the fact that our training data sets for both Chicago:source and San Diego:source are relatively small and not very diverse, as they are based on data from close-by beaches at one main location, i.e., Chicago and San Diego, respectively. Obtaining larger sets of environmental data pertaining to data-rich beaches (those with an extensive record of historical FIB data) should improve source model performance, as should obtaining satellite, pollutant source, beach slope, and other characteristics of beaches.
The findings of this research have several implications for beach monitoring. While beaches in urban areas and tourist destinations are monitored regularly for FIB, many beaches are monitored infrequently or not at all. We demonstrated that the use of ML models using weather and FIB data from one set of beaches can predict FIB exceedance at a very different set of beaches using weather data (but not FIB data) from the target beaches. The accuracy of prediction was comparable to (if not better than) that of models developed using data from a beach to predict FIB exceedance at the same beach. This was true despite the fact that source beaches and target beaches used in this study were quite different: one set were freshwater beaches in a temperate climate for which FIB metrics used the molecular method, while the other set of beaches were marine beaches with a Mediterranean climate at which FIB were measured using culture methods. We note that the choice of source and target beaches had a major impact on model performance, and much remains to be learned about the determinants of this so that TL can be improved. We grouped Chicago beaches into three regions based on location. Weather conditions were nearly identical among beaches within a group. As a result, training on one beach and testing on another beach in the same group would have been comparable to testing and training on the same beach. Greater variability in predictor and outcome variables among beaches may have allowed each beach to be treated as an independent set of observations. We do not expect that the base ML models and augmentation methods for class imbalance that were found to perform best here will be optimal in all settings. However, this research supports the continued evaluation of information from data-rich beaches to develop real-time models of FIB exceedance at beaches for which weather data are available but are “data poor” regarding historical FIB levels. We note, however, that the application of our approach to other settings would be limited to target beaches for which environmental data are readily available, and that is generally the case for weather, tide, wave, and solar irradiance data. Though we calculated beach angles manually for San Diego beaches using public data, the use of geospatial data in raster formats could automate that process.
Feature Importance
The results for feature importance are summarized in Figure SI2. Important features are identified based on the training data set of the models as those features that the models (RF, LR) make most use of for their predictions. For both LR and RF models trained on the Chicago data set, wind and tide features were consistently highly predictive features. While solar radiation was also highly predictive in the RF model, that was not the case in the LR algorithm. There are no meaningful similarities in feature importance in San Diego data set between LR and RF except water temperature that seems to be highly predictive for both. LR has high coefficients for rain related features and RF has high MDI for the “Day of year” and wave height features. To allow for the identification of other important features that the models may neglect, we computed correlation matrices for the features in our Chicago and San Diego data sets, respectively. The correlation matrices are shown in Figure SI3, and present somewhat similar patterns for Chicago and San Diego data sets. For example, in both data sets, the rain featuresAny Rain (3 Day), Any Rain (7 Day), Cumulative Rain (3 Day) and Cumulative Rain (7 Day)have high positive correlations with each other. Similarly, Tide and Tide dichotomous features have a high positive correlation. As opposed to that, the Water Temperature feature has relatively high negative correlations with the rain features and also with the Wave Height. Such correlations can help us identify important features neglected by the models. For example, knowing that Tide is one of the most important features for the RF models, the correlation matrices suggest that the Tide dichotomous feature is also important, although the RF models do not make much use of this feature in their predictions.
Hydrologic, beach characteristics, land use, stormwater management, and weather conditions are quite different in San Diego and Chicago. Those local characteristics, as well as their interaction with Enterococci measurements (culture in San Diego and qPCR in Chicago), may contribute to the observed differences in feature importance. This is consistent with prior modeling studies of FIB at freshwater beaches. Despite these differences, as well as the use of qPCR vs culture methods of ENT measurement, the use of TL produced several sensitive and specific models of FIB exceedance. It is unknown what degree of dissimilarity of the physical characteristics of beaches, pollutant sources, and environmental metrics would reduce the accuracy of models developed at one set of beaches and applied to another.
To explore the possibility of simplifying models and their interpretability, we performed feature selection by computing information gain between each feature and the class variable for both Chicago and San Diego data sets and removing features with the lowest information gain. The information gain scores for the two data sets are shown in the columns on the right in Figure SI3. We removed three features with the lowest information gain scores from each data set and ran the base models with the remaining features. The results of the experiments with a subset of features are shown in Table SI5. As can be seen in the table, the feature selection results are very similar and in some cases slightly worse than the results with all features. Given this finding and the fact that the number of features in our data sets is relatively small (only 12 features), we kept all features in our main analysis.
Data and Code Sharing
The data sets and code used in this study are available at https://figshare.com/s/d1b9b11d79bfdad7432a.
Supplementary Material
Acknowledgments
This research was sponsored by the Department of the Navy, Office of Naval Research under ONR award number N00014-21-1-2286. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Office of Naval Research. Dr. Doina Caragea’s contributions to this research were supported in part by the Cognitive and Neurobiological Approaches to Plasticity (CNAP) Center of Biomedical Research Excellence (COBRE) of the National Institutes of Health under grant number P20GM113109. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The research team notes that the Co-PI of this research was Professor Isabel F. Cruz, PhD, UIC Department of Computer Science, who sadly passed away early in the project period. This research would not have been possible without her brilliance and her enthusiasm for cross-disciplinary research. Many thanks to Kara Sorensen, Robert D. George, and Patrick C. Sims of NIWC Pacific. We also recognize the contributions of Charlie Catlett at the beginning of this project.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.5c02835.
Additional details of models, terminology, data, model performance, including tables and figures (PDF)
The authors declare no competing financial interest.
References
- Wiedenmann A., Krüger P., Dietz K., López-Pila J. M., Szewzyk R., Botzenhart K.. A randomized controlled trial assessing infectious disease risks from bathing in fresh recreational waters in relation to the concentration of Escherichia coli, intestinal enterococci, Clostridium perfringens, and somatic coliphages. Environ. Health Perspect. 2006;114:228–236. doi: 10.1289/ehp.8115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wade, T. J. ; Sams, E. A. ; Beach, M. J. ; Collier, S. A. ; Dufour, A. P. . The incidence and health burden of earaches attributable to recreational swimming in natural waters: a prospective cohort study. Environ. Health 2013, 12.67 10.1186/1476-069X-12-67 [DOI] [PMC free article] [PubMed] [Google Scholar]
- US Environmental Protection Agency , Office of Water, Recreational Water Quality Criteria 2012, Document 820-F-12-058, 2012, http://water.epa.gov/scitech/swguidance/standards/criteria/health/recreation/upload/RWQC2012.pdf.
- EUR-Lex - 02006L0007-20140101 - EN - EUR-Lex. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02006L0007-20140101. Accessed: July 7, 2024.
- Boehm A. B.. Enterococci concentrations in diverse coastal environments exhibit extreme variability. Environ. Sci. Technol. 2007;41:8227–8232. doi: 10.1021/es071807v. [DOI] [PubMed] [Google Scholar]
- Dorevitch S., Shrestha A., DeFlorio-Barker S., Breitenbach C., Heimler I.. Monitoring urban beaches with qPCR vs. culture measures of fecal indicator bacteria: Implications for public notification. Environmental Health. 2017;16:45. doi: 10.1186/s12940-017-0256-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shrestha A., Dorevitch S.. Slow adoption of rapid testing: Beach monitoring and notification using qPCR. J. Microbiol. Methods. 2020;174:105947. doi: 10.1016/j.mimet.2020.105947. [DOI] [PubMed] [Google Scholar]
- Heasley C., Sanchez J. J., Tustin J., Young I.. Systematic review of predictive models of microbial water quality at freshwater recreational beaches. PLoS One. 2021;16:e0256785. doi: 10.1371/journal.pone.0256785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avila R., Horn B., Moriarty E., Hodson R., Moltchanova E.. Evaluating statistical model performance in water quality prediction. J. Environ. Manage. 2018;206:910–919. doi: 10.1016/j.jenvman.2017.11.049. [DOI] [PubMed] [Google Scholar]
- Mälzer H.-J., aus der Beek T., Müller S., Gebhardt J.. Comparison of different model approaches for a hygiene early warning system at the lower Ruhr River, Germany. Int.Hyg, 2016 J. Hyg. Environ. Health. 2016;219:671–680. doi: 10.1016/j.ijheh.2015.06.005. [DOI] [PubMed] [Google Scholar]
- Wang L., Zhu Z., Sassoubre L., Yu G., Liao C., Hu Q., Wang Y.. Improving the robustness of beach water quality modeling using an ensemble machine learning approach. Sci. Total Environ. 2021;765:142760. doi: 10.1016/j.scitotenv.2020.142760. [DOI] [PubMed] [Google Scholar]
- Brooks W., Corsi S., Fienen M., Carvin R.. Predicting recreational water quality advisories: A comparison of statistical methods. Environ. Model. Softw. 2016;76:81–94. doi: 10.1016/j.envsoft.2015.10.012. [DOI] [Google Scholar]
- Zhu J.-J., Yang M., Ren Z. J.. Machine learning in environmental research: common pitfalls and best practices. Environ. Sci. Technol. 2023;57:17671–17689. doi: 10.1021/acs.est.3c00026. [DOI] [PubMed] [Google Scholar]
- Searcy R. T., Boehm A. B.. A day at the beach: Enabling coastal water quality prediction with high-frequency sampling and data-driven models. Environ. Sci. Technol. 2021;55:1908–1918. doi: 10.1021/acs.est.0c06742. [DOI] [PubMed] [Google Scholar]
- Searcy R. T., Boehm A. B.. Know before you go: Data-driven beach water quality forecasting. Environ. Sci. Technol. 2023;57:17930–17939. doi: 10.1021/acs.est.2c05972. [DOI] [PubMed] [Google Scholar]
- Li L., Qiao J., Yu G., Wang L., Li H.-Y., Liao C., Zhu Z.. Interpretable tree-based ensemble model for predicting beach water quality. Water Res. 2022;211:118078. doi: 10.1016/j.watres.2022.118078. [DOI] [PubMed] [Google Scholar]
- Pan S. J., Yang Q.. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22:1345–1359. doi: 10.1109/TKDE.2009.191. [DOI] [Google Scholar]
- Elahi, A. ; Shumway, D. ; Kowalcyk, M. ; Shrestha, A. ; Gautam, N. ; Caragea, D. ; Caragea, C. ; Dorevitch, S. . Predicting Surface Water Bacteria Levels Using Transfer Learning and Domain Adaptation. 2024 IEEE International Conference on Big Data (BigData); IEEE, 2024; pp 1–10. [Google Scholar]
- US Environmental Protection Agency Office of Water , Alternative Recreational Criteria Technical Support Materials For Alternative Indicators and Methods. EPA-820-R-14- 011, 2014; December 2014. [Google Scholar]
- Zhu J.-J., Boehm A. B., Ren Z. J.. Environmental Machine Learning, Baseline Reporting, and Comprehensive Evaluation: The EMBRACE Checklist. Environ. Sci. Technol. 2024;58:19909–19912. doi: 10.1021/acs.est.4c09611. [DOI] [PubMed] [Google Scholar]
- U.S. Environmental Protection Agency Office of Water, Method 1609.1: Enterococci in Water by TaqMan Quantitative Polymerase Chain Reaction (qPCR) with Internal Amplification Control (IAC) Assay (EPA-820-R-15-099). 2015. Last accessed July 17, 2025. [Google Scholar]
- US Environmental Protection Agency , Ambient Water Quality Tools. Available at https://www.epa.gov/waterdata/ambient-water-quality-tools. Last accessed July 16, 2025.
- US National Oceanic and Atmospheric Administration , Western US Shoreline Data (WMS-Compatible Shapefile), available at https://geodesy.noaa.gov/ (accessed August 3, 2025).
- US National Oceanography and Atmospheric Administration , CO-OPS API For Data Retrieval. https://api.tidesandcurrents.noaa.gov/api/prod/. Last accessed July 16, 2025.
- He, H. ; Bai, Y. ; Garcia, E. A. ; Li, S. . ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks; IEEE World Congress on Computational Intelligence, 2008. https://www.scirp.org/reference/ReferencesPapers?ReferenceID=1603046, Accessed July 18, 2025. [Google Scholar]
- Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P.. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]
- Elahi, A. Machine Learning And Large Language Model Based Regressors to Predict Surface Water Bacteria Level Using Transfer Learning And Domain Adaptation github.com. https://github.com/aliielahi/ONR-WQ, [Accessed 23-04-2025]. [Google Scholar]
- Francy, D. S. ; Brady, A. M. G. ; Carvin, R. B. ; Corsi, S. R. ; Fuller, L. M. ; Harrison, J. H. ; Hayhurst, B. A. ; Lant, J. ; Nevers, M. B. ; Terrio, P. J. ; Zimmerman, T. M. . Developing and Implementing the Use of Predictive Models for Estimating Water Quality at Great Lakes Beaches; USGS Scientific Investigations Report, 2013-5166; USGS, 2013, pp 1–51. [Google Scholar]
- Mas D. M. L., Ahlfeld D. P.. Comparing artificial neural networks and regression models for predicting faecal coliform concentrations. Hydrol. Sci. J. 2007;52:713–731. doi: 10.1623/hysj.52.4.713. [DOI] [Google Scholar]
- Motamarri S., Boccelli D. L.. Development of a neural-based forecasting tool to classify recreational water quality using fecal indicator organisms. Water Res. 2012;46:4508–4520. doi: 10.1016/j.watres.2012.05.023. [DOI] [PubMed] [Google Scholar]
- Thoe W., Wong S. H. C., Choi K. W., Lee J. H. W.. Daily prediction of marine beach water quality in Hong Kong. J. Hydroenviron. Res. 2012;6:164–180. doi: 10.1016/j.jher.2012.05.003. [DOI] [Google Scholar]
- Jones R. M., Liu L., Dorevitch S.. Hydrometeorological variables predict fecal indicator bacteria densities in freshwater: data-driven methods for variable selection. Environ. Monit. Assess. 2013;185:2355–2366. doi: 10.1007/s10661-012-2716-8. [DOI] [PubMed] [Google Scholar]
- Thoe W., Gold M., Griesbach A., Grimmer M., Taggart M. L., Boehm A. B.. Predicting water quality at Santa Monica Beach: Evaluation of five different models for public notification of unsafe swimming conditions. Water Res. 2014;67:105–117. doi: 10.1016/j.watres.2014.09.001. [DOI] [PubMed] [Google Scholar]
- Thoe W., Gold M., Griesbach A., Grimmer M., Taggart M. L., Boehm A. B.. Sunny with a chance of gastroenteritis: Predicting swimmer risk at California beaches. Environ. Sci. Technol. 2015;49:423–431. doi: 10.1021/es504701j. [DOI] [PubMed] [Google Scholar]
- Brooks W., Corsi S., Fienen M., Carvin R.. Predicting recreational water quality advisories: A comparison of statistical methods. Environ. Model. Softw. 2016;76:81–94. doi: 10.1016/j.envsoft.2015.10.012. [DOI] [Google Scholar]
- Mälzer H.-J., aus der Beek T., Müller S., Gebhardt J.. Comparison of different model approaches for a hygiene early warning system at the lower Ruhr River, Germany. International Journal of Hygiene and Environmental Health. 2016;219:671. doi: 10.1016/j.ijheh.2015.06.005. [DOI] [PubMed] [Google Scholar]
- Avila R., Horn B., Moriarty E., Hodson R., Moltchanova E.. Evaluating statistical model performance in water quality prediction. J. Environ. Manage. 2018;206:910–919. doi: 10.1016/j.jenvman.2017.11.049. [DOI] [PubMed] [Google Scholar]
- Zhang J., Qiu H., Li X., Niu J., Nevers M. B., Hu X., Phanikumar M. S.. Realtime nowcasting of microbiological water quality at recreational beaches: A wavelet and artificial neural network-based hybrid modeling approach. Environ. Sci. Technol. 2018;52:8446–8455. doi: 10.1021/acs.est.8b01022. [DOI] [PubMed] [Google Scholar]
- García-Alba J., Bárcena J. F., Ugarteburu C., García A.. Artificial neural networks as emulators of process-based models to analyse bathing water quality in estuaries. Water Res. 2019;150:283–295. doi: 10.1016/j.watres.2018.11.063. [DOI] [PubMed] [Google Scholar]
- Xu T., Coco G., Neale M.. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Res. 2020;177:115788. doi: 10.1016/j.watres.2020.115788. [DOI] [PubMed] [Google Scholar]
- Guo J., Lee J. H. W.. Development of predictive models for “very poor” beach water quality gradings using class-imbalance learning. Environ. Sci. Technol. 2021;55:14990–15000. doi: 10.1021/acs.est.1c03350. [DOI] [PubMed] [Google Scholar]
- Tselemponis A., Stefanis C., Giorgi E., Kalmpourtzi A., Olmpasalis I., Tselemponis A., Adam M., Kontogiorgis C., Dokas I. M., Bezirtzoglou E., Constantinidis T. C.. Coastal Water Quality Modelling Using E. coli, Meteorological Parameters and Machine Learning Algorithms. International Journal of Environmental Research and Public Health. 2023;20:6216. doi: 10.3390/ijerph20136216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

