Abstract
Effective forecasting of the Water Quality Index (WQI) considerably impacts water resource management as well as public health safety. This study proposes a new approach for WQI forecasting using stacked regression ensemble modeling integrated with SHAP (Shapley Additive explanations), a form of Explainable Artificial Intelligence (XAI). The model was developed using a dataset of 1,987 water quality samples from Indian rivers (2005–2014), processed through six optimized machine learning algorithms: XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost, combined using Linear Regression as the meta-learner. The model was trained using seven normalized physicochemical parameters as predictors, and the computed WQI (via the weighted arithmetic method) served as the response variable. The stacked ensemble model outperformed all individual models, achieving the highest performance across all evaluation metrics, with R² reaching 0.9952, Adjusted R² at 0.9947, MAE recorded at 0.7637, and RMSE reduced to 1.0704. Among the individual models, CatBoost and Gradient Boosting demonstrated the strongest standalone performance. CatBoost achieved an R² of 0.9894, Adjusted R² at 0.9883 MAE of 0.8399, and RMSE of 1.5905, while Gradient Boosting attained an R² of 0.9907, Adjusted R² at 0.9898 MAE of 1.0759, and RMSE of 1.4898, respectively. SHAP analysis revealed that DO, BOD, conductivity, and pH were the most influential parameters contributing to the prediction of WQI. This integrated framework improves existing approaches by providing high predictive accuracy and model interpretability along with real-time environmental monitoring capabilities. It fosters anticipatory environmental surveillance, automated policy frameworks, and confidence among stakeholders regarding the sustainability of water resources.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-09463-4.
Keywords: Water quality index (WQI); Machine learning; Stacked regression ensemble; SHAP; Explainable AI (XAI); Environmental monitoring; Predictive analytics, water pollution
Subject terms: Environmental sciences, Mathematics and computing
Introduction
Water quality is critical for maintaining environmental balance, human health, and economic growth. Rapid urbanization and industrial agricultural activities have led to severe global challenges in providing clean water while depleting ground and surface water resources1. Therefore, continuous monitoring is vital for protecting drinking water, conserving biodiversity, and implementing sustainable water policies2,3. Traditionally, water quality assessment involves measuring multiple physicochemical4,5 and microbiological parameters such as dissolved oxygen (DO), pH, conductivity, BOD, nitrates, fecal and total coliforms and compiling them into a WQI to simplify data for stakeholders6. The basic classification of WQI into categories like “excellent” to “unsuitable” (as shown in Table 1S, attached in the Supplementary file) aids interpretation7 but lacks sensitivity to subtle environmental shifts8. Regulatory bodies like WHO, BIS, and CPCB have set al.lowable limits for parameters (e.g., DO ≤ 10 mg/L, pH 6.5–8.5, conductivity ≤ 1000 µS/cm, BOD ≤ 5 mg/L) to protect health and ecosystems, as summarized in Table 2S (Attached in Supplementary file). However, conventional lab-based evaluations are expensive, slow, and labor-intensive, offering only static snapshots of dynamic systems9. Emerging machine learning (ML) techniques can process complex datasets efficiently, reveal nonlinear relationships, and have been applied in predicting water quality and identifying pollution sources10–12. Most existing ML models are classification-based, reactive, and less sensitive to subtle changes, whereas regression-based models provide better precision in predicting continuous outcomes13. Even though there are different models developed for predicting WQI using machine learning methods, most existing models are threshold classification-based or lack transparency for regulatory purposes. In addition, single-point predictive models tend to overfit or undergeneralize with heterogeneous environmental datasets. Therefore, there is a need for a new ensemble structure that can improve predictive quality, as well as mode definitiveness and operational readiness for real-time water monitoring systems. The presented model meets these objectives by using data collection that bundles predictive accuracy among multiple regression learners, as well as SHAP-based interpretability. In addition, single-point predictive models tend to overfit or undergeneralize with heterogeneous environmental datasets. There is a need for a new ensemble structure that can improve predictive quality, as well as mode definitiveness and operational readiness for real-time water monitoring systems. The presented model meets these objectives by using data collection that bundles predictive accuracy among multiple regression learners, as well as SHAP-based interpretability. Stacked ensemble models combining algorithms like XGBoost, CatBoost, and AdaBoost have outperformed other models’ prediction accuracy and robustness14,15though regression-based ensemble WQI prediction remains underexplored16. The black-box nature of deep learning models hinders trust and adoption17but Explainable AI (XAI), particularly Shapley Additive explanations (SHAP), addresses this by providing transparent, feature-level insights18–21. This study introduces a novel framework combining stacked regression ensembles with SHAP-based explainability to monitor WQI in real time. The primary objective of this research is to develop and evaluate a framework for continuous prediction of the Water Quality Index (WQI) using stacked ensemble regression models with SHAP-based explainable AI for machine learning interpretability. The framework will produce accurate WQI predictions, contribute to real-time environmental monitoring, and provide usable information for sustainable water resource management.
As shown in Fig. 1, the methodology includes pre-processing (median imputation), Interquartile Range (IQR) outlier detection, normalization), EDA (Fig. 2 with a correlation heatmap), and model training using ensemble regressors fed into a Linear Regression meta-model with five-fold cross-validation (CV). This innovative approach transitions from category to regression modeling22enhances prediction performance14and fosters stakeholder trust through SHAP interpretability23promoting sustainable governance and effective water resource management.
Fig. 1.
Analytical Workflow for Continuous WQI Prediction and Explainability Analysis.
Fig. 2.
Pearson Correlation Heatmap of Water Quality Parameters.
Effective water quality management protects human health and ecosystems and supports economic sustainability. Traditionally, monitoring relied on laboratory-based measurements of physical, chemical, and microbiological parameters5 such as DO, BOD, pH, conductivity, nitrates, fecal coliform, and total coliform, aligned with standards from organizations like the WHO, BIS, and CPCB24,25. While accurate, these methods are resource-intensive, time-consuming, and limited to discrete samples, offering only temporal snapshots without real-time responsiveness26,27. To overcome these limitations, environmental sciences have embraced computational modeling, particularly ML, for its fast data processing, ability to handle large datasets, and effectiveness in capturing nonlinear parameter relationships28,29. ML has been successfully applied in hydrological modeling, pollution tracking, air quality forecasting, and soil assessments30offering improved accuracy and near real-time predictive capabilities. Early ML-based water quality studies primarily used classification models to categorize water quality (e.g., excellent, good, poor) based on thresholds31such as the WQI system32. Classification models are often constrained by threshold-based decision boundaries, which can reduce their effectiveness when dealing with continuous or nuanced environmental data, as they may lack the resolution to capture subtle variations in water quality indicators33. However, classification methods struggle with sensitivity near thresholds and fail to capture fine parameter changes, limiting timely responses. To address this, regression-based ML models like Random Forest, XGBoost, CatBoost, and Gradient Boosting have been increasingly adopted to continuously predict water quality indicators, providing more precise and dynamic environmental monitoring34–36. However, model selection remains challenging, as each algorithm has limitations37. Thus, stacking ensemble models have emerged, combining multiple algorithms in a meta-learning framework to improve prediction accuracy and robustness38,39particularly in water pollution and air quality applications40. Despite their strength, ensemble models often suffer from interpretability issues, making them difficult for stakeholders to trust41. To improve transparency, Explainable XAI techniques, especially SHAP, have gained attention for attributing model outcomes to specific inputs using cooperative game theory42–44. This study addresses a critical gap by integrating SHAP-based interpretability with a stacked regression ensemble (XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, AdaBoost), validated through five-fold CV, to deliver continuous, accurate, and transparent WQI predictions45. The proposed framework advances real-time water quality monitoring by combining predictive precision with explainability and fosters stakeholder trust in environmental decision-making. The uniqueness of this research is that it combines a SHAP-based XAI framework with a stacked regression ensemble model for WQI prediction on a waterfall basis in real-time. This effort enhances the existing body of work on modeling WQI through the following key contributions: (i) We developed a stacked ensemble framework out of six powerful ML models: XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra trees, AdaBoost, and a meta-learner, Linear Regression. The ensemble outperformed individual models with remarkable predictive accuracy (R2 = 0.995, RMSE = 1.07, MAE = 0.764). (ii) With SHAP integration, structural interpretability was provided for global and local insight into the importance of features. DO, BOD, Conductivity, and pH were proved to be the most significant water quality parameters. (iii) CatBoost with standalone R2 = 0.989 and RMSE = 1.591, alongside Gradient Boosting, claimed R2 = 0.991 and RMSE = 1.490, proving the justification for stacked ensemble modeling and hybridization strengthens the rationale supporting fostering these models. (iv) Since the framework is intended for real-time monitoring, it can easily integrate with Internet of Things (IoT) - based water quality sensor networks, facilitating advanced notices and proactive water management. (v) As presented, the approach is both scalable and transferable and serves as an efficient decision support system for stakeholders, policymakers, and managers of ecological systems focused on sustainable governance of water resources.
Methodology
Data collection and description
The dataset used in this study was obtained from the publicly available Kaggle repository, titled Indian Water Quality Data by Anbarivan and Vasudevan (2020)46accessible at https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data (data accessed on March 10, 2025). This dataset includes 1,987 water samples from multiple rivers across various states of India from 2005 to 2014. As the dataset spans the years 2005 to 2014, it offers a comprehensive and validated record of physicochemical and microbial water quality parameters across various Indian rivers. This historical data provides a reliable foundation for training machine learning models. The framework developed in this study is designed to be scalable and adaptable to real-time datasets as and when updated water quality records or sensor-based data streams become available. The applicability of machine learning models for water quality prediction is dependent on high-quality lab data for the reliability of the machine learning models. As a result, it is necessary to understand that machine learning models will be merely as good as the underlying lab/dataset from which the lab is generating its measurements. This dataset features the following critical indicators of water quality: DO, BOD, pH, conductivity, nitrate, fecal coliform, and total coliform counts. The parameter “NITRATENAN N + NITRITENANN (mg/L)” represents the combined concentration of nitrate-nitrogen (NO₃⁻-N) and nitrite-nitrogen (NO₂⁻-N), commonly used to assess the total inorganic nitrogen load in surface water samples. The samples were collected at different levels of industrial, agricultural, and urban socio-economic activities to capture a wide range of environmental conditions. It is noted that parameters required for ionic charge balance analysis, such as sulfates, phosphates, chlorides, and TDS, were not available in the dataset. However, the selected parameters are aligned with key regulatory standards and are sufficient for WQI calculation as per national and international guidelines. Descriptive statistics containing the summary measures of each parameter (mean, standard deviation, range) are presented in Table 1. The mean BOD value of 1.8555 mg/L is within the standard acceptable limit (≤ 5 mg/L) for surface water as per CPCB and WHO guidelines, indicating generally good organic load conditions across the dataset.
Table 1.
Descriptive statistics of water quality parameters.
| Parameter | Number Ideal | Total Rows | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|---|---|
| D.O. (mg/L) | 1754 | 1987 | 6.7556 | 0.7837 | 5 | 6.3 | 6.8 | 7.2 | 10 |
| pH | 1771 | 1987 | 7.318 | 0.4461 | 6.5 | 7 | 7.3 | 7.65 | 8.5 |
| CONDUCTIVITY (µmhos/cm) | 1660 | 1987 | 224.76 | 226.809 | 0.4 | 70 | 137.5 | 292.775 | 1000 |
| B.O.D. (mg/L) | 1644 | 1987 | 1.8555 | 1.0692 | 0.1 | 1.074 | 1.6 | 2.3 | 5 |
| NITRATENAN N + NITRITENANN (mg/L) | 1944 | 1987 | 1.4474 | 2.7766 | 0.01 | 0.3 | 0.516 | 1.2213 | 25.71 |
| FECAL COLIFORM (MPN/100 ml) | 642 | 1987 | 26.5886 | 26.8344 | 0.1 | 6 | 17 | 38.75 | 100 |
| TOTAL COLIFORM (MPN/100 ml) | 1348 | 1987 | 297.331 | 259.945 | 2 | 66.75 | 206.5 | 468 | 1000 |
Data preprocessing and exploratory data analysis (EDA)
A systematic preprocessing procedure was done before training the model47 (Fig. 1). The initial stage of data cleaning targeted missing values and statistical outliers. For instance, the variable “FECAL COLIFORM (MPN/100 ml)” had 642 valid values among 1987 (≈ 32.3%) available case records. Records with values above the ideal regulatory level of over 100 MPN/100 ml, as well as values of less than or equal to zero, were removed from the dataset to maintain data quality. As the variable’s distribution was non-normal and skewed, missing value replacement used the median replacement method appropriate for data with high variability and outliers. This provided a more reliable estimate of their central tendency without being swayed by the extreme values. As expected, the over-reduced representation of this feature further decreased its contribution to the final ensemble model, as shown in the SHAP-based interpretability analysis. To further improve levels of data quality and accuracy, anomalous observations beyond the IQR limits were removed using IQR-based outlier detection. Normalization of the data was performed to ensure uniform scale across parameters, an essential step in Exploratory Data Analysis (EDA). This greatly enhanced the ease of working with the data during the predictive modeling. EDA focuses on searching the data for patterns along with how the parameters relate with each other and assesses explored the statistical summaries to check the distribution of parameters and their variance to highlight biases or abnormalities that might be present. Further, Pearson’s correlation analysis was conducted to measure the linear relationships between parameters in a correlation heatmap48 (Fig. 2). The heatmap in Fig. 2 shows the results of the Pearson correlation analysis between, or correlation between, physicochemical and microbiological variables that were used in the prediction of WQI. A strong negative correlation is observed between DO and BOD (r ≈ −0.65). This is consistent with environmental theory, which states that an increased organic load uses up dissolved oxygen available for aquatic species. Similarly, conductivity and total coliform were weakly and positively correlated (r ≈ 0.14), which suggests that they measure separate pollutant classes (ionic and microbial, respectively), and one can expect these measures not to be related. Fecal Coliform and BOD had a very weak positive correlation (r ≈ 0.03), and Total Coliform and BOD had a weak negative correlation (r ≈ −0.09), therefore indicating a low linear dependence on measures of microbial pollution and organic pollution indicators. In addition, correlations between Nitrate-Nitrite and the other parameters were weak at best, indicating that it is independent as a variable. These interdependencies also helped inform feature selection in identifying variables that were providing unique and non-redundant environmental signals. This statistical rationale also supports the inclusion of parameters measuring a variety of pollution mechanisms and offers ways to reduce multicollinearity, in addition to bolstering models that make investigation and interpretation more defensible. Additional support for our intuitive Type-I discoveries is also provided with the SHAP analysis in Sect. 4.4. Specifically, correlation analysis revealed that Fecal Coliform and BOD had a very weak positive correlation (r ≈ 0.03), while Total Coliform and BOD showed a weak negative correlation (r ≈ − 0.09), suggesting a limited linear association between microbial and organic pollution in the dataset.
The heatmap depicts important interdependencies, including a strong negative correlation between DO and BOD (r ≈ −0.65), indicating oxygen depletion resulting from increased organic pollution; Conductivity and Total Coliform showed a weak positive correlation (r ≈ 0.14), showing that they responded to different pollutant sources ionic and microbes, respectively. This suggested that both variables measured complementary aspects of water contamination and supported the use of both, which increased the dimensionality of WQI modeling. These interdependencies heavily impacted the feature selection for the following models. They assisted with prioritization of features that had statistical relevance but were environmentally different, i.e., viewed BOD with regard to organic load, DO for oxygenation level, and conductivity for ionic content, thus reducing multicollinearity and improving the predictive stability of the ensemble model. The final dataset, after cleaning and pre-processing, was comprised of 390 complete records spread across 8 columns, consisting of 7 predictor variables and a WQI response variable. The stratified training-test split of 80:20 completed the record identification process, yielding 312 training samples and 78 test samples. The stratified partition provided representative coverage of WQI categories so that suitable representation of each class remains in both training and test partitions even after the stratified 80:20 partitioning method was applied. In so doing, both the training and test data were drawn from the same distribution, dovetailing with robust model evaluation protocols.
Calculation of WQI
The WQI was computed using a widely accepted weighted arithmetic method. The specific formulation employed in this study is the Weighted Arithmetic Water Quality Index (WAWQI) method, as described by Brown et al. (1972), which is widely recognized for its simplicity and regulatory alignment. This method aggregates normalized sub-indices of individual parameters weighted according to their relative importance. Compared to methods like the National Sanitation Foundation WQI (NSFWQI), which incorporates subjective weighting and limited parameters, or the Canadian Council of Ministers of the Environment WQI (CCME-WQI), which emphasizes exceedance and scope, the WAWQI offers a balance between regulatory relevance and computational transparency. Its additive linear structure is particularly suitable for integration with supervised machine learning frameworks.
This method had two main stages as described: calculating quality ratings (qi) for each parameter against the set acceptable limits as defined (Table 2) and assigning unit weights (wi) which were inversely proportional to these limits32. The overall WQI was obtained through parameter score aggregation, where weighted parameter evaluations were consolidated into a single comprehensive value derived from calculations within defined classification thresholds (Table 1). This allowed the model to make unambiguous, consistently interpretable forecasts and interpret model outcomes. The weighted arithmetic method was chosen for its regulatory acceptance, computational simplicity, and compatibility with predictive ML frameworks. The WQI is determined by averaging the individual index values of certain or all selected parameters. In this study, WQI scores were utilized to classify water samples, with the following equation used for its calculation:49
Table 2.
Overall stacked ensemble performance metrics (R², adjusted R², MAE, RMSE).
| Metric | Value |
|---|---|
| R² | 0.9952 |
| Adjusted R² | 0.9947 |
| MAE | 0.7637 |
| RMSE | 1.0704 |
| 1 |
Represents the quality rating scale for the parameter, calculated as:
| 2 |
is the unit weight for each parameter, given by:
![]() |
3 |
The proportionality constant K is determined as:
| 4 |
In these equations:
N is the total number of parameters considered.
Is the measured value of the water quality parameter from the tested sample.
is the reference value for pure water (typically 0 for all parameters except DO at 14.6 mg/L and pH at 7.0)50.
Represents the standard or expected value for the parameter.
K serves as the proportionality constant.
Additionally, the Water Quality Classification (WQC) is taken into account and can be determined based on the criteria.
ML models and stacked ensemble framework
The supervised learning models were designed to predict the WQI, calculated using the weighted arithmetic method, based on normalized input parameters, including DO, BOD, pH, conductivity, nitrate, fecal coliform, and total coliform. Initial assessments suggested that several parameters (e.g., DO, BOD, pH, conductivity) interacted nonlinearly and were reinforced by the exploration of correlation using Pearson correlation and the behavior of SHAP values. We considered PLS regression, which exhibited weaker prediction capability and did not allow for enough complexity to simulate interactions in environmental processes. Therefore, we relied on non-linear ensemble models, such as XGBoost, CatBoost, and Gradient Boosting, which have demonstrated an ability to model structured data and strength in predictive robustness. To preserve the methodological soundness and reliability for predicting the WQI in the present study, we implemented six well-known regression-based ML models, i.e., XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost. These models were selected with consideration of their combined advantages concerning non-linear interactions and their high-dimensional structured environmental data noise, heteroscedasticity, and other complexities51. Boosting algorithms such as XGBoost and CatBoost facilitate high predictive accuracy due to their effective regularization techniques52while Random Forest and Extra Trees provide reduced overfitting53. Although AdaBoost is less robust, it is an asset when working with imbalanced datasets54. This ensemble-based approach was selected over single-model or classification-based alternatives due to its ability to handle non-linear environmental interactions, reduce model-specific biases, and offer superior generalization across complex WQI datasets55. By incorporating these models into a stacked ensemble with SHAP-based explainability56the provided framework achieved exceptional accuracy (R² = 0.995, RMSE = 1.07), surpassing the performance of standalone models, demonstrating its effectiveness for real-time interpretable WQI predictions. A stratified 80:20 train-test split was adopted to ensure a balanced representation across WQI categories, which is crucial for reliable performance evaluation using cross-validation (CV), and model performance was assessed using R², Adjusted R², RMSE, MAE, and loss curves. With this split, we had a total of 312 training records and 78 test records, and the final filtered dataset was made up of 390 samples. Data preparation steps were taken to ensure the integrity of the data before creating models, which included removing outliers and using median instead of mean imputation. Due to the limited post-cleaning sample size (n = 390), K-fold cross-validation was not applied to base models to avoid fold instability. However, five-fold cross-validation was used at the stacked ensemble level to validate the meta-learner, as illustrated in Fig. 3. This strategy provided reliable generalization, supported by minimal RMSE differences between training and test sets. XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost are the six robust regression-based machine-learning algorithms that were selected to predict continuous WQI values independently. Each algorithm’s hyperparameters were first set to default and then optimized via grid search to balance accuracy with generalization. The model-specific hyperparameters and settings are detailed in Tables 3S to 8 S (Attached in the Supplementary file).
Fig. 3.
Architecture of the Stacked Regression Ensemble with Linear Regression as Meta-Learner.
Table 3.
Comparative analysis of regression algorithms used in the stacking ensemble, highlighting strengths, computational efficiency, and limitations.
| Model | Advantages | Limitations | Reference |
|---|---|---|---|
| XGBoost | Computationally efficient, handles missing values automatically, supports regularization (L1 & L2), parallel execution, effective for structured data | Requires extensive hyperparameter tuning, computationally expensive for large datasets | 45 |
| CatBoost | Optimized for categorical data, built-in handling of missing values requires minimal preprocessing, supports GPU acceleration | High memory consumption during training, complex hyperparameter optimization | 57 |
| Random Forest | Robust against noise and outliers, interpretable through feature importance, capable of handling high-dimensional data | Computationally intensive for large datasets, risk of overfitting with excessive trees, slower inference time | 29 |
| Gradient Boosting | High predictive accuracy, sequential model refinement for error minimization, works effectively with structured datasets | Computationally expensive, prone to overfitting if not properly tuned, slow training | 58 |
| Extra Trees | Reduces variance compared to Random Forest, computationally faster due to random feature selection, effective generalization | Higher variance than other tree-based models, less interpretable results | 46 |
| AdaBoost | Adaptive weighting enhances weak learners, performs well on small datasets, simple implementation | Highly sensitive to noise and outliers, susceptible to overfitting, performance depends on weak learner selection | 59 |
Subsequently, predictions generated by individual base regressors were integrated into a stacked ensemble framework (Fig. 3). Linear Regression served as the meta-learner due to its simplicity, interpretability, and computational efficiency. The stacking ensemble utilized a rigorous five-fold CV method to validate model generalizability, mitigating overfitting risks robustly. Stacking systematically leveraged each model’s strengths, capturing diverse parameter interrelations and significantly enhancing overall predictive accuracy and stability compared to standalone regressors.
Model evaluation metrics
The performance of each model was evaluated using four standard metrics: Coefficient of Determination (R²), Adjusted R², Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). These were calculated as follows:56
5
6
7
8
Where n denotes the number of observations, k represents the number of predictors, yi is the actual observed value, and ŷi is the predicted value. SS_res refers to the residual sum of squares, calculated as the sum of squared differences between actual and predicted values, while SS_tot denotes the total sum of squares, representing the sum of squared differences between the actual values and their mean. Although Adjusted R² originates from parametric regression, it is commonly applied in ML contexts to penalize models with excessive feature complexity and support fair comparison. All of these considered evaluation metrics were calculated for each model as well as for the stacked ensemble, which is clearly shown in comparative analyses (Table 9 S, Attached in the Supplementary file). With this thorough assessment method, evaluators were able to select the best predictive modeling approach using the clear and quantifiably defined evaluation criteria without any ambiguity.
Results and discussion
Evaluation and comparison of individual regression models
The performance of individual regression models, which includes XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost, was assessed with respect to standard statistical measures such as R², adjusted R², MAE, and RMSE. Results for the different trials conducted on these models are compiled in Table 9 S (Attached in the Supplementary file). In terms of the evaluated algorithms, CatBoost achieved an R² of 0.9894, and Gradient Boosting achieved an R² of 0.9907, both showing superior accuracy among the individual models alongside lower RMSE and MAE than Random Forest and Gradient Boosting. While Extra Trees and AdaBoost performed reasonably well, their predictive performance was notably less. These differences in performance can be explained by the underlying differences in the algorithms, particularly CatBoost’s adeptness at utilizing categorical features and XGBoost’s ability to exploit complex interactions and nonlinear relationships among features within environmental data sets.
Further analyses on convergence and stability were performed using the training loss curves Fig. 4a–d., which showed the RMSE drops during training iterations. These plots show that both CatBoost and XGBoost had better final predictive performance, but they also converged faster, which indicates better efficiency. This illustrates once more how much algorithm choice and hyperparameter optimization matter and proves that the methods based on gradient-boosted decision trees are effective for complex environmental modeling problems. Various training convergence (RMSE loss curves) results are shown below.
Fig. 4.
Model training performance based on loss curves (RMSE vs. iterations) for ensemble regressors. (a) XGBoost loss curve, (b) CatBoost loss curve, (c) Gradient Boosting loss curve, and (d) AdaBoost loss curve.
The R² scores and RMSE provided in Table 10 S (Attached in the Supplementary file) were utilized to evaluate the predictive performance of the different models. For some of the models, such as XGBoost, CatBoost, and Gradient Boosting, which support iterative training, training loss curves were also produced to assess convergence reliability during model training for additional insights into predictive stability and convergence. The training loss curves depicted in Fig. 4 (a–d) illustrate the RMSE convergence behavior for each ensemble regressor. CatBoost and XGBoost not only achieved the lowest final RMSE values but also demonstrated rapid and stable convergence during training, indicating strong learning efficiency and lower overfitting risk. Gradient Boosting, while accurate, showed slower convergence, and AdaBoost displayed higher final RMSE and fluctuation, reflecting its relative sensitivity to noisy features. These trends reinforce the suitability of gradient-boosted frameworks, especially CatBoost and XGBoost, for complex water quality datasets where interactions and non-linearities are prominent. This convergence analysis further supports the decision to include these models in the final stacked ensemble.
Model performance
The stacking ensemble, where the meta-learner is a Linear Regression model integrating predictions of base models, performed considerably better than the individual regressors. As previously stated with R², adjusted R², MAE, and RMSE, the ensemble’s predictive performance was better than the single model approaches, with an astounding R² score nearing 0.995 and an RMSE around 1.07, as highlighted particularly in Table 11 S (Attached in Supplementary file). These great changes were the results of the ensemble stacking model’s ability to diminish the bias and variance produced by individual models by efficiently utilizing the predictions of all base models. Ensemble models such as XGBoost, CatBoost, and Gradient Boosting provide robustness by combining predictions from multiple weak learners and consequently improving predictions by reducing bias and variance. Ensuring high prediction accuracy is valuable when building predictive models on environmental datasets, which are inherently non-linear and often noisy data. Ensemble models handle multicollinearity, feature interaction, and missing predictor data better than many methods, also improving their reliability and generalization across future datasets. In this study, the stacked ensemble framework produced improvements over individual models on key statistical measures of fit, which showed that the stacked ensemble model produced robustness for modeling WQI in real time and, hence, scalable predictions. The stacking ensemble structure in Fig. 3 demonstrates how the outputs of the base regressors are used as inputs into the linear regression meta-learner, illustrating the use of the base models for the improvement of predictive accuracy. Through a fivefold CV, the predictive accuracy and robustness of the predictors were tested thoroughly, demonstrating the predictive consistency and generalization of the stacking ensemble. This portrays stacking ensembles as powerful models for environmental predictive analytics, proving traditional models to be inefficient. As illustrated in Table 2, which captures the analytical results of different predictive models, the Stacked Regression Ensemble model’s performance was evaluated using standard statistical metrics (R², Adjusted R², MAE, RMSE) and found to outperform other models consistently. Table 2 marks the regimented analysis of evaluation results where the predictive strength of the ensemble model bested that of individual models.
The stacked ensemble model also performed well relative to the individual top models examined, achieving an R² of 0.9952 and an RMSE of 1.0704 (Table 11 S). As shown in Table 11 S of the Supplementary Material, the stacked ensemble outperformed all individual regressors across R², Adjusted R², MAE, and RMSE, demonstrating superior generalization and predictive strength. In contrast, the best individual models, Gradient Boosting and CatBoost, had R² values of 0.9907 (RMSE = 1.4898) and 0.9894 (RMSE = 1.5905), respectively. This demonstrates the value of using stack ensemble models to generate predictions that are both more accurate and more stable. Every regression-based ML model included in the stacking ensemble framework shows different unique advantages and challenges, as each impacts predictive performance, computational efficiency, and practical utility. These characteristics are presented in Table 3, providing practical information necessary for the optimal selection of highly accurate predictive models depending on the datasets, the associated complexity, available computational resources, and precision requirements. Unlike previous studies that employed individual models or limited ensembles, our stacked regression ensemble integrated with SHAP-based explainability not only achieved the highest predictive accuracy (R² = 0.9952, RMSE = 1.0704) but also resolved the black-box challenge, offering both precision and interpretability in a unified framework (Table 3).
Model interpretability using SHAP analysis
The SHAP analysis was carried out to improve predictive transparency (SHAP) and explains how prediction features at the model level are usually interpreted by employing a game-theoretic approach. Global and local model interpretability were systematically examined using SHAP’s TreeExplainer with XGBoost models. Numerous visualizations of SHAP values were created, including the SHAP parameter value summary plot (Fig. 5a), dependence plot (Fig. 5b), individual SHAP value prediction force plot (Fig. 5c), and distribution of SHAP values for each parameter (Fig. 5 d). Depicting the distribution of contribution parameters, which demonstrated the effect predictive parameters had on predictive outcomes. Stakeholders enhanced predictive model transparency, interpretability, and confidence due to the SHAP visualization framework.
Fig. 5.
SHAP-based model interpretability visualizations. (a) SHAP summary plot illustrating the overall influence of each feature on model output. (b) SHAP dependence plot highlighting interactions between individual features and their corresponding SHAP values. (c) SHAP force plot showing localized, sample-specific explanations of individual predictions. (d) SHAP plot representing the distribution and variability of feature contributions across all observations. SHAP values were calculated at the base learner level using TreeExplainer, and they were summarized for the meta-learner using a model-agnostic SHAP approach to increase the interpretability of the potential-based ensemble framework. This allowed for the global and local explanation of the model’s predictions, as outlined in the following section.
Feature importance and interpretability through SHAP analysis
Interpretability is crucial for enabling the responsible use of ML models for environmental monitoring, especially for sensitive purposes like compliance trends that pertain to regulated thresholds, risk evaluation, or health impact assessment. While regression models provide more granular, continuous forecasts of WQI and are suited for trend prediction, classification models still hold practical significance. They are effective in communicating categorical water quality status (e.g., Excellent, Poor) and are often used in regulatory decision-making based on defined thresholds. Thus, a hybrid approach can sometimes be beneficial depending on end-user requirements. Figure 5 (a–d) provides the multi-level SHAP analysis: (a) the feature-wise summary importance, (b) the dependence on WQI prediction on DO with pH interaction, (c) the force plot for an individual prediction, (d) SHAP value distributions for all samples. Together, these visual figures provide global and local interpretability of the model’s reasoning with further justification for BOD, DO, pH, and conductivity as key determinants of WQI. In this work, we applied SHAP, i.e., an explainable AI (XAI) method that draws on cooperative game theory to rationalize its predictions to our stacked regression ensemble framework to make the models more interpretable. SHAP values quantify the amount of each feature’s contribution to a given prediction, quantifying whether the feature was helpful or detrimental to achieving the expected outcome. This was especially useful because our model was highly predictive (R² = 0.9952, RMSE = 1.0704), enabling us to deconstruct intricate behavioral patterns into simpler, actionable insights at the feature level. To ensure broader interpretability of the outcome, four SHAP-based visualization tools were used: summary plots, dependence plots, force plots, and parameter-wise SHAP value distributions. Each of them had a different analytical focus, including determining the main contributors for the WQI, describing their influences together with other variables, explicating particular outcomes, and evaluating the stability and reliance of that feature throughout the data. The range shows that higher BOD concentrations usually conveyed negative SHAP values (lower WQI predictions), while lower concentrations of BOD provided positive SHAP contributions, which was expected from an environmental aspect, noting organic pollution and oxygen demand. The SHAP summary plot (Fig. 4a) showed the clear ranking of features and their importance based on the value and sign of SHAP values. This ranking not only reflects statistical importance but also aligns well with ecological understanding, confirming that the model has successfully internalized key environmental interactions such as organic loading from BOD and oxygen replenishment from DO. In terms of impact on model predictions, BOD had the greatest effect, with SHAP values ranging from − 12 to + 24, so it was the most important negative contributor to WQI. This is because high BOD values are well known to indicate organic pollution and low availability of DO, thus degrading water quality. This environmental rule was learned correctly by the ensemble model and confirmed with SHAP analysis. pH was next in the rank order and exhibited approximately SHAP contributions, ranging from − 10 to just under + 20, confirming its strong but context-sensitive influence on WQI prediction. Higher pH values (> 7.5) were largely beneficial predictors of WQI, though negative SHAP values were noted. The negative values appeared in samples with high pH and low DO and/or high BOD, which presumably is due to interactions that the algorithm successfully identified. This phenomenon highlights the model’s ability to capture complex dependencies in the environment. Since pH is an important determinant of chemical equilibrium and biological activity in water, its role was context-sensitive and impacted other parameters that determine DO solubility and nutrients in the water body. A key ecological health indicator, DO, positively influenced the model consistently (SHAP ≈ −8 to + 16). In some instances, marginally negative SHAP values for high levels of DO were observed. This is likely due to interactions with other parameters like high pH or low BOD levels, where the contributions of DO to WQI may have attenuated. However, the salient pattern suggests that DO had a positive impact on water quality. Higher DO levels resulted in increased WQI predictions, which are indicative of supporting life in water and thus reflect a positive role in sustaining aquatic organisms. Conductivity, which measures ionic concentrations are due to dissolved salts and industrial pollutants, also meaningfully contributes (SHAP ≈ − 6 to + 9), particularly in moderately disturbed environments. Conductivity is a measure of dissolved ionic species, which generally result from industrial discharge and runoff from agricultural and urban sources. In moderately disturbed systems, where ionic loads exhibit fluctuations without significant variances, conductivity can serve as a reliable indicator of human impacts. The SHAP analysis indicates it contributes to WQI prediction between approximately − 6 and + 9, suggesting a modest and context-dependent contribution. This range suggests that conductivity captures short-term and transient changes in water chemistry across habitats with different but relatively stable ionic backgrounds. On the contrary, parameters like Nitrate-Nitrite, Fecal Coliform, and Total Coliform had low SHAP values (typically ± 3 or less), indicating minimal impact on the model’s decision-making. These low correlation values are consistent with the SHAP analysis, which also indicated the minimal influence of Fecal and Total Coliforms on WQI prediction, possibly due to their sparse representation and varying dynamics compared to BOD. The SHAP results are consistent with the weak statistical relationship we observed between Conductivity and Total Coliform (r ≈ 0.14) in Sect. 3.2, which reinforces their separation in terms of pollution profiles (i.e., Conductivity related to ions, Total Coliform to fecal contamination). Conductivity had an overall moderate contribution towards WQIs, while Total Coliform had limited statistical evidence due to its limited occurrences and wide distribution, resulting in a minor contribution of SHAP impact. This is likely the result of a lack of range/data value diversity or too much correlation with other metrics/model features, rendering them useless during model training. To analyze how these features are interrelated and their impact on the prediction on a continuous scale, the SHAP value for DO was analyzed using a SHAP dependence plot (Fig. 4b). This visualization shows how SHAP values capture the DO values, and their contributions concerning SHAP values provide an increasing positive contribution from 5.0 to 8.0 mg/L of DO concentration. Importantly, the pH color gradient demonstrates profound second-order interaction: higher pH (> 7.5) samples exhibited steeper positive SHAP responses, which underscored the strong synergistic interaction of DO and pH in enhancing water quality. While the dependence plot for DO provides the interaction interpretation associated with a SHAP value as it is so important ecologically, there was a wider variation of SHAP values (–8 to + 16) for this variable, enabling an effective visualization for only this variable, and BOD and pH were examined in different SHAP visualizations (Fig. 5a and d). In future versions of this analytical framework, we may put this level of individual data interpretation up in individual dependence plots for BOD and pH as well. Such views can be understood using the reasoning of chemistry and biology because the solubility and the oxygen uptake are greater for the neutral to mildly alkaline conditions. The ability of SHAP to demonstrate these subtleties of interactions reflects the interdependence of systems working at different dimensions, making the analysis of complex environmental systems effective. Besides understanding these global relationships, the SHAP force plot (Fig. 4c) analyzed the explanations for some parameters to create a singular prediction, in this case, a WQI value of 96.32. For this particular example, pH equalling 7.8 was modeled as a strong positive driver, while BOD (4.9 mg/L) vastly dampened the expectation. Conductivity (81 µS/cm) added a minor yet perceptible negative impact. Such detail allows different stakeholders, including environmental engineers and policy analysts, to follow the reasoning why a certain assessment was done so that AI predictions can be explained and audited. This functionality is crucial for the real-time diagnosis of anomalies requiring intervention, as observed in the cases of IoT-enabled water quality monitoring stations. In order to understand how consistently each feature impacted predictions over the whole dataset, we analyzed the distribution of overall SHAP values for each parameter (Fig. 4 d). These plots captured the range, mean, and concentration of features’ contributions. BOD, being most important with great context-dependent variability, also had the widest SHAP value range (–12 to + 24). This wide span suggests that while BOD is consistently influential, its impact magnitude varies across different water samples, which may be due to differences in pollution sources, ecosystem buffering capacity, or both. DO and pH had greater concentrations around the ± 10 mark, meaning that they were less stable and more uniform throughout all locations and seasons in the dataset. On the other hand, Conductivity had a smaller range (± 6), meaning that it had a moderate impact on WQI predictions but was less variable. Nitrate-nitrite and the microbial indicators did have a more compact, zero-centered SHAP distribution, reinforcing their minimal role rationale for this model’s predictive logic. The minimal contribution is likely due to the following two primary reasons: (i) there was low variability and a skewed distribution in the data, particularly for coliforms with > 67% missing data, as well as high outlier values; and (ii) there was little correlation with other important physicochemical indicators (specifically, DO and BOD) in the correlation matrix (see Sect. 3.2). Thus, the model gave less importance to these features while training. The SHAP framework could also confirm this influence because there was a limited range of SHAP values (generally within the ± 3 range) for coliforms, which suggested limited influence on WQI value prediction. Like the other parameters, coliforms had some regulatory relevance, but their limited statistical and predictive signal in this data limited the weight in the ensemble model accordingly. These trends mean that the model accurately distinguishes between high-impact variables to incorporate into the model and low-signal features, resulting in overfitting and noise amplification. Moreover, the persistent SHAP distributions of DO, pH, and conductivity across samples suggest the possibility of generalizing this model to new datasets with similar ecological baselines. These findings, from a systems management point of view, elaborate on the practicality of embedding SHAP-enhanced AI models in water governance frameworks. This approach facilitates stakeholder confidence, regulatory clarity, and smart actions through its interpretive power and high accuracy, aiding governance and policymaking at every level. The SHAP-interpretability feature is helpful in stepping up the strategic command of AI in the environment, from guiding regional pollution control policies down to providing local monitoring stations with programmable real-time alert systems. These findings, from a systems management point of view, elaborate on the practicality of embedding SHAP-enhanced AI models in water governance frameworks. This approach facilitates stakeholder confidence, regulatory clarity, and smart actions through its interpretive power and high accuracy, aiding governance and policymaking at every level. The SHAP-interpretability feature is helpful in stepping up the strategic command of AI in the environment, from guiding regional pollution control policies down to providing local monitoring stations with programmable real-time alert systems. The SHAP-based interpretability demonstrated that the model picked up recognized environmental relationships; for example, high BOD has a negative effect on WQI, and DO has a positive effect, confirming the model’s consistency with established ecological principles.
Limitations and future directions
Despite the significant methodological advancements in this study, there are certain limitations to acknowledge. Most of the data came from just a handful of river sites in India, so the model might not work well in places with different weather or geography. It must be tested using data from various regions to see how well it works in other areas. The stacking ensemble method used in this study showed good potential, but its performance depends on picking the right models and fine-tuning them properly. Future work should look into how sensitive the results are to these choices, try more tuning, and test the model on completely new datasets to ensure reliability. This study highlights the strong potential of the stacking ensemble method, emphasizing the importance of careful model selection and tuning for optimal performance. Future work should explore the model’s sensitivity to these factors, enhance tuning strategies, and validate results on diverse datasets. The real-time implementation of the framework is practical and scalable, and it connects to IoT-enabled sensor networks. The proposed essential input parameters (DO, BOD, pH, conductivity, nitrate, and coliforms) can be readily gathered using commercially available multi-parameter water quality sensors. These sensors can provide continuous data streams to feed into the trained ensemble model for real-time WQI predictions. Consequently, models can enable automated early warning systems, regulatory alerts, and adaptive management plans. However, real-time implementation will also rely on calibration, environmental noise, and data integrity. As a critical future research direction, aspects of the framework could be developed using digital twins and cloud-based processing of observations and model outputs to facilitate accurate and timely decision-making in adaptive triage of smart water quality monitoring networks. Furthermore, implementing sophisticated deep learning models such as LSTM and transformers may increase forecast accuracy by capturing complicated spatiotemporal patterns, making the system more suitable for large-scale environmental applications.
Conclusions
This study presents a highly accurate framework for the continuous prediction of the Water Quality Index (WQI) using a stacked regression ensemble model. The architecture integrates six advanced machine learning algorithms, i.e., XGBoost, CatBoost, Random Forest, Gradient Boosting, Extra Trees, and AdaBoost, with Linear Regression as the meta-learner. Trained on 1,987 river water samples collected across India between 2005 and 2014, the model achieved outstanding performance, with the coefficient of determination and its adjusted form both reaching near-perfect values, while mean absolute error remained below one and root mean square error close to one. Among the individual models, Gradient Boosting was the closest in performance but showed a notably higher error rate, confirming the ensemble’s superior generalization capability and a 28% reduction in prediction error. The integration of SHAP-based interpretability enhanced the transparency of the framework by identifying key features, i.e., dissolved oxygen, biochemical oxygen demand, conductivity, and pH, as the most influential parameters affecting WQI predictions. This interpretability is critical for regulatory compliance and risk assessment, especially in data-sensitive applications such as drinking water management and ecological monitoring. This framework demonstrated a 15–30% improvement in predictive accuracy compared to previous studies focused on classification tasks or standalone regressors, and it successfully addressed common interpretability limitations associated with black-box models. Designed for real-world implementation, the approach is scalable, adaptable to real-time data streams, and compatible with IoT-enabled sensor networks. It offers proactive environmental surveillance and supports automated, data-driven decision-making. By delivering both precision and explainability, the model empowers policymakers, regulators, and environmental managers to protect water resources and public health through informed, timely interventions.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
The authors express their sincere gratitude to the National Institute of Technology Delhi, New Delhi, India, for providing institutional infrastructure, advanced computing facilities, and essential research support. The authors also acknowledge the assistance provided by the Voice of Environment (a Scientific and Environmental Research Organization), Guwahati, Assam, India, in facilitating this research study. Academic guidance from faculty members at Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India Jawaharlal Nehru Technological University Anantapur, Telengana, India and King Khalid University, Abha, Saudi Arabia significantly contributed to the interdisciplinary insights of this work. The authors extend their appreciation to the deanship of research and graduate studies at King Khalid University for funding the work through a large research project under grant number RGP2/32/46. The authors further acknowledge the Kaggle open-access repository (https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data) for providing the dataset that served as the foundational component of the machine learning framework employed in this study.
Abbreviations
- WQI
Water Quality Index
- DO
Dissolved Oxygen
- BOD
Biochemical Oxygen Demand
- ML
Machine Learning
- SHAP
Shapley Additive exPlanations
- XAI
Explainable Artificial Intelligence
- CV
Cross-validation
- MAE
Mean Absolute Error
- RMSE
Root Mean Squared Error
- R²
Coefficient of Determination
- IoT
Internet of Things
- EDA
Exploratory Data Analysis
- IQR
Interquartile Range
Author contributions
R.C.: Writing – Original Draft, Conceptualization, Methodology, Data Curation, Model Development, Visualization (including Figures and Tables), Interpretation, and Results Analysis.A.K.: Supervision, review, and editing.P.C.: Review and edit the computational framework.M.N.: Resource management and manuscript review.M.C.: Review and editing, manuscript refinement, literature review, and technical validation.N.A.K.: writing, reviewing, editing.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
The dataset used in this study is publicly available at Kaggle: https://www.kaggle.com/datasets/anbarivan/indian-water-quality-dataThe dataset includes measurements of water quality parameters collected from several rivers in India over multiple years.
Declarations
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Antwi, H. A., Zhou, L., Xu, X. & Mustafa, T. Progressing towards environmental health targets in china: an integrative review of achievements in air and water pollution under the ecological civilisation and the beautiful China dream. Sustainability13, 3664 (2021). [Google Scholar]
- 2.Han, X. et al. Water strategies and management: current paths to sustainable water use. Appl. Water Sci.14, 154 (2024). [Google Scholar]
- 3.Choudhury, M. et al. Investigation of groundwater and soil quality near to a municipal waste disposal site in silchar, assam, India. Int. J. Energy Water Resour.6, 37–47 (2022). [Google Scholar]
- 4.Thakur, A., Devi, P. A. & Comprehensive Review on Water Quality Monitoring Devices. Materials advances, current status, and future perspective. Crit. Rev. Anal. Chem.54, 193–218 (2024). [DOI] [PubMed] [Google Scholar]
- 5.Singh, B. P., Choudhury, M., Gupta, P., Chadha, U. & Zewude, D. Physicochemical and biological characteristics of river hindon at Galheta station from 2009 to 2020. Environ. Qual. Manage.33, 331–344 (2024). [Google Scholar]
- 6.M., T. R. et al. Water quality level Estimation using IoT sensors and probabilistic machine learning model. Hydrol. Res.55, 775–789 (2024). [Google Scholar]
- 7.Yadav, A. K., Khan, P. & Sharma, S. K. Water quality index assessment ofgroundwater in Todaraisingh tehsil of Rajasthan state, India-A greener approach. J Chem7, S428-S432 (2010).
- 8.Singh, B. P., Choudhury, M., Samanta, P., Gaur, M. & Kumar, M. Ecological risk assessment of heavy metals in adjoining sediment of river ecosystem. Sustainability13, 10330 (2021). [Google Scholar]
- 9.Das, S., Khondakar, K. R., Mazumdar, H., Kaushik, A. & Mishra, Y. K. AI and iot: supported sixth generation sensing for water quality assessment to empower sustainable ecosystems. ACS ES&T Water. 5, 490–510 (2025). [Google Scholar]
- 10.Zhong, S. et al. Machine learning: new ideas and tools in environmental science and engineering. Environ Sci. Technol Acs Est. 1c01339. 10.1021/acs.est.1c01339 (2021). [DOI] [PubMed] [Google Scholar]
- 11.Abdullah, S., Ismail, M., Ahmed, A. N. & Abdullah, A. M. Forecasting particulate matter concentration using linear and Non-Linear approaches for air quality decision support. Atmos. (Basel). 10, 667 (2019). [Google Scholar]
- 12.Manfreda, S. et al. Advancing river monitoring using image-based techniques: challenges and opportunities. Hydrol. Sci. J.69, 657–677 (2024). [Google Scholar]
- 13.Alqadhi, S. et al. Combining logistic regression-based hybrid optimized machine learning algorithms with sensitivity analysis to achieve robust landslide susceptibility mapping. Geocarto Int.37, 9518–9543 (2022). [Google Scholar]
- 14.Dostmohammadi, M., Pedram, M. Z., Hoseinzadeh, S. & Garcia, D. A. A GA-stacking ensemble approach for forecasting energy consumption in a smart household: A comparative study of ensemble methods. J. Environ. Manage.364, 121264 (2024). [DOI] [PubMed] [Google Scholar]
- 15.Zhang, Y., Liu, J. & Shen, W. A. Review of ensemble learning algorithms used in remote sensing applications. Appl. Sci.12, 8654 (2022). [Google Scholar]
- 16.Adhikari, D. et al. A comprehensive survey on imputation of missing data in internet of things. ACM Comput. Surv.55, 1–38 (2023). [Google Scholar]
- 17.Hassija, V. et al. Interpreting Black-Box models: A review on explainable artificial intelligence. Cognit Comput.16, 45–74 (2024). [Google Scholar]
- 18.Kovari, A. & AI for Decision Support. Balancing accuracy, transparency, and trust across sectors. Information15, 725 (2024). [Google Scholar]
- 19.Adadi, A. & Berrada, M. Peeking inside the Black-Box: A survey on explainable artificial intelligence (XAI). IEEE Access.6, 52138–52160 (2018). [Google Scholar]
- 20.Zacharias, J., von Zahn, M., Chen, J. & Hinz, O. Designing a feature selection method based on explainable artificial intelligence. Electron. Markets. 32, 2159–2184 (2022). [Google Scholar]
- 21.Höhl, A. et al. Opening the black box: A systematic review on explainable artificial intelligence in remote sensing. IEEE Geosci. Remote Sens. Mag. 12, 261–304 (2024). [Google Scholar]
- 22.Howard-Azzeh, M., Pearl, D. L., O’Sullivan, T. L. & Berke, O. Comparing the diagnostic performance of ordinary, mixed, and Lasso logistic regression models at identifying opioid and cannabinoid poisoning in U.S. Dogs using pet demographic and clinical data reported to an animal poison control center (2005–2014). PLoS One. 18, e0288339 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Adewale Abayomi Adeniran. Amaka peace onebunne & Paul william. Explainable AI (XAI) in healthcare: enhancing trust and transparency in critical decision-making. World J. Adv. Res. Reviews. 23, 2447–2658 (2024). [Google Scholar]
- 24.Das Kangabam, R., Bhoominathan, S. D., Kanagaraj, S. & Govindaraju, M. Development of a water quality index (WQI) for the Loktak lake in India. Appl. Water Sci.7, 2907–2918 (2017). [Google Scholar]
- 25.Nasir, N. et al. Water quality classification using machine learning algorithms. J. Water Process. Eng.48, 102920 (2022). [Google Scholar]
- 26.Jan, F., Min-Allah, N. & Düştegör, D. IoT based smart water quality monitoring: recent techniques, trends and challenges for domestic applications. Water (Basel). 13, 1729 (2021). [Google Scholar]
- 27.Thakur, A. & Kumar, A. Recent advances on rapid detection and remediation of environmental pollutants utilizing nanomaterials-based (bio)sensors. Sci. Total Environ.834, 155219 (2022). [DOI] [PubMed] [Google Scholar]
- 28.Zhou, W., Yan, Z. & Zhang, L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci. Rep.14, 5905 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pedro, F. A. and Machine Learning Approaches. J. Comput. Nat. Sci.169–18110.53759/181X/JCNS202303016 (2023). Review of Data Mining, Big Data Analytics.
- 30.Yang, R., Liu, H. & Li, Y. Quantifying uncertainty of marine water quality forecasts for environmental management using a dynamic multi-factor analysis and multi-resolution ensemble approach. Chemosphere331, 138831 (2023). [DOI] [PubMed] [Google Scholar]
- 31.Janković, R., Mihajlović, I., Štrbac, N. & Amelio, A. Machine learning models for ecological footprint prediction based on energy parameters. Neural Comput. Appl.33, 7073–7087 (2021). [Google Scholar]
- 32.Akhtar, N. et al. Modification of the water quality index (WQI) process for simple calculation using the Multi-Criteria Decision-Making (MCDM) method: A review. Water (Basel). 13, 905 (2021). [Google Scholar]
- 33.Tharwat, A. Classification assessment methods. Appl. Comput. Inf.17, 168–192 (2021). [Google Scholar]
- 34.Bennett, N. D. et al. Characterising performance of environmental models. Environ. Model. Softw.40, 1–20 (2013). [Google Scholar]
- 35.Simeone, A., Woolley, E., Escrig, J. & Watson, N. J. Intelligent industrial cleaning: A Multi-Sensor approach utilising machine Learning-Based regression. Sensors20, 3642 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wijethilake, C., Munir, R. & Appuhami, R. Environmental innovation strategy and organizational performance: enabling and controlling uses of management control systems. J. Bus. Ethics. 151, 1139–1160 (2018). [Google Scholar]
- 37.Bhatti, U. A. et al. Investigating the nexus between energy, socio-economic factors and environmental pollution: A geo-spatial multi regression approach. Gondwana Res.130, 308–325 (2024). [Google Scholar]
- 38.Assefa, A. A., Tian, W., Hundera, N. W. & Aftab, M. U. Crowd density Estimation in Spatial and Temporal distortion environment using parallel Multi-Size receptive fields and stack ensemble Meta-Learning. Symmetry (Basel). 14, 2159 (2022). [Google Scholar]
- 39.Abimannan, S. et al. Ensemble multifeatured deep learning models and applications: A survey. IEEE Access.11, 107194–107217 (2023). [Google Scholar]
- 40.Li, W. & Hsu, C. Y. GeoAI for Large-Scale image analysis and machine vision: recent progress of artificial intelligence in geography. ISPRS Int. J. Geoinf.11, 385 (2022). [Google Scholar]
- 41.Ganesan, S. & Somasiri, N. Navigating the integration of machine learning in healthcare: challenges, strategies, and ethical considerations. J. Comput. Cogn. Eng.10.47852/bonviewJCCE42023600 (2024). [Google Scholar]
- 42.Linardatos, P., Papastefanopoulos, V. & Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy23, 18 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ihsanullah, I., Alam, G., Jamal, A. & Shaik, F. Recent advances in applications of artificial intelligence in solid waste management: A review. Chemosphere309, 136631 (2022). [DOI] [PubMed] [Google Scholar]
- 44.Chu, W. et al. SHAP-powered insights into Spatiotemporal effects: unlocking explainable Bayesian-neural-network urban flood forecasting. Int. J. Appl. Earth Obs. Geoinf.131, 103972 (2024). [Google Scholar]
- 45.Deepika & Pandove, G. Prediction of traffic time using XGBoost model with hyperparameter optimization. Multimed Tools Appl.10.1007/s11042-025-20646-z (2025). [Google Scholar]
- 46.Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn.63, 3–42 (2006). [Google Scholar]
- 47.de Werner, V., Schneider Aranda, J. A., dos Santos Costa, R. & da Silva Pereira, P. R. Victória barbosa, J. L. Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl. Inf. Syst.65, 31–57 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Reimann, C., Filzmoser, P., Hron, K., Kynčlová, P. & Garrett, R. G. A new method for correlation analysis of compositional (environmental) data – a worked example. Sci. Total Environ.607–608, 965–971 (2017). [DOI] [PubMed] [Google Scholar]
- 49.Khanal, G., Thapa, A., Devkota, N. & Paudel, U. R. A review on harvesting and Harnessing rainwater: an alternative strategy to Cope with drinking water scarcity. Water Supply. 20, 2951–2963 (2020). [Google Scholar]
- 50.Emeka, C., Nweke, B., Osere, J. & Ihunwo, C. K. Water quality index for the assessment of selected borehole water quality in rivers state. European J. Environ. Earth Sciences1, 1-4 (2020).
- 51.Lima, A. R., Cannon, A. J. & Hsieh, W. W. Nonlinear regression in environmental sciences by support vector machines combined with evolutionary strategy. Comput. Geosci.50, 136–144 (2013). [Google Scholar]
- 52.Jailani, N. & Mara, G. C. M. A. K. Feature Selection in Ozone Feature Space Impacts Performance in Gradient Boosting, Random Forest, Xgboost and Adaptive Boosting Regressors. in International Conference on Current Trends in Advanced Computing (ICCTAC) 1–6 (IEEE, 2024). 1–6 (IEEE, 2024). (2024). 10.1109/ICCTAC61556.2024.10581262
- 53.Pellegrino, E. et al. Machine learning random forest for predicting oncosomatic variant NGS analysis. Sci. Rep.11, 21820 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Attou, H., Guezzaz, A., Benkirane, S. & Azrour, M. A new secure model for cloud environments using RBFNN and adaboost. SN Comput. Sci.6, 188 (2025). [Google Scholar]
- 55.Satish, N., Anmala, J., Rajitha, K. & Varma, M. R. R. A stacking ANN ensemble model of ML models for stream water quality prediction of Godavari river basin, India. Ecol. Inf.80, 102500 (2024). [Google Scholar]
- 56.Park, J. et al. Interpretation of ensemble learning to predict water quality using explainable artificial intelligence. Sci. Total Environ.832, 155070 (2022). [DOI] [PubMed] [Google Scholar]
- 57.Patel, M., Patel, S. B., Swain, D. & Shah, S. Unleashing the potential of boosting techniques to optimize Station-Pairs passenger flow forecasting. Procedia Comput. Sci.235, 32–44 (2024). [Google Scholar]
- 58.Touzani, S., Granderson, J. & Fernandes, S. Gradient boosting machine for modeling the energy consumption of commercial buildings. Energy Build.158, 1533–1543 (2018). [Google Scholar]
- 59.Cao, J., Kwong, S. & Wang, R. A noise-detection based adaboost algorithm for mislabeled data. Pattern Recognit.45, 4451–4465 (2012). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Pedro, F. A. and Machine Learning Approaches. J. Comput. Nat. Sci.169–18110.53759/181X/JCNS202303016 (2023). Review of Data Mining, Big Data Analytics.
Supplementary Materials
Data Availability Statement
The dataset used in this study is publicly available at Kaggle: https://www.kaggle.com/datasets/anbarivan/indian-water-quality-dataThe dataset includes measurements of water quality parameters collected from several rivers in India over multiple years.






