Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 5;16:7341. doi: 10.1038/s41598-026-38969-8

Comparative assessment of machine learning models for daily streamflow prediction in a subtropical monsoon watershed

Zhi Zhang 1,, Yusha Xiao 2, Runting Chen 3, Kaihao Long 3, Haojun Deng 1, Zhuangpeng Zheng 1, Jiwu Liao 1
PMCID: PMC12923708  PMID: 41644994

Abstract

Accurate streamflow prediction is critical for flood warning and water resources management in subtropical monsoon watersheds, yet optimal model selection remains challenging. This study compared seven machine learning models, including Linear Regression (LR), Gradient Boosting Regressor, Artificial Neural Network (ANN), Random Forest Extra Trees Regressor, XGBoost (XGB), and Long Short-Term Memory (LSTM), for daily streamflow prediction in the Boluo Watershed, South China. Results demonstrated that LSTM achieved superior performance with NSE and KGE of 0.95, followed by ANN and LR. High-flow evaluation revealed that LSTM maintained robust performance under extreme conditions, achieving NSE of 0.86, 0.80, and 0.45 for flows exceeding the 90th, 95th, and 99th percentiles respectively. For flood peaks, LSTM showed the smallest underestimation of 7 to 20%, compared to 30 to 50% for tree-based models. Feature importance analysis revealed upstream flow from Lingxia Station as the dominant predictor (importance of 0.373 for XGB), reflecting watershed memory effects whereby streamflow is predominantly controlled by antecedent hydrological conditions. Residual analysis identified pronounced heteroscedasticity with increasing prediction errors under high-flow conditions. These findings demonstrate that temporal memory mechanisms provide substantial advantages for streamflow prediction under extreme conditions, offering guidance for model selection in operational flood forecasting systems.

Keywords: Streamflow prediction, Machine learning models, Feature importance analysis, Subtropical monsoon watershed, Boluo watershed

Subject terms: Climate sciences, Environmental sciences, Hydrology

Introduction

Flooding poses the greatest threat to human life among natural hazards, causing thousands of deaths, displacing millions of people, and resulting in economic losses exceeding $651 billion during the first two decades of the 21st century1. Accurate streamflow prediction is fundamental to water resources management, flood early warning, reservoir operation, and ecological flow protection2,3. Climate change and intensifying human activities, particularly reservoir operations that alter natural flow regimes, have increased the frequency and magnitude of high-flow hydrological events, underscoring the critical need for reliable forecasting tools to support disaster mitigation and sustainable water allocation4,5. This challenge is particularly acute in subtropical monsoon watersheds, which are characterized by high-intensity rainfall events, pronounced wet-dry seasonality, and complex streamflow generation mechanisms that complicate prediction efforts.

Traditional streamflow forecasting approaches broadly fall into two categories: process-based hydrological models and data-driven machine learning methods. Process-based models simulate rainfall-streamflow processes through explicit representation of physical hydrological mechanisms, providing mechanistic understanding of watershed behavior6,7. However, these models typically require extensive input data that are often unavailable in data-scarce regions3. Moreover, process-based models frequently struggle to capture the highly nonlinear and nonstationary patterns inherent in hydrological time series, particularly in watersheds influenced by complex human interventions such as reservoir operations8.

In contrast, machine learning approaches have demonstrated remarkable capability in extracting complex patterns from high-dimensional, nonlinear data without requiring explicit physical parameterization9. Recent advances in machine learning have sparked reporting superior performance compared to traditional approaches10,11. The application of machine learning in streamflow prediction has evolved through distinct methodological approaches, from simple statistical models to sophisticated ensemble and neural network architectures. Artificial Neural Networks (ANN) marked a significant advancement by enabling the capture of nonlinear hydrological relationships through multi-layer perceptron architectures with nonlinear activation functions12. Studies have demonstrated ANN’s effectiveness in streamflow modeling, with their layered structure allowing hierarchical feature learning that captures intricate patterns potentially overlooked by simpler models13,14.

More recently, recurrent neural network architectures, particularly Long Short-Term Memory (LSTM) networks, have emerged as powerful tools for hydrological time series modeling. LSTM incorporates explicit memory mechanisms through gating structures that selectively retain or discard information across time steps, enabling effective capture of long-term temporal dependencies inherent in watershed processes15. Several studies have demonstrated LSTM’s superiority over conventional machine learning approaches for streamflow prediction, with the architecture’s ability to model sequential dependencies proving particularly advantageous for capturing antecedent moisture effects and flow recession dynamics16,17. Comparative evaluations have shown that LSTM can outperform both process-based models and traditional machine learning algorithms, especially in watersheds exhibiting complex storage-release behavior18.

Ensemble tree-based methods represent a parallel advancement, offering distinct advantages through aggregation strategies19. Vilaseca et al. (2023) demonstrated that Random Forest (RF) effectively identified influential rainfall-streamflow variables for daily streamflow simulation, achieving superior performance through proper handling of feature interactions20. Recent studies have shown that Extra Trees Regressor (ETR), along with RF and gradient boosting variants, can achieve high accuracy in monthly streamflow time series modeling, with ensemble methods accounting for nearly 80% of best-performing models in comparative evaluations21. Gradient boosting algorithms such as Gradient Boosting Regressor (GBR) represent another major advancement in ensemble learning22,23. Studies have shown that XGBoost (XGB) is particularly effective for streamflow forecasting, generally outperforming Support Vector Machines with cluster analysis-based modular approaches further improving accuracy in capturing complicated hydrological patterns14,24.

Despite the proliferation of machine learning applications in hydrology, significant research gaps remain that warrant systematic investigation. Comprehensive multi-algorithm comparisons within single watersheds remain scarce, with most studies focusing on one or two model types, limiting understanding of model-specific strengths and preventing informed selection decisions25. Moreover, the relative performance of simple linear models versus complex ensemble or neural network approaches remains incompletely characterized, particularly in subtropical monsoon regions where rainfall-streamflow relationships may exhibit distinct characteristics26. Equally important, systematic evaluation of model performance during high-flow events is often overlooked, with most studies emphasizing overall metrics rather than conditional performance across flow regimes27. Additionally, feature importance analysis to elucidate hydrological process controls remains underutilized despite its value for model interpretability and physical understanding25.

These gaps are pronounced for subtropical monsoon watersheds characterized by high-intensity rainfall, pronounced seasonality, and reservoir regulation. To address these gaps, this study systematically evaluates six machine learning models representing different algorithmic families, including Linear Regression (LR), GBR, ANN, RF, ETR, XGB and LSTM, for daily streamflow prediction in the Boluo Watershed. The specific objectives are to: (1) compare predictive performance across simple linear, ensemble tree-based, and neural network approaches; (2) evaluate model capabilities in capturing high-flow conditions through conditional performance metrics at the 90th, 95th, and 99th percentiles, along with detailed analysis of peak magnitude, timing, and hydrograph morphology; (3) diagnose model behavior through residual analysis to identify systematic biases and heteroscedasticity patterns; (4) quantify feature importance across different model types to elucidate key hydrological drivers and reconcile algorithm-dependent importance rankings, thereby guiding monitoring network optimization and data collection priorities. This comprehensive evaluation provides practical guidance for model selection in operational forecasting systems while advancing methodological understanding of data-driven hydrological modeling in reservoir-influenced subtropical monsoon watersheds.

Materials and methods

Study area and dataset

This study focuses on the Boluo watershed, a sub-watershed of the Dongjiang River, which is a primary tributary of the Pearl River watershed. Originating in Jiangxi Province, the Dongjiang River flows southwest through Guangdong Province, serving as one of the most critical water sources for the Guangdong-Hong Kong-Macao Greater Bay Area. The Dongjiang River watershed encompasses approximately 35,340 km², with reservoir-controlled areas comprising 33.2% of the total watershed area. The Boluo watershed, serving as the outlet watershed of the Dongjiang River system, covers a drainage area of 3941.3 km² with a shape ratio of 1.06, indicating a relatively compact watershed configuration. The watershed is significantly influenced by upstream flow from the Lingxia station station and the Baipenzhu Reservoir, the latter being an artificially regulated reservoir capable of flood control and water storage operations (Fig. 1). The watershed is characterized by a subtropical monsoon climate with a mean annual temperature of approximately 21 °C. Mean annual precipitation ranges from 1500 to 2400 mm, with approximately 80% concentrated during the wet season from April to September. Frontal and typhoon rainfall events are predominant in this region, generating rapid streamflow responses and high flood peaks.

Fig. 1.

Fig. 1

Location and topographic characteristics of the Boluo watershed in the Dongjiang River watershed, southern China.

The modeling framework employs meteorological forcing data and upstream flow measurements to predict daily streamflow at the Boluo outlet. Precipitation data were obtained from a network of 14 rain station stations strategically distributed across diverse topographical zones (mountainous, plain, and valley regions) within the watershed, effectively capturing spatial precipitation heterogeneity. Additionally, watershed-averaged rainfall and evapotranspiration were calculated from national meteorological control stations. Streamflow data comprised daily measurements from three hydrological stations (Boluo, Lingxia, and Baipenzhu), providing comprehensive coverage of the watershed’s hydrological dynamics.

During data preprocessing, missing daily observations (< 3% missing rate) were interpolated using linear interpolation methods. For continuous gaps exceeding five days, missing values were replaced with climatological means from corresponding periods (same month and hydrological year type). Outlier detection was performed using the 3σ criterion to identify anomalous data. Prior to model training, all input features were normalized using MinMaxScaler to transform values into the range [0, 1]. This normalization is critical for LSTM, ANN and LR models, as these algorithms are sensitive to feature scales and may produce suboptimal results when input variables have vastly different magnitudes. For gradient-based optimization methods, normalized inputs ensure faster convergence and more stable training. To prevent data leakage, the scaling parameters were computed exclusively from the train period and subsequently applied to the test period using the same transformation.

Given varying record lengths across observation stations, the study period was divided into train period (1985–2004, 70%) and test period (2005–2013, 30%). This temporal division ensures adequate representation of different hydrological regimes (wet, normal, and dry years) in both datasets, enhancing model generalization capability. The train period was used for model calibration and hyperparameter optimization through cross-validation, while the test period provided independent evaluation of model performance.

Models

Seven machine learning models representing different algorithmic families were employed for streamflow prediction: LR, GBR, ANN, RF, ETR, XGB and LSMT. These models encompass linear methods, ensemble tree-based approaches, and neural networks, enabling comprehensive comparison of data-driven modeling strategies.

LR

LR represents the simplest approach, establishing linear relationships between input features and target streamflow28. The model assumes that the target variable can be expressed as a weighted sum of input features.

graphic file with name d33e373.gif 1

where Inline graphic is the predicted streamflow, Inline graphic is the intercept, Inline graphic represents the regression coefficient for feature Inline graphic, and p is the total number of input features. The coefficients are estimated by minimizing the sum of squared residuals between observed and predicted values. Hyperparameter tuning examined fit intercept [True, False] and positive constraint [True, False] to ensure physically plausible non-negative streamflow predictions, and regularization strategies including None (ordinary least squares), Ridge (L2 regularization with alpha values of 0.1, 1.0, 10.0), and Lasso (L1 regularization with alpha values of 0.1, 1.0, 10.0) to prevent overfitting in high-dimensional feature spaces. The optimal LR configuration (fit intercept=True, positive constraint=False, regularization=Ridge, alpha = 1.0) was selected. Despite its simplicity, LR provides a benchmark for assessing whether more complex models offer substantial performance improvements and serves as an interpretable baseline for understanding feature-target relationships.

GBR

GBR employs sequential ensemble learning, where decision trees are added iteratively to correct residuals from previous iterations29. The model builds trees in a stage-wise manner, with each subsequent tree fitted to the negative gradient of the loss function.

graphic file with name d33e429.gif 2

where Inline graphic is the ensemble prediction at iteration m, Inline graphic is the prediction from previous iterations, Inline graphic is the learning rate, and Inline graphic is the newly added decision tree. Hyperparameter tuning explored learning rates (0.01, 0.1, 0.2), maximum tree depths (3, 5, 7), and numbers of estimators (50, 100, 200). The optimal GBR configuration (learning rate = 0.1, max depth = 5, n estimators = 200) was selected. This boosting strategy enables effective capture of nonlinear patterns while maintaining reasonable computational efficiency. GBR is particularly effective for structured tabular data common in hydrological applications.

ANN

ANN utilizes multi-layer perceptron architecture with nonlinear activation functions, enabling approximation of complex nonlinear mappings between inputs and outputs12. The network consists of an input layer receiving hydrological and meteorological variables, one or more hidden layers for feature transformation, and an output layer producing streamflow predictions. For each neuron in the network, the output is computed as follows.

graphic file with name d33e481.gif 3

where Inline graphic is the neuron output, f is the nonlinear activation function, Inline graphic represents the weight for input Inline graphic, n is the number of inputs, and b is the bias term. Architecture search explored hidden layer configurations [(64,), (128,), (128, 64)], activation functions (ReLU, tanh), optimization solvers (Adam, SGD), and learning rate strategies (constant, adaptive). Training was conducted using mini-batch gradient descent with a batch size of 32. To prevent overfitting, early stopping was implemented with a patience of 10 iterations. The L2 regularization parameter alpha was tuned from [0.0001, 0.001, 0.01]. The maximum number of training iterations was set to 200. The optimal ANN configuration (hidden layer sizes=(128, 64), activation=relu, solver=adam, learning rate=adaptive, alpha = 0.0001, batch size = 32, early stopping=True, n iter no change = 10, max iter = 200) was selected. The network’s layered structure allows hierarchical feature learning, potentially capturing intricate rainfall-streamflow relationships that simpler models may overlook.

RF

RF constructs an ensemble of decision trees trained on bootstrapped samples with random feature selection at each split30. The final prediction is obtained by averaging the predictions from all individual trees.

graphic file with name d33e570.gif 4

where Inline graphic is the ensemble prediction, T is the total number of trees, and Inline graphic is the prediction from the tth decision tree. Each tree is trained on a bootstrap sample of the original dataset, and at each node split, only a random subset of features is considered. Hyperparameter tuning examined ensemble sizes (50, 100, 200 trees), maximum depths (None, 10, 20), and minimum samples for splits and leaves (2, 5, 10). The optimal RF configuration (n estimators = 200, max depth = 20, min samples split = 5, min samples leaf = 2) was selected. This bagging approach reduces overfitting and improves generalization through averaging predictions across multiple trees. RF naturally handles feature interactions and provides built-in feature importance measures.

ETR

ETR extends the random forest concept by introducing additional randomness in tree construction where the notation follows that of RF31. Unlike RF, ETR uses the entire training sample for each tree and selects split thresholds randomly rather than searching for optimal splits. The key difference lies in the tree construction process, where both training samples and split thresholds are randomized. This enhanced randomization can further reduce overfitting while potentially increasing computational efficiency through simplified split selection. Hyperparameter tuning examined ensemble sizes (50, 100, 200 trees), maximum depths (None, 10, 20), and minimum samples for splits and leaves (2, 5, 10). The optimal ETR configuration (n estimators = 200, max depth = 20, min samples split = 5, min samples leaf = 2) was selected.

XGB

XGB is a machine learning technique that has been embraced worldwide because of its ability to deal with computational performance as well as accuracy issues in training and prediction. Since XGB is an ensemble of trees, it constructs the ensemble of a collection of distinct decision trees to predict. Every tree gives a score to the sample based on its features. During training, the model constructs multiple trees, and each gives a distinct score to the sample leaf nodes. The predicted value is obtained by summing up all the trees scores32.

graphic file with name d33e653.gif 5
graphic file with name d33e657.gif 6

where Inline graphic is the objective function, K is the total number of samples, Inline graphic and Inline graphic represent the observed and predicted values for sample k, N is the number of trees in the ensemble, and Inline graphic denotes the ith tree. There are two components in the XGB objective function. L denotes the loss function that measures the difference between the observed and predicted outcomes, while represents the regularization term that applies a penalty. In XGB, the regularization process incorporates to control the magnitude of this penalty. Furthermore, γ and λ are regularization parameters, V refers to the count of leaves in a CART model, and ω represents the vectors corresponding to the scores of these leaves. This regularization component plays a crucial role in controlling model complexity and preventing overfitting33. Hyperparameter tuning explored learning rates (0.01, 0.1, 0.2), maximum tree depths (3, 5, 7), numbers of estimators (50, 100, 200), and subsample ratios (0.8, 1.0). The optimal XGB configuration (learning rate = 0.1, max depth = 5, n estimators = 200, subsample = 0.8) was selected to maintain high-flow event sensitivity while preventing overfitting.

LSTM

LSTM networks represent a specialized recurrent neural network architecture designed to capture long-term temporal dependencies in sequential data15. Unlike feedforward neural networks, LSTM incorporates explicit memory mechanisms through gating structures that selectively retain or discard information across time steps, making it particularly suitable for hydrological time series where antecedent conditions significantly influence current streamflow. At each time step t, the LSTM cell updates its hidden state ht​ and cell state ct​ following:

graphic file with name d33e771.gif 7
graphic file with name d33e775.gif 8
graphic file with name d33e779.gif 9
graphic file with name d33e783.gif 10
graphic file with name d33e787.gif 11
graphic file with name d33e791.gif 12

where Wt is the input vector at time step t; ht is the hidden state at time step t; ct is the cell state at time step t; ft, ot, it are the forget gate, input gate, and output gate activations, respectively; Inline graphic is the candidate cell state; Wf, Wi, Wc, Wo are the weight matrices for the gates and cell state; bf, bi, bc, bo are the bias terms.

The network architecture consisted of a dual-layer stacked LSTM with hidden sizes of 128 and 64 units respectively, followed by a fully connected output layer. Input sequences were constructed using a sliding window of 7 lag days, resulting in input tensors of shape (samples, 7, 19) where 19 represents the number of features per time step. Hyperparameter tuning explored hidden layer configurations [(64, 32), (128, 64), (256, 128)], dropout rates (0.1, 0.2, 0.3), learning rates (0.001, 0.01), and batch sizes (32, 64, 128). The model was trained using the Adam optimizer to directly optimize hydrological efficiency. Early stopping with patience of 30 epochs was implemented to prevent overfitting, with maximum training epochs set to 200. The optimal LSTM configuration (hidden sizes=(128, 64), dropout = 0.2, learning rate = 0.001, batch size = 32) was selected.

Experimental setup

To ensure fair comparison across all models, a unified training framework was established for model development and evaluation. The loss function for all models was defined as the Root Mean Square Error (RMSE) between observed and predicted streamflow values:

graphic file with name d33e899.gif 13

where Inline graphic and Inline graphic​ denote observed and predicted streamflow at time i, n is the number of observations, Inline graphic​ represents mean observed flow. Hyperparameter optimization was conducted using GridSearchCV, which performs exhaustive search over specified parameter grids. The search ranges for hyperparameters were determined based on recommended values from the scikit-learn, torch and xgboost library documentation, and established practices in hydrological machine learning literature14,34. To prevent data leakage and ensure proper temporal ordering in time series prediction, TimeSeriesSplit cross-validation with 5 folds was employed instead of standard k-fold cross-validation. This approach ensures that training data always precedes validation data chronologically, maintaining the temporal integrity of hydrological predictions. In each fold, the model was trained on historical data and validated on subsequent time periods, mimicking real-world forecasting scenarios. Model selection was based on the negative RMSE score averaged across all cross-validation folds, ensuring that the optimal hyperparameter configuration for each model was identified through consistent evaluation criteria. All model implementations were conducted using Python (version 3.12) with scikit-learn library (version 1.7.2) for LR, GBR, ANN, RF, and ETR models, xgboost library (version 3.0.5) for XGB model, and torch (version 2.8.0) for LSTM model.

Model performance was assessed using four complementary metrics (including RMSE) providing comprehensive evaluation of predictive accuracy and hydrological fidelity:

graphic file with name d33e933.gif 14
graphic file with name d33e937.gif 15
graphic file with name d33e941.gif 16

where r is the correlation coefficient, σ denotes standard deviation, and µ represents mean values.

Mean Absolute Error (MAE) quantifies absolute prediction errors in original units (m³/s). Nash–Sutcliffe Efficiency (NSE) assesses model skill relative to a naive mean predictor, with values approaching 1 indicating excellent performance. Kling–Gupta Efficiency (KGE) provides a decomposed evaluation considering correlation, bias, and variability ratio, offering more balanced assessment than NSE alone. Together, these metrics enable comprehensive characterization of model strengths and weaknesses across different flow regimes and hydrological conditions.

Feature importance analysis

To interpret model predictions and identify key hydrological drivers, feature importance analysis was conducted using multiple approaches. For LR, standardized regression coefficients were used to quantify variable contributions. For XGB, built-in feature importance measures based on information gain or impurity reduction were employed. For ANN and LSTM, SHapley Additive exPlanations (SHAP) was applied to interpret the black-box predictions. SHAP is a unified approach to explain individual predictions based on game-theoretic Shapley values35. The SHAP value for each feature represents its contribution to the difference between the actual prediction and the average prediction across all samples. For a given prediction, the SHAP value of feature j is calculated as follows.

graphic file with name d33e969.gif 17

where F is the set of all input features, S is a subset of features excluding feature j, |S| and |F| represent the number of features in sets S and F respectively, Inline graphic is the model prediction using only features in subset S, and Inline graphic is the prediction when feature j is added to subset S. The equation computes a weighted average of the marginal contributions of feature j across all possible feature combinations, ensuring fair attribution of prediction contributions among all features.

SHAP values satisfy three desirable properties including local accuracy (the sum of all feature SHAP values equals the difference between prediction and base value), missingness (features with no impact have zero SHAP values), and consistency (if a feature contributes more in one model, its SHAP value should not decrease). This method enables both global interpretation through aggregated SHAP values across all samples and local interpretation for individual predictions, which is particularly valuable for understanding model behavior during high-flow flood events.

Results

Overall model performance comparison

Four evaluation metrics (RMSE, MAE, NSE, and KGE) were employed to comprehensively evaluate model performance. Figure 2 presents results for all models on the test period (2005–2013). All seven models demonstrated satisfactory predictive capability with NSE exceeding 0.88 and KGE surpassing 0.83, indicating strong applicability of machine learning approaches in the Boluo Watershed.

Fig. 2.

Fig. 2

Comparison of predictive performance across seven machine learning models for daily streamflow prediction in the Boluo watershed.

LSTM achieved superior performance across all metrics, yielding the lowest RMSE (173 m³/s) and MAE (90 m³/s) while attaining NSE of 0.95 and KGE of 0.95. The explicit memory mechanism of LSTM architecture effectively captures the temporal dependencies inherent in streamflow processes. ANN ranked second with RMSE of 189 m³/s, MAE of 95 m³/s, NSE of 0.94, and KGE of 0.95, demonstrating strong capability in modeling nonlinear rainfall-streamflow relationships. Notably, LR exhibited unexpected competitiveness, ranking third with NSE of 0.94 and KGE of 0.93. Although LR’s RMSE (196 m³/s) and MAE (98 m³/s) were higher than LSTM’s, the differences remained moderate at 13.3% and 8.9%, respectively. The competitive performance of LR warrants particular attention. Rainfall-streamflow transformation is inherently nonlinear, involving threshold-dependent processes such as infiltration excess and saturation dynamics; thus linear models would be expected to substantially underperform nonlinear architectures. This unexpected finding can be attributed to the inclusion of antecedent flow variables that effectively linearize the prediction problem, the high streamflow autocorrelation at daily time scales that allows persistence-based predictions to capture substantial variance, and the relatively limited training data (20 years) that may constrain complex models from fully exploiting their representational capacity.

GBR and XGB exhibited intermediate performance levels, achieving NSE of 0.91 and 0.90 respectively. Tree-based models generate piecewise constant predictions that may inadequately approximate smooth streamflow variations, and the residual-fitting strategy prioritizes error reduction on frequent medium-flow samples, limiting high-flow performance. RF and ETR demonstrated relatively weaker performance with NSE of 0.89 and RMSE of 251 and 258 m³/s respectively. Potential explanations include oversmoothing effects inherent in ensemble averaging, where predictions tend to regress toward the mean of training samples rather than capturing extreme magnitudes. Additionally, the squared error loss function used during training assigns equal weight to all samples regardless of flow magnitude, causing the model to prioritize accuracy on the more frequent medium-flow observations at the expense of rare extreme events.

Examining the error metrics, RMSE ranged from 173 to 258 m³/s (relative difference of 49.1%), whereas MAE varied from 90 to 104 m³/s (relative difference of 15.6%). The substantially larger inter-model variability in RMSE compared to MAE indicates significant divergence in model capabilities for handling large errors, particularly flood peaks. Given RMSE’s higher sensitivity to outliers, its greater variability suggests that certain models (e.g., ETR and RF) may exhibit considerable bias in flood peak prediction. Regarding efficiency coefficients, NSE ranged from 0.89 to 0.95, and KGE from 0.83 to 0.95, both demonstrating consistently high performance levels. Notably, LSTM achieved the highest NSE (0.95) and exhibited balanced performance with minimal difference between KGE (0.95) and NSE (0.95), indicating superior performance across correlation, bias, and variability dimensions. In contrast, GBR and XGB displayed noticeably lower KGE than NSE (differences of 0.05 and 0.04, respectively), suggesting deficiencies in preserving statistical properties of hydrological time series despite satisfactory overall fit.

Synthesizing all four evaluation metrics, model performance ranked as follows: LSTM > ANN > LR > GBR > XGB > RF ≈ ETR. The top three models (LSTM, ANN, and LR) significantly outperformed others, with LSTM demonstrating that architectures incorporating explicit temporal memory mechanisms provide meaningful advantages for daily streamflow prediction in this watershed.

Scatter plot analysis and prediction accuracy

Figure 3 presents scatter plots comparing observed and predicted streamflow for all seven models. The 1:1 reference line represents perfect prediction; closer proximity of scatter points to this line indicates higher predictive accuracy.

Fig. 3.

Fig. 3

Scatter plots of observed versus predicted daily streamflow for seven machine learning models.

The LSTM model (Fig. 3g) exhibited the most optimal scatter distribution among all models. Data points clustered tightly along the 1:1 line with R² of 0.95, NSE of 0.95, and the lowest RMSE of 173.4 m³/s. Even in the high-flow region (> 5000 m³/s), predicted points maintained close adherence to the 1:1 line with minimal deviation. The explicit memory mechanism of LSTM architecture effectively captures temporal dependencies in streamflow processes, enabling superior representation of antecedent hydrological conditions. The distribution confirms the robustness of LSTM for streamflow prediction.

The ANN model (Fig. 3c) demonstrated similarly excellent scatter characteristics with R² of 0.94, NSE of 0.94, and RMSE of 188.8 m³/s. Scatter points formed a narrow band along the 1:1 line across all flow ranges, with only slight underestimation in the high-flow region. The LR model (Fig. 3a) achieved equivalent R² and NSE values of 0.94, with RMSE of 196.5 m³/s. The scatter distribution exhibited a subtle fan-shaped pattern, with prediction error variance gradually expanding as streamflow magnitude increased. This represents typical heteroscedastic behavior of linear models confronting nonstationary hydrological data. However, the heteroscedasticity remained relatively mild and did not substantially compromise overall performance. Within the medium-to-low flow range (< 4000 m³/s), LR’s predictive accuracy was virtually indistinguishable from ANN, supporting the inference that the Boluo Watershed rainfall-streamflow relationship contains substantial linear components.

The GBR model (Fig. 3b) displayed noticeably increased scatter dispersion with R² and NSE of 0.91 and RMSE of 229.2 m³/s. The XGB model (Fig. 3f) presented similar characteristics with R² of 0.90 and RMSE of 237.9 m³/s, though with more uniform dispersion across flow ranges due to regularization mechanisms. The RF and ETR models (Fig. 3d and e) exhibited the most dispersed scatter distributions, both achieving R² of 0.89. Their RMSE values of 251.1 and 258.0 m³/s exceeded the optimal LSTM model by 45% and 49%, respectively. Critically, predictions in the high-flow region predominantly fell below the 1:1 line, indicating systematic underestimation of flood peaks.

All models demonstrated greater dispersion in high-flow regions, reflecting inherent forecasting difficulty associated with extreme events. From practical perspectives, LSTM and ANN provide optimal choices for flood warning scenarios, while LR offers comparable accuracy with superior interpretability for operational applications requiring model transparency.

High-flow conditional performance evaluation

To quantitatively assess model performance under extreme hydrological conditions, conditional evaluation metrics were computed for flows exceeding the 90th, 95th, and 99th percentiles, corresponding to thresholds of approximately 1824 m³/s, 2568 m³/s, and 4521 m³/s respectively (Table 1).

Table 1.

Conditional performance metrics for different flow thresholds. P90, P95, and P99 represent flows exceeding the 90th, 95th, and 99th percentiles of the observed streamflow distribution. Bold values indicate the best-performing model for each metric under each flow condition.

LR GBR ANN RF ETR XGB LSTM
P90 NSE 0.82 0.73 0.84 0.67 0.65 0.70 0.86
KGE 0.84 0.70 0.89 0.66 0.63 0.68 0.92
RMSE 543.2 666.8 518.1 739.1 762.6 696.2 473.9
MAE 359.6 416.8 360.4 459.4 471.8 433.3 345.7
P95 NSE 0.71 0.54 0.75 0.43 0.39 0.49 0.80
KGE 0.80 0.61 0.84 0.56 0.53 0.58 0.89
RMSE 712.3 890.5 658.1 990.3 1025.2 938.9 592.4
MAE 511.6 606.8 485.9 681.9 704.9 649.4 442.7
P99 NSE -0.12 -1.59 0.19 -2.24 -2.54 -1.90 0.45
KGE 0.73 0.51 0.76 0.37 0.36 0.50 0.67
RMSE 1103.8 1677.7 937.1 1876.3 1962.1 1776.9 773.0
MAE 891.1 1484.7 700.0 1665.9 1773.7 1585.1 629.4

Under the 90th percentile condition (330 samples), all models exhibited performance degradation compared to overall metrics, yet maintained acceptable predictive capability. LSTM achieved the best performance with NSE of 0.86 and RMSE of 473.9 m³/s, followed by ANN (NSE = 0.84, RMSE = 518.1 m³/s) and LR (NSE = 0.82, RMSE = 543.2 m³/s). Tree-based ensemble models showed notably weaker performance, with RF and ETR achieving NSE values of only 0.67 and 0.65 respectively.

Performance degradation became more pronounced under the 95th percentile condition (165 samples). LSTM maintained the highest NSE of 0.80 with RMSE of 592.4 m³/s, demonstrating robust high-flow prediction capability. ANN ranked second with NSE of 0.75, while LR achieved NSE of 0.71. Tree-based models exhibited substantial deterioration, with RF and ETR NSE values declining to 0.43 and 0.39 respectively, indicating limited capability for extreme flow prediction.

Under the most extreme 99th percentile condition (33 samples), model performance diverged dramatically. Only LSTM and ANN maintained positive NSE values of 0.45 and 0.19 respectively. LR exhibited marginal negative NSE (-0.12), while all tree-based models showed severely negative NSE values ranging from − 1.59 (GBR) to -2.54 (ETR), indicating predictions worse than the mean value baseline. LSTM demonstrated remarkable robustness with RMSE of 773.0 m³/s, substantially outperforming all other models.

As hydrological events become increasingly extreme, prediction difficulty escalates substantially across all model architectures. Nevertheless, LSTM consistently maintained superior performance across all flow regimes, with explicit temporal memory mechanisms providing particular advantages under extreme conditions where conventional machine learning models exhibit catastrophic performance degradation.

Peak flow event prediction performance

Accurate flood peak streamflow prediction represents the most critical task in streamflow forecasting, directly influencing flood early warning, reservoir operations, and disaster mitigation decisions. To evaluate model performance during high-flow events, three largest flood episodes from the test period were analyzed, exhibiting observed peaks of 7760 m³/s in June 2005, 7670 m³/s in July 2006, and 7620 m³/s in August 2013 (Fig. 4).

Fig. 4.

Fig. 4

Model performance during three major flood events with hydrograph and flow distribution analysis. The figure shows predictions for (a) June 2005, (b) July 2006, and (c) August 2013 flood events. Left panels display observed streamflow (black line with dots) versus model predictions (colored lines), while right panels show probability density distributions. Red text indicates observed peak streamflow values.

Prediction accuracy of peak magnitude

Most models exhibited systematic underestimation of flood peaks, though the degree varied substantially. LSTM consistently achieved best peak predictions across all three events. For Event (a) with observed peak of 7760 m³/s, LSTM predicted 7217 m³/s with 7.0% underestimation, followed by LR at 7082 m³/s with 8.7% underestimation and ANN at 6647 m³/s with 14.3% underestimation. Tree-based models showed severe underestimation with GBR predicting 4735 m³/s at 39.0% underestimation, while RF, XGB, and ETR predictions ranged between 4350 and 4420 m³/s with underestimation exceeding 43%.

Event (b) with an observed peak of 7670 m³/s revealed interesting patterns. LSTM predicted 8241 m³/s, representing a slight overestimation of 7.4%, the only instance of overestimation among all model-event combinations. ANN achieved second-best performance at 6995 m³/s with 8.8% underestimation. LR performance declined substantially, predicting 5931 m³/s with 22.7% underestimation, comparable to GBR and XGB at approximately 22% underestimation. RF and ETR showed largest errors at 32.6% and 31.4% underestimation respectively.

For Event (c) with observed peak of 7620 m³/s, all models exhibited notable underestimation. LSTM performed best at 6119 m³/s with 19.7% underestimation, followed by ANN at 5900 m³/s with 22.6% and LR at 5846 m³/s with 23.3%. ETR showed poorest performance at 3620 m³/s with 52.5% underestimation.

LSTM superior peak prediction capability stems from its explicit memory mechanism enabling effective retention of antecedent high-flow conditions and better representation of nonlinear amplification processes during flood events. Peak underestimation across most models arises from training sample imbalance where high-flow events constitute small dataset fractions, regression smoothing effects gravitating predictions toward training data means, and input information limitations where critical factors such as soil saturation remain absent.

Prediction accuracy of peak occurrence timing

All models demonstrated generally accurate peak timing prediction, with most predicted peaks occurring within one day of observed peaks. In Event (a), LSTM, LR, and ANN predicted peaks on June 24, one day after the observed peak on June 23, while tree-based models achieved exact temporal synchronization. In Event (b), all seven models demonstrated perfect timing accuracy with predicted peaks coinciding exactly with observed peak on July 17. Event (c) showed similar accuracy with all models except ETR correctly identifying peak date of August 19.

Peak timing prediction robustness across all architectures indicates successful capture of rainfall-streamflow lag relationships. From flood warning perspective, accurate peak timing proves more critical than magnitude as advance warning provides invaluable lead time for disaster preparedness.

Morphological matching of flood hydrographs

Beyond peak magnitude and timing, overall flood hydrograph morphological matching proves equally important. The density distribution plots in Fig. 4 (right panels) reveal significant differences in flow distribution characteristics among model predictions. LSTM, ANN, and LR density curves closely resembled observed patterns, exhibiting comparable distribution frequencies across all flow intervals, indicating these models effectively preserve statistical characteristics throughout flood events. The LSTM density curve demonstrated the closest alignment with observed distributions, particularly in the high-flow regions where other models showed deficiencies. Conversely, RF and ETR density curves skewed markedly toward low-flow regions, with severe deficiencies in high-flow interval frequencies, consistent with systematic underestimation identified in scatter plot analysis.

For rising limb performance, LSTM demonstrated superior tracking capability with curve slopes nearly parallel to observed values. Recession limb prediction proved more challenging universally, with predicted curves exhibiting faster decline rates than observed values. LSTM demonstrated improved recession limb tracking, maintaining closer adherence during gradual decline phase due to its ability to retain information from preceding time steps, enabling better representation of slow-release processes from watershed storage.

Residual analysis

Residual analysis serves as a critical diagnostic tool for assessing model predictive quality and identifying systematic bias. An ideal model should generate residual sequences with zero mean, constant variance, and symmetric distribution. Figure 5 presents residual distribution characteristics, relationships between residuals and observed streamflow, and standardized residual statistics for all models.

Fig. 5.

Fig. 5

Residual analysis for the six machine learning models. (a) Residual distributions with kernel density estimates. The vertical dashed line represents zero residual. (b) Residuals versus observed streamflow revealing heteroscedasticity. The horizontal dashed line indicates zero residual. (c) Normalized residual statistics heatmap showing standardized mean, standard deviation, skewness, and kurtosis. Red/blue colors represent positive/negative standardized values.

Regarding residual distribution patterns (Fig. 5a), all models exhibited approximately symmetric distributions centered around zero, indicating absence of significant systematic bias. However, important inter-model differences existed. LSTM exhibited the sharpest peak with narrowest distribution width, with most residuals concentrated within ± 150 m³/s, reflecting highly focused prediction errors. ANN displayed similar characteristics with residuals primarily within ± 200 m³/s. Conversely, RF and ETR residual distributions were notably flat and dispersed, with distribution tails extending beyond ± 1000 m³/s, suggesting higher frequencies of large prediction errors.

The residual-observed streamflow scatter plot (Fig. 5b) revealed pronounced heteroscedasticity across all models. Residuals exhibited distinct funnel-shaped dispersion patterns: in low-flow regions (< 2000 m³/s), residuals clustered tightly around the zero line; as streamflow increased, residual dispersion progressively expanded, with individual residuals exceeding ± 3000 m³/s in high-flow regions. This heteroscedasticity reflects inherent multiplicative uncertainty of streamflow processes. LSTM and ANN scatter clouds remained relatively compact, maintaining residuals primarily within ± 1000 m³/s even in high-flow regions. RF and ETR exhibited numerous outliers with residuals exceeding − 2000 m³/s. Notably, large residuals in high-flow regions were predominantly negative, confirming systematic underestimation of flood peaks.

Standardized residual statistics (Fig. 5c) provided quantitative diagnostic information. LSTM achieved the lowest standardized mean of -1.27 and the smallest standard deviation of -1.42 among all models, indicating minimal systematic bias and highest error concentration. ANN exhibited standardized mean of 1.87 with standard deviation of -0.91, reflecting consistent underprediction bias particularly during high-flow events, aligning with flood peak underestimation identified in Sect. 3.4. ETR and RF displayed standardized standard deviations of 1.17 and 0.96 respectively, significantly exceeding other models and reflecting elevated predictive uncertainty. All models exhibited negative skewness indicating left-skewed distributions with tendency toward underestimation of high-flow events. LSTM showed highest standardized skewness of 1.47, suggesting its distribution most closely approximated symmetry. Kurtosis analysis indicated all models exhibited leptokurtic distributions with heavy tails. LSTM achieved lowest standardized kurtosis of -1.40, indicating most concentrated error distribution with fewer extreme outliers.

From practical perspectives, heteroscedasticity significantly impacts uncertainty quantification. Flow-dependent uncertainty models employing quantile regression or weighted loss functions are recommended for operational applications. LSTM performed optimally across all diagnostic metrics with most concentrated error distribution, minimal systematic bias, and superior high-flow prediction stability.

Discussion

Feature importance and hydrological interpretation

Feature importance analysis reveals key hydrological variables controlling streamflow processes in the Boluo Watershed. Figure 6 presents evaluation results from four representative models (LR, XGB, ANN, and LSTM), where different algorithms quantify input variable contributions based on their inherent mechanisms. Recent studies demonstrate that Shapley values, permutation importance, and gradient-based methods constitute the most reliable feature importance assessment tools for hydrological applications36. For LR and XGB, standardized coefficients and built-in feature importance metrics were employed respectively, providing global importance rankings. For ANN and LSTM, SHAP analysis was applied, as neural networks lack inherent interpretability and SHAP provides both global importance rankings and local, instance-level explanations essential for understanding nonlinear model behavior during specific events such as flood peaks37.

Fig. 6.

Fig. 6

Feature importance analysis for (a) LR, (b) XGB and (c) ANN. (d) SHAP value contributions for the maximum flood event with the dashed line shows base value, and green bars indicate positive contributions while red shows negative.

The four models exhibited distinct importance rankings due to fundamental algorithmic differences. For LR (Fig. 6a), basin evaporation exhibited the highest standardized importance (0.264), followed by Dapeibu rainfall (0.160) and Danshui rainfall (0.158). However, this evaporation dominance likely reflects multicollinearity among predictors rather than true physical control. Basin evaporation correlates strongly with temperature and antecedent moisture conditions, and LR coefficients cannot disentangle these interdependencies, potentially inflating evaporation importance while suppressing correlated flow variables. This result highlights a fundamental limitation of linear models for interpreting complex hydrological drivers. In contrast, XGB (Fig. 6b) identified Lingxia flow station as absolutely dominant (0.373), nearly triple the second-ranked feature (Dapeibu rainfall, 0.112). Tree-based models capture nonlinear dependencies through hierarchical split decisions, allowing upstream flow to absorb predictive contributions that LR distributes across correlated variables. Research confirms that in watersheds influenced by human activities such as reservoir operations, antecedent flow feature importance increases significantly8. ANN SHAP analysis (Fig. 6c) corroborated the XGB finding, with Lingxia flow likewise dominating predictions (543.11), approximately 20 times higher than the second-ranked Baipenzhu flow (26.30). LSTM SHAP analysis (Fig. 6d) similarly identified Lingxia flow as dominant (281.32), followed by Basin rainfall (25.08) and Baipenzhu flow (22.15).

For the flood peak event shown in Fig. 6e and f (observed 7760 m³/s), instance-level SHAP contributions reveal model decision dynamics during extreme events. In ANN (predicted 6647 m³/s), T-1 day Lingxia flow contributed approximately 5241 m³/s positive forcing, while T-2 day Lingxia flow showed − 1937 m³/s negative contribution, indicating complex nonlinear temporal interactions within the neural network. In LSTM (predicted 7217 m³/s), T-1 day Lingxia flow contributed 4483 m³/s, with additional positive contributions from T-3 day Lingxia flow (253 m³/s) and T-1 day Basin rainfall (139 m³/s). The LSTM predictions more closely approached observed values, and its SHAP contributions demonstrated smoother temporal patterns without the strong negative feedback observed in ANN, demonstrating instance-level explanation capability that reveals neural network decision dynamics during extreme events.

From a physical mechanism perspective, the expected importance hierarchy for daily streamflow prediction should follow antecedent flow exceeds rainfall exceeds evaporation38. This memory effect indicates that current streamflow integrates antecedent precipitation, soil moisture, and groundwater contributions over preceding days to weeks, with antecedent flow comprehensively reflecting watershed storage conditions that determine rainfall-to-streamflow conversion efficiency6. XGB, ANN, and LSTM results conform to this expectation, with upstream flow dominating predictions. However, LR results diverge notably, ranking evaporation above both rainfall and upstream flow, reflecting statistical associations confounded by multicollinearity rather than causal hydrological relationships. Given that all three nonlinear models achieved superior predictive accuracy and produced physically consistent importance rankings, Lingxia Station flow is concluded to represent the most critical predictor for Boluo Watershed streamflow forecasting.

Regarding spatial patterns, both LR and nonlinear models indicate that midstream and downstream inputs contribute more than far upstream stations. XGB, ANN, and LSTM consistently identified Lingxia flow station, located at a critical midstream control section, as dominant. This downstream-weighted pattern reflects hydrological routing effects, where downstream signals more directly represent outlet conditions while upstream contributions become attenuated through channel storage and transmission losses. The relatively low importance of rainfall variables across all models reflects temporal scale effects. At daily prediction scales, previous-day flow already integrates multi-day cumulative rainfall, rendering single-day rainfall incremental information limited. Multiple studies demonstrate that at monthly or seasonal scales, meteorological factor importance rises significantly, whereas at daily scales, antecedent flow typically becomes the strongest predictor39. Additionally, individual rain stations inadequately represent watershed-wide precipitation distribution, whereas streamflow naturally incorporates all upstream inputs as an integrated response14.

In summary, feature importance analysis across four models consistently identifies Lingxia Station flow as the most critical predictor for Boluo Watershed streamflow forecasting. Improving prediction accuracy depends less on adding rainfall stations than on better characterizing watershed hydrological memory, such as introducing soil moisture observations or adopting architectures with explicit memory mechanisms. The LSTM model demonstrates particular advantage in this regard, exhibiting more balanced feature importance distribution and smoother temporal SHAP patterns compared to ANN, suggesting that explicit memory architectures provide both superior predictive accuracy and more physically coherent feature importance patterns for hydrological applications.

Sensitivity analysis of input feature configurations

To evaluate the relative contribution of different input feature groups and demonstrate model robustness across varying input configurations, sensitivity experiments were conducted by systematically excluding specific feature categories from LSTM model inputs. Four experimental scenarios were examined: (1) full features including all inputs, (2) excluding upstream flow stations (Lingxia and Baipenzhu stations), (3) rainfall-only inputs excluding flow and meteorological variables, and (4) excluding all rainfall station inputs. This experimental design directly addresses the physical interpretability of model predictions and provides insights into the hydrological information content of different input variables.

The sensitivity analysis results revealed striking performance differences across input configurations (Fig. 7). The full-feature LSTM achieved optimal performance with NSE of 0.95, KGE of 0.95, and RMSE of 173.44 m³/s. Excluding upstream flow stations caused dramatic performance degradation, with NSE declining to 0.62 (34.7% reduction), KGE declining to 0.55 (42.2% reduction), and RMSE increasing to 475.83 m³/s (174.4% increase). The rainfall-only configuration exhibited the poorest performance among all scenarios, achieving NSE of only 0.52 and RMSE of 533.41 m³/s. In contrast, excluding rainfall inputs caused relatively minor performance reduction, with NSE declining marginally from 0.95 to 0.92 (3.0% reduction) and RMSE increasing from 173.44 to 217.03 m³/s (25.1% increase). This finding reflects the strong watershed memory effect whereby previous-day streamflow integrates cumulative information from antecedent precipitation, soil moisture conditions, and groundwater contributions over preceding days to weeks40. The poor performance of rainfall-only inputs (NSE = 0.52) demonstrates that rainfall variables alone cannot adequately capture the complex rainfall-streamflow transformation processes governing watershed response. The relatively modest impact of rainfall exclusion suggests that at daily prediction scales, upstream flow stations already incorporate rainfall information through natural hydrological integration processes, rendering direct rainfall inputs partially redundant.

Fig. 7.

Fig. 7

LSTM prediction performance across four input feature configurations (Full Features, Without Upstream Flow, Rainfall Only, and Without Rainfall): (a) NSE, (b) KGE, (c) RMSE, and (d) MAE.

For operational forecasting systems where upstream flow station data may be unavailable or unreliable, substantial accuracy reduction should be anticipated. Conversely, the robustness demonstrated upon rainfall exclusion suggests that streamflow prediction systems can maintain reasonable accuracy even with incomplete meteorological observations, provided reliable upstream flow measurements are available. This finding supports the conclusion from feature importance analysis that improving prediction accuracy depends primarily on characterizing watershed hydrological memory through antecedent flow information rather than expanding rainfall monitoring networks.

Limitations and future directions

Despite achieving promising predictive results, this study exhibits several limitations requiring future improvement.

First, models demonstrate significant deficiencies in high-flow event prediction and physical consistency. As shown in Sect. 3.4, most models underestimated peak magnitudes across the three largest flood events, potentially causing severe consequences for flood early warning and control operations. This primarily stems from training sample imbalance, as high-flow events constitute a minimal dataset fraction, causing model optimization to prioritize medium and low-flow fitting accuracy41. Additionally, models failed to capture dual-peak floods and secondary peak phenomena, indicating limitations in responding to complex rainfall sequences. Although LSTM achieved highest predictive accuracy, its black-box nature restricts mechanistic understanding. Future research should employ weighted loss functions, quantile regression, or specialized high-flow value modeling methods to enhance high-flow prediction capability4245. Furthermore, Physics-Informed Machine Learning (PIML) that embeds physical constraints into model architecture offers promising directions46. Maharjan et al. (2025) compared process-based SWAT, data-driven LSTM, and physics-informed LSTM models for streamflow prediction in a snow-fed catchment, demonstrating that the physics-informed approach integrating melt index and precipitation-phase constraints provided the most robust performance with minimized bias47. Although the subtropical monsoon climate of the Boluo Watershed differs from snow-dominated systems, incorporating analogous physical constraints related to soil moisture dynamics, evapotranspiration processes, and reservoir operation rules represents a valuable avenue for improving physical rationality while maintaining predictive accuracy.

Second, feature engineering and uncertainty quantification limitations constrain model performance and decision-support value. Section 4.1 revealed upstream flow dominance with relatively low rainfall variable importance, potentially reflecting inadequate input feature design. Current models employ only daily total rainfall, neglecting critical characteristics including intensity, duration, and temporal distribution, while excluding variables reflecting antecedent watershed conditions such as soil moisture and groundwater levels. Research demonstrates that antecedent soil moisture decisively influences streamflow response lag time and peak magnitude40; its absence may impair high-flow prediction accuracy. Furthermore, current predictions provide only point estimates without prediction intervals or probability distributions. Section 3.5 revealed that prediction errors increase substantially with flow magnitude; this flow-dependent uncertainty requires explicit representation. Future research should explore multi-source data fusion including remote sensing soil moisture and reanalysis datasets to construct comprehensive input feature systems, while Bayesian deep learning, quantile regression, or ensemble methods should be explored to provide reliable confidence intervals for risk-informed decision support8.

Third, spatial and temporal generalizability and operational deployment require enhancement. This study evaluated performance only at the Boluo Watershed single outlet. Model behavior in upstream sub-watersheds, adjacent watersheds, or different climatic regions remains unclear. Transfer learning research demonstrates that source watershed-trained models can significantly improve target watershed predictive accuracy48. Although models exhibited excellent offline evaluation performance, real-time forecasting capability remains untested. Developing lightweight architectures, optimizing inference speed, and constructing automated data processing workflows constitute essential steps for translating research outcomes into operational systems49. Future research should integrate watershed infrastructure operational data and climate change scenarios to evaluate model applicability under non-stationary conditions.

Conclusions

This study systematically evaluated seven machine learning models for daily streamflow prediction in the Boluo Watershed, a subtropical monsoon region in southern China.

LSTM achieved superior accuracy (NSE = 0.95, KGE = 0.95, RMSE = 173 m³/s), demonstrating that explicit temporal memory mechanisms provide meaningful advantages for capturing watershed hydrological dynamics. ANN ranked second (NSE = 0.94), while Linear Regression demonstrated unexpected competitiveness (NSE = 0.94), suggesting substantial linear components attributable to strong flow autocorrelation at daily time scales. Ensemble tree models showed intermediate performance (NSE = 0.89 to 0.91). High-flow conditional evaluation revealed striking performance divergence under extreme conditions. At the 99th percentile, only LSTM and ANN maintained positive NSE values of 0.45 and 0.19 respectively, while tree-based models exhibited severely negative NSE values ranging from − 1.59 to -2.54. All models systematically underestimated flood peaks, with LSTM achieving the smallest underestimation of 7 to 20% compared to 30 to 50% for tree-based models.

Feature importance analysis revealed that upstream Lingxia station dominated predictions across nonlinear models, reflecting strong watershed memory effects. Sensitivity analysis confirmed the hierarchical importance of input variables, with upstream flow exclusion causing dramatic performance degradation (NSE declining from 0.95 to 0.62) while rainfall exclusion caused only minor reduction (NSE declining to 0.92), indicating that improving prediction accuracy depends primarily on characterizing watershed hydrological memory rather than expanding rainfall monitoring networks.

This study advances hydrological prediction methodology by demonstrating that explicit temporal memory mechanisms provide substantial advantages for streamflow prediction, feature importance rankings are algorithm-dependent, and simpler models can achieve comparable accuracy to complex architectures in appropriate contexts. Integration with physical process understanding and uncertainty quantification remains essential for reliable operational applications.

Author contributions

Z.Z. and Y.X. designed the study and developed the methodology. Y.X. collected and processed the data, implemented the machine learning models, and performed the analyses. R.C. and K.L. contributed to model optimization and validation. Y.X. and Z.Z. wrote the main manuscript text. H.D. prepared the figures and tables. Z.Z., J.L., and Z.Z. provided critical review and revision of the manuscript. All authors reviewed and approved the final manuscript.

Funding

This research was funded by the 2025 High-Level Talent Cultivation Program of Zhaoqing University (Grant No. gcc202512).

Data availability

The data that support the findings of this study are available from the Guangdong Dongjiang River Basin Management Bureau but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Guangdong Dongjiang River Basin Management Bureau.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Garshasbi, M., Alizadeh, H., Mojaradi, B., Saadatpour, M. & Zarei, E. Uncertainty-Aware flood inundation mapping with a bayesian deep learning framework using SAR imagery. IEEE J. Sel. Top. Appl. Earth Obs Remote Sens. 18, 26716–26726. 10.1109/JSTARS.2025.3610403 (2025). [Google Scholar]
  • 2.Mangukiya, N. K., Sharma, A. & Mehta, D. Deep learning-based approach for enhancing streamflow prediction in watersheds with aggregated and intermittent observations. Water Resour. Res.60, e2024WR037331. 10.1029/2024WR037331 (2024). [Google Scholar]
  • 3.Zhang, M., Niu, Y., Chen, Y. & Wang, L. Comparative study of daily streamflow prediction based on coupling SWAT+ with interpretable machine learning algorithms. Ecol. Inf.84, 102868. 10.1016/j.ecoinf.2024.102868 (2025). [Google Scholar]
  • 4.Ahmad, I. et al. Enhanced streamflow forecasting using hybrid modelling integrating glacio-hydrological outputs, deep learning and wavelet transformation. Sci. Rep.15, 1911. 10.1038/s41598-025-87187-1 (2025). [DOI] [PMC free article] [PubMed]
  • 5.Nikoo, M. R., Zarei, E. & Al-Wardy, M. Non-stationary drought patterns in hyper-arid regions: Spatiotemporal and multi-timescale drought analysis. Sci. Total Environ.1000, 180401. 10.1016/j.scitotenv.2025.180401 (2025). [DOI] [PubMed] [Google Scholar]
  • 6.Yifru, B. A., Lim, K. J. & Lee, S. Enhancing streamflow prediction physically consistently using process-based modeling and domain knowledge: A review. Sustainability16 (4), 1376. 10.3390/su16041376 (2024). [Google Scholar]
  • 7.Sharafati, A. & Pezeshki, E. A strategy to assess the uncertainty of a climate change impact on extreme hydrological events in the semi-arid Dehbar catchment in Iran. Theor. Appl. Clim.139 (1–2), 389–402. 10.1007/s00704-019-02979-6 (2020). [Google Scholar]
  • 8.Solanki, M., Rathinasamy, M., Wani, O., Srivastav, R. & Bhasme, P. Improving streamflow prediction using multiple hydrological models and machine learning methods. Water Resour. Res.61 (1) e2024WR038192. 10.1029/2024WR038192 (2025).
  • 9.Bhusal, A., Parajuli, U., Regmi, S. & Kalra, A. Application of machine learning and process-based models for rainfall-runoff simulation in dupage river Basin, Illinois. Hydrology9 (7). 10.3390/hydrology9070117 (2022).
  • 10.Dehghani, A. et al. Comparative evaluation of LSTM, CNN, and ConvLSTM for hourly short-term streamflow forecasting using deep learning approaches. Ecol. Inf.75, 102119. 10.1016/j.ecoinf.2023.102119 (2023). [Google Scholar]
  • 11.Wang, Y., Wang, W., Xu, D., Zhao, Y. & Zang, H. A novel strategy for flood flow prediction: integrating spatio-temporal information through a two-dimensional hidden layer structure. J. Hydrol.638, 131482. 10.1016/j.jhydrol.2024.131482 (2024). [Google Scholar]
  • 12.Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning. (MIT, 2016).
  • 13.Nourani, V. An emotional ANN (EANN) approach to modeling rainfall-runoff process. J. Hydrol. (Amst). 544, 267–277. 10.1016/j.jhydrol.2016.11.033 (2017). [Google Scholar]
  • 14.Sharma, K. V., Kumar, V., Singh, K. & Mehta, D. J. Advanced machine learning techniques to improve hydrological prediction: A comparative analysis of streamflow prediction models. Water15 (14), 2572. 10.3390/w15142572 (2023). [Google Scholar]
  • 15.Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput.9 (8), 1735–1780 (1997). [DOI] [PubMed] [Google Scholar]
  • 16.Kratzert, F., Klotz, D., Brenner, C., Schulz, K. & Herrnegger, M. Rainfall-runoff modelling using long Short-Term memory (LSTM) networks. Hydrol. Earth Syst. Sci.22 (11), 6005–6022. 10.5194/hess-22-6005-2018 (2018). [Google Scholar]
  • 17.Feng, D., Fang, K. & Shen, C. Enhancing streamflow forecast and extracting insights using long-short term memory networks with data integration at continental scales. Water Resour. Res.56 (9). 10.1029/2019WR026793 (2020).
  • 18.Frame, J. M. et al. Deep learning rainfall-runoff predictions of extreme events. Hydrol. Earth Syst. Sci.26 (13), 3377–3392. 10.5194/hess-26-3377-2022 (2022). [Google Scholar]
  • 19.Tongal, H. & Booij, M. J. Simulation and forecasting of streamflows using machine learning models coupled with base flow separation. J. Hydrol. (Amst). 564, 266–282. 10.1016/j.jhydrol.2018.07.004 (2018). [Google Scholar]
  • 20.Vilaseca, F., Castro, A., Chreties, C. & Gorgoglione, A. Assessing influential rainfall–runoff variables to simulate daily streamflow using random forest. Hydrol. Sci. J.68 (13), 1738–1753. 10.1080/02626667.2023.2246477 (2023). [Google Scholar]
  • 21.Dastour, H. & Hassan, Q. K. A machine-learning framework for modeling and predicting monthly streamflow time series. Hydrology10 (4), 95. 10.3390/hydrology10040095 (2023). [Google Scholar]
  • 22.Ridwan, W. M. et al. Rainfall forecasting model using machine learning methods: case study Terengganu, Malaysia. Ain Shams Eng. J.12 (2), 1651–1663. 10.1016/j.asej.2020.09.011 (2021). [Google Scholar]
  • 23.Sapitang, M., Ridwan, W. M., Kushiar, K. F., Ahmed, A. N. & El-Shafie, A. Machine learning application in reservoir water level forecasting for sustainable hydropower generation strategy. Sustainability12 (15). 10.3390/su12156121 (2020).
  • 24.Ni, L. et al. Streamflow forecasting using extreme gradient boosting model coupled with Gaussian mixture model. J. Hydrol.586, 124901. 10.1016/j.jhydrol.2020.124901 (2020). [Google Scholar]
  • 25.López-Chacón, S. R., Salazar, F. & Bladé, E. Interpretation of a machine learning model for short-term high streamflow prediction. Hydrology6 (3), 64. 10.3390/earth6030064 (2025). [Google Scholar]
  • 26.Akiner, M. E., Kartal, V., Guzeler, A. C. & Karakoyun, E. Exploring the applicability of the experiment-based ANN and LSTM models for streamflow Estimation. Earth Sci. Inf.17, 3111–3135. 10.1007/s12145-024-01332-4 (2024). [Google Scholar]
  • 27.Teegavarapu, R. S. V., Sharma, P. J. & Patel, P. L. Frequency-based performance measure for hydrologic model evaluation. J. Hydrol.608, 127583. 10.1016/j.jhydrol.2022.127583 (2022). [Google Scholar]
  • 28.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn (Springer, 2009).
  • 29.Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat.29 (5), 1189–1232. 10.1214/aos/1013203451 (2001). [Google Scholar]
  • 30.Breiman, L. Random forests. Mach. Learn.45 (1), 5–32. 10.1023/A:1010933404324 (2001). [Google Scholar]
  • 31.Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn.63 (1), 3–42. 10.1007/s10994-006-6226-1 (2006). [Google Scholar]
  • 32.Wardy, M., Zarei, E., Nikoo, M. R. & Nazari, R. Climate-driven projections of cyanobacterial harmful algal bloom expansion in coastal waters. Sci. Total Environ.992, 179940. 10.1016/j.scitotenv.2025.179940 (2025). [DOI] [PubMed] [Google Scholar]
  • 33.Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794. ACM. 10.1145/2939672.2939785 (2016).
  • 34.Szczepanek, R. Daily streamflow forecasting in mountainous catchment using XGBoost, LightGBM and catboost. Hydrology9 (12), 226. 10.3390/hydrology9120226 (2022). [Google Scholar]
  • 35.Lundberg, S. & Lee, S. I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NIPS 2017) 4765–4774. https://doi.org/10.48550/arXiv.1705.07874 (2017).
  • 36.Cappelli, F. et al. Feature importance measures to dissect the role of sub-basins in shaping the catchment hydrological response: A proof of concept. Stoch. Env. Res. Risk Assess.37 (4), 1247–1264. 10.1007/s00477-022-02332-w (2023). [Google Scholar]
  • 37.Wang, S. & Peng, H. Towards interpreting machine-learning models for multi-step ahead daily streamflow forecasting. Hydrol. Process.38 (5), e70163. 10.1002/hyp.70163 (2024). [Google Scholar]
  • 38.Gao, H., Kirchner, J. W., Soulsby, C. & Tetzlaff, D. Quantifying dynamic linkages between precipitation, groundwater recharge, and streamflow using ensemble rainfall-runoff analysis. Water Resour. Res.61 (1), e2024WR037821. (2025).
  • 39.Morovati, R. & Kisi, O. Utilizing hybrid machine learning techniques and gridded precipitation data for advanced discharge simulation in under-monitored river basins. Hydrology11 (5), 48. 10.3390/hydrology11040048 (2024). [Google Scholar]
  • 40.Haga, H. et al. Flow paths, rainfall properties, and antecedent soil moisture controlling lags to peak discharge in a granitic unchanneled catchment. Water Resour. Res.41 (12), W12410. 10.1029/2005WR004236 (2005). [Google Scholar]
  • 41.Adera, S. & Bellugi, D. Streamflow prediction at the intersection of physics and machine learning: A case study of two Mediterranean-climate watersheds. Water Resour. Res.60 (7), e2023WR035790 (2024). [Google Scholar]
  • 42.Farfán, J. F. & Cea, L. Streamflow forecasting with deep learning models: A side-by-side comparison in Northwest Spain. Earth Sci. Inf.17, 3359–3382. 10.1007/s12145-024-01454-9 (2024). [Google Scholar]
  • 43.Jiang, S., Zheng, Y., Wang, C. & Babovic, V. Deep learning for cross-region streamflow and flood forecasting at a global scale. Innov.5 (3), 100617. 10.1016/j.xinn.2024.100617 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Li, K. et al. Lag-related noise shrinkage stacked LSTM network for short‐term traffic flow forecasting. IET Intell. Transp. Syst.18 (2), 244–257. 10.1049/itr2.12448 (2024). [Google Scholar]
  • 45.Liu, Y. Y., Wang, L. J., Tang, Y. J. & Ren, B. Judgment of athlete action safety in sports competition based on LSTM recurrent neural network algorithm. Math. Probl. Eng.10.1155/2022/1758198 (2022). [Google Scholar]
  • 46.Zhang, Z., Wang, D. G., Mei, Y. W., Zhu, J. X. & Xiao, X. S. Developing an explainable deep learning module based on the LSTM framework for flood prediction. Front. Water. 7, 1562842. 10.3389/frwa.2025.1562842 (2025). [Google Scholar]
  • 47.Maharjan, S. et al. Physics-informed deep learning reveals climate-driven snowpack decline and threatens ecological water availability in a Californian snow-fed catchment. Ecol. Inf.92, 103526. 10.1016/j.ecoinf.2025.103526 (2025). [Google Scholar]
  • 48.Ma, K. et al. Transfer learning framework for streamflow prediction in large-scale transboundary catchments: sensitivity analysis and applicability in data-scarce basins. J. Geog. Sci.34, 963–984. 10.1007/s11442-024-2235-x (2024). [Google Scholar]
  • 49.Sezen, C. & Šraj, M. Improving the simulations of the hydrological model in the karst catchment by integrating the conceptual model with machine learning models. Sci. Total Environ.926, 171684. 10.1016/j.scitotenv.2024.171684 (2024). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the Guangdong Dongjiang River Basin Management Bureau but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Guangdong Dongjiang River Basin Management Bureau.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES