Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 30;15:38031. doi: 10.1038/s41598-025-21926-2

Improving stage-discharge relationship modeling accuracy using a hybrid ViT-CNN framework

Hajar Feizi 1,4, Mohammad Taghi Sattari 1,2,4,, Adam Milewski 3,
PMCID: PMC12575754  PMID: 41168348

Abstract

Predicting river flow is one of the key issues in hydrological modeling, which is particularly important in applications such as managing and controlling floods. Water resource engineers use historical observational data of river flow to establish a relationship between discharge and water level, referred to as the stage-discharge relationship or rating curve (RC). In this study, deep learning methods, including the Vision Transformer (ViT) and Convolutional Neural Network (CNN), were used to model the stage-discharge relationship in the Nahand River. The results from these models were compared with a novel hybrid method known as ViT-CNN. To optimize the input of these models, the Vector AutoRegression (VAR) method was used in which a one-time step delay for discharge and stage was selected as the model inputs. This selection of inputs was based on time-series analysis which enable the models to simulate the complexity of the flow as accurately as possible. The results showed that among the evaluated methods, the ViT-CNN hybrid method achieved the best performance in predicting flow discharge, with evaluation criteria of CC = 0.983, NSE = 0.962, RMSE = 0.178, and MAE = 0.071. The results of this study demonstrate the utility of deep learning to further enhance the predictability of stage-discharge relationships in rivers worldwide.

Keywords: Convolutional neural network, Deep learning, Stage-discharge, Time-series, Vector autoregression, Vision transformer

Subject terms: Environmental sciences, Hydrology

Introduction

Accurate forecasting of river flow is essential for efficient planning and operation of water resource systems. This issue has always been one of the main concerns of water researchers, therefore many efforts have been made to develop and improve different techniques for predicting river flow1. In rivers with dynamic and complex behavior, direct measurement of flow rate is often difficult or impossible. In such a situation, one of the most common methods of estimating flow rate in natural streams is the use of flow rating curves which are obtained by fitting the observed stage-discharge data to polynomial regression functions2. However, polynomial equations that define the stage-discharge relationship are often unable to accurately predict peak flow values. Additionally, stage-discharge observations are typically conducted manually and during the day, while flood peaks often occur at night. This discrepancy increases the uncertainty in discharge estimation3. Recently, artificial intelligence techniques have been considered to more accurately predict the stage-discharge relationship at different time scales. Artificial intelligence methods are one of the data mining tools that use different techniques to extract hidden patterns and rules in the data. The use of these methods significantly reduces the costs of field measurements4. These methods can better model complex nonlinear relationships between input and output variables and, as a result, provide more accurate predictions3. So far, artificial intelligence methods based on machine learning (ML) have been used by many researchers to estimate the stage-discharge relationship. Al-Aboodi et al.5 applied support vector machines (SVM), radial basis function neural networks (RBF), and decision tree forest (DTF) methods to find the best relationship between discharge and stage in a river in Iraq. Various combinations of time-step delays in discharge and stage were considered as model inputs. The results indicated that the performance of the SVM model was slightly better than the other two models in estimating the stage-discharge relationship in the study area. Xu et al.6 investigated water storage changes in Poyang Lake, the largest freshwater lake in China, using multi-mission satellite data and hydrological models. The results indicated that combining GRACE data with global models may not be sufficient for accurately estimating water storage changes. This study highlights the importance of integrated approaches in monitoring dynamic lakes. Hasan et al.7 examined changes in total water storage (TWS) in the Nile River Basin during the twentieth and twenty-first centuries using GRACE/GRACE-FO data and modeling. The results indicated that between 2002 and 2020, the basin transitioned to wetter conditions, while projections for 2021 to 2050 show a 10% to 30% decrease in water storage. This study highlights the importance of water resource planning, especially during flood seasons. Shukla et al.8 modeled the stage-discharge relationship using 12 years of daily data. For this purpose, they employed artificial neural network (ANN), adaptive neuro-fuzzy inference system (ANFIS), and wavelet artificial neural network (WANN) methods. The results showed that the performance of the ANFIS model for predicting discharge in the study area is better than the models based on ANN and WANN. Aquil and Ishak9 compared various machine learning models for forecasting reservoir water levels. The results showed that the VARMAX model had the highest accuracy in simulating the seasonal components of the data, while autoregressive integrated moving average (ARIMA) performed poorly in the presence of these components. This study emphasizes the importance of selecting the appropriate model for accurate reservoir water level forecasting. Umar et al.10 modeled the stage-discharge relationship using the generalized reduced gradient (GRG) optimization technique and traditional regression methods. The results showed that the GRG method had higher accuracy in predicting discharge at eight Jhelum River gauging stations and exhibited a high correlation with observed discharge, especially during flood conditions. Maghrebi & Vatanchi11 employed machine learning models, including multiple linear regression (MLR), for river flow estimation. The results showed that linear models such as MLR and ANFIS performed better under steady flow conditions, while nonlinear models like SVR_RBF were more suitable for unstable flows. Idowu et al.12 introduced a new method called range-dependent multivariate adaptive regression splines–genetic algorithm (RD-MARS-GA) for monthly streamflow prediction and compared its results with standard models such as Gaussian process regression and the M5 model tree. The results showed that the proposed model significantly improved the accuracy of river flow prediction compared to the other models. Agaj et al.13 used monthly time series data over a 10-year period to predict the water level of the Morava e Binçës River. In this study, ARIMA and exponential smoothing (ETS) models were employed, with 9 years of data allocated for training and 1 year for testing. The results showed that both models were able to predict the river’s water level with high accuracy. Sharma et al.14 used three different methods including ANN, SVM, and ANFIS in a study to model the stage-discharge relationship in a Himalayan River. The results of this study showed that the ANFIS method, based on various evaluation criteria, has a better performance than other methods and was able to achieve higher accuracy in discharge prediction. Kisi et al.15 used machine learning methods, including multilayer neural networks (MLNN), RBNN, and ANFIS to simulate daily average discharge data in Turkey. Discharge and stage delays over three-time steps were considered as input data based on various scenarios. Among the used methods, the MLNN method showed better performance than the others in estimating discharge.

Recently, deep learning techniques have attracted a lot of attention as a more advanced subset of machine learning and have outperformed other machine learning techniques in many fields. These approaches involve employing artificial neural network architecture, which includes numerous processing layers16. In fact, deep learning methods are representational learning techniques that are achieved by using multiple layers and combining simple but non-linear modules. Unlike traditional machine learning methods, which require feature engineering, deep learning has the capability to automatically learn relevant features from raw data. This ability contributes to higher accuracy and better model performance in complex problems. Due to the ability of this technology to discover complex structures in data, it has been widely used in many fields17. Despite the significant advances of deep learning in various fields, this technology still faces significant challenges. The need for a large amount of data, limitations in the generalization of learning, complexity, and lack of transparency are some of the most important challenges of these methods18. Due to the complexity of the relationships between different variables in hydrology, recently deep learning techniques have been increasingly used in hydrology-related studies19. Among deep learning techniques, convolutional neural networks (CNN) and vision transformers (ViT) are notable. These two techniques are often used for image datasets, but due to their capacity to model complex relationships and discover nonlinear patterns in the data, they are also used in hydrological problems.

Van et al.20 used the CNN method with two-layer convolution to model rainfall-runoff and compared the results with the LSTM method and traditional methods. The evaluation of the results showed that the CNN method has a better performance than other methods in modeling rainfall-runoff in the study area. In addition, the CNN method is able to learn the dependence between time series without using long time series. Shu et al.21used the CNN method to predict the monthly flow of a reservoir in China. The results showed that CNN performs better than other methods such as ANN and extreme learning machine (ELM) in all statistical criteria. The authors concluded that CNN can automatically extract important features from multiple inputs, which is a distinct advantage compared to other AI models. Zhang et al22. used a hybrid of CNN and long short-term memory (CNN-LSTM) method to forecast the downstream water level of a reservoir. The results indicated that the downstream water levels predicted by the proposed model were very close to the actual water levels. Xu et al.23 developed a CNN network to predict discharge in the Zhexi Reservoir. The results showed that the CNN method, with a correlation coefficient exceeding 0.9, demonstrated high reliability for predicting the non-periodic flow of the reservoir. Jahanbakht et al.24 used the ViT method to predict sediment, with the aim of enhancing water quality management programs, and then compared the results with actual sediment data. The results showed that the proposed model could accurately predict sediment levels. Taccari et al.25 used the combination of U-Net and ViT methods for the numerical modeling of groundwater and compared its results with the Fourier neural operator (FNO) method. The results showed that the combined U-Net + ViT method is more accurate and efficient than the FNO method, especially in scenarios with sparse data. Zeng et al.26 used time–frequency spectrograms as a visual representation of time series data to predict time series, including temperature, and used the ViT method to model the data. The results of this method were compared with traditional methods such as ARIMA. The results showed that the use of spectrographs with the ViT method provided significantly better accuracy and performance in time-series modeling. Zhen and Alina Bărbulescu27 used three different hybrid methods including CNN-LSTM, sparrow search algorithm with backpropagation neural networks (SSA-BP) and particle swarm optimization with extreme learning machines (PSO-ELM) to predict monthly river water discharge in Romania and compared the model’s performance based on various statistical metrics. The evaluation of the models showed that the CNN-LSTM method has higher computational efficiency and prediction accuracy than other methods. Suresh et al.28 used dynamic programming (DP) and ViT models for groundwater quality analysis. The results showed a 5–10% improvement in evaluation accuracy, confirming the high efficiency and accuracy of the proposed method for groundwater quality assessment.

Achite et al.29 employed various deep learning methods for monthly streamflow time series forecasting and compared them with the CNN–recurrent neural network (RNN) hybrid approach. The results indicated that among the methods used, the CNN-RNN algorithm demonstrated the best performance. Shekar et al.30 employed six different artificial intelligence models for rainfall-runoff modeling and compared their results with the CNN-RNN hybrid approach. The findings revealed that among the methods used, the CNN-RNN algorithm exhibited the best performance in both training and testing phases, demonstrating high effectiveness in rainfall-runoff analysis. Ougahi and Rowan31 predicted snowmelt runoff using various machine learning and deep learning techniques and compared their results with the CNN-LSTM hybrid approach. The findings indicated that integrating these methods significantly enhanced prediction accuracy and proved to be more reliable than other approaches. Sadeghi et al.32 utilized CNN and ViT methods for groundwater potential mapping. The results indicated that both models demonstrate high capability in groundwater modeling, with ViT showing a relative advantage.

A review of the literature indicates that CNN and ViT methods have not yet been used to estimate the stage-discharge relationship. However, CNN-based models have been increasingly applied in various hydrological subfields, such as flood forecasting20, sediment transport prediction24, streamflow prediction29, and groundwater modeling25,32. Therefore, the purpose of this study is to implement and evaluate these two methods for estimating the stage-discharge relationship in the Nahand River. Additionally, the results of these methods will be compared with the proposed hybrid ViT-CNN approach. To optimize the number of time-series delays as inputs for these models, the VAR method was used with two criteria, AIC and SIC. Using this method causes the model to provide the highest performance with the least complexity. These approaches have proven effective in previous studies, such as Ozcicek and McMillin33, where the use of AIC and SIC in VAR models demonstrated improved predictive accuracy with reduced model complexity.

Materials and methods

Study area and data

In this study, daily data from Nahand River during the period of 2007 to 2021 were used to estimate the stage-discharge relationship. Daily stage and discharge data were obtained from the East Azerbaijan Regional Water Authority. Missing daily values for each month were imputed using the long-term mean of the same month computed over 2007–2021. The Nahand River watershed is a region with an area of 169 square kilometers, located in East Azerbaijan Province, Iran (Fig. 1). East Azerbaijan Province is classified as a mountainous and semi-arid region, characterized by cold winters and warm summers. According to long-term climatological data, the average annual temperature is approximately 12 °C, while the mean annual precipitation is around 288 mm (Kazemi34). Nahand river in this province, is one of the main branches of the Ajichay River and, due to its role in supplying drinking water to Tabriz, a clay dam has been constructed on it (Mahdavi Asl, 2020).

Fig. 1.

Fig. 1

Location of the study area.

In Table 1, the statistical characteristics of the data are presented, and in Fig. 2, the graph of the time series of discharge and stage data against time is shown. For modeling purposes, the CNN, ViT, and hybrid ViT-CNN methods were employed. 70% of the data was used for training, and 30% for testing. The model inputs include time-series delays of discharge and stage, with the number of delays determined using the Vector Autoregression method. All models were implemented in the Python programming environment using the Numpy, Scikit-Learn, Matplotlib, and TensorFlow libraries on the Google Colab platform.

Table 1.

Statistical characteristics of daily stage and discharge.

Variables Min Max Mean Range Median Standard Deviation (σ)
Stage (m) 0.020 0.290 0.082 0.270 0.060 0.058
Discharge (Inline graphic) 0.000 4.410 0.063 4.410 0.243 0.907

Fig. 2.

Fig. 2

Time series plot of stage and discharge datasets.

Vector autoregression method for determining time series lags

Selecting an appropriate number of lags for time-series analysis is very important. Choosing too many lags can lead to reduced model efficiency and excessive complexity, while selecting fewer lags than necessary may result in model underfitting. Therefore, selecting an appropriate number of lags is essential to achieve accurate results. There are numerous methods for determining the optimal number of lags, one of which is the Vector Autoregression (VAR) method. This method is considered one of the most widely used time-series models. Among the criteria used to select the number of lags in VAR models are the Akaike Information Criterion (AIC) and Schwarz Information Criterion (SIC) criteria35.

graphic file with name d33e626.gif 1
graphic file with name d33e632.gif 2

In Eq. (1) and (2), T is the effective sample size and Inline graphic is the quasi-maximum likelihood estimate of the innovation covariance matrix Σ. N is the number of variables and P is maximal lag. Prior to implementing the VAR model, the stationarity of the time series must be evaluated using the Augmented Dickey–Fuller (ADF) test. In cases where non-stationarity is detected, differencing is applied to render the series stationary; otherwise, no differencing is required36.

Convolutional neural network

The field of machine learning has become remarkable complexity with the advent of Artificial Neural Networks. One of the forms of ANN architecture is the Convolutional Neural Network, which is built using multiple layers. These layers, including the input layer, convolution layer, pooling layer, and fully connected (dense) layer, are sequentially stacked to form a complete convolutional network structure37. Each layer processes a three-dimensional input and transforms it into a three-dimensional output using a derivative function. The general steps of a CNN can be described as follows:

  • 1- The input layer contains the raw input values.

  • 2- The convolution layer is responsible for extracting features from the input data.

  • 3- The pooling layer reduces the input dimensions and primarily helps the convolutional network produce a score vector.

  • 4- The flatten layer is placed before the fully connected layer and converts multi-dimensional data into a one-dimensional vector, which is essential to prepare the data for processing by fully connected layers.

  • 5- The fully connected (FC) layer calculates the final scores of the network.

  • 6- To reduce overfitting in CNN models, the dropout operator is used, which should be placed between FC layers during the training process to enable the network to adapt to different architectures and sets of neurons38.

The proposed CNN architecture is shown in Fig. 3.

Fig. 3.

Fig. 3

The architecture of the proposed convolutional neural network.

Vision transformer

Transformers, originally applied in the field of natural language processing (NLP), are a type of deep neural network mainly based on the self-attention mechanism and have recently been used for various applications such as text, image, and time series (Zeynali et al., 2023). Transformer-based models have similar or even better performance than other types of networks, such as CNN and RNN39. One type of transformer is the Vision transformer, which was first introduced in 2020, and its general steps are as follows40:

  1. The transformer receives a one-dimensional sequence of embedding tokens as input.

  2. The received input is divided into smaller patches.

  3. Linear mapping is applied to the patches, projecting them to dimension D. The output of this mapping is known as patch embeddings.

  4. Position embeddings are incorporated into the patch embeddings to retain the sequence information of the patches.

  5. The sequence of embedding vectors is then provided as input to the encoder, which comprises multi-headed self-attention layers and MLP blocks.

  6. Layer normalization is applied before each block and residual connections are placed after each block.

  7. Several transformer blocks are sequentially applied to the tokens to extract features from the patches.

The proposed structure of the ViT network is shown in Fig. 4, which includes two transformer layers.

Fig. 4.

Fig. 4

The architecture of the proposed vision transformer.

The proposed ViT-CNN method

Instead of using raw patches, the input sequence can be composed of feature maps extracted by a CNN. In this hybrid model, the patch embedding projection mapping is applied to the patches extracted from the CNN feature map. In this hybrid architecture, features extracted by the CNN are provided as input to the ViT. The combination of CNN and ViT enables the model to effectively capture local features through the CNN, and global features and long-term relationships through the ViT40. Therefore, the combination of ViT and CNN helps ViT to exploit the information extracted by CNN and perform better based on these features. The proposed structure for the ViT-CNN hybrid presented in this study is shown in Fig. 5, which includes two CNN layers and one ViT layer.

Fig. 5.

Fig. 5

The Architecture of the proposed ViT-CNN hybrid model.

In Table 2, the tuning parameters of all three models used in this study, including the CNN, ViT, and ViT-CNN hybrid models, are presented. Accurately tuning these parameters plays a key role in the performance and final accuracy of each model and significantly impacts the optimization of the deep learning process. The hyperparameters were tuned through trial and error to ensure optimal convergence. In hyperparameter tuning, only the most impactful options were explored to keep the search space compact; the optimizer (Adam), activation (ReLU), and loss (MSE) were held fixed. The learning rate was swept in two stages 1e-2 to 1e-6 and 5e-3 to 5e-6 yielding the final values reported in Table 2. Batch size was tested at 16, 32, and 64, and dropout at 0.1, 0.2, and 0.3. For ViT-based models, the number of patches was set to 3. Epochs were explored at 300, 500, and 700, with 500 selected as the best trade-off between convergence and runtime. These ranges were chosen based on practical modeling experience and an understanding of network behavior. Figure 6 shows the training and testing loss curves for the three models, clearly illustrating their convergence during training.

Table 2.

Hyperparameters of the Proposed Models.

Network Epoch Dropout Batch Size Number of Patches Learning Rate Loss Function Activation Function Optimizer
CNN 500 0.3 32 - 1e-6 Mean Squared Error Relu Adam
ViT 500 0.3 32 3 1e-5 Mean Squared Error Relu Adam
ViT-CNN 500 0.3 64 3 1e-4 Mean Squared Error Relu Adam

Fig. 6.

Fig. 6

Training and Testing Loss Curves for a) CNN, b) ViT, and c) ViT-CNN Models.

Evaluation criteria

To evaluate the models, the criteria of correlation coefficient (CC), Nash–Sutcliffe efficiency (NSE), mean absolute error (MAE), and root mean square error (RMSE) were used, which are presented in Eq. 3 to Eq. 6 respectively. Both MAE and RMSE are non-negative (≥ 0) and share the units of the target variable; CC lies in [−1, 1], and NSE lies in (-∞, 1].

graphic file with name d33e914.gif 3
graphic file with name d33e920.gif 4
graphic file with name d33e926.gif 5
graphic file with name d33e932.gif 6

where Inline graphic is the observed data and Inline graphic is its mean, Inline graphic is the calculated data, and Inline graphic is its mean. The closer the value of CC criterion is to 1 and the value of MAE and RMSE criteria is closer to 0, indicates better performance of the model. For the Nash–Sutcliffe efficiency (NSE), simulations with NSE ≥ 0.90 are rated very good, those with 0.60 ≤ NSE < 0.90 are acceptable, and those with NSE < 0.60 are unacceptable (Chiew et al., 1993). To evaluate the results, in addition to numerical criteria, graphical evaluation criteria such as scatter plots, Taylor diagrams, and violin plots were used. In a Taylor diagram, the data’s position is analyzed based on RMSE, correlation coefficient, and the standard deviation of the time series41. Violin plots, a combination of kernel density plots and box plots, can display both the statistics and data density42.

3- Results and discussion

Optimal number of lags using the VAR method

Stationarity was assessed using the Augmented Dickey–Fuller (ADF) test. For both discharge (ADF = − 8.85, p < 0.001) and stage (ADF = − 7.26, p < 0.001), the test statistic was smaller (more negative) than the critical values at the 1%, 5%, and 10% significance levels, and the null hypothesis of a unit root was rejected. Therefore, both series are stationary and no differencing was required. Using the SIC and AIC, the optimal lags for the discharge and stage time series were determined. The results are presented in Table 3, indicating the best lags for each variable across different models. Lower values of these criteria indicate the optimal number of lags.

Table 3.

VAR model output for determining optimal number of lags.

Lag Order AIC Value SIC Value Optimal (AIC) Optimal (SIC)
1 −364.162 −364.092 TRUE TRUE
2 −362.825 −362.694 FALSE FALSE
3 −360.087 −359.895 FALSE FALSE
4 −358.433 −358.18 FALSE FALSE
5 −357.534 −357.22 FALSE FALSE
6 −358.82 −358.444 FALSE FALSE
7 −358.319 −357.883 FALSE FALSE
8 −357.059 −356.562 FALSE FALSE
9 −357.388 −356.829 FALSE FALSE
10 −356.849 −356.229 FALSE FALSE

According to the results obtained for the estimation of discharge (Inline graphic), the optimal delay for both information criteria of the VAR model, including AIC and SIC, is equal to 1 (Inline graphic). The optimal lag number is the same for both criteria which indicates a high reliability in selecting this number for modeling. At the daily time step, the hydrologic response exhibits short memory; accordingly, the most informative predictors for estimating Inline graphic in the immediate history Inline graphic and the contemporaneous stage Inline graphic. This pattern is consistent with Table 4, where Inline graphic shows the highest and Inline graphic the lowest correlation with Inline graphic. Therefore, selecting a lag order of p = 1is justified at this temporal resolution. Selecting the optimal number of lags not only reduces model complexity but also decreases error and enhances the predictive capability of the model. Table 4 shows the correlation of the selected inputs with the discharge output based on the VAR model. According to the results presented in Table 4, Discharge (t-1) has the highest correlation with Discharge (t), while Stage (t-1) has the lowest correlation.

Table 4.

Correlation Between Input and Output Variables.

Variables Stage (t) Stage (t-1) Discharge (t-1)
Discharge (t) 0.702 0.676 0.954

Evaluation of CNN, ViT, and ViT-CNN model results

Table 5 shows the performance of three different models including CNN, ViT and ViT-CNN hybrid model in estimating Discharge (t). The models were evaluated using CC, NSE, MAE and RMSE criteria on training and testing data. For each model, three scenarios with different inputs were implemented and evaluated.

Table 5.

Performance of CNN, ViT, and Hybrid ViT-CNN Models for Discharge Estimation.

Method Inputs CC CC NSE NSE RMSE RMSE MAE MAE
Train Test Train Test Train Test Train Test
Inline graphic 0.892 0.891 0.855 0.852 0.476 0.495 0.240 0.240
CNN Inline graphic 0.884 0.882 0.847 0.841 0.478 0.482 0.244 0.246
Inline graphic 0.958 0.953 0.911 0.912 0.265 0.272 0.090 0.090
Inline graphic 0.932 0.938 0.891 0.894 0.429 0.431 0.226 0.225
ViT Inline graphic 0.915 0.917 0.874 0.877 0.448 0.448 0.231 0.232
Inline graphic 0.979 0.971 0.945 0.950 0.224 0.231 0.080 0.086
Inline graphic 0.943 0.939 0.902 0.895 0.396 0.417 0.267 0.272
ViT-CNN Inline graphic 0.942 0.931 0.901 0.892 0.398 0.422 0.270 0.279
Inline graphic 0.991 0.983 0.977 0.962 0.132 0.178 0.066 0.071

As shown in Table 5, the test set performs slightly better than the training set. This can be explained by the temporal split of the dataset: the earlier part used for training contains greater variability and more extreme peaks, whereas the later part used for testing is comparatively calmer and more regular (Fig. 2). Consequently, higher accuracy is observed during the test period. The ViT-CNN, ViT, and CNN models demonstrated satisfactory performance, with high accuracy and low error, respectively. Additionally, the results in Table 5 indicate that considering all three variables Inline graphic and Inline graphic​ simultaneously improves model accuracy. The results of the stepwise testing section also show that the proposed ViT-CNN hybrid method increases the correlation coefficient by 0.012 units compared to the ViT model and by 0.030 units compared to the CNN model in the third scenario (which includes the three variables Inline graphic and Inline graphic​​). Although this increase may not seem particularly significant from a practical perspective, it is important from a numerical and modeling standpoint, reflecting the highly favorable performance of the proposed ViT-CNN hybrid model. In addition, to interpret the ViT-CNN results relative to the natural scale of the data (Table 1), normalized metrics were computed: Inline graphic = 0.040 (Inline graphic), RSR = 0.196 (RMSE/σ_obs), and MAE/σ = 0.078 (MAE/σ_obs). These values indicate that the errors amount to only 19.6% of the natural variability and 4% of the total range, while R2 ≈ 0.96 shows that approximately 96% of the discharge variance is reproduced.

In Fig. 7, the scatter plot and the time-series plot of observed discharge and estimated discharge against time are shown.

Fig. 7.

Fig. 7

Comparison of Observed and Estimated Discharge for the Optimal Input Scenario Using: a) CNN, b) ViT, and c) Hybrid ViT-CNN Models.

According to Fig. 7, there is a very high level of agreement between the observed and estimated data across all input scenarios and for each of the three models used. Although, the results of the proposed ViT-CNN hybrid model are more satisfactory compared to the other two models. It should be noted that the CNN model captures peak flows more accurately in the time-series plots due to the ability of its convolutional layers to track abrupt surges, whereas the scatter and correlation plots, which reflect performance over the entire record, favor ViT-CNN. Accordingly, ViT-CNN, owing to its self-attention mechanism, reduces residual variance and bias across the more frequent low-to-mid flows and yields higher overall accuracy. So, in the proposed ViT-CNN hybrid model, the consistency and compactness between the observed and estimated data are higher. Additionally, based on the time-series plot in Fig. 7, it is evident that at relatively higher discharge levels, due to the low frequency and occurrence of such events, the models did not have sufficient opportunities for training, resulting in slightly lower estimated values. Naturally, if the frequency of higher discharge events were greater, the models would have better learning opportunities, leading to improved overall performance in estimating higher discharge points. Figure 8 presents the Taylor diagram and violin plots for the test section in the best-case scenario.

Fig. 8.

Fig. 8

Comparison of models using taylor diagram and violin plots for the optimal input scenario.

In Fig. 8-a, the black triangle represents the observed data, and the colored points represent the models used in this study. The ViT-CNN model, marked in brown, is approaching the black triangle corresponding to the observational data. This generally indicates the appropriate performance of the proposed ViT-CNN hybrid model, as it has a standard deviation and correlation coefficient closer to the observed values.

In the violin plots in Fig. 8-b, the small white dot, located within the black bars in the center of each plot, represents the median of the data. The thick black line indicates the interquartile range (Q1–Q3), while the thin black line shows the data points outside this range. The wider sections of the violin plot near the median indicate a higher density of data in that area, while the narrower sections reflect a lower data density in those regions. Analysis of the violins indicates that ViT-CNN aligns most closely with the observed distribution near the center: its median nearly coincides with the observed median, and the interquartile range is only slightly narrower, yielding a pronounced central density. By contrast, CNN exhibits a heavier upper tail (wider upper body and longer upper whisker), consistent with its superior reproduction of peak flows, while ViT shows a similar central tendency but a somewhat broader spread.

In this study, a single-parameter, stage-based model was used to maintain simplicity, low cost, and high generalizability, demonstrating that competitive accuracy can be achieved using only routinely available gauging-station data. Overall, based on the numerical and graphical evaluation criteria presented, the proposed ViT-CNN hybrid model demonstrated higher accuracy compared to the independent CNN and ViT models and was able to predict discharge values at the Nahand hydrometric station with high accuracy. In other words, the combination of the CNN and ViT models was successful and had a positive impact on increasing the prediction accuracy. These findings align with the results of Al-Juboori43 and Li et al.44, indicating that the use of hybrid models can enhance the accuracy of predictions related to hydrological data. In addition to hydrology, the effectiveness of the ViT-CNN Hybrid model has been demonstrated in various contexts. For example, a study on the accurate detection of elbow bone fractures45 showed that the combination of CNN and ViT can significantly improve feature extraction and the accuracy of medical diagnosis. Similarly, a study by Kim et al.46 systematically analyzed 34 papers published between 2020 and 2024, focusing on the ViT-CNN hybrid architecture. The review found that CNN excels at extracting local features through convolution filters, while ViT is highly effective at capturing long-range dependencies through self-attention mechanisms. By integrating these two architectures, the limitations of each are mitigated, leading to more comprehensive and effective solutions.

This study faced two main limitations. The first limitation is due to the use of only one meteorological station and the lack of long-term statistical data, and the second limitation is due to conducting the modeling process in a specific climate. These two limitations mean that the results obtained from this study may not be applicable to different climatic conditions. Therefore, the findings of this research cannot be generally applied across various climates and regions. From a practical perspective, using the proposed models in this study might not be appealing for water sector managers due to the high level of expertise required in computer science to implement and fine-tune the models. Additionally, the improvement in prediction accuracy may not be significant enough to justify the use of these complex models for practical applications, unless future developments make it possible to expand these models through user-friendly software based on deep learning thereby enhancing their practical applicability. The future scope of this study emphasizes the potential expansion of the developed models to incorporate a broader range of environmental variables, such as rainfall patterns, land use changes, and climate variations, which may enhance the robustness of stage-discharge estimations under diverse conditions. Additionally, future research could explore the application of these deep learning models across different river systems and climatic regions to assess their generalizability and adaptability. Also, hyperparameters in this study were selected through trial and error based on model convergence and visual inspection of loss curves, we recognize that more systematic tuning methods, such as Bayesian optimization or grid search with cross-validation, may lead to better model performance and higher reproducibility. Future studies are encouraged to explore these optimization strategies.

Conclusion

Modeling the stage-discharge relationship in rivers is of particular importance for water resource management and ecosystem conservation. For this reason, the present study was conducted to estimate the daily stage-discharge relationship in the Nahand River located in Iran. In this paper, an approach based on deep learning methods, including Convolutional Neural Networks (CNN), Vision Transformers (ViT), and a ViT-CNN hybrid model, is presented to estimate the stage-discharge relationship. The results of the study showed that deep learning methods perform very well in estimating the stage-discharge relationship. Among the methods used, the ViT-CNN hybrid model, leveraging the features of both CNN and ViT networks, provided the best performance achieving higher accuracy (98.3%) than the other models. Based on these results, it appears that hybrid models offer superior performance in improving estimation accuracy for complex problems such as estimating the stage-discharge relationship.

The future scope of this study emphasizes the potential expansion of the developed models to incorporate a broader range of environmental variables, such as rainfall patterns, land use changes, and climate variations, which may enhance the robustness of stage-discharge estimations under diverse conditions. Additionally, future research could explore the application of these deep learning models across different river systems and climatic regions to assess their generalizability and adaptability. Comparative analysis with existing literature suggests that while previous studies have reported promising results with hybrid models, the current findings reinforce the growing consensus that leveraging advanced algorithms, such as CNN and ViT, can significantly improve hydrological predictions. Furthermore, integrating user-friendly software interfaces into these models will be crucial for enhancing their accessibility and practicality for water resource managers, potentially transforming how hydrological data is analyzed and utilized in real-world applications. Engaging with stakeholders in water management can foster collaborative efforts to refine model usage, ensuring that the advancements in technology align with the operational needs and decisions in the water sector.

Abbreviations

RC

Rating Curve

VIT

Vision Transformer

CNN

Convolutional Neural Network

VITCNN

Hybrid Of Vision Transformer and Convolutional Neural Network

SVM

Support Vector Machine

RBF

Radial Basis Function

DTF

Decision Tree Forest

ARIMA

Autoregressive Integrated Moving Average

VAR

Vector Autoregression

AIC

Akaike Information Criterion

SIC

Schwarz Information Criterion

CC

Correlation Coefficient

MAE

Mean Absolute Error

ANN

Artificial Neural Network

ANFIS

Adaptive Neuro Fuzzy Inference Systems

LSTM

Long Short Term Memory

WANN

Wavelet Artificial Neural Network

MLNN

Multilayer Neural Networks

RBNN

Radial Basis Neural Networks

FNO

Fourier Neural Operator

DP

Dynamic Programming

NLP

Natural Language Processing

RNN

Recurrent Neural Network

ELM

Extreme Learning Machine

NSE

Nash–Sutcliffe Efficiency

RMSE

Root Mean Square Error

Author contributions

H.F. and M.S. wrote the conducted the study, carried out the research, formal analysis, and wrote the main manuscript. M.S. was responsible for project design, project management, and editing the manuscript. A.M.M. helped conduct the formal analysis and review the manuscript.

Data availability

The datasets analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Mohammad Taghi Sattari, Email: mtsattar@gmail.com, Email: mtsattar@tabrizu.ac.ir.

Adam Milewski, Email: milewski@uga.edu.

References

  • 1.Chaplot, B. & Birbal, P. Development of stage-discharge rating curve using ANN. Int. J. Hydrol. Sci. Technol.14(1), 75–95 (2022). [Google Scholar]
  • 2.Kumar, M. et al. Estimation of daily stage–discharge relationship by using data-driven techniques of a perennial river India. Sustainability12(19), 7877 (2020). [Google Scholar]
  • 3.Vishwakarma, D. K., Kuriqi, A., Abed, S. A., Kishore, G., Al-Ansari, N., Pandey, K., & Jewel, A. Forecasting of stage-discharge in a non-perennial river using machine learning with gamma test. Heliyon, 9(5) (2023). [DOI] [PMC free article] [PubMed]
  • 4.Sattari, M. T. et al. Surface water quality classification using data mining approaches: Irrigation along the Aladag River. Irrig. Drain.70(5), 1227–1246 (2021). [Google Scholar]
  • 5.Al-Aboodi, A., Ibrahim, H. & Al-Rekabi, W. S. Stage-discharge relationship modeling using data mining techniques in an arid region. Int. J. Appl. Eng. Res.13, 326–336 (2018). [Google Scholar]
  • 6.Xu, Y. et al. Assessing water storage changes of Lake Poyang from multi-mission satellite data and hydrological models. J. Hydrol.590, 125229 (2020). [Google Scholar]
  • 7.Hasan, E., Tarhule, A. & Kirstetter, P. E. Twentieth and twenty-first century water storage changes in the Nile river basin from grace/grace-fo and modeling. Remote Sens.13(5), 953 (2021). [Google Scholar]
  • 8.Shukla, R., Kumar, P., Vishwakarma, D. K., Ali, R., Kumar, R., & Kuriqi, A. Modeling of stage-discharge using back propagation ANN-, ANFIS-, and WANN-based computing techniques. Theoretical and Applied Climatology, 1–23 (2021).
  • 9.Aquil, M. A. I. & Ishak, W. H. W. Comparison of Machine Learning Models in Forecasting Reservoir Water Level. J. Adv. Res. Appl. Sci. Eng. Technol.31(3), 137–144 (2023). [Google Scholar]
  • 10.Umar, S., Lone, M. A., Goel, N. K. & Zakwan, M. Modelling of stage-discharge relationship using optimisation techniques for Jhelum River in Kashmir Valley, NW Himalayas. Int. J. Hydrol. Sci. Technol.15(2), 140–153 (2023). [Google Scholar]
  • 11.Maghrebi, M. F., & Vatanchi, S. M. Comparison of different machine learning methods in river streamflow estimation using isovel contours and hydraulic variables. International Journal of River Basin Management, 1–18 (2024).
  • 12.Idowu, M., Kulls, C. & Kisi, O. A new method for monthly streamflow prediction using multi-source data: range-dependent multivariate adaptive regression splines–genetic algorithm. Hydrol. Sci. J.69(13), 1860–1880 (2024). [Google Scholar]
  • 13.Agaj, T., Budka, A., Janicka, E. & Bytyqi, V. Using ARIMA and ETS models for forecasting water level changes for sustainable environmental management. Sci. Rep.14(1), 22444 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sharma, A., Bansal, P., Chandel, A. & Shankar, V. Modelling stage–discharge relationship of Himalayan river using ANN, SVM and ANFIS. Sustain. Water Resour. Manage.10(2), 88 (2024). [Google Scholar]
  • 15.Kisi, O. et al. Enhancing river flow predictions: Comparative analysis of machine learning approaches in modeling stage-discharge relationship. Results Eng.22, 102017 (2024). [Google Scholar]
  • 16.Feizi, H., Sattari, M. T., Mosaferi, M. & Apaydin, H. A. L. İT. An image-based deep learning model for water turbidity estimation in laboratory conditions. Int. J. Environ. Sci. Technol.20(1), 149–160 (2023). [Google Scholar]
  • 17.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521(7553), 436–444 (2015). [DOI] [PubMed] [Google Scholar]
  • 18.Marcus, G. Deep Learning: A Critical Appraisal. arXiv preprint arXiv:1801.00631 (2018).
  • 19.Apaydin, H., Feizi, H., Akcakoca, F. & Sattari, M. T. Daily Streamflow Modelling in the Nalli River Using Recurrent Neural Networks. In International Conference “New Technologies, Development and Applications” 813–822 (Cham: Springer International Publishing, 2022). [Google Scholar]
  • 20.Van, S. P. et al. Deep learning convolutional neural network in rainfall–runoff modelling. J. Hydroinf.22(3), 541–561 (2020). [Google Scholar]
  • 21.Shu, X. et al. Monthly streamflow forecasting using convolutional neural network. Water Resour. Manage35, 5089–5104 (2021). [Google Scholar]
  • 22.Zhang, Z. et al. Downstream water level prediction of reservoir based on convolutional neural network and long short-term memory network. J. Water Resour. Plan. Manage.147(9), 04021060 (2021). [Google Scholar]
  • 23.Xu, Y. et al. Improved convolutional neural network and its application in non-periodical runoff prediction. Water Resour. Manage36(15), 6149–6168 (2022). [Google Scholar]
  • 24.Jahanbakht, M., Xiang, W. & Azghadi, M. R. Sediment prediction in the great barrier reef using vision transformer with finite element analysis. Neural Netw.152, 311–321 (2022). [DOI] [PubMed] [Google Scholar]
  • 25.Taccari, M. L., Ovadia, O., Wang, H., Kahana, A., Chen, X., & Jimack, P. K. Understanding the Efficacy of U-Net & Vision Transformer for Groundwater Numerical Modelling. arXiv preprint arXiv:2307.04010 (2023).
  • 26.Zeng, Z., Kaur, R., Siddagangappa, S., Balch, T., & Veloso, M. From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting. In Proceedings of the Fourth ACM International Conference on AI in Finance, 82–90 (2023).
  • 27.Zhen, L. & Bărbulescu, A. Comparative Analysis of Convolutional Neural Network-Long Short-Term Memory, Sparrow Search Algorithm-Backpropagation Neural Network, and Particle Swarm Optimization-Extreme Learning Machine Models for the Water Discharge of the Buzău River. Romania. Water16(2), 289 (2024). [Google Scholar]
  • 28.Suresh, A., Bolla, D. R., Baby Kalpana, Y. & Shareef, M. Analysing the impact on groundwater quality using dynamic programming and vision transformer. Groundw. Sustain. Dev.25, 101159 (2024). [Google Scholar]
  • 29.Achite, M. et al. Advanced Soft Computing Techniques for Monthly Streamflow Prediction in Seasonal Rivers. Atmosphere16(1), 106 (2025). [Google Scholar]
  • 30.Shekar, P. R., Mathew, A., & Sharma, K. V. A hybrid CNN–RNN model for rainfall–runoff modeling in the Potteruvagu watershed of India. CLEAN–Soil, Air, Water, 2300341. (2025).
  • 31.Ougahi, J. H. & Rowan, J. S. Enhanced streamflow forecasting using hybrid modelling integrating glacio-hydrological outputs, deep learning and wavelet transformation. Sci. Rep.15(1), 2762 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sadeghi, B., Alesheikh, A. A., Jafari, A. & Rezaie, F. Performance evaluation of convolutional neural network and vision transformer models for groundwater potential mapping. J. Hydrol.654, 132840 (2025). [Google Scholar]
  • 33.Ozcicek, O. & Douglas Mcmillin, W. Lag length selection in vector autoregressive models: symmetric and asymmetric lags. Appl. Econ.31(4), 517–524 (1999). [Google Scholar]
  • 34.Garajeh, M. K. & Feizizadeh, B. A comparative approach of data-driven split-window algorithms and MODIS products for land surface temperature retrieval. Appl. Geom.13(4), 715–733 (2021). [Google Scholar]
  • 35.Ivanov, V., & Kilian, L. A practitioner’s guide to lag-order selection for vector autoregressions (Vol. 2685). London: Centre for Economic Policy Research (2001).
  • 36.Gianfreda, A., Maranzano, P., Parisio, L. & Pelagatti, M. Testing for integration and cointegration when time series are observed with noise. Econ. Model.125, 106352 (2023). [Google Scholar]
  • 37.Taye, M. M. Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions. Computation11(3), 52 (2023). [Google Scholar]
  • 38.Yang, Y. et al. A study on water quality prediction by a hybrid CNN-LSTM model with attention mechanism. Environ. Sci. Pollut. Res.28(39), 55129–55139 (2021). [DOI] [PubMed] [Google Scholar]
  • 39.Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y and Tao, D. A survey on visual transformer. arXiv preprint arXiv:2012.12556 (2020).
  • 40.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 41.Pincus, R., Batstone, C. P., Hofmann, R. J. P., Taylor, K. E., & Glecker, P. J. Evaluating the present‐day simulation of clouds, precipitation, and radiation in climate models. Journal of Geophysical Research: Atmospheres, 113 (2008).
  • 42.Eini, N., Bateni, S. M., Jun, C., Heggy, E. & Band, S. S. Estimation and interpretation of equilibrium scour depth around circular bridge piers by using optimized XGBoost and SHAP. Eng. Appl. Comput. Fluid Mech.17(1), 2244558 (2023). [Google Scholar]
  • 43.Al-Juboori, A. M. A hybrid model to predict monthly streamflow using neighboring rivers annual flows. Water Resour. Manage.35(2), 729–743 (2021). [Google Scholar]
  • 44.Li, P., Zhang, J. & Krebs, P. Prediction of flow based on a CNN-LSTM combined deep learning approach. Water14(6), 993 (2022). [Google Scholar]
  • 45.Khan, M. I., Amin, J., Shehzad, M. A. & Iqbal, S. An innovative hybrid model for elbow bone fracture detection: integrating VIT and CNN. Contemp. J. Soc. Sci. Rev.3(1), 1–21 (2025). [Google Scholar]
  • 46.Kim, J. W., Khan, A. U., & Banerjee, I. Systematic review of hybrid vision transformer architectures for radiological image analysis. Journal of Imaging Informatics in Medicine, 1–15 (2025). [DOI] [PMC free article] [PubMed]
  • 47.Zeynali, M., Seyedarabi, H. & Afrouzian, R. Classification of EEG signals using Transformer based deep learning and ensemble models. Biomed. Signal Process. Control86, 105130 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analysed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES