Abstract
Accurate and long-term soil moisture data is vital for drought monitoring and agricultural planning in monsoon-dependent and irrigation-intensive regions such as India. Satellite missions like SMAP and SMOS have improved global monitoring but remain limited by short records, shallow sensing depths, and reduced accuracy under dense vegetation and irrigation. To address these gaps, we reconstructed a 0.05° daily root-zone soil moisture (RZSM, 100 cm) dataset for India covering 1981–2024. The dataset was developed using a hybrid approach that combines simulations from the calibrated H08 land surface model with SMAP RZSM through Random Forest regression. Predictors included H08-derived soil moisture and evapotranspiration, precipitation, and temperature, trained against SMAP observations for 2016–2024. Cross-validation demonstrates strong agreement with SMAP, achieving R² and NSE values above 0.90 and an RMSE of less than 0.03 m³/m³ across most regions. Comparison with available in-situ measurement yields an RMSE of 0.04 m³/m³ and a correlation coefficient of 0.94. Independent validation with Solar-Induced Chlorophyll Fluorescence further confirmed consistency with vegetation activity during drought years (2002, 2009). This high-resolution, long-term dataset provides a robust resource for analysing drought variability, calibrating hydrological models, and assessing agricultural risks in India.
Subject terms: Hydrology, Hydrology
Background & Summary
Soil moisture is a critical component of the terrestrial water cycle, regulating land-atmosphere interactions, influencing crop yields, and modulating hydrological extremes such as floods and droughts1,2. Accurate and continuous estimates of soil moisture are essential for agricultural planning, water resource management, and early warning systems, particularly in regions like South Asia where both monsoon variability and groundwater dependence are high3–5. In recent years, satellite missions like the Soil Moisture Active Passive (SMAP) and Soil Moisture and Ocean Salinity (SMOS) have significantly advanced global soil moisture monitoring, offering surface soil moisture observations at spatial resolutions of 9–36 km6,7. However, these products have limited temporal continuity (post-2015), which limits their application for multi-decadal hydrological assessments. Additionally, satellite-derived surface soil moisture retrievals can be unreliable in regions with dense vegetation cover, snow cover, complex terrain, or intensive irrigation8,9. To address these limitations, the SMAP mission introduced a data-assimilated Root Zone Soil Moisture (RZSM) product through the SMAP Level 4 suite, which integrates satellite observations with the NASA Catchment land surface model to provide estimates up to 1 meter depth10. While the SMAP L4 RZSM product enhances depth sensitivity and temporal continuity, it remains constrained to the satellite era (April 2015 onward)11,12.
Previous efforts to reconstruct long-term RZSM have largely relied on global reanalysis and land data assimilation products such as ERA5, GLDAS, and MERRA-2, which provide long-term coverage but are generally available at coarse spatial resolutions13–15. Several regional studies have used alternative approaches, including the use of exponential filtering combined with machine learning to estimate RZSM from dense in-situ networks, such as over China using more than 2,000 stations16, as well as data assimilation frameworks that integrate satellite-derived surface soil moisture and vegetation information to generate global RZSM products17. Few studies have also utilised drone-based geophysical techniques for high-resolution field-scale RZSM mapping for precision irrigation planning18; however, these approaches are limited to small spatial domains and are computationally expensive. While machine learning approaches offer strong potential for capturing nonlinear relationships between RZSM and its drivers, they often suffer from limited physical interpretability, transferability, and dependence on large training datasets1. In India, most satellite-based soil moisture studies have focused on surface soil moisture or short-term, site-scale retrievals using optical or thermal imagery19,20, and long-term gridded high-resolution RZSM datasets remain limited.
Physically based land surface models (LSMs) provide temporally continuous estimates of soil moisture and associated fluxes21. However, their accuracy depends on input data quality, model structure, and parameter calibration22. Many global land surface models (LSMs) operate at coarse spatial resolutions (~0.25° or higher), which inadequately capture the heterogeneity of land-climate-soil interactions across the Indian subcontinent. For instance, Shah and Mishra23 utilized multiple LSMs at 0.25° resolution to reconstruct historical droughts in India, highlighting significant uncertainties in soil moisture simulations due to coarse spatial resolution and model structural differences. Moreover, most LSMs do not explicitly simulate irrigation practices, leading to persistent biases in heavily cultivated basins such as the Ganges and Indus. Kragh et al.24 pointed out that if we ignore irrigation representation in hydrological models, estimates of evapotranspiration can be erroneous. This highlights the importance of including irrigation processes to enhance the accuracy of models. To address these limitations, recent studies have explored hybrid frameworks that combine land surface modelling with machine learning (ML) to reconstruct long-term soil moisture by fusing satellite observations, model outputs, and meteorological variables. ML approaches such as Random Forests and deep neural networks are particularly well-suited to capturing nonlinear interactions and spatial variability in soil moisture patterns, and have shown strong potential in regional-scale applications when trained with physically consistent LSMs outputs and high-quality reference data25,26.
We developed a gridded dataset of daily root-zone soil moisture (RZSM) for India at 0.05° resolution from 1981 to 2024, generated using a hybrid modelling framework (Fig. 1). We first simulate long-term hydrological variables, including runoff, evapotranspiration, and soil moisture, using the calibrated H08 land surface model27,28. These physically consistent outputs are then integrated with meteorological predictors and satellite-based SMAP observations using a Random Forest model to produce a spatially detailed, temporally consistent, and satellite-informed reconstruction of RZSM across 44 years (1981–2024). Beyond its resolution and temporal extent, the high-resolution soil moisture dataset captures heterogeneity across agro-climatic zones, hydro-meteorological extremes, and irrigation-intensified regions, addressing critical gaps in satellite and model-only products. It enables retrospective analysis of droughts, anomalies, and agricultural soil water stress from 1981 onwards, providing continuity for climate adaptation research.
Fig. 1.
Flowchart of the overall methodological framework used in the study to reconstruct root-zone soil moisture.
Methods
High-resolution (0.05°) daily gridded precipitation and air temperature dataset
We utilized high-resolution precipitation and temperature (maximum and minimum) datasets developed by Chuphal et al.29 based on observations from India Meteorological Department (IMD), satellite/gauge-derived CHIRPS data and ERA5-Land Reanalysis products. Despite their extensive spatial and temporal coverage, these datasets inherently include biases arising from limited ground observations, uncertainties in satellite retrieval methods, and reanalysis processes3. To address these biases, Chuphal et al.29 applied the distribution (Quantile-Quantile) mapping bias correction method30,31 to reduce dataset biases, aligning them consistently with observed datasets. This method effectively corrects biases in both mean and extreme climatic conditions and has been previously validated across Indian river basins32.
For precipitation, CHIRPS satellite data (0.05° resolution) were integrated with observational data from IMD, available at 0.25° resolution. CHIRPS data were initially aggregated from 0.05° to 0.25° resolution for bias correction against IMD reference data for the period 1981–2024. Scaling factors derived at the coarser resolution (0.25°) were subsequently reapplied to the original CHIRPS dataset at 0.05°, yielding a coherent high-resolution, bias-corrected precipitation dataset. We assume that the scaling factors estimated at 0.25° are representative of all underlying 0.05° grid cells within each corresponding 0.25° grid. The corrected CHIRPS dataset was then employed as a reference for bias correction of IMD precipitation data regridded to 0.05°, resulting in a comprehensive long-term precipitation dataset. Similarly, for temperature, ERA5-Land daily maximum and minimum temperature data (0.1° resolution) were initially regridded to match the 0.25° IMD observational temperature data resolution. Bias correction was performed against IMD data using the Quantile-Quantile mapping method. Scaling factors derived from the 0.25° resolution bias correction were then reapplied to the ERA5-Land data downscaled to 0.05° from 0.1° using the elevation-based SYMAP algorithm. This resulted in a precise, high-resolution daily temperature dataset at 0.05°. The stepwise methodology for both precipitation and temperature is illustrated in Supplementary Figure S1. Detailed information about the method can be found in previous studies3,29.
Hydrological modelling using the H08 and CaMa-flood models
To simulate land-surface hydrology across the Indian subcontinent at high resolution, we employed the land surface module of the H08 global hydrological model27,28. H08 is a physically based, spatially distributed model developed for global and regional water resource assessments. It comprises six core modules: land surface hydrology, crop growth, reservoir operation, environmental flow requirements, water abstraction, and river routing. In this study, we exclusively used the land surface module of H08 at 0.05° spatial resolution to simulate total runoff, evapotranspiration, and soil moisture across 18 major river basins of India.
The H08 uses a set of spatially derived empirical factors (Table S1) based on topography, soil texture, geological permeability, and permafrost condition to control surface and subsurface flow processes27, unlike many land surface models that explicitly incorporate Digital Elevation Models (DEMs) and land use/land cover (LULC) for runoff routing or partitioning. More details of all the static input parameters are given in Supplementary Tables S1, S2. We carried out the H08 model simulations for the 1981–2024 period using the high-resolution daily meteorological forcings at a 0.05° spatial resolution.
The H08-simulated total runoff at 0.05° resolution was routed using CaMa-Flood (at 0.05° spatial resolution), a hydrodynamic floodplain routing model capable of simulating water surface elevation, water discharge, and floodplain inundation over complex topography33. The CaMa-Flood model has been previously integrated with the H08 model for hydrological applications in the Indian subcontinent34,35. We calibrated the integrated H08-CaMa-Flood framework across 18 major river basins using long-term daily streamflow observations. The gap-filled streamflow over the Indian mainland was obtained from the Solanki & Mishra36 and Magotra et al.37. We also assessed the performance of the H08 model against SMAP daily soil moisture for the period 2016–2024 and MODIS 8-day net evapotranspiration (MOD16A2)38 for the period 2001–2022. Four physically meaningful parameters- soil depth (SD), bulk transfer coefficient (CD), percolation resistance (γ), and groundwater storage time constant (τ) were adjusted for each basin through manual sensitivity analysis for calibration. We evaluated the model performance using Pearson’s correlation (r), root mean square error (RMSE), Nash-Sutcliffe Efficiency (NSE), coefficient of determination (R²), and Kling-Gupta efficiency (KGE). The choice of calibration parameters and their sensitivity ranges is based on earlier basin-scale H08 applications in India35,39 and is further detailed in Supplementary Tables S3, S4.
The land surface scheme in H08 calculates hydrological fluxes using a single-layer soil moisture bucket model40, wherein saturation-excess surface runoff is produced when soil water exceeds the defined field capacity, while subsurface runoff (baseflow) follows an exponential recession model based on a leaky-bucket formulation27,41. Evapotranspiration is simulated using a Penman-type bulk aerodynamic formulation, modulated by a nonlinear soil moisture stress function (β), which scales potential evapotranspiration (PET) to account for dry surface conditions27,42. Soil moisture evolution follows a mass-balance approach that integrates precipitation, snowmelt, runoff components, and evapotranspiration, while soil temperature is estimated using a force–restore method, incorporating ground heat capacity and snow insulation dynamics43,44. Following calibration, final simulations were conducted using the optimized parameters, yielding grid-wise outputs of daily soil moisture and actual evapotranspiration. These outputs served as physically consistent predictors in the machine learning-based reconstruction of long-term SMAP-like soil moisture.
Machine learning-based reconstruction of root-zone soil moisture
To reconstruct a long-term, high-resolution RZSM dataset for India from 1981 to 2024, we developed a machine learning (ML) framework that fuses satellite observations from the Soil Moisture Active Passive (SMAP) mission with hydrologically consistent outputs from the calibrated H08 model. This hybrid approach leverages the temporal continuity and physical consistency of model-based simulation with the observational accuracy of satellite data to extend SMAP-like soil moisture estimates for the entire period. We utilised the SMAP Level-4 (L4) product, which provides root-zone soil moisture estimates for depths up to 100 cm, ensuring consistency with the target soil layer of the reconstructed product.
To ensure spatial consistency across inputs and target variables, the SMAP L4 root-zone soil moisture data (originally at ~0.08° resolution) were regridded to 0.05° to use in the machine learning model for the period 2016–2024. Daily SMAP RZSM was used to represent the reference soil moisture for model learning. To predict SMAP-like soil moisture, we used four predictor variables: (i) H08-simulated root-zone soil moisture, (ii) H08-derived actual evapotranspiration (ET), (iii) 7-day accumulated precipitation, and (iv) daily air temperature. All the predictors were prepared at a spatial resolution of 0.05°. Vegetation indices such as Solar-Induced Chlorophyll Fluorescence (SIF) were deliberately excluded from the predictors to avoid circular reasoning in the later validation stages, which rely on SIF-based comparison. SIF serves as a robust satellite-based proxy for photosynthetic activity and land surface hydrological conditions45,46. Recent studies have demonstrated strong relationships between SIF, soil moisture variations, and vegetation responses, supporting its application for validating large-scale soil moisture datasets47. We used SIF data to independently evaluate the reconstructed soil moisture product for studying droughts. This ensures statistical independence between the learning phase and the evaluation of ecological consistency48.
Among several machine learning algorithms available, the Random Forest Regressor (RFR) was selected for its superior performance and robustness to noise, as demonstrated in other soil moisture downscaling and data fusion studies49–52. The model was trained independently for each grid to preserve spatial heterogeneity and local hydrological dynamics, which is crucial in climatically diverse regions of India. Performance evaluation was conducted using three standard metrics: the coefficient of determination (R²), Nash-Sutcliffe Efficiency (NSE), and Root Mean Squared Error (RMSE). Following this cross-validation, the full historical record of predictor variables (1981–2024) was used as input to reconstruct the daily RZSM dataset at 0.05° resolution.
To quantify the added value of adopting a nonlinear ensemble approach (RFR), we also implemented a baseline linear regression model, widely used in hydrology for its simplicity and interpretability. We conducted a spatial transferability test in the Cauvery basin (see Supplementary Text 1) to evaluate the Random Forest model’s generalisability across correlated grids. Furthermore, beyond grid-specific modelling, we also explored a pooled approach by clustering India into homogeneous regions based on soil properties (field capacity and wilting point)53, topography (DEM, aspect, slope)54, land use and land cover (LULC)55, and climate (precipitation, temperature, ET). However, given the large number of grids (>1,15,000) and high-dimensional grid-specific features, the clustering algorithm produced only two clusters (as indicated by silhouette scores) (Figure S5). Moreover, training a pooled model with daily data from such a large number of grids was computationally infeasible, which limited the practicality of this approach.
ML model architecture and implementation
For the reconstruction of root-zone soil moisture (RZSM) at each 0.05° grid cell, we used the Random Forest Regressor (RFR) from the scikit-learn library in Python56. Random Forest is a non-parametric ensemble learning technique that constructs multiple decision trees during training and combines their outputs to make robust predictions. This method is well-suited for high-dimensional and nonlinear datasets and is known for its resilience to overfitting when appropriately regularized57,58.
We trained an independent Random Forest model at each grid cell to capture spatial variability across India’s diverse landscapes. This grid-wise modelling strategy helps preserve local soil-climate-vegetation interactions and avoids biases from spatial averaging, which has been widely used in previous soil moisture downscaling and fusion studies49–52. The four hydro-meteorological predictors were selected based on their established relevance to soil moisture dynamics in both physically based and data-driven hydrology59. We observed notable improvements in model performance when additional predictors were incrementally included (see technical validation section). We evaluated the performance of the RF model using a leave-k-out approach, which has been robust in demonstrating ML model skills60. We used 75% of the random data between 2016–2024 for model training and the remaining 25% of the data to cross-validate the model. We generated 50 such bootstraps and estimated mean RMSE, R², and NSE across 50 bootstraps. To predict soil moisture over the full historical period from 1981 to 2024, we trained the RF model for the entire 2016–2024 period to maximise the availability of high-quality SMAP data for training.
To ensure model efficiency and scalability, especially when training tens of thousands of grid-specific models, the number of decision trees was fixed at 100. Since our study area contains more than 1,15,000 grids of 0.05°, this value provides a reliable balance between model accuracy and computational cost. Additionally, prior studies61 have shown that increasing the number of trees beyond this typically provides minimal gains while significantly increasing training time, particularly in well-structured geospatial datasets. All other hyperparameters were retained at their default values provided by scikit-learn. The tree depth was left unlimited, allowing trees to grow until each leaf is pure or contains fewer than the minimum number of samples required to split, a strategy that works well when ensemble averaging mitigates overfitting57. The RF models achieved robust cross-validation skills and consistently outperformed the baseline linear regression model (see technical validation section), underscoring the effectiveness of the chosen input parameters.
Data Records
The reconstructed high-resolution soil moisture dataset (100 cm) for India for the period 1981–2024 has been made available through the Zenodo repository62. The gridded data is available at a daily temporal scale over the Indian region and is provided in text and NetCDF formats. A comprehensive README file describing the data structure, file formats, and usage guidelines to facilitate user access and application is added to the Zenodo repository.
Technical Validation
Hydrological model calibration and validation
We evaluated the hydrological model’s performance during both calibration and validation periods against daily observed streamflow using three statistical metrics, R², NSE, and KGE across 195 gauge stations (Fig. 2). We observed a good performance of the model with R², NSE, and KGE greater than 0.55 for most gauge stations during both calibration and validation (Table S5). The median R², NSE, and KGE during calibration were 0.61, 0.54, and 0.58, respectively. Similarly, the median R², NSE, and KGE during validation were 0.61, 0.50, and 0.55, respectively. The relatively weaker performance at a few stations can be partly attributed to model parametrization and the quality of input data23,63. Due to the unavailability of observed streamflow data for transboundary river basins (Indus and Brahmaputra), the selected stations for model evaluation are fewer in these river basins. Overall, the model performance was satisfactory in simulating daily streamflow across the major Indian river basins. In addition, the model performed well in simulating ET over the Indian mainland (Fig. 3). The model’s performance in simulating ET was particularly robust in the core monsoon region of central India. Overall, the model evaluation outcomes based on streamflow, ET, and soil moisture (see next section) suggest that the model is satisfactory and its hydrological outputs can be used as inputs in the machine learning framework to reconstruct long-term soil moisture.
Fig. 2.
Calibration and evaluation performance (R², NSE, and KGE) of H08-CaMa-Flood simulated streamflow against daily streamflow observations at 195 stations across the study region. Grey and light blue lines indicate major basin boundaries and stream networks of the Indian subcontinent, respectively. The thick black outline denotes the national boundary of India.
Fig. 3.
Spatial distribution of (a) R² and (b) KGE between H08-simulated evapotranspiration and the MODIS-derived evapotranspiration over the study region for the period 2001–2022. The median represents the median value of all the grids.
Comparison of SMAP and H08 simulated soil moisture
To assess how well the H08 land surface model represents RZSM, we performed a spatial and temporal comparison with SMAP soil moisture over India for the period 2016–2024. We estimated the correlation between SMAP and H08 RZSM between 2016 and 2024 at each grid cell across India to evaluate temporal consistency (Fig. 4a). The H08 RZSM showed high correlations (r > 0.90) for most parts of India, particularly in central, western, and Indo-Gangetic Plains regions, demonstrating that the H08 model effectively captures the seasonality and interannual variability in SMAP soil moisture. In contrast, lower correlations in the northern high-altitude Himalayan region and parts of southern and northeastern India likely reflect known limitations of satellite retrievals in areas with dense forests, mountainous terrain, or snow cover, as well as possible model biases in H086,8. The area-averaged standardised (Z-score) RZSM over India from H08 is significantly correlated (R2 = 0.96, p-value < 0.01) with SMAP soil moisture (Fig. 4b), highlighting consistency in temporal variability rather than absolute magnitude. However, the magnitude of average RZSM across India from SMAP and H08 differs significantly (Fig. 4c,d). SMAP captures strong regional contrasts, showing wetter conditions in the northeast, along the Himalayan foothills, and on the west coast, while drier conditions appear in the western and southern parts of the country. In contrast, although the H08 model captures the daily variability in RZSM, it substantially underestimates the magnitude of RZSM across the region (Fig. 4d).
Fig. 4.
(a) Spatial distribution of Pearson’s correlation between root-zone soil moisture (RZSM) from SMAP and H08 during 2016–2024. (b) Comparison of area-averaged standardised (Z-score) series of daily RZSM from SMAP and H08 over India. (c,d) Spatial distribution of mean (absolute) RZSM over the Indian region for the period 2016–2024 based on SMAP and H08. (e) Spatial distribution of differences in RZSM between SMAP and H08 (SMAP–H08) over India.
The H08 model underestimates SMAP soil moisture, with deviations exceeding 0.3 m³/m³ in high-rainfall zones of India (Fig. 4e), particularly in central, northeastern, and parts of northern India. This suggests that H08 does not fully capture the intensity of soil moisture and necessitates some degree of correction for its applicability. However, the model can be applied for large-scale drought monitoring and hydrological analysis, particularly when anomalies or trends are of greater interest than absolute values.
Predictor evaluation and validation of ML-based soil moisture estimates
To generate reliable and spatially consistent estimates of RZSM, grid-wise RF models were trained using SMAP RZSM as the target variable and four key predictors: H08-simulated soil moisture, evapotranspiration (ET), 7-day cumulative precipitation, and temperature64,65. The distribution of relationships between each predictor and SMAP RZSM for each grid across India is shown in Figure S2. H08-simulated soil moisture shows a strong linear relationship with SMAP (median r = 0.90), reflecting its effectiveness in capturing large-scale soil moisture patterns, despite biases in magnitude often found in global hydrological models28. Evapotranspiration also shows a robust positive relationship (median r = 0.75), highlighting its close coupling with soil water availability during vegetation-active periods66. Precipitation over the previous 7 days showed moderate correlation (median r = 0.50) with SMAP soil moisture, consistent with expectations due to the lag between rainfall and soil moisture infiltration or redistribution59. Temperature, on the other hand, displays a weak negative correlation with SMAP soil moisture (median r = −0.05), suggesting that while it contributes indirectly through its effect on atmospheric demand and evapotranspiration (land-atmospheric feedback)67, it does not directly govern daily soil moisture variability.
The RF model demonstrated satisfactory cross-validation skills across most of the region (Fig. 5). The spatial distribution of RMSE across India ranged between 0 and 0.05 m³/m³, marking a substantial reduction from the magnitude bias of RZSM in the raw H08 estimates, which exceeded 0.20 m³/m³ over large areas (Fig. 5a). Higher RMSE values were observed in the Western Ghats, the northeastern region, and the Himalayan foothills, while more than 80% of the grids showed RMSE less than 0.03 m³/m³ (Fig. 5d). Similarly, the spatial distribution of R² and NSE confirmed that the RF model performed well across most of the region (Fig. 5b,c), with more than 75% of the grids showing R² and NSE values above 0.70 (Fig. 5e,f). The northern high-altitude Himalayas, however, exhibited relatively weaker performance, likely due to snow cover, freeze–thaw processes, and retrieval uncertainties in SMAP under such conditions6, as well as poorer RZSM estimates from H08 in this region. The mean ML-predicted RZSM for 2016–2024 compares well with the mean RZSM from the SMAP (Fig. 6a,b). The reconstructed RZSM showed minimal bias (<0.03 m³/m³) with SMAP RZSM for most regions across India (Fig. 6c), which were more than 0.3 m³/m³ between SMAP and raw H08 RZSM (Fig. 4e). The temporal variation of area-averaged ML-predicted RZSM shows excellent agreement (R² = 0.99) with SMAP (Fig. 6d), underscoring the model’s capacity to learn complex nonlinear interactions and seasonal dynamics inherent in the hydrological cycle.
Fig. 5.
(a–c) Cross-validation performance of the RZSM reconstruction model, showing mean skill scores (RMSE, R², NSE) across 50 bootstraps. (d–f) Spatial distribution of grids (percentage area) categorized by RMSE, R², and NSE skill classes.
Fig. 6.
(a–c) Spatial distribution of mean RZSM over the Indian region for the period 2016–2024 based on SMAP and ML-predicted soil moisture. (c) Spatial distribution of differences in RZSM between SMAP and ML-predicted soil moisture over India. (d) Comparison of the area-averaged series of daily RZSM from SMAP and ML-predicted over India.
We found significant improvement in cross-validation skills after incremental inclusion of key predictors (Fig. 7). For example, the median RMSE was 0.025 m³/m³ when using only H08-derived soil moisture, but decreased to 0.020 m³/m³ when ET, precipitation, and temperature were added (Fig. 7a). Likewise, R² and NSE improved from ~0.82 with only H08-derived soil moisture to ~0.90 with additional predictors (Fig. 7b,c). Overall, the inclusion of H08-derived ET, 7-day accumulated precipitation, and daily air temperature alongside H08-derived soil moisture consistently enhanced performance, as reflected by increases in NSE and R² and reductions in RMSE. Furthermore, comparison of the RF model with a baseline linear regression model (Figure S3) highlighted the added value of the non-linear RF approach. Across all grids, the RF model reduced RMSE and improved R² and NSE, with particularly large gains in the northern Himalayas, where R² and NSE improvements exceeded 0.2, and RMSE decreased by 0.01 m³/m³. Substantial gains (>0.2) in R² were also evident over peninsular and northeastern India. Overall, the strong performance demonstrates that the selected predictors, particularly physically consistent variables such as H08-derived soil moisture and evapotranspiration, are sufficient to accurately reproduce SMAP-like soil moisture estimates using the RF model.
Fig. 7.
Cross-validation skill of the RZSM reconstruction model (RMSE, R², NSE) under incremental inclusion of predictors. The combination SM_ET_PCP_T denotes the use of H08-simulated soil moisture (SM) and evapotranspiration (ET), precipitation, and temperature as predictors in the random forest model. The outliers are not represented in the box plots.
Next, we compared the reconstructed RZSM with in-situ soil moisture observations at the IIT Kanpur site, obtained from the International Soil Moisture Network (ISMN). The in-situ measurements are available as point observations at multiple depths (10, 25, 50, and 80 cm). Validation against the depth-averaged in-situ soil moisture shows good temporal agreement, with a correlation coefficient (r) of 0.94, and reasonable agreement in absolute magnitude, with an RMSE of less than 0.04 m³/m³ (Fig. 8). The observed series exhibits a larger dynamic range than the gridded product, which likely reflects the scale mismatch between point-scale measurements and the spatially averaged 0–100 cm modelled soil layer at 0.05° resolution.
Fig. 8.
Comparison of predicted soil moisture (SM) with in-situ soil moisture observations at the IIT Kanpur site from the International Soil Moisture Network (ISMN). Panel (a) shows the comparison of absolute soil moisture time series, while Panel (b) shows the comparison of standardised (Z-score) soil moisture time series. The observed series of soil moisture represents the average value of soil moisture for the point measurements at depths of 10, 25, 50, and 80 cm.
Further, to evaluate how well the machine learning-estimated RZSM matches observed patterns during any time period, we first identified a dry (July 2019) and a wet (March 2020) month based on the SMAP 1-month SSI time series (Fig. 9g). For these selected months, we compared the standardised soil moisture (SSI) patterns from SMAP, ML-predicted, and H08 (Fig. 9a–f). The ML-predicted SSI closely matched the spatial distribution and intensity of dryness and wetness observed in SMAP, particularly across central and southern India, while the H08-derived SSI showed notable deviations in the Indo-Gangetic and western ghats regions. This spatial agreement was further supported by the temporal correlation over 2016–2024, where ML-predicted RZSM outperformed H08 in capturing SMAP dynamics (R² = 0.92 and 0.84; Fig. 9g). These results demonstrate the effectiveness of our ML framework in reproducing satellite-consistent soil moisture anomalies both spatially and temporally across India.
Fig. 9.
Spatial and temporal comparison of 1-month standardised soil moisture (SSI) from SMAP, ML-based predictions (ML-predicted), and H08 model over India (2016–2024). Subplots (a–c) show SSI during a dry month (July 2019), (d–f) during a wet month (March 2020), and (g) presents the average SSI time series for the entire area.
To further validate the reconstructed RZSM, we analyzed its capacity to capture historical drought events by comparing the standardized anomaly of the ML-predicted RZSM and SIF (Fig. 10). Annual anomalies were derived for each dataset using a 23-year climatology (2001–2023), and anomalies were computed for the water year (June-May), which better captures the soil moisture and vegetation memory effects relevant to Indian cropping seasons23. We selected two major drought years, 2002 and 2009, due to their well-documented impacts on agriculture and hydrology23,29,68. Both years exhibit substantial negative soil moisture anomalies across large parts of India. The 2002 widespread drought severely affected western, east-central, and peninsular India, which is effectively reflected in both soil moisture and SIF anomalies (Fig. 10a,b). Similarly, in 2009, negative anomalies were observed across most regions except for the peninsular and selected areas in northwestern and central India (Fig. 10c,d). The spatial congruence between RZSM and SIF anomalies across both drought events highlights the skill of the reconstructed soil moisture dataset in representing soil-vegetation dynamics and regional drought severity. These results underscore its potential applicability in real-time drought monitoring, agricultural stress assessment, and hydrological model validation, particularly critical in regions with sparse in situ soil moisture observations.
Fig. 10.
Spatial comparison of ML-predicted soil moisture anomalies and SIF anomalies during the drought years 2002 and 2009 across the Indian region. The standardised anomalies were estimated with respect to the 2001–2023 period mean and standard deviation.
Usage Notes
We provide a grid-wise, SMAP-consistent daily root-zone soil moisture (100 cm) dataset over India at 0.05° spatial resolution for the period 1981–2024. The reconstruction model performed robustly in producing RZSM during cross-validation, demonstrating its reliability for soil moisture reconstruction. The reconstructed soil moisture was independently validated with SIF data and showed good performance for most regions across India. Further, the RF model was tested to assess the spatial transferability for a small basin (Cauvery) and found to be satisfactory in reducing the bias in RZSM magnitude from the H08 model. The reconstructed high-resolution RZSM dataset holds substantial potential for monitoring droughts, analyzing soil-vegetation interactions, and supporting hydro-agricultural decision-making across India. The high spatial resolution enables the detection of fine-scale variability in soil moisture, which is critical for localized decision-making in water resources and agricultural planning. Further, the long temporal span also allows for trend and variability analysis over multiple decades, making it suitable for climate change impact studies.
Despite strong agreement with SMAP and independent validation using vegetation activity (SIF) and in-situ soil moisture, caution is advised in snow-covered and cold-climate regions where satellite retrievals and model accuracy remain limited. Future work could integrate snow and vegetation indices as predictors to improve accuracy in complex terrains. Additionally, we did not explicitly represent irrigation practices within the hydrological modelling framework. Although SMAP soil moisture products implicitly capture irrigation effects by reflecting actual land surface conditions69, they do not fully represent temporal variability in irrigation intensity and management practices across historical periods due to their limited temporal record70. This may introduce uncertainty in soil moisture estimates over intensively irrigated regions, particularly when interpreting past hydroclimatic variability. Such uncertainties could be reduced in future work by integrating irrigation datasets or dynamic irrigation schemes into hydrological models71. Further efforts to refine the H08 model parameterization would also contribute to more reliable estimates. Overall, the dataset provides a valuable resource for researchers and policymakers working on drought risk reduction, sustainable agriculture, and climate resilience in India, provided that its limitations in certain regions and contexts are duly considered.
Supplementary information
Development of Gridded Root-Zone Soil Moisture Product for India, 1981-2024
Acknowledgements
We acknowledge all the data-providing agencies for freely providing the datasets used in the study. All datasets can be downloaded from the respective sources upon registration. The work is supported by funding from the Major Research & Development Program (MRDP) by the Department of Science and Technology, India (Grant MRDP4356).
Author contributions
V.M. designed the study. D.S.C. and A. performed the analysis and wrote the first draft, A.P.K. and G.V. contributed to the analysis, and V.M. finalised the manuscript.
Data availability
The gridded (0.05°) root-zone soil moisture data for India between 1981–2024 can be accessed from the Zenodo repository (10.5281/zenodo.17014507). The repository also includes a README file that describes the data structure and file formats.
Code availability
The source codes for the H08 hydrological model and the CaMa-Flood routing model are publicly accessible at http://h08.nies.go.jp and https://hydro.iis.u-tokyo.ac.jp/~yamadai/cama-flood/, respectively. The Random Forest Regressor (RFR) used in this study is implemented using the scikit-learn library, an open-source machine learning toolkit available in Python.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Dipesh Singh Chuphal, Abhishek.
Supplementary information
The online version contains supplementary material available at 10.1038/s41597-026-06940-x.
References
- 1.Brocca, L., Ciabatta, L., Massari, C., Camici, S. & Tarpanelli, A. Soil Moisture for Hydrological Applications: Open Questions and New Opportunities. Water9, 140 (2017). [Google Scholar]
- 2.Seneviratne, S. I. et al. Investigating soil moisture–climate interactions in a changing climate: A review. Earth Sci Rev99, 125–161 (2010). [Google Scholar]
- 3.Aadhar, S. & Mishra, V. Data Descriptor: High-resolution near real-time drought monitoring in South Asia. Sci Data4, 1–14 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Jain, M. et al. Groundwater depletion will reduce cropping intensity in India. Sci Adv 7 (2021). [DOI] [PMC free article] [PubMed]
- 5.Vangala, G. & Chandrasekar, A. Analysis of soil moisture estimates from global and regional datasets over the Indian region. Journal of Earth System Science131, 63 (2022). [Google Scholar]
- 6.Entekhabi, D. et al. The soil moisture active passive (SMAP) mission. Proceedings of the IEEE98, 704–716 (2010). [Google Scholar]
- 7.Kerr, Y. H. et al. The SMOS soil moisture retrieval algorithm. IEEE Transactions on Geoscience and Remote Sensing50, 1384–1403 (2012). [Google Scholar]
- 8.Chan, S. K. et al. Assessment of the SMAP Passive Soil Moisture Product. Ieee Transactions On Geoscience And Remote Sensing54 (2016). [DOI] [PMC free article] [PubMed]
- 9.Kumar, S. V. et al. Evaluating the utility of satellite soil moisture retrievals over irrigated areas and the ability of land data assimilation methods to correct for unmodeled processes. Hydrol Earth Syst Sci19, 4463–4478 (2015). [Google Scholar]
- 10.Reichle, R. H. et al. Assessment of the SMAP Level-4 Surface and Root-Zone Soil Moisture Product Using In Situ Measurements. J Hydrometeorol18, 2621–2645 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Reichle, R. H. et al. Version 4 of the SMAP Level-4 Soil Moisture Algorithm and Data Product. J Adv Model Earth Syst11, 3106–3130 (2019). [Google Scholar]
- 12.Felfelani, F., Pokhrel, Y., Guan, K. & Lawrence, D. M. Utilizing SMAP Soil Moisture Data to Constrain Irrigation in the Community Land Model. Geophys Res Lett45, 12,892–12,902 (2018). [Google Scholar]
- 13.Martens, B. et al. GLEAM v3: Satellite-based land evaporation and root-zone soil moisture. Geosci Model Dev10, 1903–1925 (2017). [Google Scholar]
- 14.Luo, X. et al. Spatio-temporal changes in global root zone soil moisture from 1981 to 2017. J Hydrol (Amst)626, 130297 (2023). [Google Scholar]
- 15.Hersbach, H. et al. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society146, 1999–2049 (2020). [Google Scholar]
- 16.Tian, J. et al. Predicting root zone soil moisture using observations at 2121 sites across China. Science of The Total Environment847, 157425 (2022). [DOI] [PubMed] [Google Scholar]
- 17.Xu, Y., Calvet, J. C. & Bonan, B. The joint assimilation of satellite observed LAI and soil moisture for the global root zone soil moisture production and its impact on land surface and ecosystem variables. Agric For Meteorol360, 110299 (2025). [Google Scholar]
- 18.Wu, K. et al. Automated drone-borne GPR mapping of root-zone soil moisture for precision irrigation. Remote Sens Environ333, 115110 (2026). [Google Scholar]
- 19.Kasim, A. A. et al. Remote sensing of root zone soil moisture: A review of methods and products. J Hydrol (Amst)656, 133002 (2025). [Google Scholar]
- 20.Srivastava, S. & Dhanapriya, M. Remote Sensing Based Soil Moisture Estimation Using In-Situ Probes in Varanasi District, India. International Journal of Environment and Climate Change15, 389–403 (2025). [Google Scholar]
- 21.Pitman, A. J. The evolution of, and revolution in, land surface schemes designed for climate models. International Journal of Climatology23, 479–510 (2003). [Google Scholar]
- 22.He, Q., Lu, H. & Yang, K. Soil Moisture Memory of Land Surface Models Utilized in Major Reanalyses Differ Significantly From SMAP Observation. Earths Future11, e2022EF003215 (2023). [Google Scholar]
- 23.Mishra, V. et al. Reconstruction of droughts in India using multiple land-surface models (1951–2015). Hydrol Earth Syst Sci22, 2269–2284 (2018). [Google Scholar]
- 24.Kragh, S. J., Fensholt, R., Stisen, S. & Koch, J. The precision of satellite-based net irrigation quantification in the Indus and Ganges basins. Hydrol Earth Syst Sci27, 2463–2478 (2023). [Google Scholar]
- 25.Han, Q. et al. Ensemble of optimised machine learning algorithms for predicting surface soil moisture content at a global scale. Geosci Model Dev16, 5825–5845 (2023). [Google Scholar]
- 26.Batchu, V., Nearing, G. & Gulshan, V. A Machine Learning Data Fusion Model for Soil Moisture Retrieval. https://arxiv.org/pdf/2206.09649 (2022).
- 27.Hanasaki, N. et al. An integrated model for the assessment of global water resources - Part 1: Model description and input meteorological forcing. Hydrol Earth Syst Sci12, 1007–1025 (2008). [Google Scholar]
- 28.Hanasaki, N., Yoshikawa, S., Pokhrel, Y. & Kanae, S. A global hydrological simulation to specify the sources of water used by humans. Hydrol Earth Syst Sci22, 789–817 (2018). [Google Scholar]
- 29.Chuphal, D. S., Kushwaha, A. P., Aadhar, S. & Mishra, V. Drought Atlas of India, 1901–2020. Sci Data11, 1–12 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zhang, X. & Tang, Q. Combining satellite precipitation and long-term ground observations for hydrological monitoring in China. Journal of Geophysical Research: Atmospheres120, 6426–6443 (2015). [Google Scholar]
- 31.Teutschbein, C. & System, J. S.-H. and E. & 2013, undefined. Is bias correction of regional climate model (RCM) simulations possible for non-stationary conditions? hess.copernicus.orgC Teutschbein, J SeibertHydrology and Earth System Sciences, 2013•hess.copernicus.org17, 5061–5077 (2013). [Google Scholar]
- 32.Chuphal, D. S. & Mishra, V. Increased hydropower but with an elevated risk of reservoir operations in India under the warming climate. iScience26 (2023). [DOI] [PMC free article] [PubMed]
- 33.Yamazaki, D., Kanae, S., Kim, H. & Oki, T. A physically based description of floodplain inundation dynamics in a global river routing model. Water Resour Res47, 4501 (2011). [Google Scholar]
- 34.Vegad, U., Pokhrel, Y. & Mishra, V. Flood risk assessment for Indian sub-continental river basins. Hydrol Earth Syst Sci28, 1107–1126 (2024). [Google Scholar]
- 35.Chuphal, D. S. & Mishra, V. Hydrological model-based streamflow reconstruction for Indian sub-continental river basins, 1951–2021. Sci Data10, 1–11 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Solanki, H. & Mishra, V. Machine learning based gap filling of streamflow and water level observations in India, 1961–2021. ESS Open Archive10.22541/essoar.175103952.29794951/v1 (2025).
- 37.Magotra, B., Saharia, M. & Dhanya, C. T. Improved streamflow simulations in hydrologically diverse basins using physically-informed deep learning models. Hydrological Sciences Journal70, 775–788 (2025). [Google Scholar]
- 38.Mu, Q., Zhao, M. & Running, S. W. MODIS global terrestrial evapotranspiration (ET) product (NASA MOD16A2/A3). Algorithm Theoretical Basis Document, Collection5, 381–394 (2013). [Google Scholar]
- 39.Grogan, D. S. et al. Natural and anthropogenic drivers of the lost groundwater from the Ganga River basin. Environmental Research Letters16, 114009 (2021). [Google Scholar]
- 40.MANABE, S. Climate and the ocean circulation: i. the atmospheric circulation and the hydrology of the earth’s surface. Mon Weather Rev97, 739–774 (1969). [Google Scholar]
- 41.Gerten, D., Schaphoff, S., Haberlandt, U., Lucht, W. & Sitch, S. Terrestrial vegetation and water balance—hydrological evaluation of a dynamic global vegetation model. J Hydrol (Amst)286, 249–270 (2004). [Google Scholar]
- 42.Robock, A., Vinnikov, K. Y., Schlosser, C. A., Speranskaya, N. A. & Xue, Y. Use of midlatitude soil moisture and meteorological observations to validate soil moisture simulations with biosphere and bucket models. J Clim8, 15–35 (1995). [Google Scholar]
- 43.Deardorff, J. W. Efficient prediction of ground surface temperature and moisture, with inclusion of a layer of vegetation. J Geophys Res Oceans83, 1889–1903 (1978). [Google Scholar]
- 44.Bhumralkar, C. M. Numerical Experiments on the Computation of Ground Surface Temperature in an Atmospheric General Circulation Model. J Appl Meteorol Climatol14, 1246–1258 (1975). [Google Scholar]
- 45.Li, X. & Xiao, J. A Global, 0.05-Degree Product of Solar-Induced Chlorophyll Fluorescence Derived from OCO-2, MODIS, and Reanalysis Data. Remote Sensing11, 517 (2019). [Google Scholar]
- 46.Sun, Y. et al. OCO-2 advances photosynthesis observation from space via solar-induced chlorophyll fluorescence. Science (1979)358 (2017). [DOI] [PubMed]
- 47.Gao, J. & O’Neill, B. C. Mapping global urban land for the 21st century with data-driven simulations and Shared Socioeconomic Pathways. Nature Communications11, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Nearing, G. S. et al. What Role Does Hydrological Science Play in the Age of Machine Learning? Water Resour Res57, e2020WR028091 (2021). [Google Scholar]
- 49.Zhao, W., Sánchez, N., Lu, H. & Li, A. A spatial downscaling approach for the SMAP passive surface soil moisture product using random forest regression. J Hydrol (Amst)563, 1009–1024 (2018). [Google Scholar]
- 50.Zhang, H. et al. Downscaling of AMSR-E Soil Moisture over North China Using Random Forest Regression. ISPRS International Journal of Geo-Information11, 101 (2022). [Google Scholar]
- 51.Carranza, C., Nolet, C., Pezij, M. & van der Ploeg, M. Root zone soil moisture estimation with Random Forest. J Hydrol (Amst)593, 125840 (2021). [Google Scholar]
- 52.Adab, H., Morbidelli, R., Saltalippi, C., Moradian, M. & Ghalhari, G. A. F. Machine Learning to Estimate Surface Soil Moisture from Remote Sensing Data. Water12, 3223 (2020). [Google Scholar]
- 53.Simons, G., Koster, R. & Droogers, P. Hihydrosoil v2. 0-high resolution soil maps of global hydraulic properties. Future Works.[online] Available from https://www.futurewater.eu/projects/hihydrosoil (2020).
- 54.Lehner, B., Verdin, K. & Jarvis, A. New global hydrography derived from spaceborne elevation data. Eos, Transactions American Geophysical Union89, 93–94 (2008). [Google Scholar]
- 55.Sulla-Menashe, D. & Friedl, M. A. User guide to collection 6 MODIS land cover (MCD12Q1 and MCD12C1) product. Usgs: Reston, Va, Usa1, 18 (2018). [Google Scholar]
- 56.Pedregosa Fabianpedregosa, F. et al. Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot. Journal of Machine Learning Research12, 2825–2830 (2011). [Google Scholar]
- 57.Breiman, L. Random forests. Mach Learn45, 5–32 (2001). [Google Scholar]
- 58.Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M. & Gräler, B. Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ2018, e5518 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Chen, P. Y., Chen, C. C., Kang, C., Liu, J. W. & Li, Y. H. Soil water content prediction across seasons using random forest based on precipitation-related data. Comput Electron Agric230, 109802 (2025). [Google Scholar]
- 60.Xu, J., Wu, Z., Wang, C. & Jia, X. Machine unlearning: Solutions and challenges. IEEE Trans Emerg Top Comput Intell8, 2150–2168 (2024). [Google Scholar]
- 61.Probst, P., Wright, M. N. & Boulesteix, A. L. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip Rev Data Min Knowl Discov9, e1301 (2019). [Google Scholar]
- 62.Chuphal, D. S. Development of High-Resolution Soil Moisture Product for India, 1981–2024. Zenodo10.5281/zenodo.17014507 (2025).
- 63.Kushwaha, A. P. et al. Multimodel assessment of water budget in Indian sub-continental river basins. J Hydrol (Amst)603, 126977 (2021). [Google Scholar]
- 64.Han, Q. et al. Global long term daily 1 km surface soil moisture dataset with physics informed machine learning. Sci Data10, 1–12 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shokati, H. et al. Random Forest-Based Soil Moisture Estimation Using Sentinel-2, Landsat-8/9, and UAV-Based Hyperspectral Data. Remote Sens (Basel)16, 1962 (2024). [Google Scholar]
- 66.Goswami, M. M. et al. Understanding the soil water dynamics during excess and deficit rainfall conditions over the Core monsoon zone of India. https://arxiv.org/pdf/2308.15196 (2023).
- 67.Ganeshi, N. G. et al. Assessing the impact of soil moisture-temperature coupling on temperature extremes over the Indian region. NPJ Clim Atmos Sci6 (2022).
- 68.Malik, I. & Mishra, V. Sub-seasonal to seasonal (S2S) prediction of dry and wet extremes for climate adaptation in India. Clim Serv34, 100457 (2024). [Google Scholar]
- 69.Lawston, P. M., Santanello, J. A. & Kumar, S. V. Irrigation Signals Detected From SMAP Soil Moisture Retrievals. Geophys Res Lett44, 11,860–11,867 (2017). [Google Scholar]
- 70.Ozdogan, M., Rodell, M., Beaudoing, H. K. & Toll, D. L. Simulating the Effects of Irrigation over the United States in a Land Surface Model Based on Satellite-Derived Agricultural Data. J Hydrometeorol11, 171–184 (2010). [Google Scholar]
- 71.Leng, G. et al. Modeling the effects of irrigation on land surface fluxes and states over the conterminous United States: Sensitivity to input data and model parameters. Journal of Geophysical Research: Atmospheres118, 9789–9803 (2013). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Chuphal, D. S. Development of High-Resolution Soil Moisture Product for India, 1981–2024. Zenodo10.5281/zenodo.17014507 (2025).
Supplementary Materials
Development of Gridded Root-Zone Soil Moisture Product for India, 1981-2024
Data Availability Statement
The gridded (0.05°) root-zone soil moisture data for India between 1981–2024 can be accessed from the Zenodo repository (10.5281/zenodo.17014507). The repository also includes a README file that describes the data structure and file formats.
The source codes for the H08 hydrological model and the CaMa-Flood routing model are publicly accessible at http://h08.nies.go.jp and https://hydro.iis.u-tokyo.ac.jp/~yamadai/cama-flood/, respectively. The Random Forest Regressor (RFR) used in this study is implemented using the scikit-learn library, an open-source machine learning toolkit available in Python.










