Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Dec 15;13:122. doi: 10.1038/s41597-025-06436-0

A global long-term daily multilayer soil moisture dataset derived from machine learning

Zeyang Wei 1,2, Lifei Wei 1,2,, Ting Wang 3, Qikai Lu 1,2, Shuang Tian 1,2, Fei Zhang 4, Yanfei Zhong 5
PMCID: PMC12858798  PMID: 41397989

Abstract

Soil moisture is a critical component of the Earth’s energy and water cycles. However, most existing products focus solely on surface layers, and continuous, high‐resolution datasets for deep soil horizons remain scarce. To address this gap, we generated a global, daily, seamless multilayer soil moisture dataset (SWSM) for the period 2002–2021 by leveraging a machine learning approach (XGBoost). The SWSM dataset provides estimates at a 0.05° spatial resolution for three depth horizons: 0–10 cm, 10–30 cm, and 30–60 cm. Rigorous validation against in situ observations demonstrated the dataset’s high accuracy, with Pearson correlation coefficients exceeding 0.90 and root mean square errors below 0.05 across all depths. A feature importance assessment verified the dataset’s physical consistency, revealing depth-dependent patterns aligned with established hydrological understanding. The SWSM dataset, with its long-term temporal coverage, fine spatial resolution, and multi-layer structure, is a valuable resource for applications in hydrologic modeling, agricultural water management, and climate change studies.

Subject terms: Hydrology, Ecology

Background & Summary

Soil moisture is a critical state variable governing land-atmosphere interactions, ecohydrological processes, and agricultural productivity1,2. It modulates surface energy partitioning, vegetation dynamics, and groundwater recharge while serving as a key indicator for drought monitoring and climate change adaptation3. Despite its importance, accurately capturing the spatiotemporal variability of soil moisture across multiple depths remains challenging. Surface-layer moisture (0–5 cm) directly influences evapotranspiration and short-term weather patterns, whereas root-zone moisture (5–30 cm) and deeper layers (30–60 cm) regulate vegetation water uptake and long-term hydrological cycles4. Current observational networks and remote sensing products, however, fail to resolve these vertical and horizontal heterogeneities at scales relevant to ecosystem and agricultural management5.

Satellite-derived soil moisture products, such as those from SMAP and AMSR2, face three persistent limitations: (1) coarse spatial resolution (>25 km), which obscures fine-scale variability in heterogeneous landscapes6; (2) restricted vertical sensitivity, with most algorithms limited to surface-layer estimates (0–5 cm) and inadequate for deeper soil horizons7; and (3) spatiotemporal discontinuities due to cloud contamination (optical sensors) or vegetation/roughness interference (microwave sensors)8. While land surface models (e.g., ERA5-Land) provide global coverage, their reliance on simplified parameterizations introduces uncertainties in regions with complex terrain or dynamic vegetation9.

The spatial and temporal dynamics of soil moisture are driven by synergistic interactions among meteorological forcing (e.g., precipitation, radiation), soil properties (e.g., texture, hydraulic conductivity), vegetation activity (e.g., leaf area index, LAI), and topography (e.g., elevation, slope)10. Surface moisture responds rapidly to rainfall and solar radiation, while deeper layers reflect slower processes influenced by soil porosity and root water uptake11. Traditional remote sensing approaches struggle to disentangle these multifactorial controls, particularly in data-sparse regions12.

Recent advances in machine learning (ML) offer transformative potential for soil moisture estimation by integrating multi-source data13. Different machine learning (ML) approaches exhibit complementary strengths. Deep sequence models such as Long Short-Term Memory (LSTM) networks excel at capturing temporal dependencies, but are often data- and computation-intensive14. Convolutional or physics-informed hybrid networks can leverage spatial context, yet require careful design and large training datasets15. Tree-based ensemble methods, such as Extreme Gradient Boosting (XGBoost), efficiently integrate heterogeneous input features and achieve high predictive accuracy while remaining computationally scalable16,17. Notable examples include Sung et al., who employed LSTM to generate the global multi-layer dataset SoMo.ml (0.25°)18 and its European counterpart SoMo.ml-EU (0.1°)19; and the GLASS-SM and GSSM products20, which provide global surface soil moisture at 1 km resolution. While these pioneering efforts address important challenges, they also highlight trade-offs among spatial resolution, temporal continuity, and vertical coverage21.

To address these limitations, we developed an XGBoost-based multilayer soil moisture inversion framework that produces the Seamless Worldwide Soil Moisture (SWSM) dataset—a global, gap-free daily product at 0.05° resolution for the period 2002–2021, covering three distinct layers (0–5 cm, 5–30 cm, and 30–60 cm). Specifically, the framework integrates ERA5-Land reanalysis products along with multiple environmental variables, with the International Soil Moisture Network (ISMN) serving as the reference for training. In addition, interpretable ML models provide novel insights into the controlling factors of soil moisture variability at different depths. The key contribution of SWSM lies in its simultaneous provision of long-term spatiotemporal coverage (2002–2021, global), fine spatial resolution (0.05°), and multilayer soil moisture estimates, making it an essential resource for hydrological simulations, agricultural drought monitoring, and climate change assessments.

Methods

Environment factors

As shown in the Table 1, 11 data were selected as products for this study and soil moisture in-situ site data were used as training targets. Among them, ERA5-Land soil moisture was used as the primary variable, and other quality influences were auxiliary variables: precipitation, surface net solar radiation22,23, land surface temperature24, leaf area index25, land use data26, digital elevation model27, soil texture28, and depth to bedrock28.

Table 1.

Details of the data used in data generation.

category dataset spatial resolution time resolution
Reanalysis product ERA5_Land SM23,ERA5_Land TP23, ERA5_Land SSR23 0.1° × 0.1° hourly
LST GLASS LST24 (glass.bnu.edu.cn/introduction/lst.html) 0.05° × 0.05° daily
vegetation GLASS LAI25 (https://glass.bnu.edu.cn/introduction/LAI.html) 0.05° × 0.05° 8 days
LULC MCD12C126 0.05° yearly
Altitude GMTED201027 250 m \
Soil Texture SoilGrids28 (https://soilgrids.org/) 250 m \

Precipitation serves as the primary external source of soil moisture, governing not only surface replenishment but also influencing deeper moisture storage via infiltration29. ERA5-Land precipitation data quantitatively capture short-term moisture inputs, elucidating the impacts of precipitation events on both surface and subsurface soil moisture and providing critical insight into infiltration processes30. Concurrently, surface net solar radiation, as a key indicator of the energy balance, elevates surface temperatures to promote evaporation while also modulating moisture retention31. Specifically, following a precipitation event, intense SSR leads to a rapid increase in surface temperature, thereby accelerating evaporation, whereas lower SSR conditions tend to favor the retention of moisture within the soil. Moreover, the spatial and temporal variability of SSR can affect local microclimates, indirectly regulating the post-precipitation distribution of moisture in soils32.

LST is another pivotal indicator of the surface energy balance, reflecting the complex interplay between soil thermal inertia and moisture status33. Moist soils, with their higher specific heat capacity and thermal inertia, can dampen temperature fluctuations, whereas arid soils, lacking sufficient moisture, allow a greater proportion of solar radiation to be converted into heat, resulting in pronounced temperature increases34. The spatial and temporal variations in LST thus provide an intuitive remote sensing signal for soil moisture—smaller diurnal temperature ranges generally correspond to higher soil water content, while larger fluctuations often signal moisture deficits35. In this way, LST data robustly capture the dynamic evolution of soil moisture, offering essential support for soil moisture modeling.

Vegetation influences soil moisture not only by directly regulating water loss through transpiration but also by affecting the infiltration, distribution, and release of moisture via its physical structure and physiological processes36. In densely vegetated areas, canopy interception reduces direct soil exposure to intense solar radiation, thereby limiting evaporation from bare soil, while an extensive root system enhances vertical moisture transport and storage37,38. Leaf area index, a crucial metric of canopy cover, not only reflects vegetation density but also indirectly indicates the capability of vegetation to intercept water and regulate transpiration39. Regions with high LAI typically exhibit stronger canopy interception and lower evaporation rates from bare soil, providing a distinct advantage for soil moisture retrieval40. Experimental results have demonstrated that incorporating LAI data into soil moisture remote sensing models significantly enhances their performance, particularly under varying vegetation conditions41.

Elevation also plays a critical role in soil moisture retrieval, as increasing altitude is typically accompanied by reductions in temperature and alterations in precipitation distribution, both of which directly or indirectly influence soil moisture. Low-elevation areas may exhibit higher soil moisture due to concentrated precipitation, while high-elevation regions may display different moisture distribution patterns as a result of reduced evaporation under cooler conditions42. The digital elevation data not only reflect these temperature and precipitation gradients, but also have different regional climates at different elevations, and this invisible relationship may be useful for quantitative estimation of soil moisture models43.

Soil texture fundamentally determines water-holding capacity; areas with a high clay content, due to their complex pore structure, can retain more moisture, whereas sandy soils, with their high permeability and lower water retention, behave differently. Depth to bedrock defines the vertical boundary of the soil profile, where shallow bedrock limits root water uptake and subsequently affects soil moisture dynamics44. In this study, we employed global soil texture databases along with DTB raster data to enhance the model’s simulation of the complex water environment in the subsurface45,46.

Different land cover types—such as agricultural fields, forests, and grasslands—modulate the water cycle through variations in vegetation transpiration, canopy interception, and surface roughness47. For example, irrigated croplands may mask natural moisture variability due to supplemental watering, while forested areas often exhibit unique moisture temporal patterns as a result of deep-root systems and litter retention48. Land use/land cover data were incorporated into the model via categorical encoding to capture the nonlinear impacts of land use on moisture redistribution49.

In Situ Site Data

The International Soil Moisture Network (ISMN, https://ismn.earth/en/), maintained by numerous research organizations and scientists worldwide, is a global platform for sharing in situ soil moisture data50. It integrates measurements from 2,842 stations across 71 sub-networks (spanning 1952–2024), covering major climate zones and soil types worldwide. The dataset includes soil moisture, temperature, and precipitation parameters measured from 0 to 2 meters, and employs a three-tier quality control system. Following the established International Soil Moisture Network (ISMN) quality control guidelines50,51, only measurements with the ‘G’ (Good) quality flag were selected for model training.

Soil moisture targets were defined for the depth intervals of 0–10, 10–30, and 30–60 cm. Since these intervals do not exactly align with the ERA5-Land layers (0–7, 7–28, and 28–100 cm), we applied a depth-weighted interpolation approach to harmonize the datasets. Specifically, ERA5-Land soil moisture values were vertically interpolated by weighting the contributions of overlapping depth segments. For in-situ data selection, measurements closest to the central depth of each layer (i.e., 5 cm, 20 cm, and 35 cm for the 0–10 cm, 10–30 cm, and 30–60 cm layers, respectively) were used as representative values. However, when the central depth data were unavailable, the arithmetic mean of all available measurements within the corresponding layer was used instead. Although this approach represents a simplification of the continuous vertical distribution of soil moisture, it provides a practical and relatively reliable way to approximate layer-averaged soil moisture. Since ISMN data are recorded hourly—posing challenges for synchronizing with satellite and reanalysis datasets—the daily average of the site data was computed. For satellite grid cells containing multiple stations, the measurements were averaged, and only stations with at least 60 valid yearly measurements during the study period (2002–2021) were retained. The sites utilized in this study are illustrated in Fig. 1.

Fig. 1.

Fig. 1

Distribution of in situ data points.

Moreover, because the ISMN data are collected from heterogeneous sensors with varying adjustment protocols, systematic biases may arise among different stations, which can affect the consistency of the training dataset. To mitigate this, we adopted a normalization approach following Sungmin and Qingliang Li18,52. Specifically, over the entire study period, each in-situ soil moisture record was rescaled using the long-term mean and standard deviation of the corresponding ERA5-Land grid cell. This procedure harmonized the in-situ datasets across stations, while preserving their original daily temporal variability. The resulting standardized dataset was then used as the prediction target in training the machine learning models.

Model training and global data generation

XGBoost (eXtreme Gradient Boosting) is a state-of-the-art machine learning algorithm renowned for its high performance in regression and classification tasks17. In our study, it is employed to model soil moisture dynamics by effectively integrating various data sources, including satellite observations, reanalysis data, and in situ measurements. XGBoost constructs an ensemble of decision trees in a sequential manner, where each new tree aims to correct the errors of the previous ones. This iterative process, known as gradient boosting, enables the algorithm to capture complex nonlinear relationships between the input features and soil moisture53.

The workflow for generating global data is illustrated in Fig. 2. Specifically, after screening and adjusting the ISMN site data, all environmental variable features are aligned to form the training dataset, which is then input into the XGBoost model. Data spanning the entire study period were randomly divided into training and testing subsets at a ratio of 9:1. Recognizing that hyperparameters are critical to model performance, we performed hyperparameter selection using grid search within a 10-fold cross-validation framework on the training set. In this process, the dataset was randomly partitioned into 10 groups, with one group (10% of the data) serving as the validation set and the remaining 90% used for model fitting in each iteration. This procedure was repeated 10 times, ensuring that each subset was used for validation exactly once; the average score across all folds was then computed to provide a robust assessment of model performance. The optimal hyperparameters for each soil layer were determined as follows: Layer 1: n_estimators = 900, learning_rate = 0.1, max_depth = 8; Layer 2: n_estimators = 1000, learning_rate = 0.1, max_depth = 8; Layer 3: n_estimators = 1000, learning_rate = 0.1, max_depth = 8.

Fig. 2.

Fig. 2

Overall flow of the study.

Finally, all variables were resampled to a resolution of 0.05°, and the final calibrated XGBoost model was employed to predict a 20-year daily seamless global soil moisture data product. Model validation metrics were chosen based on their widespread use in soil moisture retrieval studies, including the correlation coefficient (R), root mean square error (RMSE), unbiased RMSE (ubRMSE), and bias54.

Data Records

The SWSM dataset can be accessed at Zenodo55,56. It is organized as annual ZIP archives, each containing one year of global daily soil moisture at 0.05° resolution, subdivided into four seasonal files (Q1–Q4). Each seasonal file is stored in NetCDF-4 format and named according to the pattern SWSM_Layer1_ < YYYY > _Q < q > .nc for streamlined batch processing. Within each file are three variables—L1, L2, and L3—representing soil moisture in the 0–10 cm, 10–30 cm, and 30–60 cm layers, respectively, enabling multi-layer analyses. Raw GeoTIFF pixel values have been scaled by integer division by 100, with the value 255 reserved as a missing data flag.

Technical Validation

Model validation

The distribution of model predictions versus in situ measurements for the three soil layers is illustrated in Fig. 3. The model consistently demonstrated high predictive performance in all three soil layers (0–10 cm, 10–30 cm, and 30–60 cm) as shown in the scatterplot and probability density distribution (Fig. 3), which are summarized quantitatively in Table 2. The correlation coefficient (R) increased slightly with depth, from 0.905 in the topsoil layer (layer 1) to 0.919 in layers 2 and 3, indicating strong linear agreement between predicted and in situ soil moisture values at all depths. Correspondingly, the root mean square error (RMSE) decreased from 0.047 m³/m³ in Layer 1 to 0.045 m³/m³ in Layer 3, further demonstrating the robustness of the model in deeper soil layers. The unbiased root-mean-square error (ubRMSE) was equal to the root-mean-square error at all depths, and the bias remained practically zero in all three soil layers, indicating that the model neither systematically overestimated nor underestimated soil moisture. The density scatter plot shows a high concentration of predictions along the 1:1 reference line, and the color gradient confirms that most of the predictions are very close to the true observations. The similarity between the predicted and observed soil moisture distributions is further confirmed by the kernel density estimates in the right panel of Fig. 3. A slight bias can be observed in the tails, especially in layer 1, where the model tends to slightly underestimate higher soil moisture values. Nevertheless, the overall shape and central tendency of the soil moisture distribution at all depths are well maintained.

Fig. 3.

Fig. 3

Scattered distribution and density probability plot showing predicted and measured soil moisture values in three layers.

Table 2.

Statistical performance indicators (including R, RMSE, ubRMSE, and Bias) of the soil moisture prediction results for the three strata).

Data R RMSE ubRMSE Bias
Layer1 0.905 0.047 0.047 0.000
Layer2 0.919 0.046 0.046 0.000
Layer3 0.919 0.045 0.045 0.000

A comprehensive comparison of the SWSM, SoMo.ml, GLDAS, and GLEAM datasets was performed based on the median BIAS, R, and RMSE, as demonstrated by the plots in Fig. 4. Overall, both SWSM and SoMo.ml, as machine learning-based datasets, exhibited superior performance over GLDAS57 and GLEAM58 across 3 depths and metrics, indicating their ability to capture local soil moisture dynamics more accurately and consistently. In the 0–10 cm horizon SWSM shows the smaller median BIAS and the narrower error distribution. In the 10–30 cm layer SWSM generally maintains low bias and high R for Fine and Medium textures, while SoMo.ml presents slightly higher median R in several Coarse cases. In the 30–60 cm layer (Layer 3), SWSM’s median BIAS is closer to zero than SoMo.ml, and SWSM attains lower RMSE in both Coarse and Fine textures. Conversely, SoMo.ml still achieves competitive RMSE and higher R in some deeper/coarse subsets.

Fig. 4.

Fig. 4

Comparison of SWSM, SoMo.ml, GLDAS, and GLEAM against in-situ ISMN soil moisture across three depths (0–10 cm, 10–30 cm, 30–60 cm) and soil textures. Columns denote soil depths (Layer 1–3), and rows show evaluation metrics (BIAS, R, RMSE) for Coarse, Fine, and Medium conditions.

Temporal and spatial verification

Sites were randomly selected across the five major climate zones, and the temporal dynamics of four datasets (SWSM, GLEAM, GLDAS and SoMo.ml) were evaluated at three soil depths for the period 2014–2019 (Fig. 5). Overall, SWSM shows robust temporal reproduction across climates and depths: at most humid tropical and temperate sites SWSM accurately captures seasonal amplitudes and precipitation-driven short-term responses, with good phase alignment to in-situ observations and generally lower RMSE. At arid sites, SWSM exhibits slightly lower Pearson R in the surface and middle layers compared with some other products, but its smaller RMSE indicates better numerical consistency with in-situ values. Notably, in polar regions SWSM maintains low RMSE and high correlation (R > 0.8) in the surface and middle layers, although correlation decreases in the deepest layer — likely a consequence of polar-specific seasonal freeze–thaw cycles and the presence of permafrost59. In addition, SWSM’s multilayer time series typically respond well to precipitation variability, reflecting sensitivity to precipitation-driven short-term events and regular seasonal cycles. In summary, SWSM attains high correlation and low error at most sites and is effective at capturing both short-term precipitation responses and seasonal variability, though further improvements are desirable for deep freeze–thaw/permafrost conditions.

Fig. 5.

Fig. 5

Comparison of in situ observations with GLDAS, GLEAM, and SWSM soil moisture time series for the five major climate zones in 2014–2019 (light blue bars are precipitation).

We conducted a spatial comparison of three soil-moisture datasets against global in-situ observations; the resulting distribution of R is shown in Fig. 6. Overall, SWSM achieves high agreement in temperate regions such as Europe and North America. In contrast, correlations drop markedly in arid and semi-arid zones—Central Asia, the interior of North America, Australia’s interior, and the margins of the Sahara—where GLEAM and GLDAS also exhibit greater dispersion. Moreover, all three datasets show low correlations in mountainous areas with complex topography, pronounced freeze-thaw cycles, or seasonal snow cover. Finally, correlation strength declines progressively for all datasets as sensing depth increases.

Fig. 6.

Fig. 6

Spatial distribution of Pearson’s correlation coefficients of the three soil moisture datasets, SWSM, GLEAM, and GLDAS, globally at three depths of the soil layers with respect to the in-situ observations; (a) Layer1, (b) Layer2, (c) Layer3.

Shap value validation

To evaluate the physical plausibility and internal consistency of the generated soil moisture dataset, we employed a feature importance analysis (SHAP). The goal was to verify if the relationships embedded within the dataset align with established hydrological principles. The analysis reveals a clear, depth-dependent hierarchy of feature contributions, which validates the physical realism of the relationships represented in our dataset (Fig. 7). For the 0–10 cm layer, the primary influence of the ERA5-Land soil moisture data underscores that our dataset successfully encapsulates the dominant near-surface dynamics. The significant roles of topography and soil texture (silt, clay) are clearly evident in the data, confirming that our dataset represents these critical regulating factors. At 10–30 cm depth, the dataset reflects the continued importance of ERA5-Land SM, but also highlights the emerging influence of soil physical properties (e.g., sand content) and subsurface indicators. This shift in dominant controls is accurately represented in our dataset. For the 30–60 cm layer, the dataset retains a strong signature of ERA5-Land SM at deeper layers, suggesting it preserves valuable information on hydrological. Furthermore, the dataset clearly manifests the increasing influence of subsurface features like sand content and dtb, which are strongly associated with variations in deep groundwater. Overall, the change of surface-related predictors (LST, LAI) with depth and the rising weight of texture and bedrock depth provide coherent, physically plausible depth-wise trends. Together, these SHAP-derived patterns support the dataset’s structural validity and its suitability for process-based soil moisture studies.

Fig. 7.

Fig. 7

Comprehensive assessment of the contribution of soil moisture inversion features in different soil horizons (0–10 cm, 10–30 cm, 30–60 cm) based on SHAP analysis. The bars on the left show the average effect of each feature on the model output magnitude; the SHAP organograms on the right reveal the discrete distribution of the direction (positive or negative) in which each feature drives the prediction in different samples.

To further validate the necessity of each input data category in our compilation process, we conducted an ablation analysis by sequentially excluding each category from the model (Fig. 8). The results provide independent and quantitative evidence supporting the dataset’s construction: removing ERA5-Land SM or static soil properties caused the largest performance decline across all depths, underscoring their fundamental contribution. In contrast, the impact of surface remote-sensing indicators (LAI, LST) was strong only in the top layer (0–10 cm) and diminished sharply at greater depths, consistent with their physical role. This combined analysis not only validates the inclusion of core data sources but also provides a quantitative measure of their utility for modeling soil moisture across soil profiles.

Fig. 8.

Fig. 8

Relative importance of feature categories for three soil layers (0–10 cm, 10–30 cm, 30–60 cm), where static = static soil properties and ts = time-series features.

Comparison with existing products

We also compared the similarities and differences between SWSM and different products in terms of space. Figure 9 (left subfigure) shows the spatial distribution of the annual mean soil moisture at 0–10 cm, 10–30 cm, and 30–60 cm soil depths of the SWSM during the study period. Moisture values in all three layers were concentrated in the range of 0.0–0.5 m³/m³, with equatorial and tropical regions (e.g., Amazon Basin, Congo Basin, and Southeast Asia) showing the highest values (>0.3 m³/m³), and arid zones (e.g., Sahara Desert, Middle East, and Central Asia) showing the lowest values (<0.1 m³/m³). With increasing depth, the spatial differentiation of humidity characteristics remains stable, but the magnitude of the values changes significantly: the overall humidity in the 0–10 cm layer is slightly lower than that in the 10–30 cm layer, and the 30–60 cm layer maintains higher values in the humid zone. The right subplot compares the latitude-averaged curves of SSWM, GLEAM, SoMo.ml, and GLDAS datasets, and the results show that the dispersion of each dataset is the largest in the 0–10 cm layer, and the difference gradually decreases in the 10–30 cm and 30–60 cm layers. All three datasets show a common trend of peak humidity in the equatorial region (0–10°N/S), a significant decrease in the subtropics (20–30°N/S), and a rebound in the mid- to high-latitudes (>40°N/S). SWSM shows good agreement with GLEAM in the subtropical regions (15–50°N/S), and follows a consistent variation pattern with SoMo.ml and GLDAS. However, the datasets exhibit different biases in terms of magnitude. Specifically, SoMo.ml and GLDAS tend to be relatively wetter overall, while SWSM appears drier near the equator, reflecting different model orientations in local regions. These differences may be partly attributable to the scarcity of in-situ observations in low-latitude regions. In addition, dense vegetation canopies and complex climatic conditions in tropical rainforest areas may further challenge the accuracy of both remote sensing retrievals and ground measurements. By contrast, at high latitudes (>60°N/S), all datasets exhibit larger discrepancies and distinct variation patterns, mainly due to complex surface properties and freeze–thaw processes of soil moisture under low-temperature conditions.

Fig. 9.

Fig. 9

Global average distribution of soil moisture in the three layers of SWSM (0–10 cm, 10–30 cm, 30–60 cm) during the study period, and comparative analysis with the annual latitudinal means of GLEAM, GLDAS, and SoMo.ml.

The temporal consistency and numerical bias of the GLDAS NOAH and SWSM soil moisture products at different soil levels were also explored. Specifically, for the three soil depths, the correlation coefficient R, root mean square error (RMSE) and corresponding statistical significance between the predicted values of soil moisture in the time series of each pixel were computed in this study using more than 200 days of data, the spatial distribution of which is shown in Fig. 10. Except for the Sahara Desert, high latitudes and some localized regions, most of the regions exhibited high and significant R values in the top soil (Layer1), along with smaller RMSDs, indicating that the two products were able to maintain both trend consistency and numerical proximity in capturing the surface moisture dynamics. On the other hand, in the deep soil layer (Layer 2 and Layer 3), although the correlation can still be maintained at a high level in the humid zone, the RMSE value of the arid zone remains at a small level even if the R value is low or even negative because of the very low soil moisture itself; on the contrary, the high latitude and the freezing/thawing transition zone tend to have a low R-value and a large RMSD at the same time due to the irregular frequency of product prediction and the drastic change of soil moisture, reflecting the large errors of the two products in these zones. In these regions, the errors were larger and the temporal consistency was reduced, which was also found in the same type of study.

Fig. 10.

Fig. 10

Spatial distribution of correlation, statistical significance, and root mean square error between GLDAS NOAH and SWSM soil moisture products at different soil horizons.

Uncertainty analysis

In addition to conventional error metrics (RMSE, ubRMSE, and Bias), we further quantified the prediction uncertainty. As shown in Fig. 11, the Prediction Interval Coverage Probabilities (PICP) for the three soil layers are 93.8%, 94.4%, and 94.5%. The average interval width (MeanWidth) across these layers is approximately 0.18. These results indicate that the model achieves high coverage while maintaining a reasonably narrow interval, thus providing a reliable representation of predictive uncertainty. As illustrated by the scatter plots and interval sequences, most observed values fall within the predicted intervals, confirming the reliability and practical value of the estimated uncertainty.

Fig. 11.

Fig. 11

Observed versus predicted soil moisture for three soil layers (top) and predicted-interval sequences (bottom). The top row shows observations plotted against median predictions with associated prediction intervals; the bottom row displays sample-wise predicted medians and their prediction intervals alongside observed values.

Usage Notes

The SWSM dataset integrates multi-source information using an interpretable XGBoost framework to provide a seamless, long-term, multi-depth global soil moisture dataset at high spatial resolution. It demonstrates high accuracy and consistency compared with other products, offering strong support for hydrological and climate-related applications. Feature attribution analysis further confirms its physical realism: surface layers are dominated by ERA5-Land SM and surface indicators, while deeper layers increasingly reflect soil texture and subsurface properties, reinforcing the structural validity of the dataset.

However, this dataset is constrained by in situ surface data, and SWSM data quality is uncertain in areas with low in situ site coverage. Deep soil data quality in some areas is affected by soil structure and the complexity of deep soil moisture transport, resulting in slightly lower performance compared to surface soils. Furthermore, SWSM is still subject to the physical limitations of freeze-thaw soil moisture fluctuations, which may introduce some errors. Ensuring data quality in permafrost environments in polar regions remains a challenge for future research. Despite these limitations, the dataset provides a valuable resource to complement existing soil moisture products and to advance research in environmental monitoring and climate change.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (42271392), the Opening Foundation of Xi’an Key Laboratory of Territorial Spatial Information (3001023545016), and the Open Fund of the Key Laboratory of Natural Resources Monitoring and Supervision in Southern Hilly Region, Ministry of Natural Resources (NRMSSHR2022Y02, NRMSSHR2023Y03). We also gratefully acknowledge the QA4SM platform for providing critical soil moisture validation services; their efforts have greatly facilitated and enhanced the quality of our soil moisture research.

Author contributions

Z.W. and L.W. conceived the overall experiment; Z.W. conducted the entire experiment and wrote the manuscript; T.W. provided computational resources; Q.L. optimized the experimental procedure; S.T. assisted in data cleaning; F.Z. reviewed the manuscript; and Y.Z. guided the methodology and made revisions to the figures and manuscript. All authors have read and approved the final manuscript.

Data availability

The SWSM dataset generated in this study is openly available at Zenodo. Two repositories are provided: 10.5281/zenodo.15262116 and 10.5281/zenodo.15250534.

Code availability

The custom script used to read the NetCDF files in this study is publicly hosted on GitHub at the repository address: https://github.com/weizeyang1997/SWSM.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Dorigo, W. et al. ESA CCI soil moisture for improved earth system understanding: state-of-the art and future directions. Remote Sens. Environ.203, 185–215 (2017). [Google Scholar]
  • 2.Yuan, Q., Xu, H., Li, T., Shen, H. & Zhang, L. Estimating surface soil moisture from satellite observations using a generalized regression neural network trained on sparse ground-based measurements in the continental U.S. Journal of Hydrology580, 124351 (2020). [Google Scholar]
  • 3.Dong, J., Akbar, R., Feldman, A. F., Gianotti, D. S. & Entekhabi, D. Land surfaces at the tipping‐point for water and energy balance coupling, 10.1029/2022WR032472.
  • 4.Zohaib, M., Kim, H. & Choi, M. Evaluating the patterns of spatiotemporal trends of root zone soil moisture in major climate regions in east Asia, 10.1002/2016JD026379.
  • 5.Shellito, P. J. et al. Assessing the impact of soil layer depth specification on the observability of modeled soil moisture and brightness temperature, 10.1175/JHM-D-19-0280.1 (2020).
  • 6.Song, P. et al. A 1&thinsp;km daily surface soil moisture dataset of enhanced coverage under all-weather conditions over china in 2003–2019. Earth Syst. Sci. Data14, 2613–2637 (2022). [Google Scholar]
  • 7.Zhang, N., Quiring, S. M. & Ford, T. W. Blending noah, SMOS, and in situ soil moisture using multiple weighting and sampling schemes, 10.1175/JHM-D-20-0119.1 (2021).
  • 8.Chen, Y., Feng, X. & Fu, B. An improved global remote-sensing-based surface soil moisture (RSSSM) dataset covering 2003–2018. Earth Syst. Sci. Data13, 1–31 (2021). [Google Scholar]
  • 9.Fisher, R. A. & Koven, C. D. Perspectives on the future of land surface models and the challenges of representing complex terrestrial systems. JAMES12, e2018MS001453 (2020). [Google Scholar]
  • 10.Tai, S.-L. et al. A 1&thinsp;km soil moisture dataset over eastern CONUS generated by assimilating SMAP data into the noah-MP land surface model. Earth Syst. Sci. Data17, 4587–4611 (2025). [Google Scholar]
  • 11.Feldman, A. F. et al. Remotely sensed soil moisture can capture dynamics relevant to plant water uptake, 10.1029/2022WR033814.
  • 12.Feldman, A. F. et al. Soil moisture profiles of ecosystem water use revealed with ECOSTRESS, 10.1029/2024GL108326.
  • 13.Liu, J., Rahmani, F., Lawson, K. & Shen, C. A multiscale deep learning model for soil moisture integrating satellite and In situ data, 10.1029/2021GL096847.
  • 14.Zhao, H., Montzka, C., Vereecken, H. & Franssen, H.-J. H. A comparative analysis of remote sensing soil moisture datasets fusion methods: novel LSTM approach versus widely used triple collocation technique. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.17, 16659–16671 (2024). [Google Scholar]
  • 15.Hu, J., Deng, C., Zhang, Q. & Pang, A. Physics-informed neural networks enhanced by data augmentation: a novel framework for robust soil moisture estimation using multi-source data fusion. J. Hydrol.663, 134320 (2025). [Google Scholar]
  • 16.Chen, L. et al. Using remote sensing and machine learning to generate 100-cm soil moisture at 30-m resolution for the black soil region of China: implication for agricultural water management. Agric. Water Manage.309, 109353 (2025). [Google Scholar]
  • 17.Zhang, Y. et al. Generation of global 1&thinsp;km daily soil moisture product from 2000 to 2020 using ensemble learning. Earth Syst. Sci. Data15, 2055–2079 (2023). [Google Scholar]
  • 18.O, S. & Orth, R. Global soil moisture data derived through machine learning trained with in-situ measurements. Sci. Data8, 170 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.O, S., Orth, R., Weber, U. & Park, S. K. High-resolution european daily soil moisture derived with machine learning (2003–2020), 10.48550/arXiv.2205.10753 (2022). [DOI] [PMC free article] [PubMed]
  • 20.Han, Q. et al. Global long term daily 1 km surface soil moisture dataset with physics informed machine learning. Sci. Data10, 101 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Padarian, J., McBratney, A. B. & Minasny, B. Game theory interpretation of digital soil mapping convolutional neural networks. Soil6, 389–397 (2020). [Google Scholar]
  • 22.Muñoz-Sabater, J. et al. ERA5-land: a state-of-the-art global reanalysis dataset for land applications. Earth Syst. Sci. Data13, 4349–4383 (2021). [Google Scholar]
  • 23.Hersbach, H & Bell, B.: ERA5 hourly time-series data on single levels from 1940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), 10.24381/cds.e2161bac (2025).
  • 24.Zhou, J., Liang, S., Cheng, J., Wang, Y. & Ma, J. The GLASS land surface temperature product. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.12, 493–507 (2019). [Google Scholar]
  • 25.Ma, H. & Liang, S. Development of the GLASS 250-m leaf area index product (version 6) from MODIS data using the bidirectional LSTM deep learning model. Remote Sens. Environ.273, 112985 (2022). [Google Scholar]
  • 26.Friedl, M. & Sulla-Menashe, D. MODIS/terra+aqua land cover type yearly L3 global 0.05Deg CMG V061. NASA Land Processes Distributed Active Archive Center, 10.5067/MODIS/MCD12C1.061 (2022).
  • 27.Danielson, J. J. & Gesch, D. B. Global Multi-Resolution Terrain Elevation Data 2010 (GMTED2010). Open-File Reporthttps://pubs.usgs.gov/publication/ofr20111073, 10.3133/ofr20111073 (2011).
  • 28.Hengl, T. et al. SoilGrids250m: global gridded soil information based on machine learning. PLOS One12, e0169748 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lehmann, P., Berli, M., Koonce, J. E. & Or, D. Surface evaporation in arid regions: insights from lysimeter decadal record and global application of a surface evaporation capacitor (SEC) model, 10.1029/2019GL083932.
  • 30.Beck, H. E. et al. Evaluation of 18 satellite- and model-based soil moisture products using in situ measurements from 826 sensors. Hydrol. Earth Syst. Sci.25, 17–40 (2021). [Google Scholar]
  • 31.Zhang, L. et al. Environmental factors driving evapotranspiration over a grassland in a transitional climate zone in China, 10.1002/met.2066.
  • 32.Yang, J., Li, Z., Zhai, P., Zhao, Y. & Gao, X. The influence of soil moisture and solar altitude on surface spectral albedo in arid area. Environ. Res. Lett.15, 35010 (2020). [Google Scholar]
  • 33.Hu, Y. et al. A physical method for downscaling land surface temperatures using surface energy balance theory. Remote Sens. Environ.286, 113421 (2023). [Google Scholar]
  • 34.Matsushima, D. Thermal inertia-based method for estimating soil moisture. in Soil Moisture, 10.5772/intechopen.80252 (IntechOpen, 2018).
  • 35.Zhang, J., Wang, W.-C. & Wu, L. Land‐atmosphere coupling and diurnal temperature range over the contiguous united states, 10.1029/2009GL037505.
  • 36.Lagos, L. O. et al. Surface energy balance model of transpiration from variable canopy cover and evaporation from residue-covered or bare soil systems: model evaluation. Irrig. Sci.31, 135–150 (2013). [Google Scholar]
  • 37.Alves, I. & do Rosário Cameira, M. Evapotranspiration estimation performance of root zone water quality model: evaluation and improvement. Agric. Water Manage.57, 61–73 (2002). [Google Scholar]
  • 38.Cisneros Vaca, C., van der Tol, C. & Ghimire, C. P. The influence of long-term changes in canopy structure on rainfall interception loss: a case study in speulderbos, the Netherlands. Hydrol. Earth Syst. Sci.22, 3701–3719 (2018). [Google Scholar]
  • 39.Hoek van Dijke, A. J. et al. Examining the link between vegetation leaf area and land–atmosphere exchange of water, energy, and carbon fluxes using FLUXNET data. Biogeosciences17, 4443–4457 (2020). [Google Scholar]
  • 40.Liu, Z. et al. Modeling the response of daily evapotranspiration and its components of a larch plantation to the variation of weather, soil moisture, and canopy leaf area index. J. Geophys. Res.: Atmos.123, 7354–7374 (2018). [Google Scholar]
  • 41.Chen, M., Willgoose, G. R. & Saco, P. M. Investigating the impact of leaf area index temporal variability on soil moisture predictions using remote sensing vegetation data. J. Hydrol.522, 274–284 (2015). [Google Scholar]
  • 42.Wang, Y., Yang, J., Chen, Y., Wang, A. & De Maeyer, P. The spatiotemporal response of soil moisture to precipitation and temperature changes in an arid region, china. Remote Sens.10, 468 (2018). [Google Scholar]
  • 43.Fan, L. et al. Mapping soil moisture at a high resolution over mountainous regions by integrating In situ measurements, topography data, and MODIS land surface temperatures. Remote Sens.11, 656 (2019). [Google Scholar]
  • 44.Lapides, D. A. et al. Inclusion of bedrock vadose zone in dynamic global vegetation models is key for simulating vegetation structure and function. Biogeosciences21, 1801–1826 (2024). [Google Scholar]
  • 45.Dai, Y. et al. A global high-resolution data set of soil hydraulic and thermal properties for land surface modeling. JAMES11, 2996–3023 (2019). [Google Scholar]
  • 46.Shangguan, W., Hengl, T., Mendes de Jesus, J., Yuan, H. & Dai, Y. Mapping the global depth to bedrock for land surface modeling. JAMES9, 65–88 (2017). [Google Scholar]
  • 47.Chen, L. & Dirmeyer, P. A. Impacts of land-use/land-cover change on afternoon precipitation over north america, 10.1175/JCLI-D-16-0589.1 (2017).
  • 48.Floriancic, M. G. et al. Potential for significant precipitation cycling by forest‐floor litter and deadwood, 10.1002/eco.2493.
  • 49.Li, Y. et al. Spatiotemporal impacts of land use land cover changes on hydrology from the mechanism perspective using SWAT model with time-varying parameters. Hydrol. Res.50, 244–261 (2018). [Google Scholar]
  • 50.Dorigo, W. et al. The international soil moisture network: serving earth system science for over a decade. Hydrol. Earth Syst. Sci.25, 5749–5804 (2021). [Google Scholar]
  • 51.Dorigo, W. A. et al. Global automated quality control of In situ soil moisture data from the international soil moisture network. Vadose Zone J.12, vzj2012.97 (2013). [Google Scholar]
  • 52.Li, Q. et al. A 1&thinsp;km daily soil moisture dataset over china using in situ measurement and machine learning. Earth Syst. Sci. Data14, 5267–5286 (2022). [Google Scholar]
  • 53.Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, 10.1145/2939672.2939785 (2016).
  • 54.Entekhabi, D., Reichle, R. H., Koster, R. D. & Crow, W. T. Performance Metrics for Soil Moisture Retrievals and Application Requirements. J. Hydrometeorol.11, 832–840 (2010). [Google Scholar]
  • 55.Wei, Z. High resolution daily multilayer soil moisture dataset 2002 to 2013 derived from integrated multi-source data fusion. Zenodo 10.5281/zenodo.15250534 (2025).
  • 56.Wei, Z. High resolution daily multilayer soil moisture dataset 2014 to 2021 derived from integrated multi-source data fusion. Zenodo10.5281/zenodo.15262116 (2025).
  • 57.Beaudoing, H., Rodell, M. & Nasa/Gsfc/Hsl. GLDAS noah land surface model L4 3 hourly 0.25 x0.25 degree, version 2.1. NASA Goddard Earth Sciences Data and Information Services Center, 10.5067/E7TYRXPJKWOQ (2020).
  • 58.Miralles, D. G. et al. GLEAM4: global land evaporation and soil moisture dataset at 0.1° resolution from 1980 to near present. Sci. Data12, 416 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Preimesberger, W., Stradiotti, P. & Dorigo, W. ESA CCI soil moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates. Earth Syst. Sci. Data17, 4305–4329 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Wei, Z. High resolution daily multilayer soil moisture dataset 2014 to 2021 derived from integrated multi-source data fusion. Zenodo10.5281/zenodo.15262116 (2025).

Data Availability Statement

The SWSM dataset generated in this study is openly available at Zenodo. Two repositories are provided: 10.5281/zenodo.15262116 and 10.5281/zenodo.15250534.

The custom script used to read the NetCDF files in this study is publicly hosted on GitHub at the repository address: https://github.com/weizeyang1997/SWSM.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES