Skip to main content
Scientific Data logoLink to Scientific Data
. 2026 Jan 21;13:230. doi: 10.1038/s41597-026-06546-3

A spatially rich, temporally coherent soil spectral dataset for soil organic carbon estimation

Jeehwan Bae 1, Inhye Seo 2, Junge Hyun 1, Yelim Park 2, Minseop Jeong 2, Jaewoo Kim 2, Seoyeon Kim 2, Sunyoung Joo 2, Youngseo Shin 2, Yonghui Jung 2, Seunghee Seo 2, Heesoo Kim 2, Chaehee Ahn 2, Juneyoung Pyung 3, Minjoon Cha 3, Byeonggil Choi 4, Wheemoon Kim 5, Hansu Kim 5, Gayoung Yoo 1,2,
PMCID: PMC12901301  PMID: 41565678

Abstract

Accurate estimation of soil organic carbon (SOC) is crucial for climate mitigation and sustainable land management. Near-infrared (NIR) spectroscopy provides a rapid, cost-effective approach for SOC assessment, but its predictive performance depends on calibration datasets with adequate spatiotemporal coverage. Here, we present the Gyeonggi Soil Spectral Library (G-SSL), comprising NIR spectra (1,400–2,500 nm) from 1,500 topsoil samples (0–15 cm) collected systematically across Gyeonggi Province, South Korea, in 2024. Sampling spans 11 representative land cover types, including deciduous, coniferous, and mixed forests; paddy and upland fields; orchards; greenhouses; urban parks; artificial grasslands; riparian zones; and bare lands. To develop an accurate NIR-based SOC prediction model, SOC measurements from 712 samples were used to calibrate partial least squares regression (PLSR) models, which showed robust performance in a 70:30 train–test split (R2 = 0.95, RMSE = 0.39%, RPD = 4.54). The G-SSL provides a spatially robust, high-resolution resource for digital SOC mapping and establishes a methodological benchmark for developing region-specific spectral libraries in other heterogeneous landscapes.

Subject terms: Carbon cycle, Ecosystem ecology, Biogeochemistry

Background & Summary

Soil organic carbon (SOC) plays a central role in regulating the terrestrial carbon cycle1, sustaining soil health2, and mitigating climate change3. However, its accurate quantification of SOC remains challenging due to pronounced spatial variability driven by heterogeneous land cover types4, complex topography5, and anthropogenic disturbances6. The accuracy and utility of most global SOC databases are constrained by their reliance on coarse-resolution sampling or legacy data aggregated over long timescales7,8. Although SOC generally changes more slowly over time than across space, rapidly evolving land use practices today highlight the need for shorter-term SOC monitoring9. Therefore, establishing SOC baselines with both high spatial and temporal resolutions is indispensable for developing scientifically robust and effective SOC management strategies8,10. Innovative alternatives to conventional laboratory analysis are increasingly needed to address the challenges associated with developing high-resolution SOC databases11,12. In this context, soil spectroscopy, particularly near-infrared (NIR) spectroscopy, has emerged as a promising approach for SOC estimation13,14. Over the past three decades, NIR spectroscopy has proven effective for rapid, cost-effective, and non-destructive SOC estimation1416, and has been widely applied in digital SOC mapping from regional to national scales1719. The accuracy of NIR-based spectral libraries largely depends on the quality of the soil samples collected12,20. If the samples are representative of the diverse land cover types in the target region and are collected within a temporally consistent timeframe, the resulting spectral library can serve as a reliable training dataset for robust SOC estimation. To address the need for high-resolution SOC datasets, we developed the Gyeonggi Soil Spectral Library (G-SSL), a curated dataset of NIR reflectance spectra (1,400–2,500 nm). It comprises spectra from 1,500 topsoil samples (0–15 cm) collected during a single growing season (April–November 2024) across Gyeonggi Province, South Korea. The region, encompassing 10,200 km2 around Seoul, has experienced rapid urbanization21, resulting in a highly heterogeneous land cover, ranging from dense urban zones to preserved national parks. Samples were systematically collected across 11 representative land cover types: deciduous, coniferous, and mixed forests; paddy fields; upland fields; orchards; riparian zones; urban parks; artificial grasslands; and bare lands. These representative land cover types are included in the primary classification standards currently used in Gyeonggi Province’s land management framework, which is evaluated every five years. Within each land cover type, sampling was strategically designed to capture diverse vegetation, elevations (2–1,411 m), and spatial variability, ensuring high landscape representativeness. All spectral measurements were conducted using standardized laboratory protocols with a single NIR spectrometer, ensuring methodological consistency. Conventional SOC analyses were performed on a subset of 712 samples—covering > 40% of each land cover type—to train a partial least squares regression (PLSR) model. Using 10-fold cross-validation, the optimal number of latent components was 10, with calibration performance of R2 = 0.96, RMSE = 0.37%, RPD = 4.90, and RPIQ = 5.94. Independent validation using a 70:30 split further confirmed the model’s generalizability (R2 = 0.95, RMSE = 0.39%, RPD = 4.54, RPIQ = 5.64). The resulting dataset provides a spatially representative and methodologically coherent resource for modeling SOC in complex land cover mosaics and offers a practical benchmark for developing region-specific spectral libraries and high-resolution SOC monitoring frameworks in similarly heterogeneous landscapes.

Methods

In this section, the methods used during the field sampling, laboratory analysis, and spectral measurements are presented in detail.

Selection of soil sampling sites

The sampling sites were strategically selected to comprehensively represent the land cover types of Gyeonggi Province, South Korea. The sites were stratified according to the area proportions of 11 land cover types, excluding impervious surfaces (Fig. 1). Additionally, sites were carefully chosen to avoid biases, ensuring comprehensive coverage across elevation gradients (2–1,411 m) and geographic locations, resulting in a total of 1,500 sites. The number of sites by land cover type is as follows: deciduous forests (n = 564), coniferous forests (n = 205), mixed forests (n  = 352), paddy fields (n = 80), upland fields (n = 82), orchards (n = 15), greenhouses (n = 30), urban parks (n = 28), artificial grasslands (n = 5), riparian zones (n = 36), and bare lands (n = 103). All sampling locations were established on natural soil surfaces (i.e., not on artificial substrates or potting media). For cultivated land covers (paddy and upland fields, orchards, and greenhouses), management practices (e.g., tillage or fertilizer application) were recorded during fieldwork; however, these attributes were not used to stratify site selection and are not distinguished as variables in the dataset released for this study. The geographic distribution of these sampling sites is illustrated in Fig. 2.

Fig. 1.

Fig. 1

Representative aerial imagery of the 11 major land cover types included in the soil survey across Gyeonggi Province, South Korea. The dominant land cover types—deciduous forest, coniferous forest, mixed forest, paddy field, upland field, orchard, greenhouse, urban park, artificial grassland, riparian zone, and bare land—represent the region’s land use heterogeneity.

Fig. 2.

Fig. 2

Spatial distribution of soil sampling sites in Gyeonggi Province, South Korea. (a) Location of Gyeonggi Province within South Korea. (b) Distribution of 1,500 topsoil sampling sites established during the soil survey in 2024.

Collection of soil samples and preparation for analysis

Soil sampling was conducted across the 1,500 selected sites during a single growing season, from April to November 2024, to ensure temporal consistency. At each sampling site, five soil samples were systematically collected within a 10 × 10 m quadrat (Fig. 3). Three of these samples were composited for SOC analysis, while the remaining two samples were immediately frozen for subsequent biological analyses. Prior to soil sampling, litter and vegetation residues were carefully removed from the soil surface. Topsoil samples (0–15 cm) were then collected using a core sampler (5.08 cm diameter; 2″ × 12″ Soil Core Sampler Complete, AMS, American Falls, ID, USA).

Fig. 3.

Fig. 3

Soil sampling scheme. (a) Standard sampling protocol using a 10 × 10 m quadrat, with five sampling points (center and four corners). (b) Rectangular site adaptation in constrained field settings, maintaining five-point sampling. Field photographs depict actual sampling conditions across different land cover types.

Composited soil samples were air-dried at room temperature (20-25°C) for three weeks and subsequently sieved through a 2 mm standard testing sieve (Chung Gye Sang Gong, Seoul, South Korea) to remove coarse fragments.

Acquisition of Near-infrared (NIR) spectral data

NIR reflectance spectra for all 1500 soil samples were measured using a benchtop NIR spectrometer (Unity Scientific, Spectra Star XT, Westborough, MA, USA) equipped with UCal calibration software. Prior to each measurement session, the halogen light source within the contact probe was preheated for at least 2 hours to ensure thermal stability. Approximately 30 g of each soil sample was uniformly packed into a ring-shaped sample holder (50 mm diameter), and 12 scans were averaged across the 1,400 – 2,500 nm spectral range at 1 nm intervals. Twelve replicate scans were acquired by rotating the sample holder at 30° intervals to complete a full 360° rotation, thereby averaging illumination and viewing geometry effects22. White and dark reference calibrations were performed before and after the session using certified panels. To reduce baseline variability and enhance spectral feature extraction, Savitzky–Golay filtering (second-order polynomial) was applied to the raw NIR spectra23. The NIR spectral reflectance of 1,500 topsoil samples is illustrated in Fig. 4.

Fig. 4.

Fig. 4

The Near-Infrared (NIR) spectral reflectance. Pre-processed NIR spectral reflectance curves from 1,500 topsoil samples following technical validation and smoothing procedures.

Measurement of soil organic carbon (SOC)

From the 1,500 collected samples, a stratified subset (n = 712) was chosen so that, within each land cover type, at least 40% of the spectrally analyzed samples were also analyzed for SOC. All samples were air-dried, sieved to <2 mm, and finely milled (MM400; Retsch GmbH, Haan, Germany). For SOC analysis, each ground sample was divided into two aliquots. Total carbon (TC) was determined on the untreated aliquot by high-temperature dry combustion at 900 °C using an elemental analyzer (Flash EA 1112; Thermo Fisher Scientific). The second aliquot was acid-pretreated to remove inorganic carbon by adding 10 mL of 5.7 M HCl to 2.5 g of soil, followed by oven-drying at 105 °C for 24 h; carbon measured thereafter by dry combustion was interpreted as SOC. Inorganic carbon was then inferred as the portion of TC not accounted for by SOC (data not shown).

Development of a spectral-based SOC prediction model

A partial least squares regression (PLSR) model was developed to predict SOC using pre-processed spectral data and corresponding measured SOC (n = 712). PLSR is widely used in soil spectroscopy due to its robustness against multicollinearity and effectiveness in handling high-dimensional spectral datasets17. All NIR spectral data were mean-centered before the PLSR24. This follows common practice in PLSR to stabilize the covariance structure and prevent the first latent component from capturing only the intercept. All wavelengths were kept on their native reflectance scale. The model was tuned on the calibration set using 10-fold cross-validation, and the optimal number of latent components was 10, selected by minimizing the cross-validated RMSE. Calibration performance was R2 = 0.96 and RMSE = 0.37% SOC, with RPD = 4.90 and RPIQ = 5.94 (Fig. 5). To evaluate generalization, we performed an independent validation with a 70:30 (train:test) stratified split across the SOC range; the model achieved R2 = 0.95, RMSE = 0.39%, RPD = 4.54, and RPIQ = 5.64 on the test set (Fig. 5).

Fig. 5.

Fig. 5

Performance of the NIR-based PLSR model for predicting soil organic carbon (SOC, %). (a) Calibration shown as 10-fold cross-validated out-of-fold predictions. (b) Independent validation using a stratified 70:30 (train:test) split across the SOC range. The red dashed line is the 1:1 reference; each point is an individual sample.

Data Records

The dataset25 is provided as a single Microsoft Excel file containing geographic information, measured SOC (mSOC, %), and pre-processed NIR reflectance spectra spanning wavelengths of 1,400–2,500 nm. The dataset includes data from 1,500 topsoil samples (0–15 cm) collected in Gyeonggi Province, South Korea, during 2024.

SOC sheet

This sheet contains measured SOC (mSOC, %) determined using an elemental analyzer (n = 712). Additionally, it includes sample-level metadata such as sampling number (1–1500), region, latitude and longitude, land cover type, altitude (m), and the year and month of soil sampling.

NIR sheet

This sheet provides raw NIR and Savitzky–Golay smoothed NIR reflectance spectra. Each row represents one sample, with spectral reflectance values at 1 nm intervals from 1,400 to 2,500 nm (totaling 1,101 spectral bands), following the sample number column.

Technical Validation

Multiple quality control procedures were applied throughout measured SOC analysis and NIR spectral data acquisition. Elemental analysis for SOC quantification was periodically validated using certified reference materials (SOC = 1.5%, Elemental Microanalysis, Devon, UK), maintaining analytical precision within ±0.1% SOC. The NIR spectrometer (Spectra Star XT) was periodically validated for photometric precision, average absolute photometric difference, and wavelength precision, ensuring high analytical reliability and accuracy.

Usage Notes

The Gyeonggi Soil Spectral Library (G-SSL) is intended for use in the development and benchmarking of PLSR models for SOC prediction. Users may also utilize the dataset for digital SOC mapping, remote sensing calibration, and comparative regional soil analyses. While the full dataset comprises 1,500 topsoil samples, SOC content is available for a subset of 712 samples, which should be considered when constructing or validating predictive models. All spectral data have been pre-processed consistently and are provided in a ready-to-use format25.

Acknowledgements

This work was conducted using the database and analytical outcomes made available by the Gyeonggi Climate Platform. This work was supported by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry & Energy (MOTIE) of the Republic of Korea (20224000000260).

Author contributions

Jeehwan Bae: project conception, soil core collection, data acquisition, validation, data analysis, writing. Inhye Seo: data acquisition, validation, data analysis. Junge Hyun: data acquisition, validation, data analysis. Yelim Park: soil core collection, data acquisition, validation, data analysis. Minseop Jeong: soil core collection, data acquisition. Jaewoo Kim: soil core collection, data acquisition. Seoyeon Kim: soil core collection, data acquisition. Sunyoung Joo: soil core collection. Youngseo Shin: soil core collection, data acquisition. Yonghui Jung: soil core collection, data acquisition. Seunghee Seo: soil core collection, data acquisition. Heesoo Kim: soil core collection, data acquisition. Chaehee Ahn: soil core collection, data acquisition. Juneyoung Pyung: soil core collection. Minjoon Cha: soil core collection. Byeonggil Choi: soil core collection, data acquisition. Wheemoon Kim: project conception and coordination. Hansu Kim: project conception and coordination. Gayoung Yoo.: project coordination and funding, project conception, data acquisition, writing. The manuscript was written by Jeehwan Bae and Gayoung Yoo with contributions and approval from all authors.

Data availability

All datasets generated and analyzed in this study are publicly available on Figshare: 10.6084/m9.figshare.29380574.

Code availability

All custom analysis code used in this study is publicly available on GitHub: https://github.com/baecology/GSSL-spectral-processing to ensure full reproducibility.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Friedlingstein, P. et al. Global carbon budget. Earth System Science Data14, 4811–4900, 10.5194/essd-14-4811-2022 (2022). [Google Scholar]
  • 2.Hyun, J., Kim, Y. J., Kim, A., Plante, A. F. & Yoo, G. Ecosystem services-based soil quality index tailored to the metropolitan environment for soil assessment and management. Science of Total Environment820, 153301, 10.1016/j.scitotenv.2022.153301 (2022). [DOI] [PubMed] [Google Scholar]
  • 3.Lal, R. Soil carbon sequestration to mitigate climate change. Geoderma123, 1–22, 10.1016/j.geoderma.2004.01.032 (2004). [Google Scholar]
  • 4.Bae, J. & Ryu, Y. Land use and land cover changes explain spatial and temporal variations of the soil organic carbon stocks in a constructed urban park. Landscape and Urban Planning136, 57–67, 10.1016/j.landurbplan.2014.11.015 (2015). [Google Scholar]
  • 5.Patton, N. R., Lohse, K. A., Seyfried, M. S., Godsey, S. E. & Parsons, S. B. Topographic controls of soil organic carbon on soil-mantled landscapes. Scientific Reports9, 6390, 10.1038/s41598-019-42556-5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bae, J. & Ryu, Y. High soil organic carbon stocks under impervious surfaces contributed by urban deep cultural layers. Landscape and Urban Planning204, 103953, 10.1016/j.landurbplan.2020.103953 (2020). [Google Scholar]
  • 7.Goidts, E., Van Wesemael, B. & Crucifix, M. Magnitude and sources of uncertainties in soil organic carbon (SOC) stock assessments at various scales. European Journal of Soil Science60, 723–739, 10.1111/j.1365-2389.2009.01157.x (2009). [Google Scholar]
  • 8.Bokati, L. et al. Temporal adjustment approach for high-resolution continental scale modeling of soil organic carbon. Scientific Reports15, 6483, 10.1038/s41598-025-89503-1 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bae, J. & Ryu, Y. The magnitude and causes of edge effects on soil organic carbon stocks within and across urban to rural forest patches. Landscape and Urban Planning215, 104223, 10.1016/j.landurbplan.2021.104223 (2021). [Google Scholar]
  • 10.Viscarra Rossel, R. A., Webster, R., Bui, E. N. & Baldock, J. A. Baseline map of organic carbon in Australian soil to support national carbon accounting and monitoring under climate change. Global Change Biology20, 2953–2970, 10.1111/gcb.12569 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Szatmári, G. et al. Gridded, temporally referenced spatial information on soil organic carbon for Hungary. Scientific Data11, 1–13, 10.1038/s41597-024-04158-3 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lucà, F., Conforti, M., Castrignanò, A., Matteucci, G. & Buttafuoco, G. Effect of calibration set size on prediction at local scale of soil carbon by Vis-NIR spectroscopy. Geoderma288, 175–183, 10.1016/j.geoderma.2016.11.015 (2017). [Google Scholar]
  • 13.Mészáros, J. et al. Vis-NIR soil spectral library of the Hungarian Soil Degradation Observation System. Scientific Data12, 363, 10.1038/s41597-025-04667-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bellon-Maurel, V. & McBratney, A. Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils – Critical review and research perspectives. Soil Biology and Biochemistry43, 1398–1410, 10.1016/j.soilbio.2011.02.019 (2011). [Google Scholar]
  • 15.Peng, Y. et al. Spectroscopic solutions for generating new global soil information. The Innovation6, 10.1016/j.xinn.2025.100839 (2025). [DOI] [PMC free article] [PubMed]
  • 16.Chang, C.-W. et al. Near‐infrared reflectance spectroscopy–principal components regression analyses of soil properties. Soil Science Society of America Journal65.2, 480–490, 10.2136/sssaj2001.652480x (2001). [Google Scholar]
  • 17.Wijewardane, N. K., Ge, Y., Wills, S. & Loecke, T. Prediction of soil carbon in the conterminous United States: visible and near infrared reflectance spectroscopy analysis of the rapid carbon assessment project. Soil Science Society of America Journal80, 973–982, 10.2136/sssaj2016.02.0052 (2016). [Google Scholar]
  • 18.Ma, Y. et al. A soil spectral library of New Zealand. Geoderma Regional35, e00726, 10.1016/j.geodrs.2023.e00726 (2023). [Google Scholar]
  • 19.Shi, Z. et al. Development of a national VNIR soil-spectral library for soil classification and prediction of organic matter concentrations. Science China Earth Sciences57, 1671–1680, 10.1007/s11430-013-4808-x (2014). [Google Scholar]
  • 20.Orgiazzi, A., Ballabio, C., Panagos, P., Jones, A. & Fernández‐Ugalde, O. LUCAS Soil, the largest expandable soil dataset for Europe: a review. European Journal of Soil Science69, 140–153, 10.1111/ejss.12499 (2018). [Google Scholar]
  • 21.Kim, J.-H., Kwon, O.-S. & Ra, J.-H. Urban type classification and characteristic analysis through time-series environmental changes for land use management for 31 satellite cities around Seoul, South Korea. Land10, 799, 10.3390/land10080799 (2021). [Google Scholar]
  • 22.Knox, N. M. et al. Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-infrared (MIR) spectroscopy. Geoderma239, 229–239, 10.1016/j.geoderma.2014.10.019 (2015). [Google Scholar]
  • 23.Savitzky, A. & Golay, M. J. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry36, 1627–1639, 10.1021/ac60214a047 (1964). [Google Scholar]
  • 24.Calderón, F. J. et al. Quantification of soil permanganate oxidizable C (POXC) using infrared spectroscopy. Soil Science Society of America Journal81.2, 277–288, 10.2136/sssaj2016.07.0216 (2017). [Google Scholar]
  • 25.Bae, J. Gyeonggi Soil Spectral Library (G-SSL). figshare10.6084/m9.figshare.29380574 (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All datasets generated and analyzed in this study are publicly available on Figshare: 10.6084/m9.figshare.29380574.

All custom analysis code used in this study is publicly available on GitHub: https://github.com/baecology/GSSL-spectral-processing to ensure full reproducibility.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES