Explainable machine learning integrating biochemical and metabolomic biomarkers with conventional clinical factors improves chronic kidney disease prediction and risk stratification

Jing Ma; Ruiyan Liu; Xin Feng; Xing Li; Jielin Huang; Lu Zhang; Jian Gao; Guifang Hu; Xiru Zhang

doi:10.1186/s12882-026-04781-9

. 2026 Jan 29;27:137. doi: 10.1186/s12882-026-04781-9

Explainable machine learning integrating biochemical and metabolomic biomarkers with conventional clinical factors improves chronic kidney disease prediction and risk stratification

Jing Ma ^1,^#, Ruiyan Liu ^2,^#, Xin Feng ³, Xing Li ¹, Jielin Huang ⁴, Lu Zhang ², Jian Gao ^2,⁴, Guifang Hu ^2,^✉, Xiru Zhang ^2,^3,^✉

PMCID: PMC12924283 PMID: 41612264

Abstract

Background

Chronic kidney disease (CKD) is a leading cause of morbidity and mortality worldwide, yet existing risk models have limited ability to identify individuals at high long-term risk. Whether integrating circulating biochemical and metabolomic biomarkers can improve CKD prediction and risk stratification remains unclear.

Methods

We included 233,589 UK Biobank participants without CKD at baseline. Biomarkers were screened using multiple feature selection strategies. Predictive performance and effect sizes were evaluated using Cox proportional hazards models. CatBoost and SHAP were applied to identify key predictors, derive interpretable binary thresholds, and construct a simplified biomarker risk score (BRS). Relative and absolute CKD risks were assessed across tertiles of the BRS. Model discrimination and calibration were evaluated in an England development cohort and a geographically independent validation cohort from Scotland and Wales.

Results

A combined biochemical–metabolomic signature (BioMet) showed good discrimination for incident CKD and CKD-related mortality and consistently outperformed conventional risk models in both cohorts. Key risk-elevating biomarkers included cystatin C, HbA1c, CRP, and urea, whereas higher eGFR, M-VLDL-CE, histidine, and IGF-1 were inversely associated with CKD risk. A SHAP-derived Top10 BRS (Top10BRS) effectively stratified individuals into distinct risk groups. Compared with the lowest tertile, participants in the highest tertile had a substantially higher risk of incident CKD (HR: 3.73) and CKD-related mortality (HR: 10.40). Discrimination improved after adding Top10BRS to conventional models, while calibration and prediction error remained stable. Similar patterns were observed in the validation cohort.

Conclusion

Integrating biochemical and metabolomic biomarkers with conventional clinical predictors improves long-term prediction and risk stratification for CKD. An interpretable SHAP-derived BRS enables robust identification of individuals at elevated risk and may support earlier risk assessment and personalized prevention strategies for CKD.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12882-026-04781-9.

Keywords: Chronic kidney disease, Machine learning, Biomarker risk score, Risk stratification, Binary threshold, SHAP, Metabolomics

Introduction

Chronic kidney disease (CKD) is a major global health challenge, affecting over 800 million individuals worldwide [1–3]. The insidious onset of CKD means that early stages are often asymptomatic. Because early-stage CKD is often asymptomatic, fewer than 5% of affected individuals are aware of their condition until advanced disease or complications occur [4]. Progression to end-stage renal disease frequently necessitates dialysis or transplantation, leading to substantial healthcare costs and reduced quality of life and survival [5].

Beyond renal consequences, CKD is significantly associated with elevated risks of stroke [6, 7], cardiovascular disease, and premature mortality [8–10]. Early detection, precise risk prediction, and effective risk stratification are therefore essential. Identification of modifiable biomarkers may enable targeted interventions, including pharmacological treatment, dietary modification, and lifestyle changes [11]. However, most existing prediction models rely primarily on traditional risk factors, such as age, sex, diabetes, and hypertension, as well as routine biomarkers including creatinine, estimated glomerular filtration rate (eGFR), blood urea nitrogen, and microalbumin [12–14]. The contribution of modifiable biomarkers to CKD development and mortality remains insufficiently understood.

Advances in high-throughput metabolomics have enabled standardized quantification of a wide range of plasma metabolites [15, 16]. Emerging evidence suggests that metabolic dysregulation, particularly involving inflammatory, lipid, and amino acid pathways, plays a critical role in CKD pathogenesis and progression [17, 18]. Several CKD and diabetic kidney disease (DKD) prediction models have incorporated metabolic biomarkers. However, most studies are limited by small sample sizes, restricted diabetic populations, or relatively short follow-up periods [19, 20]. The relative importance of individual metabolites remains unclear, and clinically actionable thresholds have not been well established. It is also uncertain whether a small panel of key biomarkers can effectively stratify CKD risk.

In this study, we leveraged data from the UK Biobank to develop prediction models for incident CKD and CKD-related mortality. The models integrated traditional risk factors, routine biochemical markers, and high-dimensional metabolomic data. Using CatBoost and SHAP analysis, we identified key predictive biomarkers and derived SHAP-based binary thresholds. Based on these thresholds, we constructed a clinically interpretable biomarker risk score (BRS). This score enables stratification of both absolute and relative CKD risk and supports intuitive long-term CKD risk assessment.

Methods

Study designs and population

The UK Biobank is a large prospective cohort that enrolled over 500,000 adults aged 40–73 years from 22 assessment centers across the United Kingdom between 2006 and 2010. At baseline, participants completed standardized questionnaires and interviews developed by the UK Biobank [21], underwent physical measurements, and provided biological samples. Follow-up data were obtained through linkage to national health records, as previously described [15, 16].

Of 502,131 participants, we excluded individuals with > 20% missing blood biochemistry data (n = 51,908), missing serum creatinine (n = 21,337), or > 20% missing covariates (n = 656). We further excluded participants without metabolomic measurements (n = 188,804) and those with baseline CKD (n = 5,837). Baseline CKD was defined using self-reported physician-diagnosed CKD and prior hospital inpatient ICD-9/ICD-10 codes. A single baseline estimated glomerular filtration rate (eGFR) < 60 mL/min/1.73 m² was used only as an additional safeguard to exclude probable pre-existing CKD and was not used to define incident CKD. The final analytical sample included 233,589 participants (Figure S1).

For prediction modeling, participants from England were used as the development cohort. Participants from Scotland and Wales were reserved as a geographically distinct validation cohort to assess model generalizability.

Conventional predictor and biomarkers

Three conventional predictor sets were evaluated. The first included age and sex only (AgeSex). The second predictor set (Set) was based on the CKD Prognosis Consortium 5-year risk equation, which was developed using data from 34 international cohorts [14]. The third was a comprehensive predictor set (PANEL) that incorporated demographic characteristics, lifestyle factors, anthropometric measures, medical history, and medication use. Definitions of the three conventional predictor sets are shown in Fig. 1B.

Fig. 1 — Study overview and biomarker sets. (A). Study workflow; (B). Conventional predictor sets. All Bioche included 26 biochemical markers. Non-overlap Met comprised 170 NMR metabolites after excluding those overlapping with biochemical markers. Feature selection for Non-overlap Met was conducted using multiple strategies. These included ranking by the area under the receiver operating characteristic curve (AUC), joint mutual information maximization (JMIM), permutation feature importance, and correlation filtering with Spearman correlation coefficients < 0.95. BioMet denotes a selected biochemical–metabolomic signature. It comprised 26 routine biochemical biomarkers and 46 NMR-based metabolites. These metabolites were retained after correlation filtering using Spearman correlation coefficients < 0.95

In addition to conventional predictors, we evaluated a biomarker-based panel comprising 26 routine biochemical markers and 170 nuclear magnetic resonance (NMR)–based metabolites. These biomarkers covered inflammatory, metabolic, lipid, and lipoprotein pathways [15, 16]. Full lists of clinical biochemistry biomarkers and NMR metabolites are provided in Tables S1 and S2.

CKD outcome definition

The outcomes were incident CKD and CKD-related mortality. Incident CKD was defined as a new diagnosis of hypertensive nephropathy, CKD at any stage, end-stage kidney disease (ESKD), or chronic renal failure. Repeated measurements of estimated glomerular filtration rate (eGFR) or albuminuria are not systematically available in the UK Biobank. Therefore, incident CKD was ascertained using International Classification of Diseases, Ninth and Tenth Revision (ICD-9 and ICD-10) codes from hospital admission [22, 23]. CKD-related mortality was defined as death in which CKD was recorded as the underlying cause of death in national death registries, based on ICD-9 and ICD-10 codes. Detailed code definitions are provided in Table S3.

Hospitalization records were censored on October 31, 2022, for England, August 31, 2022, for Scotland, and May 31, 2022, for Wales. Mortality follow-up was complete through November 30, 2022. Follow-up time was calculated from baseline assessment to the first occurrence of the CKD outcome, death, or the end of follow-up, whichever occurred first.

Statistical analysis

Data preprocessing and baseline characteristics

For the 170 original metabolites, values exceeding four interquartile ranges (IQRs) from the median were winsorized [16]. Metabolites with less than 5% missingness were imputed using random forest–based regression. Missing values for conventional risk factors were imputed using multiple imputation by chained equations with random forest algorithms [24]. Five imputed datasets were generated, and one was randomly selected for analysis. Baseline characteristics were summarized using medians (IQRs) for continuous variables and counts (%) for categorical variables.

Biomarker selection and evaluation of predictive performance

Initial biomarker screening was performed using multiple feature selection strategies. These strategies included ranking by the area under the receiver operating characteristic curve (AUC), joint mutual information maximization (JMIM), permutation feature importance, and correlation filtering. Several candidate biomarker sets were constructed and evaluated (Fig. 1A and Figure S2).

Each biomarker set was evaluated using Cox proportional hazards models. Predictive performance was quantified using Harrell’s concordance index (C-index), with 95% confidence intervals (CIs) estimated from 1,000 bootstrap resamples [25]. Among all candidate sets, the combination of all biochemical markers and correlation-filtered metabolites with Spearman correlation < 0.95 (All Bioche + Cor0.95) demonstrated the most favorable balance between predictive performance and model complexity [26]. This set was therefore selected as the primary biomarker panel for subsequent analyses and was termed BioMet, comprising 26 biochemical markers and 46 metabolites.

Incremental discrimination was assessed by comparing C-index values before and after adding the BioMet signature to each conventional predictor model. Calibration was evaluated using the calibration slope. Analyses were conducted separately in the England development cohort and the Scotland and Wales validation cohorts [15]. All performance metrics were reported with 95% CIs derived from 1,000 bootstrap resamples.

CatBoost classification and SHAP-based feature importance

Selected biomarkers were used as input features in CatBoost classification models to identify key predictors of incident CKD and CKD-related mortality. To reduce confounding related to blood creatinine, eGFR was used instead of creatinine in CatBoost models [27, 28]. CatBoost is a gradient boosting algorithm that performs well with high-dimensional and heterogeneous data and has strong resistance to overfitting [29, 30]. It was applied as a supervised classification model to identify important biomarkers and to derive interpretable SHAP-based binary thresholds, rather than to model time-to-event outcomes directly. Time-to-event effect sizes and predictive discrimination were evaluated using Cox proportional hazards models.

CatBoost models were trained in the England development cohort using repeated five-fold cross-validation, with class balancing implemented through undersampling within the training folds. Model performance was evaluated in the held-out fold during cross-validation. Generalizability was then assessed in the geographically independent Scotland and Wales validation cohort using the original class distribution. SHAP values were used to rank feature importance based on mean absolute SHAP values and to derive data-driven thresholds according to their contributions to predicted risk. Feature effects and directions were further examined in the validation cohort to assess consistency across populations. SHAP summary and dependence plots were used to visualize feature importance and biomarker effect patterns [31].

Key predictive biomarkers for assessing CKD risk

Key predictive biomarkers were defined as features that consistently ranked among the top predictors across CKD outcomes based on CatBoost and SHAP analyses. Associations of these biomarkers with incident CKD and CKD-related mortality were evaluated using multivariable Cox proportional hazards models adjusted for age, sex, and body mass index (BMI). Hazard ratios (HRs) were estimated per standard deviation (SD) increase.

SHAP-based feature binarization

SHAP dependence plots were used to derive approximate binary thresholds for each of the top 20 biomarkers. These plots illustrate how biomarker values influence predicted risk. Biomarker values with SHAP values ≥ 0 were considered risk-elevating, whereas those with SHAP values < 0 were considered lower risk. Each biomarker was dichotomized using a SHAP value threshold of 0. Associations between the binarized top 20 biomarkers and CKD incidence or mortality were assessed using multivariable Cox proportional hazards models.

Construction and stratification of the biomarker risk score

A biomarker risk score (BRS) was calculated for each participant based on SHAP-derived binary thresholds [32]. For the primary score, the top 10 biomarkers identified by SHAP importance were used (Top10BRS). Each biomarker was dichotomized according to its SHAP-derived threshold, and the BRS was defined as the sum of binary risk indicators across the selected biomarkers. For biomarkers positively associated with the outcome, a value of 1 was assigned when the biomarker level was greater than or equal to its threshold, and 0 otherwise. For biomarkers negatively associated with the outcome, a value of 1 was assigned when the biomarker level was below the threshold, and 0 otherwise. Higher BRS values indicate a greater cumulative burden of adverse biomarker profiles.

According to BRS tertiles, participants were categorized into low-, moderate-, and high-risk groups. Absolute risk in each group was assessed using incidence proportion (%) and incidence density (ID). Associations with incident CKD and CKD-related mortality were evaluated using multivariable Cox proportional hazards models, with the low-risk group as the reference. Models were adjusted for demographic characteristics, lifestyle factors, and baseline health conditions.

To assess robustness, an alternative score based on the top 20 biomarkers (Top20BRS) was constructed using the same SHAP-derived binary thresholds. Weighted versions of the biomarker risk scores were also derived using Cox regression coefficients, including weighted Top10wBRS and weighted Topw20BRS. The ability of these alternative scores to stratify relative and absolute risk was evaluated in both the development and validation cohorts. The unweighted Top10BRS was selected as the primary score because of its simplicity and stability.

Incremental predictive performance after adding the SHAP-derived biomarker risk scores to conventional risk factor models was evaluated using Cox proportional hazards models. Discrimination was assessed using C-index and the change in C-index relative to the conventional model. Calibration was evaluated using the calibration slope, and overall prediction error was assessed using the 10-year Brier score. All performance metrics were reported with 95% CIs estimated from 1,000 bootstrap resamples.

All statistical analyses were conducted in R software version 4.3.1. Multiple comparisons were corrected using the Benjamini–Hochberg procedure to control the false discovery rate (FDR). All tests were 2-sided. P value or FDR-adjusted P value < 0.05 was considered statistically significant.

Results

Study population characteristics

The study included 233,589 participants who were free of CKD at baseline. The median age was 58.0 years (IQR, 50.0–63.0), and 46.4% of participants were male. Most participants were of White ancestry (94.9%). During a median follow-up of 13.67 years (IQR, 12.82–14.36), 10,371 participants developed incident CKD, and 507 died from CKD-related causes. Participants who developed CKD or experienced CKD-related mortality generally had lower educational attainment and household income. They had higher baseline levels of body mass index (BMI), C-reactive protein (CRP), glycated hemoglobin (HbA1c), triglycerides (TG), and systolic blood pressure (SBP). The prevalence of hypertension, type 2 diabetes, and cardiovascular disease was also higher in these groups (Table 1). The distributions of key demographic characteristics were broadly similar between included and excluded participants (Table S4).

Table 1.

Baseline characteristics of study participants by incident CKD and CKD mortality

Characteristics	All participants (N = 233,589)	Incident CKD (N = 10,371)	CKD mortality (N = 507)
Median (IQR) age, years	58.0 [50.0, 63.0]	63.0 [59.0, 66.0]	65.0 [61.0, 68.0]
Sex
Male	108,379 (46.4)	5,521 (53.2)	341 (67.3)
Female	125,210 (53.6)	4,850 (46.8)	166 (32.7)
Ethnicity
White	221,590 (94.9)	9,778 (94.3)	463 (91.3)
Others	11,999 (5.1)	593 (5.7)	44 (8.7)
Education
Lower qualification	158,657 (67.9)	8,229 (79.3)	420 (82.8)
Higher qualification	74,932 (32.1)	2,142 (20.7)	87 (17.2)
Average household income (£)
< 18 000	62,096 (26.6)	4,376 (42.2)	247 (48.7)
18 000–30 999	59,904 (25.6)	2,968 (28.6)	152 (30.0)
31 000–51 999	56,110 (24.0)	1,835 (17.7)	76 (15.0)
52 000–100 000	43,917 (18.8)	993 (9.6)	29 (5.7)
> 100 000	11,562 (4.9)	199 (1.9)	3 (0.6)
BMI (kg/m²)
< 18.5	1,159 (0.5)	21 (0.2)	2 (0.4)
18.5–24.9	75,535 (32.3)	1,867 (18.0)	74 (14.6)
25–29.9	99,586 (42.6)	4,316 (41.6)	186 (36.7)
≥ 30	57,309 (24.5)	4,167 (40.2)	245 (48.3)
Median (IQR) MET minutes per week for moderate activity, minutes	720.0 [240.0, 2880.0]	720.0 [240.0, 3600.0]	720.0 [160.0, 5040.0]
Regular physical activity
Low	35,721 (15.3)	1,896 (18.3)	115 (22.7)
Moderate	76,279 (32.7)	3,155 (30.4)	139 (27.4)
High	121,589 (52.1)	5,320 (51.3)	253 (49.9)
Smoking status
Never	127,820 (54.7)	4,787 (46.2)	190 (37.5)
Former	80,861 (34.6)	4,338 (41.8)	237 (46.7)
Current	24,908 (10.7)	1,246 (12.0)	80 (15.8)
Alcohol consumption (times/week)
Never	70,533 (30.2)	4,121 (39.7)	229 (45.2)
1–2	61,158 (26.2)	2,599 (25.1)	112 (22.1)
3–4	54,592 (23.4)	1,861 (17.9)	80 (15.8)
≥ 5	47,306 (20.3)	1,790 (17.3)	86 (17.0)
Vegetable consumption (servings/day)
< 2.0	15,237 (6.5)	840 (8.1)	41 (8.1)
≥ 2.0–3.9	66,816 (28.6)	2,971 (28.6)	160 (31.6)
≥ 4.0–5.9	78,035 (33.4)	3,334 (32.1)	150 (29.6)
≥ 6.0	73,501 (31.5)	3,226 (31.1)	156 (30.8)
Fruit consumption (servings/day)
< 2.0	77,693 (33.3)	3,597 (34.7)	180 (35.5)
≥ 2.0–2.9	59,076 (25.3)	2,557 (24.7)	109 (21.5)
≥ 3.0–3.9	45,135 (19.3)	1,938 (18.7)	92 (18.1)
≥ 4.0	51,685 (22.1)	2,279 (22.0)	126 (24.9)
Median (IQR) CRP, mg/L	1.3 [0.7, 2.8]	2.0 [1.0, 4.0]	2.56 [1.2, 5.3]
Median (IQR) HbA1c, mmol/mol	35.1 [32.4, 37.8]	37.0 [34.0, 41.1]	39.6 [35.3, 48.9]
Median (IQR) TC, mmol/L	5.7 [4.9, 6.4]	5.3 [4.5, 6.2]	4.8 [4.0, 5.8]
Median (IQR) HDL-C, mmol/L	1.4 [1.2, 1.7]	1.3 [1.1, 1.5]	1.2 [1.0, 1.4]
Median (IQR) LDL-C, mmol/L	3.5 [3.0, 4.1]	3.3 [2.6, 4.0]	2.9 [2.3, 3.7]
Median (IQR) TG, mmol/L	1.5 [1.0, 2.1]	1.7 [1.2, 2.4]	1.8 [1.3, 2.5]
Median (IQR) SBP, mmHg	137.0 [125.0, 150.0]	142.0 [130.0, 156.0]	144.0 [132.0, 158.0]
Median (IQR) DBP, mmHg	82.0 [76.0, 89.0]	83.0 [76.0, 90.0]	81.0 [74.0, 89.0]
Drug use
Cholesterol lowering medication	40,248 (17.2)	4,194 (40.4)	274 (54.0)
Anti-hypertensive drug	47,952 (20.5)	4,929 (47.5)	320 (63.1)
Insulin	2,898 (1.2)	460 (4.4)	50 (9.9)
Disease history
T2DM	13,375 (5.7)	2,020 (19.5)	199 (39.3)
HTN	117,147 (50.2)	7,661 (73.9)	409 (80.7)
CVD	36,223 (15.5)	3,924 (37.8)	272 (53.6)

Open in a new tab

Data are presented as median (IQR); otherwise, values are presented as numbers (%)

Abbreviations: CKD, chronic kidney disease; BMI, body mass index; IQR, interquartile range; MET, metabolic equivalent of task; SBP, systolic blood pressure; DBP, diastolic blood pressure; CVD, cardiovascular disease; HTN, hypertension; T2DM, type 2 diabetes mellitus

Superior discriminative performance of combined biomarkers in chronic kidney disease risk stratification

Various feature selection methods were applied to identify informative biomarkers, and model performance was evaluated using C-index. For incident CKD, the combination of all biochemical markers and correlation-filtered metabolites (All Bioche + Cor0.95) achieved a C-index of 0.797 (95% CI, 0.793–0.802), which was comparable to the model including all biochemical and non-overlapping metabolites (C-index, 0.800; 95% CI, 0.797–0.805). For CKD-related mortality, models combining all biochemical markers with metabolites selected using different feature selection strategies showed similar discrimination, with C-index values ranging from 0.898 to 0.899 (Table S5 and Figure S3). Based on discrimination performance and model complexity, the All Bioche + Cor0.95 biomarker set was selected for subsequent analyses and was termed BioMet. Calibration slopes were close to 1 across models, indicating good agreement between predicted and observed risks after inclusion of the BioMet signature (Table S6).

Discrimination performance of BioMet was evaluated in the England development cohort and the geographically independent Scotland and Wales validation cohort (Fig. 2 and Table S7). For incident CKD, conventional risk factor models showed moderate discrimination. The BioMet signature alone achieved C-index values of 0.795 (95% CI, 0.792–0.800) in the development cohort and 0.843 (95% CI, 0.834–0.867) in the validation cohort, which were higher than those of the corresponding Set and AgeSex model in both cohorts. The addition of BioMet to conventional PANEL models yielded C-index values of 0.827 (95% CI, 0.824–0.831) in the development cohort and 0.866 (95% CI, 0.860–0.892) in the validation cohort.

Fig. 2 — (A). Harrell’s C-index for predicting incident CKD in the development cohort from England. (B). Harrell’s C-index for predicting CKD-related mortality in the development cohort from England. (C). Harrell’s C-index for predicting incident CKD in the the geographically distinct validation cohort from Scotland and Wales. (D). Harrell’s C-index for predicting CKD-related mortality in the geographically distinct validation cohort from Scotland and Wales. Discrimination performance was assessed using Harrell’s C-index with 95% CIs estimated from 1,000 bootstrap resamples. The Set model included age, sex, BMI, eGFR, smoking status, baseline CVD, T2DM, and HTN. The PANEL model included age, sex, BMI, eGFR, race, household income, education level, smoking status, alcohol intake, physical activity, fruit and vegetable consumption, sleep duration, waist and hip circumference, SBP and DBP, baseline CVD, T2DM, HTN, and use of blood pressure–lowering, insulin, and cholesterol-lowering medications. BioMet denotes a selected biochemical–metabolomic signature comprising 26 routine biochemical biomarkers and 46 NMR-based metabolites after correlation filtering

For CKD-related mortality, the BioMet signature alone achieved C-index values of 0.899 (95% CI, 0.890–0.915) in the development cohort and 0.941 (95% CI, 0.926–0.974) in the validation cohort, again outperforming the Set and AgeSex model in both cohorts. When BioMet was added to conventional PANEL risk factor models, discrimination further improved, with C-index values of 0.931 (95% CI, 0.924–0.943) in the development cohort and 0.970 (95% CI, 0.968–0.994) in the validation cohort.

CatBoost classifier and SHAP analysis reveals key factors in predicting incident CKD and CKD-related mortality

SHAP importance plots were used to identify key predictors of incident CKD and CKD-related mortality in the England development cohort (Figs. 3A and 4A). For incident CKD, eGFR ranked as the most important predictor, whereas it ranked second for CKD-related mortality. Lower eGFR values were associated with higher predicted CKD risk. Cystatin C was the strongest predictor of CKD-related mortality and ranked second for incident CKD, with higher levels corresponding to increased CKD risk. HbA1c consistently ranked among the top predictors for both outcomes, with higher levels linked to greater risk.

Fig. 3 — SHAP analysis identifying key predictors of incident CKD in the England development cohort. (A). SHAP summary plot showing the top 50% features ranked by mean absolute SHAP value for incident CKD prediction. (B–U). SHAP dependence plots for the top 20 predictors, illustrating the relationship between individual biomarker values and their contributions to the predicted risk of incident CKD

Fig. 4 — SHAP analysis identifying key predictors of CKD-related mortality in the England development cohort. (A). SHAP summary plot showing the top 50% features ranked by mean absolute SHAP value for CKD-related mortality prediction. (B–U). SHAP dependence plots for the top 20 predictors, illustrating the relationship between individual biomarker values and their contributions to the predicted risk of CKD-related mortality

SHAP dependence plots further illustrated the relationships between individual biomarker values and predicted risk (Figs. 3B–U and 4B–U). Higher levels of cystatin C, HbA1c, gamma-glutamyl transferase (GGT), urea, CRP, and GlycA were associated with increased predicted risk of both incident CKD and CKD-related mortality. In contrast, lower levels of eGFR, insulin-like growth factor 1 (IGF-1), and albumin were associated with higher predicted risk, as reflected by positive SHAP values at lower ranges of these biomarkers. Similar SHAP importance rankings and dependence patterns were observed in the geographically independent Scotland and Wales validation cohort (Figures S4 and S5).

Cox proportional hazards model identifies key biomarkers associated with new-onset CKD and CKD mortality

Multivariable Cox proportional hazards models were used to examine associations between biomarkers consistently identified within the top 50 SHAP-ranked features across outcomes and risks of incident CKD and CKD-related mortality (Fig. 5A). Each 1-SD increase in Cystatin C was associated with a HR of 1.60 (95% CI: 1.58–1.61) for incident CKD and 1.71 (95% CI: 1.64–1.78) for CKD-related mortality. Each 1-SD increase in GlycA was associated with a HR of 1.20 (95% CI: 1.18–1.23) for incident CKD and 1.42 (95% CI: 1.30–1.55) for CKD-related mortality. CRP, urea, HbA1c, and Glucose-lactate were also consistently associated with higher risk across outcomes. In contrast, each 1-SD increase in eGFR was associated with a substantially lower risk of incident CKD (HR = 0.47, 95% CI: 0.46–0.48) and CKD-related mortality (HR = 0.53, 95% CI: 0.48–0.58). Albumin and M-VLDL-CE also showed inverse associations with both CKD outcomes.

Fig. 5 — Associations of key biomarkers with incident CKD and CKD-related mortality in the England development cohort. (A). HRs with 95% CIs for the associations of per–SD increases in the top SHAP-identified biomarkers with incident CKD and CKD-related mortality. (B). HRs with 95% CIs for the associations between elevated levels of the top 20 biomarkers and incident CKD. (C). HRs with 95% CIs for the associations between elevated levels of the top 20 biomarkers and CKD-related mortality. Cox proportional hazards models were adjusted for age, sex, and BMI. Abbreviations: eGFR, estimated glomerular filtration rate; HbA1c, glycated hemoglobin; IGF-1, insulin-like growth factor 1; ALT, alanine aminotransferase; GGT, gamma-glutamyltransferase; M-VLDL-CE, medium very-low-density lipoprotein cholesteryl ester; ALP, alkaline phosphatase; CRP, C-reactive protein; LA, linoleic acid; TC, total cholesterol

Elevated biomarkers and their association with incident CKD and CKD mortality

To facilitate risk stratification, the top 20 biomarkers were further binarized using SHAP-derived thresholds. Associations were assessed by comparing biomarker values at or above versus below each threshold for risks of incident CKD and CKD-related mortality (Fig. 5B and C). Elevated levels of cystatin C, HbA1c, urea, CRP, GlycA, GGT, glucose–lactate, and alkaline phosphatase (ALP) were consistently associated with increased risks of both incident CKD and CKD-related mortality. Among these biomarkers, cystatin C showed the strongest risk elevation. Individuals above the SHAP-derived threshold had a higher risk of incident CKD (HR = 3.17; 95% CI, 3.03–3.31) and CKD-related mortality (HR = 4.45; 95% CI, 3.61–5.49).

In contrast, higher levels of eGFR, albumin, IGF-1, histidine, and cholesteryl esters in medium VLDL (M-VLDL-CE) were associated with lower risks of incident CKD and CKD-related mortality. The inverse associations were most pronounced for eGFR and M-VLDL-CE. The eGFR values above the SHAP-derived threshold were associated with a 70% lower risk of incident CKD (HR = 0.30; 95% CI: 0.29–0.32) and a 63% lower risk of CKD-related mortality (HR = 0.37; 95% CI: 0.31–0.46). Similar association patterns were observed in the Scotland and Wales validation cohort (Figure S6).

Enhanced CKD risk stratification using the biomarker risk score

Using the SHAP-derived Top10BRS, participants were stratified into low-, moderate-, and high-risk groups based on score tertiles. For both incident CKD and CKD-related mortality, absolute risk increased progressively across risk categories. For incident CKD, the incidence proportion increased from 1.33% in the low-risk group to 3.54% in the moderate-risk group and 10.70% in the high-risk group. A similar gradient was observed for incidence density (ID), with corresponding values of 0.99, 2.68, and 8.44 per 1,000 person-years across the three groups. In multivariable-adjusted Cox models, the moderate-risk group had a higher risk of incident CKD than the low-risk group, whereas the high-risk group exhibited a more than threefold increase in risk. For CKD-related mortality, absolute event rates were low but increased consistently across risk categories. Compared with the low-risk group, both the moderate- and high-risk groups had substantially higher mortality risk, with the strongest association observed in the highest tertile (Fig. 6A).

Consistent risk stratification patterns were observed in the geographically independent Scotland and Wales validation cohort (Fig. 6B). Sensitivity analyses using alternative score constructions, including Top20BRS, Cox coefficient–weighted Top10wBRS, and Top20wBRS, yielded similar gradients in both absolute and relative risks (Figures S7–S9).

Adding the SHAP-derived Top10BRS to traditional risk models increased discrimination for incident CKD and CKD-related mortality in both the development and validation cohorts. Calibration slopes and 10-year Brier scores were similar before and after inclusion of Top10BRS (Table S8).

Discussion

In this large-scale prospective study of UK Biobank participants, we evaluated whether circulating biomarkers improve prediction of CKD. The combined biochemical–metabolomic signature showed good discrimination in both the development cohort and the geographically independent validation cohort. When added to conventional prediction models, BioMet improved discrimination for both incident CKD and CKD-related mortality. Calibration performance remained satisfactory after biomarker integration, supporting the reliability of risk estimation. Using explainable machine learning and SHAP analysis, we identified a parsimonious set of informative biomarkers. Based on ten selected biomarkers, we constructed a simplified biomarker risk score (Top10BRS). This score effectively stratified individuals into distinct risk groups, clearly separating individuals in terms of both absolute and relative risk for incident CKD and CKD-related mortality. These findings were consistent in the development cohort and the geographically independent validation cohort. The slightly higher C-index observed in the Scotland and Wales validation cohort is plausible and may reflect differences in case-mix and risk heterogeneity, with calibration remaining acceptable.

These identified biomarkers demonstrate robust predictive performance and effective risk stratification. They are also biologically plausible and reflect key pathophysiological processes involved in CKD development. Rather than acting as isolated predictors, the identified biomarkers appeared to cluster into several biologically coherent pathways relevant to CKD pathophysiology. These pathways include impaired renal filtration, metabolic dysregulation, systemic inflammation, and nutritional or anabolic status. They provide a unified framework linking circulating metabolic signals to renal injury.

Markers of renal filtration formed the core of the predictive signature. Cystatin C emerged as one of the strongest risk-elevating biomarkers for both incident CKD and CKD-related mortality. Unlike creatinine, cystatin C is less influenced by muscle mass and more sensitive to early declines in glomerular filtration [33–35]. In parallel, higher eGFR showed a strong protective association, further supporting the biological validity of the model [35, 36]. A second group of biomarkers reflected metabolic dysregulation and glycemic stress. Elevated HbA1c and glucose–lactate were consistently associated with higher CKD risk. These associations are consistent with prior evidence implicating chronic hyperglycemia, mitochondrial stress, and altered energy metabolism in kidney injury. Diabetes is a major comorbidity of CKD, and diabetic kidney disease accounts for nearly half of all ESKD cases [37]. Urea further captured impaired nitrogen handling and systemic metabolic burden, linking altered metabolism with reduced renal clearance capacity [38, 39]. Inflammation-related biomarkers also contributed substantially to risk prediction. Higher levels of CRP and GGT were associated with increased risks of CKD onset and mortality. These markers reflect low-grade systemic inflammation and oxidative stress, which have been implicated in endothelial dysfunction, tubular injury, and fibrotic processes in the kidney [36, 40–42].

In contrast, several biomarkers showed protective associations, highlighting the potential relevance of nutritional and anabolic pathways in CKD. Higher circulating levels of IGF-1 were consistently associated with lower risks of both incident CKD and CKD-related mortality. IGF-1 is a key regulator of cellular growth, repair, and metabolic homeostasis and has been linked to renal hemodynamics, partly through interactions with the renin–angiotensin system, which may contribute to its observed association with CKD outcomes [43, 44]. Similarly, histidine and albumin likely reflect preserved nutritional status and antioxidant capacity, factors that have been associated with resilience to chronic metabolic and inflammatory stress [45]. The inverse association of M-VLDL-CE suggests that specific lipid fractions may capture aspects of metabolic resilience rather than traditional atherogenic risk in the context of CKD. Taken together, these findings suggest that early CKD risk may reflect the combined effects of impaired filtration, metabolic stress, inflammation, and reduced anabolic reserve. The biomarker signature and risk score capture these joint influences more effectively than individual markers or conventional risk factors alone.

This study supports the integration of high-dimensional biomarkers into CKD risk prediction with clear clinical relevance. By combining explainable machine learning with traditional survival analysis, we improved predictive performance and enabled individualized risk stratification beyond conventional risk factors. SHAP analysis facilitated the derivation of data-driven and interpretable binary thresholds for selected biomarkers [24, 32]. This approach enhances model transparency and interpretability in a clinical context. Based on ten informative biomarkers, we developed a simplified biomarker risk score (Top10BRS) that performed comparably to weighted alternatives, supporting its simplicity and robustness. The Top10BRS relies on a limited set of biologically relevant biomarkers with consistent directions of effect, making it practical for risk stratification, clinical interpretation, and communication of risk. This score enables stratification of both absolute and relative CKD risk years before overt disease onset, which may support earlier identification of high-risk individuals, closer surveillance, and targeted preventive strategies in clinical practice. Although some metabolomic markers are not yet routinely measured, many of the most influential predictors are already available in standard clinical testing, and broader adoption may become feasible as analytical platforms evolve. Together, this framework highlights the potential for a more proactive and precision-oriented approach to CKD risk assessment. External validation and recalibration will be required before clinical implementation.

Strengths and limitations

This study has several strengths. These include the large population-based design, long follow-up duration, comprehensive biomarker profiling, and validation in a geographically independent cohort. The combined use of explainable machine learning and Cox modeling enhances both predictive performance and interpretability.

Several limitations should be acknowledged. First, CKD outcomes were ascertained using administrative codes rather than repeated measurements of eGFR or albuminuria, which may have led to some degree of outcome misclassification, particularly for early or subclinical disease. Second, the NMR metabolomics platform primarily captures lipid-related and selected metabolic measures and does not comprehensively represent all metabolic pathways relevant to CKD pathophysiology. Third, SHAP-derived thresholds were data-driven and should be interpreted as approximate rather than definitive clinical cut-points. These thresholds may require recalibration across different populations, assay platforms, or clinical settings. Fourth, although we performed geographic validation using participants from Scotland and Wales, all analyses were conducted within the UK Biobank. Therefore, this study lacks external validation in independent cohorts outside the UK Biobank, which may limit generalizability. In addition, the predominance of participants of European ancestry may constrain the applicability of our findings to other ethnic groups. Finally, as with all observational studies, residual confounding cannot be fully excluded, and the identified associations should not be interpreted as causal. Future research should validate these biomarkers in diverse populations, evaluate their clinical utility, and explore underlying mechanisms in longitudinal and experimental studies.

Conclusions

In summary, integrating plasma biochemical and metabolomic biomarkers with conventional clinical predictors improves long-term prediction and risk stratification for incident CKD and CKD-related mortality. Using an explainable machine learning framework, we identified biologically coherent biomarkers and derived interpretable, data-driven thresholds. We then constructed a simplified biomarker risk score with robust performance in both development and validation cohorts. These findings support the potential of multi-biomarker, interpretable prediction models to advance early identification and personalized prevention of CKD.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(3MB, pdf)}

Acknowledgements

This study was conducted using data from the UK Biobank resource under application number 242473. We are grateful to all the participants and professionals contributing to the UK Biobank.

Abbreviations

BMI: Body mass index
BRS: Biomarker risk score
C-index: Harrell’s concordance index
CI: Confidence interval
CKD: Chronic kidney disease
CRP: C-reactive protein
CVD: Cardiovascular disease
eGFR: Estimated glomerular filtration rate
ESKD: End-stage kidney disease
FDR: False discovery rate
HbA1c: Glycated hemoglobin
ICD: International Classification of Diseases
ID: Incidence density
IQR: Interquartile range
JMIM: Joint mutual information maximization
NMR: Nuclear magnetic resonance
ROC: Receiver operating characteristic
SBP: Systolic blood pressure
SD: Standard deviation
T2DM: Type 2 diabetes mellitus
TG: Triglycerides

Author contributions

Study concept and design: J. Ma, and R.Y. Liu. J. Ma, and R.Y. Liu contributed equally to this work. Study supervision: X.R. Zhang and G.F. Hu. Acquisition, analysis, or interpretation of data: X.R. Zhang, R.Y. Liu, X. Feng. Contributed to discussion of the results: J. Ma, X.R. Zhang, X. Feng, G.F. Hu, R.Y. Liu, L. Zhang and X. Li. Drafting of the manuscript: R.Y. Liu, and X.R. Zhang. Critical revision of the manuscript for important intellectual content: All authors. Final approval of the article: All authors. Statistical expertise: X.R. Zhang, X. Feng, J.L. Huang, and J. Gao. Funding acquisition: X.R. Zhang, and X. Feng. Technical, material or administrative support: G.F. Hu, and X. Feng.

Funding

This work was financially supported by the National Natural Science Foundation of China to X.R. Zhang (82304211), and X. Feng (82201427 and 82571691). The funders had no role in the study design or implementation; data collection, management, analysis, or interpretation; manuscript preparation, review, or approval; or the decision to submit the manuscript for publication.

Data availability

The data used in this study were available from the UK Biobank with restrictions applied. The data were used under license. UK Biobank data are available to all researchers for health-related research and studies in the public interest, and access can be requested through their established protocol (https://www.ukbiobank.ac.uk/use-our-data/).

Declarations

Ethics approval and consent to participate

This research complied with the principles of the Declaration of Helsinki. Written informed consent was obtained from all participants prior to their involvement, and ethical approval was granted by the North West Multi-Center Research Ethics Committee (reference: 11/NW/0382). Any additional ethical approval was adjudged unnecessary for the present study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jing Ma and Ruiyan Liu contributed equally to this work.

Contributor Information

Guifang Hu, Email: hgf@smu.edu.cn.

Xiru Zhang, Email: zxr19921218@smu.edu.cn.

References

1.Zhang L, Wang F, Wang L, Wang W, Liu B, Liu J, et al. Prevalence of chronic kidney disease in china: A cross-sectional survey. Lancet Lond Engl. 2012;379:815–22. 10.1016/S0140-6736(12)60033-6. [DOI] [PubMed] [Google Scholar]
2.Kovesdy CP. Epidemiology of chronic kidney disease: an update 2022. Kidney Int Suppl. 2022;12:7–11. 10.1016/j.kisu.2021.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Jager KJ, Kovesdy C, Langham R, Rosenberg M, Jha V, Zoccali C. A single number for advocacy and communication-worldwide more than 850 million individuals have kidney diseases. Kidney Int. 2019;96:1048–50. 10.1016/j.kint.2019.07.012. [DOI] [PubMed] [Google Scholar]
4.Chen TK, Knicely DH, Grams ME. Chronic kidney disease diagnosis and management: A review. JAMA. 2019;322:1294–304. 10.1001/jama.2019.14745. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Radhakrishnan J, Remuzzi G, Saran R, Williams DE, Rios-Burrows N, Powe N, et al. Taming the chronic kidney disease epidemic: A global view of surveillance efforts. Kidney Int. 2014;86:246–50. 10.1038/ki.2014.190. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Vanent KN, Leasure AC, Acosta JN, Kuohn LR, Woo D, Murthy SB, et al. Association of chronic kidney disease with risk of intracerebral hemorrhage. JAMA Neurol. 2022;79:911–8. 10.1001/jamaneurol.2022.2299. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lee M, Saver JL, Chang K-H, Liao H-W, Chang S-C, Ovbiagele B. Low glomerular filtration rate and risk of stroke: Meta-analysis. BMJ. 2010;341:c4249. 10.1136/bmj.c4249. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Choi Y, Jacobs DR, Shroff GR, Kramer H, Chang AR, Duprez DA. Progression of chronic kidney disease risk categories and risk of cardiovascular disease and total mortality: coronary artery risk development in young adults cohort. J Am Heart Assoc. 2022;11:e026685. 10.1161/JAHA.122.026685. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Go AS, Chertow GM, Fan D, McCulloch CE, Hsu C. Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N Engl J Med. 2004;351:1296–305. 10.1056/NEJMoa041031. [DOI] [PubMed] [Google Scholar]
10.Meisinger C, Döring A, Löwel H, KORA Study Group. Chronic kidney disease and risk of incident myocardial infarction and all-cause and cardiovascular disease mortality in middle-aged men and women from the general population. Eur Heart J. 2006;27:1245–50. 10.1093/eurheartj/ehi880. [DOI] [PubMed] [Google Scholar]
11.Shlipak MG, Tummalapalli SL, Boulware LE, Grams ME, Ix JH, Jha V, et al. The case for early identification and intervention of chronic kidney disease: conclusions from a kidney disease: improving global outcomes (KDIGO) controversies conference. Kidney Int. 2021;99:34–47. 10.1016/j.kint.2020.10.012. [DOI] [PubMed]
12.O’Sullivan ED, Hughes J, Ferenbach DA. Renal aging: causes and consequences. J Am Soc Nephrol JASN. 2017;28:407–20. 10.1681/ASN.2015121308. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Carrero JJ, Hecking M, Chesnaye NC, Jager KJ. Sex and gender disparities in the epidemiology and outcomes of chronic kidney disease. Nat Rev Nephrol. 2018;14:151–64. 10.1038/nrneph.2017.181. [DOI] [PubMed] [Google Scholar]
14.Nelson RG, Grams ME, Ballew SH, Sang Y, Azizi F, Chadban SJ, et al. Development of risk prediction equations for incident chronic kidney disease. JAMA. 2019;322:2104–14. 10.1001/jama.2019.17379. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Buergel T, Steinfeldt J, Ruyoga G, Pietzner M, Bizzarri D, Vojinovic D, et al. Metabolomic profiles predict individual multidisease outcomes. Nat Med. 2022;28:2309–20. 10.1038/s41591-022-01980-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Julkunen H, Cichońska A, Tiainen M, Koskela H, Nybo K, Mäkelä V, et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK biobank. Nat Commun. 2023;14:604. 10.1038/s41467-023-36231-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chen D-Q, Cao G, Chen H, Argyopoulos CP, Yu H, Su W, et al. Identification of serum metabolites associating with chronic kidney disease progression and anti-fibrotic effect of 5-methoxytryptophan. Nat Commun. 2019;10:1476. 10.1038/s41467-019-09329-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhao Y-Y, Cheng X-L, Wei F, Bai X, Tan X-J, Lin R-C, et al. Intrarenal metabolomic investigation of chronic kidney disease and its TGF-β1 mechanism in induced-adenine rats using UPLC Q-TOF/HSMS/MS(E). J Proteome Res. 2013;12:692–703. 10.1021/pr3007792. [DOI] [PubMed] [Google Scholar]
19.Sabanayagam C, He F, Nusinovici S, Li J, Lim C, Tan G, et al. Prediction of diabetic kidney disease risk using machine learning models: A population-based cohort study of Asian adults. eLife. 2023;12:e81878. 10.7554/eLife.81878. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Huang J, Huth C, Covic M, Troll M, Adam J, Zukunft S, et al. Machine learning approaches reveal metabolic signatures of incident chronic kidney disease in individuals with prediabetes and type 2 diabetes. Diabetes. 2020;69:2756–65. 10.2337/db20-0586. [DOI] [PubMed] [Google Scholar]
21.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Honigberg MC, Zekavat SM, Pirruccello JP, Natarajan P, Vaduganathan M. Cardiovascular and kidney outcomes across the glycemic spectrum: insights from the UK biobank. J Am Coll Cardiol. 2021;78:453–64. 10.1016/j.jacc.2021.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhang X, Liu Y-M, Lei F, Huang X, Liu W, Sun T, et al. Association between questionnaire-based and accelerometer-based physical activity and the incidence of chronic kidney disease using data from UK biobank: A prospective cohort study. EClinicalMedicine. 2023;66:102323. 10.1016/j.eclinm.2023.102323. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Li L, Prato CG, Wang Y. Ranking contributors to traffic crashes on mountainous freeways from an incomplete dataset: A sequential approach of multivariate imputation by chained equations and random forest classifier. Accid Anal Prev. 2020;146:105744. 10.1016/j.aap.2020.105744. [DOI] [PubMed] [Google Scholar]
25.Kulesa A, Krzywinski M, Blainey P, Altman N. Sampling distributions and the bootstrap. Nat Methods. 2015;12:477–8. 10.1038/nmeth.3414. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Bommert A, Welchowski T, Schmid M, Rahnenführer J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief Bioinform. 2022;23:bbab354. 10.1093/bib/bbab354. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Levey AS, Titan SM, Powe NR, Coresh J, Inker LA. Kidney disease, race, and GFR Estimation. Clin J Am Soc Nephrol CJASN. 2020;15:1203–12. 10.2215/CJN.12791019. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Levey AS, Stevens LA, Schmid CH, Zhang YL, Castro AF, Feldman HI, et al. A new equation to estimate glomerular filtration rate. Ann Intern Med. 2009;150:604–12. 10.7326/0003-4819-150-9-200905050-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2018. pp. 6639–49.
30.Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. 2018. http://arxiv.org/abs/1810.11363. Accessed 22 Feb 2025. 2018. 10.48550/arXiv.1810.11363
31.Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global Understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gou W, Ling C-W, He Y, Jiang Z, Fu Y, Xu F, et al. Interpretable machine learning framework reveals robust gut Microbiome features associated with type 2 diabetes. Diabetes Care. 2021;44:358–66. 10.2337/dc20-1536. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Stevens PE, Levin A, Kidney Disease, Improving Global Outcomes Chronic Kidney Disease Guideline Development Work Group Members. Evaluation and management of chronic kidney disease: synopsis of the kidney disease: improving global outcomes 2012 clinical practice guideline. Ann Intern Med. 2013;158:825–30. 10.7326/0003-4819-158-11-201306040-00007. [DOI] [PubMed] [Google Scholar]
34.Peralta CA, Shlipak MG, Judd S, Cushman M, McClellan W, Zakai NA, et al. Detection of chronic kidney disease with creatinine, Cystatin C, and urine albumin-to-creatinine ratio and association with progression to end-stage renal disease and mortality. JAMA. 2011;305:1545–52. 10.1001/jama.2011.468. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Menon V, Shlipak MG, Wang X, Coresh J, Greene T, Stevens L, et al. Cystatin C as a risk factor for outcomes in chronic kidney disease. Ann Intern Med. 2007;147:19–27. 10.7326/0003-4819-147-1-200707030-00004. [DOI] [PubMed] [Google Scholar]
36.Tonelli M, Sacks F, Pfeffer M, Jhangri GS, Curhan G. Cholesterol and recurrent events (CARE) trial Investigators. Biomarkers of inflammation and progression of chronic kidney disease. Kidney Int. 2005;68:237–45. 10.1111/j.1523-1755.2005.00398.x. [DOI] [PubMed] [Google Scholar]
37.Tuttle KR, Bakris GL, Bilous RW, Chiang JL, de Boer IH, Goldstein-Fuchs J, et al. Diabetic kidney disease: A report from an ADA consensus conference. Diabetes Care. 2014;37:2864–83. 10.2337/dc14-1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Weiner DE, Tighiouart H, Elsayed EF, Griffith JL, Salem DN, Levey AS. Uric acid and incident kidney disease in the community. J Am Soc Nephrol JASN. 2008;19:1204–11. 10.1681/ASN.2007101075. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Dalbeth N, Merriman TR, Stamp LK, Gout. Lancet. 2016;388:2039–52. 10.1016/S0140-6736(16)00346-9. [DOI] [PubMed] [Google Scholar]
40.Krane V, Wanner C. Statins, inflammation and kidney disease. Nat Rev Nephrol. 2011;7:385–97. 10.1038/nrneph.2011.62. [DOI] [PubMed] [Google Scholar]
41.Ridker PM, Tuttle KR, Perkovic V, Libby P, MacFadyen JG. Inflammation drives residual risk in chronic kidney disease: A CANTOS substudy. Eur Heart J. 2022;43:4832–44. 10.1093/eurheartj/ehac444. [DOI] [PubMed] [Google Scholar]
42.Kadatane SP, Satariano M, Massey M, Mongan K, Raina R. The role of inflammation in CKD. Cells. 2023;12:1581. 10.3390/cells12121581. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Bach LA, Hale LJ. Insulin-like growth factors and kidney disease. Am J Kidney Dis Off J Natl Kidney Found. 2015;65:327–36. 10.1053/j.ajkd.2014.05.024. [DOI] [PubMed] [Google Scholar]
44.Jia T, Gama Axelsson T, Heimbürger O, Bárány P, Lindholm B, Stenvinkel P, et al. IGF-1 and survival in ESRD. Clin J Am Soc Nephrol CJASN. 2014;9:120–7. 10.2215/CJN.02470213. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Tangri N, Stevens LA, Griffith J, Tighiouart H, Djurdjev O, Naimark D, et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA. 2011;305:1553–9. 10.1001/jama.2011.451. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(3MB, pdf)}

Data Availability Statement

[CR1] 1.Zhang L, Wang F, Wang L, Wang W, Liu B, Liu J, et al. Prevalence of chronic kidney disease in china: A cross-sectional survey. Lancet Lond Engl. 2012;379:815–22. 10.1016/S0140-6736(12)60033-6. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Kovesdy CP. Epidemiology of chronic kidney disease: an update 2022. Kidney Int Suppl. 2022;12:7–11. 10.1016/j.kisu.2021.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Jager KJ, Kovesdy C, Langham R, Rosenberg M, Jha V, Zoccali C. A single number for advocacy and communication-worldwide more than 850 million individuals have kidney diseases. Kidney Int. 2019;96:1048–50. 10.1016/j.kint.2019.07.012. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Chen TK, Knicely DH, Grams ME. Chronic kidney disease diagnosis and management: A review. JAMA. 2019;322:1294–304. 10.1001/jama.2019.14745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Radhakrishnan J, Remuzzi G, Saran R, Williams DE, Rios-Burrows N, Powe N, et al. Taming the chronic kidney disease epidemic: A global view of surveillance efforts. Kidney Int. 2014;86:246–50. 10.1038/ki.2014.190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Vanent KN, Leasure AC, Acosta JN, Kuohn LR, Woo D, Murthy SB, et al. Association of chronic kidney disease with risk of intracerebral hemorrhage. JAMA Neurol. 2022;79:911–8. 10.1001/jamaneurol.2022.2299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Lee M, Saver JL, Chang K-H, Liao H-W, Chang S-C, Ovbiagele B. Low glomerular filtration rate and risk of stroke: Meta-analysis. BMJ. 2010;341:c4249. 10.1136/bmj.c4249. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Choi Y, Jacobs DR, Shroff GR, Kramer H, Chang AR, Duprez DA. Progression of chronic kidney disease risk categories and risk of cardiovascular disease and total mortality: coronary artery risk development in young adults cohort. J Am Heart Assoc. 2022;11:e026685. 10.1161/JAHA.122.026685. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Go AS, Chertow GM, Fan D, McCulloch CE, Hsu C. Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N Engl J Med. 2004;351:1296–305. 10.1056/NEJMoa041031. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Meisinger C, Döring A, Löwel H, KORA Study Group. Chronic kidney disease and risk of incident myocardial infarction and all-cause and cardiovascular disease mortality in middle-aged men and women from the general population. Eur Heart J. 2006;27:1245–50. 10.1093/eurheartj/ehi880. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Shlipak MG, Tummalapalli SL, Boulware LE, Grams ME, Ix JH, Jha V, et al. The case for early identification and intervention of chronic kidney disease: conclusions from a kidney disease: improving global outcomes (KDIGO) controversies conference. Kidney Int. 2021;99:34–47. 10.1016/j.kint.2020.10.012. [DOI] [PubMed]

[CR12] 12.O’Sullivan ED, Hughes J, Ferenbach DA. Renal aging: causes and consequences. J Am Soc Nephrol JASN. 2017;28:407–20. 10.1681/ASN.2015121308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Carrero JJ, Hecking M, Chesnaye NC, Jager KJ. Sex and gender disparities in the epidemiology and outcomes of chronic kidney disease. Nat Rev Nephrol. 2018;14:151–64. 10.1038/nrneph.2017.181. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Nelson RG, Grams ME, Ballew SH, Sang Y, Azizi F, Chadban SJ, et al. Development of risk prediction equations for incident chronic kidney disease. JAMA. 2019;322:2104–14. 10.1001/jama.2019.17379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Buergel T, Steinfeldt J, Ruyoga G, Pietzner M, Bizzarri D, Vojinovic D, et al. Metabolomic profiles predict individual multidisease outcomes. Nat Med. 2022;28:2309–20. 10.1038/s41591-022-01980-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Julkunen H, Cichońska A, Tiainen M, Koskela H, Nybo K, Mäkelä V, et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK biobank. Nat Commun. 2023;14:604. 10.1038/s41467-023-36231-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Chen D-Q, Cao G, Chen H, Argyopoulos CP, Yu H, Su W, et al. Identification of serum metabolites associating with chronic kidney disease progression and anti-fibrotic effect of 5-methoxytryptophan. Nat Commun. 2019;10:1476. 10.1038/s41467-019-09329-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Zhao Y-Y, Cheng X-L, Wei F, Bai X, Tan X-J, Lin R-C, et al. Intrarenal metabolomic investigation of chronic kidney disease and its TGF-β1 mechanism in induced-adenine rats using UPLC Q-TOF/HSMS/MS(E). J Proteome Res. 2013;12:692–703. 10.1021/pr3007792. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Sabanayagam C, He F, Nusinovici S, Li J, Lim C, Tan G, et al. Prediction of diabetic kidney disease risk using machine learning models: A population-based cohort study of Asian adults. eLife. 2023;12:e81878. 10.7554/eLife.81878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Huang J, Huth C, Covic M, Troll M, Adam J, Zukunft S, et al. Machine learning approaches reveal metabolic signatures of incident chronic kidney disease in individuals with prediabetes and type 2 diabetes. Diabetes. 2020;69:2756–65. 10.2337/db20-0586. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Honigberg MC, Zekavat SM, Pirruccello JP, Natarajan P, Vaduganathan M. Cardiovascular and kidney outcomes across the glycemic spectrum: insights from the UK biobank. J Am Coll Cardiol. 2021;78:453–64. 10.1016/j.jacc.2021.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Zhang X, Liu Y-M, Lei F, Huang X, Liu W, Sun T, et al. Association between questionnaire-based and accelerometer-based physical activity and the incidence of chronic kidney disease using data from UK biobank: A prospective cohort study. EClinicalMedicine. 2023;66:102323. 10.1016/j.eclinm.2023.102323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Li L, Prato CG, Wang Y. Ranking contributors to traffic crashes on mountainous freeways from an incomplete dataset: A sequential approach of multivariate imputation by chained equations and random forest classifier. Accid Anal Prev. 2020;146:105744. 10.1016/j.aap.2020.105744. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Kulesa A, Krzywinski M, Blainey P, Altman N. Sampling distributions and the bootstrap. Nat Methods. 2015;12:477–8. 10.1038/nmeth.3414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Bommert A, Welchowski T, Schmid M, Rahnenführer J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief Bioinform. 2022;23:bbab354. 10.1093/bib/bbab354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Levey AS, Titan SM, Powe NR, Coresh J, Inker LA. Kidney disease, race, and GFR Estimation. Clin J Am Soc Nephrol CJASN. 2020;15:1203–12. 10.2215/CJN.12791019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Levey AS, Stevens LA, Schmid CH, Zhang YL, Castro AF, Feldman HI, et al. A new equation to estimate glomerular filtration rate. Ann Intern Med. 2009;150:604–12. 10.7326/0003-4819-150-9-200905050-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2018. pp. 6639–49.

[CR30] 30.Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. 2018. http://arxiv.org/abs/1810.11363. Accessed 22 Feb 2025. 2018. 10.48550/arXiv.1810.11363

[CR31] 31.Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global Understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Gou W, Ling C-W, He Y, Jiang Z, Fu Y, Xu F, et al. Interpretable machine learning framework reveals robust gut Microbiome features associated with type 2 diabetes. Diabetes Care. 2021;44:358–66. 10.2337/dc20-1536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Stevens PE, Levin A, Kidney Disease, Improving Global Outcomes Chronic Kidney Disease Guideline Development Work Group Members. Evaluation and management of chronic kidney disease: synopsis of the kidney disease: improving global outcomes 2012 clinical practice guideline. Ann Intern Med. 2013;158:825–30. 10.7326/0003-4819-158-11-201306040-00007. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Peralta CA, Shlipak MG, Judd S, Cushman M, McClellan W, Zakai NA, et al. Detection of chronic kidney disease with creatinine, Cystatin C, and urine albumin-to-creatinine ratio and association with progression to end-stage renal disease and mortality. JAMA. 2011;305:1545–52. 10.1001/jama.2011.468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Menon V, Shlipak MG, Wang X, Coresh J, Greene T, Stevens L, et al. Cystatin C as a risk factor for outcomes in chronic kidney disease. Ann Intern Med. 2007;147:19–27. 10.7326/0003-4819-147-1-200707030-00004. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Tonelli M, Sacks F, Pfeffer M, Jhangri GS, Curhan G. Cholesterol and recurrent events (CARE) trial Investigators. Biomarkers of inflammation and progression of chronic kidney disease. Kidney Int. 2005;68:237–45. 10.1111/j.1523-1755.2005.00398.x. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Tuttle KR, Bakris GL, Bilous RW, Chiang JL, de Boer IH, Goldstein-Fuchs J, et al. Diabetic kidney disease: A report from an ADA consensus conference. Diabetes Care. 2014;37:2864–83. 10.2337/dc14-1296. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Weiner DE, Tighiouart H, Elsayed EF, Griffith JL, Salem DN, Levey AS. Uric acid and incident kidney disease in the community. J Am Soc Nephrol JASN. 2008;19:1204–11. 10.1681/ASN.2007101075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Dalbeth N, Merriman TR, Stamp LK, Gout. Lancet. 2016;388:2039–52. 10.1016/S0140-6736(16)00346-9. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Krane V, Wanner C. Statins, inflammation and kidney disease. Nat Rev Nephrol. 2011;7:385–97. 10.1038/nrneph.2011.62. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Ridker PM, Tuttle KR, Perkovic V, Libby P, MacFadyen JG. Inflammation drives residual risk in chronic kidney disease: A CANTOS substudy. Eur Heart J. 2022;43:4832–44. 10.1093/eurheartj/ehac444. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Kadatane SP, Satariano M, Massey M, Mongan K, Raina R. The role of inflammation in CKD. Cells. 2023;12:1581. 10.3390/cells12121581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Bach LA, Hale LJ. Insulin-like growth factors and kidney disease. Am J Kidney Dis Off J Natl Kidney Found. 2015;65:327–36. 10.1053/j.ajkd.2014.05.024. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Jia T, Gama Axelsson T, Heimbürger O, Bárány P, Lindholm B, Stenvinkel P, et al. IGF-1 and survival in ESRD. Clin J Am Soc Nephrol CJASN. 2014;9:120–7. 10.2215/CJN.02470213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Tangri N, Stevens LA, Griffith J, Tighiouart H, Djurdjev O, Naimark D, et al. A predictive model for progression of chronic kidney disease to kidney failure. JAMA. 2011;305:1553–9. 10.1001/jama.2011.451. [DOI] [PubMed] [Google Scholar]

PERMALINK

Explainable machine learning integrating biochemical and metabolomic biomarkers with conventional clinical factors improves chronic kidney disease prediction and risk stratification

Jing Ma

Ruiyan Liu

Xin Feng

Xing Li

Jielin Huang

Lu Zhang

Jian Gao

Guifang Hu

Xiru Zhang

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Introduction

Methods

Study designs and population

Conventional predictor and biomarkers

Fig. 1.

CKD outcome definition

Statistical analysis

Data preprocessing and baseline characteristics

Biomarker selection and evaluation of predictive performance

CatBoost classification and SHAP-based feature importance

Key predictive biomarkers for assessing CKD risk

SHAP-based feature binarization

Construction and stratification of the biomarker risk score

Results

Study population characteristics

Table 1.

Superior discriminative performance of combined biomarkers in chronic kidney disease risk stratification

Fig. 2.

CatBoost classifier and SHAP analysis reveals key factors in predicting incident CKD and CKD-related mortality

Fig. 3.

Fig. 4.

Cox proportional hazards model identifies key biomarkers associated with new-onset CKD and CKD mortality

Fig. 5.

Elevated biomarkers and their association with incident CKD and CKD mortality

Enhanced CKD risk stratification using the biomarker risk score

Fig. 6.

Discussion

Strengths and limitations

Conclusions

Supplementary Information

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases