Skip to main content
Journal of Gastrointestinal Oncology logoLink to Journal of Gastrointestinal Oncology
. 2026 Feb 12;17(1):12. doi: 10.21037/jgo-2025-611

ColoLDB: a machine learning-based predictive model for colorectal cancer using routine laboratory parameters

Xing Zhang 1, Xuedong Tong 1, Jiangtao Mou 1, Jun Liu 2, Chenxi Zhang 3, Hongyan Han 1,*,, Kun Deng 1,*,
PMCID: PMC12972017  PMID: 41816568

Abstract

Background

Colorectal cancer (CRC) is one of the most common and highly prevalent cancers worldwide, posing a serious threat to public health. Current CRC screening and diagnosis primarily depend on colonoscopy, an invasive procedure that often misses early-stage tumors, contributing to delayed diagnoses. The aim of this study is to develop a simpler, more accessible screening method to assist clinicians in the early identification and diagnosis of CRC and its precancerous lesions.

Methods

Using the patient’s hospitalization number as the unique identifier, invalid age records were excluded, non-numerical laboratory test results were removed, and only the first diagnostic test result for each parameter per patient (i.e., the initial test value at first diagnosis) was retained. The study distinguished between the CRC experimental group and the control group. The study collected laboratory test data from each participant, including tumor markers, biochemical parameters, immunological indicators, complete blood count, coagulation tests, and routine urinalysis. We selected light gradient boosting machine (LightGBM), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) to construct the models. Finally, the SHapley Additive explanations (SHAP) algorithm was employed to interpret the models.

Results

After analyzing the four selected models, the intersection of the top-ranked features across all models was identified, ultimately screening eight laboratory parameters to construct the diagnostic colorectal laboratory digital biomarker (ColoLDB) model: specific gravity (SG), carbohydrate antigen 19-9 (CA19-9), carcinoembryonic antigen (CEA), age, albumin (ALB), cytokeratin 19 fragment (CYFRA21-1), high-density lipoprotein cholesterol (HDL-C) and carbohydrate antigen 72-4 (CA72-4). In the test set, the RF machine learning model demonstrated optimal performance in identifying CRC, achieving an area under the curve (AUC) of 0.863 (95% confidence interval: 0.792–0.922), an accuracy of 0.900, a sensitivity of 0.225, a specificity of 0.997, a positive predictive value (PPV) of 0.917, and a negative predictive value (NPV) of 0.900. When the specificity was set at 0.903, the ColoLDB model’s sensitivity reached 0.694. In comparison, a diagnostic model combining CEA and CA19-9 yielded an AUC of 0.688, a sensitivity of 0.429 and a specificity of 0.947. The RF diagnostic ColoLDB model exhibited superior diagnostic efficacy compared to the combined CEA and CA19-9 diagnosis model.

Conclusions

Our research findings indicate that eight laboratory test indicators may be related the risk of developing CRC. Our RF diagnostic ColoLDB model is an innovative and practical tool that effectively predicts the occurrence of CRC, enhancing the diagnostic efficiency for this disease. This method holds promise as a valuable tool for diagnosing CRC.

Keywords: Colorectal cancer (CRC), laboratory parameters, machine learning, colorectal laboratory digital biomarker model (ColoLDB model)


Highlight box.

Key findings

• We have developed predictive models for the colorectal cancer (CRC). The diagnostic colorectal laboratory digital biomarker (ColoLDB) model was developed using eight parameters and the following four machine learning algorithms: light gradient boosting machine, logistic regression, random forest (RF), and extreme gradient boosting. The results demonstrated favorable diagnostic performance across these models, with the RF diagnostic ColoLDB model exhibiting superior predictive capability.

What is known, and what is new?

• Previous studies have shown that when integrating 12 parameters, the RF algorithm demonstrates strong diagnostic performance for CRC.

• Our research shows that the RF algorithm achieved comparable diagnostic performance using only eight parameters.

What is the implication, and what should change now?

• The machine learning algorithm has the potential to enhance clinical assessment of CRC risk. This RF model can more accurately identify individuals with the disease.

Introduction

Colorectal cancer (CRC) is the third most common malignancy and the second leading cause of cancer-related mortality in the world, with an estimated number of 1.8 million new cases and about 881,000 deaths worldwide in 2018. Between 2012 and 2018, the age-standardized rate of CRC incidence increased from 17.2 to 19.7 per 100,000 and the mortality rate increased from 8.3 to 8.9 per 100,000. While the incidence and mortality rate of CRC are increasing worldwide, their trends vary between different regions and countries (1-3). By 2030, global new cases of CRC are projected to increase by 60%, exceeding 2.2 million, with deaths reaching 1.1 million (4). The adenoma-carcinoma sequence represents the primary developmental pathway for most sporadic CRC. Patients with advanced adenomas face a significantly elevated risk of malignant transformation into CRC (5,6). Unfortunately, due to the insidious nature of early-stage CRC symptoms, over 50% of patients are diagnosed at intermediate or advanced stages, with a five-year survival rate of only 20% (7). Early diagnosis, however, enables timely and optimal treatment, improving the five-year survival rate to 90% (8-10).

Colonoscopy has been established as the gold standard for CRC screening, offering high sensitivity and specificity. However, its invasive nature, time-consuming process, requirement for bowel preparation, need for experienced endoscopists, and associated high costs limit its widespread use in routine CRC screening among high-risk populations (11). Additionally, carcinoembryonic antigen (CEA) and carbohydrate antigen 19-9 (CA19-9), the two most commonly used biomarkers for CRC diagnosis, have limited utility due to their low sensitivity (40% to 60%). Machine learning provides a framework for developing predictive models and uncovering latent groupings within data. Rather than relying on subjective human judgment, it offers a systematic, computational approach to pattern recognition. This methodology is uniquely valuable for analyzing datasets that are either too large, too complex, or contain too many features for practical human evaluation. Moreover, it allows for the creation of automated analytical processes, ensuring scalable and unbiased data analysis (12-14). Non-invasive monitoring of patients with adenomatous polyps is crucial for the early diagnosis and prevention of CRC. Therefore, the development of non-invasive methods for the early diagnosis of CRC is an urgent and critical goal. The primary objective of this study is to develop and validate a robust machine learning model that can assist clinicians in the early identification and diagnosis of CRC and its precancerous lesions, ultimately providing an effective tool to help reduce the disease burden of CRC in China. We present this article in accordance with the TRIPOD reporting checklist (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-611/rc).

Methods

Data collection and processing

All the raw data that we collected came from patients hospitalized at The Third Affiliated Hospital of Chongqing Medical University between 2022 and 2024. By combining the “colonoscopy findings” and “colonoscopy conclusions” from colonoscopies, the “pathological findings” and “pathological conclusions” from pathological examinations, as well as discharge diagnoses, we identified patients with CRC and those with benign colorectal diseases. We then extracted the first routine blood, coagulation, biochemical, immunological and routine urine test parameters before any treatment after hospitalization from these patients to build the model. These data were sourced from the hospital’s information system. Any feature with more than 50% missing values across all cases was considered unreliable and was removed from the dataset entirely. For the remaining features with less than 50% missing data, we employed median imputation. Crucially, imputation parameters (median values) were learned exclusively from the training set. These training-set medians were then used to fill missing values in both the training set and the independent test set to strictly avoid information leakage. All missing values for a given feature were replaced with the median value calculated from the non-missing data for that same feature. This method was chosen because it is robust to outliers and suitable for non-normally distributed data. Ultimately, we used data from 226 patients diagnosed with CRC and 1,721 patients with benign colorectal diseases to construct the colorectal laboratory digital biomarker (ColoLDB) model (Figure 1). Ultimately, we used data from 226 patients diagnosed with CRC and 1,721 patients with benign colorectal diseases. From the initial pool of laboratory indicators, 47 parameters were retained solely because they met the data completeness criteria (<50% missing values). To prevent data leakage, the dataset was then randomly split into a training set (80%) and a test set (20%). All subsequent preprocessing steps, including median imputation and feature selection, were conducted within the training loop. For imputation, median values were calculated exclusively from the training set and applied to the test set. The methodology achieves a blind assessment by splitting the data into a validation set (80%) and a test set (20%) for final evaluation. This 20% test set was completely held out and was not used for any part of the model training, hyperparameter tuning, or validation process. Using 5-fold cross-validation and random search for hyperparameter tuning exclusively on the 80% training set further strengthened the blind assessment. This ensured that the model architecture and its optimal hyperparameters were selected without any “information leakage” from the final test set.

Figure 1.

Figure 1

The flow chart of this study. ALB, albumin; AUC, area under the curve; CA19-9, carbohydrate antigen 19-9; CA72-4, carbohydrate antigen 72-4; CEA, carcinoembryonic antigen; CRC, colorectal cancer; CYFRA21-1, cytokeratin 19 fragment; HDL-C, high-density lipoprotein cholesterol; LightGBM, light gradient boosting machine; NPV, negative predictive value; PPV, positive predictive value; SG, specific gravity; SHAP, SHapley Additive explanations; XGBoost, extreme gradient boosting.

Inclusion and exclusion criteria

The inclusion criteria for the study are as follows: (I) age ≥18 years; (II) patients with a clear clinical or pathological histological diagnosis of CRC, or patients with benign diseases (including chronic enteritis, ulcerative colitis, adenomatous polyps, hyperplastic polyps, and inflammatory polyps) confirmed by colonoscopy or pathological diagnosis; (III) patients who have not undergone any radiotherapy, chemotherapy, or surgical treatment prior to blood collection; (IV) the inclusion of each retrospective parameter should ensure that the same testing reagents, platforms, methods, etc., are used for all populations.

The exclusion criteria for the study are as follows: (I) patients with malignant tumors in other organ tissues (except for those with colon cancer metastasis) or patients with recurrent colon cancer; (II) patients with concomitant systemic diseases, such as diabetes and endocrine disorders; (III) patients with concomitant autoimmune diseases; (IV) pregnant or breastfeeding individuals; (V) patients with psychiatric disorders.

Machine learning methods

Logistic regression (LR) is a statistical method used to solve classification problems, particularly suited for binary classification. It predicts the probability that a sample belongs to a certain category by mapping the results of linear regression to the [0,1] interval using the Sigmoid function to represent probability. Prior to training the LR model, all continuous variables were standardized (Z-score normalization) to have a mean of 0 and a standard deviation of 1. This step was performed to ensure that features with larger ranges did not disproportionately influence the model coefficients. Random forest (RF) is an ensemble learning algorithm that builds multiple decision trees to perform classification or regression predictions. The core idea of this algorithm is “collective intelligence”, meaning that the combination of multiple models often has better generalization ability and accuracy than a single model. Light gradient boosting machine (LightGBM) is an efficient machine learning algorithm based on the gradient boosting framework, specifically designed to handle large-scale data and high-dimensional features. LightGBM iteratively trains multiple weak learners and combines them into a strong learner. In each iteration, the new model corrects the errors of the previous model, gradually improving overall performance. Extreme gradient boosting (XGBoost) is an efficient machine learning algorithm that falls under the gradient boosting framework. XGBoost improves and extends traditional gradient boosting methods by optimizing computational speed and model performance. In summary, these algorithms each have their advantages in feature optimization. To compare the performance of different machine learning methods, we selected LightGBM, LR, RF and XGBoost to build the ColoLDB model. To identify the most impactful and interactive predictors, we employed a model-based feature selection strategy using SHapley Additive exPlanations (SHAP).

Statistical analysis

Continuous variables were presented as medians with interquartile ranges or means with standard deviations, depending on their distribution. For features with less than 50% missing data, median imputation was employed to ensure robustness against outliers and non-normally distributed data. To identify biomarkers with significant differences between the CRC and control groups, appropriate statistical tests (such as the Mann-Whitney U test for continuous variables or Chi-square test for categorical variables) were performed, and 47 biomarkers were ultimately selected. The dataset was randomly partitioned into a training set (80%) and an independent test set (20%). To optimize the machine learning models (LightGBM, LR, RF, and XGBoost), 5-fold cross-validation and random search were utilized exclusively on the training set for hyperparameter tuning, preventing information leakage from the test set. Feature importance and model interpretability were assessed using the SHAP algorithm. The diagnostic performance of the models was evaluated using receiver operating characteristic (ROC) curve analysis. Key metrics included the area under the curve (AUC), sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV). All statistical analyses and model construction were performed using Python version 3.10. A P value <0.05 was considered statistically significant.

Ethical consideration

This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study has been approved by the Ethics Committee of The Third Affiliated Hospital of Chongqing Medical University [IRB#2025(23)]. Due to the retrospective nature of this study, individual informed consent was waived.

Results

Screening of subjects

A total of 2,837 patients undergoing colonoscopy were screened. After applying inclusion and exclusion criteria, 1,947 eligible patients were enrolled for analysis. Based on pathological biopsy results, 226 were diagnosed with CRC, and 1,721 had benign colorectal lesions (control group). Among 371 laboratory parameters initially evaluated, 324 were excluded due to insufficient data availability (>50% missing values). Consequently, 47 parameters with adequate data completeness were retained for the machine learning analysis. The characteristics of these 47 parameters are presented in Table 1.

Table 1. Characteristics of each parameter.

Baseline characteristic Control group (N=1,721) Experimental group (N=226) Missing value percentage (%)
Age (years) 56.0 (50.0, 65.0) 71.0 (61.0, 77.0) 0.00
Gender (M/F) 1,022/699 146/80 0.00
AST/ALT 1.1 (0.9, 1.2) 1.2 (1.0, 1.6) 23.93
Neu# (×109 /L) 3.56 (2.89, 4.29) 3.96 (3.1, 5.36) 9.14
Neu% 64.2 (58.3, 68.9) 68.7 (62.1, 77.0) 9.14
PTR 1.0 (0.9, 1.0) 1.0 (0.97, 1.03) 3.34
PTA (%) 107.0 (100.0, 115.0) 102.0 (92.0, 110.0) 3.34
Mon# (×109 /L) 0.33 (0.26, 0.4) 0.37 (0.29, 0.49) 9.14
Mon% 5.8 (5.0, 6.7) 6.3 (5.1, 7.6) 9.14
INR 0.96 (0.92, 1.0) 0.99 (0.95, 1.05) 3.34
P-LCR (%) 29.1 (24.5, 35.2) 26.8 (21.9, 32.5) 9.30
SG 1.02 (1.01, 1.02) 1.02 (1.02, 1.03) 29.99
UREA (mmol/L) 4.96 (4.28, 5.68) 5.21 (4.15, 6.66) 16.28
UA (μmol/L) 333.0 (291.0, 386.0) 320.0 (249.25, 381.75) 16.28
MCV (fL) 91.4 (89.3, 93.8) 90.05 (86.0, 93.57) 9.14
MPV (fL) 10.4 (9.8, 11.3) 10.1 (9.4, 11.0) 9.30
MCH (Pg) 30.8 (29.9, 31.8) 30.1 (28.12, 31.48) 9.14
MCHC (g/L) 336.0 (331.0, 342.0) 331.0 (323.0, 339.0) 9.14
CHO (mmol/L) 4.79 (4.69, 5.0) 4.78 (3.89, 4.96) 39.09
TP (g/L) 68.8 (66.9, 71.6) 65.6 (61.0, 69.9) 17.15
Lym# (×109 /L) 1.45 (1.21, 1.78) 1.23 (0.93, 1.64) 9.14
Lym% 26.7 (22.7, 32.2) 20.7 (15.27, 26.8) 9.14
FT3 (pmol/L) 5.07 (4.98, 5.27) 5.07 (4.8, 5.07) 43.50
TG (mmol/L) 1.48 (1.37, 1.63) 1.48 (1.09, 1.69) 39.09
CEA (ng/mL) 1.94 (1.45, 2.22) 3.34 (1.94, 8.46) 27.79
A/G 1.58 (1.5, 1.7) 1.47 (1.24, 1.64) 23.93
WBC (×109 /L) 5.61 (4.75, 6.57) 5.8 (4.85, 7.84) 9.14
ALB (g/L) 42.1 (40.9, 43.9) 38.7 (35.1, 41.5) 17.10
ALP (U/L) 74.0 (64.0, 84.0) 78.5 (64.0, 97.0) 17.15
HCO3 (mmol/L) 25.9 (24.3, 27.6) 24.9 (23.0, 27.28) 6.73
CA19-9 (u/mL) 9.14 (7.9, 9.15) 12.24 (9.14, 27.71) 40.83
CA72-4 (u/mL) 2.34 (1.97, 2.49) 2.67 (1.77, 7.8) 41.19
RDW-CV (%) 13.0 (12.6, 13.3) 13.5 (12.9, 14.68) 9.14
RDW-SD 45.2 (43.6, 46.6) 46.05 (44.1, 48.78) 9.14
HCT (%) 41.9 (39.7, 44.8) 39.6 (34.0, 43.0) 8.78
RBC (1012/L) 4.59 (4.35, 4.9) 4.35 (3.92, 4.63) 9.14
FIB-C (g/L) 3.09 (2.7, 3.49) 3.55 (3.0, 4.41) 3.39
CYFRA21-1 (ng/mL) 1.92 (1.7, 1.92) 2.55 (1.92, 4.06) 41.04
ADA (U/L) 9.0 (8.1, 9.7) 9.0 (8.1, 11.65) 31.33
GLU (mmol/L) 5.05 (4.91, 5.10) 5.38 (4.85, 6.60) 39.19
PDW (%) 16.2 (16.0, 16.4) 16.1 (15.9, 16.3) 9.30
PLT (109/L) 209.0 (176.0, 239.0) 214.5 (178.0, 279.75) 9.14
HGB (g/L) 141.0 (132.0, 152.0) 129.5 (110.25, 141.0) 9.14
IBIL (μmol/L) 9.3 (8.0, 11.8) 7.65 (5.45, 10.0) 23.93
HDL-C (mmol/L) 1.14 (1.13, 1.19) 1.13 (0.94, 1.15) 43.61
Neu#/Lym# 2.46 (1.81, 3.03) 3.35 (2.3, 4.94) 9.14
PLT#/Lym# 144.14 (112.43, 170.59) 171.34 (127.09, 268.05) 9.14

Data are presented as number or median (Q1, Q3). A/G, albumin/globulin ratio; ADA, adenosine deaminase; ALB, albumin; ALP, alkaline phosphatase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; CA19-9, carbohydrate antigen 19-9; CA72-4, carbohydrate antigen 72-4; CEA, carcinoembryonic antigen; CHO, cholesterol; CYFRA21-1, cytokeratin 19 fragment; FIB-C, fibrinogen concentration; FT3, free triiodothyronine; GLU, glucose; HCO3, bicarbonate; HCT, hematocrit; HDL-C, high-density lipoprotein cholesterol; HGB, hemoglobin; IBIL, indirect bilirubin; INR, international normalized ratio; Lym#, lymphocyte count; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; Mon#, monocyte count; MPV, mean platelet volume; Neu#, neutrophil count; P-LCR, platelet large cell ratio; PDW, platelet distribution width; PLT, platelet count; PTA, prothrombin activity; PTR, prothrombin time ratio; RBC, red blood cell count; RDW-CV, red cell distribution width-coefficient of variation; RDW-SD, red cell distribution width-standard deviation; SG, specific gravity; TG, triglycerides; TP, total protein; UA, uric acid; UREA, urea; WBC, white blood cell count.

Selection of laboratory parameters

Machine learning algorithms were used to identify the top 20 feature variables in each model. LR classification (Figure 2A), RF classification (Figure 2B), XGBoost classification (Figure 2C), and LightGBM classification (Figure 2D). Ultimately, eight variables, specific gravity (SG), carbohydrate antigen 19-9 (CA19-9), CEA, age, albumin (ALB), cytokeratin 19 fragment (CYFRA21-1), high-density lipoprotein cholesterol (HDL-C) and carbohydrate antigen 72-4 (CA72-4) were selected based on the intersection of SHAP-derived top 20 features across models to construct a diagnostic model.

Figure 2.

Figure 2

The ranking chart of the top 20 influential parameters for the following four machine-learning models: (A) LR; (B) RF; (C) XGBoost and (D) LightGBM. A/G, albumin/globulin ratio; ALB, albumin; ALP, alkaline phosphatase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; CA19-9, carbohydrate antigen 19-9; CA72-4, carbohydrate antigen 72-4; CEA, carcinoembryonic antigen; CHO, cholesterol; CYFRA21-1, cytokeratin 19 fragment; FIB-C, fibrinogen concentration; FT3, free triiodothyronine; GLU, glucose; HCO3, bicarbonate; HCT, hematocrit; HDL-C, high-density lipoprotein cholesterol; HGB, hemoglobin; INR, international normalized ratio; LightGBM, light gradient boosting machine; LR, logistic regression; Lym#, lymphocyte count; MCH, mean corpuscular hemoglobin; MCV, mean corpuscular volume; Mon#, monocyte count; PLT, platelet count; PTA, prothrombin activity; PTR, prothrombin time ratio; RBC, red blood cell count; RDW-CV, red cell distribution width-coefficient of variation; RDW-SD, red cell distribution width-standard deviation; RF, random forest; SG, specific gravity; TG, triglycerides; TP, total protein; UA, uric acid; UREA, urea; XGBoost, extreme gradient boosting.

Model performance

ROC curves for the initial screening of 47 laboratory parameters in the training (Figure 3A) and test sets (Figure 3B) were plotted to compare the baseline diagnostic potential of the four algorithms. At this initial screening stage (47 parameters), the LightGBM model achieved the highest AUC of 0.912 in the test set.

Figure 3.

Figure 3

ROC curves for the training and test sets of the four machine-learning models during the initial screening phase using 47 laboratory parameters. (A) ROC curves for the training set; (B) ROC curves for the test set. At this stage (47 parameters), the LightGBM model demonstrated the highest AUC (0.912). However, the RF model was subsequently selected for the final ColoLDB tool after feature selection reduced the inputs to the top eight biomarkers. AUC, area under the curve; LightGBM, light gradient boosting machine; RF, random forest; ROC, receiver operating characteristic; XGBoost, extreme gradient boosting.

However, for clinical utility, a parsimonious model is preferred. After performing the intersection analysis of the top-ranked features, eight parameters (SG, CA19-9, CEA, Age, ALB, CYFRA21-1, HDL-C, and CA72-4) were selected to construct the final diagnostic ColoLDB models. Table 2 summarizes the performance metrics of the algorithms using only these eight selected parameters. In this final configuration, the RF model demonstrated superior overall prediction stability and was selected as the final ColoLDB model, achieving an AUC of 0.863.

Table 2. Diagnostic efficacy of the four machine-learning models for the training and test sets.

Dataset Model AUC Accuracy Sensitivity Specificity PPV NPV F1 score
Training LR 0.876 (0.796–0.917) 0.912 (0.872–0.931) 0.322 (0.196–0.453) 0.988 (0.976–0.997) 0.770 (0.611–0.963) 0.919 (0.874–0.934) 0.454 (0.308–0.595)
RF 0.937 (0.871–0.966) 0.928 (0.880–0.936) 0.384 (0.231–0.500) 0.998 (0.980–1.000) 0.958 (0.684–1.000) 0.927 (0.879–0.940) 0.548 (0.342–0.644)
XGBoost 0.953 (0.893–0.970) 0.930 (0.882–0.939) 0.401 (0.236–0.500) 0.998 (0.985–1.000) 0.960 (0.762–1.000) 0.929 (0.879–0.941) 0.566 (0.378–0.647)
LightGBM 0.927 (0.863–0.952) 0.926 (0.874–0.931) 0.418 (0.233–0.511) 0.991 (0.971–0.997) 0.851 (0.611–0.950) 0.930 (0.881–0.937) 0.561 (0.361–0.627)
Test LR 0.848 (0.786–0.905) 0.903 (0.872–0.931) 0.265 (0.150–0.391) 0.994 (0.985–1.000) 0.867 (0.667–1.000) 0.904 (0.871–0.933) 0.406 (0.245–0.546)
RF 0.863 (0.792–0.922) 0.900 (0.872–0.928) 0.225 (0.114–0.350) 0.997 (0.991–1.000) 0.917 (0.727–1.000) 0.900 (0.867–0.927) 0.361 (0.208–0.522)
XGBoost 0.860 (0.793–0.921) 0.905 (0.874–0.933) 0.286 (0.164–0.419) 0.994 (0.985–1.000) 0.875 (0.687–1.000) 0.906 (0.876–0.936) 0.431 (0.259–0.590)
LightGBM 0.861 (0.790–0.922) 0.908 (0.877–0.936) 0.367 (0.238–0.500) 0.985 (0.973–0.997) 0.783 (0.600–0.926) 0.916 (0.887–0.942) 0.500 (0.345–0.622)

Data are presented as mean values with 95% CI. AUC, area under the curve; CI, confidence interval; LightGBM, light gradient boosting machine; LR, logistic regression; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; XGBoost, extreme gradient boosting.

At the default threshold (0.5), the RF model achieved high specificity (0.997) but exhibited low sensitivity (0.225). Based on clinical considerations, we adjusted the threshold using the Youden index and achieved a balanced accuracy of 0.799. At this optimized level, the model demonstrated a sensitivity of 0.694 and a specificity of 0.903. This balance is critical for large-scale screening tools, as it ensures the effectiveness of early detection while keeping false positives within clinically acceptable limits.

Feature importance in the model

The ranking of the eight laboratory parameters by importance in the RF diagnostic ColoLDB model is illustrated in Figure 4.

Figure 4.

Figure 4

RF model SHAP importance graph. ALB, albumin; CA19-9, carbohydrate antigen 19-9; CA72-4, carbohydrate antigen 72-4; CEA, carcinoembryonic antigen; CYFRA21-1, cytokeratin 19 fragment; HDL-C, high-density lipoprotein cholesterol; RF, random forest; SG, specific gravity; SHAP, SHapley Additive explanations.

This figure illustrates the feature importance ranking of the eight clinical indicators identified by the RF model as most relevant for the early diagnosis of CRC. The importance values were derived using SHAP analysis, where a larger absolute value indicates a greater contribution of the feature to the model’s final prediction. ALB: the model identified this as the most important indicator, strongly suggesting that the patient’s nutritional status plays a key role in the disease process. Age: ranked as the second most important factor by the model, this validates the well-established clinical fact that cancer risk increases significantly with age. CYFRA 21-1: as the third most significant indicator in the model, this is clinically reasonable, as CYFRA 21-1 is known for its high sensitivity and specificity for cancers such as lung cancer. CEA and CA19-9: the model attributes a moderate and similar level of contribution to these two broad-spectrum tumor markers, which are associated with various gastrointestinal malignancies. HDL-C: the model indicates this is an important negatively correlated indicator. This aligns with the biological mechanism, as HDL-C is often considered “good cholesterol”, and its decreased level is often associated with an increased risk of various cancers. SG: The model shows that this indicator contributes to some extent, potentially reflecting the patient’s hydration status or renal function, which relates to their overall health condition. CA72-4: this marker demonstrated the relatively lowest contribution in the current model. This does not imply it is useless, but rather that within the specific dataset and feature combination analyzed, other indicators provided more predictive information. In summary, the model not only confirms the importance of classic clinical factors such as age and CYFRA 21-1 but also reveals the critical role of nutritional status (ALB) and metabolic factors (HDL-C) in prediction. This suggests that in clinical practice, integrating traditional tumor markers with routine indicators that reflect the patient’s systemic status may provide more comprehensive diagnostic or prognostic information.

Comparison between machine learning models and tumor markers

Among the four models, RF showed the best performance. Its diagnostic efficacy was further compared with standalone CEA and the CEA + CA19-9 combination. The RF model surpassed traditional tumor markers in AUC, demonstrating that integrating machine learning with multiple biomarkers significantly enhances CRC diagnosis (Table 3).

Table 3. Diagnostic efficacy of the RF (8 biomarkers) machine-learning model and tumor markers.

Variables AUC Sensitivity Specificity Youden Index
CEA 0.668 0.376 0.959 0.335
CEA + CA19-9 0.688 0.429 0.947 0.376
RF (8 biomarkers) 0.863 0.694 0.903 0.597

AUC, area under the curve; CA19-9, carbohydrate antigen 19-9; CEA, carcinoembryonic antigen; RF, random forest.

Web application

The final CRC diagnostic ColoLDB model is presented as a web application (http://drdenglab.com/ColoLDB/).

Discussion

In this study, we developed a diagnostic ColoLDB model using data from patients diagnosed with CRC and benign colorectal diseases. We extracted the first routine blood, coagulation, biochemical, immunological, and routine urine test parameters of these patients before any treatment after admission, which were sourced from the hospital’s information system. We selected LightGBM, LR, RF, and XGBoost to build the ColoLDB model. After excluding indicators with more than 50% missing values from the initial 371 test indicators, 78 indicators remained, of which 45 were identified as important and 33 as unimportant. We used these 45 indicators along with gender and age to construct a CRC risk prediction model to assist in diagnosis. The RF model achieved higher AUC values than the other three machine learning models in this study.

Integrating machine learning algorithms to construct predictive models for assisting in disease diagnosis is currently an advanced approach in the field of medical science. For instance, Alcazer et al. (15) employed an XGBoost-based machine learning model to assist in the diagnosis of acute leukemia. Wu et al. (16) developed an osteoporosis risk prediction model in 2023, where their XGBoost model demonstrated the best performance. Chen et al. (17) employed machine learning algorithms to analyze metabolomics data and constructed two models for gastric cancer diagnosis and prognosis prediction.

Eight parameters associated with CRC were identified in this study, including age, CA19-9, CEA, CA72-4, CYFRA21-1, SG, ALB, and HDL-C. CA 19-9 is a cell surface glycoprotein complex (18), it is an emerging tumor marker for CRC (19,20). When CA 19-9 levels are used in conjunction with CEA levels, the sensitivity of tumor detection is improved (21). Studies have shown that elevated preoperative serum CA 19-9 levels may be a useful marker for identifying lymph node-negative CRC patients (22,23). Tumor markers like CEA and CA72-4 may improve the diagnostic efficiency for CRC (24,25). CYFRA 21-1 can be used not only for the differential diagnosis and prognosis of lung, laryngeal, and esophageal cancers but also for CRC (26-29). Celia Mallafré-Muro et al. identified 244 compounds in urine samples associated with CRC, finding that up to 11 compounds may be specific biomarkers for CRC (30). Study indicates that ALB can serve as a prognostic factor for CRC (31). Obesity in CRC appears to increase the risk and mortality of the disease, with tumors tending to develop around fat-rich tissues (32). Chen et al. explained the significant role of oncogenes and signaling pathways in the reprogramming of lipid metabolism, linking lipids to cancer metastasis and recurrence (33,34). Li et al. also reported a logistic regression algorithm based on CEA, lipoprotein(a) [Lp (a)], and HDL markers that can distinguish CRC patients (35). Therefore, our predictors have been validated by relevant studies, further demonstrating the reliability of our model.

This study employed machine learning and deep learning algorithms to construct diagnostic models. The RF model achieved an ROC curve AUC (0.863) higher than that of the single CEA indicator (0.668) and the combined CEA and CA 19-9 indicator (0.688), thereby enhancing the capability for early CRC screening.

There are limitations in this study. First, due to the enrollment criteria, the analysis was conducted only within a specific patient population. The potential class imbalance among eligible participants may limit the generalizability of the results to all CRC patients. Secondly, the patients were limited to those from The Third Affiliated Hospital of Chongqing Medical University between 2022 and 2024, and being from a single center, this may limit the generalizability of the conclusions. Lastly, the experimental population was relatively small, and considering the impact of variables such as region on CRC, it is necessary to integrate datasets with an increasing number of donors to assess the influence of these variables on CRC.

This study constructed machine learning and deep learning models that demonstrated superior diagnostic performance in early CRC screening compared to traditional tumor markers (such as CEA and the combination of CEA and CA19-9), providing clinicians with a more sensitive and reliable non-invasive auxiliary diagnostic tool. By integrating predictors that can be easily obtained from blood or urine samples, the ColoLDB model significantly reduces the invasiveness and operational barriers of screening, making large-scale population-based initial screening and regular monitoring feasible. This advancement contributes to promoting the widespread adoption and routine implementation of CRC screening. Although the model shows promising application potential, its clinical translation requires further exploration. First, validation of the model’s generalizability and stability in multicenter, large-scale prospective cohorts is necessary. Second, integrating the model with multimodal data such as imaging and pathology could help build a more comprehensive diagnostic system. Finally, developing interpretable versions of the model and integrating them into existing clinical workflows will be key to realizing their practical application. Additionally, exploring the model’s extended value in risk assessment, treatment efficacy prediction, and prognosis evaluation represents an important direction for future research.

Conclusions

This study established multiple machine learning models for CRC prediction based on laboratory parameters. Due to the superior predictive performance of RF, it was ultimately selected as the preferred tool for predicting CRC. We utilized SG, CA19-9, CEA, age, ALB, CA72-4, CYFRA21-1 and HDL-C as general indicators to differentiate CRC from benign colorectal diseases. The machine learning models possess strong predictive power, which can assist clinicians in providing earlier diagnoses for CRC patients.

Supplementary

The article’s supplementary files as

jgo-17-01-12-rc.pdf (180.8KB, pdf)
DOI: 10.21037/jgo-2025-611
jgo-17-01-12-coif.pdf (507.3KB, pdf)
DOI: 10.21037/jgo-2025-611

Acknowledgments

The authors would like to extend gratitude to all the patients who participated in this trial and their families. Thanks are also due to colleagues from the clinical laboratory and information department for their technical support in electronic medical records and data retrieval. Special appreciation goes to the MyLab+i-Research consulting team of Roche for sharing their work, which played an indispensable role in the data analysis process of our study. The authors are also grateful to the MyLab+ i-Research consulting team of Roche for their help to improve the model.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study has been approved by the Ethics Committee of The Third Affiliated Hospital of Chongqing Medical University [IRB#2025(23)]. Due to the retrospective nature of this study, individual informed consent was waived.

Footnotes

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-611/rc

Funding: This study was supported by grants from the National Natural Science Foundation of China (No. 82372357), the National Natural Science Foundation of China General Program (No. 82172365 to K.D.), and the Chongqing Natural Science Foundation General Program (No. CSTB2022NSCQ-MSX1062 to K.D.).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-611/coif). C.Z. is an employee of Roche Diagnostics Ltd. The other authors have no conflicts of interest to declare.

Data Sharing Statement

Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-611/dss

jgo-17-01-12-dss.pdf (136.6KB, pdf)
DOI: 10.21037/jgo-2025-611

References

  • 1.Baidoun F, Elshiwy K, Elkeraie Y, et al. Colorectal Cancer Epidemiology: Recent Trends and Impact on Outcomes. Curr Drug Targets 2021;22:998-1009. 10.2174/1389450121999201117115717 [DOI] [PubMed] [Google Scholar]
  • 2.Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin 2020;70:7-30. 10.3322/caac.21590 [DOI] [PubMed] [Google Scholar]
  • 3.Siegel RL, Torre LA, Soerjomataram I, et al. Global patterns and trends in colorectal cancer incidence in young adults. Gut 2019;68:2179-85. 10.1136/gutjnl-2019-319511 [DOI] [PubMed] [Google Scholar]
  • 4.Arnold M, Sierra MS, Laversanne M, et al. Global patterns and trends in colorectal cancer incidence and mortality. Gut 2017;66:683-91. 10.1136/gutjnl-2015-310912 [DOI] [PubMed] [Google Scholar]
  • 5.Corley DA, Jensen CD, Marks AR, et al. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med 2014;370:1298-306. 10.1056/NEJMoa1309086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cottet V, Jooste V, Fournel I, et al. Long-term risk of colorectal cancer after adenoma removal: a population-based cohort study. Gut 2012;61:1180-6. 10.1136/gutjnl-2011-300295 [DOI] [PubMed] [Google Scholar]
  • 7.Xun D, Li X, Huang L, et al. Machine learning-based analysis identifies a 13-gene prognostic signature to improve the clinical outcomes of colorectal cancer. J Gastrointest Oncol 2024;15:2100-16. 10.21037/jgo-24-325 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Yin H, Xie J, Xing S, et al. Machine learning-based analysis identifies and validates serum exosomal proteomic signatures for the diagnosis of colorectal cancer. Cell Rep Med 2024;5:101689 . 10.1016/j.xcrm.2024.101689 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bray F, Ferlay J, Soerjomataram I, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018;68:394-424. 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]
  • 10.Biller LH, Schrag D. Diagnosis and Treatment of Metastatic Colorectal Cancer: A Review. JAMA 2021;325:669-85. 10.1001/jama.2021.0106 [DOI] [PubMed] [Google Scholar]
  • 11.Wieszczy P, Kaminski MF, Franczyk R, et al. Colorectal Cancer Incidence and Mortality After Removal of Adenomas During Screening Colonoscopies. Gastroenterology 2020;158:875-883.e5. 10.1053/j.gastro.2019.09.011 [DOI] [PubMed] [Google Scholar]
  • 12.Zygulska AL, Pierzchalski P. Novel Diagnostic Biomarkers in Colorectal Cancer. Int J Mol Sci 2022;23:852 . 10.3390/ijms23020852 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Dawson MA, Kouzarides T. Cancer epigenetics: from mechanism to therapy. Cell 2012;150:12-27. 10.1016/j.cell.2012.06.013 [DOI] [PubMed] [Google Scholar]
  • 14.Sun Y, Zhang X, Hang D, et al. Integrative plasma and fecal metabolomics identify functional metabolites in adenoma-colorectal cancer progression and as early diagnostic biomarkers. Cancer Cell 2024;42:1386-1400.e8. 10.1016/j.ccell.2024.07.005 [DOI] [PubMed] [Google Scholar]
  • 15.Alcazer V, Le Meur G, Roccon M, et al. Evaluation of a machine-learning model based on laboratory parameters for the prediction of acute leukaemia subtypes: a multicentre model development and validation study in France. Lancet Digit Health 2024;6:e323-33. 10.1016/S2589-7500(24)00044-X [DOI] [PubMed] [Google Scholar]
  • 16.Wu X, Zhai F, Chang A, et al. Application of machine learning algorithms to predict osteoporosis in postmenopausal women with type 2 diabetes mellitus. J Endocrinol Invest 2023;46:2535-46. 10.1007/s40618-023-02109-0 [DOI] [PubMed] [Google Scholar]
  • 17.Chen Y, Wang B, Zhao Y, et al. Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer. Nat Commun 2024;15:1657 . 10.1038/s41467-024-46043-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Nakisa A, Sempere LF, Chen X, et al. Tumor-Associated Carbohydrate Antigen 19-9 (CA 19-9), a Promising Target for Antibody-Based Detection, Diagnosis, and Immunotherapy of Cancer. ChemMedChem 2024;19:e202400491 . 10.1002/cmdc.202400491 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Xu M, Chen G, Huang Y, et al. A simple SERS sensor based on antibody-modified Fe3O4@Au MNPs for the detection of CA19-9 in CRC patients. Anal Methods 2024;17:84-91. 10.1039/D4AY01382D [DOI] [PubMed] [Google Scholar]
  • 20.Duffy MJ, Lamerz R, Haglund C, et al. Tumor markers in colorectal cancer, gastric cancer and gastrointestinal stromal cancers: European group on tumor markers 2014 guidelines update. Int J Cancer 2014;134:2513-22. 10.1002/ijc.28384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lee JO, Kim M, Lee JH, et al. Carbohydrate antigen 19-9 plus carcinoembryonic antigen for prognosis in colorectal cancer: An observational study. Colorectal Dis 2023;25:272-81. 10.1111/codi.16372 [DOI] [PubMed] [Google Scholar]
  • 22.Halilovic E, Rasic I, Sofic A, et al. The Importance of Determining Preoperative Serum Concentration of Carbohydrate Antigen 19-9 and Carcinoembryonic Antigen in Assessing the Progression of Colorectal Cancer. Med Arch 2020;74:346-9. 10.5455/medarh.2020.74.346-349 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.VodiČka J , Fichtl J, Šebek J, et al. Outcomes and Prognostic Factors Following Surgical Treatment of Pulmonary Metastases from Colorectal Carcinoma. Anticancer Res 2020;40:7045-51. 10.21873/anticanres.14731 [DOI] [PubMed] [Google Scholar]
  • 24.Zhu HQ, Wang DY, Xu LS, et al. Diagnostic value of an enhanced MRI combined with serum CEA, CA19-9, CA125 and CA72-4 in the liver metastasis of colorectal cancer. World J Surg Oncol 2022;20:401 . 10.1186/s12957-022-02874-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Bu F, Cao S, Deng X, et al. Evaluation of C-reactive protein and fibrinogen in comparison to CEA and CA72-4 as diagnostic biomarkers for colorectal cancer. Heliyon 2023;9:e16092 . 10.1016/j.heliyon.2023.e16092 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Yuan J, Sun Y, Wang K, et al. Development and validation of reassigned CEA, CYFRA21-1 and NSE-based models for lung cancer diagnosis and prognosis prediction. BMC Cancer 2022;22:686 . 10.1186/s12885-022-09728-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ji M, Zhang LJ. Expression levels of SCCA and CYFRA 21-1 in serum of patients with laryngeal squamous cell carcinoma and their correlation with tumorigenesis and progression. Clin Transl Oncol 2021;23:289-95. 10.1007/s12094-020-02417-4 [DOI] [PubMed] [Google Scholar]
  • 28.Li S, Wei W, Feng Z, et al. Role of Serum CYFRA 21-1 in Diagnosis and Prognostic in Colorectal Liver Metastases. Cancer Manag Res 2023;15:601-14. 10.2147/CMAR.S410477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wu B, Zhou H, Hu L, et al. Involvement of PKCα activation in TF/VIIa/PAR2-induced proliferation, migration, and survival of colon cancer cell SW620. Tumour Biol 2013;34:837-46. 10.1007/s13277-012-0614-x [DOI] [PubMed] [Google Scholar]
  • 30.Mallafré-Muro C, Llambrich M, Cumeras R, et al. Comprehensive Volatilome and Metabolome Signatures of Colorectal Cancer in Urine: A Systematic Review and Meta-Analysis. Cancers (Basel) 2021;13:2534 . 10.3390/cancers13112534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Xu M, Liu Y, Xue T, et al. Prognostic Implication of Preoperative Serum Albumin to Carcinoembryonic Antigen Ratio in Colorectal Cancer Patients. Technol Cancer Res Treat 2022;21:15330338221078645 . 10.1177/15330338221078645 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Park J, Morley TS, Kim M, et al. Obesity and cancer--mechanisms underlying tumour progression and recurrence. Nat Rev Endocrinol 2014;10:455-65. 10.1038/nrendo.2014.94 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen D, Zhou X, Yan P, et al. Lipid metabolism reprogramming in colorectal cancer. J Cell Biochem 2023;124:3-16. 10.1002/jcb.30347 [DOI] [PubMed] [Google Scholar]
  • 34.Tirinato L, Liberale C, Di Franco S, et al. Lipid droplets: a new player in colorectal cancer stem cells unveiled by spectroscopic imaging. Stem Cells 2015;33:35-44. 10.1002/stem.1837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Li H, Lin J, Xiao Y, et al. Colorectal Cancer Detected by Machine Learning Models Using Conventional Laboratory Test Data. Technol Cancer Res Treat 2021;20:15330338211058352 . 10.1177/15330338211058352 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    The article’s supplementary files as

    jgo-17-01-12-rc.pdf (180.8KB, pdf)
    DOI: 10.21037/jgo-2025-611
    jgo-17-01-12-coif.pdf (507.3KB, pdf)
    DOI: 10.21037/jgo-2025-611

    Data Availability Statement

    Available at https://jgo.amegroups.com/article/view/10.21037/jgo-2025-611/dss

    jgo-17-01-12-dss.pdf (136.6KB, pdf)
    DOI: 10.21037/jgo-2025-611

    Articles from Journal of Gastrointestinal Oncology are provided here courtesy of AME Publications

    RESOURCES