Abstract
Objective: To explore the correlation between Blood Routine Indicators (BRI) and sepsis using machine learning algorithms (MLAs) and evaluate their application in early sepsis for prognosis assessment. Methods: A total of 4,558 blood routine data (BRD) samples were collected, including 149 sepsis patients and 186 patients with common infections (CI). A binary logistic regression model (BLRM) was constructed to predict sepsis based on BRI. Additionally, MLAs were applied, including support vector machines, neural networks, Bayesian classifiers, k-nearest neighbors), decision trees, and random forest classification models (RFCM). The performance of these seven predictive models was evaluated. Results: The RFCM demonstrated the best predictive performance among the MLAs, with accuracy of 86.97%, precision of 87.02%, recall of 86.97%, and F1 score of 0.87. These metrics were significantly higher than those of the BLRM (accuracy: 68.77%, precision PRE: 71.45%, recall: 69.47%, F1 Score: 0.70). In the random forest model, red blood cell volume distribution width (RDW) was identified as the most significant feature, with RDW-coefficient of variation contributing 6.98% and RDW-standard deviation contributing 5.32%. Conclusion: Combining blood routine indicators (BRI) with MLA has considerable potential in predicting sepsis. The RFCM showed the highest predictive value, and RDW may play a crucial role in sepsis prediction.
Keywords: Sepsi, blood routine, machine learning algorithms, prediction
Introduction
Sepsis is a syndrome characterized by a dysregulated immune response to infection, leading to life-threatening organ dysfunction. It is the most common critical illness and has a high mortality rate [1]. As a leading cause of admission to the Intensive Care Unit (ICU), sepsis is associated with immune and endocrine system disturbances, as well as metabolic abnormalities. It remains one of the leading causes of death among ICU patients and imposes a significant economic burden on healthcare systems. Despite advancements in modern medicine, research into sepsis has become increasingly comprehensive, focusing on its pathogenesis, early diagnosis, and clinical treatment. New definitions and guidelines for the management of sepsis have been established, yet its mortality rate remains high, with reported rates among ICU sepsis patients ranging from 23.7% to 64.5% [2-5]. Early recognition of sepsis or septic shock is critical for improving patient outcome, including timely fluid resuscitation, appropriate antibiotics, infection source control, and, if necessary, vasopressor use [6-8]. However, despite updates in the definition of sepsis and septic shock and the use of early warning scores like qSOFA or SOFA (Sequential Organ Failure Assessment), timely recognition remains a major challenge for clinical practice [9,10].
Artificial intelligence (AI) is an emerging field that integrates various disciplines, including computer science, control theory, information theory, physiology, psychology, linguistics, medicine, and philosophy [11]. AI techniques encompass machine learning (ML), knowledge acquisition, knowledge processing, and automated reasoning. With advancements in technology, AI has enabled groundbreaking research in healthcare, particularly in critical care. AI can facilitate earlier and more accurate predictions in disease risk assessment, deterioration warnings, and mortality prediction. At the core of AI, ML focuses on learning from large datasets. Through computational methods, ML creates learning models that improve with experience, which can then be applied to actual problems. ML has been widely used in disease prediction and other medical fields. For instance, Garcia-Gallo et al. [12] developed an ML-based mortality prediction model for severe sepsis patients using the open-source MIMIC-III database. Their model outperformed traditional scoring systems such as the Sequential Organ Failure Assessment (SOFA) and Simplified Acute Physiology Score II (APACHE II). Thorsen-Meyer et al. [13] applied ML algorithms to analyze time-series ICU data to predict 90-day mortality in real-time, improving ICU patient prognosis. While more AI models are being developed to predict other diseases [14,15], research on predicting sepsis using AI remains limited.
Complete blood count (CBC) is a routine test performed on all hospitalized patients. Key indicators, such as white blood cell count, neutrophil ratio, absolute neutrophil count, and red cell distribution width (RDW), are associated with the body’s inflammatory response. This study collected CBC data from sepsis patients and non-sepsis infected patients to analyze the correlation between CBC indicators and sepsis using machine learning algorithms. The goal was to assess the predictive and prognostic value of these indicators, offering new methods for early sepsis prediction and prognosis evaluation.
The innovation of this study lies in its comprehensive use of a wide range of CBC indicators, the systematic comparison of multiple machine learning algorithms to identify the best performer, and the in-depth exploration of the importance weights of specific blood routine data within the most effective model. These aspects provide a new, detailed understanding compared to previous works on AI for sepsis prediction.
Materials and methods
Case selection
A retrospective cohort study design was used to select patients with confirmed infections admitted to the Intensive Care Unit (ICU) of Putian College Affiliated Hospital from January 2019 to December 2022.
Inclusion criteria: Patients aged over 18 years, diagnosed with infection based on symptoms, signs, laboratory tests, and imaging. The specifics for different infections are as follows:
Pulmonary infection [16]: Community-acquired: Cough, fever, expectoration; lung infiltrates on imaging; elevated white blood cells or pathogens in sputum.
Hospital-acquired: New infiltrates on imaging ≥48 hours after admission, plus fever, cough, purulent sputum.
Ventilator-associated: New or progressive infiltrates during ventilation, purulent secretions, laboratory evidence of infection.
Intracranial infection [17]: Headache, fever, altered mental status, neurological deficits; abnormal cerebrospinal fluid (increased white blood cells, elevated protein, decreased glucose); signs of CNS inflammation on imaging.
Urinary tract infection [18]: Dysuria, frequency, urgency, possible hematuria; positive urine culture, elevated urine white blood cells.
Abdominal infection [19]: Abdominal pain, tenderness, distension, fever, elevated white blood cells; imaging showing abscesses, fluid collections, or organ inflammation.
Biliary tract infection [20]: Right upper quadrant pain, jaundice, fever, chills; abnormal liver function, elevated white blood cells, possible positive bile culture; biliary tract inflammation on imaging.
Bloodstream infection [21]: Positive blood culture (excluding contaminants), with consistent symptoms such as fever and hypotension.
Skin and soft tissue infection [22]: Local redness, swelling, pain, warmth, possible purulent discharge; elevated white blood cells, possible pathogen isolation.
Surgical site infection [23]: Pain, swelling, redness, discharge, or fever at the surgical site post-surgery; purulent drainage, pathogen isolation, or other signs of inflammation.
Exclusion criteria: Patients with a history of hematopoietic stem cell transplantation or other solid organ transplants, patients with immune system diseases currently using glucocorticoids and/or immunosuppressants, patients with blood system diseases, ICU hospitalization time <24 hours, and pregnant or lactating patients.
Diagnostic criteria for sepsis: Conforming to the 3.0 diagnostic criteria for sepsis [24].
Treatment
All patients were treated according to the guidelines of the Surviving Sepsis Campaign (SSC) [25].
Ethics
This study strictly adhered to ethical standards for medical clinical research and was approved by the Ethics Committee of Putian College Affiliated Hospital (Approval No.: 202035). Due to the retrospective and observational design, informed consent was waived. After removing personal information, the data were analyzed anonymously.
Study groups and collection of clinical observation indicators
(1) Grouping: Based on the occurrence of sepsis, patients were divided into the ordinary infection group and the sepsis group. In the sepsis group, patients were further classified into a survival group and a death group based on their 28-day survival status.
(2) Collection of Clinical Data: General information: Gender, age, clinical diagnosis, type of disease, site of infection, pathogenic bacteria, and comorbidities such as diabetes, liver cirrhosis, and chronic diseases of the cardiovascular, respiratory, renal, and immune systems. Complete blood count data were collected for all patients during hospitalization.
Statistical analysis
The following R packages were used for machine learning methods: caret, ipred, ranger, arm, nnet, and gbm. All models were subjected to 10-fold cross-validation. Hyperparameters were optimized using grid search as follows.
For the random forest (RF) model, the number of trees and the mtry parameter were adjusted.
For the neural network (NNET) model, the size and decay parameters were adjusted.
For the gradient boosting machine (GBM) model, n.trees, interaction.depth, and shrinkage were adjusted.
Finally, variable importance was ranked using the “varImpPlot” function in the “caret” package in R. Patients were randomly split into training and test sets, with 80% allocated for training and 20% for testing. Based on the selected predictors, six machine learning models were constructed: support vector machine (SVM), k-nearest neighbors (KNN), random forest, Bayesian model, gradient boosting decision tree (GBDT), and neural network. The models were evaluated and compared using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and the area under the ROC curve (AUC).
Selection of independent variables
All CBC variables were selected as independent variables, including:
White blood cell count (WBC), red blood cell count (RBC), hemoglobin (HGB), hematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), RDW coefficient of variation (RDW-CV), RDW standard deviation (RDW-SD), platelet count (PLT), plateletcrit (PCT), mean platelet volume (MPV), platelet distribution width (PDW), large platelet count (P-LCC), large platelet ratio (P-LCR), lymphocyte percentage (LY%), absolute lymphocyte count (LY#), monocyte percentage (MO%), absolute monocyte count (MO#), neutrophil percentage (NE%), absolute neutrophil count (NE#), eosinophil percentage (EO%), absolute eosinophil count (EO#), basophil percentage (BA%), absolute basophil count (BA#), immature granulocyte percentage (IMG%), and absolute immature granulocyte count (IMG#).
Handling of missing values
Variables with missing values exceeding 15% were excluded. For variables with a missing value rate <2%, the missing values were replaced by the mean value of the respective variable. For variables with a missing value rate between 2% and 15%, multiple imputations were performed.
Handling of outliers
Outliers were identified using the interquartile range (IQR), defined as the difference between the upper and lower quartiles in a box plot. Points exceeding 1.5 times the IQR (upper quartile + 1.5 × IQR or lower quartile - 1.5 × IQR) were considered outliers and were treated as missing values.
Model building
The following R packages were used for machine learning: caret, ipred, ranger, arm, nnet, and gbm. The samples were randomly divided into training and testing sets in a 7:3 ratio. All models employed 10-fold cross-validation. Hyperparameters were adjusted using grid search. For the RF model, the number of trees and the mtry parameter were optimized. For the NNET model, the size and decay parameters were optimized. For the GBM model, the n.trees, interaction.depth, and shrinkage parameters were adjusted. Finally, variable importance was ranked using the “varImpPlot” function from the “caret” package in R.
Results
Traditional methods of analysis
The following variables were included as predictors in the binary logistic regression analysis for helium toxicity (0, 1): WBC, RBC, HGB, HCT, MCV, MCH, MCHC, RDW-CV, RDW-SD, PLT, PCT, MPV, PDW, P-LCC, PLCR, LY%, LY#, MO%, MO#, NE%, NE#, EO%, EO#, BA%, BA#, IMG%, and IMG#. A summary of the data is presented in Table 1. According to the data in Table 1, the study included a total of 4,256 samples. However, 302 instances of missing data were excluded from the analysis. The effective sample size was 93.4%. The distribution of the “pyemia” variable was as follows: the value of 0 appeared 1,976 times, accounting for 46.43%, while the value of 1 appeared 2,280 times, making up 53.57%. Of the total 4,256 samples, 93.37% (4,256 samples) were used in the analysis, while 302 samples with missing data were excluded, representing 6.63% of the total 4,558 samples considered in the study.
Table 1.
Logit regression analysis basic summary
| name | Options | frequency | percent |
|---|---|---|---|
| Pyemia (0, 1) | 0 | 1976 | 46.43% |
| 1 | 2280 | 53.57% | |
| total | 4256 | 100.0% | |
| collect | effective | 4256 | 93.37% |
| deficiency | 302 | 6.63% | |
| total | 4558 | 100.0% |
The variables were analyzed using binary logistic regression, and the results are presented in Table 2. According to the data, MCH, RDW-SD, PLT, PDW, and P-LCR exhibited a significant positive correlation with chlorotoxicity (P < 0.1). In contrast, MCV, RDW-CV, MPV, and P-LCC showed a significant negative correlation with chlorotoxicity (P < 0.1). The variables WBC, RBC, HGB, HCT, MCHC, PCT, LY%, LY#, MO%, MO#, NE%, NE#, EO%, EO#, BA%, BA#, IMG%, and IMG# did not show a significant effect on hepatotoxicity (0, 1).
Table 2.
Summary of logit regression analysis results
| item | Regression coefficient | Standard error | z-value | Wald χ2 | P-value | OR-value | An OR value of 95% CI |
|---|---|---|---|---|---|---|---|
| WBC | 9.061 | 6.555 | 1.382 | 1.911 | 0.167 | 8608.578 | 0.023~3267373644.147 |
| RBC | -0.225 | 0.389 | -0.578 | 0.334 | 0.563 | 0.799 | 0.373~1.712 |
| HGB | -0.035 | 0.042 | -0.822 | 0.675 | 0.411 | 0.966 | 0.889~1.049 |
| HCT | 0.077 | 0.147 | 0.524 | 0.275 | 0.600 | 1.080 | 0.810~1.440 |
| MCV | -0.294 | 0.102 | -2.878 | 8.285 | 0.004 | 0.745 | 0.610~0.910 |
| MCH | 0.615 | 0.292 | 2.107 | 4.438 | 0.035 | 1.849 | 1.044~3.277 |
| MCHC | -0.027 | 0.027 | -0.989 | 0.979 | 0.322 | 0.973 | 0.922~1.027 |
| RDW-CV | -0.400 | 0.078 | -5.121 | 26.228 | 0.000 | 0.670 | 0.575~0.781 |
| RDW-SD | 0.200 | 0.028 | 7.224 | 52.185 | 0.000 | 1.221 | 1.157~1.289 |
| PLT | 0.005 | 0.001 | 5.402 | 29.186 | 0.000 | 1.005 | 1.003~1.007 |
| PCT | 0.014 | 0.190 | 0.074 | 0.005 | 0.941 | 1.014 | 0.699~1.472 |
| MPV | -1.343 | 0.167 | -8.059 | 64.942 | 0.000 | 0.261 | 0.188~0.362 |
| PDW | 0.427 | 0.068 | 6.270 | 39.308 | 0.000 | 1.533 | 1.341~1.752 |
| P-LCC | -0.020 | 0.004 | -5.572 | 31.046 | 0.000 | 0.980 | 0.974~0.987 |
| P-LCR | 0.252 | 0.025 | 10.179 | 103.613 | 0.000 | 1.287 | 1.226~1.351 |
| LY% | -0.034 | 0.269 | -0.126 | 0.016 | 0.900 | 0.967 | 0.570~1.638 |
| LY# | -9.255 | 6.553 | -1.412 | 1.995 | 0.158 | 0.000 | 0.000~36.173 |
| MO% | -0.114 | 0.270 | -0.423 | 0.179 | 0.672 | 0.892 | 0.526~1.514 |
| MO# | -9.063 | 6.557 | -1.382 | 1.910 | 0.167 | 0.000 | 0.000~44.208 |
| NE% | -0.087 | 0.269 | -0.322 | 0.104 | 0.748 | 0.917 | 0.541~1.554 |
| NE# | -8.982 | 6.555 | -1.370 | 1.878 | 0.171 | 0.000 | 0.000~47.727 |
| EO% | 0.001 | 0.273 | 0.004 | 0.000 | 0.997 | 1.001 | 0.586~1.710 |
| EO# | -10.107 | 6.565 | -1.540 | 2.370 | 0.124 | 0.000 | 0.000~15.803 |
| BA% | -0.054 | 0.381 | -0.142 | 0.020 | 0.887 | 0.947 | 0.449~1.999 |
| BA# | -7.605 | 6.949 | -1.094 | 1.198 | 0.274 | 0.000 | 0.000~409.475 |
| IMG% | 0.036 | 0.027 | 1.368 | 1.870 | 0.171 | 1.037 | 0.984~1.092 |
| IMG# | -0.129 | 0.174 | -0.740 | 0.548 | 0.459 | 0.879 | 0.626~1.236 |
| Intercept | 22.914 | 28.597 | 0.801 | 0.642 | 0.423 | 8945333229.477 | 0.000~1.9643659419105786e+34 |
Note: WBC, White Blood Cell Count; RBC, Red Blood Cell Count; HGB, Hemoglobin; HCT, Hematocrit; MCV, Mean Corpuscular Volume; MCH, Mean Corpuscular Hemoglobin; MCHC, Mean Corpuscular Hemoglobin Concentration; RDW-CV, Red Cell Distribution Width-Coefficient of Variation; RDW-SD, Red Cell Distribution Width-Standard Deviation; PLT, Platelet Count; PCT, Plateletcrit; MPV, Mean Platelet Volume; PDW, Platelet Distribution Width; P-LCC, Large Platelet Count; P-LCR, Large Platelet Ratio; LY%, Lymphocyte Percentage; LY#, Absolute Lymphocyte Count; MO%, Monocyte Percentage; MO#, Absolute Monocyte Count; NE%, Neutrophil Percentage; NE#, Absolute Neutrophil Count; EO%, Eosinophil Percentage; EO#, Absolute Eosinophil Count; BA%, Basophil Percentage; BA#, Absolute Basophil Count; IMG%, Immature Granulocyte Percentage; IMG#, Absolute Immature Granulocyte Count.
A logistic regression model was developed using the following equation: ln(p/1-p) = 3.968 - 0.155*MCV + 0.213*MCH - 0.345*RDW-CV + 0.193*RDW-SD + 0.004*PLT - 1.273*MPV + 0.365*PDW - 0.019*P-LCC + 0.244*P-LCR. Here, p denotes the probability of chlorination being equal to one (1), and 1-p denotes the likelihood of chlorination being zero (0).
The model’s validity is supported by the p-value shown in Table 3, which is less than 0.05.
Table 3.
Binary Logit regression model appears to be better than the test results
| Model | -2 times the logarithmic likelihood | Chi-square value | df | p | AIC-value | BIC-value |
|---|---|---|---|---|---|---|
| Only intercept | 5878.336 | |||||
| The final model | 5288.515 | 5288.515 | 9 | 0.000 | 5288.515 | 5288.515 |
The Hosmer-Lemeshow (HL) test was conducted on the model, as shown in Table 4. The P-value from Table 4 was less than 0.05, suggesting that the model does not meet the HL test’s requirements. As a result, the model’s fit exhibits some variability.
Table 4.
Hosmer-Lemeshow conformity test
| χ2 | DOF df | P-value |
|---|---|---|
| 52.874 | 8 | 0.000 |
Building of a machine learning algorithm model
After selecting the self-variables and removing missing or abnormal values, the chosen self-variables included WBC, RBC, HGB, HCT, MCV, MCH, MCHC, RDW-CV, RDW-SD, PLT, PCT, MPV, PDW, P-LCC, P-LCR, LY%, LY#, MO%, MO#, NE%, NE#, EO%, EO#, BA%, BA#, IMG%, and IMG#. These variables were referred to as self-variables. Additionally, the variable representing helium toxicity (0, 1) was included in the model. A total of 4,256 samples were analyzed. Please refer to Table 5.
Table 5.
Summary of basic information
| Name | option | frequency | percentage |
|---|---|---|---|
| metastasizing septicemia (0, 1) | 0.0 | 1976 | 46.43% |
| 1.0 | 2280 | 53.57% | |
| Total | 4558 | 93.37% | |
| Total | effective | 4256 | 93.37% |
| hiatus | 302 | 6.63% | |
| total | 4558 | 100.00% |
Effect evaluation
The training set ratio was set to 0.8, after which the model is assessed. The results, as shown in Table 6, indicated that the random forest model achieved an accuracy of 86.15%, a combined precision rate of 86.22%, a consolidated recall rate of 8.6%, and an F1-score of 0.86. These performance metrics (accuracy, precision, recall, and F1-score) were the highest among the six models tested.
Table 6.
Comparison of the evaluation effects of the predictive models
| Model types | Model Assessment Effects | |||
|---|---|---|---|---|
|
| ||||
| accuracy | Accuracy (overall) | Recall (aggregate) | f1-score | |
| Support vector model | 58.33% | 76.53% | 58.33% | 0.47 |
| Neural Network model | 53.76% | 28.90% | 53.76% | 0.38 |
| Bayesian classification model | 59.39% | 70.65% | 59.39% | 0.55 |
| The KNN classification model | 74.41% | 74.67% | 74.41% | 0.74 |
| Decision tree classification model | 72.07% | 72.05% | 72.07% | 0.72 |
| Random forest classification model | 86.97% | 87.02% | 86.97% | 0.87 |
| Logistic regression model | 68.77% | 71.45% | 69.47% | 0.70 |
Confusion matrix of machine learning algorithm test set results
The confusion matrix is a tabular representation used to evaluate the accuracy of a machine learning algorithm’s predictions. It displays the outcomes of the algorithm’s predictions in a structured format, allowing for a comprehensive analysis of its performance. Please refer to Figure 1.
Figure 1.

A perplexing matrix depicting the test set results of various machine learning algorithm models. A: Random forest classification model; B: Bayesian classification model; C: Decision tree classification model; D: KNN classification model; E: Support vector model; F: Neural network model.
ROC curves of predictive models
ROC curves were generated for each individual predictive model. The results and corresponding graphical representations are shown in Tables 7, 8 and Figure 2.
Table 7.
Comparison of ROC curves of predictive models
| Model Type | ROC results AUC summary | |||
|---|---|---|---|---|
|
| ||||
| AUC | standard error | P | 95% CI | |
| Support vector model | 0.910 | 0.005 | 0.000 | 0.483~0.517 |
| Neural Network model | 0.500 | 0.009 | 1.000 | 0.483~0.517 |
| Bayesian classification model | 0.600 | 0.009 | 0.000 | 0.583~0.617 |
| The KNN classification model | 0.823 | 0.007 | 0.000 | 0.968~0.979 |
| Decision tree classification model | 0.943 | 0.004 | 0.000 | 0.583~0.617 |
| Random forest classification model | 0.973 | 0.003 | 0.000 | 0.968~0.979 |
| Logistic regression model | 0.754 | 0.007 | 0.000 | 0.740~0.769 |
Table 8.
Results of ROC optimum boundary values for each forecast model
| Model Type | AUC | Optimal threshold | sensitivity | specificity | Cut-off |
|---|---|---|---|---|---|
| Support vector model | 0.910 | 0.819 | 1.0 | 0.819 | 0.000 |
| Neural Network model | 0.500 | 0.000 | 0.000 | 1.000 | 1.000 |
| Bayesian classification model | 0.600 | 0.200 | 0.275 | 0.925 | 0.000 |
| The KNN classification model | 0.823 | 0.645 | 0.810 | 0.836 | 0.000 |
| Decision tree classification model | 0.943 | 0.887 | 0.948 | 0.939 | 0.000 |
| Random forest classification model | 0.973 | 0.947 | 0.973 | 0.974 | 0.000 |
| Logistic regression model | 0.754 | 0.389 | 0.654 | 0.735 | 0.527 |
Figure 2.
ROC curves of predictive models. A: KNN model; B: Bayesian model; C: Decision tree model; D: Neural network model; E: Random forest model; F: Support vector model; G: Logistic regression model.
Importance of blood conventional values in random forest model
Table 9 shows the contribution of various blood data to the random forest model. The RDW-CV ratio was found to be 6.98%, RDW-SD 5.32%, MCV 5.22%, RBC 5.21%, HCT 5.08%, PLT 5.02%, PCT 4.59%, MCH 4.53%, HGB 4.02%, NE# 3.73%, P-LCR 3.56%, WBC 3.52%, LY# 3.48%, MCHC 3.45%, PDW 3.30%, and P-LCC 3.27%. The remaining 11% consisted of MO, MO#, LY#, LY%, MPV, IMG#, EO#, and others. For further details, refer to Table 3 and Figure 3, which show percentages of 27% and 7%, respectively.
Table 9.
Weighing values of conventional blood indicators in random forest models
| Item | Weight value |
|---|---|
| WBC | 0.035 |
| RBC | 0.052 |
| HGB | 0.040 |
| HCT | 0.051 |
| MCV | 0.052 |
| MCH | 0.045 |
| MCHC | 0.035 |
| RDW-CV | 0.070 |
| RDW-SD | 0.053 |
| PLT | 0.050 |
| PCT | 0.046 |
| MPV | 0.028 |
| PDW | 0.033 |
| P-LCC | 0.033 |
| P-LCR | 0.036 |
| LY% | 0.028 |
| LY# | 0.035 |
| MO% | 0.032 |
| MO# | 0.031 |
| NE% | 0.032 |
| NE# | 0.037 |
| EO% | 0.023 |
| EO# | 0.030 |
| BA% | 0.014 |
| BA# | 0.027 |
| IMG% | 0.027 |
| IMG# | 0.025 |
Figure 3.

Weighted values of blood conventional indicators by random forest models.
Discussion
Sepsis is a life-threatening disease that progresses rapidly, leading to multi-organ dysfunction, disseminated intravascular coagulation, and potentially death [26]. Early detection and appropriate treatment are crucial for improving patient prognosis, with studies showing that timely treatment significantly reduces mortality [27]. Various clinical indicators of inflammation (such as calcium levels and C-reactive protein) and clinical scores (e.g., qSOFA and SOFA) are commonly used to assess the disease.
Regular blood tests provide valuable information on red blood cells, white blood cells, and platelets, which are sensitive to many pathologic changes in the body. This sensitivity makes them effective in detecting the onset and progression of the disease. Increasing evidence supports the use of routine blood tests as informative biomarkers for a variety of diseases, including cancer, inflammatory disorders, viral infections, and cardiovascular diseases [28]. In a prospective study, Margolis [29] demonstrated that elevated leukocyte (WBC) levels are associated with an increased risk of various types of breast cancer. Additionally, the predictive value of regular blood tests, including RDW and other markers, has been shown to be more effective than diagnosis based on conventional cell counts.
In this study, we first applied traditional statistical methods to develop a binary logistic regression model for predicting hepatotoxicity using conventional blood markers. We then employed machine learning algorithms, including support vector machines (SVM), neural networks, Naive Bayes classifiers, K-Nearest Neighbors (KNN), decision trees, and random forest classifiers. The performance of these models was evaluated, and the results showed that the random forest model outperformed the other machine learning algorithms in terms of accuracy, precision, recall, and F1-score, significantly surpassing the performance of the traditional statistical method.
RDW is a measure that reflects the heterogeneity of peripheral red blood cell volume. A lower RDW indicates uniformity in red blood cell size, while a higher RDW suggests greater variability in cell size. RDW is useful in the early diagnosis and monitoring of iron-deficiency anemia. The RDW/MCV ratio is also used in the morphologic classification of anemia. Studies have shown that RDW is a valuable predictor for tumor diseases [30], active inflammatory bowel disease [31], and cardiovascular diseases [32].
In clinical studies, various indicators and clinical scores have been used for the early identification of sepsis, including systemic inflammatory response syndrome (SIRS), sequential organ failure assessment (SOFA), modified early warning score (MEWS), and national early warning score (NEWS). Although these tools aid in diagnosis and treatment, their diagnostic accuracy varies significantly, and most have limited predictive value. Over the years, numerous studies have explored the application of AI in sepsis management. In 2016, Calvert et al. [33] used the MIMIC-II database to analyze data from 1,394 critically ill patients and developed the Insight model based on 9 indicators, achieving an AUC of 0.83 (95% CI: 0.80-0.86) for predicting sepsis 3 hours in advance. Similarly, Mao et al. [34] developed an advanced model based on 6 vital signs, showing an AUC of 0.85 (95% CI: 0.79-0.91) for predicting severe sepsis 4 hours before its onset. However, these studies had limitations, such as inadequate use of electronic health record (EHR) data and ambiguous definitions of sepsis onset. Faisal et al. [35] created an LR model to predict sepsis risk after emergency admission, achieving AUCs of 0.78 and 0.79 on different test sets, but the model lacked real-time predictive capability despite using more data. In 2018, Nemati et al. [36] developed the AISE model, based on specific data and methods, which had AUCs of 0.83-0.85 for predicting sepsis 4-12 hours in advance. This model also provided interpretable results, though limited access to EHR data remained a challenge for many AI researchers. Additionally, randomized controlled trials (RCTs) and prospective studies have shown that some AI warning models can improve patient outcomes. However, previous studies often suffered from low data collection frequencies, leading to prediction delays. More recent research indicates that using physiological data from bedside monitors can extract refined features for faster feedback on disease progression. Studies on heart rate variability and other factors in various patient groups have demonstrated their predictive value for sepsis, showing that combining multiple features improveed prediction accuracy. Initially, many studies did not fully utilize multi-channel signals, focusing mainly on numerical vital signs. However, recent efforts provided valuable references for non-contact video motion analysis in sepsis warning systems.
This study has several limitations. First, it is a monocentric retrospective study with some missing data. To minimize deviations in the results, we supplemented the data using various statistical software inference functions. Second, blood-related data are subject to interference from various factors, and some measurements may fluctuate significantly during treatment. For example, patients with heparin-induced thrombocytopenia may experience platelet decreases after heparin administration, while blood dialysis can damage red blood cells, causing further data fluctuations. These patients were not excluded, which may have influenced the study results.
Conclusion
This study provides important findings with significant clinical implications. The integration of routine blood data and machine learning algorithms-especially the high performance of the random forest classification model-offers a novel and effective approach for early sepsis prediction. Clinically, this enables healthcare providers to use routine blood test results to quickly identify patients at high risk for sepsis, facilitating timely initiation of preventive and therapeutic measures.
The identification of RDW as a key measure within the model underscores its potential as an important biomarker for sepsis. This discovery suggests that clinicians should consider incorporating RDW analysis into routine clinical evaluations to enhance diagnostic accuracy and risk stratification.
From a clinical treatment and prognostic perspective, the early prediction facilitated by this study allows for prompt evidence-based management, including timely fluid resuscitation, appropriate antibiotic selection, and effective control of infection sources. These interventions are likely to prevent sepsis from progressing to more severe stages, thereby reducing mortality rates and improving overall patient outcomes. In summary, the results of this study provide a practical predictive model and valuable insight that may optimize clinical decision-making in sepsis management, ultimately improving patient outcomes.
Acknowledgements
This study was supported by the: 1. Fujian provincial health technology project (NO.2020QNA085); 2. National Natural Science Foundation of China (NO.62276146); 3. The Natural Science Foundation of Fujian Province of China (2023J011707); 4. The Natural Science Foundation of Fujian Province of China (2023J011721).
Disclosure of conflict of interest
None.
References
- 1.Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Bauer M. The third international consensus definitions for sepsis and septic shock (Sepsis-3) JAMA. 2016;315:801–810. doi: 10.1001/jama.2016.0287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hernandez G, Ospina-Tascón GA, Damiani LP, Estenssoro E, Hurtado J, Friedman G, Castro R, Alegría L, Teboul JL, Cecconi M, Ferri G, Jibaja M, Pairumani R, Fernández P, Barahona D, Granda-Luna V, Cavalcanti AB, Bakker J The ANDROMEDA SHOCK Investigators and the Latin America Intensive Care Network (LIVEN) Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock: the ANDROMEDA-SHOCK Randomized Clinical Trial. JAMA. 2019;321:654–664. doi: 10.1001/jama.2019.0071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pittard MG, Huang SJ, McLean AS, Orde SR. Association of positive fluid balance and mortality in sepsis and septic shock in an Australian cohort. Anaesth Intensive Care. 2017;45:737–743. doi: 10.1177/0310057X1704500614. [DOI] [PubMed] [Google Scholar]
- 4.Schwindenhammer V, Shiraishi A, Yamakawa K, Ogura H, Fujishima S. oXiris(R) use in septic shock: experience of two French centres. Blood Purif. 2019;47(Suppl 3):1–7. doi: 10.1159/000499510. [DOI] [PubMed] [Google Scholar]
- 5.Rhodes A, Evans LE, Alhazzani W, Levy MM, Ferrer R. Surviving sepsis campaign: international guidelines for management of sepsis and septic shock: 2016. Crit Care Med. 2017;45:486–552. doi: 10.1097/CCM.0000000000002255. [DOI] [PubMed] [Google Scholar]
- 6.Liu B, Ding X, Yang J. Effect of early goal directed therapy in the treatment of severe sepsis and/or septic shock. Curr Med Res Opin. 2016;32:1773–1782. doi: 10.1080/03007995.2016.1206872. [DOI] [PubMed] [Google Scholar]
- 7.Sherwin R, Winters ME, Vilke GM, Wardi G. Does early and appropriate antibiotic administration improve mortality in emergency department patients with severe sepsis or septic shock? J Emerg Med. 2017;53:588–595. doi: 10.1016/j.jemermed.2016.12.009. [DOI] [PubMed] [Google Scholar]
- 8.Bai X, Yu W, Ji W, Lin Z, Duan K. Early versus delayed administration of norepinephrine in patients with septic shock. Crit Care. 2014;18:532. doi: 10.1186/s13054-014-0532-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kleinpell RM, Schorr CA, Balk RA. The new sepsis definitions: implications for critical care practitioners. Am J Crit Care. 2016;25:457–464. doi: 10.4037/ajcc2016574. [DOI] [PubMed] [Google Scholar]
- 10.Lu Y, Zhang H, Teng F, Xia WJ, Sun GX, Wen AQ. Early goal-directed therapy in severe sepsis and septic shock: a meta-analysis and trial sequential analysis of randomized controlled trials. J Intensive Care Med. 2018;33:296–309. doi: 10.1177/0885066616671710. [DOI] [PubMed] [Google Scholar]
- 11.Kong XY, Wang RZ. Artificial intelligence and its application in medical field. J Med Inform. 2016;37:1–5. [Google Scholar]
- 12.García-Gallo JE, Fonseca-Ruiz NJ, Celi LA, Duitama-Muñoz JF. A machine learning-based model for 1-year mortality prediction in patients admitted to an Intensive Care Unit with a diagnosis of sepsis. Med intensiva. 2020;44:160–70. doi: 10.1016/j.medin.2018.07.016. [DOI] [PubMed] [Google Scholar]
- 13.Thorsen-Meyer HC, Nielsen AB, Nielsen AP, Kaas-Hansen BS, Schierbeck J. Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit Health. 2020;2:e179–91. doi: 10.1016/S2589-7500(20)30018-2. [DOI] [PubMed] [Google Scholar]
- 14.Giannini HM, Ginestra JC, Chivers C, Draugelis M, Schweickert WD. A machine learning algorithm to predict severe sepsis and septic shock: development, implementation, and impact on clinical practice. Crit Care Med. 2019;47:1485–92. doi: 10.1097/CCM.0000000000003891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Giacobbe DR, Signori A, Del Puente F, Mora S, Briano F. Early detection of sepsis with machine learning techniques: a brief clinical perspective. Front Med. 2021;8:617486. doi: 10.3389/fmed.2021.617486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Torres A, Cilloniz C, Niederman MS. Pneumonia. Nat Rev Dis Primers. 2021;7:25. doi: 10.1038/s41572-021-00259-0. [DOI] [PubMed] [Google Scholar]
- 17.Li S, Nguyen IP, Urbanczyk K. Common infectious diseases of the central nervous system-clinical features and imaging characteristics. Quant Imaging Med Surg. 2020;10:2227–2259. doi: 10.21037/qims-20-886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wilkie ME, Almond MK, Marsh FP. Diagnosis and management of urinary tract infection in adults. BMJ. 1992;305:1137. doi: 10.1136/bmj.305.6862.1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ramos J. Abdominal pain: the differential diagnosis, classic histories, and diagnosis. Physician Assistant Clinics. 2023;8:33–48. [Google Scholar]
- 20.Melzer M, Toner R, Lacey S, Bettany E, Rait G. Biliary tract infection and bacteraemia: presentation, structural abnormalities, causative organisms and clinical outcomes. Postgrad Med J. 2007;83:773–6. doi: 10.1136/pgmj.2007.064683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Mermel LA, Allon M, Bouza E, Craven DE, Flynn P, O’Grady NP, Raad II, Rijnders BJ, Warren DK. Clinical practice guidelines for the diagnosis and management of intravascular catheter-related infection: 2009 update by the infectious diseases society of america. Clin Infect Dis. 2009;49:1–45. doi: 10.1086/599376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ki V, Rotstein C. Bacterial skin and soft tissue infections in adults: a review of their epidemiology, pathogenesis, diagnosis, treatment and site of care. Can J Infect Dis Med Microbiol. 2008;19:173–84. doi: 10.1155/2008/846453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Reichman DE, Greenberg JA. Reducing surgical site infections: a review. Rev Obstet Gynecol. 2009;2:212–21. [PMC free article] [PubMed] [Google Scholar]
- 24.Ding XF, Yang ZY, Xu ZT, Li LF, Yuan B, Guo LN, Wang LX, Zhu X, Sun TW. Early goal-directed and lactate-guided therapy in adult patients with severe sepsis and septic shock: a meta-analysis of randomized controlled trials. J Transl Med. 2018;16:331. doi: 10.1186/s12967-018-1700-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Rhodes A, Evans LE, Alhazzani W, Levy MM, Ferrer R. Surviving sepsis campaign: international guidelines for management of sepsis and septic shock: 2016. Intensive Care Med. 2017;43:304–377. doi: 10.1007/s00134-017-4683-6. [DOI] [PubMed] [Google Scholar]
- 26.Kim H, Kim Y, Lee HK, Yeo CD. Comparison of the delta neutrophil index with procalcitonin and C-reactive protein in sepsis. Clin Lab. 2014;60:2015–2021. doi: 10.7754/clin.lab.2014.140528. [DOI] [PubMed] [Google Scholar]
- 27.Finfer S. The surviving sepsis campaign: robust evaluation and high-quality primary research is still needed. Intensive Care Med. 2010;36:187–189. doi: 10.1007/s00134-009-1737-4. [DOI] [PubMed] [Google Scholar]
- 28.Gandini S, Ferrucci PF, Botteri E, Draugelis M. Prognostic significance of hematological profiles in melanoma patients. Int J Cancer. 2016;139:1618–1625. doi: 10.1002/ijc.30215. [DOI] [PubMed] [Google Scholar]
- 29.Margolis KL, Rodabough RJ, Thomson CA. Prospective study of leukocyte count as a predictor of incident breast, colorectal, endometrial, and lung cancer and mortality in postmenopausal women. Arch Intern Med. 2016;167:1837–1844. doi: 10.1001/archinte.167.17.1837. [DOI] [PubMed] [Google Scholar]
- 30.Montagnana M, Danese E. Red cell distribution width and cancer. Ann Transl Med. 2016;4:399. doi: 10.21037/atm.2016.10.50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Song CS, Park DI, Yoon MY. Association between red cell distribution width and disease activity in patients with inflammatory bowel disease. Dig Dis Sci. 2012;57:1033–1038. doi: 10.1007/s10620-011-1978-2. [DOI] [PubMed] [Google Scholar]
- 32.Tonelli M, Sacks F, Arnold M. Relation between red blood cell distribution width and cardiovascular event rate in people with coronary disease. Circulation. 2008;117:163–168. doi: 10.1161/CIRCULATIONAHA.107.727545. [DOI] [PubMed] [Google Scholar]
- 33.Calvert JS, Price DA, Chettipally UK. A computational approach to early sepsis detection. Comput Biol Med. 2016;74:69–73. doi: 10.1016/j.compbiomed.2016.05.003. [DOI] [PubMed] [Google Scholar]
- 34.Mao Q, Jay M, Hoffman JL. Multicentre validation of a sepsis prediction algorithm using only vital sign data in the emergency department, general ward and ICU. BMJ Open. 2018;8:e017833. doi: 10.1136/bmjopen-2017-017833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Faisal M, Scally A, Richardson D. Development and external validation of an automated computer-aided risk score for predicting sepsis in emergency medical admissions using the patient’s first electronically recorded vital signs and blood test results. Crit CareMed. 2018;46:612–618. doi: 10.1097/CCM.0000000000002967. [DOI] [PubMed] [Google Scholar]
- 36.Nemati S, Holder A, Razmi F. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit Care Med. 2018;46:547–553. doi: 10.1097/CCM.0000000000002936. [DOI] [PMC free article] [PubMed] [Google Scholar]

