Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 17;16:2813. doi: 10.1038/s41598-025-32717-0

Machine learning-based prediction of progression from idiopathic cytopenia of undetermined significance to myeloid malignancies

Hyunkyung Park 1,#, Ji-Ye Han 2,#, Han-Seung Park 1, Yunsuk Choi 1, Jung-Hee Lee 1, Je-Hwan Lee 1, Kyoo-Hyung Lee 3, Young-Hak Kim 4, Tae Joon Jun 5,✉,#, Eun-Ji Choi 1,✉,#
PMCID: PMC12823607  PMID: 41547901

Abstract

Although the risk of progression varies, a subset of patients with idiopathic cytopenia of undetermined significance (ICUS) eventually develop myeloid malignancies. Early identification of high-risk patients is crucial for timely intervention and optimized clinical management. This study aimed to develop a machine learning-based model to predict the progression of ICUS to myeloid malignancies. We retrospectively analyzed data from 1274 patients who underwent bone marrow examination at Asan Medical Center, Seoul, South Korea, between January 2000 and December 2021 and met the diagnostic criteria for ICUS. Among these patients, 36 (2.82%) progressed to myeloid malignancies. We developed a predictive model using the extreme gradient boosting algorithm, incorporating clinical, laboratory, and cytogenetic features. The model achieved an area under the receiver operating characteristic curve of 0.780, with enhanced performance after integrating PubMedBERT to extract insights from unstructured text data from bone marrow examination reports. Additionally, we applied SHapley Additive exPlanations to generate individualized risk scores, estimate progression probabilities, and visualize key predictive features, enabling personalized risk assessment. In conclusion, we developed a machine learning-based model predicting ICUS progression to myeloid malignancies. This model could serve as a valuable tool for personalized risk stratification and tailored patient monitoring in clinical practice.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-32717-0.

Keywords: Machine learning, Idiopathic cytopenia of undetermined significance, Myeloid malignancies, Progression

Subject terms: Cancer, Computational biology and bioinformatics, Diseases, Medical research, Oncology, Risk factors

Introduction

Idiopathic cytopenia of undetermined significance (ICUS) is a condition characterized by peripheral blood cytopenia persisting for at least 4 months without an identifiable cause and without meeting the criteria for myeloid neoplasms1. Although the exact prevalence of ICUS remains unknown, a study using data from a Centers for Disease Control and Prevention survey reported that cytopenia was observed in 2.0% of 7,962 individuals from the general population, with 0.9% classified as unexplained cytopenia2. Clinical implications of ICUS include the risk of a subset of patients progressing to myeloid malignancies, including myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML)3. Beyond ICUS, recent technological advances in detecting genetic aberrations in blood or bone marrow (BM) samples have led to the definition of related clinical entities based on the existence of cytopenia and/or genetic mutations. Clonal hematopoiesis of indeterminate potential (CHIP) refers to the finding of a somatic mutation in myeloid neoplasm driver genes (with VAF ≥ 2%) in the absence of cytopenia4. Clonal cytopenia of undetermined significance (CCUS) is defined by the coexistence of unexplained cytopenia (i.e., ICUS) with evidence of clonal hematopoiesis. A less frequently used entity is idiopathic dysplasia of unknown significance (IDUS), which is characterized by the presence of BM dysplasia without significant cytopenia.

Although most cases with CHIP, ICUS, IDUS, and CCUS do not progress to myeloid neoplasms, certain clinical and genetic features are associated with increased risk of progression. Reported risk factors include age, presence of cytopenia, number of mutations, and high-risk mutations, although these associations vary by patient cohort and study design57. In a recent study, patients with CCUS or CHIP harboring high-risk features had a 52.2% cumulative incidence of myeloid neoplasms over 10 years, whereas those in the low-risk group had only a 0.7% progression rate5. Another real-world study found that high-risk patients with CCUS had a 2-year cumulative incidence of myeloid malignancies of 37.2%, compared with 6.4% in the low-risk group7. Additionally, multiparameter risk stratification models for AML, MDS, and myeloproliferative neoplasms (MPN) have been developed using general population databases8. These findings highlight the clinical importance of identifying patients with cytopenia, or even asymptomatic individuals, who are at high risk of progression to myeloid malignancies.

Most predictive models for myeloid neoplasm progression rely on somatic mutation analysis via targeted sequencing. However, this approach is generally impractical in routine clinical settings due to the high costs, particularly for patients without a confirmed hematologic diagnosis. Furthermore, the lack of a standardized treatment approach for CCUS or CHIP limits the clinical utility of targeted sequencing for risk assessment. In this context, we hypothesized that predicting disease progression in patients with ICUS, who do not require confirmation of somatic mutations, would be more feasible in real-world settings than in CCUS or CHIP. This study aimed to develop a machine learning-based model to predict ICUS progression to myeloid malignancies using selected clinical and laboratory, thereby minimizing bias and improving predictive performance. Additionally, we employed the PubMed Bidirectional Encoder Representations from Transformers (PubMedBERT) model for text embedding to enhance the interpretation of BM examination reports9.

Methods

Patient population

The patient population was defined using electrical medical records (EMR). Among 26,393 patients who underwent BM examination between January 2000 and December 2021 at Asan Medical Center, 6,962 patients with cytopenia observed between 7 days before and 7 days after the BM examination date were identified. Cytopenia was defined according to the criteria outlined in the WHO 2022 classification: hemoglobin < 13 g/dL in males or < 12 g/dL in females, neutrophil count < 1.9 × 109/L, and/or platelet count < 150 × 109/L4. From these cases, ICUS was defined as persistent cytopenia lasting at least 4 months, with no underlying disease or condition that could account for the cytopenia. Conditions leading to exclusion included a history of cytotoxic chemotherapy or radiation therapy, blood or BM disorders, autoimmune diseases, solid organ transplantation, active infection by atypical pathogens, and use of immunosuppressants or chemotherapeutic agents. Secondary conditions were excluded based on ICD-10 codes and medical records. Ultimately, 1,274 patients were selected, and their electronic data were used to develop the predictive model. The primary outcome was defined as the progression to myeloid malignancies, including MDS, AML, MPN, and chronic myelomonocytic leukemia.

Data processing (ABLE)

Structured feature extraction

An overview of the study design is provided in Fig. 1. The initial dataset comprised 100 structured features, encompassing diagnoses, medications, physical attributes, laboratory results, and medical history. Data collection was based on the date of the initial BM examination. To ensure that patient characteristics were accurately represented, each category was refined through an analysis of the entire patient population, with selection based on frequency distribution. Clinical relevance guided the final selection, with domain experts identifying key variables for analysis.

Fig. 1.

Fig. 1

Overview of the study.

Variables used for model training included medication prescriptions, laboratory data, physical measurements, and medical history. Feature selection was performed by identifying the most frequently recorded variables across the entire patient cohort and retaining only the top-ranking items within each category. The variables used in the machine learning model are presented in the Supplementary Table 1. Categorical variables, primarily binary indicators derived from diagnosis codes or demographics, were encoded as binary values without an unknown category. For continuous features, the average missing rate prior to imputation was 39%, and missing values were imputed using the mean of the observed values.

Unstructured feature extraction: structured feature extraction from data of bone marrow examination and cytogenetic analyses

Both BM examination reports and cytogenetic analyses, recorded as unstructured free-text data in the EMR, contained both formatted sections and narrative descriptions. To incorporate these data into the predictive model, two distinct methods were applied. First, BM examination reports were processed to extract numerical percentage values for each differential cell type, converting them into structured numerical features. Second, key cytogenetic analysis details, including total chromosome count, sex chromosome composition, and descriptions of chromosomal abnormalities, were extracted and transformed into structured categorical features. Domain experts determined the most clinically relevant features for inclusion in the analysis. Through this process, 101 features derived from cytogenetic analyses and 40 features from BM examination reports were incorporated into the text dataset. The structured patient characteristics were then integrated into the base dataset for model training, contributing to enhanced predictive performance.

Unstructured feature extraction: clinical text embeddings with pubmedbert

To extract clinically meaningful insights from unstructured text, PubMedBERT was employed for contextualized embeddings. As a bidirectional Transformer-based model pre-trained on PubMed data, PubMedBERT offers advantages over standard BERT by enabling precise extraction of complex biomedical information. For this study, fine-tuning of PubMedBERT was conducted to optimize its alignment with the dataset, allowing for the conversion of unstructured patient characteristics into high-dimensional vector representations. This transformation enabled the integration of textual data from cytogenetic analysis results and BM examination reports into machine learning models. By converting key clinical report information into numerical embeddings, semantic and contextual relationships within the data were captured. These embedded features were then incorporated into the embedded dataset, improving predictive performance.

Model development and evaluation

The dataset was partitioned into training and test sets in an 8:2 ratio to ensure a balanced approach to model training and evaluation. To mitigate potential overfitting due to data imbalance and enhance generalizability, stratified K-fold cross-validation with three folds was applied10.

Four machine learning algorithms‒Extreme Gradient Boosting (XGBoost), Random Forest, Logistic Regression, and Support Vector Machine (SVM) ‒were evaluated for their ability to predict disease progression in patients with ICUS. Each model was trained on the same dataset, which was randomly partitioned into training and test sets. Model performance was assessed using a set of complementary metrics, including the area under the receiver operating characteristic curve (AUROC), the area under the precision recall curve (AUPRC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and the Matthews correlation coefficient (MCC) to account for class imbalance. For confusion matrix-based metrics, the decision threshold was optimized using Youden’s J index, which jointly maximizes sensitivity and specificity.

Across these evaluation metrics, XGBoost achieved the highest AUROC of 0.762, AUPRC of 0.141, accuracy of 0.788, and MCC of 0.232 on the base dataset, outperforming the random forest, logistic regression, and SVM models. Based on these results, XGBoost was selected as the reference model for subsequent variable selection and for comparison between the base and embedded datasets. In the embedded dataset, text-derived features were incorporated into the XGBoost model for further performance evaluation.

Explainable individualized prediction for progression

To provide individualized predictions, Shapley values were employed to quantify each patient’s probability of disease progression. The Shapley value, derived from game theory, was calculated as the average change observed with the inclusion or exclusion of a single feature across all possible feature combinations11. Given a survival prediction model f(x), the Shapley values were computed using the following equation:

graphic file with name d33e390.gif

where n represents the total number of features, and the summation extends over all subsets S of N that do not contain feature i. A unified framework called SHapley Additive exPlanations (SHAP) was applied to enhance interpretability by using Shapley values12. Through SHAP integration, patient-specific predictions were generated with interpretable feature-level contributions.

Experimental setup

The fine-tuning model for text embedding was implemented using Python 3.8 and the TensorFlow deep learning framework (version 2.6.0), based on the BERT-base architecture. Optimization of the pretrained language model was performed using the Adam optimizer, with fine-tuning conducted for up to 50 epochs with early stopping to prevent overfitting. The learning rate was set to 10⁻⁵, and a batch size of 8 was chosen based on dataset size and GPU memory capacity (GeForce RTX 3090). To ensure computational efficiency, the maximum token length was limited to 512, and the vocabulary size was restricted to 32,500.

Other statistical analysis

Categorical variables were analyzed using the Chi-squared test. The Shapiro–Wilk test was applied to assess normality in continuous variables. Variables that did not meet the normality assumption were analyzed using the Mann–Whitney U-test, whereas normally distributed variables were analyzed using the t-test. All statistical analyses were conducted using R version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria). In all analyses, two-tailed P-values less than 0.05 were considered statistically significant.

Results

Patient characteristics

The baseline characteristics of patients at the time of BM examination are summarized in Table 1. The median age of the total enrolled patients was 58 years, with 54.1% being male. Compared with the no progression group, patients who experienced progression had significantly lower white blood cell counts (3.7 vs. 5.7 × 103/µL P < 0.0001), absolute neutrophil counts (1.6 vs. 3.0 × 103/µL P < 0.0001), and hemoglobin levels (10.3 vs. 11.3, P < 0.0001). They also exhibited higher MCV (99.9 vs. 92.1 fL, P < 0.0001) and BM blast percentage (1.2 vs. 0.8%, P = 0.019). During the follow-up duration of 18.3 moths (16.1 months in progressed group and 18.5 months in non-progressed group), 36 (2.82%) out of 1,274 patients progressed to myeloid malignancies, with a median time to progression of 484 days from ICUS diagnosis. Of these 36 patients, 26 progressed to MDS, 8 to AML, 1 to myelofibrosis, and 1 to chronic myelomonocytic leukemia. The laboratory and BM findings of the progressed patients are summarized in Supplementary Table 2. Of the 34 patients with available cytogenetic results, 29 had a normal karyotype at ICUS diagnosis, 3 had partial loss of the Y chromosome, and one each exhibited partial gain of 1q and X/9 translocation, respectively.

Table 1.

Patient characteristics.

Total (n = 1,274) Progression (n = 36) No progression (n = 1,238) P-value
Follow-up duration, months, median (range) 18.3 (3.3–65.0) 16.1 (7.1–40.5) 18.5 (3.2–65.2) 0.753
Age at ICUS diagnosis, years, median (range) 58.0 (44.0–69.0) 59.5 (40.8–66.3) 58.0 (44.0–69.0) 0.512
Age at censoring, years, median (range) 62.0 (49.0–72.0) 60.1 (43.0-68.3) 62.0 (49.0–72.0) 0.345

Sex, n (%)

Male

Female

689 (54.1) 24 (66.7) 665 (53.7) 0.124
585 (45.9) 12 (33.3) 573 (46.3)
WBC, × 103/µL, median (range) 5.7 (3.7–8.6) 3.7 (3.1–5.5) 5.7 (3.8–8.7) < 0.001

ANC, × 103/µL, median (range)

<1,800, n (%)

≥1,800, n (%)

2.9 (1.6–4.9) 1.6 (0.8–2.8) 3.0 (1.6–4.9) < 0.001
300 (23.6) 20 (55.6) 280 (22.6) < 0.001
816 (64.1) 16 (44.4) 800 (64.6)
Monocytes, ×103/µL, median (range) 0.41 (0.28–0.62) 0.33 (0.18–0.45) 0.41 (0.28–0.62) 0.085

Hemoglobin, g/dL, median (range)

< 13 (male) or 12 (female), n (%)

≥ 13 (male) or 12 (female), n (%)

11.3 (9.4–12.8) 10.3 (8.9–11.9) 11.3 (9.4–12.9) 0.005
892 (70.0) 33 (91.7) 859 (69.4) 0.004
382 (30.0) 3 (8.3) 379 (30.6)
MCV, fL, median (range) 92.2 (88.0–97.3) 99.9 (91.3–106.8) 92.1 (87.9–97.0) < 0.001

Platelet, × 103/µL, median (range)

< 150, n (%)

≥ 150, n (%)

122.5 (53.0–241.0) 86.0 (51.9–205.0) 126.0 (53.3–245.0) 0.370
720 (56.5) 21 (58.3) 699 (56.5) 0.823
554 (43.5) 15 (41.7) 539 (43.5)
BM cellularity, %, median (range) 35.0 (25.0–50.0) 40.0 (25.0–55.0) 35.0 (25.0–45.0) 0.186
BM blast, %, median (range) 0.8 (0.4–1.57) 1.2 (0.8–2.0) 0.8 (0.4–1.4) 0.019

ANC, absolute neutrophil count; BM, bone marrow; ICUS, idiopathic cytopenia of undetermined significance; MCV, mean corpuscular volume; WBC, white blood cell.

Final performance of the prediction model

To characterize the final prediction model, we first compared the performance of different classifiers on the base dataset and subsequently assessed the impact of incorporating text-derived features into the XGBoost model. On the base dataset, logistic regression achieved an AUROC of 0.622, an AUPRC of 0.061, and an accuracy of 0.745, whereas the XGBoost achieved an AUROC of 0.762, an AUPRC of 0.141, and an accuracy of 0.788, representing relative improvements of approximately 22.5% in AUROC, 131.1% in AUPRC, and 5.7% in accuracy compared with logistic regression.

When XGBoost was trained on the embedded dataset, which incorporated text-derived features, model performance further improved, yielding an AUROC of 0.780, an AUPRC of 0.175, and an accuracy of 0.815. These values correspond to relative increases of approximately 2.4% in AUROC, 24.1% in AUPRC, and 3.4% in accuracy compared with the base XGBoost model. A complete set of performance metrics for all models is provided in Table 2, with the corresponding receiver operating characteristic curves shown in Fig. 2.

Table 2.

Final performance of the prediction model.

Dataset Base dataset Embedded dataset
Model Random Forest Logistic regression SVM XGBoost XGBoost
AUROC 0.655 0.622 0.644 0.762 0.780
AUPRC 0.089 0.061 0.055 0.141 0.175
Sensitivity 0.611 0.583 0.806 0.778 0.722
Specificity 0.776 0.751 0.504 0.788 0.818
PPV 0.073 0.071 0.047 0.105 0.114
NPV 0.986 0.985 0.989 0.992 0.990
Accuracy 0.771 0.745 0.513 0.788 0.815
MCC 0.151 0.134 0.106 0.232 0.234

AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; SVM, support vector machine; XGBoost, extreme gradient boosting.

Fig. 2.

Fig. 2

Receiver operating characteristic curves comparing model performances across datasets.

Given the substantial class imbalance, MCC was also evaluated. The MCC increased from 0.134 for logistic regression on the base dataset to 0.232 for the base XGBoost model and 0.234 for the embedded XGBoost model, indicating that the performance gains were preserved even after accounting for class imbalance.

Explainability and application of the prediction model

To evaluate the impact of individual features on model predictions, SHAP analysis was performed, as illustrated in Fig. 3. Each dot in the plot represents data from a specific patient, with the dot color indicating feature values: red for higher values and blue for lower values. The SHAP value, displayed on the x-axis, quantifies the contribution of each feature to the model’s output. Positive SHAP values suggest an increased predicted outcome, whereas negative values suggest a decreasing effect. Key hematological indicators, such as red cell distribution width (RDW) and band neutrophils, were consistently identified as important predictors in both the structured-data model (Fig. 3A) and the embedding-enhanced model (Fig. 3B). Additionally, embedding-derived features (e.g., embed_723 and embed_304) were highlighted in Fig. 3B as significant, demonstrating the contribution of unstructured text data to model performance.

Fig. 3.

Fig. 3

SHAP summary plots for feature impact on model predictions using (A) the structured-data model and (B) the embedding-enhanced model.

Figure 4 presents SHAP waterfall plots, which were generated using data from four patients in the validation set to illustrate the key features influencing the predicted outcomes. Features associated with an increased risk are shown in red, while those associated with a reduced risk are in blue. The numerical values on each bar represent the specific contribution of each feature to the model’s prediction, with bolded values indicating the predicted probability of disease progression for each patient. Predicted probabilities of 0.25 and 0.32 were assigned to Patients 4 A and 4B, respectively, both of whom progressed to myeloid malignancies. In contrast, probabilities of 0.07 and 0.08 were assigned to Patients 4 C and 4D, respectively, who did not progress. Figure 4E depicts the distribution of SHAP scores across the entire dataset, showing that higher SHAP scores were associated with patients who progressed to myeloid malignancies, while lower SHAP scores were observed in those who did not progress. A cutoff value of 0.12 was identified. These findings suggest that the SHAP values can quantitatively assess individual disease progression risk in this cohort, supporting their potential clinical utility.

Fig. 4.

Fig. 4

Patient-specific SHAP values illustrating feature contributions to predictions. SHAP waterfall plots for patients who progressed (A,B) and those who did not (C,D), showing key feature contributions. (E) Distribution of SHAP scores across the dataset.

Discussion

In the present study, a machine learning-based prediction algorithm was developed using the XGBoost model combined with the PubMedBERT process for text embedding to integrate unstructured data from BM examination reports and cytogenetic analyses. This approach was implemented to enhance risk prediction for ICUS progression to myeloid malignancies. On the base dataset, which included only structured variables, logistic regression achieved an AUROC of 0.622 and an AUPRC of 0.061, whereas the XGBoost model achieved an AUROC of 0.762 and an AUPRC of 0.141. When text-derived embeddings from BM examination and cytogenetic reports were incorporated into the XGBoost model in the embedded dataset, performance further improved, with an AUROC of 0.780 and an AUPRC of 0.175. The MCC, which is sensitive to class imbalance, also increased from 0.134 for logistic regression to 0.232 for the base XGBoost model and 0.234 for the embedded XGBoost model, indicating that the improvement in discrimination was maintained even when evaluated with an imbalance-aware metric. However, direct comparisons should be interpreted with caution due to variations in patient cohorts and study designs. In a study predicting the risk of myeloid malignancies from CHIP/CCUS using molecular and laboratory data from a large U.K. Biobank cohort, an AUC of 0.788 was demonstrated using a prediction model based on recursive partitioning5. Additionally, in research conducted by Z Xie et al., a Clonal Cytopenia Risk Scoring system, developed using the Cox proportional hazards method, achieved a c-index of 0.64 for predicting leukemia-free survival in patients with CCUS7.

Notably, incorporation of the PubMedBERT-based text embedding process into the XGBoost algorithm resulted in a modest increase in the AUC of our prediction model, from 0.762 to 0.780. Given that our study population included patients with ICUS ‒ encompassing both non-clonal ICUS and CCUS ‒ the overall progression rate was relatively low compared to cohorts limited to CCUS. This may have limited the magnitude of improvement in predictive performance. To address this limitation and further enhance the model’s accuracy, we plan to establish a multicenter registry that includes a more diverse patient population. Additionally, the inclusion of embedded text data was reflected in the feature importance analysis, which was evaluated using SHAP-based scores. These findings suggest that the integration of textual data into conventional machine-learning algorithms enhances model performance.

In the studied cohort, progression to myeloid malignancies was observed in 2.82% of cases undergoing BM examination due to cytopenia. While limited reports exist on the progression rates from ICUS or CCUS to myeloid malignancies, this finding aligns with previous research. According to LD Weeks et al., progression to myeloid malignancies was reported in 269 out of 11,337 patients (2.37%) with CHIP/CCUS over a median follow-up of 11.7 years5. In a recent prospective cohort study of patients with cytopenia followed for a median of 4.54 years, disease progression was observed in 0.6% of patients with non-clonal ICUS and 16.5% of patients with CCUS6. Similarly, a study of 357 patients with CCUS reported a 13% progression rate to myeloid malignancies over a follow-up period of 27 months7. The median age of the studied cohort was 58 years, consistent with previously reported data for patients with non-clonal ICUS, ranging from 52 to 571315. In addition, the median age at the time of progression to myeloid malignancies was 61 years. These findings are also align with prior studies showing that patients with CCUS or CHIP tend to be older, with reported median ages ranging from 62 to 7657,13,16.

Among the 36 patients who progressed to myeloid malignancies, cytogenetic results were available for 34. Of these, 5 patients harbored cytogenetic abnormalities—3 with partial loss of the Y chromosome and 2 with abnormalities of uncertain significance. These patients cannot be classified as CCUS based soley on cytogenetic findings in the absence of somatic mutation data. Moreover, partial loss of the Y chromosome is often considered an age-related change and has been reported in up to 30% of older males, suggesting that such abnormalities may have limited clinical relevance. Nevertheless, a prior study has shown that ≥ 75% loss of the Y chromosome is associated with a significantly incresaed risk of developing myeloid malignancies17. Taken together, these findings underscore that patients with cytopenias and cytogenetic abnormalities—including those presumed to be age-related or of uncertain significance—should undergo careful monitoring.

In our previous study, in which a model for predicting long-term survival after allogeneic hematopoietic cell transplantation in patients with hematologic malignancies, we demonstrated the potential to identify more suitable transplantation-related factors using SHAP scores18. Similarly, in the present study, we developed a SHAP-based score to predict the risk of progression from ICUS to myeloid malignancies. This score could assist in clinical decision-making by identifying patients at higher risk who may benefit from more frequent monitoring to enable earlier detection of disease progression. Although standardized clinical approaches for patients with ICUS remain lacking, the proposed model offers guidance that could inform clinical practice, potentially leading to earlier intervention and improved patient outcomes.

In our SHAP analysis, RDW and band neutrophils emerged as the most important predictors of disease progression in this cohort. RDW, a measure of red cell volume heterogeneity, has previously been reported as a predictive marker for the diagnosis of MDS in patients with cytopenia19. This aligns with findings from a large population-based study demonstrating that individuals with MDS have significantly higher RDW compared to the control group20. In addition, Jaiswal et al. showed that elevated RDW was associated with older age and the presence of hematologic malignancy-related gene mutations at the time of CHIP detection, further supporting its relevance as an early hematologic indicator21. Band neutrophils, representing immature neutrophil forms, may increase in response to conditions such as acute inflammation and infection. Although no established link exists between bandemia and the risk of progression to myeloid malignancies, elevated band counts could reflect an early inflammatory or stress-related hematopoieticstate preceding overt disease progression.

The findings of our study highlight that integrating cytogenetic data and unstructured text from BM examination reports into the predictive model enhances its performance and reliability. Notably, the incorporation of unstructured text data through the PubMedBERT method resulted in an improvement in the AUC from 0.762 to 0.780. This result underscores the value of leveraging the rich clinical information contained in free-text data to complement the limitations of structured variables, thereby strengthening predictive capabilities. Furthermore, SHAP analysis provides a visual representation of the contribution of textual data to model predictions, enhancing interpretability. By representing each patient’s predicted likelihood of disease progression through SHAP values, the clinical applicability of our model is reinforced.

Our study has several limitations. First, the study was conducted on a relatively small number of patients at a single center. Although we performed cross-validation using an 80:20 train split, external validation in a larger cohort is necessary to further confirm model performance. Additionally, genetic information, such as targeted sequencing, was not included in the analysis. However, given that next-generation sequencing for all patients undergoing BM examinations due to cytopenia may not be feasible, the developed prediction model could offer greater practicality for real-world clinical applications.

In conclusion, we developed a machine learning-based algorithm to predict the progression of ICUS to myeloid malignancies. By incorporating clinical variables with PubMedBERT-based text embeddings, we enhanced predictive performance, highlighting the added value of unstructured data in model optimization. Furthermore, this prediction model was proposed as a potential clinical tool for identifying high-risk patients who may benefit from closer monitoring and timely intervention.

Supplementary Information

Below is the link to the electronic supplementary material.

Author contributions

Hyunkyung Park, Tae Joon Jun, and Eun-Ji Choi designed the study. Ji-Ye Han and Tae Joon Jun generated ML models and performed analysis. Hyunkyung Park, Ji-Ye Han, and Eun-Ji Choi wrote the manuscript. Han-Seung Park, Yunsuk Choi, Jung-Hee Lee, Je-Hwan Lee, Kyoo-Hyung Lee, and Young-Hak Kim provided data and edited the manuscript.

Funding

This work was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR21C0198).

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval

Approval for the study protocols was granted by the Institutional Review Board (IRB) of Asan Medical Center (No. 2022 − 0280) in accordance with the Declaration of Helsinki (2008). Data from the Asan Biomedical Research Environment (ABLE), a de-identified EMR database maintained by Asan Medical Center, were used for this study. Because the ABLE database comprises anonymized patient information, the study was exempt from the requirement for informed consent. All experiments were conducted in compliance with relevant guidelines and regulations.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Hyunkyung Park and Ji-Ye Han contributed equally as co-first authors.

Tae Joon Jun and Eun-Ji Choi contributed equally as co-correspondence authors.

Contributor Information

Tae Joon Jun, Email: taejoon@amc.seoul.kr.

Eun-Ji Choi, Email: imeunjeee@gmail.com.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES