Skip to main content
PLOS One logoLink to PLOS One
. 2025 Mar 12;20(3):e0315408. doi: 10.1371/journal.pone.0315408

Research and analysis of differential gene expression in CD34 hematopoietic stem cells in myelodysplastic syndromes

Min-xiao Wang 1,2,3,, Chang-sheng Liao 2,4,, Xue-qin Wei 1,2, Yu-qin Xie 1,2, Peng-fei Han 4,*, Yan-hui Yu 1,3,*
Editor: Francesco Bertolini5
PMCID: PMC11902259  PMID: 40073065

Abstract

Objective

This study aims to investigate and analyze the differentially expressed genes (DEGs) in CD34 + hematopoietic stem cells (HSCs) from patients with myelodysplastic syndromes (MDS) through bioinformatics analysis, with the ultimate goal of uncovering the potential molecular mechanisms underlying pathogenesis of MDS. The findings of this study are expected to provide novel insights into clinical treatment strategies for MDS.

Methods

Initially, we downloaded three datasets, GSE81173, GSE4619, and GSE58831, from the public Gene Expression Omnibus (GEO) database as our training sets, and selected the GSE19429 dataset as the validation set. To ensure data consistency and comparability, we standardized the training sets and removed batch effects using the ComBat algorithm, thereby integrating them into a unified gene expression dataset. Subsequently, we conducted differential expression analysis to identify genes with significant changes in expression levels across different disease states. In order to enhance prediction accuracy, we incorporated six common predictive models and trained them based on the filtered differential gene expression dataset. After comprehensive evaluation, we ultimately selected three algorithms—Lasso regression, random forest, and support vector machine (SVM)—as our core predictive models. To more precisely pinpoint genes closely related to disease characteristics, we utilized the aforementioned three machine learning methods for prediction and took the intersection of these prediction results, yielding a more robust list of genes associated with disease features. Following this, we conducted in-depth analysis of these key genes in the training set and validated the results independently using the GSE19429 dataset. Furthermore, we performed differential analysis of gene groups, co-expression analysis, and enrichment analysis to delve deeper into the mechanisms underlying the roles of these genes in disease initiation and progression. Through these analyses, we aim to provide new insights and foundations for disease diagnosis and treatment. Figure illustrates the data preprocessing and analysis workflow of this study.

Results

Our analysis of differentially expressed genes (DEGs) in CD34+ hematopoietic stem cells (HSCs) from patients with myelodysplastic syndromes (MDS) revealed significant differences in gene expression patterns compared to the control group (individuals without MDS). Specifically, the expression levels of two key genes, IRF4 and ELANE, were notably downregulated in CD34+ HSCs of MDS patients, indicating their downregulatory roles in the pathological process of MDS

Conclusion

This study sheds light on the potential molecular mechanisms underlying MDS, with a particular focus on the pivotal roles of IRF4 and ELANE as key pathogenic genes. Our findings provide a novel perspective for understanding the complexity of MDS and exploring therapeutic strategies. They may also guide the development of precise and effective treatments, such as targeted interventions directed against these genes

Introduction

Myelodysplastic Syndromes (MDS) represent a highly heterogeneous group of diseases characterized by abnormal bone marrow cellular development [1,2], ineffective hematopoiesis, cytopenia, and a notable risk of transformation into acute myeloid leukemia (AML) [3,4]. With an annual incidence rate ranging from 2.1 to 12.6 per 100,000 individuals, the prevalence significantly escalates among individuals over 70 years old, reaching 25 times that of the general population [5,6]. Notably, in the Asia-Pacific region, MDS patients account for more than half of the global caseload, and China, too, faces a persistently high incidence rate [7]. Mortality among MDS patients varies according to disease severity and treatment response, with lower risks for those with mild symptoms and favorable therapeutic outcomes, and higher risks for those with rapidly progressing disease or severe complications [8]. The intricate pathogenesis of MDS encompasses extensive involvement of genetics, epigenetics, immunology, and environmental factors, collectively contributing to the imbalance of the bone marrow microenvironment and the impairment of hematopoietic stem and progenitor cells (HSPCs) function [9,10]. Clinically, severe MDS patients frequently experience severe symptoms such as anemia, recurrent infections, and bleeding, significantly compromising their quality of life and shortening their survival period [11]. CD34, a highly specific surface marker, is widely expressed on early hematopoietic stem/progenitor cells and is vital for their identification, isolation, and investigation [12] Aberrant proliferation, differentiation defects, or apoptotic imbalance in CD34+ cells are considered pivotal in the pathogenesis of MDS, directly correlating with abnormalities in bone marrow hematopoiesis and the diversity of disease phenotypes. Given the central role of CD34+ cells in hematopoiesis, this study specifically focuses on gene expression changes in CD34+ hematopoietic stem cells from MDS patients [13,14,15].Although the etiology of MDS remains incompletely understood, the rapid development of advanced technologies such as high-throughput sequencing has enabled the identification of an increasing number of MDS-related genetic mutations. These discoveries have paved new avenues for molecular diagnosis, classification, and prognostic assessment of MDS [16,17]. Concurrently, the emergence of immunomodulatory agents, hypomethylating agents, and targeted therapies has offered new therapeutic options and hope for MDS patients. Utilizing bioinformatics approaches, we aim to delve into biomarkers and disease-related genes intimately associated with the pathological processes of MDS. Against this backdrop, our study endeavors to contribute to precision medicine in MDS by deeply exploring gene expression changes in CD34+ cells. We anticipate that this research will not only provide novel insights into the pathogenesis of MDS but also establish a robust molecular foundation for early disease diagnosis, the formulation of precision treatment strategies, and the improvement of prognostic assessment systems. To clearly elucidate the entire research process of this study, we have provided Figure 1, which is a flowchart detailing the various key steps of the research (Fig 1).

Fig 1. Flow chart of research design and analysis.

Fig 1

I. Materials and methods

1. Data acquisition and processing

This study primarily utilized microarray technology for gene expression analysis. We meticulously selected and obtained gene expression datasets of bone marrow CD34+ hematopoietic stem cells from patients with Myelodysplastic Syndromes (MDS) and healthy individuals from the publicly accessible Gene Expression Omnibus (GEO) database. To precisely locate the required data, we searched and extracted platform description files and series matrix files from the GEO database using keywords such as “MDS”, “microarray”, “human samples”, and corresponding disease-specific gene expression patterns. These files contain core experimental data, including gene expression levels of the samples. Subsequently, we further processed these data files, including removing missing values and handling duplicate genes, ultimately generating a gene expression matrix file that will serve as the core input data for our subsequent analyses. We ultimately included the following four datasets for in-depth analysis: GSE81173, GSE4619, GSE19429, and GSE58831. Specifically, GSE81173 comprises 18 samples (12 disease samples and 6 control samples); GSE4619 contains 66 samples (55 disease samples and 11 control samples); GSE19429 includes 200 samples (183 disease samples and 17 control samples); and GSE58831 consists of 176 samples (159 disease samples and 17 control samples). It is noteworthy that the GSE81173 dataset originates from a Chinese laboratory, while GSE4619, GSE19429, and GSE58831 datasets all come from a UK laboratory, with the latter three datasets originating from the same research institution. Although we were unable to obtain detailed age and gender information for all samples during the data collection process, which somewhat limits the depth of our analysis, considering the primary objective of this study is to identify differentially expressed genes in bone marrow CD34+ hematopoietic stem cells in MDS and to deeply explore their crucial roles in disease progression, we have decided to include these datasets in our analysis.

2. Data preprocessing and identification of differentially expressed genes

To ensure the comparability of data across different samples, we first employed the normalizeBetweenArrays method to eliminate the systematic biases present in the four gene expression matrices derived from CD34+ cells of normal controls and MDS patients. This standardization step is crucial for ensuring the comparability of data across various samples. Subsequently, we utilized a Differential Expression Gene (DEG) analysis software package to compare gene expression profiles within these four standardized gene expression matrices. We chose DESeq2 as our analysis tool, which is a widely recognized and extensively validated R software package specifically designed for differential expression analysis of count data. DESeq2 models gene expression data using a negative binomial distribution and employs the Wald test to assess whether there are significant differences in gene expression levels. When initially assessing differences in gene expression levels, we set a relatively lenient threshold for fold change, aiming to capture as many potential significantly differential genes as possible. Specifically, our selection criteria included an adjusted P-value (adjP) less than 0.05 and an absolute log2-transformed fold change (|log2FC|) greater than or equal to 1. These criteria were established to accurately identify gene expression differences with statistical significance, providing more valuable candidate genes for subsequent in-depth research.

3. Batch effect correction and differential analysis

To mitigate the potential impact of batch effects introduced by different experimental platforms, this study employed a dedicated algorithm for batch correction of the data. Specifically, we processed the normalized expression datasets from GSE81173, GSE4619, and GSE58831, which were included in the training set, using the “sva” package in R for batch effect elimination. Following batch correction, we merged the datasets under the same disease status. Subsequently, we utilized the “limma” package to perform a differential expression analysis on the combined, batch-corrected data, with the aim of identifying genes exhibiting significant differences [10]. This approach allowed us to focus on biologically meaningful variations in gene expression, rather than spurious effects arising from experimental artifacts, thereby enhancing the robustness and reliability of our findings.

4. Training and selection of common prediction models

For the screened differential gene expression dataset, this study aims to find the most suitable prediction model for this dataset by training and comparing various machine learning models. To this end, we have adopted the following six common machine learning models: Random Forest (RF), LASSO Regression, Support Vector Machine (SVM), Neural Network (NN), Logistic Regression (LR), and Gradient Boosting Machine (GBM).

4.1 Overview of toolkits and functions.

To complete the model training, evaluation, interpretation, and visualization, this study uses the following R packages and their specific functions:

caret: Used for data preprocessing, model training, and performance evaluation, especially by setting cross-validation through the trainControl function.

DALEX: Provides model interpretation and evaluation functions, calculating performance metrics of different models through the model_performance function.

ggplot2: Used for drawing various graphs to intuitively display data and results.

randomForest: Specifically used for training Random Forest models.

kernlab: Supports training Support Vector Machine (SVM) models.

pROC: Used for calculating and plotting ROC curves, extracting AUC values to quantify the model’s discriminative ability.

glmnet: Supports the training of LASSO Regression models.

nnet: Used for training Neural Network models.

e1071: Provides training functions for Logistic Regression (LR) models.

tidyr: Used for data processing and format conversion, ensuring data meets analysis requirements.

4.2 Model training and evaluation process.

4.2.1 Data Preprocessing: Firstly, use tidyr and other related tools to perform necessary cleaning and format conversion on differential gene expression data, including handling missing values, data normalization, etc., to ensure data quality and analysis accuracy.

4.2.2 Model Training: Random Forest (RF): Trained using the caret and randomForest packages.

LASSO Regression: Trained using the glmnet package, selecting the best regularization parameter through cross-validation.

Support Vector Machine (SVM): Trained using the kernlab package, adjusting kernel functions and regularization parameters to optimize performance.

Neural Network (NN): Trained using the nnet package, adjusting network structure and hyperparameters to improve prediction accuracy.

Logistic Regression (LR): Trained with caret and regularized using glmnet, with model complexity and overfitting controlled by adjusting alpha and lambda.

Gradient Boosting Machine (GBM): Trained by calling the GBM algorithm through the caret package, adjusting parameters such as the number of trees, depth, and learning rate.

4.2.3 Model Evaluation: Use the DALEX package to interpret and evaluate the performance of each trained model.

Obtain key performance indicators such as accuracy, recall, F1 score, specificity, and AUC values through the model_performance function.

Use the pROC package to calculate and plot ROC curves of each model, intuitively displaying the model’s discriminative ability.

Use the confusionMatrix function of the caret package to further analyze the classification performance of the models.

4.2.4 Summary and Discussion of Results: Finally, summarize all model performance indicators into a data frame, including model name, accuracy, recall, specificity, F1 score, and AUC value, etc. We will discuss these performance indicators, analyze the advantages and disadvantages of different models, and explain why the current model was chosen as the final analysis tool. In addition, we will also use the ggplot2 package to visually display these performance indicators for intuitive comparison of different models’ performances.

4.3 Model selection discussion.

We selected RF, LASSO, SVM, NN, LR, and GBM as core prediction models, mainly based on their universality and recognition, ability to handle high-dimensional data and nonlinear relationships, and expected performance on our specific dataset. RF and GBM can handle complex nonlinear relationships and high-dimensional data; LASSO achieves feature selection through regularization, suitable for datasets with a large number of features; SVM performs well in handling small sample data and nonlinear classification problems; NN, although requiring a large amount of data, has strong nonlinear modeling capabilities, making it potentially powerful in certain situations; LR is particularly well-suited for binary classification problems, featuring a simple implementation, high computational efficiency, and strong interpretability. Although other models, such as decision trees and naive Bayes, may also excel in certain situations, we have prioritized the selected models for their demonstrated stability and reliability in tackling the intricate non-linear relationships and high-dimensional features characteristic of gene expression data. We believe these models are better suited to meet our research needs.

5. Screening for disease-feature-related genes

In this study, we used three machine learning algorithms: Random Forest (RF), Lasso Regression, and Support Vector Machine (SVM) to screen for genes related to disease features. To ensure the accuracy, reliability, and robustness of the screening results, we decided to take the intersection of genes identified by these three algorithms as the final research subjects.

5.1 Algorithm selection and implementation.

Random Forest (RF): We implemented the Random Forest algorithm using the “randomForest” package in R. This algorithm improves the accuracy of classification or regression by building multiple decision trees and aggregating their prediction results. We ranked the genes based on their importance scores in the Random Forest model and selected genes with higher scores as candidate genes.

Lasso Regression: We used the Lasso Regression algorithm from the “glmnet” package in R to identify genes related to disease features. Lasso Regression introduces an L1 regularization term to achieve feature selection, capable of handling high-dimensional data and reducing overfitting. We adjusted the regularization parameter to select the most predictive genes.

Support Vector Machine (SVM): We implemented the SVM algorithm using the “e1071” package in R. SVM is a supervised learning algorithm based on the principle of structural risk minimization, adept at handling high-dimensional data and complex classification problems. We optimized model performance by adjusting SVM parameters and kernel function types and screened out genes strongly related to disease features.

5.2 Intersection gene screening.

After completing the analysis with the above three algorithms, we adopted an intersection screening strategy. Firstly, this is because numerous similar studies have used this method and proven its effectiveness in improving the accuracy and reliability of screening results. Secondly, different algorithms may have their own advantages and limitations when screening genes, and by taking the intersection, we can complement these shortcomings, thereby screening out more robust and reliable disease feature-related genes. Specifically, we selected those genes identified as related to disease features by all algorithms as the final research subjects. This method not only enhances the robustness of the screening results but also provides a more accurate and reliable gene set for subsequent disease research and diagnosis.

6. Disease signature gene validation in the training set

We re-introduced the identified disease signature genes, IRF4 and ELANE, into the batch-corrected and consolidated training set of gene expression data. Utilizing the stat_compare_means function from the ggpubr package in R, we rigorously validated the differential expression of these two genes between the control and experimental groups, with a predefined significance threshold of p < 0.001. Following this, we employed the pROC package in R to plot ROC curves, enabling a quantitative assessment of the potential value of these genes as disease biomarkers within the training set.

7. Validation of disease signature genes in the validation set

To validate the findings from the training set, we further incorporated the standardized GSE19429 validation set of gene expression data. Utilizing the stat_compare_means function from the ggpubr package in R once again, we re-verified the differential expression of IRF4 and ELANE genes between the control and experimental groups, maintaining the significance threshold at p < 0.001. Following this validation step, we employed the pROC package in R to plot ROC curves, allowing for a second round of quantitative assessment of the potential value of these two genes as disease biomarkers within the validation set.

8. Differential analysis and co-expression analysis of disease signature gene groups

Based on the disease signature genes IRF4 and ELANE, we leveraged R to stratify the batch-corrected and consolidated training set of gene expression data into two distinct groups. These groups were defined by the expression levels of IRF4 and ELANE, with one group representing high expression (samples with elevated levels of both IRF4 and ELANE) and the other representing low expression (samples with relatively lower levels).After grouping, we utilized the limma package in R to conduct a detailed differential expression analysis of the remaining genes between these two groups. This step aimed to identify genes that showed significant changes in their expression patterns between the high and low expression groups, potentially revealing additional genes associated with the disease characteristics under investigation. Subsequently, we employed the corrplot package in R to perform a gene co-expression analysis.

9. Enrichment analysis

For the identified disease-specific genes IRF4 and ELANE, along with their closely related co-expressed genes, we utilized the clusterProfiler package in R to conduct Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. Additionally, we employed the GSEA (Gene Set Enrichment Analysis) function within the clusterProfiler package to further analyze gene sets. During the analysis, we set the significance level at a P-value less than 0.05, a criterion used to filter out statistically significant biological processes, cellular components, molecular functions, biological pathways, and disease-related pathways that are enriched in either the high or low expression groups defined by IRF4 and ELANE.

10. Immune-related functional analysis of disease-specific genes

We employed the “GSVA” package in R to perform Single-Sample Gene Set Enrichment Analysis (ssGSEA) on the disease-specific genes. This analysis aimed to assess the enrichment levels of immune-related gene sets, providing insights into the immune-related functionalities associated with the identified disease-specific genes.

11. Immunocell infiltration and immune cell correlation analysis of disease-specific genes

We utilized the CIBERSORT package in R, along with the corresponding CIBERSORT algorithm, to conduct an immunocell infiltration analysis on the batch-corrected and combined gene expression dataset from the training group. To ensure statistical significance in our results, we set the significance level at a P-value less than 0.05 and applied this criterion to filter out immune cell types that exhibited significant differences between the experimental and control groups. Subsequently, to delve deeper into the potential associations between disease-specific genes and immune cell abundances, we employed the cor.test function to perform Pearson correlation analysis. Using the same significance threshold of P-value less than 0.05, we successfully identified immune cell types that demonstrated significant correlations with the expression of disease-specific genes. This analysis provided valuable insights into the interplay between disease-related genetic signatures and the immune system’s cellular components.

12. Statistical methods, software, and tools

All statistical analyses were conducted within the R environment. A two-sample t-test analyzed gene expression differences between groups. The Benjamini-Hochberg method was applied for multiple testing correction to control the false discovery rate (FDR). R language (version 4.4.1) was utilized for data processing and statistical analysis.

II. Results

1. Data acquisition and processing

After downloading the platform description files and series matrix files for the four datasets, we conducted preliminary organization and merging. Subsequently, we obtained gene expression matrix files for each of the four datasets. Specifically: For the GSE81173 dataset, we generated a gene expression matrix containing 19,461 gene expression data points. For the GSE4619 dataset, we obtained a gene expression matrix with 22,880 gene expression data points. The GSE58831 dataset yielded a gene expression matrix also comprising 22,880 gene expression data points. Lastly, the GSE19429 dataset provided a gene expression matrix with 22,880 gene expression data points.

2. Identification of Differentially Expressed Genes (DEGs)

Through differential expression analysis of the gene expression datasets, we identified 370 DEGs in the GSE81173 dataset, 179 DEGs in the GSE58831 dataset, 46 DEGs in the GSE4619 dataset, and 68 DEGs in the GSE19429 dataset. (Fig 2A2D)

Fig 2. A. Displays a heatmap of differentially expressed genes in the GSE81173.

Fig 2

B. Shows a heatmap for the GSE58831. C. Illustrates differentially expressed genes in the GSE4619. D. A thorough depiction of the heatmap for the validation group GSE19429. In these heatmaps, red indicates upregulated genes, and blue represents downregulated genes. “Control” denotes the normal control group, while “Treat” refers to MDS.

3. Batch correction and differential analysis

Following the standardization process, we initially integrated the gene expression datasets from GSE81173, GSE58831, and GSE4619. Subsequently, we eliminated batch effects from the combined dataset. Then, we performed differential expression analysis on the batch-corrected and combined training set gene expression dataset, resulting in the identification of 110 differentially expressed genes along with their expression levels (Fig 3A-3C).

Fig 3. A. Scatter plot of the training set data before batch correction; B: Scatter plot of the training set data after batch correction.

Fig 3

Dots (●) represent control group data; Triangles (▲) represent experimental group data; Color code: Red for GSE4619 dataset, Green for GSE58831 dataset, Blue for GSE81173 dataset. C. Heatmap of the combined training set data after batch effect removal, where yellow-green represents the normal control group, blue represents the disease group, orange, pink, and green distinguish different datasets included, red indicates high expression, and blue indicates low expression.

4. Training and selection of commonly used predictive models

We trained six models, including RF, LASSO, SVM, NN, LR, and GBM, and systematically evaluated their performances using a variety of evaluation metrics such as accuracy, recall, F1 score, specificity, and AUC value. Although all models performed similarly on most metrics, RF, NN, and GBM demonstrated excellent performance across all evaluated aspects. Particularly in terms of AUC value, RF, NN, and GBM all reached or approached 1.000, while SVM was 0.997, LASSO was 0.996, and KNN also performed well with an AUC value of 0.998 (Fig 4A4E).

Fig 4. A. Bar chart for accuracy evaluation of six models.

Fig 4

B. Bar chart for recall evaluation of six models. C. Bar chart for F1 score evaluation of six models. D. Bar chart for specificity evaluation of six models. E. The ROC curve chart demonstrates the classification performance of six machine algorithms in the differential gene analysis task. The AUC value quantitatively reflects the overall classification capability of each model.

However, after an in-depth analysis of model applicability and data characteristics, we made the following choices:

Reason for not choosing NN (Neural Network): Despite its outstanding performance in the evaluation, NN has a high model complexity and typically requires a vast amount of training data to achieve ideal performance. In our study, there were 226 experimental groups and only 34 control groups, with a significant gap between the groups and a relatively small sample size, which may be insufficient to fully train the NN model, limiting its generalization ability. Furthermore, the parameter tuning process for NN is cumbersome and requires substantial computational resources, making it potentially not the most economically efficient choice for this study.

Reason for not choosing LR (Logistic Regression): While LR is a simple model with high computational efficiency and stable performance on small sample data, it may struggle with handling nonlinear relationships and high-dimensional data. Our differential gene expression data may contain complex nonlinear relationships and high feature dimensions, making LR potentially inadequate in capturing underlying patterns in the data.

Reason for not choosing GBM (Gradient Boosting Machine): GBM is a powerful ensemble learning method capable of handling nonlinear relationships and high-dimensional data. However, in our study, due to the large difference in sample size between the experimental and control groups, GBM may have difficulty balancing the importance of different class samples during training, leading to model bias. Additionally, GBM models typically contain numerous parameters, making the tuning process complex and requiring substantial computational resources.

Conversely, the three models of RF (Random Forest), LASSO (Lasso Regression), and SVM (Support Vector Machine) are highly recognized in the field of bioinformatics, with solid theoretical foundations and mature implementation methods [1820]. They can handle nonlinear relationships, high-dimensional data, and imbalanced sample sizes, with good model interpretability. Given our data characteristics, these three models performed well and were relatively robust. Therefore, after comprehensive consideration, we decided to include the three machine learning algorithms of RF, LASSO, and SVM for further research.

5. Screening of disease-specific genes

We applied Lasso regression, Random Forest algorithm, and Support Vector Machine (SVM) algorithm to a gene pool containing 110 differentially expressed genes to screen for candidate disease-specific genes. Specifically, Lasso regression identified 25 candidate genes from the gene pool. The Random Forest algorithm further screened out 8 candidate genes. The SVM algorithm also independently selected 8 candidate genes. To achieve higher accuracy and reliability in identifying disease-specific genes, we cross-validated the gene sets selected by these algorithms. After careful analysis and comparison, we ultimately identified two highly correlated and reliable disease-specific genes: IRF4 and ELANE. These two genes showed significant correlation in all three algorithms and were therefore considered strong candidates for disease-specific genes (Fig 5A5F).

Fig 5. Screening of variables based on Lasso regression.

Fig 5

A. The variation characteristics of the coefficient of variables; B. The selection process of the optimum value of the parameter λ in the Lasso regression model by cross-validation method. C. Bubble chart of gene importance related to diseases obtained by the Random Forest algorithm. D. The error rate plot related to diseases obtained by the Support Vector Machine (SVM) algorithm, where the x-axis represents the size of the feature subset, and the y-axis represents the corresponding error rate. E. The accuracy plot related to diseases obtained by the Support Vector Machine (SVM) algorithm, where the x-axis represents the size of the feature subset, and the y-axis represents the corresponding accuracy rate. F. Through comparative analysis of three algorithms—Lasso regression, Support Vector Machine (SVM), and Random Forest (RF)—we have obtained a schematic diagram of disease-related genes that are commonly identified by these methods.

6. Validation of disease-specific genes in the training set

Initially, we conducted a validation analysis within the training set dataset, using a p-value <  0.001 as the criterion for statistical significance. The results indicated that there were significant differences in the gene expression levels of IRF4 and ELANE between the experimental group and the normal control group within the training set gene expression dataset. Specifically, both IRF4 and ELANE were found to be downregulated in the experimental group compared to their upregulated expression in the control group. Subsequently, based on the training set dataset, we constructed ROC curves for the disease-specific genes IRF4 and ELANE, yielding an AUC value of 0.929 for IRF4 and 0.799 for ELANE. (Fig 6A6B)

Fig 6. A-B. Expression of IRF4 and ELANE in the experimental group compared to the normal control group showed significantly reduced expression levels of IRF4 and ELANE in the MDS disease group, highlighting their potential role in the pathogenesis of these genes.

Fig 6

C-D. Based on the merged training group dataset after batch effect removal, construct ROC curves for disease-related genes IRF4 and ELANE.

7. Independent validation using an independent dataset

To further validate our findings, we included the independent dataset GSE19429 for analysis, applying a p-value <  0.001 as the threshold for statistical significance. The results showed that there were significant differences in the gene expression levels of IRF4 and ELANE between the experimental group and the normal control group within the independent dataset as well. Specifically, both IRF4 and ELANE were downregulated in the experimental group and upregulated in the control group. Subsequently, based on the independent dataset GSE19429, ROC curves were constructed for the disease-specific genes IRF4 and ELANE, yielding an AUC value of 0.938 for IRF4 and 0.791 for ELANE (Fig 7A7D).

Fig 7. A-B. Expression of IRF4 and ELANE in MDS - Indicates that compared to the normal control group, the expression levels of both in the MDS disease group show significant reductions at different levels (P <  0.001), suggesting that the reduction in their activity may affect the mechanism or progression of the disease.

Fig 7

C-D. The ROC curve illustrates the significance of genes related to MDS. In the ROC curve analysis, the significant AUC of the IRF4 gene is 0.938, and the significant AUC of the ELANE gene is 0.791.

8. Differential analysis and co-expression analysis of disease-specific gene groups

First, based on the expression levels of IRF4 and ELANE, we utilized R language to divide the batch-corrected training set gene expression dataset into two groups: a high-expression group and a low-expression group. The high-expression group represented samples with higher expression levels of IRF4 and ELANE, while the low-expression group represented samples with lower expression levels. In the IRF4-based grouping, we observed negative correlations between the expression of IRF4 and DLK1, MAMDC2, and positive correlations with 15 genes including LRIG1, P2RY14, DNTT, CD24, among others. Similarly, in the ELANE-based grouping, WT1 exhibited a negative correlation with ELANE expression, while 47 genes such as CFD, HAL, CLEC12A, NKG7, showed positive correlations with ELANE expression. Subsequently, we performed co-expression analysis on the gene sets significantly associated with IRF4 and ELANE using the corrplot package. The results showed that 17 genes including DNTT, BLNK, MME, exhibited significant co-expression relationships with IRF4. Additionally, DLK1 and MAMDC2 displayed a positive correlation with each other and negative correlations with the remaining 15 genes to varying degrees. For ELANE, we found 20 genes including PRTN3, AZU1, MPO, to have significant co-expression relationships. Except for WT1, these genes exhibited varying degrees of positive correlations with each other (Fig 8A8D).

Fig 8. A-B. Differential expression heatmap of IRF4 and ELANE genes, grouped by high and low expression of disease-associated features.

Fig 8

Red represents positive correlation, blue represents negative correlation. Left panel: IRF4, Right panel: ELANE. “HIGH” and “LOW” indicate expression levels of disease-associated genes. C-D. Correlation Matrix Chart: Illustrating the correlation between disease-related genes and target genes. Blue tiles signify negative regulation, whereas red tiles indicate positive regulation. The intensity of the color reflects the strength of the correlation.

9. Enrichment analysis

Utilizing Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), we conducted an enrichment analysis on the identified disease-characteristic-related genes IRF4 and ELANE, along with their co-expressed genes. A significance threshold of P < 0.05 was employed in this analysis.

9.1. Enrichment analysis of IRF4.

The IRF4 gene was significantly enriched in multiple biological processes (BPs), including the production of molecular mediators of immune response, bacterial defense response, response to molecules of bacterial origin, and regulation of leukocyte-cell adhesion. At the cellular component (CC) level, it was primarily enriched in phagocytic vesicle-related components. In terms of molecular function (MF), functions related to cytokine binding were significantly enriched. KEGG pathway analysis revealed significant enrichment of IRF4 in pathways such as hematopoietic cell lineage and primary immunodeficiency (Fig 9A9B).

Fig 9. A. Illustrates the GO enrichment analysis pathway diagram for gene IRF4, where the x-axis denotes the count of enriched genes, and the y-axis represents the enrichment significance.

Fig 9

B. Depicts the KEGG enrichment analysis pathway diagram for gene IRF4, with the x-axis indicating the number of enriched genes and the y-axis signifying the enrichment significance. C. Displays an alternative aspect of the GO enrichment analysis pathway diagram for gene ELANE, wherein the x-axis specifies the gene ratio, and the y-axis depicts the enrichment significance. D. Showcases the KEGG enrichment analysis pathway diagram for gene ELANE, with the x-axis indicating the gene ratio and the y-axis denoting the enrichment significance.

9.2 Enrichment analysis of ELANE.

The ELANE gene was primarily enriched in BPs associated with humoral immune response, bacterial defense response, and regulation of chemotaxis. At the CC level, these genes were mainly enriched in cytoplasmic vesicle lumen and secretory granule lumen-related components. In terms of MF, functions related to endopeptidase activity were significantly enriched. KEGG pathway analysis showed significant enrichment of ELANE in pathways such as neutrophil extracellular trap formation and transcriptional misregulation in cancer (Fig 9C9D).

9.3 Enrichment analysis of Co-expressed genes.

Using chord diagram visualization, we demonstrated the co-enrichment of co-expressed genes closely associated with IRF4 and ELANE across various GO and KEGG categories. Co-expressed genes strongly linked to IRF4 were primarily enriched in GO categories such as leukocyte tethering or rolling, immune response, bacterial defense response, and B cell differentiation. Meanwhile, co-expressed genes closely related to ELANE were primarily enriched in GO categories including humoral immune response, antibacterial humoral response, and defense response to fungi and other organisms (Fig 10A10D).

Fig 10. A-B. The chord diagram illustrates the enrichment of IRF4 and its co-expressed genes across Gene Ontology (GO) categories (left panel) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (right panel).

Fig 10

The left side lists gene names, with colors differentiating the direction of expression (red for upregulation, blue for downregulation). The multicolored bars on the right represent distinct GO categories and KEGG pathways, while the thickness of the connecting lines reflects the number of co-enriched genes shared between them. C-D. The chord diagram displays the enrichment of ELANE and its co-expressed genes across Gene Ontology (GO) categories (left panel) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (right panel). The left side enumerates gene names, with colors distinguishing the direction of expression (red for upregulation, blue for downregulation). The multicolored bars on the right signify various GO categories and KEGG pathways, while the thickness of the connecting lines indicates the quantity of co-enriched genes shared among them.

9.4 Gene set enrichment analysis.

In the high-expression group of IRF4, biological processes like DNA replication, chromosome segregation, DNA-templated DNA replication, mitotic sister chromatid segregation, and regulation of chromosome segregation, along with KEGG pathways such as cell cycle, DNA replication, oocyte meiosis, primary immunodeficiency, and proteasome, were significantly enriched. In contrast, the low-expression group showed enrichment in biological processes like hemostasis, platelet activation, regulation of fluid level, and wound healing, along with related KEGG pathways like arachidonic acid metabolism, complement and coagulation cascades, endocytosis, focal adhesion, and tight junction. For the high-expression group of ELANE, biological processes such as antibacterial humoral response, antimicrobial humoral immune response mediated by antimicrobial peptides, defense response to Gram-negative bacterium, and related KEGG pathways like cell cycle, DNA replication, hematopoietic cell lineage, oocyte meiosis, and systemic lupus erythematosus were significantly enriched. In the low-expression group, biological processes like cardiac cell fate determination, endosomal transport, and autophagy, along with molecular functions like ubiquitin-like protein ligase activity and transferase activity, were enriched. Additionally, multiple KEGG pathways including epithelial cell signaling in Helicobacter pylori infection, Hedgehog signaling pathway, long-term potentiation, maturity-onset diabetes of the young, and phosphatidylinositol signaling system were significantly enriched (Fig 11A11H).

Fig 11. A-B. This Enrichment Score (ES) plot displays the enrichment of IRF4 and its co-expressed gene high expression group across Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.

Fig 11

The horizontal axis represents sorted genes, the vertical axis indicates the enrichment score, and colored bars distinguish different GO categories and KEGG pathways. Peaks of the ES line towards the upper left indicate significant enrichment of these categories and pathways within the high expression group. C-D. This Enrichment Score (ES) plot exhibits the enrichment of IRF4 and its co-expressed gene low expression group across Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The horizontal axis represents sorted genes, the vertical axis indicates the enrichment score, and colored bars differentiate various GO categories and KEGG pathways. Peaks of the ES line towards the lower right signify significant enrichment of these categories and pathways within the low expression group. E-F. This Enrichment Score (ES) plot demonstrates the enrichment of ELANE and its co-expressed gene high expression group across Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The horizontal axis represents sorted genes, the vertical axis indicates the enrichment score, and colored bars distinguish between different GO categories and KEGG pathways. Peaks of the ES line located towards the upper left signify significant enrichment of these categories and pathways within the high expression group. G-H. This Enrichment Score (ES) plot illustrates the enrichment of ELANE and its co-expressed gene low expression group across Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The horizontal axis represents sorted genes, the vertical axis indicates the enrichment score, and colored bars differentiate various GO categories and KEGG pathways. Peaks of the ES line positioned towards the lower right indicate significant enrichment of these categories and pathways within the low expression group.

10. Immunological function analysis

The influence of IRF4 and ELANE gene expression levels on immune-related functions was evaluated through single-sample gene set enrichment analysis (ssGSEA). Our analysis revealed significant differences in the scoring distributions across immune-related gene sets, with respect to IRF4 gene expression (P ≤ 0.001). Specifically, B cell-associated immune functions were markedly enhanced in the low-IRF4 expression group, whereas cell lytic activity and functions related to plasmacytoid dendritic cells were significantly elevated in the high-IRF4 expression group. Analogously, ELANE gene expression also demonstrated a profound impact (P < 0.001), with marked augmentation observed in mast cell, neutrophil, and plasmacytoid dendritic cell-related immune functions within the high-ELANE expression cohort (Fig 12A12B).

Fig 12. A-B. The boxplots present the distribution of scores for IRF4 and ELANE immune-related functions in low-expression and high-expression groups.

Fig 12

The horizontal axis corresponds to the immune functions, and the vertical axis indicates the activity score. Blue and red colors signify the low-expression and high-expression groups, respectively. Significant differences are denoted by *** (p < 0.001), ** (p < 0.01), and *  (p < 0.05).

11. Immunocellular infiltration and correlation analysis of disease-associated genes with immune cell types

Based on the gene expression dataset from the batch-corrected training group, we conducted an immune cell infiltration analysis and objectively presented the enrichment differences in the expression levels of immune-related functions between the experimental and control groups through box plots. Our analysis revealed statistically significant differences (P <  0.05) in the presence of CD4 memory activated T cells, resting dendritic cells, and activated mast cells between the two groups, suggesting altered expression or activity of these cell types in the experimental group compared to the control. Subsequently, Pearson correlation analysis uncovered notable correlations between the genes IRF4 and ELANE with various immune cell types. Specifically, IRF4 exhibited a positive correlation with CD4 memory T cells and naive B cells, while displaying a negative correlation with CD8 + T cells, regulatory T cells, and resting mast cells. Similarly, ELANE correlated positively with monocytes and negatively with naive CD4+ T cells and naive B cells. These findings contribute to a deeper understanding of the complex interplay between immune cell subsets and their regulatory genes in the context of the studied biological system (Fig 13A-13C).

Fig 13. A. The boxplot displays the distribution of enrichment scores for disease-related immune functions between the control group and the experimental group.

Fig 13

The horizontal axis represents the immune-related functions, while the vertical axis indicates the enrichment score. Green and red colors are used to distinguish the control group and the experimental group, respectively. Significant differences are marked with *** (p < 0.001), ** (p < 0.01), and *  (p < 0.05), indicating varying degrees of statistical significance. B-C. bubble chart illustrating the correlation between IRF4 and ELANE genes with various immune cell types. The vertical axis represents immune cell types, while the horizontal axis depicts the Pearson correlation coefficient, with positive values indicating positive regulation and negative values indicating negative regulation. significant correlations (P < 0.05) are highlighted in red font.

III. Discussion

MDS, a malignant clonal stem/progenitor cell disorder originating from CD34+ cells, primarily impacts individuals over 65 years old, with a global incidence rate ranging from 2 to 12 per 100,000 individuals [21]. Given the intensifying aging population, this ratio is projected to continue rising.1 To delve deeper into its pathogenesis, we have integrated bioinformatics and machine learning approaches, aiming to uncover novel potential therapeutic targets and strategies for clinical research and treatment of MDS. In this study, we zeroed in on the gene expression profile of bone marrow CD34+ cells in MDS, systematically analyzing four datasets encompassing both MDS patient and healthy control bone marrow samples. Through rigorous screening, we pinpointed two crucial disease-signature genes: IRF4 and ELANE. Both genes exhibited significantly lower expression levels in CD34+ cells from MDS patients, underscoring their pivotal roles in immune response and regulation of cellular differentiation, which are intimately linked to the initiation and progression of MDS. Our findings not only corroborate previous research but also reinforce the pivotal status of these genes in the pathogenesis of MDS, thereby offering valuable insights into potential therapeutic strategies for the future.

Previous independent studies have unequivocally demonstrated a significant downregulation of IRF4 gene expression in myelodysplastic syndromes (MDS). Specifically, the work by Vasikova A et al. not only unveiled the reduced expression of IRF4 across distinct genetic subsets of CD34+ cells in both early and advanced MDS patients, but also provided crucial insights into the pivotal role of IRF4 in the pathogenesis of MDS [22]. Concurrently, a plethora of research has highlighted a consistent downregulation of IRF4 expression across the myeloid disease spectrum, encompassing acute myeloid leukemia (AML), chronic myeloid leukemia (CML), and a range of hematopoietic cancer cell lines [2327]. This cross-disease consensus underscores the central importance of IRF4 in hematological malignancies, suggesting that the downregulation of IRF4 in MDS, as a subset of myeloid disorders, may represent a shared critical biological feature among these diseases. Furthermore, animal model studies have offered more tangible evidence. In murine models, the absence of IRF4 has been proven to exacerbate the progression of myeloid leukemia [28], reinforcing the essential role of IRF4 in blood disorders and presenting potential avenues for disease intervention and treatment. Given the intimate relationship between MDS and these myeloid disorders, these findings hold significant implications for exploring therapeutic strategies for MDS. Collectively, these research achievements not only deepen our understanding of the role of IRF4 in MDS pathogenesis but also pave the way for the development of novel therapeutic approaches that target IRF4 dysfunction. The consistency of IRF4 downregulation across myeloid diseases underscores its potential as a universal therapeutic target, offering hope for more effective treatments that can span across multiple hematological malignancies.

In delving deeper into the pathogenic mechanisms of myelodysplastic syndromes (MDS), we have observed that the pivotal gene IRF4 plays a significant role across various hematological disorders. Prior studies have illuminated the regulatory pathways of IRF4 in diverse disease contexts and its intimate association with tumor progression—notably, the work by Lopez-Girona A et al. revealed that immunomodulatory drugs (IMiDs) exert their tumor-suppressive effects by reducing IRF4 activity or expression levels, and this effect is modulated by the expression level of cereblon (CRBN) [29]. This discovery offers clues into the potential mechanisms of IRF4 in MDS. We speculate that in MDS, IRF4 may participate in disease progression by influencing the expression or function of CRBN or other related proteins. Firstly, given that IRF4 is a crucial transcription factor, it plays a pivotal role in regulating cellular processes such as proliferation, differentiation, and apoptosis [3033]. In MDS, abnormalities in these biological processes are often intimately linked to disease initiation and progression. Based on this premise, we can hypothesize that IRF4 in MDS may impact disease progression by modulating these biological processes. Furthermore, the intricate interplay between IRF4 and its downstream targets, including but not limited to CRBN, may constitute a regulatory network that is dysregulated in MDS, contributing to the pathological features of the disease. Exploring this network and identifying key nodes for therapeutic intervention could lead to the development of novel strategies for managing MDS.

Secondly, regarding the ELANE gene, it encodes neutrophil elastase, an enzyme crucial for neutrophil function. Neutrophils play a pivotal role in the human immune system, combating pathogens and clearing necrotic cells through the release of various enzymes, including neutrophil elastase. Previous studies have primarily focused on the relationship between ELANE gene mutations and severe congenital neutropenia (SCN) as well as their progression to myelodysplastic syndromes/acute myeloid leukemia (MDS/AML) [34,35]. These studies have shown that SCN can be caused by mutations in multiple genes, including ELANE, and that ELANE mutations are the most common genetic defect leading to the development of MDS/leukemia from SCN. For instance, Krutein et al. revealed that heterozygous mutations in ELANE encoding the potent serine protease neutrophil elastase (NE) cause cyclic neutropenia (CyN), which is the most common cause of severe congenital neutropenia (SCN) [36,37]. However, apart from mutations, our study discovered a significant reduction in ELANE gene expression levels in CD34+ cells from MDS patients. This novel finding provides a fresh perspective on understanding the pathogenesis of MDS. We speculate that the low expression of the ELANE gene may impair the normal functions of neutrophils, including their differentiation and bactericidal capabilities, thereby contributing to the pathogenesis of MDS. Prior research, such as the work by Nanua S et al., validated in an Elane-targeted mutation (G193X) transgenic mouse model that ELANE mutations lead to a block in neutrophil differentiation [36]. This finding further underscores the crucial role of the ELANE gene in the differentiation process of neutrophils. Additionally, studies by Cui et al. and Peng B et al. have highlighted the role of neutrophil elastase (ELANE) in killing cancer cells [38,39]. These studies demonstrate that catalytically active neutrophil elastase (ELANE) released by human neutrophils can selectively kill multiple cancer cell types while sparing non-cancerous and normal cells, significantly reducing tumor formation. This discovery emphasizes the potential value of the ELANE gene in anticancer processes. In conclusion, our study uncovered the low expression of IRF4 and ELANE genes in CD34+ cells from MDS patients and explored their potential roles in the pathogenesis of MDS. These findings not only offer new insights into MDS research but also provide valuable clues for future therapeutic strategies. Future research can further delve into the specific mechanisms underlying these gene expression changes and explore ways to modulate their expression to improve treatment outcomes for MDS patients. Lastly, the primary limitations of our study are as follows: the MDS-related gene networks identified through bioinformatics have yet to be experimentally validated in patients, and the significant heterogeneity of MDS has hindered the adequate identification of hub genes across its various subtypes. While our research has offered potential biomarkers for the prognosis, diagnosis, and treatment of MDS, further experimental and clinical validation is necessary. In the future, our methodology must undergo additional verification in larger patient cohorts to ensure its reliability and effectiveness.

IV. Conclusion

In this study, we successfully identified potential molecular pathways associated with myelodysplastic syndromes (MDS) and screened for potential therapeutic targets. These findings not only validate the existing research foundation but also significantly enhance our understanding of the pathological mechanisms underlying this disease. Furthermore, they present novel avenues for the development of innovative therapeutic strategies, holding the potential to improve treatment outcomes and enhance the quality of life for patients with this disease in the future.

Data Availability

all relevant analysis scripts and steps have been uploaded to a GitHub repository and have obtained a permanent access link through the Figshare platform (10.6084/m9.figshare.27276612). This repository contains detailed descriptions and codes for all key steps, including data preprocessing, gene feature analysis, and result visualization.

Funding Statement

This study was supported by grants from the National Natural Science Foundation of China (82300209), the Natural Science Foundation for Young Scientists of Shanxi Province (20210302124089), and the Heping Hospital Affiliated to Changzhi Medical College (Institute Level Research Fund; grant no. 2020-22).

References

  • 1.Gerke MB, Christodoulou I, Karantanos T. Definitions, biology, and current therapeutic landscape of myelodysplastic/myeloproliferative neoplasms. Cancers (Basel). 2023;15(15):3815. doi: 10.3390/cancers15153815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dotson JL, Lebowicz Y. Myelodysplastic syndrome. StatPearls. 2022. [PubMed] [Google Scholar]
  • 3.Ning Y, Zhang Y, Kallen MA, Emadi A, Baer MR. Cytogenetics and molecular genetics of myelodysplastic neoplasms. Best Pract Res Clin Haematol. 2023;36(4):101512. doi: 10.1016/j.beha.2023.101512 [DOI] [PubMed] [Google Scholar]
  • 4.Kwon A, Weinberg OK. Acute myeloid leukemia arising from myelodysplastic syndromes. Clin Lab Med. 2023;43(4):657–67. doi: 10.1016/j.cll.2023.07.001 [DOI] [PubMed] [Google Scholar]
  • 5.Aul C, Giagounidis A, Germing U. Epidemiological features of myelodysplastic syndromes: results from regional cancer surveys and hospital-based statistics. Int J Hematol. 2001;73(4):405–10. doi: 10.1007/BF02994001 [DOI] [PubMed] [Google Scholar]
  • 6.Yan X, Wang L, Jiang L, Luo Y, Lin P, Yang W, et al. Clinical significance of cytogenetic and molecular genetic abnormalities in 634 Chinese patients with myelodysplastic syndromes. Cancer Med. 2021;10(5):1759–71. doi: 10.1002/cam4.3786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jiang Y, Eveillard J-R, Couturier M-A, Soubise B, Chen J-M, Gao S, et al. Asian population is more prone to develop high-risk myelodysplastic syndrome, concordantly with their propensity to exhibit high-risk cytogenetic aberrations. Cancers (Basel). 2021;13(3):481. doi: 10.3390/cancers13030481 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gao L, Yang L, Zhou S, Zhu W, Han Y, Chen S, et al. Allogenic hematopoietic stem cell transplantation outcomes of patients aged ≥ 55 years with acute myeloid leukemia or myelodysplastic syndromes in China: a retrospective study. Stem Cell Res Ther. 2024 Jan 29;15(1):24. doi: 10.1186/s13287-024-03640-4 ; PMCID: PMC10823660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Yuen LD, Hasserjian RP. Morphologic characteristics of myelodysplastic syndromes. Clin Lab Med. 2023;43(4):577–96. doi: 10.1016/j.cll.2023.06.003 [DOI] [PubMed] [Google Scholar]
  • 10.Ning Y, Zhang Y, Kallen MA, Emadi A, Baer MR. Cytogenetics and molecular genetics of myelodysplastic neoplasms. Best Pract Res Clin Haematol. 2023;36(4):101512. doi: 10.1016/j.beha.2023.101512 [DOI] [PubMed] [Google Scholar]
  • 11.Niscola P, Gianfelici V, Giovannini M, Piccioni D, Mazzone C, de Fabritiis P. Latest insights and therapeutic advances in myelodysplastic neoplasms. Cancers (Basel). 2024;16(8):1563. doi: 10.3390/cancers16081563 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Trinder M, McGinnis E. CD34+ megakaryocytes associate with myelodysplastic syndromes and related cytogenetic abnormalities, but not other hematological disorders. Blood. 2023;142(Supplement 1):6500–6500. doi: 10.1182/blood-2023-189928 [DOI] [Google Scholar]
  • 13.Hasserjian RP, Germing U, Malcovati L. Diagnosis and classification of myelodysplastic syndromes. Blood. 2023;142(26):2247–57. doi: 10.1182/blood.2023020078 [DOI] [PubMed] [Google Scholar]
  • 14.Votavova H, Belickova M. Hypoplastic myelodysplastic syndrome and acquired aplastic anemia: immune‑mediated bone marrow failure syndromes (Review). Int J Oncol. 2022;60(1):7. doi: 10.3892/ijo.2021.5297 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Côme C, Balhuizen A, Bonnet D, Porse BT. Myelodysplastic syndrome patient-derived xenografts: from no options to many. Haematologica. 2020;105(4):864–9. doi: 10.3324/haematol.2019.233320 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Le Y. Screening and identification of key candidate genes and pathways in myelodysplastic syndrome by bioinformatic analysis. PeerJ. 2019;7e8162. doi: 10.7717/peerj.8162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Tuerxun N, Wang J, Zhao F, Qin Y-T, Wang H, Chen R, et al. Bioinformatics analysis deciphering the transcriptomic signatures associated with signalling pathways and prognosis in the myelodysplastic syndromes. Hematology. 2022;27(1):214–31. doi: 10.1080/16078454.2022.2029256 [DOI] [PubMed] [Google Scholar]
  • 18.Tian Y, Tao K, Li S, Chen X, Wang R, Zhang M, et al. Identification of m6A-related biomarkers in systemic lupus erythematosus: a bioinformation-based analysis. J Inflamm Res. 2024;17507–26. doi: 10.2147/JIR.S439779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Huang J, Zhou J, Xue X, Dai T, Zhu W, Jiao S, et al. Identification of aging-related genes in diagnosing osteoarthritis via integrating bioinformatics analysis and machine learning. Aging (Albany NY). 2024;16(1):153–68. doi: 10.18632/aging.205357 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li Y, Yu J, Li R, Zhou H, Chang X. New insights into the role of mitochondrial metabolic dysregulation and immune infiltration in septic cardiomyopathy by integrated bioinformatics analysis and experimental validation. Cell Mol Biol Lett. 2024;29(1):21. doi: 10.1186/s11658-024-00536-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Vasikova A, Belickova M, Budinska E, Cermak J. A distinct expression of various gene subsets in CD34+ cells from patients with early and advanced myelodysplastic syndrome. Leuk Res. 2010;34(12):1566–72. doi: 10.1016/j.leukres.2010.02.021 [DOI] [PubMed] [Google Scholar]
  • 22.Ortmann CA, Burchert A, Hölzle K, Nitsche A, Wittig B, Neubauer A, et al. Down-regulation of interferon regulatory factor 4 gene expression in leukemic cells due to hypermethylation of CpG motifs in the promoter region. Nucleic Acids Res. 2005;33(21):6895–905. doi: 10.1093/nar/gki1001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Schmidt M, Hochhaus A, König-Merediz SA, Brendel C, Proba J, Hoppe GJ, et al. Expression of interferon regulatory factor 4 in chronic myeloid leukemia: correlation with response to interferon alfa therapy. J Clin Oncol. 2000;18(19):3331–8. doi: 10.1200/JCO.2000.18.19.3331 [DOI] [PubMed] [Google Scholar]
  • 24.Schmidt M, Hochhaus A, König-Merediz SA, Brendel C, Proba J, Hoppe GJ, et al. Expression of interferon regulatory factor 4 in chronic myeloid leukemia: correlation with response to interferon alfa therapy. J Clin Oncol. 2000;18(19):3331–8. doi: 10.1200/JCO.2000.18.19.3331 [DOI] [PubMed] [Google Scholar]
  • 25.Jo S-H, Schatz JH, Acquaviva J, Singh H, Ren R. Cooperation between deficiencies of IRF-4 and IRF-8 promotes both myeloid and lymphoid tumorigenesis. Blood. 2010;116(15):2759–67. doi: 10.1182/blood-2009-07-234559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ma S, Shukla V, Fang L, Gould KA, Joshi SS, Lu R. Accelerated development of chronic lymphocytic leukemia in New Zealand Black mice expressing a low level of interferon regulatory factor 4. J Biol Chem. 2013;288(37):26430–40. doi: 10.1074/jbc.M113.475913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Shukla V, Ma S, Hardy RR, Joshi SS, Lu R. A role for IRF4 in the development of CLL. Blood. 2013;122(16):2848–55. doi: 10.1182/blood-2013-03-492769 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lopez-Girona A, Mendy D, Ito T, Miller K, Gandhi AK, Kang J, et al. Cereblon is a direct protein target for immunomodulatory and antiproliferative activities of lenalidomide and pomalidomide. Leukemia. 2012;26(11):2326–35. doi: 10.1038/leu.2012.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wang J, Clay-Gilmour AI, Karaesmen E, Rizvi A, Zhu Q, Yan L, et al. Genome-wide association analyses identify variants in irf4 associated with acute myeloid leukemia and myelodysplastic syndrome susceptibility. Front Genet. 2021;12:554948. doi: 10.3389/fgene.2021.554948 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Shaffer AL, Emre NCT, Romesser PB, Staudt LM. IRF4: immunity. malignancy! therapy?. Clin Cancer Res. 2009;15(9):2954–61. doi: 10.1158/1078-0432.CCR-08-1845 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Nam S, Lim J-S. Essential role of interferon regulatory factor 4 (IRF4) in immune cell development. Arch Pharm Res. 2016;39(11):1548–55. doi: 10.1007/s12272-016-0854-1 [DOI] [PubMed] [Google Scholar]
  • 32.Lu J, Liang T, Li P, Yin Q. Regulatory effects of IRF4 on immune cells in the tumor microenvironment. Front Immunol. 2023;14:1086803. doi: 10.3389/fimmu.2023.1086803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Xiao Y, Wang N, Jin X, Liu A, Zhang Z. Clinical relevance of SCN and CyN induced by ELANE mutations: a systematic review. Front Immunol. 2024;151349919. doi: 10.3389/fimmu.2024.1349919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Carlsson G, Fasth A, Berglöf E, Lagerstedt-Robinson K, Nordenskjöld M, Palmblad J, et al. Incidence of severe congenital neutropenia in Sweden and risk of evolution to myelodysplastic syndrome/leukaemia. Br J Haematol. 2012;158(3):363–9. doi: 10.1111/j.1365-2141.2012.09171.x [DOI] [PubMed] [Google Scholar]
  • 35.Kennedy AL, Shimamura A. Genetic predisposition to MDS: clinical features and clonal evolution. Blood. 2019;133(10):1071–85. doi: 10.1182/blood-2018-10-844662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Krutein M. Molecular genetics of myeloid malignancy predisposition: Insights into pathogenesis and therapeutic translation (Doctoral dissertation). 2019.
  • 37.Nanua S, Murakami M, Xia J, Grenda DS, Woloszynek J, Strand M, et al. Activation of the unfolded protein response is associated with impaired granulopoiesis in transgenic mice expressing mutant Elane. Blood. 2011;117(13):3539–47. doi: 10.1182/blood-2010-10-311704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Cui C, Chakraborty K, Tang XA, Zhou G, Schoenfelt KQ, Becker KM, et al. Neutrophil elastase selectively kills cancer cells and attenuates tumorigenesis. Cell. 2021;184(12):3163-3177.e21. doi: 10.1016/j.cell.2021.04.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Peng B, Hu J, Fu X. ELANE: an emerging lane to selective anticancer therapy. Signal Transduct Target Ther. 2021;6(1):358. doi: 10.1038/s41392-021-00766-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Francesco Bertolini

9 Sep 2024

PONE-D-24-28210Research and Analysis of Differential Gene Expression in CD34 Hematopoietic Stem Cells in Myelodysplastic SyndromesPLOS ONE

Dear Dr. han,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process by both reviewers, experts in the field. Please resubmit only if you can answer all their concerns.

Please submit your revised manuscript by Oct 24 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Francesco Bertolini, MD, PhD

Academic Editor

PLOS ONE

Journal requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf. 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.  When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 4. Thank you for stating the following financial disclosure:  [National Natural Science Foundation of China (82300290)The Natural Science Foundation for Young Scientists of Shanxi Province (20210302124089)The present study was supported by a grant from Heping Hospital Affiliated to Changzhi Medical College (Institute Level Research Fund; grant no. 2020-22).].  Please state what role the funders took in the study.  If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 5. We note that your Data Availability Statement is currently as follows: [All relevant data are within the manuscript and its Supporting Information files.] Please confirm at this time whether or not your submission contains all raw data required to replicate the results of your study. Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods (https://journals.plos.org/plosone/s/data-availability#loc-minimal-data-set-definition). For example, authors should submit the following data: - The values behind the means, standard deviations and other measures reported;- The values used to build graphs;- The points extracted from images for analysis. Authors do not need to submit their entire data set if only a portion of the data was used in the reported study. If your submission does not contain these data, please either upload them as Supporting Information files or deposit them to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories. If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If data are owned by a third party, please indicate how others may request data access.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Authors,

Thanks for the original research that you conduct, but most of the information on the computational modeling part need to be done.

A)I have major questions and request edit and improve on prediction models part include at least 2 more models for performance comparison and why you choose the current approach see my comments on Comment 7. (ML part).

I have not seen any discussion on that part also.

B) Others are mostly related to the shape, the quality of the figures (Comments 1-5), and sharing your scripts in a public repository (see comment 6) (DEGs and Batch correction, and modeling sections codes).

Comment 1: Please, fit the figures in one page as much as possible instead of having 32 figures stick up with up to 10 figures. For instance, the authors can always refer to a Figure 1A, 1B, 1C, 1D, up to 1F so in one page you can have 8 figures.

Comment 2: It is hard to read figures texts since they are blurry. They should be like Figures 16 and 17 in terms of readability, colors, and font size. I am aware that some of the software are not producing the fine look in the figures but using ppt or canvas the authors can always fix the text.

Comment 3: Why are the figures being not in the correct order numerically at the end of paper? For example, the figures are starting with Figure 13, and mixed up unordered.

Comment 4: Please, make sure you proofread the manuscript to a native English speaker. I see some grammar mistakes and punctuational errors. One common mistake is when the authors end a sentence using a period. Afterwards the new sentence and “.” must have a space in between.

Comment 5: The captions of Figs1A-1D should start with capital letters. Please follow the same punctuation rules throughout the text. Also make sure you have all the figures mentioned in the text in order. Also, some figures got legends need to be capitalized (such as gse81173). The authors should make sure they use similar heatmap method arguments in R. For example, heatmap cells look much nicer in Fig1 than Fig4.

Comment 6: This is a promising study that the findings on specific gene features are very original. Can the authors share their scripts or analysis steps in a public repository (such as GitHub)? So that other researchers can repeat and reproduce the results of not just specifically of this manuscript but for future studies and datasets in the field.

Comment 7: Add a discussion paragraph for predictive modeling (RF, LASSO, and SVM). And why do you pick those models please support your findings by running at least 3 other models. It is always good to have model performance comparisons. For instance, why Neurol Network would not work? Have the authors tried other models. And how about ROC analysis and AUC results, Accuracy, Error, Sensitivity, and Specify? Without those missing criteria it is hard to judge why the authors pick the model’s understudy.

Comment 8: Please sketch the analysis steps in a flow chart as a figure. It helps readers to follow. And improve the readability.

Reviewer #2: The authors use published gene expression micro array data sets of CD34 expressing immature cells from the bone marrow of myelodysplastic syndrome patients and healthy controls to perform a differential gene expression meta analysis. The objective is to unravel expression signatures that are involved in the pathology and biology of the disease.

Concerns:

- The order of Figures is incorrect as well as the numbering. In addition many Figures lack resolution and can therefore not be interpreted. As a consequence, the manuscript can not properly be evaluated.

- The methodology is not properly described in accordance with Plos One author guidelines. For examples, are raw CEL files used, which normalization approach (e.g. MAS5) and transformation was performed. Also the result of selections are unclear. For example, when genes or samples with missing data are removed from the data set, it is not stated with what numbers that analysis is continued. A fold change selection of 1 is 'uncommon'.

- Algorithms (L2 regression, RF, SVM) are more or less fit for the data. I would advise not to assess overlap from the different approaches. A proper way to analyze the data seems to split the data, build a classifier using each approach, perform cross validations and evaluate the models. Then use features from the best model and evaluate these in the independent validation set.

- Code is not provided which hampers interpretation and reproducibility of the analyses. Also not in concordance with Plos One policy. https://journals.plos.org/plosone/s/materials-software-and-code-sharing

- Versions of and reference to used software are lacking

- For a meta analysis the number of data sets is somewhat limited and the additive value over of the analysis over the original manuscripts is limited.

Minor concerns:

- Volcano plots are informative

- The plots showing before and after batch effect could show symbols for the two sample type and colors for gene set. After batch correction one would assume that not the data set, but sample type would explain most varaibility in PCA plots.

- Improve punctuation

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes:  Emine Guven

Reviewer #2: Yes:  Costa Bachas

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Mar 12;20(3):e0315408. doi: 10.1371/journal.pone.0315408.r003

Author response to Decision Letter 1


12 Nov 2024

Dear Reviewers,

We deeply appreciate your invaluable feedback on our manuscript, titled "Research and Analysis of Differential Gene Expression in CD34 Hematopoietic Stem Cells in Myelodysplastic Syndromes." Recognizing the effort and time you've invested in the review, we sincerely thank you. Your insightful comments have enriched our understanding of the research and provided us with valuable suggestions. We have carefully considered all your recommendations and will address them in the revised manuscript. Here are our point-by-point responses to your main comments:

Reviewer #1:

1: We have meticulously reviewed and integrated logically related figures, employing subfigure annotations (e.g., Figure 1A, 1B) to condense information onto fewer pages. Furthermore, we have utilized PACE, the official processing platform of PLOS ONE (https://pacev2.apexcovantage.com/), to adjust the size and resolution of the figures in compliance with submission requirements, ensuring both clarity and aesthetic appeal. Lastly, we have updated the figure references throughout the manuscript to maintain consistency.

2: In response to the issue of blurred text in the figures you pointed out, we have implemented comprehensive optimization measures:

Firstly, we referenced the styles of Figure 16 and 17 and meticulously adjusted the text clarity, font size, and color of all figures to ensure readability. During this process, we specifically utilized tools such as PPT and Photoshop to address the blurring issues generated by the original software and regenerated all figures in high-quality formats.

Secondly, we confirmed that the uploaded image versions fully comply with the journal's requirements and maintain the highest clarity. While we acknowledge that the images in the manuscript may appear slightly blurred due to software compression when viewed, we assure you that the uploaded image files themselves possess exceptional clarity.

3: In response to the issue with the figure order that you pointed out, we conducted a thorough investigation and implemented comprehensive optimizations. Our analysis suggests that the problem may have arisen from differences in upload speeds for images of varying sizes. To definitively address this issue, we have taken the following measures: Re-uploaded each figure individually to ensure its order strictly aligns with the content of the article;Carefully reviewed the annotations on each figure to guarantee their accuracy in reflecting the article's content;Regenerated the final version of the paper and verified that all modifications have been correctly implemented.

4: In response to the grammatical and punctuation errors you identified, we have taken the following comprehensive measures to enhance the paper's quality: 1.Engaged a native English-speaking proofreader to thoroughly correct the entire text. 2.Employed a professional editor to review the text, focusing on punctuation and spacing. 3.Utilized online grammar checking tools for accuracy and fluency. 4.Sought feedback from multiple native speakers to ensure compliance with international standards.After completing these corrections, we re-generated and reviewed the paper, confirming all modifications were properly implemented for comprehensive quality improvement.

5: In response to the issues raised, we have made the following modifications: Chart Titles: We have revised the titles of Figures 1A to 1D to ensure they all comply with English writing conventions, specifically starting with capital letters. Punctuation Consistency: We have thoroughly checked and corrected the punctuation usage throughout the text to maintain consistency and accuracy. Figure Sequence: We have verified the citation order of the figures in the text and ensured they follow the sequence of their first appearance, preserving the coherence of the manuscript. Legend Format: We have standardized the necessary legends (e.g., "GSE81173") to uppercase to meet formatting requirements. Heatmap Method Consistency: Regarding the apparent differences between Figures 1 and 4, we confirm that both utilize similar parameter settings. The visual discrepancy arises primarily due to the difference in sample size, specifically, Figure 1 represents a gene set with 18 samples, whereas Figure 4 has 183 samples. We acknowledge the impact of sample size on chart aesthetics and will consider this in future work to further ensure visual uniformity and aesthetics across all figures.

6: We are immensely grateful for your attention and support of our research. In response to your request for sharing scripts and analytical steps, we are pleased to inform you that all relevant analysis scripts and steps have been uploaded to a GitHub repository and have obtained a permanent access link through the Figshare platform (10.6084/m9.figshare.27276612). This repository contains detailed descriptions and codes for all key steps, including data preprocessing, gene feature analysis, and result visualization.

Our aim in making these resources open is to enhance the reproducibility and transparency of research and to provide valuable references for other researchers in the field. We fully understand the importance of data availability and have ensured that all underlying data of this meta-analysis (including raw data points) are fully accessible in accordance with PLOS ONE's data policy. The data used in this study are derived from published journal articles, and all cited articles are listed in detail in the main text and reference list. As these data are publicly available secondary resources, no additional data access permissions or application processes are required. However, to obtain more information or verify the data, we encourage interested readers to consult the corresponding original literature.

We warmly invite interested researchers to access the repository and utilize our methods and scripts to replicate and validate our results. We also hope that these resources will inspire and assist in future research and datasets. Furthermore, we will provide the corresponding links in the revised manuscript. If you encounter any issues while accessing or using these resources, or need further assistance and support, please feel free to contact us. We are more than happy to provide help and look forward to collaborating with you to advance research in this field.

7: Explanation of Model Selection, Supported by Running at Least Three Additional Models: A Comparative Analysis.A dedicated discussion section has been added to the article, focusing on the predictive modeling process. It delves into the fundamental principles, unique advantages, and broad applicability of Random Forest (RF), LASSO regression (LASSO), and Support Vector Machines (SVM) in bioinformatics data analysis.

Rationale for Model Selection:

We selected RF, LASSO, and SVM as our core predictive models based on the following considerations:

Wide Applicability and Recognition: These three models are highly recognized in the bioinformatics field, with solid theoretical foundations and mature implementation methods.

Data Sample Characteristics: Given the significant sample size difference between the experimental and control groups in our data, and the limited sample size after screening, we chose models that could robustly handle such data.

Identification of Overlapping Genes: Preliminary analysis revealed that, besides the overlapping genes between RF, LASSO, and SVM, models such as Neural Networks (NN) and Gradient Boosting Machines (GBM) had fewer overlapping genes with RF. This further enhanced the reliability and consistency of our screening results.

Performance Comparison with Additional Models:

To fully support our findings and address your request for increased model comparison, we introduced three additional models: Neural Networks (NN), Logistic Regression (LR), and Gradient Boosting Machines (GBM). We conducted a comprehensive performance comparison between these models and RF, LASSO, and SVM using diverse evaluation metrics, including ROC analysis, AUC, accuracy, error rate, sensitivity, and specificity. While NN, LR, and GBM demonstrated certain advantages in sensitivity and F1 score, considering the extreme fractionation of our data, the significant difference between the experimental and control groups, and the limited sample size, we found that these models had applicability issues or biases during training.

Discussion on the Inapplicability of Neural Network and Other Models:

Significant challenges were encountered when attempting to apply the Neural Network (NN) model. Firstly, NN models typically require vast amounts of training data to achieve optimal performance, which was not feasible given our relatively limited sample size after screening. Secondly, the complexity of NN models and the difficulty in parameter tuning made them less ideal for our study. Therefore, we ultimately decided not to include the NN model in our final analysis framework.

Comprehensive Evaluation Metrics and Integrated Judgment:

We have provided detailed key evaluation metrics in the article, including ROC analysis, AUC, accuracy, error rate, sensitivity, and specificity, to allow readers to better understand the basis for our model selection. Based on a comprehensive comparison of these metrics, we found that although the six models exhibited similar performance across various indicators, considering numerous previous related studies and preliminary applications, we determined that RF, LASSO, and SVM demonstrated superior performance for our gene expression dataset.

8: We have drawn a flowchart illustrating the analytical steps and incorporated it as Figure 1 in the Methods section. This should enhance readers' comprehension of our analytical procedure.

Reviewer #2

Major Concerns:

1: We have revised the order and numbering of the figures and tables, and increased their resolution to ensure that they clearly present the analysis results.

2: We have supplemented and refined the methodology section, providing a detailed description of the data processing and analysis steps, including the types of files used, normalization methods, transformation processes, and criteria for selecting fold changes, etc. After repeated verification, we have confirmed that during the data file processing, only missing values and duplicated genes were removed, without deleting any samples. We have revised and elaborated on the original methodology section accordingly. Regarding your comment that selecting a fold change of 1 is "unusual", our rationale is as follows: in the initial assessment of differences in gene expression levels, we set a low fold change threshold (such as 1) to capture as many potential statistically significant differentially expressed genes as possible, providing more valuable candidate genes for subsequent in-depth studies (such as multi-machine learning predictions).

3: We have considered the reviewer's suggestion and adjusted our model evaluation methods. We used cross-validation to assess the performance of the models and compared metrics such as AUC values and accuracy across different models. Additionally, we plan to attempt using an independent validation set in future studies to further evaluate the generalization ability of the models.

4: We have shared the code on a public GitHub repository (10.6084/m9.figshare.27276612) so that other researchers can replicate and reproduce our results.

5: We have supplemented the methodology section with the software versions used and the corresponding citations.

6: Regarding your comment on the lack of information regarding the software versions used and references cited, we have supplemented the relevant details in the Methodology section. Specifically, we have: 1.Listed all software and their respective versions: This ensures the reproducibility of our results.2.Added citations for all relevant literature: This supports both our analytical methods and findings.We hope these additions address your concerns and further enhance the quality and transparency of our manuscript. 

7: Regarding your comment on the "limited number of datasets in the meta-analysis and the limited added value compared to the original manuscript," we have carefully reflected and respond as follows.During the data collection phase, we faced numerous challenges. To ensure the comprehensiveness and accuracy of the data, we systematically searched the GEO database and strictly screened according to the criteria of having more than 20 samples and meeting quality standards, ultimately including 4 representative MDS gene expression datasets. These datasets cover important research in the field of MDS, providing us with a reliable analytical foundation.Although the number of datasets may not meet the standards of some large-scale meta-analyses, in the context of current MDS gene expression research, these datasets are sufficient to support our in-depth analysis. We employed advanced analytical methods and tools to perform detailed mining and comprehensive analysis on these data, aiming to uncover potential gene expression patterns and biomarkers.In terms of analytical results, we have made some meaningful discoveries. These findings not only validate previous research results but also provide new perspectives and insights. Although these discoveries may not be sufficient to completely change the current understanding and treatment strategies for MDS, they undoubtedly pave new paths for future research and provide valuable reference points.

We sincerely appreciate your valuable feedback and will continue to strive in our future research. We are committed to including more relevant datasets to enhance the statistical power and generalizability of our conclusions. At the same time, we will continuously explore new analytical methods and research perspectives, aiming to achieve more significant breakthroughs and progress in the field of MDS gene expression research.

Secondary Concerns:

1: The volcano plots provide useful information.

Thank you for affirming the utility of the volcano plots in our manuscript. We are glad that you find these plots informative. Volcano plots are indeed a key part of our analysis, as they visually represent significant changes in gene expression levels, providing strong support for our research. If you have any further suggestions or need additional details to enhance the interpretation of these plots, please feel free to let us know. We look forward to further refining our work.

2: We have modified the charts to display the batch effects before and after correction, using distinct symbols for the two sample types and different colors for the various datasets.

3: We have carefully proofread and corrected the punctuation in the manuscript to ensure it adheres to English writing norms.

We again thank the reviewer for their valuable comments and look forward to your further feedback on our revised manuscript.

Best regards,

Min-xiao Wang

Attachment

Submitted filename: Response to Reviewers.docx

pone.0315408.s001.docx (19.5KB, docx)

Decision Letter 1

Francesco Bertolini

26 Nov 2024

Research and Analysis of Differential Gene Expression in CD34 Hematopoietic Stem Cells in Myelodysplastic Syndromes

PONE-D-24-28210R1

Dear Dr. han,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Francesco Bertolini, MD, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes:  Emine Guven

**********

Acceptance letter

Francesco Bertolini

PONE-D-24-28210R1

PLOS ONE

Dear Dr. han,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Francesco Bertolini

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0315408.s001.docx (19.5KB, docx)

    Data Availability Statement

    all relevant analysis scripts and steps have been uploaded to a GitHub repository and have obtained a permanent access link through the Figshare platform (10.6084/m9.figshare.27276612). This repository contains detailed descriptions and codes for all key steps, including data preprocessing, gene feature analysis, and result visualization.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES