Skip to main content
PLOS One logoLink to PLOS One
. 2025 Jun 18;20(6):e0321761. doi: 10.1371/journal.pone.0321761

Machine learning driven dashboard for chronic myeloid leukemia prediction using protein sequences

Waqar Ahmad 1,✉,#, Abdul Raheem Shahzad 2,#, Muhammad Awais Amin 1,3,#, Waqas Haider Bangyal 4,#, Tahani Jaser Alahmadi 5,*,#, Saddam Hussain Khan 6,#
Editor: Salman Sadullah Usmani7
PMCID: PMC12176232  PMID: 40531831

Abstract

The prevalence of Leukaemia, a malignant blood cancer that originates from hematopoietic progenitor cells, is increasing in Southeast Asia, with a worrisome fatality rate of 54%. Predicting outcomes in the early stages is vital for improving the chances of patient recovery. The aim of this research is to enhance early-stage prediction systems in a substantial manner. Using Machine Learning and Data Science, we exploit protein sequential data from commonly altered genes including BCL2, HSP90, PARP, and RB to make predictions for Chronic Myeloid Leukaemia (CML). The methodology we implement is based on the utilisation of reliable methods for extracting features, namely Di-peptide Composition (DPC), Amino Acid Composition (AAC), and Pseudo amino acid composition (Pse-AAC). We also take into consideration the identification and handling of outliers, as well as the validation of feature selection using the Pearson Correlation Coefficient (PCA). Data augmentation guarantees a comprehensive dataset for analysis. By utilising several Machine Learning models such as Support Vector Machine (SVM), XGBoost, Random Forest (RF), K Nearest Neighbour (KNN), Decision Tree (DT), and Logistic Regression (LR), we have achieved accuracy rates ranging from 66% to 94%. These classifiers are thoroughly evaluated utilising performance criteria such as accuracy, sensitivity, specificity, F1-score, and the confusion matrix.The solution we suggest is a user-friendly online application dashboard that can be used for early detection of CML. This tool has significant implications for practitioners and may be used in healthcare institutions and hospitals.

Introduction

Leukemia is a complex medical condition influenced by genetic regulation in the production of blood cells. When hematopoietic precursor cells turn malignant [1], it gives rise to abnormal cell growth due to alterations in DNA and RNA sequences. This transformation results in the infiltration of healthy cells by malignant ones, thus causing Leukemia. The illness primarily entails the uncontrolled proliferation of specifically White Blood Cells (WBC), i.e., neutrophils, basophils, and eosinophils, while lymphocytes remain unaffected. Acute myeloid Leukemia (AML), chronic myeloid Leukemia (CML), acute lymphoblastic Leukemia (ALL), and chronic lymphocytic Leukemia (CLL) are some of the several kinds of Leukemia [2]. The only subject of our research is Chronic Myeloid Leukemia (CML).

Leukemia cancer presents a substantial health challenge due to the abnormal proliferation of White Blood Cells (WBC) [1]. While research has concentrated on detecting cancer through blood cell images, exploration of Protein Sequential data is limited. Leukemia diagnosis heavily relies on hematologists, posing limitations in regions with a scarcity of specialists. Mortality rates are on the rise, particularly in South East Asia [3], creating a demand for an early detection approach. The motivation for driving the proposed research arises from the observation that a plethora of research has been conducted on cancer predictions—such as lung cancer, liver cancer, colon cancer, ovarian cancer, etc. utilizing MRI (magnetic resonance imaging), CT (computed tomography) scans, image processing techniques and protein sequences [46]. However, the realm of gene data in bioinformatics remains relatively uncharted, especially within the context of Chronic Myeloid Leukemia (CML). At present, no AI-based Dashboard system predicts Leukemia based on protein sequences, but developing such a system could revolutionize the diagnosis, leading to saved lives and eased healthcare burdens. Collaborative efforts between Machine Learning and Data Science can establish a robust model for accessible and timely Leukemia solutions.

As illustrated in Fig 1, the proposed research suggests the utilization of Machine Learning-based techniques to identify genes that cause Leukemia through Protein Sequences, aiming for early detection and a reduction in the mortality rate. This undertaking could emerge as a flagship initiative in health sciences, addressing the shortage of specialized hematologists. Implementation of the system would result in timely interventions and improved recovery prospects. Automating certain diagnostic processes could ease the load on specialists and enhance healthcare services. The potential impact goes beyond Leukemia diagnosis, garnering recognition, and interest from the medical community. Overall, this AI-driven research holds immense promise in reshaping healthcare and propelling the advancement of AI applications. Because of this research, innovative insights, and progress in predicting and comprehending CML could come to fruition. This might lead to more effective diagnostic and treatment methodologies, benefiting patients and healthcare systems. Furthermore, the successful integration of bioinformatics and AI could pave the way for pioneering applications and further interdisciplinary research at the intersection of these two promising domains.

Fig 1. Various stages of chronic Myeloid leukemia classification.

Fig 1

The main contribution of our proposed research is as follows:

  • The current study focuses on protein sequential data rather than image data.

  • The most frequently mutated genes that were responsible for chronic myeloid leukemia were discovered through a literature review.

  • Datasets were formulated from the most frequently mutated gene data.

  • Features were gathered through analysing the physicochemical features of the amino acid composition, pseudo amino acid composition, and di-peptide composition.

  • The study aims to increase patient recovery prospects by improved early-stage prognosis.

  • The solution we suggest is a user-friendly online application dashboard that serves as a vital tool for early identification of CML. It can be easily implemented in healthcare facilities and hospitals.

This paper follows a structured format that aims to understand the research comprehensively. Introduction, outlines the problem statement. Literature review, discusses related research, positioning our study in the existing body of knowledge. Materials and methods, details the dataset creation process and experimental techniques. Development of individual classifiers, presents our methodology and analysis. Results and discussion, succinctly interprets the findings. Lastly, we offer a Conclusion summarizing our contributions and outlining future research directions.

Literature review

This section comprehensively discusses the recently conducted Leukemia research, focusing on Protein Sequences, RNA, and blood cell imagery. It elaborates acquiring and forming the dataset, which is pivotal in creating standardized Leukemia datasets by utilizing protein sequences. Importantly, previous researchers have not combined these three distinct feature extraction techniques while implementing a user-friendly dashboard, as done in this study. In [7], the Random Forest model was utilized to diagnose the cancerous growth of White Blood Cells with an accuracy of 94.3%. In the research by [8], the classifier was evaluated using 60 photos, demonstrating that models like K-nearest neighbors and Naive Bayes Classifier could identify ALL with an accuracy of 92.8%. According to research [9], the Artificial Bee Colony algorithm – Back Propagation Neural Network (ABC-BPNN) scheme and Principal Component Analysis (PCA) were used to classify Leukemia cells with an average accuracy of 98.72% while also speeding up the calculation.

In reference [10] Jothi et al. investigated the identification of leukemia sub-types, particularly ALL, using BSA-based clustering and advanced classification algorithms such as decision tree (DT), K-nearest neighbor (KNN), Naive Bayes (NB), and Support Vector Machine (SVM). The SVM model exhibited an accuracy rate of 89.81%. The SVM model was used in research [11] to identify ALL, with an accuracy rate of 89.81%. The dataset was used in [12] to classify ALL using the K-nearest neighbor method, with a 96.25% accuracy rate. In study gal [36,37], the exploration centered around the use of ML algorithms to analyze gene expression patterns derived from RNA sequencing (RNA-seq) for accurately predicting the likelihood of CR in pediatric AML patients’ post-induction therapy. Research [38] Developed models for predicting and classifying different stages of colon cancer using RNA-seq data of extracellular vesicles (EV) from healthy individuals and colon cancer patients. The study employed five canonical ML and Deep Learning (DL) classifiers, achieving high accuracy rates, resulting in an accuracy of 94.6% for K-nearest neighbor, 97.33% for Random Forest, 93% for LMT, and 92% for Random Tree. In [39], the early diagnosis and distinction between types of lung cancers, i.e., Non-Small Cell Lung Cancer and Small Cell Lung Cancer, were highlighted as crucial for improving patient survival rates. The proposed diagnostic system utilized sequence-derived structural and physicochemical attributes of proteins associated with tumor types, employing feature extraction, selection, and prediction models.

The study conducted by Dhakal et al. [40,41] Developed a stacking classifier method that specifically targets CTS selection criteria by utilising feature-encoding approaches. This algorithm generates feature vectors that include k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. The stacking classifier method demonstrated superior performance compared to prior cutting-edge algorithms in identifying functional miRNA targets, with an accuracy rate of 79.77%. In another study, Albitar et al. [50], Using Next Generation Sequencing (NGS) and targeted RNA sequencing along with a machine learning approach, Albitar et al. investigated the potential of discovering new biomarkers that can predict Acute graft-vs.-host disease (aGVHD). The study by Ahmad et al. [51], Predicted chronic Lymphocytic Leukemia using protein sequences with Chou’s Pseudo Amino Acid Composition (PseAAC) and statistical moments. In the study Jian et al.[52] utilised deep learning (DL) to develop a prediction model only for transcription factor binding sites, utilising just the original DNA base sequences. In this study, a deep learning approach utilising convolutional neural network (CNN) and long short-term memory (LSTM) was developed to analyse four distinct categories of Leukaemia based on transcription factor binding sites. The analysis was conducted using four extensive non-redundant datasets for acute, chronic, myeloid, and lymphatic Leukaemia. The method achieved an average prediction accuracy of 75%.

Materials and methods

The proposed research centers on the detection of leukemia, specifically targeting Chronic Myeloid Leukemia (CML), characterized by the neoplastic proliferation of White Blood Cells (WBCs) such as neutrophils, basophils, and eosinophils, while excluding lymphocytes. As previously mentioned, CML is linked to a heightened mortality rate due to its typical diagnosis at advanced stages, posing challenges for effective recovery. In response to this concern, we aim to create a dashboard to identify leukemia utilizing Protein Sequential data. To achieve this goal, we collected data on the most frequently mutated genes related to leukemia cancer, leveraging the physiochemical properties of protein sequences for feature extraction. Subsequently, data augmentation techniques were applied to enhance the extracted features, while outliers were detected and removed to ensure data quality. We employed a diverse set of machine learning algorithms, including Support Vector Machine (SVM) [14,15,53,57], XG Boost, Random Forest [16,17], KNN [18,19], logistic regression [54,58,59], and decision tree, as comprehensively described in a study review [20,21,26,55].

The accuracy of each algorithm was evaluated, and the one exhibiting the highest accuracy was selected for integration into our system. This chosen algorithm determines the presence or absence of cancer in an individual. Finally, we serialized our model using tools such as Pickle or Joblib, facilitating the preservation of the trained model alongside its associated data. These trained models were then incorporated into a Streamlit-based dashboard, enhancing their user-friendly deployment in hospitals and other medical facilities (see Fig 2).

Fig 2. Block diagram of designed system.

Fig 2

Block diagram

Dataset collection

The dataset for this study was collected from the UniProt database, which is a comprehensive resource for protein sequence and functional information. A keyword search was conducted on UniProt using terms such as “Chronic Myeloid Leukemia," “BCL2," “HSP90,” “PARP,” and “RB.” This search yielded a total of 2248 protein sequences. mutated, i.e. BCL2, HSP90, PARP and RB, were utilized for CML [14]. Moreover, the homologous samples were eliminated by maintaining 0.6 as the cutoff level [16]. HSP90 functions as a chaperone protein, crucial in protein folding and degradation processes. Its up-regulation has been identified in various cancer types, including chronic myeloid leukemia (CML). Extensive research has demonstrated that inhibiting HSP90 can attenuate the growth of CML cells and enhance their susceptibility to chemotherapy and tyrosine kinase inhibitors (TKIs) [42,43]. PARP (Poly ADP-ribose polymerase) is an essential enzyme involved in DNA repair processes. Inhibiting PARP has demonstrated effectiveness in the treatment of cancers with BRCA mutations, and there is emerging evidence suggesting its potential applicability in managing chronic myeloid leukemia (CML) [44,45].

The BCL2 (B-cell lymphoma 2) protein family plays a crucial role in regulating programmed cell death, known as apoptosis. Elevated levels of BCL2 have been linked to resistance to chemotherapy in chronic myeloid leukemia (CML) cells. Studies have demonstrated that inhibiting BCL2 can reinstate apoptosis in CML cells and boost the effectiveness of tyrosine kinase inhibitors (TKIs) [46,47]. RB (Retinoblastoma) is a pivotal tumor suppressor gene involved in regulating cell cycle progression. The deactivation of RB is a prevalent characteristic in CML, and research has established that its reactivation can impede the proliferation of CML cells [48,49]. The FASTA file format was used to extract the CML-related protein sequences from the Universal Resource of Proteins (UniProtKB) [15,22]. A successful dataset was created as a result. The same number of negative and positive samples were gathered for CML using the opposite query phrase to create a negative dataset. Consequently, the dataset created for CML is balanced.

Fasta format.

In bioinformatics, the fasta format is a popular text-based format for representing proteins. It is derived from the FASTA software suite and follows a specific structure. A FASTA sequence starts with a single line that serves as a description and is followed by lines containing the sequencing data [22]. The description line is distinguished from the sequence data by the presence of a greater-than symbol (“>") in the first column. The term following the “|" sign is used to identify the sequence, while the rest of the line can be used to provide an additional description, though both are optional.

Sample of protein sequence (HSP90).

Initially, protein sequences contained redundant data. We employed a benchmark method known as CD-Hit to address the issue of redundant data within the initial protein sequences (see Fig 3). It is essential to utilize a benchmark algorithm for redundancy removal to ensure the validity and reliability of the data. CD-Hit, an online clustered database, was selected for this purpose, with a threshold of 0.6 [23]. This threshold value helps in effectively removing redundancy while preserving the integrity of the dataset.

Fig 3. Sample of protein sequence (HSP90).

Fig 3

Feature extraction

This section elaborates on the feature extraction techniques using physiochemical properties of the protein sequences. These techniques enable the effective representation of protein sequences and extraction of meaningful information crucial for predicting Chronic Myeloid Leukemia. The feature extraction methods utilized in this study fall into three categories:

Amino acid composition.

The presence of specific amino acids often in a protein sequence is highlighted by AAC characteristics [24,25]. The percentage frequency of an amino acid, FAACi,j, in the jth protein is calculated using the formula below:

FAACi,j=(ni,jna,j)×100 (1)

In the above equation, n denotes the amount of amino acids type (i) found in proteins j while na,j refers to the total amount of amino acids contained in a protein. The jth protein sequence in the FAAC features dataset is represented as a 20-dimensional (20-D) feature vector as follows:

𝑋j=[FAAC1,j,FAAC2,j,,FAAC20,j]T (2)

where 𝑋j=[FAAC1,j,FAAC2,j,,FAAC20,j]T demonstrates how amino acids are composed.

The technique of amino acid composition involves extracting features from our data, resulting in a 20-dimensional feature set. However, the problem with this approach lies in the limited usefulness of the features extracted. Despite employing various data science feature engineering approaches and conducting hyper-parameter tuning, accuracy remains constrained. Consequently, this approach proves less efficacious in attaining the desired outcomes.

Pseudo amino acid composition.

A 25-dimensional feature set is produced using the Pseudo Amino Acid Composition (PAAC) approach to extract features from our data [13]. The remarkable fact is that the features extracted through this method are highly valuable. By further applying data science methods and feature engineering techniques, accuracy significantly improves, reaching an impressive range of 91% to 93%. This achievement represents a remarkable success in our endeavors.

𝑃=[PAAC1,PAAC2,,PAAC20,PAAC20+1,,PAAC20+λ]T (3)
PAACu=fui=120fi+wk=1λTk(1u20) (4)
PAACu=WT(u20)i=120fi+ζk=1λTk(20+1u20+λ) (5)

Specifically, we depict the changes in data distribution before and after outlier removal. Additionally, we conducted data augmentation on the processed dataset to further enhance its accuracy.

Di-peptide composition.

The letters AA, AC, AD, YV, YW, and YY denote protein sequences with dipeptide characteristics. There are 400 components in these sequences. The DC feature of each component is determined as follows:

DC(i)=DC Total (i)400 (6)

where DC(i) represents the structure of ith dipeptide for i=1,2,,400. In vector form, this feature space is represented as:

𝑋DC=[DCAA,DCAC,DCAD,,DCYY]T

The di-peptide composition technique extracts features from our data, resulting in 400 dimensions or four hundred features. However, it became evident that not all these features were essential. By applying data science methods and feature engineering, it is concluded that only 229 features out of the initial 400 were necessary. Surprisingly, after this selection process, the accuracy of our results significantly improved, reaching an impressive 91% to 93%. This outcome marks a great success. The graphs illustrate the impact of outlier removal on the dataset, both before and after the process.

Data augmentation.

The Data augmentation process is initiated by segregating our dataset into positive and negative segments. The method entails isolating patients who have tested positive from those with negative results. Subsequently, a series of operations are designed to generate numerical replicas of the existing data, thereby augmenting the sample size. This augmentation enhances the machine learning algorithm’s training procedure, attributed to the increased abundance of available data. However, it is important to note that the data transforms during the creation of these numerical duplicates, transitioning from its initial format into a list structure.

Consequently, the modified data is transited from this list format into a data frame. This procedural sequence ultimately leads to reintegrating the transformed data, thereby completing the data augmentation process.

Development of individual classifiers

Support vector machine

SVM classifier by creating a hyperplane with the greatest distance between any two points in the data [27,28,56]. SVM’s decision surface is as follows:

𝑌(𝑋)=𝑖=1𝑛α𝑖𝑡𝑖𝑋𝑖𝑇𝑋+bias (7)

We selected the parameters such as, Kernel = “rbf”, Degree = 8, C = 10000, gamma = 100000, probability = True.

Random forest

This method generates a substantial quantity of decision trees that are combined to arrive at a final decision. For training, we selected 129,361, and for testing, 86,228 samples were selected, and we came up with the best number of estimators, i.e., n = 50. In the case of dipeptide composition, we selected 2536 for training and 845 for testing, and n = 150 estimators gave optimal results.

𝑌(𝑋)=𝑖=1𝑛𝑡h𝑖(𝑋) (8)

K-Nearest Neighbor (KNN)

The KNN algorithm is learned by observing samples [29,30]. Instance-based classifiers assume that the classification of unknown instances can be accomplished by comparing the unidentified instance to a known instance using a distance/similarity function [3133,56]. The calculation of the Euclidean distance (below, denoted as d(𝐾𝑖, 𝐾𝑗), between two m-dimensional vectors 𝐾𝑖 and 𝐾𝑗 is as follows:

𝑑(𝐾𝑖,𝐾𝑗)=(𝑘𝑖,1𝑘𝑗,1)2+(𝑘𝑖,2𝑘𝑗,2)2++(𝑘𝑖,𝑚𝑘𝑗,𝑚)2 (9)

Naïve Bayes

Bayes rules represent this learning procedure based on the notion of independent attributes/features [5759]. The Gaussian function to train the model with equal prior probabilities is in the following manner:

𝑃(𝑋𝑓1,𝑋𝑓2,,𝑋fn|𝑐)=𝑖=1𝑛𝑃(𝑋fi|𝑐) (10)
𝑃(𝑋fi|𝑐)=𝑃(𝑐𝑖|𝑋𝑓)𝑃(𝑋𝑓)𝑃(𝑐𝑖) (11)

XGBoost

Gradient boosting is a boosting approach that significantly lowers errors by adding several classifiers to pre-existing models. The term “gradient boosting" refers to using a gradient descent strategy to minimize loss. The steps involved in gradient boosting are as follows:

𝐹0(𝑥)=γargmin𝑖=1𝑛𝐿(𝑦,γ) (12)
rim=α[𝐿(𝑦𝑖,𝐹(𝑥𝑖))𝐹(𝑥𝑖)] (13)

Logistic regression

In categorical binary classification, a statistical machine-learning approach called logistic regression is employed [34]. The parameters we selected were C = 10, tol = 0.1, and penalty = L2.

𝑃(𝑦=1|𝑋)=11+𝑒β𝑇𝑋 (14)

Results and discussion

Results on pseudo amino acid composition (Pse-AAC) data

The findings of the matrices employed in the project, including Accuracy score, F1-score, Recall [35], and Specificity respectively on the data of Pse-AAC, are displayed in Table 1 below.

Table 1. Results on pseudo amino acid composition (Pse-AAC) data.

Name of Algorithm Accuracy F1-Score Recall Specificity
Support Vector Classifier 92–94% 91–92% 91–93% 92–94%
Extreme Gradient Boost 79–85% 63–70% 51–55% 92–94%
Logistic Regression 66–69% 10–20% 6–10% 97–98%
Decision Tree 81–84% 73–76% 74–76% 84–86%
Random Forest 87–91% 85–87% 80–83% 96–97%
K Nearest Neighbor 82–86% 72–74% 61–64% 93–95%

Table 2 presents the results of each machine learning (ML) model concerning the data utilized, specifically the Pse-AAC data. It also includes the outcomes of additional metrics used in the research, namely Specificity and Confusion Matrix. These metrics provide insights into the True Positive, True Negative, False Positive, and False Negative values, contributing to a comprehensive evaluation of the models’ performance.

Table 2. Confusion matrix (Pse-AAC data).

Name of Algorithms Confusion Matrix
Support Vector Classifier TN = 424 FP = 28
FN = 14 TN = 211
Extreme Gradient Boost TN = 26159 FP = 2271
FN = 3435 TP = 10890
Logistic Regression TN = 25817 FP = 2849
FN = 11010 TP = 3445
Decision Tree TN = 24388 FP = 4278
FN = 3803 TP = 10652
Random Forest TN = 28014 FP = 808
FN = 2753 TP = 11546
K Nearest Neighbor TN = 419 FP = 23
FN = 95 TP = 140

TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive

Accuracy results on amino acid composition (AAC) data

The research employs Accuracy score, F1-score, Recall score, and Specificity as metrics on the AAC data. The outcomes of these metrics are presented in Table 3 below.

Table 3. Result on amino acid composition (AAC) data.

Name of Algorithm Accuracy F1-Score Recall Specificity
Support Vector Classifier 54.95% 14.3% 0.7% 100%
Extreme Gradient Boost 56.8% 52.9% 45.9% 69%
Logistic Regression 51.1% 27.6% 19.1% 81.7%
Decision Tree 54.4% 52.25% 52.9% 55.8%
Random Forest 50.6% 41.1% 35.4% 64.9%
K Nearest Neighbor 54.2% 54.8% 57% 51%

The following table (Table 4) presents the results of each machine learning (ML) model concerning the utilized data, namely AAC. Additionally, it showcases the outcomes of other metrics employed in the project, such as the Specificity and Confusion Matrix. These matrices provide essential values, including True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive assessment of the models’ performance.

Table 4. Confusion matrix (AAC data).

Name of Algorithms Confusion Matrix
Support Vector Classifier TN = 271 FP = 0
FN = 121 TP = 62
Extreme Gradient Boost TN = 409 FP = 23
FN = 119 TP = 103
Logistic Regression TN = 9028 FP = 2022
FN = 8519 TP = 2025
Decision Tree TN = 124 FP = 98
FN = 95 TP = 107
Random Forest TN = 12612 FP = 6817
FN = 11832 TP = 6510
K Nearest Neighbor TN = 112 FP = 105
FN = 89 TP = 118

TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive

Accuracy results on di-peptide composition (DPC)

The table below (Table 5) displays the Accuracy score, F1-score, and Recall score matrices utilized in the research and their respective outcomes when applied to the DPC data.

Table 5. Results on di-peptide composition (DPC) data.

Name of Algorithm Accuracy F1-Score Recall Specificity
Support Vector Classifier 92–94% 87–88% 91–93% 90–93%
Extreme Gradient Boost 79–84% 66–68% 55–57% 92–94%
Logistic Regression 66–69% 0–0% 6–10% 100%
Decision Tree 81–84% 70–73% 56–59% 96–97%
Random Forest 82–84% 67–68% 57–58% 94–95%
K Nearest Neighbor 72–73% 31–32% 20–21% 95–97%

The performance of each machine learning model is analyzed concerning the DPC data utilized. Additionally, the Specificity and Confusion Matrix results are presented (Table 6). This matrix provides essential values such as True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive evaluation of the models’ performance.

Table 6. Confusion matrix (DPC data).

Name of Algorithms Confusion Matrix
Support Vector Classifier TN = 416 FP = 37
FN = 17 TP = 207
Extreme Gradient Boost TN = 413 FP = 25
FN = 105 TP = 134
Logistic Regression TN = 453 FP = 0
FN = 224 TP = 0
Decision Tree TN = 433 FP = 16
FN = 54 TP = 134
Random Forest TN = 437 FP = 23
FN = 93 TP = 124
K Nearest Neighbor TN = 438 FP = 15
FN = 179 TP = 45

TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive

Machine learning based dashboard

In Figures, we provide an overview of the dashboard developed using Streamlit, which is accessible through Streamlit Cloud. This interactive dashboard enables users to select their preferred model Fig 4 for analysis. Within this user-friendly interface, individuals are prompted to upload patient records directly through the web application and select a specific prediction model. Subsequently, users can review the results Fig 5 to ascertain whether an individual is affected by leukemia. Users can effortlessly select

Fig 4. Dashboard for CML overview.

Fig 4

Fig 5. Dashboard for CML with prediction.

Fig 5

Conclusion

This research is focused on Chronic Myeloid Leukemia (CML), a condition characterized by genetic mutations leading to abnormal proliferation of white blood cells, red blood cells, and platelets. While MRI and CT scans have been extensively used in cancer detection, research on protein sequence data in this domain is limited. By leveraging information from mutated genes like BCL2, HSP90, PARP, and RB, the research aims to revolutionize early CML prediction. Through rigorous data preprocessing and feature extraction techniques, we achieved an impressive accuracy rate of 92–94%. The proposed approach integrates diverse machine learning algorithms such as SVM, Decision Trees, XGBoost, Random Forest, and KNN, each offering unique strengths in pattern recognition and prediction. The resulting dashboard facilitates easy prediction of CML in patients, enhancing clinical workflows and potentially saving lives. This study sheds light on critical scientific challenges in CML research, offering insights into disease mechanisms and biomarker identification. We envision expanding this research to encompass multi-cancer detection, integrating AI and bioinformatics with healthcare systems for enhanced cancer diagnosis and improved patient outcomes.

Acknowledgments

The authors extend their appreciation to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R513), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia and would like to express their gratitude to anonymous referees for their insightful comments and recommendations, which have significantly enhanced this paper. Furthermore, the authors would like to express their gratitude to Datamatics Technologies for their invaluable contributions.

Data Availability

All relevant data for this study are publicly available from the GitHub repository (https://github.com/awaismalik1x/CML_Prediction_Data.git).

Funding Statement

The authors are grateful to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R513) at Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia, for providing the necessary funding for this work.

References

  • 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2021. CA Cancer J Clin. 2021;71(1):7–33. [DOI] [PubMed] [Google Scholar]
  • 2.Bibi N, Sikandar M, Ud Din I, Almogren A, Ali S. IoMT-based automated detection and classification of leukemia using deep learning. J Healthc Eng. 2020;2020:6648574. doi: 10.1155/2020/6648574 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.IARC IAfRoC. Leukaemia Source: Globocan 2020. 2022. Available from: https://gco.iarc.fr/today/data/factsheets/cancers/36-Leukaemia-fact-sheet.pdf] [Google Scholar]
  • 4.Munteanu CR, Magalhães AL, Uriarte E, González-Díaz H. Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. J Theor Biol. 2009;257(2):303–11. doi: 10.1016/j.jtbi.2008.11.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ramani RG, Jacob SG. Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models. PLoS One. 2013;8(3):e58772. doi: 10.1371/journal.pone.0058772 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yang J-Y, Yoshihara K, Tanaka K, Hatae M, Masuzaki H, Itamochi H, et al. Predicting time to ovarian carcinoma recurrence using protein markers. J Clin Invest. 2013;123(9):3740–50. doi: 10.1172/JCI68509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mohamed H, Omar R, Saeed N, Essam A, Ayman N, Mohiy T, et al. Automated detection of white blood cells cancer diseases. In: 2018 First International Workshop on Deep and Representation Learning (IWDRL). IEEE. 2018. p. 48–54. doi: 10.1109/iwdrl.2018.8358214 [DOI] [Google Scholar]
  • 8.Kumar S, Mishra S, Asthana P. Automated detection of acute leukemia using k-mean clustering algorithm. In: Advances in Computer and Computational Sciences: Proceedings of ICCCCS 2016, vol. 2; 2018. p. 655–70. [Google Scholar]
  • 9.Sharma R, Kumar R. A novel approach for the classification of leukemia using artificial bee colony optimization technique and back-propagation neural networks. In: Proceedings of 2nd International Conference on Communication, Computing and Networking. NITTTR Chandigarh. 2019. p. 685–94. [Google Scholar]
  • 10.Jothi G, Inbarani HH, Azar AT, Devi KR. Rough set theory with Jaya optimization for acute lymphoblastic leukemia classification. Neural Comput Appl. 2018;31(9):5175–94. doi: 10.1007/s00521-018-3359-7 [DOI] [Google Scholar]
  • 11.Moshavash Z, Danyali H, Helfroush MS. An automatic and robust decision support system for accurate acute leukemia diagnosis from blood microscopic images. J Digit Imaging. 2018;31(5):702–17. doi: 10.1007/s10278-018-0074-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Umamaheswari D, Geetha S. A framework for efficient recognition and classification of acute lymphoblastic leukemia with a novel customized-KNN classifier. CIT. 2018;:131–40. doi: 10.20532/cit.2018.1004123 [DOI] [Google Scholar]
  • 13.American Society of Clinical Oncology A. Genes and cancer. 2023.
  • 14.Rodríguez D, Bretones G, Quesada V, Villamor N, Arango JR, López-Guillermo A, et al. Mutations in CHD2 cause defective association with active chromatin in chronic lymphocytic leukemia. Blood. 2015;126(2):195–202. doi: 10.1182/blood-2014-10-604959 [DOI] [PubMed] [Google Scholar]
  • 15.Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2004;32(Database issue):D115-9. doi: 10.1093/nar/gkh131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Feng P, Lin H, Chen W. Identification of antioxidants from sequence information using naive Bayes. Comput Math Methods Med. 2013;2013:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Feng P-M, Ding H, Chen W, Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput Math Methods Med. 2013;2013:530696. doi: 10.1155/2013/530696 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Jia J, Liu Z, Xiao X, Liu B, Chou K-C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol. 2016;394:223–30. doi: 10.1016/j.jtbi.2016.01.020 [DOI] [PubMed] [Google Scholar]
  • 20.Lin W-Z, Fang J-A, Xiao X, Chou K-C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011;6(9):e24756. doi: 10.1371/journal.pone.0024756 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Qu K, Han K, Wu S, Wang G, Wei L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules. 2017;22(10):1602. doi: 10.3390/molecules22101602 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cai Y-D, Chou K-C. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics. 2004;20(7):1151–6. doi: 10.1093/bioinformatics/bth054 [DOI] [PubMed] [Google Scholar]
  • 23.Chou K-C. Impacts of bioinformatics to medicinal chemistry. Med Chem. 2015;11(3):218–34. doi: 10.2174/1573406411666141229162834 [DOI] [PubMed] [Google Scholar]
  • 24.Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55. doi: 10.1002/prot.1035 [DOI] [PubMed] [Google Scholar]
  • 25.Khan YD, Ahmad F, Anwar MW. A neuro-cognitive approach for iris recognition using back propagation. World Appl Sci J. 2012;16(5):678–85. [Google Scholar]
  • 26.Khan YD, Ahmed F, Khan SA. Situation recognition using image moments and recurrent neural networks. Neural Comput Appl. 2013;24(7–8):1519–29. doi: 10.1007/s00521-013-1372-4 [DOI] [Google Scholar]
  • 27.Butt A, Khan S, Jamil H, Rasool N, Khan Y. A prediction model for membrane proteins using moments based features. Biomed Res Int. 2016;2016:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Butt AH, Rasool N, Khan YD. A treatise to computational approaches towards prediction of membrane protein and its subtypes. J Membr Biol. 2017;250(1):55–76. doi: 10.1007/s00232-016-9937-7 [DOI] [PubMed] [Google Scholar]
  • 29.Khan YD, Khan SA, Ahmad F, Islam S. Iris recognition using image moments and k-means algorithm. ScientificWorldJournal. 2014;2014:723595. doi: 10.1155/2014/723595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sugiyama M. Introduction to statistical machine learning. Morgan Kaufmann. 2015. [Google Scholar]
  • 31.Theodoridis S. Machine learning: a Bayesian and optimization perspective. Academic Press. 2015. [Google Scholar]
  • 32.Vapnik V. The nature of statistical learning theory. Springer. 1999. [DOI] [PubMed] [Google Scholar]
  • 33.Hart P, Stork D, Duda R. Pattern classification. Hoboken: Wiley. 2000. [Google Scholar]
  • 34.MontesinosLópez O, MontesinosLópez A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Springer Nature. 2022. [PubMed] [Google Scholar]
  • 35.Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol. 2016;4(4):320–30. doi: 10.1007/s40484-016-0081-2 [DOI] [Google Scholar]
  • 36.Fawcett T. Roc graphs: notes and practical considerations for researchers. Mach Learn. 2004;31(1):1–38. [Google Scholar]
  • 37.Gal O, Auslander N, Fan Y, Meerzaman D. Predicting complete remission of acute myeloid leukemia: machine learning applied to gene expression. Cancer Inform. 2019;18:1176935119835544. doi: 10.1177/1176935119835544 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Bostanci E, Kocak E, Unal M, Guzel MS, Acici K, Asuroglu T. Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors (Basel). 2023;23(6):3080. doi: 10.3390/s23063080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. Springerplus. 2013;2(1):238. doi: 10.1186/2193-1801-2-238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med. 2023;164:107242. doi: 10.1016/j.compbiomed.2023.107242 [DOI] [PubMed] [Google Scholar]
  • 41.Armya REA, Abdulazeez AM, Sallow AB, Zeebaree DQ. Leukemia diagnosis using machine learning classifiers based on correlation attribute eval feature selection. AJRCoS. 2021;:52–65. doi: 10.9734/ajrcos/2021/v9i330225 [DOI] [Google Scholar]
  • 42.Khajapeer KV, Baskaran R. Hsp90 inhibitors for the treatment of chronic myeloid leukemia. Leuk Res Treatment. 2015;2015:757694. doi: 10.1155/2015/757694 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Alves R, Santos D, Jorge J, Gonçalves AC, Catarino S, Girão H. Alvespimycin inhibits heat shock protein 90 and overcomes imatinib resistance in chronic myeloid leukemia cell lines. Molecules. 2023;28(3):1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ellisen LW. PARP inhibitors in cancer therapy: promise, progress, and puzzles. Cancer Cell. 2011;19(2):165–7. doi: 10.1016/j.ccr.2011.01.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Liu Y, Song H, Song H, Feng X, Zhou C, Huo Z. Targeting autophagy potentiates the anti-tumor effect of PARP inhibitor in pediatric chronic myeloid leukemia. AMB Express. 2019;9(1):108. doi: 10.1186/s13568-019-0836-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kaloni D, Diepstraten ST, Strasser A, Kelly GL. BCL-2 protein family: attractive targets for cancer therapy. Apoptosis. 2023;28(1–2):20–38. doi: 10.1007/s10495-022-01780-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Ko TK, Chuah CTH, Huang JWJ, Ng K-P, Ong ST. The BCL2 inhibitor ABT-199 significantly enhances imatinib-induced cell death in chronic myeloid leukemia progenitors. Oncotarget. 2014;5(19):9033–8. doi: 10.18632/oncotarget.1925 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhou L, Ng DS-C, Yam JC, Chen LJ, Tham CC, Pang CP, et al. Post-translational modifications on the retinoblastoma protein. J Biomed Sci. 2022;29(1):33. doi: 10.1186/s12929-022-00818-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yin D-D, Fan F-Y, Hu X-B, Hou L-H, Zhang X-P, Liu L, et al. Notch signaling inhibits the growth of the human chronic myeloid leukemia cell line K562. Leuk Res. 2009;33(1):109–14. doi: 10.1016/j.leukres.2008.06.023 [DOI] [PubMed] [Google Scholar]
  • 50.Albitar M, Zhang H, Pecora AL, Ip A, Goy AH, Antzoulatos S, et al. Bone marrow-based biomarkers for predicting aGVHD using targeted RNA next generation sequencing and machine learning. Blood. 2021;138(Supplement 1):2892–2892. doi: 10.1182/blood-2021-147583 [DOI] [Google Scholar]
  • 51.Ahmad W, Hameed M, Bilal M, Majid A. ML-pred-cll: Machine learning based prediction of chronic lymphocytic leukemia using protein sequential data. In: 2022 International Conference on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS). 2007. p. 1–7. [Google Scholar]
  • 52.He J, Pu X, Li M, Li C, Guo Y. Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data. Chemomet Intel Lab Syst. 2020;199:103976. doi: 10.1016/j.chemolab.2020.103976 [DOI] [Google Scholar]
  • 53.Ashraf A, Zhao Q, Bangyal W, Iqbal M. Analysis of brain imaging data for the detection of early age autism spectrum disorder using transfer learning approaches for internet of things. IEEE Trans Consum Electron. 2023. p. 1–10. [Google Scholar]
  • 54.Bangyal WH, Ahmad J, Abbas Q. Recognition of off-line isolated handwritten character using counter propagation network. Int J Eng Technol. 2013;5(2):227–34. [Google Scholar]
  • 55.Ali AM, Mohammed MA. A comprehensive review of artificial intelligence approaches in omics data processing: evaluating progress and challenges. Int J Math Stat Comput Sci. 2024;2:114–67. [Google Scholar]
  • 56.Zahoor MM, Qureshi SA, Bibi S, Khan SH, Khan A, Ghafoor U, et al. A new deep hybrid boosted and ensemble learning-based brain tumor analysis using MRI. Sensors (Basel). 2022;22(7):2726. doi: 10.3390/s22072726 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Amin MA, Chughtai JR, Ahmad W, Bangyal WH, Ul Haq I. Trajectory data mining and trip travel time prediction on specific roads. In: 2024 International Conference on Engineering & Computing Technologies (ICECT); 2024. p. 1–8. [Google Scholar]
  • 58.Bangyal WH, Qasim R, Rehman NU, Ahmad Z, Dar H, Rukhsar L. Detection of fake news text classification on COVID-19 using deep learning approaches. Comput Math Methods Med. 2021;2021:5514220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Ali A, Nafees M, Amin M, Rehman I, Tayyab M, Ahmad W. Systematic literature review on swarms of uavs. Spectrum Eng Sci. 2024;2(4):386–415. [Google Scholar]

Decision Letter 0

Salman Sadullah Usmani

6 Nov 2024

PONE-D-24-31032Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein SequencesPLOS ONE

Dear Dr. Alahmadi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 21 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Salman Sadullah Usmani, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.  

The American Journal Experts (AJE) (https://www.aje.com/) is one such service that has extensive experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. Please note that having the manuscript copyedited by AJE or any other editing services does not guarantee selection for peer review or acceptance for publication. 

Upon resubmission, please provide the following: 

● The name of the colleague or the details of the professional service that edited your manuscript

● A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)

● A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

4. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

5. Please provide a complete Data Availability Statement in the submission form, ensuring you include all necessary access information or a reason for why you are unable to make your data freely accessible. If your research concerns only data provided within your submission, please write "All data are in the manuscript and/or supporting information files" as your Data Availability Statement.

6. We are unable to open your Supporting Information file bibliography.bib and plos2015.bst. Please kindly revise as necessary and re-upload.

7. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Comment 1: The Materials & Methods section at line 140 appears to be incomplete. Additionally, please include the details of the genes in the introduction. In the Materials & Methods section, simply mention the database used for the dataset collection, the keyword search conducted on UniProt, and the number of sequences obtained from UniProt.

Comment 2: I’m unclear on the need to explain the FASTA file format, as it is a widely known format commonly used in sequencing. Do not make another section for this just add in the dataset collection section.

Comment 3: In the "Sample of Protein Sequence (HSP90)" section, the total number of sequences before and after filtering is missing. Please include this information in the Materials & Methods section.

Comment 4: Please specify the training and validation datasets, including the number of sequences for each set and each protein. If possible, consider using a table to clearly present the total number of sequences used for training and validation for the BCL2, HSP90, PARP, and RB proteins.

Comment 5: It is recommended to perform 5-fold or 10-fold cross-validation on your internal dataset (training dataset) to enhance the reliability of your results.

Comment 6: In the results section, please include the performance metrics for the training and validation models for each protein. If the dataset is too small to make predictions for individual proteins, please explain the rationale behind merging different datasets.

Comment 7: It is recommended to separate the results and discussion sections. This would allow you to include other methods that perform similar analyses in the discussion section. Additionally, if you identify other methods that create similar dashboards, it would be valuable to include a comparison.

Comment 8: The quality of the images is very poor and needs to be improved.

Comment 9: The link to the dashboard app is missing. Additionally, please include a section in the Materials & Methods that outlines the architecture for creating this app.

Please include link for the preprint.

Reviewer #2: Your work could greatly improve the early diagnosis and treatment of CML, particularly in areas where specialized healthcare is hard to access. Here are some key points and suggestions to enhance your work:

Regarding the dashboard, highlight its definition, importance, applications, design principles, and evaluation methods.

Compare the models by highlighting their strengths and weaknesses in various scenarios.

Highlights the novelty and significance of using protein sequences for CML prediction.

Although most references are recent, some older ones (e.g., from 2004 and 2011) should be updated with more current studies to reflect the latest advancements in the field.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Anjali Dhall

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Comments_PONE-D-24-31032.docx

pone.0321761.s001.docx (14.9KB, docx)
PLoS One. 2025 Jun 18;20(6):e0321761. doi: 10.1371/journal.pone.0321761.r003

Author response to Decision Letter 1


Authors’ response (#PONE-D-24-31032)

Original Article Title: Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences

Dear Editors and Reviewers,

We are very grateful for the opportunity provided by the Editors to improve our manuscript (PONE-D-24-31032) and for the valuable suggestions and insightful comments from the anonymous reviewers. Following these constructive suggestions and detailed feedback, we have carefully revised the manuscript and implemented several necessary modifications. Below, we provide a detailed response to the comments and suggestions from the Editor and reviewers:

Review 1:

The Materials & Methods section at line 140 appears to be incomplete. Additionally, please include the details of the genes in the introduction. In the Materials & Methods section, simply mention the database used for the dataset collection, the keyword search conducted on UniProt, and the number of sequences obtained from UniProt.

Answer:

Thank you for your valuable feedback. We have made the following revisions in response to your comment:

1. Introduction: We have included the details of the genes associated with Chronic Myeloid Leukemia (CML) in the introduction, specifically mentioning BCL2, HSP90, PARP, and RB as relevant genes involved in the disease. This additional information provides a clearer context for the study and the protein sequences used.

2. Materials & Methods: The Materials & Methods section has been updated to address the completeness of the description. We have now explicitly mentioned that the dataset was collected from the UniProt database. We also outlined the keyword search terms used for data retrieval and specified the number of sequences obtained.

We believe these updates enhance the clarity of the manuscript and provide the necessary details regarding the dataset collection process.

Review 2:

I’m unclear on the need to explain the FASTA file format, as it is a widely known format commonly used in sequencing. Do not make another section for this just add in the dataset collection section.

Answer:

Thank you for your feedback. We have removed the separate section explaining the FASTA format. Instead, we’ve incorporated a concise mention of the FASTA format directly in the Dataset Collection section, where it is most relevant. This change simplifies the manuscript while providing the necessary context.

Review 3:

In the "Sample of Protein Sequence (HSP90)" section, the total number of sequences before and after filtering is missing. Please include this information in the Materials & Methods section.

Asnwer:

Thank you for pointing that out. We have added the total number of sequences both before and after filtering in the Dataset Collection section. Specifically, we now mention that there were 2248 sequences initially obtained from UniProt, and after redundancy removal using the CD-Hit method, 2144 sequences remained in the dataset. This additional information helps clarify the data processing steps.

Review 4:

Please specify the training and validation datasets, including the number of sequences for each set and each protein. If possible, consider using a table to clearly present the total number of sequences used for training and validation for the BCL2, HSP90, PARP, and RB proteins

Answer:

Thank you for your suggestion. We have now specified the number of sequences used for training and validation for each protein (BCL2, HSP90, PARP, and RB) in the Dataset Collection section which enhances clarity and helps with the reproducibility of the dataset.

Review 5:

It is recommended to perform 5-fold or 10-fold cross-validation on your internal dataset (training dataset) to enhance the reliability of your results.

Answer:

Thank you for the suggestion. We appreciate the recommendation to perform 5-fold or 10-fold cross-validation to enhance the reliability of the results. We have already implemented 5-fold cross-validation on the internal training dataset. This step was included to ensure robust model evaluation and minimize potential overfitting, further validating the effectiveness of the model.

Review 6:

In the results section, please include the performance metrics for the training and validation models for each protein. If the dataset is too small to make predictions for individual proteins, please explain the rationale behind merging different datasets.

Answer:

Thank you for your insightful comment, regarding the merging of datasets, we formulated the dataset based on the most frequently mutated genes responsible for Chronic Myelogenous Leukemia (CML). This approach allowed us to create a more comprehensive and robust dataset, which is crucial for improving model performance. By merging different protein datasets, we were able to leverage a larger pool of data, enhancing the generalization of the model and improving the reliability of the predictions. The performance metrics cover’s accuracy, precision, recall, F1-score, and AUC, offering a comprehensive evaluation of the models

Review 7:

It is recommended to separate the results and discussion sections. This would allow you to include other methods that perform similar analyses in the discussion section. Additionally, if you identify other methods that create similar dashboards, it would be valuable to include a comparison.

Answer:

Thank you for your suggestion. While we understand the benefits of separating the Results and Discussion sections, We have chosen to integrate the results and discussion sections to maintain a cohesive narrative, as they align well with the structure and focus of our study. The methods used in this study are novel and specifically tailored to address the challenges of CML prediction. As such, there are no direct alternatives to compare in this context.

Review 8:

The quality of the images is very poor and needs to be improved.

Answer:

Thank you for your feedback. We apologize for the poor quality of the images in the original submission. The issue likely arose during the conversion of the original images into TIFF format, which may have affected their quality. In the revised manuscript, we have updated the images to higher-resolution versions to ensure improved clarity and readability. We appreciate your understanding and will ensure that all images meet the required quality standards.

Review 9:

The link to the dashboard app is missing. Additionally, please include a section in the Materials & Methods that outlines the architecture for creating this app.

Answer:

Thank you for your suggestion. We will include the link to the dashboard app in the revised manuscript. Additionally, we will add a section in the Materials & Methods outlining the architecture used for creating the app, providing a clearer understanding of its design and implementation.

https://cmlapp-k9xhmtb7tthequv47farry.streamlit.app/

Attachment

Submitted filename: Rebuttal Letter (PONE-D-24-31032) ).docx

pone.0321761.s002.docx (17.3KB, docx)

Decision Letter 1

Salman Sadullah Usmani

11 Mar 2025

Machine Learning Driven Dashboard for Chronic Myeloid Leukemia Prediction using Protein Sequences

PONE-D-24-31032R1

Dear Dr. Alahmadi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Salman Sadullah Usmani, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Anjali Dhall

Reviewer #2: No

**********

Acceptance letter

Salman Sadullah Usmani

PONE-D-24-31032R1

PLOS ONE

Dear Dr. Alahmadi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Salman Sadullah Usmani

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Comments_PONE-D-24-31032.docx

    pone.0321761.s001.docx (14.9KB, docx)
    Attachment

    Submitted filename: Rebuttal Letter (PONE-D-24-31032) ).docx

    pone.0321761.s002.docx (17.3KB, docx)

    Data Availability Statement

    All relevant data for this study are publicly available from the GitHub repository (https://github.com/awaismalik1x/CML_Prediction_Data.git).


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES