Abstract
The prevalence of Leukaemia, a malignant blood cancer that originates from hematopoietic progenitor cells, is increasing in Southeast Asia, with a worrisome fatality rate of 54%. Predicting outcomes in the early stages is vital for improving the chances of patient recovery. The aim of this research is to enhance early-stage prediction systems in a substantial manner. Using Machine Learning and Data Science, we exploit protein sequential data from commonly altered genes including BCL2, HSP90, PARP, and RB to make predictions for Chronic Myeloid Leukaemia (CML). The methodology we implement is based on the utilisation of reliable methods for extracting features, namely Di-peptide Composition (DPC), Amino Acid Composition (AAC), and Pseudo amino acid composition (Pse-AAC). We also take into consideration the identification and handling of outliers, as well as the validation of feature selection using the Pearson Correlation Coefficient (PCA). Data augmentation guarantees a comprehensive dataset for analysis. By utilising several Machine Learning models such as Support Vector Machine (SVM), XGBoost, Random Forest (RF), K Nearest Neighbour (KNN), Decision Tree (DT), and Logistic Regression (LR), we have achieved accuracy rates ranging from 66% to 94%. These classifiers are thoroughly evaluated utilising performance criteria such as accuracy, sensitivity, specificity, F1-score, and the confusion matrix.The solution we suggest is a user-friendly online application dashboard that can be used for early detection of CML. This tool has significant implications for practitioners and may be used in healthcare institutions and hospitals.
Introduction
Leukemia is a complex medical condition influenced by genetic regulation in the production of blood cells. When hematopoietic precursor cells turn malignant [1], it gives rise to abnormal cell growth due to alterations in DNA and RNA sequences. This transformation results in the infiltration of healthy cells by malignant ones, thus causing Leukemia. The illness primarily entails the uncontrolled proliferation of specifically White Blood Cells (WBC), i.e., neutrophils, basophils, and eosinophils, while lymphocytes remain unaffected. Acute myeloid Leukemia (AML), chronic myeloid Leukemia (CML), acute lymphoblastic Leukemia (ALL), and chronic lymphocytic Leukemia (CLL) are some of the several kinds of Leukemia [2]. The only subject of our research is Chronic Myeloid Leukemia (CML).
Leukemia cancer presents a substantial health challenge due to the abnormal proliferation of White Blood Cells (WBC) [1]. While research has concentrated on detecting cancer through blood cell images, exploration of Protein Sequential data is limited. Leukemia diagnosis heavily relies on hematologists, posing limitations in regions with a scarcity of specialists. Mortality rates are on the rise, particularly in South East Asia [3], creating a demand for an early detection approach. The motivation for driving the proposed research arises from the observation that a plethora of research has been conducted on cancer predictions—such as lung cancer, liver cancer, colon cancer, ovarian cancer, etc. utilizing MRI (magnetic resonance imaging), CT (computed tomography) scans, image processing techniques and protein sequences [4–6]. However, the realm of gene data in bioinformatics remains relatively uncharted, especially within the context of Chronic Myeloid Leukemia (CML). At present, no AI-based Dashboard system predicts Leukemia based on protein sequences, but developing such a system could revolutionize the diagnosis, leading to saved lives and eased healthcare burdens. Collaborative efforts between Machine Learning and Data Science can establish a robust model for accessible and timely Leukemia solutions.
As illustrated in Fig 1, the proposed research suggests the utilization of Machine Learning-based techniques to identify genes that cause Leukemia through Protein Sequences, aiming for early detection and a reduction in the mortality rate. This undertaking could emerge as a flagship initiative in health sciences, addressing the shortage of specialized hematologists. Implementation of the system would result in timely interventions and improved recovery prospects. Automating certain diagnostic processes could ease the load on specialists and enhance healthcare services. The potential impact goes beyond Leukemia diagnosis, garnering recognition, and interest from the medical community. Overall, this AI-driven research holds immense promise in reshaping healthcare and propelling the advancement of AI applications. Because of this research, innovative insights, and progress in predicting and comprehending CML could come to fruition. This might lead to more effective diagnostic and treatment methodologies, benefiting patients and healthcare systems. Furthermore, the successful integration of bioinformatics and AI could pave the way for pioneering applications and further interdisciplinary research at the intersection of these two promising domains.
Fig 1. Various stages of chronic Myeloid leukemia classification.
The main contribution of our proposed research is as follows:
The current study focuses on protein sequential data rather than image data.
The most frequently mutated genes that were responsible for chronic myeloid leukemia were discovered through a literature review.
Datasets were formulated from the most frequently mutated gene data.
Features were gathered through analysing the physicochemical features of the amino acid composition, pseudo amino acid composition, and di-peptide composition.
The study aims to increase patient recovery prospects by improved early-stage prognosis.
The solution we suggest is a user-friendly online application dashboard that serves as a vital tool for early identification of CML. It can be easily implemented in healthcare facilities and hospitals.
This paper follows a structured format that aims to understand the research comprehensively. Introduction, outlines the problem statement. Literature review, discusses related research, positioning our study in the existing body of knowledge. Materials and methods, details the dataset creation process and experimental techniques. Development of individual classifiers, presents our methodology and analysis. Results and discussion, succinctly interprets the findings. Lastly, we offer a Conclusion summarizing our contributions and outlining future research directions.
Literature review
This section comprehensively discusses the recently conducted Leukemia research, focusing on Protein Sequences, RNA, and blood cell imagery. It elaborates acquiring and forming the dataset, which is pivotal in creating standardized Leukemia datasets by utilizing protein sequences. Importantly, previous researchers have not combined these three distinct feature extraction techniques while implementing a user-friendly dashboard, as done in this study. In [7], the Random Forest model was utilized to diagnose the cancerous growth of White Blood Cells with an accuracy of 94.3%. In the research by [8], the classifier was evaluated using 60 photos, demonstrating that models like K-nearest neighbors and Naive Bayes Classifier could identify ALL with an accuracy of 92.8%. According to research [9], the Artificial Bee Colony algorithm – Back Propagation Neural Network (ABC-BPNN) scheme and Principal Component Analysis (PCA) were used to classify Leukemia cells with an average accuracy of 98.72% while also speeding up the calculation.
In reference [10] Jothi et al. investigated the identification of leukemia sub-types, particularly ALL, using BSA-based clustering and advanced classification algorithms such as decision tree (DT), K-nearest neighbor (KNN), Naive Bayes (NB), and Support Vector Machine (SVM). The SVM model exhibited an accuracy rate of 89.81%. The SVM model was used in research [11] to identify ALL, with an accuracy rate of 89.81%. The dataset was used in [12] to classify ALL using the K-nearest neighbor method, with a 96.25% accuracy rate. In study gal [36,37], the exploration centered around the use of ML algorithms to analyze gene expression patterns derived from RNA sequencing (RNA-seq) for accurately predicting the likelihood of CR in pediatric AML patients’ post-induction therapy. Research [38] Developed models for predicting and classifying different stages of colon cancer using RNA-seq data of extracellular vesicles (EV) from healthy individuals and colon cancer patients. The study employed five canonical ML and Deep Learning (DL) classifiers, achieving high accuracy rates, resulting in an accuracy of 94.6% for K-nearest neighbor, 97.33% for Random Forest, 93% for LMT, and 92% for Random Tree. In [39], the early diagnosis and distinction between types of lung cancers, i.e., Non-Small Cell Lung Cancer and Small Cell Lung Cancer, were highlighted as crucial for improving patient survival rates. The proposed diagnostic system utilized sequence-derived structural and physicochemical attributes of proteins associated with tumor types, employing feature extraction, selection, and prediction models.
The study conducted by Dhakal et al. [40,41] Developed a stacking classifier method that specifically targets CTS selection criteria by utilising feature-encoding approaches. This algorithm generates feature vectors that include k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. The stacking classifier method demonstrated superior performance compared to prior cutting-edge algorithms in identifying functional miRNA targets, with an accuracy rate of 79.77%. In another study, Albitar et al. [50], Using Next Generation Sequencing (NGS) and targeted RNA sequencing along with a machine learning approach, Albitar et al. investigated the potential of discovering new biomarkers that can predict Acute graft-vs.-host disease (aGVHD). The study by Ahmad et al. [51], Predicted chronic Lymphocytic Leukemia using protein sequences with Chou’s Pseudo Amino Acid Composition (PseAAC) and statistical moments. In the study Jian et al.[52] utilised deep learning (DL) to develop a prediction model only for transcription factor binding sites, utilising just the original DNA base sequences. In this study, a deep learning approach utilising convolutional neural network (CNN) and long short-term memory (LSTM) was developed to analyse four distinct categories of Leukaemia based on transcription factor binding sites. The analysis was conducted using four extensive non-redundant datasets for acute, chronic, myeloid, and lymphatic Leukaemia. The method achieved an average prediction accuracy of 75%.
Materials and methods
The proposed research centers on the detection of leukemia, specifically targeting Chronic Myeloid Leukemia (CML), characterized by the neoplastic proliferation of White Blood Cells (WBCs) such as neutrophils, basophils, and eosinophils, while excluding lymphocytes. As previously mentioned, CML is linked to a heightened mortality rate due to its typical diagnosis at advanced stages, posing challenges for effective recovery. In response to this concern, we aim to create a dashboard to identify leukemia utilizing Protein Sequential data. To achieve this goal, we collected data on the most frequently mutated genes related to leukemia cancer, leveraging the physiochemical properties of protein sequences for feature extraction. Subsequently, data augmentation techniques were applied to enhance the extracted features, while outliers were detected and removed to ensure data quality. We employed a diverse set of machine learning algorithms, including Support Vector Machine (SVM) [14,15,53,57], XG Boost, Random Forest [16,17], KNN [18,19], logistic regression [54,58,59], and decision tree, as comprehensively described in a study review [20,21,26,55].
The accuracy of each algorithm was evaluated, and the one exhibiting the highest accuracy was selected for integration into our system. This chosen algorithm determines the presence or absence of cancer in an individual. Finally, we serialized our model using tools such as Pickle or Joblib, facilitating the preservation of the trained model alongside its associated data. These trained models were then incorporated into a Streamlit-based dashboard, enhancing their user-friendly deployment in hospitals and other medical facilities (see Fig 2).
Fig 2. Block diagram of designed system.
Block diagram
Dataset collection
The dataset for this study was collected from the UniProt database, which is a comprehensive resource for protein sequence and functional information. A keyword search was conducted on UniProt using terms such as “Chronic Myeloid Leukemia," “BCL2," “HSP90,” “PARP,” and “RB.” This search yielded a total of 2248 protein sequences. mutated, i.e. BCL2, HSP90, PARP and RB, were utilized for CML [14]. Moreover, the homologous samples were eliminated by maintaining 0.6 as the cutoff level [16]. HSP90 functions as a chaperone protein, crucial in protein folding and degradation processes. Its up-regulation has been identified in various cancer types, including chronic myeloid leukemia (CML). Extensive research has demonstrated that inhibiting HSP90 can attenuate the growth of CML cells and enhance their susceptibility to chemotherapy and tyrosine kinase inhibitors (TKIs) [42,43]. PARP (Poly ADP-ribose polymerase) is an essential enzyme involved in DNA repair processes. Inhibiting PARP has demonstrated effectiveness in the treatment of cancers with BRCA mutations, and there is emerging evidence suggesting its potential applicability in managing chronic myeloid leukemia (CML) [44,45].
The BCL2 (B-cell lymphoma 2) protein family plays a crucial role in regulating programmed cell death, known as apoptosis. Elevated levels of BCL2 have been linked to resistance to chemotherapy in chronic myeloid leukemia (CML) cells. Studies have demonstrated that inhibiting BCL2 can reinstate apoptosis in CML cells and boost the effectiveness of tyrosine kinase inhibitors (TKIs) [46,47]. RB (Retinoblastoma) is a pivotal tumor suppressor gene involved in regulating cell cycle progression. The deactivation of RB is a prevalent characteristic in CML, and research has established that its reactivation can impede the proliferation of CML cells [48,49]. The FASTA file format was used to extract the CML-related protein sequences from the Universal Resource of Proteins (UniProtKB) [15,22]. A successful dataset was created as a result. The same number of negative and positive samples were gathered for CML using the opposite query phrase to create a negative dataset. Consequently, the dataset created for CML is balanced.
Fasta format.
In bioinformatics, the fasta format is a popular text-based format for representing proteins. It is derived from the FASTA software suite and follows a specific structure. A FASTA sequence starts with a single line that serves as a description and is followed by lines containing the sequencing data [22]. The description line is distinguished from the sequence data by the presence of a greater-than symbol (“>") in the first column. The term following the “" sign is used to identify the sequence, while the rest of the line can be used to provide an additional description, though both are optional.
Sample of protein sequence (HSP90).
Initially, protein sequences contained redundant data. We employed a benchmark method known as CD-Hit to address the issue of redundant data within the initial protein sequences (see Fig 3). It is essential to utilize a benchmark algorithm for redundancy removal to ensure the validity and reliability of the data. CD-Hit, an online clustered database, was selected for this purpose, with a threshold of 0.6 [23]. This threshold value helps in effectively removing redundancy while preserving the integrity of the dataset.
Fig 3. Sample of protein sequence (HSP90).
Feature extraction
This section elaborates on the feature extraction techniques using physiochemical properties of the protein sequences. These techniques enable the effective representation of protein sequences and extraction of meaningful information crucial for predicting Chronic Myeloid Leukemia. The feature extraction methods utilized in this study fall into three categories:
Amino acid composition.
The presence of specific amino acids often in a protein sequence is highlighted by AAC characteristics [24,25]. The percentage frequency of an amino acid, FAACi,j, in the protein is calculated using the formula below:
| (1) |
In the above equation, n denotes the amount of amino acids type (i) found in proteins j while na,j refers to the total amount of amino acids contained in a protein. The protein sequence in the FAAC features dataset is represented as a 20-dimensional (20-D) feature vector as follows:
| (2) |
where demonstrates how amino acids are composed.
The technique of amino acid composition involves extracting features from our data, resulting in a 20-dimensional feature set. However, the problem with this approach lies in the limited usefulness of the features extracted. Despite employing various data science feature engineering approaches and conducting hyper-parameter tuning, accuracy remains constrained. Consequently, this approach proves less efficacious in attaining the desired outcomes.
Pseudo amino acid composition.
A 25-dimensional feature set is produced using the Pseudo Amino Acid Composition (PAAC) approach to extract features from our data [13]. The remarkable fact is that the features extracted through this method are highly valuable. By further applying data science methods and feature engineering techniques, accuracy significantly improves, reaching an impressive range of 91% to 93%. This achievement represents a remarkable success in our endeavors.
| (3) |
| (4) |
| (5) |
Specifically, we depict the changes in data distribution before and after outlier removal. Additionally, we conducted data augmentation on the processed dataset to further enhance its accuracy.
Di-peptide composition.
The letters AA, AC, AD, YV, YW, and YY denote protein sequences with dipeptide characteristics. There are 400 components in these sequences. The DC feature of each component is determined as follows:
| (6) |
where represents the structure of dipeptide for . In vector form, this feature space is represented as:
The di-peptide composition technique extracts features from our data, resulting in 400 dimensions or four hundred features. However, it became evident that not all these features were essential. By applying data science methods and feature engineering, it is concluded that only 229 features out of the initial 400 were necessary. Surprisingly, after this selection process, the accuracy of our results significantly improved, reaching an impressive 91% to 93%. This outcome marks a great success. The graphs illustrate the impact of outlier removal on the dataset, both before and after the process.
Data augmentation.
The Data augmentation process is initiated by segregating our dataset into positive and negative segments. The method entails isolating patients who have tested positive from those with negative results. Subsequently, a series of operations are designed to generate numerical replicas of the existing data, thereby augmenting the sample size. This augmentation enhances the machine learning algorithm’s training procedure, attributed to the increased abundance of available data. However, it is important to note that the data transforms during the creation of these numerical duplicates, transitioning from its initial format into a list structure.
Consequently, the modified data is transited from this list format into a data frame. This procedural sequence ultimately leads to reintegrating the transformed data, thereby completing the data augmentation process.
Development of individual classifiers
Support vector machine
SVM classifier by creating a hyperplane with the greatest distance between any two points in the data [27,28,56]. SVM’s decision surface is as follows:
| (7) |
We selected the parameters such as, Kernel = “rbf”, Degree = 8, C = 10000, gamma = 100000, probability = True.
Random forest
This method generates a substantial quantity of decision trees that are combined to arrive at a final decision. For training, we selected 129,361, and for testing, 86,228 samples were selected, and we came up with the best number of estimators, i.e., n = 50. In the case of dipeptide composition, we selected 2536 for training and 845 for testing, and n = 150 estimators gave optimal results.
| (8) |
K-Nearest Neighbor (KNN)
The KNN algorithm is learned by observing samples [29,30]. Instance-based classifiers assume that the classification of unknown instances can be accomplished by comparing the unidentified instance to a known instance using a distance/similarity function [31–33,56]. The calculation of the Euclidean distance (below, denoted as d(, ), between two m-dimensional vectors and is as follows:
| (9) |
Naïve Bayes
Bayes rules represent this learning procedure based on the notion of independent attributes/features [57–59]. The Gaussian function to train the model with equal prior probabilities is in the following manner:
| (10) |
| (11) |
XGBoost
Gradient boosting is a boosting approach that significantly lowers errors by adding several classifiers to pre-existing models. The term “gradient boosting" refers to using a gradient descent strategy to minimize loss. The steps involved in gradient boosting are as follows:
| (12) |
| (13) |
Logistic regression
In categorical binary classification, a statistical machine-learning approach called logistic regression is employed [34]. The parameters we selected were C = 10, tol = 0.1, and penalty = L2.
| (14) |
Results and discussion
Results on pseudo amino acid composition (Pse-AAC) data
The findings of the matrices employed in the project, including Accuracy score, F1-score, Recall [35], and Specificity respectively on the data of Pse-AAC, are displayed in Table 1 below.
Table 1. Results on pseudo amino acid composition (Pse-AAC) data.
| Name of Algorithm | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 92–94% | 91–92% | 91–93% | 92–94% |
| Extreme Gradient Boost | 79–85% | 63–70% | 51–55% | 92–94% |
| Logistic Regression | 66–69% | 10–20% | 6–10% | 97–98% |
| Decision Tree | 81–84% | 73–76% | 74–76% | 84–86% |
| Random Forest | 87–91% | 85–87% | 80–83% | 96–97% |
| K Nearest Neighbor | 82–86% | 72–74% | 61–64% | 93–95% |
Table 2 presents the results of each machine learning (ML) model concerning the data utilized, specifically the Pse-AAC data. It also includes the outcomes of additional metrics used in the research, namely Specificity and Confusion Matrix. These metrics provide insights into the True Positive, True Negative, False Positive, and False Negative values, contributing to a comprehensive evaluation of the models’ performance.
Table 2. Confusion matrix (Pse-AAC data).
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | TN = 424 | FP = 28 |
| FN = 14 | TN = 211 | |
| Extreme Gradient Boost | TN = 26159 | FP = 2271 |
| FN = 3435 | TP = 10890 | |
| Logistic Regression | TN = 25817 | FP = 2849 |
| FN = 11010 | TP = 3445 | |
| Decision Tree | TN = 24388 | FP = 4278 |
| FN = 3803 | TP = 10652 | |
| Random Forest | TN = 28014 | FP = 808 |
| FN = 2753 | TP = 11546 | |
| K Nearest Neighbor | TN = 419 | FP = 23 |
| FN = 95 | TP = 140 |
TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive
Accuracy results on amino acid composition (AAC) data
The research employs Accuracy score, F1-score, Recall score, and Specificity as metrics on the AAC data. The outcomes of these metrics are presented in Table 3 below.
Table 3. Result on amino acid composition (AAC) data.
| Name of Algorithm | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 54.95% | 14.3% | 0.7% | 100% |
| Extreme Gradient Boost | 56.8% | 52.9% | 45.9% | 69% |
| Logistic Regression | 51.1% | 27.6% | 19.1% | 81.7% |
| Decision Tree | 54.4% | 52.25% | 52.9% | 55.8% |
| Random Forest | 50.6% | 41.1% | 35.4% | 64.9% |
| K Nearest Neighbor | 54.2% | 54.8% | 57% | 51% |
The following table (Table 4) presents the results of each machine learning (ML) model concerning the utilized data, namely AAC. Additionally, it showcases the outcomes of other metrics employed in the project, such as the Specificity and Confusion Matrix. These matrices provide essential values, including True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive assessment of the models’ performance.
Table 4. Confusion matrix (AAC data).
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | TN = 271 | FP = 0 |
| FN = 121 | TP = 62 | |
| Extreme Gradient Boost | TN = 409 | FP = 23 |
| FN = 119 | TP = 103 | |
| Logistic Regression | TN = 9028 | FP = 2022 |
| FN = 8519 | TP = 2025 | |
| Decision Tree | TN = 124 | FP = 98 |
| FN = 95 | TP = 107 | |
| Random Forest | TN = 12612 | FP = 6817 |
| FN = 11832 | TP = 6510 | |
| K Nearest Neighbor | TN = 112 | FP = 105 |
| FN = 89 | TP = 118 |
TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive
Accuracy results on di-peptide composition (DPC)
The table below (Table 5) displays the Accuracy score, F1-score, and Recall score matrices utilized in the research and their respective outcomes when applied to the DPC data.
Table 5. Results on di-peptide composition (DPC) data.
| Name of Algorithm | Accuracy | F1-Score | Recall | Specificity |
|---|---|---|---|---|
| Support Vector Classifier | 92–94% | 87–88% | 91–93% | 90–93% |
| Extreme Gradient Boost | 79–84% | 66–68% | 55–57% | 92–94% |
| Logistic Regression | 66–69% | 0–0% | 6–10% | 100% |
| Decision Tree | 81–84% | 70–73% | 56–59% | 96–97% |
| Random Forest | 82–84% | 67–68% | 57–58% | 94–95% |
| K Nearest Neighbor | 72–73% | 31–32% | 20–21% | 95–97% |
The performance of each machine learning model is analyzed concerning the DPC data utilized. Additionally, the Specificity and Confusion Matrix results are presented (Table 6). This matrix provides essential values such as True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive evaluation of the models’ performance.
Table 6. Confusion matrix (DPC data).
| Name of Algorithms | Confusion Matrix | |
|---|---|---|
| Support Vector Classifier | TN = 416 | FP = 37 |
| FN = 17 | TP = 207 | |
| Extreme Gradient Boost | TN = 413 | FP = 25 |
| FN = 105 | TP = 134 | |
| Logistic Regression | TN = 453 | FP = 0 |
| FN = 224 | TP = 0 | |
| Decision Tree | TN = 433 | FP = 16 |
| FN = 54 | TP = 134 | |
| Random Forest | TN = 437 | FP = 23 |
| FN = 93 | TP = 124 | |
| K Nearest Neighbor | TN = 438 | FP = 15 |
| FN = 179 | TP = 45 |
TN = True Negative, FP = False Positive, FN = False Negative, TP = True Positive
Machine learning based dashboard
In Figures, we provide an overview of the dashboard developed using Streamlit, which is accessible through Streamlit Cloud. This interactive dashboard enables users to select their preferred model Fig 4 for analysis. Within this user-friendly interface, individuals are prompted to upload patient records directly through the web application and select a specific prediction model. Subsequently, users can review the results Fig 5 to ascertain whether an individual is affected by leukemia. Users can effortlessly select
Fig 4. Dashboard for CML overview.
Fig 5. Dashboard for CML with prediction.
Conclusion
This research is focused on Chronic Myeloid Leukemia (CML), a condition characterized by genetic mutations leading to abnormal proliferation of white blood cells, red blood cells, and platelets. While MRI and CT scans have been extensively used in cancer detection, research on protein sequence data in this domain is limited. By leveraging information from mutated genes like BCL2, HSP90, PARP, and RB, the research aims to revolutionize early CML prediction. Through rigorous data preprocessing and feature extraction techniques, we achieved an impressive accuracy rate of 92–94%. The proposed approach integrates diverse machine learning algorithms such as SVM, Decision Trees, XGBoost, Random Forest, and KNN, each offering unique strengths in pattern recognition and prediction. The resulting dashboard facilitates easy prediction of CML in patients, enhancing clinical workflows and potentially saving lives. This study sheds light on critical scientific challenges in CML research, offering insights into disease mechanisms and biomarker identification. We envision expanding this research to encompass multi-cancer detection, integrating AI and bioinformatics with healthcare systems for enhanced cancer diagnosis and improved patient outcomes.
Acknowledgments
The authors extend their appreciation to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R513), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia and would like to express their gratitude to anonymous referees for their insightful comments and recommendations, which have significantly enhanced this paper. Furthermore, the authors would like to express their gratitude to Datamatics Technologies for their invaluable contributions.
Data Availability
All relevant data for this study are publicly available from the GitHub repository (https://github.com/awaismalik1x/CML_Prediction_Data.git).
Funding Statement
The authors are grateful to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R513) at Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia, for providing the necessary funding for this work.
References
- 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2021. CA Cancer J Clin. 2021;71(1):7–33. [DOI] [PubMed] [Google Scholar]
- 2.Bibi N, Sikandar M, Ud Din I, Almogren A, Ali S. IoMT-based automated detection and classification of leukemia using deep learning. J Healthc Eng. 2020;2020:6648574. doi: 10.1155/2020/6648574 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.IARC IAfRoC. Leukaemia Source: Globocan 2020. 2022. Available from: https://gco.iarc.fr/today/data/factsheets/cancers/36-Leukaemia-fact-sheet.pdf] [Google Scholar]
- 4.Munteanu CR, Magalhães AL, Uriarte E, González-Díaz H. Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. J Theor Biol. 2009;257(2):303–11. doi: 10.1016/j.jtbi.2008.11.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ramani RG, Jacob SG. Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models. PLoS One. 2013;8(3):e58772. doi: 10.1371/journal.pone.0058772 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yang J-Y, Yoshihara K, Tanaka K, Hatae M, Masuzaki H, Itamochi H, et al. Predicting time to ovarian carcinoma recurrence using protein markers. J Clin Invest. 2013;123(9):3740–50. doi: 10.1172/JCI68509 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mohamed H, Omar R, Saeed N, Essam A, Ayman N, Mohiy T, et al. Automated detection of white blood cells cancer diseases. In: 2018 First International Workshop on Deep and Representation Learning (IWDRL). IEEE. 2018. p. 48–54. doi: 10.1109/iwdrl.2018.8358214 [DOI] [Google Scholar]
- 8.Kumar S, Mishra S, Asthana P. Automated detection of acute leukemia using k-mean clustering algorithm. In: Advances in Computer and Computational Sciences: Proceedings of ICCCCS 2016, vol. 2; 2018. p. 655–70. [Google Scholar]
- 9.Sharma R, Kumar R. A novel approach for the classification of leukemia using artificial bee colony optimization technique and back-propagation neural networks. In: Proceedings of 2nd International Conference on Communication, Computing and Networking. NITTTR Chandigarh. 2019. p. 685–94. [Google Scholar]
- 10.Jothi G, Inbarani HH, Azar AT, Devi KR. Rough set theory with Jaya optimization for acute lymphoblastic leukemia classification. Neural Comput Appl. 2018;31(9):5175–94. doi: 10.1007/s00521-018-3359-7 [DOI] [Google Scholar]
- 11.Moshavash Z, Danyali H, Helfroush MS. An automatic and robust decision support system for accurate acute leukemia diagnosis from blood microscopic images. J Digit Imaging. 2018;31(5):702–17. doi: 10.1007/s10278-018-0074-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Umamaheswari D, Geetha S. A framework for efficient recognition and classification of acute lymphoblastic leukemia with a novel customized-KNN classifier. CIT. 2018;:131–40. doi: 10.20532/cit.2018.1004123 [DOI] [Google Scholar]
- 13.American Society of Clinical Oncology A. Genes and cancer. 2023.
- 14.Rodríguez D, Bretones G, Quesada V, Villamor N, Arango JR, López-Guillermo A, et al. Mutations in CHD2 cause defective association with active chromatin in chronic lymphocytic leukemia. Blood. 2015;126(2):195–202. doi: 10.1182/blood-2014-10-604959 [DOI] [PubMed] [Google Scholar]
- 15.Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2004;32(Database issue):D115-9. doi: 10.1093/nar/gkh131 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Feng P, Lin H, Chen W. Identification of antioxidants from sequence information using naive Bayes. Comput Math Methods Med. 2013;2013:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Feng P-M, Ding H, Chen W, Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput Math Methods Med. 2013;2013:530696. doi: 10.1155/2013/530696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jia J, Liu Z, Xiao X, Liu B, Chou K-C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol. 2016;394:223–30. doi: 10.1016/j.jtbi.2016.01.020 [DOI] [PubMed] [Google Scholar]
- 20.Lin W-Z, Fang J-A, Xiao X, Chou K-C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011;6(9):e24756. doi: 10.1371/journal.pone.0024756 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Qu K, Han K, Wu S, Wang G, Wei L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules. 2017;22(10):1602. doi: 10.3390/molecules22101602 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Cai Y-D, Chou K-C. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics. 2004;20(7):1151–6. doi: 10.1093/bioinformatics/bth054 [DOI] [PubMed] [Google Scholar]
- 23.Chou K-C. Impacts of bioinformatics to medicinal chemistry. Med Chem. 2015;11(3):218–34. doi: 10.2174/1573406411666141229162834 [DOI] [PubMed] [Google Scholar]
- 24.Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55. doi: 10.1002/prot.1035 [DOI] [PubMed] [Google Scholar]
- 25.Khan YD, Ahmad F, Anwar MW. A neuro-cognitive approach for iris recognition using back propagation. World Appl Sci J. 2012;16(5):678–85. [Google Scholar]
- 26.Khan YD, Ahmed F, Khan SA. Situation recognition using image moments and recurrent neural networks. Neural Comput Appl. 2013;24(7–8):1519–29. doi: 10.1007/s00521-013-1372-4 [DOI] [Google Scholar]
- 27.Butt A, Khan S, Jamil H, Rasool N, Khan Y. A prediction model for membrane proteins using moments based features. Biomed Res Int. 2016;2016:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Butt AH, Rasool N, Khan YD. A treatise to computational approaches towards prediction of membrane protein and its subtypes. J Membr Biol. 2017;250(1):55–76. doi: 10.1007/s00232-016-9937-7 [DOI] [PubMed] [Google Scholar]
- 29.Khan YD, Khan SA, Ahmad F, Islam S. Iris recognition using image moments and k-means algorithm. ScientificWorldJournal. 2014;2014:723595. doi: 10.1155/2014/723595 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sugiyama M. Introduction to statistical machine learning. Morgan Kaufmann. 2015. [Google Scholar]
- 31.Theodoridis S. Machine learning: a Bayesian and optimization perspective. Academic Press. 2015. [Google Scholar]
- 32.Vapnik V. The nature of statistical learning theory. Springer. 1999. [DOI] [PubMed] [Google Scholar]
- 33.Hart P, Stork D, Duda R. Pattern classification. Hoboken: Wiley. 2000. [Google Scholar]
- 34.MontesinosLópez O, MontesinosLópez A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Springer Nature. 2022. [PubMed] [Google Scholar]
- 35.Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol. 2016;4(4):320–30. doi: 10.1007/s40484-016-0081-2 [DOI] [Google Scholar]
- 36.Fawcett T. Roc graphs: notes and practical considerations for researchers. Mach Learn. 2004;31(1):1–38. [Google Scholar]
- 37.Gal O, Auslander N, Fan Y, Meerzaman D. Predicting complete remission of acute myeloid leukemia: machine learning applied to gene expression. Cancer Inform. 2019;18:1176935119835544. doi: 10.1177/1176935119835544 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Bostanci E, Kocak E, Unal M, Guzel MS, Acici K, Asuroglu T. Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors (Basel). 2023;23(6):3080. doi: 10.3390/s23063080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. Springerplus. 2013;2(1):238. doi: 10.1186/2193-1801-2-238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med. 2023;164:107242. doi: 10.1016/j.compbiomed.2023.107242 [DOI] [PubMed] [Google Scholar]
- 41.Armya REA, Abdulazeez AM, Sallow AB, Zeebaree DQ. Leukemia diagnosis using machine learning classifiers based on correlation attribute eval feature selection. AJRCoS. 2021;:52–65. doi: 10.9734/ajrcos/2021/v9i330225 [DOI] [Google Scholar]
- 42.Khajapeer KV, Baskaran R. Hsp90 inhibitors for the treatment of chronic myeloid leukemia. Leuk Res Treatment. 2015;2015:757694. doi: 10.1155/2015/757694 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Alves R, Santos D, Jorge J, Gonçalves AC, Catarino S, Girão H. Alvespimycin inhibits heat shock protein 90 and overcomes imatinib resistance in chronic myeloid leukemia cell lines. Molecules. 2023;28(3):1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ellisen LW. PARP inhibitors in cancer therapy: promise, progress, and puzzles. Cancer Cell. 2011;19(2):165–7. doi: 10.1016/j.ccr.2011.01.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Liu Y, Song H, Song H, Feng X, Zhou C, Huo Z. Targeting autophagy potentiates the anti-tumor effect of PARP inhibitor in pediatric chronic myeloid leukemia. AMB Express. 2019;9(1):108. doi: 10.1186/s13568-019-0836-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kaloni D, Diepstraten ST, Strasser A, Kelly GL. BCL-2 protein family: attractive targets for cancer therapy. Apoptosis. 2023;28(1–2):20–38. doi: 10.1007/s10495-022-01780-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ko TK, Chuah CTH, Huang JWJ, Ng K-P, Ong ST. The BCL2 inhibitor ABT-199 significantly enhances imatinib-induced cell death in chronic myeloid leukemia progenitors. Oncotarget. 2014;5(19):9033–8. doi: 10.18632/oncotarget.1925 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhou L, Ng DS-C, Yam JC, Chen LJ, Tham CC, Pang CP, et al. Post-translational modifications on the retinoblastoma protein. J Biomed Sci. 2022;29(1):33. doi: 10.1186/s12929-022-00818-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Yin D-D, Fan F-Y, Hu X-B, Hou L-H, Zhang X-P, Liu L, et al. Notch signaling inhibits the growth of the human chronic myeloid leukemia cell line K562. Leuk Res. 2009;33(1):109–14. doi: 10.1016/j.leukres.2008.06.023 [DOI] [PubMed] [Google Scholar]
- 50.Albitar M, Zhang H, Pecora AL, Ip A, Goy AH, Antzoulatos S, et al. Bone marrow-based biomarkers for predicting aGVHD using targeted RNA next generation sequencing and machine learning. Blood. 2021;138(Supplement 1):2892–2892. doi: 10.1182/blood-2021-147583 [DOI] [Google Scholar]
- 51.Ahmad W, Hameed M, Bilal M, Majid A. ML-pred-cll: Machine learning based prediction of chronic lymphocytic leukemia using protein sequential data. In: 2022 International Conference on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS). 2007. p. 1–7. [Google Scholar]
- 52.He J, Pu X, Li M, Li C, Guo Y. Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data. Chemomet Intel Lab Syst. 2020;199:103976. doi: 10.1016/j.chemolab.2020.103976 [DOI] [Google Scholar]
- 53.Ashraf A, Zhao Q, Bangyal W, Iqbal M. Analysis of brain imaging data for the detection of early age autism spectrum disorder using transfer learning approaches for internet of things. IEEE Trans Consum Electron. 2023. p. 1–10. [Google Scholar]
- 54.Bangyal WH, Ahmad J, Abbas Q. Recognition of off-line isolated handwritten character using counter propagation network. Int J Eng Technol. 2013;5(2):227–34. [Google Scholar]
- 55.Ali AM, Mohammed MA. A comprehensive review of artificial intelligence approaches in omics data processing: evaluating progress and challenges. Int J Math Stat Comput Sci. 2024;2:114–67. [Google Scholar]
- 56.Zahoor MM, Qureshi SA, Bibi S, Khan SH, Khan A, Ghafoor U, et al. A new deep hybrid boosted and ensemble learning-based brain tumor analysis using MRI. Sensors (Basel). 2022;22(7):2726. doi: 10.3390/s22072726 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Amin MA, Chughtai JR, Ahmad W, Bangyal WH, Ul Haq I. Trajectory data mining and trip travel time prediction on specific roads. In: 2024 International Conference on Engineering & Computing Technologies (ICECT); 2024. p. 1–8. [Google Scholar]
- 58.Bangyal WH, Qasim R, Rehman NU, Ahmad Z, Dar H, Rukhsar L. Detection of fake news text classification on COVID-19 using deep learning approaches. Comput Math Methods Med. 2021;2021:5514220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Ali A, Nafees M, Amin M, Rehman I, Tayyab M, Ahmad W. Systematic literature review on swarms of uavs. Spectrum Eng Sci. 2024;2(4):386–415. [Google Scholar]





