Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model

Vaibhav Rupapara; Furqan Rustam; Wajdi Aljedaani; Hina Fatima Shahzad; Ernesto Lee; Imran Ashraf

doi:10.1038/s41598-022-04835-6

. 2022 Jan 19;12:1000. doi: 10.1038/s41598-022-04835-6

Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model

Vaibhav Rupapara ^1,^#, Furqan Rustam ^2,^#, Wajdi Aljedaani ³, Hina Fatima Shahzad ², Ernesto Lee ^4,^✉, Imran Ashraf ^5,^✉

PMCID: PMC8770560 PMID: 35046459

Abstract

Blood cancer has been a growing concern during the last decade and requires early diagnosis to start proper treatment. The diagnosis process is costly and time-consuming involving medical experts and several tests. Thus, an automatic diagnosis system for its accurate prediction is of significant importance. Diagnosis of blood cancer using leukemia microarray gene data and machine learning approach has become an important medical research today. Despite research efforts, desired accuracy and efficiency necessitate further enhancements. This study proposes an approach for blood cancer disease prediction using the supervised machine learning approach. For the current study, the leukemia microarray gene dataset containing 22,283 genes, is used. ADASYN resampling and Chi-squared (Chi2) features selection techniques are used to resolve imbalanced and high-dimensional dataset problems. ADASYN generates artificial data to make the dataset balanced for each target class, and Chi2 selects the best features out of 22,283 to train learning models. For classification, a hybrid logistics vector trees classifier (LVTrees) is proposed which utilizes logistic regression, support vector classifier, and extra tree classifier. Besides extensive experiments on the datasets, performance comparison with the state-of-the-art methods has been made for determining the significance of the proposed approach. LVTrees outperform all other models with ADASYN and Chi2 techniques with a significant 100% accuracy. Further, a statistical significance T-test is also performed to show the efficacy of the proposed approach. Results using k-fold cross-validation prove the supremacy of the proposed model.

Subject terms: Cancer, Diseases

Introduction

Cancer is the abandoned outgrowth of abnormal cells that may spread to different parts of the human body¹. Currently, it is one of the leading causes of death in the world. Study² shows that approximately 10 million cancer deaths and 19.3 million new cases appeared only in 2020. The mortality rates of the different types of cancer vary concerning the type of cancer. For example, in 2020, lung cancer has 18%, colorectal cancer has 9.4%, while liver cancer, stomach cancer, and breast cancer has mortality rates of 8.3%, 7.7%, and 6.9%, respectively. Blood cancer constitutes nearly 10% of all the newly diagnosed cancer cases¹. Early diagnosis and prediction have been considered prudent ways to reduce cancer deaths worldwide.

In this regard, this study focuses on the prediction of blood cancer. As noted by the Leukemia and Lymphoma Society³, in the United States (US) alone, 1,290,773 people have blood cancer. The common types of blood cancers include myeloma, leukemia, lymphoma, myelodysplastic syndromes, among others. To be discrete, blood cancers affect the blood cells, bone marrow, lymph nodes, as well as other parts of the lymphatic system. Currently, research has led to the development of therapies that improve the immunity system of affected individuals so that they can deal with cancer cells.

Previous studies on blood cancer prediction have utilized different models and algorithms for predicting blood cancer, which yielded various accuracy and precision levels. For example, Goutam et al.⁴ utilized support vector machines (SVM) to achieve a precision of 85.74%, specificity of 80%, and sensitivity of 100%. Study⁵ used H20 deep learning and got an accuracy of 79.45%. Additionally, Vijayarani and Sudha⁶ applied K Means, Fuzzy Means, and Weighted K Means which achieved an accuracy of 78%, 75%, and 85%, respectively. Similarly, Xiao et al.⁷ used k-nearest neighbor (KNN), SVM, decision trees (DT), random forest (RF), and gradient boosting decision trees to achieve accuracy of 99.20%, 98.78%, and 98.41%, respectively. On the other hand, Subhan et al.⁸ leveraged KNN and Hough transform to obtain an accuracy of 93%. Gal et al.⁹ used KNN, SVM, and RF classifiers for achieving accuracy scores of 84%, 74%, and 81%, respectively. Despite such efforts to elevate the performance of the machine and deep learning classifiers, the desired accuracy is not met for blood cancer prediction.

The chief objective of the current study is to propose an approach that can perform blood cancer prediction with high accuracy using microarray gene data. Of the challenges associated with this task, the data imbalance and the high dimensionality of data are two important problems. To overcome these issues, the current study uses adaptive synthetic (ADASYN) oversampling and Chi-square (Chi2). In summary, this study makes the following contributions

The performance of well-known machine learning algorithms is analyzed on microarray gene data. These algorithms include RF, logistic regression (LR), support vector classifier (SVC), KNN, Naive Bayes (NB), extra tree classifier (ETC), DT, and Adaboost classifier (ADA).
A hybrid model called LVTrees is proposed which utilizes RL, SVC, and ETC through the majority voting. For data balancing the influence of ADASYN is investigated while Chi2 is used to select the optimal set of features for classification.
Extensive experiments are conducted to evaluate the efficacy of the proposed approach. In addition, several state-of-the-art methods are compared with the proposed approach. The statistical significance test is also performed to analyze the validity of the proposed approach. Results are further validated using k-fold cross-validation.

The rest of the paper is organized as follows. The following section discusses the research papers related to the current study. The proposed methodology is described in the section “Materials and methods” while the section “Results and discussions” contains the analysis and discussion of results. In the end, the “Conclusion” section concludes the paper and highlights the direction for future work.

Related work

Owing to the importance of the healthcare domain, several research works can be found in the literature that focus on cancer prediction using machine and deep learning approaches. For example, studies^10,11 perform cancer prediction using image-based approaches. Similarly, Goutam et al.⁴ developed an automated system for the diagnosis of leukemia. The framework supports a variety of strategies like K-means clustering etc. The data are obtained from hospitals for examining the performance of the proposed method as a binary classifier. Results show that it obtains a 98% accuracy for cancer prediction. While Vijayarani and Sudha⁶ focused on the prediction of disease using hemogram blood test data. A new algorithm called weight-based K-means is proposed to diagnose various diseases, e.g., human immunodeficiency virus (HIV) and viral infection. Tests are performed on data from 524 patients, and results show that the proposed algorithm achieves significantly higher accuracy than the Fuzzy C-methods and K-means clustering algorithms.

In the same way, a multi-model ensemble is presented in⁷ for predicting cancer. The authors analyzed the gene data gathered from the stomach, breast, and lung tissues. The DESeq approach is used to avoid overfitting in classification which helped identify genetic details differentiated between normal and tumor phenotypes. Moreover, it controlled the dimensionality of data and enhanced the forecast accuracy along with the significant reduction in computational time. Study¹² developed an automated method of detecting and classifying acute lymphoblastic leukemia based on a deep convolutional neural network (CNN). To test the performance, comparisons are made with different color models. The results show that the proposed method achieved high accuracy without requiring microscopic image segmentation. The authors presented a diagnosing method in¹³ to predict the primary stage of cancer. The model is integrated between hybrid feature selection and preprocessing phases. From a subset of 25 features, the proposed model showed the highest accuracy with 14 optimal features. A four-phase process is employed to train the subset of the optimal feature. Results show that the classification accuracy can be greatly improved by using preprocessing methods and feature selection before selecting the data.

Study¹⁴ proposed classification models to distinguish the blood microscopic images of patients affected by leukemia from those free of leukemia. To extract the features, a pre-trained CNN name AlexNet and various other classifiers are used. Tests show that SVM got better results compared to other classifiers. In the second model, extraction and classification are done using AlexNet only where results show its superiority over other models concerning different performance metrics.

A study²⁰, very similar to ours, used the Leukemia_GSE9476 dataset²¹ with a deep learning approach to analyze its diagnostic performance compared to traditional methods. The study used Leukemia microarray gene data which consists of 22,283 genes. Normalization tests are used at preprocessing stage while a DNN neural network is used for training and testing. Experimental results indicate that using the traditional method achieved an accuracy of 0.63, whereas the deep learning network achieved an accuracy of 0.96%. Another study¹⁹, used the Leukemia_GSE28497 dataset²² to study the integration of multiple microarrays and ribonucleic acid (RNA)-seq platforms. Four types of Leukemia samples are analyzed in the study. The minimum redundancy maximum relevance (mRMR) is used for feature selection. Results show that 96% accuracy can be achieved using only a small portion of ten genes. Analysis of variance (ANOVA) statistical test is performed to verify the performance of the model for multi-class classification.

For improving the Leukemia classification process, Abd El-Nasser et al.¹⁵ proposed an enhanced classification algorithm (ECA) using the select most informative genes (SMIG) module and a standardization process. Evaluation results showed that the proposed ECA system achieves 98% accuracy in 0.1s time when preprocessing and classification are done. Compared to the methods used in previous studies, the proposed system achieved better results. The authors propose an automatic diagnostic method in¹⁸ to predict acute myeloid and acute lymphoid Leukemia. The study utilizes a CNN model called Acute Leukemias Recognition Network - Residual Without Dropout (Alert Net- RWD) for this purpose. In the Alert Net-RWD model, the Alert Net part consists of five convolutional layers, batch normalization, and max-pooling layers. The residual layer without dropout is followed by the max-pooling layers in the Alert Net-RWD model. Compared to other CNN architectures, Alert Net-RWD uses fewer parameters. Test results show that the proposed model achieves 97.18% accuracy and 97.23% precision. Study¹⁷ proposed an algorithm for the detection of blast cells under specific criteria of image enhancement and processing. It comprises a selection of the panel, use of K-means clustering for segmentation, followed by a refinement process. A public database is used for testing, and the results show that the proposed algorithm achieves 97.47% sensitivity and 98.1% specificity. Another dataset collected from local hospitals is also used for experimental purposes which shows that the algorithm led to 100% sensitivity, 99.747% accuracy, and 99.7617% accuracy. In a similar fashion, an enhanced computer-based method for cancer cell prediction is introduced in¹⁶. The authors use principal component analysis (PCA)-based features extracted from the nucleus image of these cells. In addition to detecting cancerous cell subtypes, the proposed algorithm can differentiate non-cancerous cell subtypes with improved sensitivity.

Despite the tremendous results reported in the above-mentioned studies, using microarray gene data is not very well studied concerning blood cancer prediction. Besides, apart from a couple of research works, the accuracy reported in the rest of the research works is not sufficient enough for the blood cancer prediction. In addition, predominantly, research works use smaller datasets and results cannot be generalized. To overcome such limitations, this study proposes a hybrid model to achieve higher accuracy for blood cancer prediction. Table 1 summarizes the systematic analysis studies in related work.

Table 1.

Summary of the systematic analysis studies in related work.

Study	Models	Dataset	Evaluation metrics	Results
¹⁵	Bayes Network learning, Conjunctive Rule, NBTree, VFI, Random Subspace, Naïve Bayes Updateable, and PART	Three datasets contains 7130 Genes	Accuracy	97.22% for 500 genes
⁴	Local Directional path	90 high-quality $184 \times 138$ size images obtained from the American Society of Hematology	Sensitivity, Specificity, Precision, F-Measure	Sensitivity: 100%, Specificity: 80%, Precision: 85.74%, F-Measure: 93.4%
⁶	K-Means, Fuzzy C Means, Weighted K Means	Heart dataset from UCI machine learning repository	Cluster accuracy, error rate and execution time	Leukemia, K-Means: 78%, Fuzzy means: 75%, WK-Means: 85%
¹³	Updatable NB, MLP, KNN, SVM	25 variables or features and 82 instances or records	Accuracy	NB 94.76%, MLP 95.24%, SVM 96.20%, KNN 91.43%
¹⁶	Fuzzy c-means clustering, PCA, SVM	21 peripheral blood smear and bone marrow slides of 14 patients with all and 7 normal persons $2592 \times 3872$ pixels in red green blue (RGB) color	sensitivity, specificity, accuracy, precision and false negative	Sensitivity 98%, Specificity 97%, Accuracy 98%, Precision 98%
¹⁷	Linde–Buzo–Gray, Kekre’s Propotionate Error, K-Means	115 digital images of size $256 \times 256$ . 16 datasets with 2415 images, 642 images with size $632 \times 480$ pixels	Sensitivity, specificity, accuracy	Sensitivity 100%, Specificity 99.747%, Accuracy 99.7617%
⁷	KNN, SVM, DT, RF, GBDT	Three RNA-seq data sets	Precision, recall and accuracy	Accuracy LUAD: 98.80 (± 1.79), STAD: 98.78 (± 1.44), BRCA: 98.41 (± 0.41)
¹²	Deep convolutional neural networks	Images from ALL-Image DataBase (IDB)	Sensitivity, specificity, accuracy	Sensitivity 100%, Specificity 98.11%, Accuracy of 99.50%
¹⁴	AlexNet	2,820 images	Precision, Recall, accuracy	100% classification accuracy
¹⁸	Alert Net-RWD	16 datasets with 2,415 images	Accuracy, precision	Accuracy 97.18%, Precision 97.23%
¹⁹	SVM, KNN, NB, and RF	NCBI/GEO public database: 11 series from Microarray and 2 series from RNA-seq	ANOVA statistical test, accuracy, F1	10 Genes F1-score: SVM: 97.13%, KNN: 96.28%, NB: 97.29%, RF: 97.01%
²⁰	DNN deep learning network	36 cases containing 22,283 gene expression of acute myeloid leukemia (AML) microarray	Accuracy	Accuracy: 96.6%

Target	Count	After ADASYN
B-CELL_ALL	74	74
B-CELL_ALL_TCF3-PBX1	22	74
B-CELL_ALL_HYPERDIP	51	64
B-CELL_ALL_HYPO	18	74
B-CELL_ALL_MLL	17	73
B-CELL_ALL_T-ALL	46	74
B-CELL_ALL_ETV6-RUNX1	53	76
Total Samples	281	509

Techniques	Training set		Testing set
Techniques	Samples	Features	Samples	Features
Original dataset	238	22,283	43	22,283
After ADASYN	432	22,283	77	22,283
After Chi2	238	400	43	400
After ADASYN+Chi2	432	400	77	400

Type	1007_s_at	1053_at	.	AFFXTrpnXM_at
BCELL_ALL	7.409521	5.009216	.	2.608381
BCELL_ALL	7.177109	5.415108	.	2.634063

Type	1007_s_at	1053_at	.	AFFXTrpnXM_at
Bone_Marrow_CD34	7.745245	7.811210	.	4.139249
Bone_Marrow_CD34	8.087252	7.240673	.	4.122700

Model	Description
RF	RF is a model for tree-based ensemble learning that predicts accurately by combining multiple poor learners. IT uses the bagging method for training several decision trees with different samples of bootstrap. The substitution of training data in random forests is a bootstrap study, where the sample is the same as the training collection²⁶
LR	The classification problems are generally dealt with using logistic regression. It is a regression model based on the probability theorem and a predictive analysis algorithm. Binary information, in which one or more variables work together to generate a result, is most often interpreted. Using the sigmoid logistic regression function, a relationship is established between one or more independent variables with an approximation probability²⁷
SVC	The classification aims to divide a data collection into categories based on a set of criteria to classify data in a more meaningful way. SVC is a classification method focused on the support vector technique. The SVC’s goal is to fit the data you supply and return a “best fit” hyperplane that separates or categorizes the data. Following that, you should feed any features to your classifier to see what the “predicted” class is after you have obtained the hyperplane. This makes this algorithm particularly good for our purposes, though it can be used in a variety of contexts^28,29
KNN	KNN is a basic model used in machine learning for regression and classification processing. The data is referred to as the class with the closest neighbors, and the technique uses the data to organize the current data means based on a distance attribute. The KNN model bestows pledge effects in this experiment when the value of k is equal to five (k = 5). It means it looks at the five closest neighbors and chooses one based on the majority or closest distance³⁰
NB	Focused on the Bayes Theorem, the controlled learning algorithm called the Naive Bayes algorithm is used to resolve classification problems. The training of an NB classifier involves a limited number of data points and is therefore fast and scalable. It is a probabilistic classifier that predicts the probability of an object. The NB classifier claims that each likelihood of feature is independent of the others and that they do not overlap, such that each feature contributes similarly to a sample belonging to a given class. The NB classifier is easy to use and quick to compute, and it works well on massive datasets of high dimensionality³¹
ETC	The ETC works in a similar way to the random forest, except for the process of tree building in the forest. The ETC uses the initial training sample to build each decision tree. The top function to interrupt the data in the tree is chosen using the Gini index, and k samples of the best functions are used to make the decision. Several de-correlated decision trees were developed using these random function indicator samples. The algorithm for decision trees is an algorithm for categorical and numerical data that works perfectly³²
DT	A DT is a kind of tree-like framework used to construct structures. A decision tree is commonly used in medical processing because it is quick and fast to execute. There are three nodes in the decision tree. (1) Root node (main node; other nodes’ roles are dependent on it); Interior node (it handles various types of attributes) (3) Node of the leaf (it is also called as end-node; it is the final node which represents the results of each test)³³
ADA	ADA is typically used in combination with other algorithms to improve their accuracy. It focuses on boosting vulnerable learners into good learners. Any AdaBoost tree is based on an error rate of the last constructed tree³⁴

Model	Hyperparameters setting	Hyperparameter range
RF	n_estimators = 300, max_depth = 25	n_estimators = 20 to 500, max_depth = 2 to 50
LR	multi_class = “multinomial”, C = 2.0	solver = liblinear,saga sag, multi_class = “multinomial”, C = 1.0–5.0
SVC	kernel = “linear”, C = 2.0	kernel = linear, sigmoid, poly, C = 1.0–5.0
KNN	n_neighbors = 4	n_neighbors = 2–6
NB	Default setting	–
ETC	n_estimators = 300, max_depth = 25	n_estimators = 20–500, max_depth = 2–50
DT	max_depth = 25	max_depth = 2–50
ADA	n_estimators = 300, learning_rate = 0.2	n_estimators = 20–500, learning_rate = 0.1–0.8
LVTrees	Model (LR, SVC,ETC), Voting = Hard	Voting = Hard and Soft

Model	Accuracy	Precision	Recall	F1 score
LVTrees	0.91	0.95	0.89	0.89
KNN	0.91	0.95	0.88	0.88
ETC	0.88	0.80	0.84	0.82
ADA	0.65	0.78	0.67	0.67
SVC	0.91	0.96	0.88	0.88
RF	0.88	0.81	0.84	0.82
NB	0.86	0.79	0.81	0.79
DT	0.72	0.74	0.72	0.73
LR	0.91	0.95	0.88	0.88

Model	Accuracy	Precision	Recall	F1 score
LVTrees	0.99	0.99	0.99	0.99
KNN	0.87	0.91	0.88	0.87
ETC	0.97	0.98	0.98	0.98
ADA	0.75	0.86	0.78	0.77
SVC	0.99	0.99	0.99	0.99
RF	0.99	0.99	0.99	0.99
NB	0.95	0.95	0.95	0.95
DT	0.87	0.87	0.88	0.87
LR	0.99	0.99	0.99	0.99

Model	Accuracy	Precision	Recall	F1 score
LVTrees	1.00	1.00	1.00	1.00
KNN	0.95	0.96	0.92	0.92
ETC	0.97	0.97	0.96	0.97
ADA	0.86	0.88	0.85	0.84
SVC	0.99	0.99	0.98	0.98
RF	0.99	0.99	0.98	0.98
NB	0.92	0.91	0.90	0.91
DT	0.84	0.87	0.81	0.82
LR	0.97	0.97	0.97	0.97

Model	Accuracy	Precision	Recall	F1 Score
LVTrees (Original)	0.91	0.95	0.89	0.89
LVTrees (Chi+ADASYN)	0.95	0.93	0.95	0.94

Model	Original data		Chi2 +ADASYN
Model	Accuracy	SD	Accuracy	SD
LVTrees	0.90	$\pm 0.03$	0.97	$\pm 0.03$
KNN	0.79	$\pm 0.05$	0.92	$\pm 0.04$
ETC	0.86	$\pm 0.03$	0.95	$\pm 0.03$
ADA	0.48	$\pm 0.06$	0.57	$\pm 0.10$
SVC	0.89	$\pm 0.04$	0.96	$\pm 0.03$
RF	0.86	$\pm 0.04$	0.96	$\pm 0.03$
NB	0.83	$\pm 0.07$	0.90	$\pm 0.04$
DT	0.70	$\pm 0.06$	0.86	$\pm 0.05$
LR	0.89	$\pm 0.03$	0.95	$\pm 0.03$

Reference	Year	Model	Data	Accuracy
¹⁹	2019	SVM, KNN, NB, and RF	Microarray gene	KNN: 96.28%, NB: 97.29%, RF: 97.01%
²⁰	2020	DNNs deep learning network	Microarray gene	96.6%
Current study	2021	LVTrees	Microarray gene	100%

PERMALINK

Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model

Vaibhav Rupapara

Furqan Rustam

Wajdi Aljedaani

Hina Fatima Shahzad

Ernesto Lee

Imran Ashraf

Abstract

Introduction

Related work

Table 1.

Materials and methods

Proposed approach overview

Figure 1.

Table 2.

Table 3.

Table 4.

Data description

Table 5.

Table 6.

Table 7.

Table 8.

Supervised machine learning models

Table 10.

Table 9.

Proposed model LVTrees

Figure 2.

Chi-square (Chi2)

ADASYN resampling

Results and discussions

Models performance on original leukemia dataset

Table 11.

Figure 3.

Models performance using ADASYN oversampled dataset

Table 12.

Figure 4.

Models’ performance after applying Chi2 technique

Table 13.

Figure 5.

Models performance for combining Chi2 and ADASYN techniques

Table 14.

Figure 6.

Figure 7.

Figure 8.

Significance of proposed approach

Experimental results of LVTrees on leukemia_GSE9476 dataset

Table 15.

Results using resampling on training data alone

Table 16.

Feature selection after data splitting

Table 17.

Results with 10-fold cross-validation

Table 18.

Performance analysis of proposed approach

Table 19.

T-test

Conclusion

Author contributions

Funding

Competing interests

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases