Lung cancer prediction from microarray data by gene expression programming

Hasseeb Azzawi; Jingyu Hou; Yong Xiang; Russul Alanni

doi:10.1049/iet-syb.2015.0082

. 2016 Oct 1;10(5):168–178. doi: 10.1049/iet-syb.2015.0082

Lung cancer prediction from microarray data by gene expression programming

Hasseeb Azzawi ^1,^✉, Jingyu Hou ¹, Yong Xiang ¹, Russul Alanni ¹

PMCID: PMC8687242 PMID: 27762231

Abstract

Lung cancer is a leading cause of cancer‐related death worldwide. The early diagnosis of cancer has demonstrated to be greatly helpful for curing the disease effectively. Microarray technology provides a promising approach of exploiting gene profiles for cancer diagnosis. In this study, the authors propose a gene expression programming (GEP)‐based model to predict lung cancer from microarray data. The authors use two gene selection methods to extract the significant lung cancer related genes, and accordingly propose different GEP‐based prediction models. Prediction performance evaluations and comparisons between the authors’ GEP models and three representative machine learning methods, support vector machine, multi‐layer perceptron and radial basis function neural network, were conducted thoroughly on real microarray lung cancer datasets. Reliability was assessed by the cross‐data set validation. The experimental results show that the GEP model using fewer feature genes outperformed other models in terms of accuracy, sensitivity, specificity and area under the receiver operating characteristic curve. It is concluded that GEP model is a better solution to lung cancer prediction problems.

Inspec keywords: lung, cancer, medical diagnostic computing, patient diagnosis, genetic algorithms, feature selection, learning (artificial intelligence), support vector machines, multilayer perceptrons, radial basis function networks, reliability, sensitivity analysis

Other keywords: lung cancer prediction, cancer‐related death, cancer diagnosis, gene profiles, gene expression programming‐based model, gene selection, GEP‐based prediction models, prediction performance evaluations, representative machine learning methods, support vector machine, multilayer perceptron, radial basis function neural network, real microarray lung cancer datasets, cross‐data set validation, reliability, receiver operating characteristic curve

1 Introduction

Cancer is considered as a genetic disorder with unknown causes and mechanisms in most cases. Among the cancer types, lung cancer is a major killer disease around the world, especially in America and East Asia [1–3]. Lung cancer accounts for ∼25% of cancer related death worldwide, which is higher than other most prevalent cancers together, such as breast, prostate and colorectal cancers [4]. However, the genetic factors of lung cancer are still yet to be clearly understood because lung cancer is a complex genetic disease which is developed by the concurrence of many genetic changing events [5]. Lung cancer can be divided into two types: non‐small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). The more common lung cancer type is NSCLC, which is found in ∼80% of lung cancer patients. NSCLC can also be sub‐categorised as squamous cell carcinoma, adenocarcinoma (AC) and large cell carcinoma (LCC) [6]. Traditionally, physical analyses of tissues are performed for lung cancer diagnosis and prognosis using chest X‐ray, computed tomography scan and magnetic resonance imaging [7, 8]. Unfortunately, these techniques can only detect the malignant cells in the late stage of lung cancer, which results in low survival rates (∼16% for NSCLC and 6% for SCLC) [9]. With the advances in molecular biology, unfortunately, these techniques can only detect the malignant cells in the late stage of lung cancer, which results in low survival rates (∼16% for NSCLC and 6% for SCLC) [9]. With the advances in molecular biology, especially the microarray technology, we can now acquire the information of DNA, RNA and proteins to detect the formation of tumours in earlier stages. This would result in an increase of cancer patient survival rate, especially for lung cancer patients. One application of microarray technology is for cancer classification analysis, in which microarray data are used to determine if genes are active, hyperactive or inactive in various tissues. Then, samples are classified into two or more groups especially the microarray technology, we can now acquire the information of DNA, RNA and proteins to detect the formation of tumours in earlier stages. This would result in an increase of cancer patient survival rate, especially for lung cancer patients. One application of microarray technology is for cancer classification analysis, in which microarray data are used to determine if genes are active, hyperactive or inactive in various tissues. Then, samples are classified into two or more groups [10]. There are many studies on the classification of lung cancer [1, 4, 11, 12]. To ensure the successful diagnosis and effective treatment for cancer diseases, the classification process has to be accurate and reliable.

Machine learning techniques have been widely used in the last two decades for cancer prediction and prognosis, especially the techniques of artificial neural networks (ANNs), support vector machines (SVMs), and Bayesian networks [13, 14]. For example, to identify SCLC and NSCLC cells, SVM method was used on lung cancer gene expression databases with prior knowledge [15]. ANNs have solved many problems related to classification and pattern recognition. ANNs use input variables to train the classifier and create the output by using multiple hidden layers which use mathematic processes to connect the neural nodes. ANN has become a standard method for different classification tasks [16]. However, it has some drawbacks. The structure of ANN causes many time‐consuming operations which lead to an inefficient performance. Furthermore, the generic layered structure of ANN known as a ‘black‐box’ makes it almost impossible to look into the working mechanism of the classification process or why an ANN does not work. Different from ANNs, SVM hardly has the over‐fitting problem, but when the dataset is large, the training process will be slow [13, 17]. In recent years, an innovative evolutionary algorithm called gene expression programming (GEP) was proposed by Ferreira [18, 19]. GEP provides a data visualisation model which can easily convert a classifier into a mathematic model. One study has shown the power of GEP in predicting essential proteins indispensable for cell survival [20]. Another success story of GEP is the prediction of adverse events of radical hysterectomy for cervical cancer patients with the accuracy of 71.96% [21]. For lung cancer classification, some GEP‐based models have been proposed. For example, Yu et al. [11] proposed an optimal biomarker joint model using a GEP algorithm to classify lung tumours. They also developed a GEP model for the auxiliary diagnosis of NSCLC lung cancer using serum biomarkers [17]. These research works show that GEP provides a promising approach to cancer prediction/diagnosis and prognosis because of its good performance of using simple coding to solve complex problems. However, to our knowledge, there is no research on using the GEP model to classify lung cancers from microarray data.

In this research, we propose a new GEP‐based classifier to classify lung cancers from microarray data. In fact, we use two models to select feature genes (attributes) from microarray data, one is the Relief Attribute Eval another is the Random Forest (RF). GEP models are then constructed based on the selected feature genes to classify lung cancers. We used three real lung cancer datasets to compare the prediction performance of our GEP‐based classifier and other three representative classifiers: SVM, multilayer perceptron (MLP) and radial basis function neural network (RBFNN) classifiers, in terms of accuracy, sensitivity, specificity and area under ROC curve (AUC). For the reliability evaluation, the k‐fold cross validation and the average of the results were used. The experimental results show that the proposed GEP model using fewer attributes outperforms the existing models in lung cancer prediction.

The paper is structured as follows. Section 2 briefly explains the GEP method followed by Section 3 which presents our GEP‐based approach for constructing a novel classifier and the settings of our experiments. The experimental results of our classifier in predicting lung cancer, as well as the comparison results of our classifier with other representative classifiers on the same datasets, are presented in Section 4. We discuss our experimental results in Section 5, and finally in Section 6 conclude our work with some directions for future work.

2 Gene expression programming

GEP is an algorithm that simulates biological evolutions to create and evolve computer programs. GEP was proposed by Ferreira [22] with the assumption of being, in some way, an extension of genetic programming (GP) [23] while preserving a small number of genetic algorithm (GA) properties [24]. The difference between GP and GEP is the chromosome representation. In GEP chromosomes are not represented as trees, but as linear strings with a fixed length. This feature is inherited from GA. In GEP, programmes (individuals) are encoded by chromosomes, and a chromosome consists of genes (can be one gene or more) which are structured with a head and a tail. The head consists of functional operators such as (+,−,×,/) and/or terminal elements such as symbols and conditions. The tail consists of terminal elements only and its size is calculated by t = h (n − 1) + 1, where h and n represent, respectively, the size of the head and the maximum number of parameters which is required by the functions in the function set [25]. There are no limitations on the size of a gene which is determined by the head size. Phenotype with an expression tree (ET) is a result of the conversion process from genotype, which is established when each gene representation is given.

A chromosome is constructed by a linking function (can be a mathematical function such as addition or a logical function such as AND) that links the genes to each other (i.e. the outcome of joining multiple trees together). Individual chromosomes form a sample population, which undergoes evolutions to produce new generations by performing genetic operations (mutation, transposition, root transposition, gene transcription and gene recombination), calculating the expression values for every chromosome/individual, judging the fitness of each chromosome based on its expression value, and then selecting those chromosomes with better fitness values as the next generation to continue the evolution. The evolution process will stop when a predefined termination criterion is satisfied, then the individual/chromosome with the best fitness value is selected as the classifier model. Depending on the problems considered, different fitness functions could be defined. More details of GEP can be found in [26].

3 Classification model and settings

3.1 GEP‐based classifier

In this work, we use GEP to construct a classifier for lung cancer prediction/classification. To do this, we need to follow the following major steps: defining chromosomes using the function and terminal sets, initialising a population with the defined group of chromosomes, defining a fitness function for evaluating chromosomes, selecting eugenic individual/chromosomes from the population, reproducing a group of chromosomes of the next generation via genetic operations on the selected eugenic chromosomes, and checking the termination condition of the evolution. Fig. 1 illustrates the flow chart of building a GEP classifier.

Fig. 1 — Flowchart of building a GEP classifier

Actually, to build a classifier from GEP for lung cancer prediction, we first build a set of functions and a set of terminals to define chromosomes that will be expressed as the non‐linear combinations of functions and terminals. The set of functions consists of mathematical operators which are +, −, ×, ÷, Exp, Sqrt, Log and Logi, while the set of terminals consists of extracting feature genes (attributes) from the microarray dataset and the relevant coefficients. Here Exp(x) is the exponential function, Sqrt(x) is the square root function, Log(x) is the logarithmic function, and Logi(x) is the function defined as Logi(x) = 1/(1 + exp(‐x)). After that, we define the chromosomal structure by setting the head length be 10, tail length is 11 and the number of genes in each chromosome be 4. For instance, if there are five attributes related to lung cancer (d 0,d 1,…,d 4), a single chromosome could be constructed as follows (the head is shown in bold, each line joined by the symbol ‘+’ represents a gene of GEP)

\begin{aligned} * . & + . + .^{*} .^{*} . + . + . + . / . - . d 3. d 3. d 1. d 4. d 2. d 4. d 4. d 4. d 3. d 0. d 0 \\ + \\ + . S q r t . - . d 3 . d 0 . * . + . + . * . \\ + . d 0. d 1. d 0. d 2. d 4. d 0. d 4. d 4. d 2. d 0. d 0 \\ + \\ / . * . + . - . d 1 . * . d 1 . + . d 4 . d 4 . d 2. d 3. d 0. d 0. d 4. d 3. d 0. d 1. d 3. d 2. d 2 \\ + \\ * . * . * . d 0 . L o g i . - . / . - . \\ + . L o g i . d 1. d 3. d 3. d 1. d 1. d 0. d 4. d 1. d 2. d 4. d 3 \end{aligned}

The chromosome notation used in the above example is called Karva notation, or K‐expression. A K‐expression can be converted into an ET. The conversion starts from the first position in the Karva notation, and expands as a binary tree by adding the elements of the Karva notation into the tree as the nodes one by one. At each tree layer, the elements are added from left to right. This tree expanding process continues layer‐by‐layer until all leaf nodes in the ET are the elements from the terminal set. The corresponding ET of the above chromosome is shown in Fig. 2.

Fig. 2 — Example of GEP ETs. Four sub‐ETs

The second step is to initialise the population by randomly constructing individuals (i.e. chromosomes) from the defined function and terminal sets. The size of the population is set at 30 in our model.

The third step is to define a fitness function for evaluating each chromosome individually. The fitness function SSPN is defined as the product of sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV). We select SSPN in this study as the fitness function, which is used to obtain optimal results [11, 23]. The mathematical formula of the fitness function is given by the following equation

SSP N_{i} = S N_{i} \times S P_{i} \times PP V_{i} \times NP V_{i}

(1)

where SN_i, SP_i, PPV_i and NPV_i are calculated by (2), (3), (4) and (5), respectively

S N_{i} = \frac{T P_{i}}{(T P_{i} + F N_{i})}

(2)

S P_{i} = \frac{T N_{i}}{(T N_{i} + F P_{i})}

(3)

PP V_{i} = \frac{T P_{i}}{(T P_{i} + F P_{i})}

(4)

NP V_{i} = \frac{T N_{i}}{(T N_{i} + F N_{i})}

(5)

where TN_i, TP_i, FN_i and FP_i are the numbers of true negatives, true positives, false negatives and false positives of classification/prediction, respectively, for chromosome i (a classifier) over the whole training dataset.

The above four possible outcomes TN, TP, FN and FP are for a single prediction of a binomial classification task with two classes ‘Yes’ (‘1’) and ‘No’ (‘0’). In fact, for each chromosome in an evolved generation has two steps. The first step is extracting the expression values of the training case from the microarray dataset with respect to the selected feature genes (attributes). The second step is calculating the value score by combining these extracted expression values according to the ET of the chromosome. Then, the classification score is converted into ‘1’ or ‘0’ using a 0/1 rounding threshold (e.g. 0.5), i.e. if the classification score is greater than the rounding threshold, then the result of the classification is ‘1’ (‘Yes’), otherwise ‘0’ (‘No’).

The fourth step is to rank all individuals (chromosomes) in the generation from the highest to the lowest fitness scores, and choose the top 20% of the population as the selected eugenic ones. The fifth step is to perform a set of genetic operations, including mutation, transposition and crossover, on the eugenic individuals to produce new chromosomes of the next generation that has the same size as the former one. Steps 4 and 5 are repeated until a predefined number of generations are achieved. Then, the individual (chromosome) with the highest fitness score in the last generation is selected as the final classifier. The settings of GEP method for building a classifier are presented in Table 1.

Table 1.

GEP parameters for building a lung cancer classifier from microarray datasets

Parameter	Setting
number of chromosomes	30
number of genes	4
head size	10
tail size	11
gene size	21
number of generations	700
linking function	addition
function set	+, −, *, /, Exp, Sqrt, Log, Logi
mutation rate	0.044
IS transposition rate	0.1
RIS transposition rate	0.1
gene transposition rate	0.1
one‐point recombination rate	0.3
two‐point recombination rate	0.3
gene recombination rate	0.1

Open in a new tab

3.2 Microarray datasets

In our experiments, we evaluated our GEP classification model on three public microarray datasets which are free to download and commonly used by researchers for lung cancer classifications. These three datasets and their main characteristics (e.g. the number of samples, the number of genes and class distribution) are shown in Table 2. These gene expression datasets were downloaded from the National Library of Medicine (http://www.ncbi.nlm.nih.gov/pubmed) and Kent Ridge Bio‐medical Dataset (http://datam.i2r.a‐star.edu.sg/datasets/krbd/index. html# cs4‐software), the details of these datasets are as follows.

Table 2.

Characteristics of microarray datasets

Dataset	# Samples	# Genes	Class 1	Class 2
Michigan	96	7129	86	10
Harvard	181	12 533	31	150
GEO	58	1700	40	18

Open in a new tab

Michigan dataset had 96 lung cancer cases and 7129 genes. Among these samples, 86 patients had cancer labelled as (‘Tumour’). The remaining ten patients were labelled as ‘Normal’ and they did not suffer from lung cancer.

Harvard dataset (Brigham and Women's Hospital and Harvard Medical School) had 181 lung cancer cases and 12,533 genes. Among these samples, 150 patients had cancer type AC. The remaining 31 patients suffer from malignant pleural mesothelioma.

GEO dataset (GSE10245) had 58 lung cancer cases and more than 1700 genes. Among these samples, 40 patients had cancer type (AC). The remaining 18 patients did not suffer from AC on both clinical and radiological tests.

3.3 Microarray attributes selection

Attribute or feature selection, is a technique of creating robust learning models by selecting the most relevant attributes from the original attributes. This technique is widely used in machine learning algorithms to improve learning model performance. Gene feature selection is essential to cancer classification, as we can use a small number of correctly selected genes to obtain accurate classification results [27]. The attributes in lung cancer microarray datasets are those genes informative for classification. We used two ways to extract the feature genes: n ‐top ranked genes selection and automatic gene selection.

3.3.1 n‐Top ranked genes selection

In our experiments, we used the attribute evaluator Relief Attribute Eval. Relief‐F method of this evaluator evaluates the feature F by selecting a random sample from F and samples of its nearest neighbours from the same class and other classes. F is scored depending on the differences between the different classes. If F has a different expression for samples from different classes, it will experience a higher score (or vice versa) [28, 29]. This ranking technique is provided by the software Weka [30], which is a collection of open access machine learning algorithms for data mining tasks. We selected 5, 10 and 20 top ranked informative genes (i.e. features to create three sub datasets for each original lung cancer microarray dataset. Each sub‐dataset of an original microarray dataset had the same number of samples (patients) as the original one, but a much less number of genes. We denoted a sub‐dataset with 5, 10 and 20 feature genes as GEP‐5, GEP‐10 and GEP‐20, respectively. The purpose of creating three sub datasets for each original dataset was to experimentally determine a proper number of features (genes) for the GEP algorithm to generate a better classifier and get better lung cancer prediction results. Reliability was assessed by cross‐data set validation. Using the ten‐fold cross‐validation, the dataset was randomly divided into ten equal subsets. For each run, nine subsets were used as a training dataset to construct the model while the remaining subset was used as a testing dataset for prediction. The average accuracy of ten iterations was recorded as the final measurement.

3.3.2 Automatic gene selection

To extract feature genes, we also used another feature selection technique RF [31] which can automatically determine the number of features. RF is one of the popular machine learning algorithms which can provide an accurate result with robustness and easy procedure [32]. It is an ensemble of decision trees and each node in the decision trees represents a condition of a single feature. Each decision tree is constructed by using a random subset of the training data. We first randomly divided the samples of each subset into two parts: training and testing. 80% of all patients (samples) were randomly selected as the training dataset for training the generated classifiers, while the remaining 20% of all patients formed a testing dataset which was used to test the predictive performance of the generated classifiers. To avoid the biased performance estimate, RF was trained on the training datasets only. The GEP classification method with this automatic feature gene selection is denoted as ‘GEP‐AS’. To ensure the reliability of the selected feature genes, we repeated the experiment ten times and used the average of the results to evaluate the performance of the classification models.

We used the software GeneXproServer 5.0 [22] to run GEP algorithm using the parameter set in Table 1.

3.4 Methods for comparison

We selected the representative classification methods to compare with our GEP‐based classifier in terms of classification performance. These methods were: SVM, MLP and RBFNN. These techniques are used by many studies for classification purposes [13, 14, 33]. GEP algorithm was performed using GeneXproServer 5.0 software [22] while SVM, MLP and RBFNN methods were simulated by DTREG software [34].

3.4.1 Support vector machine

SVM is a supervised learning method commonly used for data analysis and pattern recognition [35, 36]. SVM is becoming a public technique to deal with biological application problems and is popular for cancer microarray data classification and prediction [13, 14]. The main idea of SVM is to find the optimal hyperplane between two classes, which is working to separate the classes by placing a margin around each data point and maximising the margin between the classes. The parameters of the SVM classifier in our experiment were as follows:

Type of SVM model: C‐SVC.
Kernel function: radial basis function (RBF).
C search range: 0.1–5000
Gamma search range: 0.001–50.
Stopping criteria: 0.001000.
Cache size: 256.0.

3.4.2 Multi‐layer perceptron

MLP is one of neural network types that can process input variables through a group of layers [37]. Usually the MLP uses three layers. The first layer is the input layer which deals with features (attributes) of an input microarray or pattern. The second layer is a hidden layer (could be up to two hidden layers depending on the designer how many layers to use) which contains predefined number of nodes (neurons). The hidden layers work to add the whole variable values of the input data multiplied by weights. The weighted sum is used as the input for the activation function whose output is fed to the next layer. The output layer is the last layer which consists of neurons (nodes) and produces the final classification result of the model. The parameters of the MLP classifier selected in our experiment were as follows:

Number of layers: three (one hidden layer).
Hidden layer activation function: logistic.
Output layer activation function: logistic.
Automatic hidden layer neuron selection: Min = 2, max = 20, step = 1.

3.4.3 Radial basis function neural network

RBFNN has three layers. The first layer is an input layer; the second layer is a hidden layer where the radial basis function is used as an activation function. Each input vector in the input layer is used as an input to all neurons of a radial basis functions in the hidden layer. The number of neurons in the hidden layer is determined during the training process; the third layer is the output layer [38]. RBFNN is widely used in classification because it is simple to design, performs robustly and is tolerant of input noise [39]. The parameters of the RBFNN classifier selected in our experiment were as follows:

Number of layers: three (one hidden layer).
Hidden layer activation function: Logistic.
Output layer activation function: Logistic.
Automatic hidden layer neuron selection: Min = 2, max = 20, step = 1.

4 Experiment results

The experiments were designed to have two parts: we first evaluated the classification performance of our GEP‐based classifiers, then compared the GEP‐based classification performance with the performance of three selected classification methods on the experimental data sets. The performance was evaluated in terms of sensitivity, specificity, accuracy and area under the curve (AUC) [13]. Sensitivity and specificity have been defined in Section 2. The AUC and the accuracy were used to evaluate the overall performance of a classifier. Particularly, accuracy is used to measure the correct prediction, while AUC is to measure the prediction performance. AUC depends on the receiver operating characteristic (ROC) curve, which the graphical plot is showing the sensitivity (y ‐axis) versus its 1‐specificity (x ‐axis). The performance values were calculated on each subset which was created from every entire original dataset, i.e. GEP‐5, GEP‐10, GEP‐20 and GEP‐AS as indicated in Section 3.3. We applied GEP algorithm on the above sub datasets to generate different classifiers and evaluated the performance of these classifiers. For simplicity, in the following text without causing confusion, we also use GEP‐5, GEP‐10, GEP‐20 and GEP‐AS to represent the classifiers generated from these four sub datasets. Then, we compared the performance of these classifiers with the performance of the selected comparison methods (SVM, RBFNN and MLP). For reliability, we used the ten‐fold cross validation to obtain the predicted results of GEP‐5, GEP‐10 and GEP‐20. While in GEP‐AS model, the experiments were repeated ten times and the average performance evaluation results of GEP generated classifiers on the three evaluation datasets (GEO, Michigan and Harvard) were calculated.

For all experimental models the average results of the three datasets were calculated to show the overall performance of these models in terms of accuracy, sensitivity, specificity and AUC.

4.1 GEP models results

Among the GEP classifiers, the highest average accuracy from the three microarray datasets was 98.09% with standard deviation 2.32 which was achieved by GEP‐5, and the average accuracy achieved by GEP‐10 was 91.63% with standard deviation 1.75. The average accuracy achieved by GEP‐AS was 89.18% with standard deviation 3.52. The average accuracy achieved by GEP‐20 was 96.07% with standard deviation 3.42. While the average accuracy achieved by GEP Model with all attributes (genes) was 87.49% with standard deviation 5.89. The average results and the standard division of the five GEP models in terms of accuracy, sensitivity, specificity and AUC are presented in Figs. 3 and 4. The details of the results are shown in the appendix section of this paper.

Fig. 3 — Average and standard deviation of GEP models for all datasets in term of accuracy, sensitivity and specificity

a Average results

b Standard deviation results

Fig. 4 — Average of AUC ROC with standard deviation of GEP models for all datasets

a Average results

b Standard deviation results

4.2 Comparisons of GEP with representative classifiers

The comparison results of classification performance among GEP classifiers, SVM, RBFNN and MLP classifiers in terms of accuracy, sensitivity, specificity and AUC are shown in Figs. 5, 6, 7–8, respectively. It can be seen from the results that the highest accuracy among the all models was 98.09% achieved by GEP‐5 model with the standard division of 2.32, while the highest accuracy with ten attributes was 97.74% achieved by MLP with the standard division of 2.07. The highest accuracy with 20 attributes was 97.37% achieved by MLP as well with the standard division of 1.82. The highest accuracy with all attributes was 91.72% achieved by MLP again with the standard division of 6.16. Finally, the highest accuracy with AS attributes was 96.29% achieved by both SVM and MLP with the standard division of 1.45.

Fig. 5 — Average of accuracy with standard deviation of GEP classifiers for all datasets

a Average results

b Standard deviation results

Fig. 6 — Average of sensitivity with standard deviation of GEP classifiers for all datasets

a Average results

b Standard deviation results

Fig. 7 — Average of specificity with standard deviation of GEP classifiers for all datasets

a Average results

b Standard deviation results

Fig. 8 — Average of AUC with standard deviation of GEP classifiers for all datasets

a Average results

b Standard deviation results

The detailed classification results on all datasets are shown in the appendix section (Tables 3, 4–5).

Table 3.

Geo microarray datasets performance comparison between GEP, SVM, MLP and RBF

Training dataset										Testing dataset
Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC	Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC
5 Attributes	GEP	97.83	30.43	67.39	0	2.17	93.33	100	0.9	5 Attributes	GEP	94.83	25.86	68.97	0	5.17	83.33	100	0.887
	SVM	94.83	25.86	68.97	0	5.17	83.33	100	0.906		SVM	93.10	25.86	67.24	1.72	5.17	83.33	97.50	0.884
	MLP	94.83	25.86	68.97	0	5.17	83.33	100	0.934		MLP	93.10	25.86	67.24	1.72	5.17	83.33	97.50	0.877
	RBF	94.83	25.86	68.97	0	5.17	83.33	100	0.975		RBF	93.10	24.14	68.97	0	6.90	77.78	100	0.925
10 Attributes	GEP	97.44	28.21	69.23	0	2.56	91.67	100	0.9	10 Attributes	GEP	89.47	26.32	63.16	5.26	5.26	83.33	92.31	0.9
	SVM	94.83	25.86	68.97	0	5.17	83.33	100	0.904		SVM	91.67	16.67	75	0	8.33	66.67	100	0.9
	MLP	94.83	25.86	68.97	0	5.17	83.33	100	0.909		MLP	94.83	25.86	68.97	0	5.17	83.33	100	0.887
	RBF	96.55	27.59	68.97	0	3.45	88.89	100	0.983		RBF	89.66	24.14	65.52	3.45	6.90	77.78	95.00	0.850
20 Attributes	GEP	97.44	30.77	66.67	0	2.56	92.31	100	0.9	20 Attributes	GEP	84.21	21.05	63.16	10.53	5.26	80	85.71	0.9
	SVM	96.55	27.59	68.97	0	3.45	88.89	100	1		SVM	89.66	22.41	67.24	1.72	8.62	72.22	97.50	0.931
	MLP	96.55	27.59	68.97	0	3.45	88.89	100	0.998		MLP	94.83	25.86	68.97	0	5.17	83.33	100	0.874
	RBF	96.55	27.59	68.97	0	3.45	88.89	100	0.997		RBF	89.66	22.41	67.24	1.72	8.62	72.22	97.50	0.881
AS	GEP	97.83	23.91	73.91	0	2.17	91.67	100	0.9	AS	GEP	96.55	27.59	68.97	0	3.45	88.89	100	0.945
	SVM	98.9	78.2	20.9	0	0.9	98	100	0.98		SVM	96.55	27.59	68.97	0	3.45	88.89	100	0.947
	MLP	95.65	60	20	1.8	1.8	97.05	91.6	0.9		MLP	96.55	27.59	68.97	0	3.45	88.89	100	0.945
	RBF	96.73	60.9	20	0.9	1.8	97.1	95.65	0.9		RBF	89.66	24.14	65.52	3.45	6.90	77.78	95	0.861
All Dataset	GEP	100	31.03	68.97	0	0	100	100	1	All Dataset	GEP	83.33	33.33	50	0	16.67	66.67	100	0.8
	SVM	98.28	29.31	68.97	0	1.72	94.44	100	1		SVM	83.33	50	33.33	0	16.67	75	100	0.7
	MLP	100	31.03	68.97	0	0	100	100	1		MLP	83.33	50	33.33	0	16.67	75	100	0.7
	RBF	98.28	29.31	68.97	0	1.72	94.44	100	0.988		RBF	79.16	50	29.17	0	20.83	70.58	100	0.6

Open in a new tab

Table 4.

Michigan microarray datasets performance comparison between GEP, SVM, MLP and RBF

Training dataset										Testing dataset
Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC	Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC
5 Attributes	GEP	100	90.63	9.38	0	0	100	100	1	5 Attributes	GEP	100	87.5	12.5	0	0	100	100	1
	SVM	100	89.58	10.42	0	0	100	100	1		SVM	100	89.58	10.42	0	0	100	100	1
	MLP	100	89.58	10.42	0	0	100	100	1		MLP	100	89.58	10.42	0	0	100	100	1
	RBF	100	89.58	10.42	0	0	100	100	1		RBF	97.92	89.58	8.33	2.08	0	100	80	0.997
10 Attributes	GEP	98.44	87.5	10.94	0	1.56	98.25	100	0.9	10 Attributes	GEP	93.75	90.63	3.13	0	6.25	93.55	100	0.9
	SVM	100	89.58	10.42	0	0	100	100	1		SVM	100	89.58	10.42	0	0	100	100	1
	MLP	100	89.58	10.42	0	0	100	100	1		MLP	98.96	89.58	9.38	1.04	0	100	90	1
	RBF	100	89.58	10.42	0	0	100	100	1		RBF	97.92	89.58	8.33	2.08	0	100	80	1
20 Attributes	GEP	98.44	84.38	14.06	0	1.56	98.18	100	0.9	20 Attributes	GEP	91.67	25	66.67	0	8.33	75	100	0.9
	SVM	100	89.58	10.42	0	0	100	100	1		SVM	98.96	88.54	10.42	0	1.40	98.84	100	0.991
	MLP	100	89.58	10.42	0	0	100	100	1		MLP	98.96	88.54	10.42	0	1.04	98.84	100	0.997
	RBF	100	89.58	10.42	0	0	100	100	1		RBF	96.88	88.54	8.33	2.08	1.04	98.84	80	0.931
AS	GEP	100	89.61	10.39	0	0	100	100	1	AS	GEP	100	89.47	10.53	0	0	100	100	1
	SVM	100	89.6	10.39	0	0	100	100	1		SVM	97.92	88.54	9.38	1.04	1.04	98.84	90	0.986
	MLP	100	89.6	10.39	0	0	100	100	1		MLP	97.92	88.54	9.38	1.04	1.04	98.84	90	0.986
	RBF	100	89.6	10.39	0	0	100	100	1		RBF	97.35	89.47	7.89	0	2.63	91.89	100	0.9
All Dataset	GEP	96.88	88.54	8.33	2.08	1.04	98.84	80	0.995	All Dataset	GEP	95.83	88.54	7.29	3.13	1.04	98.84	70	0.989
	SVM	96.88	88.54	8.33	2.08	1.04	98.84	80	0.995		SVM	95.83	88.54	7.29	3.13	1.04	98.84	70	0.989
	MLP	100	89.58	10.42	0	0	100	100	1		MLP	97.92	88.54	9.38	1.04	1.04	98.84	90	0.986
	RBF	100	89.58	10.42	0	0	100	100	1		RBF	95.83	87.50	8.33	2.08	2.08	97.67	80	0.966

Open in a new tab

Table 5.

Harvard microarray datasets performance comparison between GEP, SVM, MLP and RBF

Training dataset										Testing dataset
Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC	Sub dataset	Classifier	Accuracy	TP	TN	FP	FN	Sensitivity	Specificity	AUC ROC
5 Attributes	GEP	100	17.13	82.87	0	0	100	100	1	5 Attributes	GEP	99.45	16.57	82.87	0	0.55	96.77	100	1
	SVM	98.90	16.02	82.87	0	1.10	93.55	100	1		SVM	98.90	16.02	82.87	0	1.10	93.55	100	1
	MLP	100	17.13	82.87	0	0	100	100	1		MLP	99.45	16.57	82.87	0	0.55	96.77	100	1
	RBF	100	17.13	82.87	0	0	100	100	1		RBF	98.34	16.02	82.32	0.55	1.10	93.55	99.33	0.998
10 Attributes	GEP	99.17	15.70	83.47	0	0.83	95	100	0.9	10 Attributes	GEP	91.67	13.33	78.33	0	8.33	61.54	100	0.9
	SVM	99.45	16.57	82.87	0	0.55	96.77	100	0.999		SVM	99.45	16.57	82.87	0	0.55	96.77	100	0.998
	MLP	100	17.13	82.87	0	0	100	100	1		MLP	99.45	16.57	82.87	0	0.55	96.77	100	0.998
	RBF	99.45	16.57	82.87	0	0.55	96.77	100	1		RBF	97.24	14.92	82.32	0.55	2.21	87.10	99.33	0.998
20 Attributes	GEP	99.17	14.05	85.12	0	0.83	94.44	100	0.9	20 Attributes	GEP	91.67	13.33	78.33	0	8.33	61.54	100	0.9
	SVM	99.45	16.57	82.87	0	0.55	96.77	100	0.999		SVM	99.45	16.57	82.87	0	0.55	96.77	100	0.998
	MLP	98.34	15.47	82.87	0	1.66	90.32	100	0.999		MLP	98.34	15.47	82.87	0	1.66	90.32	100	0.991
	RBF	98.90	16.02	82.87	0	1.10	93.55	100	0.979		RBF	97.24	14.92	82.32	0.55	2.21	87.10	99.33	0.920
AS	GEP	99.31	11.72	87.59	0	0.69	94.44	100	0.9	AS	GEP	91.67	27.78	63.89	0	8.33	76.92	100	0.9
	SVM	100	12.41	87.59	0	0	100	100	1		SVM	94.4	30.56	63.89	5.56	0	100	92	0.9
	MLP	97.9	10.34	87.59	2.07	0	100	97.69	0.9		MLP	94.4	30.56	63.89	5.56	0	100	92	0.9
	RBF	98.25	10.69	87.59	1.72	0	100	98	0.9		RBF	94.4	30.56	63.89	5.56	0	100	92	0.9
All Dataset	GEP	100	49.66	50.34	0	0	100	100	1	All Dataset	GEP	83.33	33.33	50	0	16.67	66.67	100	0.7
	SVM	99.45	16.57	82.87	0	0.55	96.77	100	0.998		SVM	95.03	14.36	80.66	2.21	2.76	83.87	97.33	0.980
	MLP	99.45	16.57	82.87	0	0.55	96.77	100	1		MLP	93.92	14.36	79.56	3.31	2.76	83.87	96	0.960
	RBF	100	17.13	82.87	0	0	100	100	1		RBF	94.48	13.81	80.66	2.21	3.31	80.65	97.33	0.981

Open in a new tab

5 Discussion

The experimental results showed that the performance of the GEP model with fewer feature genes (i.e. 5 attributes) was better than the model with more attributes (i.e. 10, 20 and all attributes). The experimental results also showed that the lowest accuracy for the GEP classifiers was 87.49% with the standard division of 5.89 which were achieved by the classifier with all attributes, while the highest average accuracy of the GEP classifiers was 98.09% with the standard division of 2.32 when using 5 attributes. This is also the highest accuracy for all the experimental models (GEP, SVM, MLP and RBFNN). Actually the best accuracy the other three models (SVM, MLP and RBFNN) achieved was 97.74% which is lower than GEP model.

The results also indicated that the models using the automatically selected feature genes (performed by the RF method) could provide stable and convergent results. However, the classification performance was not the best across all the experimental models. This might be caused by the limitation of this feature selection method which is unable to select some informative genes for more effective classifications.

It was observed from the experimental results that GEP classifier was more efficient as it achieved the highest accuracy by using five feature genes only in the classification model. Of course, this five feature genes have to be properly selected and informative without missing microarray values.

6 Conclusions

In this paper, an innovative GEP‐based classifier is proposed to classify/predict lung cancer from microarray data. Compared with other representative machine learning‐based classifiers, the proposed classifier achieved higher accuracies in classifying/predicting lung cancers on the commonly used datasets (GEO, Michigan and Harvard). The evaluation results showed that GEP approach improved the lung cancer prediction.

The efficiency and effectiveness of the proposed GEP approach heavily depend on the proper selection of feature genes (attributes) that are informative without missing values for classification/prediction. Our future work is to propose innovative methods to more properly select features (attributes) from integrated lung cancer datasets and multiple sources of lung cancer data to improve the GEP‐based algorithm for lung cancer prediction.

See Tables 3, 4, 5–6.

Table 6.

The standard deviation for all microarray datasets performance comparison between GEP, SVM, MLP and RBF

Accuracy	Sub dataset	GEP	SVM	MLP	RBF	TP	Sub dataset	GEP	SVM	MLP	RBF
	5	2.318424	3.026916	3.131116	2.377356		5	31.47637	32.60562	32.44997	32.92989
	10	1.747532	3.803796	2.072074	3.74383		10	33.79646	34.3937	32.44997	33.23572
	20	3.516678	4.504001	1.818467	3.491488		20	4.846417	32.63768	32.27651	33.08101
	AS	3.417371	1.448747	1.448747	3.167652		AS	29.12583	28.05828	28.05828	29.40071
	All attributes	5.892557	5.713337	6.155541	7.560231		All attributes	26.02624	30.29157	30.29157	30.0854

Sensitivity	Sub dataset	GEP	SVM	MLP	RBF	TN	Sub dataset	GEP	SVM	MLP	RBF
	5	7.218459	6.863266	7.218459	9.333475		5	30.43026	31.13021	31.13021	32.19723
	10	13.34954	15.00863	7.218459	9.110438		10	32.47011	32.45769	31.87642	31.67098
	20	7.795606	12.09045	6.342192	10.89274		20	6.484208	31.13021	31.39412	31.92407
	AS	9.424551	4.986428	4.986428	9.180853		AS	26.433	26.97347	26.97347	26.79111
	All attributes	15.16508	9.838267	9.838267	11.18011		All attributes	20.13369	30.37061	29.12816	30.39948

Specificity	Sub dataset	GEP	SVM	MLP	RBF	FP	Sub dataset	GEP	SVM	MLP	RBF
	5	0	1.178511	1.178511	9.274204		5	0	0.810816	0.810816	0.880013
	10	3.625101	0	4.714045	8.282497		10	2.479588	0	0.490261	1.184521
	20	6.736371	1.178511	0	8.713003		20	4.96389	0.810816	0	0.653146
	AS	0	4.320494	4.320494	3.299832		AS	0	2.413517	2.413517	2.291729
	All attributes	14.14214	13.5567	4.109609	8.866026		All attributes	1.475496	1.313494	1.382052	1.012555

AUC ROC	Sub dataset	GEP	SVM	MLP	RBF	FN	Sub dataset	GEP	SVM	MLP	RBF
	5	0.053269	0.054683	0.057983	0.034179		5	2.318424	2.223706	2.318424	3.026916
	10	0	0.046676	0.052804	0.070244		10	1.279384	3.803796	2.318424	2.876924
	20	0	0.03007	0.056622	0.021453		20	1.447212	3.620556	1.818467	3.33189
	AS	0.040893	0.03516	0.035122	0.018385		AS	3.417371	1.444999	1.444999	2.843312
	All attributes	0.119834	0.134165	0.12913	0.176176		All attributes	7.368053	6.997963	6.997963	8.563656

Open in a new tab

7 References

1. Engchuan W., and Chan J.H.: ‘Pathway activity transformation for multi‐class classification of lung cancer datasets’, Neurocomputing, 2015, 165, pp. 81–89 (doi: 10.1016/j.neucom.2014.08.096) [DOI] [Google Scholar]
2. Society A.C.: ‘Cancer facts & figures 2011’ (American Cancer Society Inc., 2011), vol. 1 [Google Scholar]
3. Laureen W., and Goh B.C.: ‘An overview of cancer trends in Asia’ (Innovationmagazine.com., 2012) [Google Scholar]
4. Mehta A. Dobersch S., and Romero‐Olmedo A.J. et al.: ‘Epigenetics in lung cancer diagnosis and therapy’, Cancer Metastasis Rev., 2015, 34, pp. 229–241 (doi: 10.1007/s10555-015-9563-3) [DOI] [PubMed] [Google Scholar]
5. Spitz M.R. Wei Q., and Dong Q. et al.: ‘Genetic susceptibility to lung cancer the role of DNA damage and repair’, Cancer Epidemiol. Biomarkers Prev., 2003, 12, pp. 689–698 [PubMed] [Google Scholar]
6. Melissa C.S.: ‘Lung cancer’ (Medicine.net, 2011) [Google Scholar]
7. Mountain C.F.: ‘Revisions in the international system for staging lung cancer FREE TO VIEW’, Chest, 1997, 111, pp. 1710–1717 (doi: 10.1378/chest.111.6.1710) [DOI] [PubMed] [Google Scholar]
8. Mountain C.F., and Dresler C.M.: ‘Regional lymph node classification for lung cancer staging’, Chesy J., 1997, 111, pp. 1718–1723 (doi: 10.1378/chest.111.6.1718) [DOI] [PubMed] [Google Scholar]
9. Tsou J.A. Hagen J.A., and Carpenter C.L. et al.: ‘DNA methylation analysis: a powerful new tool for lung cancer diagnosis’, Oncogene, 2002, 21, pp. 5450–5461 (doi: 10.1038/sj.onc.1205605) [DOI] [PubMed] [Google Scholar]
10. Diaz J.M. Pinon R.C., and Solano G.: ‘Lung cancer classification using genetic algorithm to optimize prediction models’. Fifth Int. Conf. on Information, Intelligence, Systems and Applications, IISA 2014, Chania, 2014, pp. 1–6
11. Yu Z. Chen X.Z., and Cui L.H. et al.: ‘Prediction of lung cancer based on serum biomarkers by gene expression programming methods’, Asian Pac. J. Cancer Prev., 2014, 15, pp. 9367–9373 (doi: 10.7314/APJCP.2014.15.21.9367) [DOI] [PubMed] [Google Scholar]
12. Rusch V.W. Asamura H., and Watanabe H. et al.: ‘The IASLC lung cancer staging project: a proposal for a new international lymph node map in the forthcoming seventh edition of the TNM classification for lung cancer’, J. Thorac. Oncol., 2009, 4, pp. 568–577 (doi: 10.1097/JTO.0b013e3181a0d82e) [DOI] [PubMed] [Google Scholar]
13. Kourou K. Exarchos T.P., and Exarchos K.P. et al.: ‘Machine learning applications in cancer prognosis and prediction’, Comput. Struct. Biotechnol. J., 2015, 13, pp. 8–17 (doi: 10.1016/j.csbj.2014.11.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Joseph A.C., and David S.W.: ‘Applications of machine learning in cancer prediction and prognosis’, Cancer Inf., 2006, 2, pp. 59–77 [PMC free article] [PubMed] [Google Scholar]
15. Guan P. Huang D., and He M. et al.: ‘Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine‐based classification method’, J. Exp. Clin. Cancer Res., 2009, 28, pp. 1–7 (doi: 10.1186/1756-9966-28-1) [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Ayer T. Alagoz O., and Chhatwal J. et al.: ‘Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration’. PMC, 2010, vol. 116, pp. 3310–3321 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Yu Z. Lu H., and Si H. et al.: ‘A highly efficient gene expression programming (GEP) model for auxiliary diagnosis of small cell lung cancer’, PLoS ONE, 2015, 10, pp. 1–19 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ferreira C.: ‘Gene expression programming in problem solving’ 2002.
19. Ferreira C., and Gepsoft U.: ‘What is gene expression programming’, (Candida Ferreira, 2008) [Google Scholar]
20. Zhong J. Wang J., and Peng W. et al.: ‘Prediction of essential proteins based on gene expression programming’, BMC Genomics, 2013, 14, p. S7 (doi: 10.1186/1471-2164-14-S8-S7) [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Kusy M. Obrzut B., and Kluska J.: ‘Application of gene expression programming and neural networks to predict adverse events of radical hysterectomy in cervical cancer patients’, Med. Biol. Eng. Comput., 2013, 51, pp. 1357–1365 (doi: 10.1007/s11517-013-1108-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Ferreira C.: ‘Gepsoft predictive modeling software’ (Candida Ferreira, 2001), vol. 2015 [Google Scholar]
23. Koza J.R.: ‘Genetic programming: on the programming of computers by means of natural selection’ (MIT Press, Cambridge, MA, 1992) [Google Scholar]
24. Ryan J.: ‘Genetic algorithms in search, optimization and machine learning (Book Review)’, ORSA J. Comput., 1991, 3, p. 176 (doi: 10.1287/ijoc.3.2.176) [DOI] [Google Scholar]
25. Han X.R. Li X.C., and Si H.Z. et al.: ‘QSAR study of the anti‐cancer activity of 38 compounds in different cancer cell lines based on gene expression programming’, Adv. Mater. Res., 2014, pp. 1291–1294 [Google Scholar]
26. Ferreira C.: ‘Gene expression programming: mathematical modeling by an artificial intelligence’ (Springer‐Verlag, Berlin, 2006), vol. 850 [Google Scholar]
27. Natarajan A., and Ravi T.: ‘A survey on gene feature selection using microarray data for cancer classification’. Int. J. Comput. Sci. Commun., 2014, 5, pp. 126–129 [Google Scholar]
28. Yildirim P.: ‘Filter based feature selection methods for prediction of risks in hepatitis disease’, Int. J. Mach. Learn. Comput., 2015, 5, p. 258 (doi: 10.7763/IJMLC.2015.V5.517) [DOI] [Google Scholar]
29. Wang X., and Gotoh O.: ‘A robust gene selection method for microarray‐based cancer classification’, Cancer Inf., 2010, 9, p. 15 (doi: 10.4137/CIN.S3794) [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Mark H. Eibe F., and Geoffrey H. et al.: ‘The WEKA data mining software: An update’, SIGKDD Explorations, 2009, 11, (1), pp. 10–18 (doi: 10.1145/1656274.1656278) [DOI] [Google Scholar]
31. Breiman L.: ‘Random forests’, Mach. Learn., 2001, 45, (1), pp. 5–32 (doi: 10.1023/A:1010933404324) [DOI] [Google Scholar]
32. Touw W.G. Bayjanov J.R., and Overmars L. et al.: ‘Data mining in the life sciences with random forest: a walk in the park or lost in the jungle?’, Brief. Bioinf., 2012, 14, p. bbs034 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Hosseinzadeh F. Kayvanjoo A.H., and Ebrahimi M.: ‘Prediction of lung tumor types based on protein attributes by machine learning algorithms’, SpringerPlus, 2013, 2, pp. 2–14 (doi: 10.1186/2193-1801-2-238) [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Sherrod P.H.: DTREG predictive modelling software. Available at https://www.dtreg.com/, accessed 7 February 2015
35. George G.V.S., and Raj V.C.: ‘Review on feature selection techniques and the impact of SVM for cancer classification using gene expression profile’, Int. J. Comput. Sci. Eng. Surv., 2011, 2, p. 16 (doi: 10.5121/ijcses.2011.2302) [DOI] [Google Scholar]
36. Vanitha C.D.A. Devaraj D., and Venkatesulu M.: ‘Gene expression data classification using support vector machine and mutual information‐based gene selection’, Procedia Comput. Sci., 2015, 47, pp. 13–21 (doi: 10.1016/j.procs.2015.03.178) [DOI] [Google Scholar]
37. McClelland J.L. Rumelhart D.E., and Group P.R.: ‘Parallel distributed processing’, Explor. Microstruct. Cogn., 1986, 2, pp. 1–38 (doi: 10.1016/0749-6036(86)90143-6) [DOI] [Google Scholar]
38. Kubat M.: ‘Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0–02‐352781‐7’ (Cambridge University Press, 1999) [Google Scholar]
39. Adetiba E., and Olugbara O.O.: ‘Improved classification of lung cancer using radial basis function neural network with affine transforms of voss representation’, PloS One, 2015, 10, p. e0143542 (doi: 10.1371/journal.pone.0143542) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0001] 1. Engchuan W., and Chan J.H.: ‘Pathway activity transformation for multi‐class classification of lung cancer datasets’, Neurocomputing, 2015, 165, pp. 81–89 (doi: 10.1016/j.neucom.2014.08.096) [DOI] [Google Scholar]

[syb2bf00005-bib-0002] 2. Society A.C.: ‘Cancer facts & figures 2011’ (American Cancer Society Inc., 2011), vol. 1 [Google Scholar]

[syb2bf00005-bib-0003] 3. Laureen W., and Goh B.C.: ‘An overview of cancer trends in Asia’ (Innovationmagazine.com., 2012) [Google Scholar]

[syb2bf00005-bib-0004] 4. Mehta A. Dobersch S., and Romero‐Olmedo A.J. et al.: ‘Epigenetics in lung cancer diagnosis and therapy’, Cancer Metastasis Rev., 2015, 34, pp. 229–241 (doi: 10.1007/s10555-015-9563-3) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0005] 5. Spitz M.R. Wei Q., and Dong Q. et al.: ‘Genetic susceptibility to lung cancer the role of DNA damage and repair’, Cancer Epidemiol. Biomarkers Prev., 2003, 12, pp. 689–698 [PubMed] [Google Scholar]

[syb2bf00005-bib-0006] 6. Melissa C.S.: ‘Lung cancer’ (Medicine.net, 2011) [Google Scholar]

[syb2bf00005-bib-0007] 7. Mountain C.F.: ‘Revisions in the international system for staging lung cancer FREE TO VIEW’, Chest, 1997, 111, pp. 1710–1717 (doi: 10.1378/chest.111.6.1710) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0008] 8. Mountain C.F., and Dresler C.M.: ‘Regional lymph node classification for lung cancer staging’, Chesy J., 1997, 111, pp. 1718–1723 (doi: 10.1378/chest.111.6.1718) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0009] 9. Tsou J.A. Hagen J.A., and Carpenter C.L. et al.: ‘DNA methylation analysis: a powerful new tool for lung cancer diagnosis’, Oncogene, 2002, 21, pp. 5450–5461 (doi: 10.1038/sj.onc.1205605) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0010] 10. Diaz J.M. Pinon R.C., and Solano G.: ‘Lung cancer classification using genetic algorithm to optimize prediction models’. Fifth Int. Conf. on Information, Intelligence, Systems and Applications, IISA 2014, Chania, 2014, pp. 1–6

[syb2bf00005-bib-0011] 11. Yu Z. Chen X.Z., and Cui L.H. et al.: ‘Prediction of lung cancer based on serum biomarkers by gene expression programming methods’, Asian Pac. J. Cancer Prev., 2014, 15, pp. 9367–9373 (doi: 10.7314/APJCP.2014.15.21.9367) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0012] 12. Rusch V.W. Asamura H., and Watanabe H. et al.: ‘The IASLC lung cancer staging project: a proposal for a new international lymph node map in the forthcoming seventh edition of the TNM classification for lung cancer’, J. Thorac. Oncol., 2009, 4, pp. 568–577 (doi: 10.1097/JTO.0b013e3181a0d82e) [DOI] [PubMed] [Google Scholar]

[syb2bf00005-bib-0013] 13. Kourou K. Exarchos T.P., and Exarchos K.P. et al.: ‘Machine learning applications in cancer prognosis and prediction’, Comput. Struct. Biotechnol. J., 2015, 13, pp. 8–17 (doi: 10.1016/j.csbj.2014.11.005) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0014] 14. Joseph A.C., and David S.W.: ‘Applications of machine learning in cancer prediction and prognosis’, Cancer Inf., 2006, 2, pp. 59–77 [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0015] 15. Guan P. Huang D., and He M. et al.: ‘Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine‐based classification method’, J. Exp. Clin. Cancer Res., 2009, 28, pp. 1–7 (doi: 10.1186/1756-9966-28-1) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0016] 16. Ayer T. Alagoz O., and Chhatwal J. et al.: ‘Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration’. PMC, 2010, vol. 116, pp. 3310–3321 [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0017] 17. Yu Z. Lu H., and Si H. et al.: ‘A highly efficient gene expression programming (GEP) model for auxiliary diagnosis of small cell lung cancer’, PLoS ONE, 2015, 10, pp. 1–19 [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0018] 18. Ferreira C.: ‘Gene expression programming in problem solving’ 2002.

[syb2bf00005-bib-0019] 19. Ferreira C., and Gepsoft U.: ‘What is gene expression programming’, (Candida Ferreira, 2008) [Google Scholar]

[syb2bf00005-bib-0020] 20. Zhong J. Wang J., and Peng W. et al.: ‘Prediction of essential proteins based on gene expression programming’, BMC Genomics, 2013, 14, p. S7 (doi: 10.1186/1471-2164-14-S8-S7) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0021] 21. Kusy M. Obrzut B., and Kluska J.: ‘Application of gene expression programming and neural networks to predict adverse events of radical hysterectomy in cervical cancer patients’, Med. Biol. Eng. Comput., 2013, 51, pp. 1357–1365 (doi: 10.1007/s11517-013-1108-8) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0022] 22. Ferreira C.: ‘Gepsoft predictive modeling software’ (Candida Ferreira, 2001), vol. 2015 [Google Scholar]

[syb2bf00005-bib-0023] 23. Koza J.R.: ‘Genetic programming: on the programming of computers by means of natural selection’ (MIT Press, Cambridge, MA, 1992) [Google Scholar]

[syb2bf00005-bib-0024] 24. Ryan J.: ‘Genetic algorithms in search, optimization and machine learning (Book Review)’, ORSA J. Comput., 1991, 3, p. 176 (doi: 10.1287/ijoc.3.2.176) [DOI] [Google Scholar]

[syb2bf00005-bib-0025] 25. Han X.R. Li X.C., and Si H.Z. et al.: ‘QSAR study of the anti‐cancer activity of 38 compounds in different cancer cell lines based on gene expression programming’, Adv. Mater. Res., 2014, pp. 1291–1294 [Google Scholar]

[syb2bf00005-bib-0026] 26. Ferreira C.: ‘Gene expression programming: mathematical modeling by an artificial intelligence’ (Springer‐Verlag, Berlin, 2006), vol. 850 [Google Scholar]

[syb2bf00005-bib-0027] 27. Natarajan A., and Ravi T.: ‘A survey on gene feature selection using microarray data for cancer classification’. Int. J. Comput. Sci. Commun., 2014, 5, pp. 126–129 [Google Scholar]

[syb2bf00005-bib-0028] 28. Yildirim P.: ‘Filter based feature selection methods for prediction of risks in hepatitis disease’, Int. J. Mach. Learn. Comput., 2015, 5, p. 258 (doi: 10.7763/IJMLC.2015.V5.517) [DOI] [Google Scholar]

[syb2bf00005-bib-0029] 29. Wang X., and Gotoh O.: ‘A robust gene selection method for microarray‐based cancer classification’, Cancer Inf., 2010, 9, p. 15 (doi: 10.4137/CIN.S3794) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0030] 30. Mark H. Eibe F., and Geoffrey H. et al.: ‘The WEKA data mining software: An update’, SIGKDD Explorations, 2009, 11, (1), pp. 10–18 (doi: 10.1145/1656274.1656278) [DOI] [Google Scholar]

[syb2bf00005-bib-0031] 31. Breiman L.: ‘Random forests’, Mach. Learn., 2001, 45, (1), pp. 5–32 (doi: 10.1023/A:1010933404324) [DOI] [Google Scholar]

[syb2bf00005-bib-0032] 32. Touw W.G. Bayjanov J.R., and Overmars L. et al.: ‘Data mining in the life sciences with random forest: a walk in the park or lost in the jungle?’, Brief. Bioinf., 2012, 14, p. bbs034 [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0033] 33. Hosseinzadeh F. Kayvanjoo A.H., and Ebrahimi M.: ‘Prediction of lung tumor types based on protein attributes by machine learning algorithms’, SpringerPlus, 2013, 2, pp. 2–14 (doi: 10.1186/2193-1801-2-238) [DOI] [PMC free article] [PubMed] [Google Scholar]

[syb2bf00005-bib-0034] 34. Sherrod P.H.: DTREG predictive modelling software. Available at https://www.dtreg.com/, accessed 7 February 2015

[syb2bf00005-bib-0035] 35. George G.V.S., and Raj V.C.: ‘Review on feature selection techniques and the impact of SVM for cancer classification using gene expression profile’, Int. J. Comput. Sci. Eng. Surv., 2011, 2, p. 16 (doi: 10.5121/ijcses.2011.2302) [DOI] [Google Scholar]

[syb2bf00005-bib-0036] 36. Vanitha C.D.A. Devaraj D., and Venkatesulu M.: ‘Gene expression data classification using support vector machine and mutual information‐based gene selection’, Procedia Comput. Sci., 2015, 47, pp. 13–21 (doi: 10.1016/j.procs.2015.03.178) [DOI] [Google Scholar]

[syb2bf00005-bib-0037] 37. McClelland J.L. Rumelhart D.E., and Group P.R.: ‘Parallel distributed processing’, Explor. Microstruct. Cogn., 1986, 2, pp. 1–38 (doi: 10.1016/0749-6036(86)90143-6) [DOI] [Google Scholar]

[syb2bf00005-bib-0038] 38. Kubat M.: ‘Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0–02‐352781‐7’ (Cambridge University Press, 1999) [Google Scholar]

[syb2bf00005-bib-0039] 39. Adetiba E., and Olugbara O.O.: ‘Improved classification of lung cancer using radial basis function neural network with affine transforms of voss representation’, PloS One, 2015, 10, p. e0143542 (doi: 10.1371/journal.pone.0143542) [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Lung cancer prediction from microarray data by gene expression programming

Hasseeb Azzawi

Jingyu Hou

Yong Xiang

Russul Alanni

Abstract

1 Introduction

2 Gene expression programming

3 Classification model and settings

3.1 GEP‐based classifier

Fig. 1.

Fig. 2.

Table 1.

3.2 Microarray datasets

Table 2.

3.3 Microarray attributes selection

3.3.1 n‐Top ranked genes selection

3.3.2 Automatic gene selection

3.4 Methods for comparison

3.4.1 Support vector machine

3.4.2 Multi‐layer perceptron

3.4.3 Radial basis function neural network

4 Experiment results

4.1 GEP models results

Fig. 3.

Fig. 4.

4.2 Comparisons of GEP with representative classifiers

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Table 3.

Table 4.

Table 5.

5 Discussion

6 Conclusions

Table 6.

7 References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases