Skip to main content
Computational Intelligence and Neuroscience logoLink to Computational Intelligence and Neuroscience
. 2022 Jul 6;2022:3895590. doi: 10.1155/2022/3895590

Survival Risk Prediction of Esophageal Squamous Cell Carcinoma Based on BES-LSSVM

Yanfeng Wang 1, Wenhao Zhang 1, Junwei Sun 1,, Lidong Wang 2, Xin Song 2, Xueke Zhao 2
PMCID: PMC9279059  PMID: 35845893

Abstract

Esophageal squamous cell carcinoma (ESCC) is one of the highest incidence and mortality cancers in the world. An effective survival prediction model can improve the quality of patients' survival. In this study, ten indicators related to the survival of patients with ESCC are founded using genetic algorithm feature selection. The prognostic index (PI) for ESCC is established using the binary logistic regression. PI is divided into four stages, and each stage can reasonably reflect the survival status of different patients. By plotting the ROC curve, the critical threshold of patients' age could be found, and patients are divided into the high-age groups and the low-age groups. PI and ten survival-related indicators are used as independent variables, based on the bald eagle search (BES) and least-squares support vector machine (LSSVM), and a survival prediction model for patients with ESCC is established. The results show that five-year survival rates of patients are well predicted by the bald eagle search-least-squares support vector machine (BES-LSSVM). BES-LSSVM has higher prediction accuracy than the existing particle swarm optimization-least-squares support vector machine (PSO-LSSVM), grasshopper optimization algorithm-least-squares support vector machine (GOA-LSSVM), differential evolution-least-squares support vector machine (DE-LSSVM), sparrow search algorithm-least-squares support vector machine (SSA-LSSVM), bald eagle search-back propagation neural network (BES-BPNN), and bald eagle search-extreme learning machine (BES-ELM).

1. Introduction

Cancer is one of the leading causes of human death in both developed and developing countries [1]. Esophageal cancer is the sixth leading cancer in the world, including esophageal squamous carcinoma and esophageal adenocarcinoma [2]. More than 90% of esophageal cancers are esophageal squamous cell carcinoma, and most of them are diagnosed in advanced stages [3]. The pathology of esophageal squamous cell carcinoma is complicated, and effective diagnosis and treatment strategies are lacking [4, 5]. In recent years, the incidence of esophageal squamous cell carcinoma has been on the rise, and the mortality rate remains high [6].

At present, with the continuous deepening of human research, the treatment methods and treatment concepts of ESCC have been continuously improved [79]. However, there is still a lack of marker models and prognostic index that can accurately and effectively reflect the prognosis of ESCC patients [10]. Generally, TNM staging is considered to be the best prognostic indicator for ESCC. However, patients with the same TNM stage often have different prognoses [11]. The TNM staging alone cannot accurately determine the patient's risk of death [12]. Therefore, it is important to establish a reasonable prognostic index.

In recent years, with the continuous progress of machine learning technology, more and more intelligent algorithms are proposed and applied in multiple fields [1319]. A hybrid model of genetic algorithm (GA) and least-squares support vector machine (LSSVM) is used by Ahmadi and Chen [20] to predict the relevant experimental permeability reduction ratio due to scale deposition during water injection, and the results confirm the validity of the GA-LSSVM model. LSSVM is used by Ahmadi and Pournik [21] to build a predictive model for determining the chemical flooding efficiency of the oil reservoir, and the results show that the model has good stability and reliability. In [22], a method based on local mean decomposition and improved FA-optimized combined kernel least-squares support vector machine is proposed to predict short-term wind speed. The results show that the proposed LMD-FA-LSSVM model has better prediction performance.

In the medical field, the doctors' diagnosis is effectively aided by the application of many new algorithms. A combined classification and regression approach is proposed by Zhu et al. [23] for early diagnosis of COVID-19 and prediction of time to conversion in patients with severe symptoms. The results show that the accuracy of the proposed method in predicting severe cases reached 76.97% with a correlation coefficient of 0.524. In [24], a method combining extreme learning machine and gain ratio feature selection method is proposed and tested on the Wisconsin Breast Cancer Diagnostic (WBCD) dataset. The experimental results show that the accuracy of the proposed method reaches 0.9868. The genetic algorithm is used by Majid et al. [25] to select the best features and then use an ensemble classifier to predict gastric infections. The results show that the proposed method performs better than existing methods. In addition, random forest [26], extreme learning machines [27], BP neural networks [28, 29], and Elman neural networks [30] have achieved satisfactory results in the prognosis and diagnosis of certain cancers.

Compared with the above studies [24, 25, 27, 28] that mostly use genetic information and image information to predict patient mortality, the proposed work mainly has the following advantages. First, the patients' blood indicators and TNM staging indicators are used to predict the patient's survival status. Second, an effective prognostic index is established, which significantly improved the performance of the prediction model. Third, these machine learning algorithms rarely distinguish between patients of different ages. Due to differences in patient age, it is difficult for a single model to accurately predict the survival risk of all patients. Therefore, the goal of this article was to find a new set of indicators related to the survival of ESCC patients based on the patient's blood indicators and TNM staging information, establish reasonable prognostic indicators, and combine new machine learning techniques to predict the survival rate in patients of different ages.

In this study, seventeen blood indicators, age, and TNM staging information of 360 patients with ESCC are studied. Ten indicators related to patient survival are found through the feature selection method of genetic algorithm. The combination of these ten indicators has a significant correlation with the patient's survival, which is verified by the Cox regression method in the SPSS software. Using the binary logistic regression method, the prognostic index (PI) of patients with ESCC is constructed. The prognostic index (PI) is divided into four stages, and the different survival conditions of patients can be reasonably reflected in each stage. Comparing the PI staging system with the traditional TNM staging system, the results show that the PI staging system has a better AUC value. The ROC curve method is used to determine the critical threshold of patient age, and the patients are divided into the high-age groups and the low-age groups. Then, based on the Kaplan–Meier survival analysis, it is concluded that the low-age group has a better survival rate than the high-age group, which effectively reflects the survival status of different patients. Finally, the bald eagle search algorithm-least-squares support vector machine (BES-LSSVM) survival prediction model is further proposed in this study. The bald eagle search algorithm is used to optimize the parameters of the least-squares support vector machine, which improves the prediction accuracy of the model. The prognostic index (PI) and the above ten related indicators are used as inputs, and the five-year survival rate of the patient is used as output. The prediction accuracy rate of BES-LSSVM is better than the existing PSO-LSSVM, GOA-LSSVM, DE-LSSVM, SSA-LSSVM, BES-BP, and BES-ELM. Therefore, the method for survival prediction of patients with ESCC proposed in this study can accurately predict the survival level of patients.

The purpose of this article was to propose prognostic indicators PI and survival prediction models based on blood indicators and TNM staging information of patients with ESCC. Based on genetic algorithm feature selection, binary logistic regression, ROC curve, Kaplan–Meier survival analysis, Cox regression analysis, and BES-LSSVM, a method for predicting the survival risk of patients with ESCC is proposed. The main contributions of this article can be summarized as follows:

  1. A combination of ten indicators is found based on genetic algorithm feature selection, which is verified to be significantly associated with survival in patients with ESCC.

  2. The prognostic index of patients with ESCC is constructed by the binary logistic regression method, which can reasonably reflect the survival of patients at different stages.

  3. The survival risk levels of patients with ESCC of different ages are gotten based on the ROC method, which can reasonably reflect the survival status of patients of different ages.

  4. The BES-LSSVM is proposed and accurately predicts the five-year survival rate of patients with ESCC.

This work is presented as follows. In Section 2, the original data are analyzed, a combination of multiple indicators that is significantly related to patient survival is found, and prognostic index is constructed. The survival risk of patients of different ages is obtained. In Section 3, the bald eagle search-least-squares support vector machine is proposed, and the five-year survival rate of patients with ESCC is effectively predicted. In Section 4, the conclusions of this article are presented.

2. Feature Selection and Construction of Prognostic Indicators

2.1. Data Introduction

The clinical data of 360 patients with ESCC used in this article are from patients who were treated in the First Affiliated Hospital of Zhengzhou University from January 2007 to December 2018. The clinical information includes seventeen blood indicators, age, and TNM staging information. The seventeen blood indicators are as follows: white blood cell count (WBC), lymphocyte count (LYMPH), globulin (GLOB), prothrombin time (PT), albumin (ALB), red blood cell count (RBC), thrombin time (TT), basophil count (BASO), eosinophil count (EO), international normalized ratio (INR), neutrophil count (NEUT), total protein (TP), monocyte count (MONO), fibrinogen (FIB), hemoglobin concentration (HGB), platelet count (PLT), and activated partial thromboplastin time (APTT). Among all patients, 177 patients survived more than five years and 183 patients survived less than five years, and the data are evenly distributed. The end points are the time of death after treatment and the end of follow-up. The population proportion information of the dataset is shown in Table 1. Information on seventeen blood indicators is shown in Table 2.

Table 1.

Population proportion information of the dataset.

Project Category Number of population Percentage of population (%)
Genders Male 222 61.7
Female 138 38.3

Ages ≤61.5 230 63.9
>61.5 130 36.1

T stages T1 54 15
T2 99 27.5
T3 205 56.9
T4 2 0.1

N stages N0 191 53.1
N1 103 28.6
N2 48 13.3
N3 18 5

TNM stages I 47 13.1
II 156 43.3
III 137 38.1
IV 20 5.6

Table 2.

Basic information about seventeen blood indicators.

Variable Mean Median (range) Variance Standard deviation
WBC 6.633 6.2 (2.5–13.6) 4.427 2.104
LYMPH 1.869 1.9 (0–4) 0.401 0.633
GLOB 29.306 29 (17–45) 27.160 5.212
PT 10.327 10.3 (7–16.6) 2.690 1.640
ALB 42.011 42 (24–56) 27.259 5.212
RBC 4.430 4.45 (2.6–6.04) 0.234 0.483
TT 15.304 15.5 (1.3–21.3) 3.583 1.893
BASO 0.042 0 (0–1) 0.007 0.082
EO 0.137 0.1 (0–3) 0.044 0.209
INR 0.795 0.79 (0.45–1.64) 0.033 0.181
NEUT 4.033 3.7 (0.3–17) 3.491 1.868
TP 71.428 71 (50–92) 53.064 7.285
MONO 0.405 0.4 (0–1.3) 0.069 0.263
FIB 379.431 367.85 (189.5–774.43) 924.038 30.398
HGB 138.311 139 (63–189) 218.705 14.789
PLT 239.781 232.5 (51–576) 52.606 7.253
APTT 36.112 35.25 (15.4–78.5) 60.110 7.753

The unit of WBC, LYMPH, GLOB, ALB, RBC, BASO, EO, NEUT, TP, HGB, and PLT is g/L. The unit of PT, TT, and APTT is second(s). The unit of FIB is mg/L.

2.2. Feature Selection Based on Genetic Algorithm

A genetic algorithm (GA) is a global optimization adaptive probability search algorithm [31]. GA has the characteristics of group search, which makes it easy to jump out of the local optimum [32]. Therefore, it is often selected as the search algorithm with better feature selection. In many studies, GA is used as a wrapper feature selection technique [33]. In this study, 17 blood indicators and TNM staging information of patients with ESCC are used as independent variable, and the five-year survival rate of patients is used as dependent variable. The least-squares support vector machine is used as the classifier of genetic algorithm feature selection to evaluate the subset of features related to the survival rate of patients. The main process of multi-index feature extraction based on genetic algorithm feature selection (GA-FS) is as follows.

  • Step 1: the generation of the initial population

  • A population is randomly generated as the first-generation solution of the problem. 17 blood indicators and TNM staging information of 360 esophageal cancer patients are selected as inputs and normalized to [−1,1] by the mapminmax function. The mapminmax function is calculated by the following equation:
    y=ymaxyminxxminxmaxxmin+ymin, (1)
     where ymax is 1 and ymin is −1.
  • Step 2: coding individuals in the population

  • The chromosome of each individual in the population is coded using a binary coding method, and each binary bit corresponds to each feature in the feature set. The initial characteristics include seventeen blood indicators, T staging, N staging, and TNM staging. In the value of each bit of the binary code, “0” indicates that the feature is not selected, and “1” indicates that the feature is selected. The dataset is divided into training set and test set.

  • Step 3: determine the fitness function

  • The value of the fitness function indicates the pros and cons of the individual or solution. The purpose of genetic algorithm (GA) used for feature selection is to improve the classification accuracy of the least-squares support vector machine (LSSVM) while reducing the number of selected features as much as possible. Therefore, the fitness function is constructed as Fitness=α · R+β · M/N. R is the classification accuracy of the LSSVM classifier. M is the number of selected features. N is the number of all features. α is a scaling parameter, which reflects the proportion of classification accuracy in the fitness function. β is the parameter importance, which reflects the weight of the selected number of features in the fitness function, and α+β=1.

  • Step 4: sort and select

  • The fitness values are calculated and individuals in the population are selected using a roulette wheel algorithm as a selection operator. The greater the fitness (i.e., the higher the classification accuracy and the lower the number of features), the greater the probability that the individual will be selected for the next generation.

  • Step 5: crossover

  • In this study, the crossover operation uses a two-point crossover operator, and the principle of the crossover operator is shown in Figure 1. Two crossover points are randomly set in the individual code string, and then, part of the gene exchange is performed. The crossover probability is generally 0.4 to 0.99, and the crossover probability selected in this study is 0.7.

  • Step 6: mutation

  • Under the condition of meeting the set mutation probability, the individuals in the population are sequentially subjected to random bit mutation. In the genetic algorithm (GA), the value of the mutation probability is generally 0.001 to 0.1, and the mutation probability used in this study is 0.05.

  • Step 7: the fitness value is calculated

  • The selected features are input into the LSSVM, and the fitness value is obtained by the ten-fold cross-validation method. If the current solution is better than the optimal solution, the optimal solution is updated.

  • Step 8: Step 3 is cycled to Step 7.

Figure 1.

Figure 1

Principle of crossover operator.

When the maximum number of iterations is reached, the loop ends. To clearly express the GA-FS process, the framework of GA-FS is shown in Algorithm 1.

Through the feature selection results of genetic algorithm, the index combinations that are more relevant to patient survival can be obtained: T staging, N staging, TNM staging, WBC, EO, RBC, PLT, TP, PT, and INR. At this time, the ten-fold cross-validation classification accuracy of LSSVM reaches the highest, and the value is 83.077 %.

2.3. The Correlation of Indicators Is Verified by Cox Regression Analysis

The Cox regression model is a semiparametric regression model that can analyze the impact of multiple factors on survival [34]. Therefore, it is widely used in the medical field. The “SPSS 22.0” statistical software is used to make the Cox model. The survival time and survival outcome of patients with ESCC are used as dependent variables. The above ten indicators are independent variables. The survival function at the mean of the covariate is shown in Figure 2. The results show that the p value of the overall score of the ten indicators is 0.000131 far less than 0.05. The combination of these ten indicators is significantly related to the survival rate of patients.

Figure 2.

Figure 2

Survival function at the mean of the covariate. The survival years are taken as the time, and the ten indicators obtained from genetic algorithm feature selection are used as covariates.

2.4. Evaluation and Establishment of Prognostic Indicators

This section establishes and evaluates the prognostic index (PI) of patients with ESCC to better classify patients and provide good clinical guidance. In the above section, the ten indicators that are significantly related to the survival of patients are selected through genetic algorithm feature selection, which are T stage, N stage, TNM stage, WBC, EO, RBC, PLT, TP, PT, and INR. The binary logistic regression analysis [35] is used to construct the prognostic index. The patient's survival status is used as the dependent variable, and ten indicators are used as independent variables. The prognostic index of ESCC is constructed by the following equation:

PI=0.481TNM0.809INR. (2)

The receiver operating characteristic (ROC) [36] curve is usually used to select the best diagnostic threshold and divide the indicators into two categories. The ROC curve of PI is shown in Figure 3(a). The AUC value is 0.660, p < 0.001, indicating that PI has a high predictive value for the prognosis of ESCC patients. The comparison of ROC curves between PI and TNM staging systems is shown in Figure 3(b). The comparison results of PI and TNM are shown in Table 3. By analyzing and comparing the ROC curves of PI and TNM, it can be concluded that the predictive effect of the prognostic index PI in this study is better than that of the TNM staging system.

Figure 3.

Figure 3

ROC analysis of PI and TNM. (a) ROC analysis of PI. (b) Comparative analysis of ROC for PI and TNM. The horizontal coordinate is “1-specificity,” and the vertical coordinate is “sensitivity.” The larger the area under the curve, the stronger the significance.

Table 3.

Results of ROC analysis for PI and TNM.

Project Sensitivity Specificity AUC Significance level P
PI 0.796 0.440 0.660 <0.0001
TNM 0.515 0.679 0.639 <0.0001

To better predict the survival status of ESCC patients, the ROC curve is further analyzed to determine the best cutoff value of PI. The PI values of all samples are used as inputs, and the ROC curve is drawn, as shown in Figure 3. The value of the area under the curve is 0.660, which is greater than 0.5, P < 0.001. Obviously, there is a threshold for PI. By calculating the Youden index, PI can be divided into two levels. The Youden index is calculated by the following equation:

Youden index=Sensitivity1Specificity. (3)

The Youden index is calculated as 0.303. The Youden index, AUC value, significance, and other related indicators are shown in Table 4. Then, for samples with PI values higher than 0.303 and samples with PI values lower than 0.303, ROC curves are drawn, as shown in Figure 4. The Youden index, AUC value, significance, and other related indicators are shown in Table 4. It can be seen from Table 4 that the AUC values of the three ROC curves are all greater than 0.5, and the significance P value is less than 0.05.

Table 4.

Results of ROC curve analysis for PI critical threshold.

Project ROC for all PI samples ROC for low PI samples ROC for high PI samples
Area under the ROC curve (AUC) 0.660 0.614 0.576
Standard error 0.030 0.056 0.033
95% confidence interval 0.600 to 0.719 0.505 to 0.723 0.511 to 0.642
Significance level P <0.0001 0.038 0.029
Youden index 0.237 0.284 0.158
Associated criterion 0.303 0.016 0.873
Sensitivity 0.796 0.742 0.309
Specificity 0.440 0.542 0.848

Figure 4.

Figure 4

ROC analysis for dividing PI staging. (a) ROC for high PI samples. (b) ROC for low PI samples. The horizontal coordinate is “1-specificity,” and the vertical coordinate is “sensitivity.” The larger the area under the curve, the stronger the significance.

According to the ROC curve, the three critical thresholds of PI can be obtained in sequence. The three critical thresholds are 0.303, 0.016, and 0.873, respectively. According to the critical threshold, PI is divided into four stages, namely PI-I, PI-II, PI-III, and PI-IV. The four stages of PI are analyzed by the Kaplan–Meier, and the results are shown in Figure 5. According to the Kaplan–Meier analysis [37], PI-I has the best prognostic effect, which is better than PI-II, PI-III, and PI-IV for patients with ESCC.

Figure 5.

Figure 5

Kaplan–Meier survival analysis of PI stages.

2.5. Divide Risk Levels Based on Patient's Age

At present, age is considered by most studies to be an important factor affecting the prognosis of ESCC. The age factor has an important influence on the physiological immunity of the patient, and it is related to the patient's tolerance to different treatment methods. Therefore, differences in age factors will also lead to different prognoses of ESCC patients. It is important to construct different survival prediction models for patients of different ages. The ROC curve is used to determine the best cutoff value of the patient's age. It is plotted with the age of all samples as the variable, named “ROC of the patient's age,” as shown in Figure 6. The area under the curve (AUC) value is 0.618, which is greater than 0.5, and P < 0.001. Obviously, a critical threshold can be found for age, which divides age into two risk levels.

Figure 6.

Figure 6

ROC analysis of age.

After calculating the Youden index, the critical threshold of age is 61.5 years. By calculating critical thresholds, patients are divided into the high- and low-age groups. The Kaplan–Meier survival analysis is performed based on the high- and low-value groups of age, and the results are shown in Figure 7. There is a significant difference between the high-age group and the low-age group (P < 0.05) on survival rate, and the low-age group has a better survival rate than the high-age group.

Figure 7.

Figure 7

Kaplan–Meier survival analysis of age.

3. Survival Prediction Based on LSSVM

3.1. Bald Eagle Search Algorithm-Least-Squares Support Vector Machine

The bald eagle search algorithm (BES) is proposed by Alsattar et al. [38]. It is a meta-heuristic optimization algorithm based on the behavior strategy or social behavior of the bald eagle during hunting. The algorithm has strong global search capabilities and can effectively solve various complex numerical optimization problems. In this study, the bald eagle search algorithm is used to optimize the parameters of the least-squares support vector machine, which improved the prediction accuracy of the least-squares support vector machine. The survival rate of ESCC patients is predicted based on the proposed BES-LSSVM classification prediction model.

The bald eagle search algorithm is mainly divided into three stages, namely select stage, search stage, and swooping stage.

3.1.1. Select Stage

In the select stage, the bald eagles will select the best area (according to the amount of food) within the selected search area and start looking for prey. At this time, the position P of the bald eagle is determined by multiplying the a priori information of the random search by α. The mathematical model of this behavior is constructed as follows:

Pi,new=Pbest+αrPmeanPi. (4)

where α is used to control the position change parameter within the range of (1.5, 2); r is a random number between (0,1). Pbest represents the best position of the bald eagle based on the previous search. Pmean is the average position of the bald eagle after the previous search. Pi represents the position of the ith bald eagle.

3.1.2. Search Stage

In the search stage, the bald eagles fly in different directions in a spiral shape, speeding up the search for prey. Then, the bald eagle will look for the best position in the selected space to swoop and hunt. The position update of the bald eagle during spiral flight adopts the form of polar coordinate equation, as follows:

xi=xrimaxxr,yi=yrimaxyr,xri=ri  sinθi,yri=ri  cosθi,θi=απrand,ri=θi+Rrand, (5)

where a and R are the parameters in the range of (5,10) and (0.5, 2), respectively, which are used to control the spiral regression trajectory. θ(i) and r(i) are the polar angle and polar diameter of the spiral equation, respectively. x(i) and y(i) represent the position of the bald eagle in polar coordinates, and the values are both (−1,1). xr(i) and yr(i) represent the position of the bald eagle in the Cartesian coordinate system. rand is a random number (0,1).

The location of the bald eagle is constructed as follows:

Pi,new=Pi+yiPiPi+1+xiPiPmean. (6)

3.1.3. Swooping Stage

In the swooping stage, the bald eagles quickly swoop from the best position in the search space to their target prey. At the same time, other individuals in the population move to the best position and attack the prey. The state of motion of the bald eagle is described by the polar coordinate equation:

θi=απrand,ri=θi,xri=risinhθi,yri=ricoxhθi,x1i=xrimaxxr,y1i=yrimaxyr. (7)

The formula for updating the position of the bald eagle during swooping is constructed as follows:

δx=x1Pic1Pmean,δy=y1Pic2Pbest,Pi,new=randPbest+δx+δy, (8)

where c1 and c2 increase the exercise intensity of the bald eagle to the optimal point and the center point, and the value range is (1,2).

For LSSVM, the choice of kernel function is a key factor. The RBF kernel function is selected in this study, and the RBF kernel function can be expressed as follows:

Kx,z=expgxz2,g>0, (9)

where g is the parameter coefficient of the kernel function, which affects the performance of LSSVM.

In this study, to improve the classification accuracy of LSSVM, BES is selected to optimize the penalty factor c and the kernel function parameter g of LSSVM. The classification error rate of LSSVM is used as the objective function of BES optimization, and the objective function is fitness function  = 1 − classification error rate. The larger the fitness value, the higher the classification effect of LSSVM.

To clearly express the BES-LSSVM process, the framework of BES-LSSVM is shown in Algorithm 2.

3.2. Survival Prediction of Esophageal Squamous Cell Carcinoma

Ten indicators related to the survival rate of ESCC patients are obtained through the method of genetic algorithm feature selection. These indicators are T stage, N stage, TNM stage, WBC, EO, RBC, PLT, TP, PT, and INR. The prognostic index PI of ESCC patients is obtained by the binary logistic regression. The eleven indicators of patients are used as inputs to the BES-LSSVM model, and the five-year survival rate of the patients is used as the output. Survival prediction models for ESCC patients in the high-age group and the low-age group are established separately. The framework of the overall implementation of the survival prediction model for patients with ESCC is shown in Figure 8. To verify the validity of this model, grasshopper optimization algorithm-least-squares support vector machine (GOA-LSSVM) [39], particle swarm optimization-least-squares support vector machine (PSO-LSSVM) [40], differential evolution-least-squares support vector machine (DE-LSSVM) [41], sparrow search algorithm-least-squares support vector machine (SSA-LSSVM) [42], bald eagle search-back propagation neural network(BES-BPNN), and bald eagle search-extreme learning machine(BES-ELM) are used for comparison.

Figure 8.

Figure 8

Framework of the overall implementation of the survival prediction model for patients with ESCC.

For the parameter setting of the bald eagle search algorithm, the bald eagle population number is set to 20, and the number of iterations is set to 100. For the particle swarm algorithm, both c1 and c2 are set to 1.5. The population size is set to 20, and the number of iterations is set to 100. For the grasshopper optimization algorithm, the population size is set to 20, and the maximum number of iterations is set to 100. For differential evolution algorithm, the scaling factor F is set to 0.5, the crossover probability CR is set to 0.9, and the maximum number of iterations is set to 100. For the sparrow search algorithm, the population size is set to 20, the safety value is set to 0.6, and maximum number of iterations is set to 100. The dataset is divided into ten parts, and the ten-fold cross-validation method is used to verify the performance of the model. Nine samples are used as the training set, and one sample is used as the validation set. The cross-validation is repeated 10 times, and the average of the ten results is obtained. This method enables training and testing with random samples repeatedly, and the results are verified once each time. The effect of boundary patient data on the performance of the least-squares support vector machine is effectively reduced. The evaluation metrics include classification accuracy, sensitivity, specificity, and running time. Among them, sensitivity is a measure of the model's ability to identify positive samples and specificity is a measure of the model's ability to identify negative samples. Sensitivity and specificity are calculated as follows:

Sensitivity=TPTP+FN,Specificity=TNTN+FP, (10)

where true positive (TP) is the number of positive samples correctly identified, true negative (TN) is the number of negative samples correctly identified, false positive (FP) is the number of positive samples incorrectly identified, and false negative (FN) is the number of positive samples incorrectly identified. The prediction results of the LSSVM optimized by the five optimization algorithms, BES-BPNN, and BES-ELM model are shown in Table 5. The optimal LSSVM model parameters under different optimization methods are shown in Table 6.

Table 5.

Comparison of different algorithms for predicting five-year survival of patients with esophageal squamous cell carcinoma.

Algorithm 10-fold cross-validation accuracy (%) Sensitivity (%) Specificity (%) Running time (s)
High-age group BES-LSSVM 86.538 88.032 86.437 1.661
GOA-LSSVM 85.769 86.971 85.101 3.464
DE-LSSVM 85.384 86.626 84.668 8.123
PSO-LSSVM 84.615 85.397 83.537 3.641
SSA-LSSVM 86.154 87.329 85.553 2.875
BES-BPNN 83.902 85.673 83.393 10.615
BES-ELM 83.477 84.419 82.907 6.171

Low-age group BES-LSSVM 86.495 88.327 85.991 1.846
GOA-LSSVM 85.435 87.229 84.915 4.254
DE-LSSVM 85.217 86.802 84.474 9.950
PSO-LSSVM 84.782 86.595 84.245 3.846
SSA-LSSVM 85.843 87.675 85.338 3.412
BES-BPNN 83.479 85.271 82.959 11.743
BES-ELM 83.913 85.706 83.393 7.036

Table 6.

Optimal LSSVM model parameters under different optimization algorithms.

Algorithm High-age group Low-age group
Penalty factor Kernel function parameter Penalty factor Kernel function parameter
BES-LSSVM 77.946 2.090 60.290 2.493
GOA-LSSVM 54.429 1.225 22.895 0.106
DE-LSSVM 66.155 1.044 50.816 0.735
PSO-LSSVM 61.902 1.086 46.111 0.459
SSA-LSSVM 77.217 10.192 75.204 5.991

It can be seen from Table 5 that in the high-age group, the prediction accuracy of BES-LSSVM, GOA-LSSVM, DE-LSSVM, PSO-LSSVM, SSA-LSSVM, BES-BPNN, and BES-ELM is 86.538%, 85.769%, 85.384%, 84.615%, 86.154%, 83.902%, and 83.477%, respectively. In the low-age group, the prediction accuracy of BES-LSSVM, GOA-LSSVM, DE-LSSVM, PSO-LSSVM, SSA-LSSVM, BES-BPNN, and BES-ELM is 86.495%, 85.435%, 85.217%, 84.782%, 85.843%, 83.479%, and 83.913%, respectively. The comparison shows that BES-LSSVM has a high accuracy rate and can accurately predict the five-year survival rate of ESCC patients. In terms of sensitivity and specificity, the proposed BES-LSSVM also outperforms other models. Besides, it can be seen from Table 5 that BES-LSSVM has the fastest running time.

To better demonstrate the effectiveness of the proposed model, the Wisconsin Diagnostic Breast Cancer (WBCD) dataset is used for testing, and the results are shown in Table 7. From the test results, it can be seen that BES-LSSVM has higher prediction accuracy and faster running time than other models. Therefore, the survival status of cancer patients can be effectively predicted by the survival prediction model proposed in this study.

Table 7.

Comparison of the results of different algorithms.

Algorithm 10-fold cross-validation accuracy (%) Sensitivity (%) Specificity (%) Running time (s)
BES-LSSVM 97.01 98.19 95.09 2.793
GOA-LSSVM 96.28 97.75 93.90 7.263
DE-LSSVM 96.10 97.64 93.62 6.824
PSO-LSSVM 96.27 97.75 93.89 5.772
SSA-LSSVM 96.65 97.97 94.54 3.818
BES-BP 95.26 97.11 92.34 12.749
BES-ELM 95.61 97.33 92.88 7.837

4. Conclusions

To accurately and effectively predict the five-year survival rate of patients with ESCC, a survival prediction model based on genetic algorithm feature selection, binary logistic regression, and least-squares support vector machine is proposed in this study. A genetic algorithm and Cox regression are used to determine ten indicators that are significantly related to the survival of patients with ESCC. Based on the binary logistic regression, a prognostic indicator PI with predictive value is constructed. Patients are divided into the high-age groups and the low-age groups by ROC curve analysis. Through the Kaplan–Meier survival analysis, it is concluded that the low-age group has a better survival rate than the high-age group. The bald eagle search algorithm-least-squares support vector machine (BES-LSSVM) is further proposed, which effectively predicts the five-year survival rate of patients with ESCC. The accuracy of BES-LSSVM in predicting the five-year survival of patients with ESCC is better than the existing GOA-LSSVM, PSO-LSSVM, DE-LSSVM, SSA-LSSVM, BES-BPNN, and BES-ELM. This reflects the good practical value of the ESCC survival prediction model proposed in this study in the field of cancer classification prediction.

However, the accuracy of the model may be affected by increase in number of samples and classes. Moreover, sometimes, it is a possibility that during the feature selection process, few important features are discarded. In the future, the combination of swarm intelligence optimization algorithm and the latest deep learning models (such as deep neural network and convolutional neural network) will be used to develop a new survival prediction model for patients with ESCC on a larger and more complex dataset.

Algorithm 1.

Algorithm 1

Framework of GA-FS.

Algorithm 2.

Algorithm 2

Framework of BES.

Acknowledgments

This work was supported in part by the Joint Funds of the National Natural Science Foundation of China, under Grant U1804262, Foundation of Young Key Teachers from University of Henan Province, under Grant 2018GGJS092, Youth Talent Lifting Project of Henan Province, under Grant 2018HYTP016, Henan Province University Science and Technology Innovation Talent Support Plan, under Grant 20HASTIT027, Zhongyuan Thousand Talents Program, under Grant 204200510003, and Open Fund of State Key Laboratory of Esophageal Cancer Prevention and Treatment, under Grants K2020-0010 and K2020-0011.

Data Availability

The data used to support the findings of the study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of financial interests or personal relationships that could have appeared to influence the work reported in this study.

References

  • 1.Arnold M., Rutherford M. J., Bardot A., et al. Progress in cancer survival, mortality, and incidence in seven high-income countries 1995–2014 (icbp survmark-2): a population-based study. The Lancet Oncology . 2019;20(11):1493–1505. doi: 10.1016/S1470-2045(19)30456-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Urakawa S., Makino T., Yamasaki M., et al. Lymph node response to neoadjuvant chemotherapy as an independent prognostic factor in metastatic esophageal cancer. Annals of Surgery . 2021;273(6):1141–1149. doi: 10.1097/SLA.0000000000003445. [DOI] [PubMed] [Google Scholar]
  • 3.Lu Z., Fang Y., Liu C., et al. Early interdisciplinary supportive care in patients with previously untreated metastatic esophagogastric cancer: a phase iii randomized controlled trial. Journal of Clinical Oncology . 2021;39(7):748–756. doi: 10.1200/JCO.20.01254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wu Y. L., Tsai M. C., Wang W. L. An unusual esophageal ulcerative lesion mimicking esophageal cancer. Gastroenterology . 2022;162(1):e4–e6. doi: 10.1053/j.gastro.2021.05.001. [DOI] [PubMed] [Google Scholar]
  • 5.Kurtom S., Kaplan B. J. Esophagus and gastrointestinal junction tumors. Surgical Clinics . 2020;100(3):507–521. doi: 10.1016/j.suc.2020.02.003. [DOI] [PubMed] [Google Scholar]
  • 6.Perisetti A., Bellamkonda M., Konda M., et al. Tumor-associated antigens and their antibodies in the screening, diagnosis, and monitoring of esophageal cancers. European Journal of Gastroenterology and Hepatology . 2020;32(7):779–788. doi: 10.1097/MEG.0000000000001718. [DOI] [PubMed] [Google Scholar]
  • 7.Metcalfe C., Avery K., Berrisford R., et al. Comparing open and minimally invasive surgical procedures for oesophagectomy in the treatment of cancer: the romio (randomised oesophagectomy: Minimally invasive or open) feasibility study and pilot trial. Health Technology Assessment . 2016;20(48):1–68. doi: 10.3310/hta20480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Vollenbrock S. E., Voncken F. E., Bartels L. W., Beets-Tan R. G., Bartels-Rutten A. Diffusion-weighted mri with adc mapping for response prediction and assessment of oesophageal cancer: a systematic review. Radiotherapy & Oncology . 2020;142:17–26. doi: 10.1016/j.radonc.2019.07.006. [DOI] [PubMed] [Google Scholar]
  • 9.Sun J., Zang M., Liu P., Wang Y. A secure communication scheme of three-variable chaotic coupling synchronization based on dna chemical reaction networks. IEEE Transactions on Signal Processing . 2022;70:2362–2373. [Google Scholar]
  • 10.Depypere L., De Hertogh G., Moons J., et al. Importance of lymph node response after neoadjuvant chemoradiotherapy for esophageal adenocarcinoma. The Annals of Thoracic Surgery . 2021;112(6):1847–1854. doi: 10.1016/j.athoracsur.2020.09.074. [DOI] [PubMed] [Google Scholar]
  • 11.Liu K., Feng F., Chen Xz, et al. Comparison between gastric and esophageal classification system among adenocarcinomas of esophagogastric junction according to ajcc 8th edition: a retrospective observational study from two high-volume institutions in China. Gastric Cancer . 2019;22(3):506–517. doi: 10.1007/s10120-018-0890-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mao Y., Fu Z., Zhang Y., et al. A six-microrna risk score model predicts prognosis in esophageal squamous cell carcinoma. Journal of Cellular Physiology . 2019;234(5):6810–6819. doi: 10.1002/jcp.27429. [DOI] [PubMed] [Google Scholar]
  • 13.Ahmadi M., Chen Z. Machine learning-based models for predicting permeability impairment due to scale deposition. Journal of Petroleum Exploration and Production Technology . 2020;10(7):2873–2884. [Google Scholar]
  • 14.Ahmadi M. A. Toward reliable model for prediction drilling fluid density at wellbore conditions: a lssvm model. Neurocomputing . 2016;211:143–149. [Google Scholar]
  • 15.Sun J., Han G., Zeng Z., Wang Y. Memristor-based neural network circuit of full-function pavlov associative memory with time delay and variable learning rate. IEEE Transactions on Cybernetics . 2019;50(7):2935–2945. doi: 10.1109/TCYB.2019.2951520. [DOI] [PubMed] [Google Scholar]
  • 16.Ahmadi M. A., Rozyn J., Lee M., Bahadori A. Estimation of the silica solubility in the superheated steam using lssvm modeling approach. Environmental Progress & Sustainable Energy . 2016;35(2):596–602. [Google Scholar]
  • 17.Tian Z. Modes decomposition forecasting approach for ultra-short-term wind speed. Applied Soft Computing . 2021;105 [Google Scholar]
  • 18.Tian Z. Backtracking search optimization algorithm-based least square support vector machine and its applications. Engineering Applications of Artificial Intelligence . 2020;94 [Google Scholar]
  • 19.Sun J., Han J., Wang Y., Liu P. Memristor-based neural network circuit of emotion congruent memory with mental fatigue and emotion inhibition. IEEE Transactions on Biomedical Circuits and Systems . 2021;15(3):606–616. doi: 10.1109/TBCAS.2021.3090786. [DOI] [PubMed] [Google Scholar]
  • 20.Ahmadi M. A., Chen Z. Comparison of machine learning methods for estimating permeability and porosity of oil reservoirs via petro-physical logs. Petroleum . 2019;5(3):271–284. [Google Scholar]
  • 21.Ahmadi M. A., Pournik M. A predictive model of chemical flooding for enhanced oil recovery purposes: application of least square support vector machine. Petroleum . 2016;2(2):177–182. [Google Scholar]
  • 22.Tian Z. Short-term wind speed prediction based on lmd and improved fa optimized combined kernel function lssvm. Engineering Applications of Artificial Intelligence . 2020;91103573 [Google Scholar]
  • 23.Zhu X., Song B., Shi F., et al. Joint prediction and time estimation of covid-19 developing severe symptoms using chest ct scan. Medical Image Analysis . 2021;67 doi: 10.1016/j.media.2020.101824.101824 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lahoura V., Singh H., Aggarwal A., et al. Cloud computing-based framework for breast cancer diagnosis using extreme learning machine. Diagnostics . 2021;11(2):p. 241. doi: 10.3390/diagnostics11020241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Majid A., Khan M. A., Yasmin M., Rehman A., Yousafzai A., Tariq U. Classification of stomach infections: a paradigm of convolutional neural network along with classical features fusion and selection. Microscopy Research and Technique . 2020;83(5):562–576. doi: 10.1002/jemt.23447. [DOI] [PubMed] [Google Scholar]
  • 26.Mogensen U. B., Ishwaran H., Gerds T. A. Evaluating random forests for survival analysis using prediction error curves. Journal of Statistical Software . 2012;50(11):p. 1. doi: 10.18637/jss.v050.i11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Jazayeri N., Sajedi H. Breast cancer diagnosis based on genomic data and extreme learning machine. SN Applied Sciences . 2020;2(1):1–7. [Google Scholar]
  • 28.Pan X., Zhang T., Yang Q., Yang D., Rwigema J. C., Qi X. S. Survival prediction for oral tongue cancer patients via probabilistic genetic algorithm optimized neural network models. British Journal of Radiology . 2020;93(1112) doi: 10.1259/bjr.20190825.20190825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sun J., Han J., Liu P., Wang Y. Memristor-based neural network circuit of pavlov associative memory with dual mode switching. AEU-international Journal of Electronics and Communications . 2021;129153552 [Google Scholar]
  • 30.Zheng L., Wang G., Zhang F., Zhao Q., Dai C., Yousefi N. Breast cancer diagnosis based on a new improved elman neural network optimized by meta-heuristics. International Journal of Imaging Systems and Technology . 2020;30(3):513–526. [Google Scholar]
  • 31.Halim Z. An ensemble filter-based heuristic approach for cancerous gene expression classification. Knowledge-Based Systems . 2021;234107560 [Google Scholar]
  • 32.Berrhail F., Belhadef H. Genetic algorithm-based feature selection approach for enhancing the effectiveness of similarity searching in ligand-based virtual screening. Current Bioinformatics . 2020;15(5):431–444. [Google Scholar]
  • 33.Miranda E. N., Barbosa B. H. G., Silva S. H. G., Monti C. A. U., Tng D. Y. P., Gomide L. R. Variable selection for estimating individual tree height using genetic algorithm and random forest. Forest Ecology and Management . 2022;504119828 [Google Scholar]
  • 34.Mohammed M., Mwambi H., Omolo B. Colorectal cancer classification and survival analysis based on an integrated rna and dna molecular signature. Current Bioinformatics . 2021;16(4):583–600. [Google Scholar]
  • 35.Sutradhar R., Barbera L. Comparing an artificial neural network to logistic regression for predicting ed visit risk among patients with cancer: a population-based cohort study. Journal of Pain and Symptom Management . 2020;60(1):1–9. doi: 10.1016/j.jpainsymman.2020.02.010. [DOI] [PubMed] [Google Scholar]
  • 36.Gao J., Hu B., Chen L. A path-based method for identification of protein phenotypic annotations. Current Bioinformatics . 2021;16(9):1214–1222. [Google Scholar]
  • 37.Shao Y., Tao X., Lu R., et al. Hsa_circ_0065149 is an indicator for early gastric cancer screening and prognosis prediction. Pathology and Oncology Research . 2020;26(3):1475–1482. doi: 10.1007/s12253-019-00716-y. [DOI] [PubMed] [Google Scholar]
  • 38.Alsattar H., Zaidan A., Zaidan B. Novel meta-heuristic bald eagle search optimisation algorithm. Artificial Intelligence Review . 2020;53(3):2237–2264. [Google Scholar]
  • 39.Haoran Z., Huiru Z., Sen G. Short-term wind electric power forecasting using a novel multi-stage intelligent algorithm. Sustainability . 2018;10(3):p. 881. [Google Scholar]
  • 40.Yu C., Cao W., Liu Y., Shi K., Ning J. Evaluation of a novel computer dye recipe prediction method based on the pso-lssvm models and single reactive dye database. Chemometrics and Intelligent Laboratory Systems . 2021;218104430 [Google Scholar]
  • 41.Alhajri I. H., Alarifi I. M., Asadi A., Nguyen H. M., Moayedi H. A general model for prediction of baso4 and srso4 solubility in aqueous electrolyte solutions over a wide range of temperatures and pressures. Journal of Molecular Liquids . 2020;299112142 [Google Scholar]
  • 42.Wu H., Wang J. A method for prediction of waterlogging economic losses in a subway station project. Mathematics . 2021;9(12):p. 1421. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of the study can be obtained from the corresponding author upon request.


Articles from Computational Intelligence and Neuroscience are provided here courtesy of Wiley

RESOURCES