A Hybrid Method to Predict Postoperative Survival of Lung Cancer Using Improved SMOTE and Adaptive SVM

Jiang Shen; Jiachao Wu; Man Xu; Dan Gan; Bang An; Fusheng Liu

doi:10.1155/2021/2213194

. 2021 Sep 10;2021:2213194. doi: 10.1155/2021/2213194

A Hybrid Method to Predict Postoperative Survival of Lung Cancer Using Improved SMOTE and Adaptive SVM

Jiang Shen ¹, Jiachao Wu ^1,^✉, Man Xu ², Dan Gan ³, Bang An ¹, Fusheng Liu ¹

PMCID: PMC8449740 PMID: 34545291

Abstract

Predicting postoperative survival of lung cancer patients (LCPs) is an important problem of medical decision-making. However, the imbalanced distribution of patient survival in the dataset increases the difficulty of prediction. Although the synthetic minority oversampling technique (SMOTE) can be used to deal with imbalanced data, it cannot identify data noise. On the other hand, many studies use a support vector machine (SVM) combined with resampling technology to deal with imbalanced data. However, most studies require manual setting of SVM parameters, which makes it difficult to obtain the best performance. In this paper, a hybrid improved SMOTE and adaptive SVM method is proposed for imbalance data to predict the postoperative survival of LCPs. The proposed method is divided into two stages: in the first stage, the cross-validated committees filter (CVCF) is used to remove noise samples to improve the performance of SMOTE. In the second stage, we propose an adaptive SVM, which uses fuzzy self-tuning particle swarm optimization (FPSO) to optimize the parameters of SVM. Compared with other advanced algorithms, our proposed method obtains the best performance with 95.11% accuracy, 95.10% G-mean, 95.02% F1, and 95.10% area under the curve (AUC) for predicting postoperative survival of LCPs.

1. Introduction

Lung cancer (LC) is the deadliest cancer in the world. More than 85% of lung cancer patients are diagnosed with non-small-cell LC [1]. Surgical resection is the standard and most effective treatment for LC stage I, stage II, and nonsmall cell stage III A [1]. A major problem of the clinical decision on LC operation is to select candidates for surgery based on the patient's short-term and long-term risks and benefits, where survival time is one of the most important measures. Accurately predicting a patient's survival after surgery can help doctors make better treatment decisions. At the same time, it can help patients better understand their conditions to have good psychological expectations and financial preparation.

In recent years, more and more data-driven methods have been used to predict the postoperative survival of LCPs. In terms of statistical methods, Kaplan–Meier curves, multivariable logistic regression, and Cox regression are the three most widely used statistical methods to predict survival or complications for LCPs [2]. However, taking into account the shortcomings of traditional statistical methods and the incompleteness of medical data, data mining and machine learning techniques are introduced in recent years. Mangat and Vig [3] proposed an association rule algorithm based on a dynamic particle swarm optimizer, and the classification accuracy is 82.18%. Saber Iraji [4] compared the accuracy of adaptive fuzzy neural networks, extreme learning machine, and neural networks for predicting the 1-year postoperative survival of LCPs. The results show that sensitivity (90.05%) and specificity (81.57%) of an extreme learning machine are the highest, respectively. Tomczak et al. [5] used the boosted support vector machine (SVM) algorithm to predict the postoperative survival of LCPs. This algorithm combines the advantages of ensemble learning and cost-sensitive SVM, and the G-mean can reach 65.73%. As can be seen from the previous research, most of them ignore the impact of imbalanced data distribution, which may reduce the performance of classifiers.

Class imbalance refers to the phenomenon in which one class of data in a dataset is much larger than the others [6]. Standard machine learning classifiers are effective for balanced data, but they are not good for imbalanced data. Specifically, with the progress of medical technology, the number of long-term survivors after surgery for LCPs is much larger than that of short-term deaths. This will lead to higher prediction accuracy for survivors (majority class) and poorer recognition for deceases (minority class). Therefore, it is necessary to propose a method that has good classification performance for both survivors and deceased ones for predicting postoperative survival of LCPs.

During the past decades, the imbalanced data classification problem has widely become a matter of concern and has been intensively researched. The existing papers on imbalanced data processing methods have two main research directions: data level and algorithm level [7]. The data-level processing methods create a balanced class distribution by resampling the input data. Algorithm-level processing methods mainly involve two aspects: ensemble learning and cost-sensitive learning. Among these imbalanced data processing methods, the synthetic minority oversampling technique (SMOTE) is one of the most widely used methods, as it is relatively simple and effective [8]. However, it is likely to be unsatisfactory or even counterproductive if SMOTE is used alone, which is because its blind oversampling ignores the distribution of samples, such as the existence of noise [9, 10]. To solve this problem, many approaches are proposed to improve SMOTE. Ramentol et al. [11] combined rough set theory with SMOTE and proposed the SMOTE-RSB algorithm. SMOTE-RSB first uses SMOTE for oversampling and then removes noise and outliers in the dataset based on rough set theory. SSMNFOS [12] is a hybrid method based on stochastic sensitivity measurement (SSM) noise filtering and oversampling, which can improve the robustness of the oversampling method with respect to noise samples. The CURE-SMOTE [13] uses CURE (clustering using representatives) to cluster minority samples for removing noise and outliers and then uses SMOTE to insert artificial synthetic samples between representative samples and central samples to balance the dataset. However, most of these methods need to set the noise threshold through prior parameters, which increases the risk of misidentification of noise. In addition, some researchers consider ensemble filtering methods, which have been proven to be generally more efficient than single filters [14]. In this paper, we propose to use the cross-validated committees filter (CVCF) to detect and remove noise before applying SMOTE and record this method as CVCF-SMOTE. CVCF is an ensemble-based filter, which can reduce the risk of error in the threshold setting of prior parameters [15].

In addition, SVM as one of the most advanced classifiers has not been well used to predict postoperative survival of LC. In the previous research, SVM has been widely used in statistical classification and regression analysis due to its excellent performance [16]. Considering the limitations of SVM on imbalanced data, some studies combine resampling technology and SVM to deal with imbalanced data. D'Addabbo and Maglietta [17] proposed a method combining parallel selective sampling and SVM (PSS-SVM) to process imbalanced big data. Experimental results show that the performance of PSS-SVM is better than that of SVM and RUSBoost classifiers. Huang et al. [18] designed an undersampling technique based on clustering and combined it with optimized SVM to deal with imbalanced data. The classification performance of SVM is improved by the linear combination of SVM based on a mixed kernel. Fan et al. [19] proposed a hybrid technology combining principal component analysis (PCA), SMOTE, and SVM to diagnose chiller fault. Experimental results prove that this hybrid technology can improve the overall performance of chiller fault diagnosis.

However, these studies usually require a manual setting of SVM parameters, which may lead to failure to obtain the best experimental results. The standard SVM has a limitation that its performance depends on the selection of initial parameters. Some studies optimize the parameters of SVM through evolutionary calculations which have achieved good results. In these optimization algorithms, the particle swarm optimization- (PSO-) optimized SVM has been widely used with promising results due to its simplicity and fast convergence [20]. With the development of PSO technology, some improved PSO algorithms are used to optimize SVM. Wei et al. [21] proposed a binary PSO-optimized SVM method for feature selection, which overcomes the problem of premature convergence and obtained high-quality features. A switching delayed particle swarm optimization- (SDPSO-) optimized SVM is proposed to diagnose Alzheimer's disease [22]. Experimental results show that the proposed method outperforms several other variants of SVM and has obtained excellent classification accuracy. However, these methods often require parameter settings for PSO or improved PSO, such as particle size and inertial weight. In general, getting the best settings is complicated and time-consuming. If the PSO parameters are set improperly, it will even reduce the performance of the SVM.

In recent years, many new metaheuristics techniques have been proposed, such as Monarch Butterfly Optimization (MBO) [23], slime mould algorithm [24], Moth Search (MS) [25], Hunger Games Search (HGS) [26], and Harris Hawks Optimizer (HHO) [27]. However, most of these methods require users to tune parameters to achieve satisfactory performance. Fuzzy self-tuning PSO (FPSO) is a kind of setting-free adaptive PSO proposed in recent years [28]. The advantage of FPSO is that every particle is adaptively adjusted during the optimization process without any PSO expertise and parameter settings. Moreover, experimental results show that FPSO is better than several previous competitors in convergence speed and finding optimal solution aspects. Based on the above considerations, the FPSO algorithm is exploited to optimize the parameters of SVM, which leads to a novel FPSO-SVM classification algorithm.

Based on the improved SMOTE and FPSO-SVM, we propose a two-stage hybrid method to improve the performance of the postoperative survival prediction of LCPs. In the first stage, CVCF is used to remove noise samples to improve the performance of SMOTE. Then, SMOTE is adopted to handle the imbalanced nature of the dataset. In the second stage, we apply FPSO-SVM to predict the postoperative survival of LCPs. The experimental results show that the proposed hybrid method outperforms other comparative state-of-the-art algorithms. This hybrid method can effectively improve the accuracy of survival prediction after LC surgery and provide reliable medical decision-making support for doctors and patients. Our contributions are summarized as follows:

A novel hybrid method that combines improved SMOTE with adaptive SVM is proposed for predicting postoperative survival of LCPs
We apply CVCF to clean up data noise to improve the performance of SMOTE
FPSO is used to optimize the parameters of SVM and achieve an adaptive SVM
The proposed hybrid method not only performs higher predictive accuracy than other compared algorithms for predicting postoperative survival of LCPs but also has better G-mean, F1, and area under the curve (AUC)

The rest of this paper is as follows: Section 2 shows the materials and methods. The experiment design, performance metrics, and experimental results are described in Section 3. A brief summary is described in Section 4.

2. Materials and Methods

2.1. Data Description

In this paper, the thoracic surgery dataset in Zięba et al. [5], is selected to predict the postoperative survival of LCPs. Data were collected from the Wroclaw Thoracic Surgery Center. These patients underwent lung resection for primary LC from 2007 to 2011. It contains 470 samples with an imbalance rate of 5.71. There are 400 patients who survived more than one year and 70 patients who survived less than one year in this dataset. Table 1 shows the features of the dataset. These features were selected from 36 preoperative predictors by the information gain method and were used to predict the postoperative survival expectancy. Our task is to predict whether the survival time in patients after surgery was greater than one year.

Table 1.

Feature details of the thoracic surgery dataset.

Feature ID	Description	Type of attribute
1	Size of the original tumor, from OC11 (smallest) to OC14 (largest)	Nominal
2	Diagnosis (specific combination of ICD-10 codes for primary and secondary as well multiple tumors if any)	Nominal
3	Forced vital capacity	Numeric
4	Pain (presurgery)	Binary
5	Age at surgery	Numeric
6	Performance status	Nominal
7	Weakness (presurgery)	Binary
8	Dyspnoea (presurgery)	Binary
9	Cough (presurgery)	Binary
10	Haemoptysis (presurgery)	Binary
11	Peripheral arterial diseases	Binary
12	MI up to 6 months	Binary
13	Asthma	Binary
14	Volume that has been exhaled at the end of the first second of forced expiration	Numeric
15	Smoking	Binary
16	Type 2 diabetes mellitus	Binary
17	1-year survival period (true value if died)	Binary

Output	Level
Output	Low	Medium	High
w	0.3	0.5	1.0
c _soc	1.0	2.0	3.0
c _cog	0.1	1.5	3.0
λ	0.0	0.001	0.01
η	0.1	0.15	0.2

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.8440	0.7149	0.6385	0.7378	0.8679	0.9511
PSO-SVM	0.8440	0.6570	0.6217	0.6776	0.7267	0.8643
SVM	0.8440	0.5294	0.5561	0.4781	0.5493	0.5204
RF	0.8369	0.7149	0.6023	0.7388	0.8430	0.8869
GBDT	0.8156	0.7059	0.5864	0.7025	0.8213	0.9276
KNN	0.8227	0.6561	0.5833	0.6910	0.7905	0.9005
AdaBoost	0.7943	0.6652	0.5615	0.6458	0.7674	0.9095

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0	0.6942	0.6148	0.7203	0.8625	0.9510
PSO-SVM	0	0.5832	0.5628	0.6150	0.6567	0.8501
SVM	0	0	0	0.1537	0.1015	0.1659
RF	0	0.7092	0.6017	0.7385	0.8404	0.8868
GBDT	0.2938	0.6901	0.5835	0.7024	0.8154	0.9274
KNN	0	0.6572	0.5819	0.6874	0.7919	0.9000
AdaBoost	0.2059	0.6550	0.5552	0.6464	0.7597	0.9096

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0	0.6612	0.5549	0.7059	0.8482	0.9502
PSO-SVM	0	0.5089	0.4995	0.5600	0.6022	0.8336
SVM	0	0	0	0.2823	0.0605	0.0536
RF	0	0.6834	0.5713	0.7458	0.8241	0.8889
GBDT	0.1333	0.6524	0.5470	0.7025	0.7950	0.9292
KNN	0	0.6545	0.5473	0.7094	0.7760	0.9035
AdaBoost	0.0645	0.6186	0.5101	0.6425	0.7323	0.9099

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.5000	0.7265	0.6268	0.7400	0.8639	0.9510
PSO-SVM	0.5000	0.6426	0.6069	0.6754	0.7094	0.8631
SVM	0.5000	0.5000	0.5000	0.4993	0.5059	0.5138
RF	0.4958	0.7115	0.6038	0.7397	0.8411	0.8873
GBDT	0.5202	0.6993	0.5857	0.7052	0.8171	0.9281
KNN	0.4874	0.6581	0.5842	0.6919	0.7927	0.9010
AdaBoost	0.4891	0.6603	0.5582	0.6483	0.7621	0.9097

Methods	Accuracy	F1	G-mean	AUC
NONE	11.034 (0.000)	25.502 (0.000)	21.102 (0.000)	27.01 (0.000)
SMOTE	14.348 (0.000)	16.01 (0.000)	10.261 (0.000)	12.469 (0.000)
SL-SMOTE	29.947 (0.000)	25.764 (0.000)	30.349 (0.000)	31.255 (0.000)
SMOTE-TL	29.815 (0.000)	30.281 (0.000)	22.248 (0.000)	26.895 (0.000)
B-SMOTE	6.541 (0.000)	5.176 (0.001)	5.297 (0.000)	5.997 (0.000)
CVCF-SMOTE	5.237 (0.001)	4.994 (0.001)	4.67 (0.001)	4.719 (0.001)

Authors	Methods	Accuracy
Mangat and Vig [3]	DA-AC	82.18%
Elyan and Gaber [46]	RFGA	84.67%
Li et al. [47]	STDPNF	85.32%
Muthukumar and Krishnan [48]	IFSSs	88%
Saber Iraji [4]	ELM (wave kernel)	88.79%
Our work	CVCF-SMOTE+FPSO-SVM	95.11%

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.7402	0.6890	0.6386	0.7396	0.7795	0.8205
PSO-SVM	0.7098	0.6435	0.6504	0.6538	0.6831	0.7205
SVM	0.7196	0.6291	0.6409	0.6423	0.6772	0.7165
RF	0.6989	0.6795	0.6142	0.7315	0.7559	0.7772
GBDT	0.6837	0.6606	0.6299	0.7252	0.7465	0.7764
KNN	0.7174	0.6630	0.6417	0.7000	0.7449	0.7992
AdaBoost	0.7163	0.6402	0.6331	0.6117	0.6819	0.7559

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.5274	0.6813	0.6288	0.7310	0.7748	0.8206
PSO-SVM	0.5012	0.6131	0.6325	0.6669	0.6518	0.7121
SVM	0.5077	0.6096	0.6246	0.6598	0.6566	0.7035
RF	0.5731	0.6815	0.6132	0.7283	0.7588	0.7784
GBDT	0.5492	0.6607	0.6274	0.7226	0.7475	0.7765
KNN	0.5737	0.6649	0.6418	0.6997	0.7433	0.8009
AdaBoost	0.5809	0.6359	0.6293	0.6118	0.6779	0.7549

Methods	Accuracy	AUC
NONE	6.603 (0.000)	18.744 (0.000)
SMOTE	6.555 (0.000)	10.315 (0.000)
SL-SMOTE	15.959 (0.000)	15.806 (0.000)
SMOTE-TL	4.506 (0.001)	3.539 (0.006)
B-SMOTE	2.601 (0.029)	2.83 (0.02)
CVCF-SMOTE	4.669 (0.001)	4.392 (0.002)

	Actual positive	Actual negative
Predicted positive	TP	FP
Predicted negative	FN	TN

Datasets	Case number	Attribute number	Class distribution
Haberman	306	3	225/81
Appendicitis	106	7	85/21

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.8688	0.8792	0.8208	0.9381	0.9167	0.9511
PSO-SVM	0.8625	0.8713	0.7620	0.8104	0.8714	0.9277
SVM	0.8469	0.7979	0.7854	0.8310	0.8813	0.9021
RF	0.8438	0.8438	0.7271	0.8714	0.9083	0.9106
GBDT	0.8188	0.8479	0.7146	0.8690	0.8917	0.9085
KNN	0.8500	0.7708	0.7354	0.8476	0.8708	0.8957
AdaBoost	0.8031	0.8396	0.7458	0.8690	0.8896	0.9106

Algorithms	NONE	SMOTE	SL-SMOTE	SMOTE-TL	B-SMOTE	CVCF-SMOTE
FPSO-SVM	0.6878	0.8807	0.8167	0.9411	0.9135	0.9512
PSO-SVM	0.5893	0.7602	0.7708	0.9311	0.8917	0.9239
SVM	0.6674	0.7966	0.7832	0.8423	0.8788	0.8982
RF	0.6930	0.8475	0.7324	0.8755	0.9064	0.9070
GBDT	0.6460	0.8539	0.7207	0.8713	0.8909	0.9092
KNN	0.6885	0.7736	0.7374	0.8499	0.8676	0.8954
AdaBoost	0.6352	0.8461	0.7492	0.8685	0.8888	0.9102

Methods	Accuracy	AUC
NONE	6.591 (0.000)	15.628 (0.000)
SMOTE	4.562 (0.001)	5.176 (0.001)
B-SMOTE	3.024 (0.014)	3.373 (0.008)
SL-SMOTE	6.227 (0.000)	7.009 (0.000)
SMOTE-TL	1.089 (0.304)	0.785 (0.453)
CVCF-SMOTE	2.764 (0.022)	2.787 (0.21)

Datasets	Algorithms
Thoracic surgery	CVCF-SMOTE+GBDT	CVCF-SMOTE+PSO-SVM	CVCF-SMOTE+FPSO-SVM
Thoracic surgery	31.2	53.6	43.5

Haberman	CVCF-SMOTE+KNN	CVCF-SMOTE+PSO-SVM	CVCF-SMOTE+FPSO-SVM
Haberman	18.8	27.5	24.5

Appendicitis	SMOTE-TL+FPSO-SVM	CVCF-SMOTE+PSO-SVM	CVCF-SMOTE+FPSO-SVM
Appendicitis	13.8	22.2	17.3

PERMALINK

A Hybrid Method to Predict Postoperative Survival of Lung Cancer Using Improved SMOTE and Adaptive SVM

Jiang Shen

Jiachao Wu

Man Xu

Dan Gan

Bang An

Fusheng Liu

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Description

Table 1.

2.2. Data Preprocessing

2.2.1. CVCF for Noise Cleaning

Figure 1.

2.2.2. SMOTE to Balance Data

2.3. The Proposed FPSO-Optimized SVM (FPSO-SVM)

2.3.1. SVM

2.3.2. FPSO-SVM Model

Table 2.

Table 3.

2.4. Specific Steps of the Proposed Hybrid Method for Predicting Postoperative Survival of LCPs

Figure 2.

3. Experiments and Results

3.1. Experiment Design

3.2. Performance Metrics

3.3. Result and Discussion

Table 4.

Table 5.

Table 6.

Table 7.

Figure 3.

Table 8.

Table 9.

Figure 4.

Figure 5.

3.4. Works on Other Datasets

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

3.5. Running Time Analysis

Table 17.

4. Conclusion

Acknowledgments

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases