Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Oct 15;15(10):e0237658. doi: 10.1371/journal.pone.0237658

Modeling and comparing data mining algorithms for prediction of recurrence of breast cancer

Alireza Mosayebi 1,#, Barat Mojaradi 2,*,#, Ali Bonyadi Naeini 1,#, Seyed Hamid Khodadad Hosseini 3,#
Editor: Bryan C Daniels4
PMCID: PMC7561198  PMID: 33057328

Abstract

Breast cancer is the most common invasive cancer and the second leading cause of cancer death in women. and regrettably, this rate is increasing every year. One of the aspects of all cancers, including breast cancer, is the recurrence of the disease, which causes painful consequences to the patients. Moreover, the practical application of data mining in the field of breast cancer can help to provide some necessary information and knowledge required by physicians for accurate prediction of breast cancer recurrence and better decision-making. The main objective of this study is to compare different data mining algorithms to select the most accurate model for predicting breast cancer recurrence. This study is cross-sectional and data gathering of this research performed from June 2018 to June 2019 from the official statistics of Ministry of Health and Medical Education and the Iran Cancer Research Center for patients with breast cancer who had been followed for a minimum of 5 years from February 2014 to April 2019, including 5471 independent records. After initial pre-processing in dataset and variables, seven new and conventional data mining algorithms have been applied that each one represents one kind of data mining approach. Results show that the C5.0 algorithm possibly could be a helpful tool for the prediction of breast cancer recurrence at the stage of distant recurrence and nonrecurrence, especially in the first to third years. also, LN involvement rate, Her2 value, Tumor size, free or closed tumor margin were found to be the most important features in our dataset to predict breast cancer recurrence.

Introduction

Cancer refers to any one of a large number (over 100 diverse diseases) of diseases characterized by the development of abnormal cells that divide uncontrollably due to genetic mutation in DNA and can infiltrate and destroy normal body tissue. What is similar among all is the deficiency in the regulating mechanisms of the cells' natural growth, proliferation, and death. The cancerous cells can invade the tissues nearby and spread to other parts of the body eventually and is the second-leading cause of death in the world [1, 2].

Breast cancer is the most common cancer among women all over the world, and it is increasing 2 percent annually [1, 35]. Although the average age of breast cancer is 40–50 years, it has also occurred in 25-year-olds, and the age of onset is declining annually. Concerning the high incidence and prevalence of the disease, and also its high treatment cost, and disease occurrence among young women who are socially and economically generative, this disease can be one of the most remediable diseases if an early diagnosis takes place [68]. So, the significance of the prediction and treatment of cancer is more obvious than before [911]. The disease occurs when people are exposed to carcinogen materials by inhaling, eating, drinking, and exposing at the workplace and environment. Cancer is a multifactorial disease and beside genetic factors, personal lifestyle, smoking, and diet play impressive effects in the etiology of cancer. There is always a balance among the amount of proliferation, the cell senescence, and the differentiation in a healthy organism, this balance is lost during cancer formation [1214].

On the other hand, 207 million new cases of breast cancer are likely to be diagnosed in 2030, among which 60% cases are diagnosed in under-developed countries [4, 14, 15]. Moreover, the number of cancer incidence is increasing in these countries.

Breast cancer is one of the most common diseases among Iranian women. Oncological studies show that the prevalence of breast cancer among Iranian women is increasing, and the age of onset of this disease is decreasing with more disease intensity [8, 16]. On the other hand, different factors influence the deterioration of the disease and its overall survival. Because it is complicated to control the causes of breast cancer, early diagnosis along with a suitable treatment is an important strategy for prognosis improvement [11, 17].

In Iran, due to the population growth, the increase of life expectancy, the relative increase of old population, the annual increase of cancer and the acceleration of change among influential factors in the escalation of cancer prevalence such as the prevalence and the development of the cancer risk factors, the rate of cancer development is predicted to rise and it will be doubled during the next two decades [8, 16]. Therefore, this disease and programs to control are of utmost importance. International studies suggest that about 30% of women will develop recurrence after the primary treatment for breast cancer, figures for the early-stage disease being lower [18, 19].

Also, it is the most common malignancy among Iranian women and the focus of attention. In recent years, the prevalence of the disease has been increasing, and data suggest that the survival rate of patients up to five years and ten years after diagnosis was 88% and 80%, respectively. In reality, not all tumors are cancerous, but they may be benign or malignant [6, 13, 20, 21].

Routine follow-up protocols currently are based on two hypotheses: (i) that most recurrences are diagnosed at an earlier stage through follow up; and (ii) that the previous treatment of recurrences offers a better chance of cure, more prolonged survival, or improvement in the quality of life. However, current data suggest that neither of these hypotheses is correct and that postoperative follow up of patients with breast cancer is costly and time-consuming and does not significantly extend survival [18, 22, 23].

Despite considerable effort to detect recurrent disease early, the evidence suggests that only a minority of recurrences are detected at an asymptomatic stage [5, 2426].

It has also been found that most patients with recurrence have signs or symptoms as the first indicator of recurrence and that history and physical examination generally provide the first clues to recurrence [27, 28].

Furthermore, most recurrences present at unscheduled appointments and not as a consequence of routine follow up ones [29]. Predicting breast cancer recurrence is one of the most popular measures taken for developing data mining approaches.

Physicians make decisions on clinical scenarios (about diagnosis, investigation, and management) every day of their working lives, based on the balance of probabilities [30].

Although randomized trials of intensive surveillance testing such as more frequent clinical examinations, biannual chest x-rays, and bone scans have shown no mortality benefit, there has been a continued rise in financial cost and resource utilization devoted to developing more effective follow-up strategies to detect early recurrences. It's necessary to the application of a multidisciplinary approach to help physicians to make better decisions in the detection of breast cancer recurrent [31].

Data mining methods may help reduce the number of false positives and false-negative results in physicians' decision-making [12, 32]. Accordingly, new approaches, such as knowledge discovery in databases (KDD), including data mining algorithms, have become increasingly popular and a desirable research tool for medical science researchers. Using them, researchers can identify patterns and relationships among a large number of variables, and using the data available in databases has made it feasible to predict the results of a disease [33].

Many studies have been carried out about difficulties in predicting the survival of patients through using statistical methods and artificial neural networks; nevertheless, only a few studies have been conducted in the field of cancer recurrence using data mining methods. Hence, this study compares the prediction of breast cancer recurrence in Iran using data mining methods and suggests a reasonable model for help to predict between several proposed models. According to the importance of the subject, the main purpose of the present study is to assist clinicians and decision-makers in the field of breast cancer for prediction of recurrence using seven data mining algorithms. Therefore, first, the used dataset is introduced. Next, all algorithms and their findings are presented separately. the results of the algorithms are compared to provide recommendations to physicians and decision-makers. And finally, the best features based on data mining algorithms extracted.

Materials and methods

This study is cross-sectional and data gathering of this research performed from June 2018 to June 2019 from the official statistics of Ministry of Health and Medical Education and the Iran Cancer Research Center for patients with breast cancer who had been followed for a minimum of 5 years from February 2014 to April 2019, including 5471 independent records. for limitation in our dataset (we had only 16 complete information about patients that their relapse occurred after 10 years), we have analyzed only patients with shorter than 5 years.

(early stages of N +, N0, T2, T1) is eligible for inclusion in this study. All patients are confirmed with a pathologic type of ductal carcinoma (Infiltrating ductal carcinoma). Patients presenting with distant metastases (stage IV) and T4 (tumors of any size directly invading the chest wall or skin) at the time of presentation are patients with other histopathologic findings of breast cancer, patients with information in their records. It is incomplete, and Patients who had not undergone lymph node surgery (NX) are excluded.

Also, required data are collected from patients' records, including the status of ER, PR, HER-2 receptors, patient age, tumor size, lymph node involvement status, tumor grade, etc. The condition of the receptors is evaluated by immunohistochemistry.

In the present study, at the first step, the chi-square test is used to evaluate the relation of each of the features on the recurrence of breast cancer, and P <0.05 is considered significant. Also, to develop predictive models and predict breast cancer recurrence, Multilayer Perceptron artificial neural network, Bayesian Neural Network, LVQ neural network, KPCA-SVM, Random Forest, and C5.0 are employed using the above-mentioned database.

Data mining methods

The process of extraction of the unknown, correct, and potentially useful data is called data mining. The data mining methods can be considered as unsupervised and supervised learning. In this study, seven different types of neural networks are exploited for the recurrence of breast cancer prediction. the number of inputs to the network is 17 key variables extracted from 23 variables, which is shown in Table 1. Fig 1 represents the algorithm used in the current study. According to the depicted algorithm, first, the data are grouped into test and train data; the results are compared and finally, the best features are selected.

Table 1. Variables relating to the recurrence of breast cancer in a dataset.

No Factor Scale Study p-value
1 Diagnosis of age 0 = "<35" [3638] 0.207
1 = "35–44"
2 = "45–55"
3 = ">55"
2 Menarche age 0- after the age of 12 [36, 39] 0.621
1- age of 12 ≥ (risky)
3 Menopause age 0- before the age of 50 [36, 3941] 0.934
1- ≥ age of 50 (risky)
2- menopause has not yet occurred
4 History of infertility 0- No [42, 43] 0.201
1- Yes
5 Family history of breast cancer 0- No [35] 0.325
1- Yes
6 Family history of other cancers 0 = "no" 3 = "colon cancer" [44] 0.146
1 = "mail breast cancer" 4 = "ovarian cancer"
2 = "prostate cancer" 5 = "uterus cancer"
7 Tumor site 1 = "uoq" 12 = "upper half" [34, 40, 45, 46] 0.230
2 = "uiq" 13 = "latral half"
3 = "loq" 14 = "uoq and liq"
4 = "liq" 24 = "medial half"
5 = "central(nipple areole)" 30 = "three quadrant”
6 = "axilla" 34 = "lower half"
50 = "diffuse"
8 Tumor side 1- right breast tumor [4547] 0.514
2- left breast tumor
3- bilateral tumor
9 Tumor size 1 = "<2" [4547] 0.002
2 = "2.5"
3 = ">5"
4 = "chest wall or skin"
10 LN involvement rate 0 = "no" [34, 47, 48] 0.001
1 = "1–3"
2 = "4–9"
3 = "0>9"
12 Tumor site 1 = "local" 12 = "local & axilla" [4547, 49] 0.270
2 = "axilla" 13 = "local & regional"
3 = "regional" 14 = "local & local (post MRM)”
4 = "Local (post MRM)"
13 Result of biopsy of pathology 1 = "lcis" 7 = "paget's disease" [38, 50, 51] 0.018
2 = "dcis" 8 = "others"
3 = "ilc" 9 = "inflammatory carcinoma"
4 = "ilc" 11 = "sarcoma"
5 = "medullary" 12 = "metastatic(un known origin)”
6 = "microinvasion" 13 = "lymphoma"
14 Type of surgery 1 = "MRM" [16, 23, 52] 0.001
2 = "Breast preservation"
3 = "bilatral MRM"
4 = "bilatral BCS"
5 = "bilatral MRM & BCS"
15 Tumor grade 1 = "1" [19, 4547] 0.093
2 = "2"
3 = "3"
16 Free or closed tumor margin 0 = "free (> = 2cm)" [4547] 0.001
1 = "closed (< = 2cm)"
2 = "involve"
17 Estrogen Receptor value 0- Negative [53] 0.405
1- Positive
Positive value increases recurrence risk
18 Progesterone Receptor value 0- Negative [10, 28, 54] 0.181
1- Positive
Positive value increases recurrence risk
19 Her2 value 0- Negative [55] 0.312
1- Positive
A positive value increases recurrence risk
20 Chemotherapy 0 = "no" [56] 0.002
1 = "yes"
19 = "chemo + Herceptin"
21 Type of chemotherapy (Typechemo) 1 = "neo adjvant" [50, 53, 56] 0.161
2 = "adjvant"
22 Radiotherapy 0- No radiotherapy [57, 58] 0.000
1- Radiotherapy has been performed
When no radiotherapy has been performed, cancer recurrence is more likely to occur.
23 Hormone therapy 0 = "no" [58, 59] 0.001
1 = "tamoxifen" 4 = "aromazin (exemstane)"
2 = "raloxifen" 5 = "megace"
3 = "femara or letrozol" 6 = "others"

Fig 1. The exploited algorithm in this study.

Fig 1

The accuracy of the model is the percentage of the number of times the test samples are successfully categorized. If the model accuracy is acceptable, the model can be used to classify data whose categories are not specified. In this research, we use a nested 5-fold cross-validation approach to train (four folds) and test (one fold) the models. Patients meeting the inclusion criteria are randomly assigned to one of the five outer folds. To ensure that the important feature set is generated from real patients with breast cancer and the importance of the features is not emphasized by duplicating minor cases, we choose the under-sampling approach to build the model. We randomly select 20 sets of controls in each round of cross-validation, matching the number of cases, and generated 80 training datasets by using one set of controls and all cases. In each training step, we use 5-fold inner cross-validation to tune the models. this method is used but for preventing of exceeding machine learning analysis. Also, to assess the performance of disease classification methods, a confusion matrix used. This matrix is for the accuracy of the obtained model is calculated. The following formula applies to calculate efficiency, sensitivity, and specificity. The results show the numerical values obtained from the calculations performed by the software and its final analysis. Models performance is evaluated according to the following criteria:

  • Feature. Probability of correctly predicting non-recurrence by true negative algorithms divided by false-positive + true negative.

  • TP. The number of samples that are correctly identified as positive

  • TN. The number of samples that are correctly diagnosed as negative.

  • FP. The number of samples that incorrectly detected positive

  • FN. The number of samples that incorrectly detected negative

  • Confusion matrix. The relationship between the actual classes and the predicted classes is called the confusion matrix.

Accuracy

This is the number of samples that are correctly identified, relative to the total sample.

Accuracy=(TP+TN)/(TP+TN+FP+FN)

Sensitivity

The probability of correctly predicting recurrence by true positive algorithms divided by false-negative + true positive.

Sensitivity=TP/(TP+FN)

Specificity. Measures a test’s ability to correctly generate a negative result for people who don’t have the condition that’s being tested for (also known as the “true negative” rate).

Specificity=TN/(TN+FP)

ROC

The method based on the receiver operating characteristic curve [9, 34, 35] is used to evaluate the performance of different discriminant models. The area under the ROC curve (AUC) can objectively reflect the overall performance of different algorithms.

F-score

In the statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision) p and the recall) r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier. R is the number of accurate positive results divided by the number of all relevant samples (all samples that should have been specified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 reaches its best value at 1 (perfect precision and recall) and its worst at 0.

Ethics statement

All data were fully anonymized before we accessed them and the Ministry of Health and Medical Education and the Iran Cancer Research Center are informed consent. All the private information of patients is eliminated from our dataset and only clinical and required data is used and presented.

Results

In the present study, seven data mining algorithms based on the accuracy, Sensitivity, Specificity, F-measure, area under the ROC curve of the models, etc. are compared. The final goal is to achieve a model with the highest performance. So, insufficient and defective data is identified and eliminated from the database. Concerning the elimination of variables, variables either overlapped with the results or that among 10337 records available, their information on these variables was eliminated. Working on the documents is began after deleting the variables according to the abovementioned two modes (overlapping with other variables or lacking maximum record information from this variable). For patients with no data or data limited to 2–3 cases, we eliminate the record inevitably. The remaining documents with fewer missing variables are replaced by expectation-maximization.

After the elimination of records with missing values, 5471 records remained. For the best prediction, data mining classification techniques including Multilayer Perceptron artificial neural network, Bayesian Neural Network, LVQ neural network, KPCA-SVM, C5.0, and Random Forest are applied. Additionally, several parameters are randomly selected and investigated.

In this study, breast cancer is classified into four groups based on IHC profile ER/PR and Her2/neu expression, positive (+), and/or negative (−). The groups are:

ER/PR+, Her2+ = ER+/PR+, Her2+; ER−/PR+, Her2+; ER+/PR−, Her2+

ER/PR+, Her2− = ER+/PR+, Her2−; ER−/PR+, Her2−; ER+/PR−, Her2−

ER/PR−, Her2+ = ER−/PR−, Her2+

ER/PR−, Her2− = ER−/PR−, Her2−

The IHC classification correlates well with intrinsic gene expression microarray categorization: ER/PR+, Her2+ with Luminal B; ER/PR+, Her2− with Luminal A; ER/PR−, Her2+, and ER/PR−, Her2− with triple-negative/basal-like tumors. Apart from lending itself to subtype analyses of tumors when fresh tissue is not available, the IHC classification has prognostic and therapeutic implications, is inexpensive and readily available.

According to [11, 26] studies, and available data of Ministry of Health and Medical Education and the Iran Cancer Research Center's, patients classified in four segments: Local Recurrence, Regional recurrence, Distant recurrence, Local and Distant recurrence, Reginal and Distant recurrence and No recurrence. The number of patients in each category is 607, 82, 1975, 339, and 65.

The demographic and clinical characteristics of the units under study are shown in Table 2 and Table 3. The mean and standard deviation of the age is 46.86 ± 10.397 years, and the age group, 41–50 years with 42.8%, had the highest frequency. The mean tumor diameter is 3.63 ± 1.86 cm. The rate of demographic and clinical factors in the studied patients is depicted in Table 2.

Table 2.

Demographic and clinical characteristics Percent Frequency
Age (years) <50 67.4 3687
>50 32.6 1784
Tumor size T1* 26.8 1466
T2* 55.1 3015
T3* 18.1 990
Tumor Grade 1 16.7 914
2 58 3173
3 25.4 1390
Lymph node metastasis (number) Negative 36.2 1981
Positive<3 31.1 1701
Positive >3 32.7 1789
Estrogen receptor Negative 31.2 1707
Positive 68.8 3764
Progesterone receptor Negative 36.2 1981
Positive 63.8 3490
HER-2 receptor Negative 47.1 2577
Positive 52.9 2894
Triple-negative 13.7 750

T1 *: A tumor with a diameter of 2 cm or less

T2 *: Tumor diameter between 2 and 5 cm

T3 *: A tumor the largest diameter of which is over 5 cm

Table 3. Frequency distribution of demographic and clinical characteristics of patients.

Receptors status
Patient Description
Non-TN TN HER- HER+ PR- PR+ ER- ER+
Number(Percent) Number(Percent) Number(Percent) Number(Percent)
Year <40 481(8.8) 317(5.8) 673(12.3) 673(12.3) 635(11.6) 164(3) 635(11.6) 711(13)
41–50 2183(39.9) 159(2.9) 832(15.2) 1505(27.5) 553(10.1) 1784(32.6) 514(9.4) 1822(33.3)
>51 1505(27.5) 2937(5.1) 1078(19.7) 711(13) 793(14.5) 996(18.2) 558(10.2) 1231(22.5)
Tumor Size T1* 1269(23.2) 197(3.6) 673(12.3) 793(14.5) 514(9.4) 952(17.4) 399(7.3) 1072(19.6)
T2* 2697(49.3) 317(5.8) 1308(23.9) 1707(31.2) 990(18.1) 2024(37) 914(16.7) 2095(38.3)
T3* 755(13.8) 235(4.3) 596(10.9) 394(7.2) 476(8.7) 514(9.4) 394(7.2) 597(10.9)
Tumor Grade 1 793(14.5) 120(2.2) 558(10.2) 356(6.5) 159(2.9) 755(13.8) 159(2.9) 755(13.8)
2 2736(50) 438(8) 1625(29.7) 1548(28.3) 1149(21) 2025(37) 908(16.6)
3 1187(21.7) 197(3.6) 394(7.2) 990(18.1) 673(12.3) 711(13) 635(11.6) 755(13.8)
Lymph node metastasis Positive 3091(56.5) 399(7.3) 1784(32.6) 1707(31.2) 1149(21) 2342(42.8) 2342(18.8) 2456(44.9)
Negative 1624(29.7) 356(6.5) 793(14.5) 1187(21.7) 832(15.2) 1149(21) 678(12.4) 1308(23.9)

T1 *: A tumor with a diameter of 2 cm or less

T2 *: Tumor diameter between 2 and 5 cm

T3 *: A tumor the largest diameter of which is over 5 cm

Some characteristic of patients are presented in Table 4 and Table 5. The median local recurrence time of patients is 16 months and the median metastasis time is 22 months. The most common sites of bone metastases (46.6%) are with a preference for the spine, especially the lumbar vertebrae. The median survival time from illness was 36 months. The overall treatment rate of the disease in the fifth year is 45%. This rate is 85% in the first stage of the disease and 5% in the fourth stage.

Table 4. Confusion matrix.

Predicted Values Actual Values
positives negative
positives TP FP
negative FN TN

Table 5. Frequency distribution of ER and PR receptors in terms of HER-2 receptors.

HER2 receptor
status
ER and PR
receptor status
HER- HER+
Number(Percent) Number(Percent)
ER+ 1822(33.3) 1942(35.5)
ER- 755(13.8) 952(17.4)
PR+ 1625(29.7) 1866(34.1)
PR- 952(17.4) 1029(18.8)

Variables relating to the recurrence of breast cancer based on various studies as well as interviews with specialists in the field of breast cancer are obtained as Table 1. also, the relationship of each factor with the recurrence of breast cancer in our dataset is evaluated by using Chi-Squared test. factors with a p-value of less than 0.05 are considered as the most relating factors.

Based on this Table, the most relating factors include Tumor size, LN involvement rate, the result of biopsy of pathology, type of surgery, free or closed tumor margin, Chemotherapy, Radiotherapy, and Hormone therapy.

By refining the results and discarding the treatment and diagnosis methods, 3 factors of Tumor size, LN involvement rate, and free or closed tumor margin are considered as the most important factors affecting the recurrence of breast cancer in our dataset.

Then, to the prediction of breast cancer recurrence, by discarding 6 factors and of Chemotherapy, Type of chemotherapy, Radiotherapy, Hormone therapy, and Result of biopsy of pathology, other features are considered as the inputs of machine learning algorithms.

Modeling using multilayer perceptron neural network

In the first step, a multilayer perceptron (MLP) neural network is used, which is one of the purest and most potent structures for modeling.

The output layer function is defined by (1), in which o and h represent the hidden layer and output layer, respectively, and w is the weights of the layers.

Oi=sgm(msgm(lxiwlmh)wmio) (1)

In this equation, sgm is a sigmoid function defined as:

sgm(x)=11+ex (2)

The first modeling method is a multilayer perceptron neural network with backpropagation error algorithm for training, which is used for classification (recurrence and non-recurrence of breast cancer), and all of the eight risk factors are fed to the network. Also, different network architectures are evaluated to acquire the best classification performance. Table 6 shows some of the best structures of the multilayer perceptron neural network, and the confusion matrix related to the best architecture (among almost 100 different architectures).

Table 6. The test of different architectures of MLP neural network.

Number of hidden layers Neural network architecture
2 1-10-10
2 1-10-20
3 1-10-10-10

Modeling using learning vector quantization

Learning Vector Quantization (LVQ) neural network consists of a competitive layer and a linear layer. The competitive layer learns to classify the input vectors, and the linear layer maps the competitive classes on the target category, which is determined by the user.

In this study, 10 LVQ neural networks, with different architectures, are used to find the optimum network architecture for disease Recurrence. Table 7 shows the performance of the various neural networks over the changes in the number of neurons in the competitive layer.

Table 7. The result of the LVQ algorithm for the number of neurons.

The number of neurons 3 4 5 6 7
Number of loop repeats 4 4 4 2 2

Modeling using Bayesian network

The Bayesian network is a decision-making system and has a strong ability to model the cause and effect relationship. Note that it is not necessary a Bayesian neural network that has precise information and a complete history of an event, and it can reach a convincing estimate of the current or future condition of a system even with incomplete and not precise information. Therefore, in this paper, the Bayesian neural network is exploited as the third method for disease classification. The distribution of the weights is defined by (3).

yk=fouter(j=1mwkj(2)finner(i=1dwji(1)+wj0(1))+wk0(1)) (3)
P(W|D)=P(D|W)P(w)P(D) (4)

Where wji(1) and wki(2) are the weights of the first and second layers, respectively, i, j, and k are the indexes of input, hidden layer, and output, respectively, and wj0(1) is the bias for the jth hidden unit. M is the number of hidden units; d is the number of input units. The function fouter(.) is linear, and finner(.) is a hyperbolic tangent function. In the Bayesian method, Eq (3) used weight distribution, and the network weights are computed by (5).

P(W|D)=P(D|W)P(w)P(D) (5)

Where p(w) is the probability distribution function in the weight space without data, which is a primary function. P(W|D) is the probability function of weights that is a probability distribution function observed after the train of the data. P(D|W) and P(D) is the probability distribution function and the second probability distribution function, respectively.

Different structures are tested to find the best number of neurons in the competitive layer. Some of them are shown in Table 8. According to the table, the first four architectures have two hidden layers. Increasing the number of hidden layers from two to three (the last two rows of the table) does not have a significant effect on error performance (only 0.3% improvement).

Table 8. A survey of different neural network architectures considering all input parameters in the recurrence of breast cancer disease.

Neural network architecture Number of hidden layers
4-6-1 2
8-7-1 2
13-6-1 2
13-18-1 2
12-6-4-1 3
18-8-6-1 3

Modeling using KPCA- SVM

To further improve the discrimination accuracy, combining kernel principal component analysis and support vector machine (KPCA-SVM) is proposed in this study. After preprocessing, the dataset is analyzed with the model of KPCA-SVM. The results show that the model can achieve excellent results.

For KPCA, by introducing the kernel function, the original dataset can be mapped into a high-dimensional space [6062]. Record the original spatial dimension as Rn, the number of samples is m, and the dimension is n, then the total sample is X = [x1,x2,…,xm], xi = [xi1,xi2,…,xim], i = 1,2,…,m. Defining has a high-dimensional Hilbert space, all the samples are mapped from the original space to the high-dimensional space, which is expressed as:

xiΦ(xi) (6)

The kernel function is defined as

K=Φ(X)TΦ(X)=[k(xi,xj)]m×m, (7)

Which needs to satisfy the Mercer condition

k(xi,xj)=Φ(xi)T,Φ(xj)=Φ(xi)T,Φ(xj). (8)

After nonlinear mapping, the corresponding covariance matrix is

thematrixisC=1mi=1mΦ(xi)T,Φ(xi). (9)

Denote the eigenvalues and eigenvectors as λ and V, respectively

λV=CV, (10)

Where V is the subspace generated by {Φ(x1),Φ(xi),…,Φ(xm)}.

Then there exists α = {α1α2αm} making

V=j=1mαjΦ(xj). (11)

The following transformation is made for Eq (10), when K = 1,2,…,m

λ(Φ(xk)V)=Φ(xk)CV. (12)

Mixing Eqs (9), (11) and (12), we get

λ(Φ(xk)j=1mαjΦ(xj))=Φ(xk)1mi=1mΦ(xi)TΦ(xi)j=1mαiΦ(xj). (13)

Further, we get

λj=1mαj(Φ(xk)Φ(xj))=1mj=1mαj(Φ(xk)i=1mΦ(xi))(Φ(xi)TΦ(xj)). (14)

When we substitute Eqs (8) and (9) into Eq (15), yield

mλα=Kα (15)

Thereby the eigenvalues and eigenvectors of the K matrix can be obtained.

Modeling using support vector machines

Support Vector Machines (SVM) has unique advantages and has been widely used in the field of pattern recognition, especially in dealing with classification problems.

For the two-category problem, set the given training sample set as {(x1,y1),(x2,y2)…,(xn,yn)},

Which yiϵ{+1, −1}(i = 1,2,…,n) is the category the sample belongs to Construct a cost function that satisfies the constraint as

min12ω2+Ci=1nξi, (16)
s.t.yi(ωTxi+b)1ξiξi0,i=1,2,,n.

Where ξ is the introduced slack variable to measure the misclassification degree of the model, C is the penalty constant, which should be selected appropriately, ω and b are the weight vector and threshold of the classification function, respectively. Lagrangian function is

L(ω,b,α)=12ω2+Ci=1nξii=1nβiξii=1nαi[yi(ωT.xi+b)1+ξi] (17)

Where αi and βi are Lagrangian operators. According to Karush-Kuhn-Tucher conditions, we have

0αiC,i=1,2,..,n, (18)
i=1nαiyi=0, (19)

Where αi > 0 is a support vector. The discriminant function is

f(x)=sgn(i=1nαi*yiK(x,xi)+b*) (20)

where K is the selected kernel function.

Fig 2 shows the workflow of the Raman spectral discrimination model of KPCA-SVM.

Fig 2. The workflow of the Raman spectral discrimination model of KPCA-SVM.

Fig 2

Modeling using random forest

Random forest is a combination of learning trees that each tree in the forest is built from a random vector called Q. The Qk vector describes the method of constructing the kth tree. For example, it may specify whether some features are randomly used in the construction of each tree, or whether the training data set is randomly selected. Each tree constructed with h (x, Qk) is represented. This vector is given to all trees to classify the input x,h (x, Qk), k = 1,…, k, and is the final class of which with the most trees to vote on.

Suppose h1 (x),…, hk (x) are some classifiers. Suppose Y is the output vector of the training data, and X is a random vector drawn from the training data. Now the boundary function of this set of classifiers is defined as follows:

mg(X_,Y)=k=1kI(hk(X_)=Y)kmaxjY[k=1kI(hk(X_)=j)k] (21)

Formula 22 Calculates the Class Boundary Function

Where I (-) is the marker function. This relation shows that if the data X is given as an input to the k cluster, it is the number of classifications performed by the k cluster correctly, more or less than the other classifications performed.

If mg (X, Y)> 0, the set of classifiers will be correctly sorted. If mg (X X, Y) <0, then the classification is incorrect.

The generalization error is defined as follows:

PE+=PX_,Y(mg(X_,Y)<0) (22)

In the random forest i.e. h (x x, ((Q_k)) k = 1,…, k it is shown in [57] study that when a random forest grows and k tends to infinity:

PE+PX_,Y(PQ(h(X_,Q)=Y)maxjYPQ(h(X_,Q)=j)<0),k (23)

Formula 24 Classification General Error When k→∞

It indicates that the generalization error tends to be above the limit. Therefore, it does not over-fit the data.

Modeling using decision tree C5.0

A Decision tree is a powerful method among classification algorithms whose popularity is increasing with the growth of data mining. Decision trees classify samples by sorting them in the tree from the root node to the leaf node. The specimens are grouped in such a way that they grow from the root downwards and eventually reach the leaf nodes. Each inner or non-leaf node is specified with a property. This feature raises a question about the input example. There are many possible answers to this question in each internal node as it's specified by its value. The leaves of this tree are characterized by a class or a bunch of answers. The reason for naming it with the decision tree is that it represents the decision-making process for classifying an input instance. The decision tree is used to solve problems that can be presented in a single answer in the name of a class or category. It is also appropriate for problems where the training examples are specified in pairs (value-specificity). The target function has output with discrete values. For example, each sample should be marked with a yes or no or require a seasonal descriptor. The decision tree consists of several algorithms such as C5.0, C4.5, ID3, and classification that used the C5.0 method in this research. The decision tree is used to approximate discrete functions.

It is resistant to input noise. It is useful for high volume data hence in data mining. The tree can be represented as an if-then rule that is understandable for use. It allows the combination of (AND) and seasonal (OR) hypotheses. It is also applicable when tutorials lack all the features.

Feature importance analysis

We evaluate the importance of features by two methods. First, as shown in Table 1, variables relating the recurrence of breast cancer extracted with chi-square and after refining the results and discarding the treatment and diagnosis methods, 3 factors of Tumor size, LN involvement rate and free or closed tumor margin were considered as the most important factors affecting the recurrence of breast cancer in our dataset.

Second, in data mining algorithms we apply forward feature selection method to validate most influencing factors that affect the recurrence of breast cancer. Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Performance metrics

The performance of the Seven Methods is evaluated by accuracy, sensitivity, specificity, AUC, PPV, NPV (Table 9). Also, a graphical comparison of the efficiency measures is shown in Fig 3.

Table 9. Performance measure.

Method Random Forest LVQ Bayesian C5.0 MLP KPCA-SVM SVM
TP 1750 1640 1650 2188 1477 2048 1750
TN 2188 2024 2008 2297 1914 2250 2243
FP 985 931 758 657 657 765 985
FN 548 876 1208 329 1423 408 493
Accuracy 0.719 0.669 0.650 0.819 0.619 0.785 0.729
Sensitivity 0.761 0.651 0.577 0.869 0.509 0.833 0.780
Specificity 0.689 0.684 0.725 0.777 0.744 0.746 0.694
The Geometric mean of sensitivity and specificity 0.724 0.668 0.647 0.822 0.615 0.788 0.736
PPV 0.639 0.637 0.685 0.769 0.692 0.728 0.639
NPV 0.799 0.697 0.624 0.874 0.573 0.846 0.819
The Geometric mean of PPV and NPV 0.715 0.667 0.654 0.820 0.630 0.785 0.724
F-measure 0.695 0.644 0.626 0.816 0.586 0.777 0.703
The area under ROC curve 0.729 0.632 0.692 0.763 0.625 0.774 0.742

Fig 3. Performance criteria of the seven classification methods.

Fig 3

As seen in Table 9 and Fig 3, almost all the methods except MLP, Bayesian, and LVQ generate relatively high Sensitivity (more than 70%). Also, the highest Accuracy in our experiments is obtained with C5.0 and KPCA-SVM. Thus, the C5.0 and KPCA-SVM approach appears to perform better than the other five data mining methods. The results indicate that all of the seven methods exceptions for MLP, Bayesian, and LVQ can predict the recurrence of breast cancer, and it is realized that despite adding the new methods to the model, the C5.0 and KPCA-SVM is the best approximation method.

According to the confusion matrix, the final number of patients with recurrence of breast cancer, as well as the patients with no recurrence of the disease, in comparison with the predictions made by different algorithms, is presented in Table 10.

Table 10. Actual and predicted recurrence in the studied patients.

Actual Positive Recurrence(n) Actual Negative Recurrence(n) Predicted Positive Recurrence(n) Predicted Negative Recurrence(n)
2517 2954 2845 2626

Because the occurrence of errors in classes and forecasting models is inevitable, so the errors in the system should be recognized and investigated. In this study, the confusion matrix has been used to address this important issue. This matrix was generated for all classification models. However, only four confusion matrix for high-precision classifiers are shown below:

Based on the Table 11, the prediction accuracy in each of the following is as follows: The results of Table 12 show that the predictive accuracy of distant recurrence (0.87) and nonrecurrent (0.88) are the most. Also, considering that distant recurrence is in step 7 of the TNM rating, it can be concluded that in this stage, the present research model provides better help to prediction rather than other data mining algorithms.

Table 11. Cancer recurrence matrix of the C5.0 algorithm.

predicted Actual Local recurrence Regional recurrence Distant recurrence Local and Distant recurrence Regional and Distant recurrence No recurrence
Local recurrence 463 37 6 12 21 68
Regional recurrence 4 54 3 3 8 10
Distant recurrence 3 22 1724 16 23 187
Local and Distant recurrence 2 18 14 234 19 52
Regional and Distant recurrence 1 4 2 2 46 10
No recurrence 64 41 164 40 19 2626

Table 12. Accuracy of the prediction model for each type of recurrence.

Type of recurrence Accuracy
Local recurrence 0.76
Regional recurrence 0.66
Distant recurrence 0.87
Local and Distant recurrence 0.69
Regional and Distant recurrence 0.70
No recurrence 0.88

There is also a general increase in the probability of disease-free survival in the first to fifth years after initial treatment in patients with breast cancer.

According to the results of Table 13, in the first to third years, the further we go in the third year, the more accurate the probability of the disease-free survival becomes, but after the third year, this rate decreases. Therefore, it can be stated that the model presented in this study has higher accuracy in the first three years.

Table 13. The total amount of disease-free survival in the first to fifth years after the initial treatment.

year The number of cumulative cases of the first recurrence Percentage of disease-free survival Percentage of disease-free survival prediction Percentage of the standard error rate of disease-free survival
first 720 80 76 2
second 1412 61 59 3
third 1751 52 51 3
fourth 1941 47 34 3
fifth 2009 45 22 3

Important features for breast cancer metastasis prediction

Table 14 represents the results of applying the Forward Selection algorithm on the Ministry of Health and Medical Education and the Iran Cancer Research Center dataset. The most important features affecting the diagnosis of breast cancer recurrence were then identified.

Table 14. The most important feature with the highest accuracy in the diagnosis of breast cancer.

Step No Feature Accuracy Sensitivity Feature Time
1 {LN involvement rate} 95.97 90.53 98.88 53.26
2 {LN involvement rate, Her2 value} 97.80 97.89 97.75 53.18
3 {LN involvement rate, Her2 value, Tumor size} 98.11 96.84 99.44 49.15
4 {LN involvement rate, Her2 value, Tumor size, free or closed tumor margin} 98.24 97.89 99.44 44.78
5 {LN involvement rate, Her2 value, Tumor size, free or closed tumor margin, Tumor grade} 98.24 97.89 99.44 44.78

Since the LN involvement rate is the most accurate one in the recurrence of breast cancer detection, the rest of the features are eliminated, the LN involvement rate is selected, and the remaining features are added to it, and the accuracy of the models is measured. In the next step, the set of two features with the highest accuracy are selected and the remaining features are added to them and the accuracy of the model is measured. Then the set of three features with the highest accuracy are selected and the remaining features are added to them and the accuracy of the model is measured.

In the next step, the set of four features with the highest accuracy are selected and the remaining features are added to them and the accuracy of the model is measured. As observed in Table 14, since the accuracy has not increased in all models by adding new features to the set of four features obtained in the previous step, the tests are not continued and it is recognized that in the C5.0 with four properties of {LN involvement rate, Her2 value, Tumor size, free or closed tumor margin} can better help rather than other data mining algorithms to the prediction of breast cancer recurrence on the Ministry of Health and Medical Education and the Iran Cancer Research Center dataset may be made.

As observed in the results of Table 14, in addition to identifying the most important features, the same or better performance was achieved with fewer features than all other features. In machine learning methods, achieving similar or better results using fewer features is important.

The results are consistent with that showed in Table 14, about three factors, and the machine learning method considers a Her2 value factor as an important factor, In addition to those factors, for better accuracy. Therefore, it is recommended that all four factors be considered for better prediction of breast cancer recurrence.

Conclusion and discussion

Numerous tests for breast cancer often increase the stress on the patient and his family, reducing control over the disease and impairing her quality of life. Therefore, the data mining algorithm possibly could be a helpful tool for physicians to help with early diagnosis and prevention of recurrence of cancer before engaging in the surgical process of treating the disease as well as incurring high costs for the patient. Of course, these methods are not definitive in medicine, but they can be helpful. So, the main aim of this study is to help the prediction of the recurrence of breast cancer disease using some data mining algorithms mainly based on neural networks. disease forecasting tools and mechanisms along with physicians' experience, can play a helpful role in the correct diagnosis and treatment choice be it and also a matter of increasing confidence in the accurate diagnosis of the disease both the physician and the patient favorably increased. On the other hand, how the diagnosis will determine the next steps for the patient. So, if it is of low accuracy, it affects one's survival and it is vital to have excellent and acceptable accuracy of forecasting models.

A model that evaluated by examining different criteria, accuracy, and sensitivity is the best, and the higher the forecast, the more reliable it will be.

The current incidence of breast cancer is high and early detection of primary tumors and the strict use of adjuvant therapy not only result in more prolonged survival but also lengthen the disease-free interval. Thus, increasing the number of patients requiring follow-up.

So, help in predicting the recurrence of breast cancer is essential for many reasons. For instance, in the case of patients who have only one tumor or one the breasts entirely removed, if the likelihood of recurrence of breast cancer is high, it can be predicted before the spread of cancer did uniquely to other parts of the body. Predictive models can be helpful and beneficial in this regard. But It should be noted that in the field of evaluation of medical prediction models, at least two features of the model and sensitivity of the model has to be considered because Considering one of them alone can be misleading.

Besides, special attention should be paid to the value of false-negative. It is essential because the patient is mistakenly considered healthy and it can have hazardous consequences.

for limitation in our dataset (we had only 16 complete information about patients that their relapse occurred after 10 years), we have analyzed only patients with shorter than 5 years.

Artificial neural networks are modern disease diagnosis methods that have excited the attention of researchers in recent years. Therefore, seven new and conventional data mining algorithms neural networks: MLP, LVQ, Bayesian Neural Network, KPCA-SVM, C5.0, and Random forest have been used. This study clearly shows the effect of neural networks technology in the recurrence of breast cancer classification. Artificial neural networks can be used as a diagnostic method with high sensitivity and specificity to detect prediction of recurrence of breast cancer tumors, besides other non-invasive diagnostic methods (such as mammography and radiography). The importance of this result is that maybe it can help to prevent patients, who do not need any invasive diagnostic methods (sampling and surgery), from such operations. In addition to the decrease in the costs, faster and more precise diagnostic of prediction of recurrence of breast cancer may increase the treatment chance.

Results show that the C5.0 and the KPCA-SVM have shown better performance in terms of accuracy, true negative, Sensitivity, Specificity, Geometric mean of sensitivity and specificity, PPV, NPV, Geometric mean of PPV and NPV, F-measure, Area under ROC curve in comparison to other methods.

Except for MLP, Bayesian, and LVQ, in terms of area under the ROC curve, the performance of methods is appropriate (more than 0.7). In terms of accuracy, the C5.0 and KPCA-SVM outperformed all other methods; however, other methods Except MLP, Bayesian, and LVQ, also achieved high efficiency (more than 0.70).

Sensitivity ranged from a minimum of 0.509 (MLP) to a maximum of 0.869 (C5.0). Considering that having Recurrence of Breast Cancer is the critical prediction in this biomedical application, a method with higher sensitivity is desired; therefore, methods such as MLP are inappropriate for prediction of Recurrence of Breast Cancer. However, C5.0 showed the best sensitivity result.

Given the nature of the present data and generally, in the field of breast health and cancer research, among the reasons for choosing the decision tree method for prediction, it can be mentioned for breast cancer recurrence data behavior, the decision tree is capable of working with continuous and discrete data (Other methods can only work with one type). Unnecessary comparisons are eliminated in this structure, different properties are used for different samples, no estimation of the distribution function is needed data preparation for a decision tree is unnecessary or straightforward (other methods often require data normalization or deletion of empty values or the creation of null variables). Decision tree structures are dominant for analyzing big data in a short time, finding unexpected or unknown relationships, also this method can adapt to inadequate data. In conclusion, we may be able to facilitate the prediction of breast cancer recurrence by designing suitable machine learning algorithms. In the current study, we found the C5.0 algorithm possibly, could be a helpful tool for predicting by physicians and health care policymakers in breast cancer recurrence prediction at the stage of distant recurrence and nonrecurrence, especially in the first to third years. And also, LN involvement rate, Her2 value, Tumor size, free or closed tumor margin were found to be the most important features in our dataset. This may results in more sustainable health for the patients and, consequently, a lower psychological, social, and economic burden on society.

Supporting information

S1 Data

(XLSX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Jurisevic M., et al. , The organic ester O,O'-diethyl-(S,S)-ethylenediamine-N,N'-di-2-(3-cyclohexyl)propanoate dihydrochloride attenuates murine breast cancer growth and metastasis. Oncotarget, 2018. 9(46): p. 28195–28212. 10.18632/oncotarget.25610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Uen Y., et al. , Mining of potential microRNAs with clinical correlation—regulation of syndecan-1 expression by miR-122-5p altered mobility of breast cancer cells and possible correlation with liver injury. Oncotarget, 2018. 9(46): p. 28165–28175. 10.18632/oncotarget.25589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tong C.W.S., et al. , Recent Advances in the Treatment of Breast Cancer. Front Oncol, 2018. 8: p. 227 10.3389/fonc.2018.00227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang Y. and Hendrix C.C., Cancer-Related Cognitive Impairment in Breast Cancer Patients: Influences of Psychological Variables. Asia Pac J Oncol Nurs, 2018. 5(3): p. 296–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yi M. and Hwang E., Pain and Menopause Symptoms of Breast Cancer Patients with Adjuvant Hormonal Therapy in Korea: Secondary Analysis. Asia Pac J Oncol Nurs, 2018. 5(3): p. 262–269. 10.4103/apjon.apjon_45_17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gao S., et al. , miR-202 acts as a potential tumor suppressor in breast cancer. Oncol Lett, 2018. 16(1): p. 1155–1162. 10.3892/ol.2018.8726 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mishra R., et al. , Activating HER3 mutations in breast cancer. Oncotarget, 2018. 9(45): p. 27773–27788. 10.18632/oncotarget.25576 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nejati-Azar A. and Alivand M.R., miRNA 196a2(rs11614913) & 146a(rs2910164) polymorphisms & breast cancer risk for women in an Iranian population. Per Med, 2018. 15(4): p. 279–289. 10.2217/pme-2017-0088 [DOI] [PubMed] [Google Scholar]
  • 9.Fernandes A.W., Wu B., and Turner R.M., Brain metastases in non-small cell lung cancer patients on epidermal growth factor receptor tyrosine kinase inhibitors: symptom and economic burden. J Med Econ, 2017. 20(11): p. 1136–1147. 10.1080/13696998.2017.1361960 [DOI] [PubMed] [Google Scholar]
  • 10.Petrou P., A systematic review of economic evaluations of tyrosine kinase inhibitors of vascular endothelial growth factor receptors, mammalian target of rapamycin inhibitors and programmed death-1 inhibitors in metastatic renal cell cancer. Expert Rev Pharmacoecon Outcomes Res, 2018. 18(3): p. 255–265. 10.1080/14737167.2018.1439740 [DOI] [PubMed] [Google Scholar]
  • 11.Rautenberg T., et al. , Economic outcomes of sequences which include monoclonal antibodies against vascular endothelial growth factor and/or epidermal growth factor receptor for the treatment of unresectable metastatic colorectal cancer. J Med Econ, 2014. 17(2): p. 99–110. 10.3111/13696998.2013.864973 [DOI] [PubMed] [Google Scholar]
  • 12.Brett E.A., et al. , Breast cancer recurrence after reconstruction: know thine enemy. Oncotarget, 2018. 9(45): p. 27895–27906. 10.18632/oncotarget.25602 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Fu M.R., et al. , Machine learning for detection of lymphedema among breast cancer survivors. Mhealth, 2018. 4: p. 17 10.21037/mhealth.2018.04.02 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Maañón J., et al. , High serum vascular endothelial growth factor C predicts better relapse-free survival in early clinically node-negative breast cancer. Oncotarget, 2018. 9(46): p. 28131–28140. 10.18632/oncotarget.25577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Leone J.P., et al. , Treatment Patterns and Survival of Elderly Patients With Breast Cancer Brain Metastases. Am J Clin Oncol, 2019. 42(1): p. 60–66. 10.1097/COC.0000000000000477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yazdani A., et al. , Investigation of Prognostic Factors of Survival in Breast Cancer Using a Frailty Model: A Multicenter Study. Breast Cancer (Auckl), 2019. 13: p. 1178223419879112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kondo M., et al. , Economic evaluation of the 70-gene prognosis-signature (MammaPrint®) in hormone receptor-positive, lymph node-negative, human epidermal growth factor receptor type 2-negative early stage breast cancer in Japan. Breast Cancer Res Treat, 2012. 133(2): p. 759–68. 10.1007/s10549-012-1979-7 [DOI] [PubMed] [Google Scholar]
  • 18.Rosner D. and Lane W.W., Predicting recurrence in axillary-node negative breast cancer patients. Breast Cancer Res Treat, 1993. 25(2): p. 127–39. 10.1007/BF00662138 [DOI] [PubMed] [Google Scholar]
  • 19.Wapnir I.L., et al. , Prognosis after ipsilateral breast tumor recurrence and locoregional recurrences in five National Surgical Adjuvant Breast and Bowel Project node-positive adjuvant breast cancer trials. J Clin Oncol, 2006. 24(13): p. 2028–37. 10.1200/JCO.2005.04.3273 [DOI] [PubMed] [Google Scholar]
  • 20.Dieci M.V., et al. , Patterns of Fertility Preservation and Pregnancy Outcome After Breast Cancer at a Large Comprehensive Cancer Center. J Womens Health (Larchmt), 2019. 28(4): p. 544–550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Xiao Y., et al. , Integrin α5 down-regulation by miR-205 suppresses triple negative breast cancer stemness and metastasis by inhibiting the Src/Vav2/Rac1 pathway. Cancer Lett, 2018. 433: p. 199–209. 10.1016/j.canlet.2018.06.037 [DOI] [PubMed] [Google Scholar]
  • 22.Muss H.B., McNamara M.J., and Connelly R.A., Follow-up after stage II breast cancer: a comparative study of relapsed versus nonrelapsed patients. Am J Clin Oncol, 1988. 11(4): p. 451–5. 10.1097/00000421-198808000-00008 [DOI] [PubMed] [Google Scholar]
  • 23.Schapira D.V., Breast cancer surveillance—a cost-effective strategy. Breast Cancer Res Treat, 1993. 25(2): p. 107–11. 10.1007/BF00662135 [DOI] [PubMed] [Google Scholar]
  • 24.Chen Z., et al. , Differential expression and function of CAIX and CAXII in breast cancer: A comparison between tumorgraft models and cells. PLoS One, 2018. 13(7): p. e0199476 10.1371/journal.pone.0199476 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Pourzand A., et al. , Erratum: Associations between Dietary Allium Vegetables and Risk of Breast Cancer: A Hospital-Based Matched Case-Control Study. J Breast Cancer, 2018. 21(2): p. 231 10.4048/jbc.2018.21.2.231 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zapater-Moros A., et al. , Probabilistic graphical models relate immune status with response to neoadjuvant chemotherapy in breast cancer. Oncotarget, 2018. 9(45): p. 27586–27594. 10.18632/oncotarget.25496 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bray F., et al. , Global cancer transitions according to the Human Development Index (2008–2030): a population-based study. Lancet Oncol, 2012. 13(8): p. 790–801. 10.1016/S1470-2045(12)70211-5 [DOI] [PubMed] [Google Scholar]
  • 28.Loong S., et al. , The effectiveness of the routine clinic visit in the follow-up of breast cancer patients: analysis of a defined patient cohort. Clin Oncol (R Coll Radiol), 1998. 10(2): p. 103–6. [DOI] [PubMed] [Google Scholar]
  • 29.Churn M. and Kelly V., Outpatient Follow-up After Treatment for Early Breast Cancer: Updated Results After 5 Years. Clinical Oncology, 2001. 13(3): p. 187–194. 10.1053/clon.2001.9251 [DOI] [PubMed] [Google Scholar]
  • 30.Vaughn A.E., et al. , Using a social marketing approach to develop Healthy Me, Healthy We: a nutrition and physical activity intervention in early care and education. Transl Behav Med, 2019. 9(4): p. 669–681. 10.1093/tbm/iby082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schneble E.J., et al. , Future directions for the early detection of recurrent breast cancer. J Cancer, 2014. 5(4): p. 291–300. 10.7150/jca.8017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hastings G. and Saren M., The Critical Contribution of Social Marketing: Theory and Application. Marketing Theory, 2003. 3(3): p. 305–322. [Google Scholar]
  • 33.Karabatak M. and Ince M.C., An expert system for detection of breast cancer based on association rules and neural network. Expert Systems with Applications, 2009. 36(2, Part 2): p. 3465–3469. [Google Scholar]
  • 34.Komoike Y., et al. , Ipsilateral breast tumor recurrence (IBTR) after breast-conserving treatment for early breast cancer: risk factors and impact on distant metastases. Cancer, 2006. 106(1): p. 35–41. 10.1002/cncr.21551 [DOI] [PubMed] [Google Scholar]
  • 35.Li D., et al. , Interactions of Family History of Breast Cancer with Radiotherapy in Relation to the Risk of Breast Cancer Recurrence. J Breast Cancer, 2017. 20(4): p. 333–339. 10.4048/jbc.2017.20.4.333 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Menarche, menopause, and breast cancer risk: individual participant meta-analysis, including 118 964 women with breast cancer from 117 epidemiological studies. The Lancet Oncology, 2012. 13(11): p. 1141–1151. 10.1016/S1470-2045(12)70425-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bai Y., et al. , Raman spectroscopy-based biomarker screening by studying the fingerprint characteristics of chronic lymphocytic leukemia and diffuse large B-cell lymphoma. Journal of Pharmaceutical and Biomedical Analysis, 2020. 190: p. 113514 10.1016/j.jpba.2020.113514 [DOI] [PubMed] [Google Scholar]
  • 38.Yu W., Kim J.K., and Park T., Estimation of Area Under the ROC Curve under nonignorable verification bias. Stat Sin, 2018. 28(4): p. 2149–2166. 10.5705/ss.202016.0315 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.La Vecchia C., et al. , Original article: The role of age at menarche and at menopause on breast cancer risk: Combined evidence from four case-control studies. Annals of Oncology, 1992. 3(8): p. 625–629. 10.1093/oxfordjournals.annonc.a058288 [DOI] [PubMed] [Google Scholar]
  • 40.Effects of radiotherapy and of differences in the extent of surgery for early breast cancer on local recurrence and 15-year survival: an overview of the randomised trials. The Lancet, 2005. 366(9503): p. 2087–2106. [DOI] [PubMed] [Google Scholar]
  • 41.Wittkowski K.M., et al. , Complex polymorphisms in endocytosis genes suggest alpha-cyclodextrin as a treatment for breast cancer. PLoS One, 2018. 13(7): p. e0199012 10.1371/journal.pone.0199012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.de Pedro M., Otero B., and Martín B., Fertility preservation and breast cancer: a review. Ecancermedicalscience, 2015. 9: p. 503 10.3332/ecancer.2015.503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lopresti M., Rizack T., and Dizon D.S., Sexuality, fertility and pregnancy following breast cancer treatment. Gland Surg, 2018. 7(4): p. 404–410. 10.21037/gs.2018.01.02 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhou W., et al. , Risk of breast cancer and family history of other cancers in first-degree relatives in Chinese women: a case control study. BMC Cancer, 2014. 14: p. 662 10.1186/1471-2407-14-662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lafourcade A., et al. , Factors associated with breast cancer recurrences or mortality and dynamic prediction of death using history of cancer recurrences: the French E3N cohort. BMC Cancer, 2018. 18(1): p. 171 10.1186/s12885-018-4076-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Neri A., et al. , Breast cancer local recurrence: risk factors and prognostic relevance of early time to recurrence. World J Surg, 2007. 31(1): p. 36–45. 10.1007/s00268-006-0097-2 [DOI] [PubMed] [Google Scholar]
  • 47.Mauguen A., et al. , Dynamic prediction of risk of death using history of cancer recurrences in joint frailty models. Stat Med, 2013. 32(30): p. 5366–80. 10.1002/sim.5980 [DOI] [PubMed] [Google Scholar]
  • 48.Voogd A.C., et al. , Differences in risk factors for local and distant recurrence after breast-conserving therapy or mastectomy for stage I and II breast cancer: pooled results of two large European randomized trials. J Clin Oncol, 2001. 19(6): p. 1688–97. 10.1200/JCO.2001.19.6.1688 [DOI] [PubMed] [Google Scholar]
  • 49.Vinh-Hung V. and Verschraegen C., Breast-conserving surgery with or without radiotherapy: pooled-analysis for risks of ipsilateral breast tumor recurrence and mortality. J Natl Cancer Inst, 2004. 96(2): p. 115–21. 10.1093/jnci/djh013 [DOI] [PubMed] [Google Scholar]
  • 50.Soerjomataram I., et al. , An overview of prognostic factors for long-term survivors of breast cancer. Breast Cancer Res Treat, 2008. 107(3): p. 309–30. 10.1007/s10549-007-9556-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Yu Y., et al. , Label-free detection of nasopharyngeal and liver cancer using surface-enhanced Raman spectroscopy and partial lease squares combined with support vector machine. Biomed Opt Express, 2018. 9(12): p. 6053–6066. 10.1364/BOE.9.006053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Candelaria R.P., et al. , Analysis of stereotactic biopsies performed on suspicious calcifications identified within 24 months after completion of breast conserving surgery and radiation therapy for early breast cancer: Can biopsy be obviated? The American Journal of Surgery, 2018. 215(4): p. 693–698. 10.1016/j.amjsurg.2017.06.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Svensson S., et al. , CCL2 and CCL5 Are Novel Therapeutic Targets for Estrogen-Dependent Breast Cancer. Clin Cancer Res, 2015. 21(16): p. 3794–805. 10.1158/1078-0432.CCR-15-0204 [DOI] [PubMed] [Google Scholar]
  • 54.Lega F., Developing a marketing function in public healthcare systems: a framework for action. Health Policy, 2006. 78(2–3): p. 340–52. 10.1016/j.healthpol.2005.11.013 [DOI] [PubMed] [Google Scholar]
  • 55.Martin-Castillo B., et al. , Basal/HER2 breast carcinomas: integrating molecular taxonomy with cancer stem cell dynamics to predict primary resistance to trastuzumab (Herceptin). Cell Cycle, 2013. 12(2): p. 225–45. 10.4161/cc.23274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Lei J., et al. , Assessment of variation in immunosuppressive pathway genes reveals TGFBR2 to be associated with prognosis of estrogen receptor-negative breast cancer after chemotherapy. Breast Cancer Res, 2015. 17(1): p. 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Coles C.E., et al. , Partial-breast radiotherapy after breast conservation surgery for patients with early breast cancer (UK IMPORT LOW trial): 5-year results from a multicentre, randomised, controlled, phase 3, non-inferiority trial. The Lancet, 2017. 390(10099): p. 1048–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Eccles S.A., et al. , Critical research gaps and translational priorities for the successful prevention and treatment of breast cancer. Breast Cancer Res, 2013. 15(5): p. R92 10.1186/bcr3493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.te Boekhorst D.S., et al. , Periodic follow-up after breast cancer and the effect on survival. Eur J Surg, 2001. 167(7): p. 490–6. 10.1080/110241501316914849 [DOI] [PubMed] [Google Scholar]
  • 60.Kim I., et al. , Erratum: Development of a Nomogram to Predict N2 or N3 Stage in T1-2 Invasive Breast Cancer Patients with No Palpable Lymphadenopathy. J Breast Cancer, 2018. 21(2): p. 232 10.4048/jbc.2018.21.2.232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Pautenberg T., et al. , Economic outcomes of sequences which include monoclonal antibodies against vascular endothelial growth factor and/or epidermal growth factor receptor for the treatment of unresectable metastatic colorectal cancer. J Med Econ, 2014. 17(2): p. 99–110. 10.3111/13696998.2013.864973 [DOI] [PubMed] [Google Scholar]
  • 62.Zapater-Moros A., et al. , Probabilistic graphical models relate immune status with response to neoadjuvant chemotherapy in breast cancer. Oncotarget, 2018. 9(45): p. 27586–27594. 10.18632/oncotarget.25496 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Bryan C Daniels

7 May 2020

PONE-D-20-07947

Modeling and Comparing Data Mining Algorithms for Prediction of Recurrence of Breast Cancer

PLOS ONE

Dear Dr Mojaradi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

One reviewer argues that more work would be necessary to make the claim that the results are clinically relevant.  This could be addressed by a major reworking of the analysis or by removing or sufficiently qualifying any claims of applicability to clinical or policy decisions.

We would appreciate receiving your revised manuscript by Jun 21 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Bryan C Daniels

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements:

1.    Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.  

Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services.  If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free.

Upon resubmission, please provide the following:

●      The name of the colleague or the details of the professional service that edited your manuscript

●      A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)

●      A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

3. In ethics statement in the manuscript and in the online submission form, please provide additional information about the database used in your retrospective study. Specifically, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have their data used in research, please include this information.

4. Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please also ensure that your ethics statement is included in your manuscript, as the ethics section of your online submission will not be published alongside your manuscript.

5. Thank you for stating the following financial disclosure:

"The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

At this time, please address the following queries:

a)    Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

b)    State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c)     If any authors received a salary from any of your funders, please state which authors and which funders.

d)     If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

6. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 1 and 4 in your text; if accepted, production will need this reference to link the reader to the Table.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear authors,

First, I'd like to aknowledge the effort made to improve the manuscript by moving your analysis to a meaninful clinical question.

Again, my main concern is that the interpretation of the clinical scenario is not correct, hence the conclusion "Therefore, machine learning algorithms, in particular, the C5.0, can be of great help to physicians and health care Policy Makers, especially in predicting recurrence of breast cancer." raised from your analysis is not supported.

There are many things regarding the experimental setup that should be amended.

First of all: recurrence in breast cancer is a time dependent event. This has been prevously established without doubt. I think it's mandatory to include time to recurrence in order to predict this recurrence.

Second: Its also been stablished that, for brast cancer recurrence studies a minimun follow up of 5 years its necessary to have a complete picture of the recurrences. This is due to the huge heterogeneity of breast cancer. For example, most TNBCs will relapse in the first three years, while ER/PR+ breast cancer recurrence is spread among the first 10 years. This can bias the analises because most of the recurrences will be from TNBCs.

Third: I strongly recommned including a clinical advisor in this work. There are some variable interpretations that should be corrected to faithfully reflect the clinical scenario (sfor examples: ER/PR are analized independently but no together, as in the clinical practice). No data about the number of recurrences in this population is presented. Obviously, treatment is related with relapse. But it should not be included as a variable to predict relapse, because the objective is to be able to predict this relapse prior to treatment in ER/PR+ breast cancer, allowing to decide fi chemotherapy is needed or not. On the other hand, prediction of relapse in TNBC will follow other clinical objectives (find out chemotherapy resistant tumors to provide these patients with additional tratment options via clinical trials for example. All these differences in the clinical interpretation should be well defined prior to the analysis, and must conditionate the analysis itself to demonstrate the capability of this powerful mathematical tools in the clinical setting.

And finally, I want to raise a question. It is not possible, with the large number of patients included, to split them onto training and test? this could help to estimate the overfitting of the methods.

Reviewer #2: Early detection of recurrence of Breast cancer can provide potential advantage in the treatment of this disease. There have been many researches in the recent past about finding the most critical attributes that plays a major role in prediction of recurrence of breast cancer. However, in this research, the author has also interviewed with specialists in the field of breast cancer along with data mining techniques. Thus the authenticity of this work increases.

Since it is a revised paper thus below are my reviews about the paper based on the previous review comments are as follows.

Reviewer 1 Comments

Comment 1 and Comment 2:

The author has resolved this issue. As per the comment of reviewer 1, the author has used 7 data mining methods and predictions has been now on finding the recurrence of breast cancer. The attributes have been now described.

Reviewer 2 Comments

Comment 1 and Comment 2:

The author has resolved this issue. As per the comment of reviewer 2, the author has used new prediction algorithms and to evaluate performance F1 and area under ROC curve is also used.

As per the revised paper, the author has modified the paper according to the reviewer comments.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Angelo Gámez-Pozo

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 15;15(10):e0237658. doi: 10.1371/journal.pone.0237658.r003

Author response to Decision Letter 0


10 Jul 2020

Dear Editor-in-Chief,

We wish to thank you, the associate editor, and the reviewers for the comments we received on the attached paper, and also thank you for allowing us to revise the manuscript again which has helped to enhance the quality of the paper. We hereby submit a revised version of the paper with an ID number of PONE-D-20-07947. Significant changes and modifications were highlighted in the revised manuscript, and the detailed responses to reviewers’ comments are listed as follows. We hope that the modified version is acceptable, and we look forward to your kind recommendations.

Best regards,

The Authors

Editor Comments to the Author

Comment 1:

One reviewer argues that more work would be necessary to make the claim that the results are clinically relevant. This could be addressed by a major reworking of the analysis or by removing or sufficiently qualifying any claims of applicability to clinical or policy decisions.

Response and corrections regarding comment 1:

According to your helpful comment, the manuscript is carefully reviewed, some sentences for eliminating and decreasing ambiguity of the results and conclusion of our analysis are made more clearer and some sentences removed. Moreover, instead of definite statements, we have used possibly, could be a helpful and helpful approach to conclusions in the diagnosis phase of clinical scenarios. it is stated that from the clinical scenario stages, our research possibly, could be helpful for physicians in the detection stage with a good probability, before entering the costly and stressful treatment(please see section “Abstract” in page 1 , and page 3 in section “introduction”, lines 82-88 and page 21-22 in section “Conclusion and Discussion”). Also, some sentences for eliminating ambiguity and make the results of the analysis clearer, are removed and some sentences placed in the methodology section (please see page 3 in the section "Materials and Methods", lines 106-111, and pages 8-10 in section "Results”). more patients' clinical characteristics are added (please see page 8 in section “Results”, lines 249- 250 and page 9. Line 251-253). More tables and results from our analysis are added to the end of results to sufficiently qualifying the medical output at each recurrence stage (please see page 18-21 in section “Results”). The most important features are obtained by two methods: chi-square (on page 8) and forward feature selection method (please see section “Feature importance analysis" on page 17 and section “Important features for breast cancer metastasis prediction” in pages 20-21).

Review Comments to the Author:

Reviewer 1 Comments:

First, I'd like to acknowledge the effort made to improve the manuscript by moving your analysis to a meaningful clinical question. Again,

Comment 1:

My main concern is that the interpretation of the clinical scenario is not correct, hence the conclusion "Therefore, machine learning algorithms, in particular, the C5.0, can be of great help to physicians and health care Policy Makers, especially in predicting recurrence of breast cancer." raised from your analysis is not supported.

Response and corrections regarding comment 1:

According to your helpful comment, the manuscript is carefully reviewed, some sentences for eliminating and decreasing ambiguity of the results and conclusion of our analysis are made more clearer and some sentences removed. Moreover, instead of definite statements, we have used possibly, could be a helpful and helpful approach to conclusions in the diagnosis phase of clinical scenarios. it is stated that from the clinical scenario stages, our research possibly, could be helpful for physicians in the detection stage with a good probability, before entering the costly and stressful treatment(please see section “Abstract” in page 1 , and page 3 in section “introduction”, lines 82-88 and page 21-22 in section “Conclusion and Discussion”). Also, some sentences for eliminating ambiguity and make the results of the analysis clearer, are removed and some sentences placed in the methodology section (please see page 3 in the section "Materials and Methods", lines 106-111, and pages 8-10 in section "Results”). more patients' clinical characteristics are added (please see page 8 in section “Results”, lines 249- 250 and page 9. Line 251-253). More tables and results from our analysis are added to the end of results to sufficiently qualifying the medical output at each recurrence stage (please see page 18-21 in section “Results”). The most important features are obtained by two methods: chi-square (on page 8) and forward feature selection method (please see section “Feature importance analysis" on page 17 and section “Important features for breast cancer metastasis prediction” in pages 20-21).

Comment 2:

There are many things regarding the experimental setup that should be amended.

recurrence in breast cancer is a time-dependent event. This has been previously established without doubt. I think it's mandatory to include time to recurrence to predict this recurrence.

Response and corrections regarding comment 2:

Based on this helpful and valuable comment, the manuscript has revised and clarified that the present study is performed based on the cross-sectional method and we have used Data Mining algorithms for Cross-Sectional dataset which examines the relationship between disease (or other health-related features) and other variables of interest as they exist in a defined population at a single point in time or over a short period. (please see section “Materials and Methods”, line 106-111 on page 3).

Comment 3:

Its also been established that for breast cancer recurrence studies a minimum follow up of 5 years its necessary to have a complete picture of the recurrences. This is due to the huge heterogeneity of breast cancer. For example, most TNBCs will relapse in the first three years, while ER/PR+ breast cancer recurrence is spread among the first 10 years. This can bias the analysis because most of the recurrences will be from TNBCs.

Response and corrections regarding comment 3:

Thanks for your valuable comment. As mentioned in the reply to the previous comment, based on disease nature and our some limitations in the dataset for analyzing with more accuracy in machine learning algorithms and improve conclusions, we have used a cross-sectional statistical method. According to your valuable and precise comment, we have revised and clarified that data gathering of this research performed from June 2018 to June 2019 the official statistics of the Ministry of Health and Medical Education and the Iran Cancer Research Center for patients with breast cancer who had been followed for a minimum of 5 years from February 2014 to April 2019, including 5471 independent records. Based on some studies about breast cancer recurrence for example Wangchinda & Ithimakin (2016), breast cancer relapse occur and analyze generally in two periods (shorter than 5 years or longer than 5 years). But for limitation in our dataset(we had only 16 complete information a bout patients that their relapse occurred after 10 years), we have analyzed only patients with shorter than 5 years and we have added some analysis a bout this first five years(please see section “Materials and Methods”, line 106-111 on page 3 and table 12 in page 19 of section “results”).

And we think in this stage, it possibly can be valuable to predict the recurrence of the disease in shorter periods because most relapses occur during the first 5 years after diagnosis. By defining algorithms, it possibly could be helpful to predict the disease recurrence in a shorter period, help to physicians for detection and management of recurrence.

However, according to your helpful comment, we absolutely try to gather more information about our dataset and conduct more practical analyze, in the future of our studies for patients with breast cancer recurrence in longer time to improve our results and conclusions.

Comment 4:

I strongly recommend including a clinical advisor in this work. There are some variable interpretations that should be corrected to faithfully reflect the clinical scenario (for example ER/PR are analyzed independently but no together, as in the clinical practice). No data about the number of recurrences in this population is presented. Obviously, treatment is related with relapse. But it should not be included as a variable to predict relapse, because the objective is to be able to predict this relapse prior to treatment in ER/PR+ breast cancer, allowing to decide if chemotherapy is needed or not. On the other hand, prediction of relapse in TNBC will follow other clinical objectives (find out chemotherapy resistant tumors to provide these patients with additional treatment options via clinical trials for example. All these differences in the clinical interpretation should be well defined prior to the analysis, and must conditionate the analysis itself to demonstrate the capability of this powerful mathematical tools in the clinical setting.

Response and corrections regarding comment 4:

Based on your helpful and valuable comment, the manuscript has reviewed and revised by more clinical advisors. At the first stage of our research, we have concluded from some studies for example Iqbal & Buch(2016) , Lim, Palmieri, & Tilley(2016) and Patani & Martin(2014) and with confirmation of physicians related to our dataset patients, that about 80% of breast cancers are “ER-positive”, which means the cancer cells grow in response to the hormone estrogen. About 65% of these are also “PR-positive.” They grow in response to another hormone, i.e., progesterone. hence, these two criteria can be analyzed independently.

But In this study breast cancer is classified into four groups based on IHC profile ER/PR and Her2/neu expression, positive (+), and/or negative (−). The groups are:

ER/PR+, Her2+ = ER+/PR+, Her2+; ER−/PR+, Her2+; ER+/PR−, Her2+

ER/PR+, Her2− = ER+/PR+, Her2−; ER−/PR+, Her2−; ER+/PR−, Her2−

ER/PR−, Her2+ = ER−/PR−, Her2+

ER/PR−, Her2− = ER−/PR−, Her2−

And are not considered independently. The IHC classification correlates well with intrinsic gene expression microarray categorization: ER/PR+, Her2+ with Luminal B; ER/PR+, Her2− with Luminal A; ER/PR−, Her2+, and ER/PR−, Her2− with triple-negative/basal-like tumors. Apart from lending itself to subtype analyses of tumors when fresh tissue is not available, the IHC classification has prognostic and therapeutic implications, is inexpensive and readily available. In general, we paid attention to the relevant factors and separate factors, but some explanations were not provided to prevent the length of the article content and a large number of pages. But for information limitation in our dataset (we had only 16 complete information about patients that their relapse occurred after 10 years), we have analyzed only patients with shorter than 5 years recurrence. Also, based on most of studies, the greatest risk of recurrence is in the 5 years after breast cancer diagnosis. Based on your valuable opinion, we have added some explanations to make the explanations more clear (please see line 212-222, on page 6 of section “results”).

Also, of course, your concern about the dependent variable is valuable, precise, and helpful for us to revise and clarify our explanations about input features to machine learning algorithms. at the first stage, based on chi-square results in table 5, we have extracted most related factors to breast cancer recurrence, then, after refining the results and discarding the treatment and diagnosis methods, 3 factors are considered as the most important factors affecting the recurrence of breast cancer in our dataset by a chi-square method(please see page 8-10 in section “Results”). This is one of the two methods that we have used to extract important features. Another method is forward feature selection in machine learning. We have used these two methods for validating the results of the best features extraction analysis (please see section “Feature importance analysis" on page 17 and section “Important features for breast cancer metastasis prediction” on page 20-21).

But we clarified that to the prediction of breast cancer recurrence, by discarding 6 factors of treatment and diagnosis, other influencing features are considered as the inputs of machine learning algorithms (please see page 10 in section “Results”, lines 263-268).

Furthermore, more patients' clinical characteristics and information have added (please see page 8 in section “Results”, lines 249- 250 and page 9, line 251-253 and Tables 10, 11 and 13 in pages 18-19). For validation, we have compared the results of the machine learning algorithms with the clinical results of the patients' dataset that we have. Tables 10-13 on pages 18-19 represent this compare and analysis (please see section “Results”, Tables 10-13 on page 18-19).

In the present study, we tried to collect a large number of patients data with some limitations in clinical information, and due to the condition and limitation we faced and consulted with physicians related to patients, we tried to analyze dataset to help diagnose of cancer recurrence in the early stages and to use machine learning as a mathematical tool in the clinical stages. However, according to your helpful comments, we absolutely try to gather more information about our dataset and conduct more practical analyze, in the future of our studies.

Comment 5:

finally, I want to raise a question. It is not possible, with a large number of patients included, to split them onto training and test? this could help to estimate the overfitting of the methods.

Response and corrections regarding comment 5:

Thanks for this valuable question. Overfitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. There are several approaches to avoiding overfitting. In this research we have used a nested 5-fold cross-validation approach to train (four folds) and test (one fold) the models. Patients meeting the inclusion criteria are randomly assigned to one of the five outer folds. To ensure that the important feature set was generated from real patients with breast cancer and the importance of the features was not emphasized by duplicating minor cases, we chose the under-sampling approach to build the model. We randomly selected 20 sets of controls in each round of cross-validation, matching the number of cases, and generated 80 training datasets by using one set of controls and all cases. In each training step, we used 5-fold inner cross-validation to tune the models. this method has used but for preventing of exceeding machine learning analysis we have not included in our manuscript with so explanation and only we mentioned the final result of this result in our paper, but according to your valuable concern, we have added these explanations in our manuscript(please see page 4-5 in the section “material and methods’).

Finally, the authors are thankful to the editors and reviewers for the critical comments and constructive suggestions, which helped improve the quality and presentation of the paper significantly.

references:

1. Wangchinda, P., & Ithimakin, S. (2016). Factors that predict recurrence later than 5 years after initial treatment in operable breast cancer. World Journal of Surgical Oncology, 14(1), 223. doi:10.1186/s12957-016-0988-0

2. Iqbal, B., & Buch, A. (2016). Hormone receptor (ER, PR, HER2/neu) status and proliferation index marker (Ki-67) in breast cancers: Their once-pathological correlation, shortcomings, and future trends. Medical Journal of Dr. D.Y. Patil University, 9(6), 674-679. DOI:10.4103/0975-2870.194180

3. Lim, E., Palmieri, C., & Tilley, W. D. (2016). Renewed interest in the progesterone receptor in breast cancer. British Journal of Cancer, 115(8), 909-911. DOI:10.1038/bjc.2016.303

4. Patani, N., & Martin, L. A. (2014). Understanding response and resistance to estrogen deprivation in ER-positive

breast cancer. Molecular and cellular endocrinology, 382(1), 683-694. DOI:10.1016/j.mce.2013.09.038

Attachment

Submitted filename: response to reviewers-v22.docx

Decision Letter 1

Bryan C Daniels

31 Jul 2020

Modeling and Comparing Data Mining Algorithms for Prediction of Recurrence of Breast Cancer

PONE-D-20-07947R1

Dear Dr. Mojaradi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Note that there are also grammatical and copyediting issues that will need to be addressed before publication.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Bryan C Daniels

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Bryan C Daniels

27 Aug 2020

PONE-D-20-07947R1

Modeling and Comparing Data Mining Algorithms for Prediction of Recurrence of Breast Cancer

Dear Dr. Mojaradi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Bryan C Daniels

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data

    (XLSX)

    Attachment

    Submitted filename: Response to Reviewers-v12.docx

    Attachment

    Submitted filename: response to reviewers-v22.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES