Skip to main content
Disease Markers logoLink to Disease Markers
. 2017 Aug 29;2017:5745724. doi: 10.1155/2017/5745724

Identification of Biomarkers for Predicting Lymph Node Metastasis of Stomach Cancer Using Clinical DNA Methylation Data

Jun Wu 1, Yawen Xiao 2, Chao Xia 1, Fan Yang 3, Hua Li 1, Zhifeng Shao 1, Zongli Lin 4, Xiaodong Zhao 1,*
PMCID: PMC5603126  PMID: 28951630

Abstract

Background

Lymph node (LN) metastasis was an independent risk factor for stomach cancer recurrence, and the presence of LN metastasis has great influence on the overall survival of stomach cancer patients. Thus, accurate prediction of the presence of lymph node metastasis can provide guarantee of credible prognosis evaluation of stomach cancer patients. Recently, increasing evidence demonstrated that the aberrant DNA methylation first appears before symptoms of the disease become clinically apparent.

Objective

Selecting key biomarkers for LN metastasis presence prediction for stomach cancer using clinical DNA methylation based on a machine learning method.

Methods

To reduce the overfitting risk of prediction task, we applied a three-step feature selection method according to the property of DNA methylation data.

Results

The feature selection procedure extracted several cancer-related and lymph node metastasis-related genes, such as TP73, PDX1, FUT8, HOXD1, NMT1, and SEMA3E. The prediction performance was evaluated on the public DNA methylation dataset. The results showed that the three-step feature procedure can largely improve the prediction performance and implied the reliability of the biomarkers selected.

Conclusions

With the selected biomarkers, the prediction method can achieve higher accuracy in detecting LN metastasis and the results also proved the reliability of the selected biomarkers indirectly.

1. Introduction

According to the recent reports of the World Health Organization (WHO), stomach cancer is the fifth most common cancer in the world and more than 70% of the new cases of stomach cancer occurred in developing countries (mainly in China) [1, 2]. The early stage of stomach cancer, which is defined as stomach cancer limited to the mucosa or submucosa and irrelevant to the presence or absence of lymph node (LN) metastasis, confers a survival rate of greater than 90% in 5 years in many centers [3]. However, even in the early stage, it was reported that the incidence of LN metastasis was 14.1% overall and was 4.8 to 23.6% depending on cancer depth [4, 5]. Many researchers demonstrated that LN metastasis is an independent risk factor for stomach cancer recurrence in patients following curative resection, and the overall survival of LN metastasis-negative stomach cancer patient is significantly longer than that of LN metastasis-positive patients [6, 7]. Therefore, it is certain that an accurate LN metastasis presence prediction can provide the guarantee of credible prognosis evaluation of stomach cancer patients.

Traditionally, LN metastasis diagnosis is mainly implemented by preoperative imaging such as abdominal ultrasonography (US) and computed tomography (CT), but their diagnostic accuracy is limited. It was reported that the detection rate of lymph nodes around the stomach was 18.7% in CT and 5.0% in US [8]. Endoscopic ultrasonography (EUS) is an effective approach and generally provides a more accurate prediction of the tumor stage than does CT. However, EUS-based prediction accuracy for LN is only slightly greater as compared to CT [4].

Recently, increasing evidences suggest the critical role of DNA methylation in human carcinogenesis [9, 10]. Aberrant DNA methylation is one of the common alterations in carcinogenesis, and it first appears before symptoms of the disease become clinically apparent [1113]. In addition, aberrant DNA methylation can promote the progression of disease [14]. With the development of high-throughput technology, plenty of DNA methylation data are available for cancer prediction and biomarker identification [1518]. Inspired by these applications, in this study, we used the DNA methylation data to categorize the incidence of LN metastasis in stomach cancer through a machine learning method. Considering the high-dimensionality and high-noisiness of the DNA methylation data, there are still several challenges to achieve the categorization. In contrast to the large number of features (probes), the small number of cancer samples available for training may lead to the degradation of classification performance and raise the risk of overfitting [19]. It is natural and perhaps essential to employ a feature selection step to obtain a feature set which only consists of genes contributing positively to the classification without redundant features. The key benefits of performing feature selection are reducing overfitting, improving accuracy, and reducing training time. Beyond that, feature selection in cancer research can help researchers to identify key carcinogenic markers and accurate prediction can provide references for clinical implementation. The feature selection methods mainly can be divided into three categories, which are the filter, wrapper, and embedded methods [2023]. The filter methods use a measure to score feature subsets while the wrapper methods use a predictive model to score. With the wrapper method, different feature sets are generated and an optimal engine, such as genetic method [24], simulated annealing method [25], and particle swarm optimization method [26], is selected to search a set of features that best distinguish the training samples of different classes. Embedded methods are the catch-all group of techniques which perform feature selection as part of the model learning process.

In this study, we grouped the data of stomach cancer into three categories, normal, LN metastasis negative, and LN metastasis positive, according to the clinical information. A three-step feature selection method was applied to identify the key genes. To evaluate the reliability of the selected biomarkers, we introduced the random forest algorithm to predict the categories with and without the three-step feature selection method. The results showed that the prediction accuracy was largely improved with the selected biomarkers, and it also proved the reliability indirectly.

2. Results

2.1. Feature Selection

Feature selection is commonly used to remove the irrelevant and redundant features from the original feature set. The minimum redundancy maximum relevance (mRMR) feature selection method is a feature selection method for finding a set of features that have the highest relevance with the target class and are also maximally dissimilar to each other based on the mutual information theory. However, mRMR is computationally expensive. In our paper, the differential methylation analysis was integrated with mRMR to achieve the preliminary feature selection. To further obtain the most informative feature for classification, an embedded feature selection method with genetic algorithm was introduced to get the final optimal features.

2.1.1. Feature Selection with Differential Methylation Region (DMR) Analysis

To preliminarily obtain the probes that are closely related to the phenotype, DMR analysis, which aimed to identify significantly methylated probes between different phenotypes, was applied. We compared the methylation status of each probes in the normal samples within the cancer samples and the methylation status of probes in the LN-negative samples within the LN-positive samples. Differentially methylated probes were determined with the Mann–Whitney U test. The density of the mean difference and the Benjamin-Hochberg- (BH-) adjusted p value of the two comparisons were shown in Figure 1(), from which we can see that the methylation patterns were much more similar in the LN-negative and LN-positive samples than in the normal samples and cancer samples. The appearance indicated that the thresholds used for selecting significantly differentially methylated probes must be different according to the two comparisons. For the comparison of normal versus cancer, we selected probes with an adjusted p value less than 1E−5 and an absolute mean difference greater than 0.2 as significantly differentially methylated probes. For the comparison of LN negative versus LN positive, the threshold for the adjusted p value and absolute mean difference was set as 0.01 and 0.02, respectively. With such criteria, we identified 1077 and 275 as significantly differentially methylated probes in the two comparisons. There were only 33 probes shared by both.

Figure 1.

Figure 1

The density of the mean difference and BH-adjusted p value of the two comparisons. (a) The density of the mean difference of normal versus cancer comparison and LN negative versus LN positive comparison. (b) The density of the log10 BH-adjusted p value of normal versus cancer comparison and LN negative versus LN positive comparison.

2.1.2. Feature Selection with the mRMR Method

The classic mRMR method was applied to filter the probes selected previously, and the probes were ranked according to their score. Since there is no explicit threshold, only the top 10% probes were left and these probes were used as input to the next feature selection step. The results of mRMR filtering were shown in Figure 2, from which we can see that the scores in respect to the LN negative versus LN positive comparison were extremely low. The results implied that the LN-negative samples and LN-positive samples were very indistinct.

Figure 2.

Figure 2

The distribution of mRMR scores with respect to features. The dashed line corresponds to the 10% cutoff used. (a) Normal versus cancer. (b) LN negative versus LN positive.

2.1.3. Feature Selection with Genetic Algorithm

Performing feature selection with genetic algorithm requires conceptualizing the processing of feature selection as an optimization problem and encoded the solution as binary. In this paper, random forest algorithm was used as the fit function during the genetic algorithm and the receiver operating characteristic (ROC) was used to measure the fitness. The details will be discussed later in the section of Materials and Methods. The normal versus cancer classification and LN negative versus LN positive classification were treated independently.

During the genetic algorithm in respect to the normal versus tumor classification, the ROC value summary in each iteration was shown in Figure 3(a), from which we can see that almost all the solutions can give a high fitness value. From this plot, we can see that after 12 iterations, the mean fitness hovered around 0.9999. We collected all the best solutions after each of the 12 iterations and simply summarized how many times a probe had been selected. The distribution of the number of selected probes were shown in Figure 3(b), and we selected the top 20 probes as the final features used for classification. According to the genomic locations, the 20 probes were associated to 39 genes including well-known cancer-related genes, such as TP73, PDX1, and FUT8 [2729].

Figure 3.

Figure 3

The results of genetic algorithm-based feature selection with respect to the normal versus tumor classification. (a) The fitness improvement in the process of iteration. (b) The distribution of the number of selected probes.

The results of genetic algorithm in respect to the LN negative versus LN positive classification were shown in Figure 4(a), from which we can find that even after 100 iterations, the fitness is still not much greater than 0.8. This result also implied the indistinctness between the LN-negative and LN-positive samples. The mean fitness hovered around 0.8 after iteration 20. Similarly, we collected all the best solutions after each 20 iterations, and the distribution of the number of selected probes was shown in Figure 4(b). Finally, 12 probes were chosen for the final classification and associated with 14 genes including several lymph node metastasis-related genes, such as HOXD1, NMT1, and SEMA3E [30].

Figure 4.

Figure 4

The results of genetic algorithm-based feature selection with respect to the LN negative versus LN positive classification. (b) The fitness improvement in the process of iteration. (a) The distribution of the number of selected probes.

2.2. Classification Performance Evaluation

To illustrate the necessity and effectiveness of the feature selection procedure, we compared the performance of the random forest using the three-step-selected probes with the random forest using only the differentially methylated probes. We randomly generated 100 training and testing data for evaluation, and the AUROC (area under ROC curve) value was used as measurement. The AUROC value of a classifier described the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Simply put that a larger value of the AUROC means a higher discriminatory power. The box plots in Figure 5 shown below were the distribution of the AUC values of the prediction in respect to the normal versus tumor and LN negative versus LN positive.

Figure 5.

Figure 5

The distribution of the AUC value with different methods. (a) AUC value with different methods with respect to the normal versus tumor classification. (b) AUC value with different methods with respect to the LN negative versus LN positive classification.

From the plots, we can see that with the three-step feature selection procedure, the classifier can give a better performance in respect to both the normal versus tumor and LN negative versus LN positive classifications compared to with only the DMR analysis. Moreover, we also can find that the three-step feature selection or DMR only analysis gives good performance (AUC value all greater than 0.99) for the normal versus tumor classification.

3. Materials and Methods

3.1. DNA Methylation Dataset and DMR Analysis

The clinical data and the TCGA level 3 DNA methylation data were downloaded from The Cancer Genome Atlas (TCGA) project [31]. Only the samples with clear clinical diagnosis were used in the study. The details were shown in Table 1.

Table 1.

The sample number for each phenotype.

Normal Cancer
LN negative LN positive Unclassified
27 94 189 12

To identify differentially methylated probes, for each probe, we ranked the samples and compared only the lower methylation quintile sample to the upper methylation quintile sample between two phenotypes using the Mann–Whitney U test. The BH-adjusted p value and mean methylation difference were used to guide the identification.

3.2. Genetic Algorithm

Genetic algorithms are optimization tools that search the solution through simulating the evolution of random variation and natural selection. For feature selection, the individuals are subsets of candidate features that are encoded as binary and the value indicated that a feature is either included or not in the subset. The parameters used for the genetic algorithm were set as follow [19]:

  1. Population size: 100

  2. Maximum number of generations: 100

  3. Selection method: tournament selection with size = 2

  4. Elitism rate: 10 individuals

  5. Crossover: 2-point crossover with probability 0.6

  6. Mutation: random mutation with probability 0.05

The initial population was created by producing chromosomes with a random 30% of the predictors. The fitness function of every individual was defined as the ROC value of the classification method.

4. Conclusions

Stomach cancer is the fifth most common cancer in the world, and most of the new cases occurred in developing countries, especially in China. Recently, more and more evidence demonstrated that LN metastasis was an independent risk factor for stomach cancer recurrence in patients following curative resection, and the overall survival of LN metastasis-negative stomach cancer patients is significantly longer than that of LN metastasis-positive patients.

Based on the critical role of DNA methylation in human carcinogenesis, in this study, we focused on the prediction of the LN metastasis status using the DNA methylation data. However, considering the inherent disadvantage of DNA methylation data, such as the limited sample number compared to the large number of probes, we applied a three-step feature selection procedure to extract a small subset of representative features. First, we applied the differential methylation analysis to identify the significantly methylated probes between different phenotypes. Then, an mRMR method was introduced to remove the redundant feature obtained in the first filter step. Finally, a wrapper method based on genetic algorithm was used to achieve the final feature selection. We obtained 20 probes related to 39 genes which were inputs of the prediction in respect to normal versus tumor, and 12 probes related to 14 genes were input to the prediction in respect to LN negative versus LN positive (see Table 2). These genes related to the selected probes are mostly associated with cancer and LN metastasis, such as TP73, PDX1, FUT8, HOXD1, NMT1, and SEMA3E.

Table 2.

Identified biomarkers for each prediction.

Normal versus tumor biomarkers LN negative versus LN positive biomarkers
SLC39A5, C3orf32, TP73, CD1B, PCDHGA4, PCDHGA11, PCDHGA9, PCDHGA1, PCDHGB1, PCDHGB6, PCDHGA12, PCDHGB3, PCDHGB7, PCDHGA6, PCDHGA8, PCDHGA10, PCDHGA5, PCDHGB4, PCDHGA3, PCDHGA2, PCDHGB2, PCDHGA7, PCDHGB5, C20orf197, SLC16A5, FUT8, SLC15A2, C17orf93, PRAC, OCLN, TMEM144, FGF2, PDX1, CCL1, LILRB5, LCE3D, GPR45, LPO, CGB5 LAT2, TTC13, ARV1, NMT1, DCAKD, GJA1, OR7A17, LOX, KRT19, ZNF655, KRTAP4-4, TAAR5, SEMA3E, HOXD1

To evaluate the effect of three-step feature selection to the prediction performance, we downloaded the DNA methylation data and clinical data from the TCGA project. The AUROC value was used as the performance measurement. The experiment results showed that the three-step feature selection can largely improve the performance of prediction, especially predicting LN negative versus LN positive. The source code used in this paper can be obtained at https://git.oschina.net/junwu302/codes/m2gonkax18sfhdvl3e0b932.

Acknowledgments

This work was partially funded by the State Key Development Program for Basic Research of China (2013CB967402) and the National Natural Science Foundation of China (31671299, 61603161). The authors would like to thank the reviewers in advance for their comments.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  • 1.Ferlay J., Soerjomataram I., Dikshit R., et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. International Journal of Cancer. 2015;136(5):E359–E386. doi: 10.1002/ijc.29210. [DOI] [PubMed] [Google Scholar]
  • 2.Antoni S., Soerjomataram I., Møller B., Bray F., Ferlay J. An assessment of GLOBOCAN methods for deriving national estimates of cancer incidence. Methods. 2016;3:p. 4. doi: 10.2471/BLT.15.164384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rubin E., Palazzo J. P. Rubin’s Pathology. Clinicopathologic Foundations of Medicine. 4th. Philadelphia: Lippincott Williams & Wilkins; 2005. The gastrointestinal tract; pp. 660–739. [Google Scholar]
  • 4.Akagi T., Shiraishi N., Kitano S. Lymph node metastasis of gastric cancer. Cancer. 2011;3(2):2141–2159. doi: 10.3390/cancers3022141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Park Y. D., Chung Y. J., Chung H. Y., et al. Factors related to lymph node metastasis and the feasibility of endoscopic mucosal resection for treating poorly differentiated adenocarcinoma of the stomach. Endoscopy. 2008;40(1):7–10. doi: 10.1055/s-2007-966750. [DOI] [PubMed] [Google Scholar]
  • 6.Kwee R. M., Kwee T. C. Predicting lymph node status in early gastric cancer. Gastric Cancer. 2008;11(3):134–148. doi: 10.1007/s10120-008-0476-5. [DOI] [PubMed] [Google Scholar]
  • 7.Deng J. Y., Liang H. Clinical significance of lymph node metastasis in gastric cancer. World Journal of Gastroenterology. 2014;20(14):3967–3975. doi: 10.3748/wjg.v20.i14.3967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Isozaki H., Okajima K., Nomura E., et al. Preoperative diagnosis and surgical treatment for LN metastasis in gastric cancer (in Japanese) Gan to Kagaku Ryoho. 1996;23:1275–1283. [PubMed] [Google Scholar]
  • 9.Baylin S. B., Herman J. G. DNA Alterations in Cancer. Natick: Eaton Publishing; 2000. DNA alterations in cancer: genetic and epigenetic alterations; pp. 293–309. [Google Scholar]
  • 10.Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene. 2002;21(35):5400–5413. doi: 10.1038/sj.onc.1205651. [DOI] [PubMed] [Google Scholar]
  • 11.Bergman Y., Cedar H. DNA methylation dynamics in health and disease. Nature Structural & Molecular Biology. 2013;20(3):274–281. doi: 10.1038/nsmb.2518. [DOI] [PubMed] [Google Scholar]
  • 12.Baylin S. B., Belinsky S. A., Herman J. G. Aberrant methylation of gene promoters in cancer—concepts, misconcepts, and promise. Journal of the National Cancer Institute. 2000;92(18):1460–1461. doi: 10.1093/jnci/92.18.1460. [DOI] [PubMed] [Google Scholar]
  • 13.Godfrey K. M., Sheppard A., Gluckman P. D., et al. Epigenetic gene promoter methylation at birth is associated with child’s later adiposity. Diabetes. 2011;60(5):1528–1534. doi: 10.2337/db10-0979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jones A., Teschendorff A. E., Li Q., et al. Role of DNA methylation and epigenetic silencing of HAND2 in endometrial cancer development. PLoS Medicine. 2013;10(11, article e1001551) doi: 10.1371/journal.pmed.1001551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Palmisano W. A., Divine K. K., Saccomanno G., et al. Predicting lung cancer by detecting aberrant promoter methylation in sputum. Cancer Research. 2000;60(21):5954–5958. [PubMed] [Google Scholar]
  • 16.Adorján P., Distler J., Lipscher E., et al. Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Research. 2002;30(5, article e21) doi: 10.1016/S0959-8049(01)80570-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Carmona F. J., Azuara D., Berenguer-Llergo A., et al. DNA methylation biomarkers for noninvasive diagnosis of colorectal cancer. Cancer Prevention Research. 2013;6(7):656–665. doi: 10.1158/1940-6207.CAPR-12-0501. [DOI] [PubMed] [Google Scholar]
  • 18.Fukushige S., Horii A. DNA methylation in cancer: a gene silencing mechanism and the clinical potential of its biomarkers. The Tohoku Journal of Experimental Medicine. 2013;229(3):173–185. doi: 10.1620/tjem.229.173. [DOI] [PubMed] [Google Scholar]
  • 19.Hijazi H., Chan C. A classification framework applied to cancer gene expression profiles. Journal of Healthcare Engineering. 2013;4(2):255–283. doi: 10.1260/2040-2295.4.2.255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bi J., Bennett K., Embrechts M. Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research. 2003;3:1229–1243. [Google Scholar]
  • 21.Kohavi R., John G. H. Wrappers for feature subset selection. Artificial Intelligence. 1997;97(1):273–324. doi: 10.1016/S0004-3702(97)00043-X. [DOI] [Google Scholar]
  • 22.Perkins S., Lacker K., Theiler J. Grafting: fast, incremental feature selection by gradient descent in function space. Journal of Machine Learning Research. 2003;3:1333–1356. [Google Scholar]
  • 23.Guyon I., Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]
  • 24.Zhao X. M., Cheung Y. M., Huang D. S. A novel approach to extracting features from motif content and protein composition for protein sequence classification. Neural Networks. 2005;18(8):1019–1028. doi: 10.1016/j.neunet.2005.07.002. [DOI] [PubMed] [Google Scholar]
  • 25.Wang H. Q., Huang D. S., Wang B. Optimisation of radial basis function classifiers using simulated annealing algorithm for cancer classification. Electronics Letters. 2005;41(11):630–632. doi: 10.1049/el:20050373. [DOI] [Google Scholar]
  • 26.Chen K. H., Wang K. J., Tsai M. L., et al. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinformatics. 2014;15(1):p. 1. doi: 10.1186/1471-2105-15-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Stiewe T., Putzer B. M. Role of p73 in malignancy: tumor suppressor or oncogene? Cell Death and Differentiation. 2002;9(3):237–245. doi: 10.1038/sj.cdd.4400995. [DOI] [PubMed] [Google Scholar]
  • 28.Ma J., Chen M., Wang J., et al. Pancreatic duodenal homeobox-1 (PDX1) functions as a tumor suppressor in gastric cancer. Carcinogenesis. 2008;29(7):1327–1333. doi: 10.1093/carcin/bgn112. [DOI] [PubMed] [Google Scholar]
  • 29.Ito Y., Miyauchi A., Yoshida H., et al. Expression of α1, 6-fucosyltransferase (FUT8) in papillary carcinoma of the thyroid: its linkage to biological aggressiveness and anaplastic transformation. Cancer Letters. 2003;200(2):167–172. doi: 10.1016/S0304-3835(03)00383-5. [DOI] [PubMed] [Google Scholar]
  • 30.Bhatlekar S., Fields J. Z., Boman B. M. HOX genes and their role in the development of human cancers. Journal of Molecular Medicine. 2014;92(8):811–823. doi: 10.1007/s00109-014-1181-y. [DOI] [PubMed] [Google Scholar]
  • 31.The TCGA Database. http://cancergenome.nih.gov/

Articles from Disease Markers are provided here courtesy of Wiley

RESOURCES