ABSTRACT
This study aims to establish a gene model that can robustly and effectively predict the prognosis of colon carcinoma patients via bioinformatics. Data along with clinical information in GSE39582 Series Matrix were firstly downloaded from Gene Expression Omnibus (GEO) database. Next, differentially expressed genes (DEGs) were obtained through “edgeR” analysis. Finally, a risk predication model was established through a series of regression analyses, and then prognostic performance of the model was comprehensively evaluated though Kaplan-Meier and receiver operating characteristic (ROC) analysis. Gene set enrichment analysis (GSEA) was further performed. Totally, 846 DEGs were obtained by analyzing the gene expression data in GSE39582 dataset. A 9-gene signature-based risk predication model was established via regression analyses, and the model-based risk score was formulated as: Riskscore = (−0.1214) * TNFRSF11A + (−0.2617) * TMEM97 + (−0.1041) * LGR5 + 0.0973 * KLK10 + 0.1655 * HOXB8 + 0.227 * FKBP10 + (−0.1312) * CXCL13 + (−0.1316) * CXCL10 + 0.2593 * CD36. Kaplan-Meier curve showed that colon carcinoma patients in the high-risk group had a lower survival rate. GSEA showed that high-risk group and low-risk group displayed significant difference in biological pathways including ECM RECEPTOR INTERACTION. Besides, correlation analysis between the riskscore of the model and clinical features of patients revealed that the model could effectively predict the prognosis of patients in different ages (age>65, age<65) and stages (tumor_stage I/II, tumor_stage III/IV, T3&T4, N0&N1, N2&N3, M0). This study provides a robust model for the prognosis prediction of colon carcinoma, and lays a basis for researching the molecular mechanism underlying the development of colon carcinoma.
KEYWORDS: Colon carcinoma, gene signature, prognosis prediction
Introduction
Colon carcinoma is the most common malignant tumor in the digestive system. According to the latest data of A Cancer Journal for Clinicians, colorectal cancer has the third highest incidence (male/female: 9%/8%) and mortality (male/female: 9%/9%) [1]. The early symptoms of colon carcinoma are not obvious and are often confused with those of benign diseases such as intestinal inflammatory response, therefore, most patients are in advance stages when initially diagnosed [2]. In recent years, many studies focus on the survival and prognosis of colorectal cancer patients [3,4], but few are related to those of colon cancer patients [5]. Colon and rectum are both part of the large intestine, but there are differences in the survival and prognosis between colon carcinoma and rectal cancer since the differences in anatomical structure and related treatment [6]. Hence, it is helpful for the implementation of precision medicine and the increasing of cure rate and prognosis of patients to explore relevant genes and independent prognostic factors and research their effect on the development and prognosis of tumors.
Along with the development of gene chips and RNA sequencing techniques, gene expression profiles have been widely used in predicting the prognosis of colon carcinoma. For example, high expression of sulfatase-1 (SULF1) in colon carcinoma is relevant to poor prognosis of patients [7]; Xu and others identified 5 microarray datasets of colon carcinoma samples from Gene Expression Omnibus (GEO), and 15 gene markers were found to be used to distinguish different prognosis of colon carcinoma patients [8]. However, since the differences in methodologies, experimental platforms and batch effects, screened genes for prognosis predication may be different, and the constructed model for prognosis predication may only be effectively applicable to experimental samples rather than other independent datasets. Hence, it is urgent to find a predication model, which is applicable to different datasets to apply in different clinical trials.
At present, a few genes are screened as key genes that affect colorectal cancer. For example, UCA1 can influence malignant progression of colorectal cancer by controlling expression of downstream HIPK3 and TUG1 genes [9]. Besides, FMNL2 gene can modulate colorectal cancer progression via stimulating cell proliferation and migration [10]. A considerable number of studies emphasized on finding valuable prognostic gene makers of colorectal cancer. For example, Nguyen et al. [11]. used several colorectal cancer-related GSEA datasets and found 113 prognosis-related genes which can distinguish patient’s survival time. Hua analyzed single nucleotide polymorphisms (SNP) and found that XPG gene polymorphisms could affect the susceptibility of colorectal cancer [12]. Moreover, there are studies revealing that specific gene characteristics of tumor are relevant to chemoradiotherapy resistance. For instance, Shahid et al. [13]. discussed differential gene expression in patients receiving chemotherapy and found that prognosis of these patients was associated with expression level of eight genes. In addition, Chen et al. [14]. discovered from TCGA data that IGF-1R/EGFR-PPAR-CASPASE axis can greatly influence the effect of chemotherapy. These results provide abundant valuable information for future cancer research.
In this study, GSE39582 Series Matrix data (including 566 tumor samples and 19 normal samples) were accessed from GEO. Then, multiple potential multivariate models for prognosis prediction were established with a series of regression analyses, and an optimal multivariate model was found lastly as searched by Akaike Information Criterion (AIC). The performance of the model was validated in the training and testing datasets and further validated in a TCGA-COAD independent dataset. Overall, this study aimed to provide a robust gene signature for the prognosis predication of colon carcinoma, improve the clinical research of colon carcinoma and to further lay a basis for researching molecular mechanism of the development of colon carcinoma.
Methods
Data acquisition and screening of differentially expressed genes (DEGs)
Gene expression microarray GSE39582 of colon carcinoma (including 566 tumor samples and 19 normal samples) along with clinical data were obtained from GEO and the data were standardized by “limma” package. DEGs were screened using “edgeR” package with normal samples as the control and |logFC| ≥1.5, adj.pvalue = 0.05 as the threshold.
Candidate gene selection
The standardized data were randomly grouped by 7:3 into the training dataset and the testing dataset using the “createDataPartition” function of “caret” package. Univariate COX regression analysis was performed based on the training dataset with R package “survival” to screen genes with p< 0.01 for following analyses. Afterward, R package “glmnet” was applied for LASSO regression analysis to further remove genes with strong collinearity to reduce model complexity [15], and the selected genes significantly related to patient’s prognosis were used to identify optimal signature genes.
Construction of risk prediction model
Multivariate COX models were constructed with the genes screened by LASSO regression with R package “survminer”. According to AIC, stepwise regression was conducted to search the optimal model. The risk score based on the model was formulated as below:
In the formula, Coefi represents risk coefficient of each signature gene and xi represents relative gene expression.
Validation of the model stability and validity
The optimal 9-gene signature-based COX model was applied to score the risk of samples in the training dataset, testing dataset and TCGA-COAD independent dataset. Patients in each dataset were respectively classified into the high- and low-risk groups according to the median score of all samples. Thereafter, Kaplan-Meier method was used to compare the survival between the high- and low-risk groups and log-rank test was applied to calculate p value. Following this, overall survival (OS) curve of patients was drawn according to the risk level. Package “survivalROC” was used to draw receiver operating characteristic (ROC) curve and area under curve (AUC) values of patients in 1 year, 3 years, and 5 years were calculated.
Gene set enrichment analysis (GSEA)
To clarify biological pathways related to the 9-gene-based model, GSEA software (https://www.gsea-msigdb.org/gsea/index.jsp) was applied to perform Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Parameters for enrichment analysis was set as follows: Gene sets database: C2.cp.kegg.V7.2.symbols.gmt; Number of permutations: 1000. Pathways with FDR<0.25 were considered of vital significance.
Prognostic performance of the model in patients with different clinical subtypes
The risk of patients in the training and the testing datasets was scored based on the 9-gene signature-based model. Chi-square test was used to analyze the correlation of clinical subtypes (gender, age_diag, tumor stage, AJCC-TNM stage, tumor.location) between patients in the high- and low-risk groups. Then, OS curve was drawn to identify the performance of the model in predicting the prognosis of patients with different clinical subtypes. At last, R package “rms” was employed to establish a nomogram which could clinically determine patient’s risk of disease.
Results
Univariate COX regression analysis and LASSO regression analysis identify candidate survival-related genes
GSE39582 Series Matrix data (including 566 tumor samples and 19 normal samples) were firstly downloaded from GEO database. Gene expression data were standardized with the “normalizeBetweenArrays” function of “limma” package. DEGs were screened using “edgeR”, with the normal samples as the control (|logFC| ≥1.5, adj.pvalue = 0.05). Eventually, 846 DEGs were obtained, including 292 upregulated DEGs and 554 downregulated DEGs (Figure 1(a)).
Figure 1.

Screening of candidate genes for the prognostic model
(a) The volcano diagram shows the differential genes in colon carcinoma patients in GSE39582; (b) LASSO coefficient in regression analysis; C Tuning parameter (lambda) in LASSO model selected through 10-fold cross validation.
The initially standardized data were randomly grouped by 7:3 into the training dataset and the testing dataset. Univariate COX regression analysis (p value = 0.01) was then performed and 15 genes related to the survival of patients were preliminary screened in the training dataset (Table 1). To avoid overfitting of the following multivariate COX regression models, LASSO regression was applied for the 15 genes to remove the genes with a strong collinearity. Eventually, 12 optimal candidate genes were screened as follows: UBD, TNFRSF11A, TMEM97, PIGR, LGR5, KLK10, HOXB8, GABBR1, FKBP10, CXCL13, CXCL10, CD36 (Figure 1(b,c)).
Table 1.
Screening of survival-related genes via univariate COX regression
| ID | HR | HR.95 L | HR.95 H | pvalue |
|---|---|---|---|---|
| TMEM97 | 0.675739 | 0.550013 | 0.830206 | 0.00019 |
| CD36 | 1.307265 | 1.130283 | 1.511959 | 0.000306 |
| TNFRSF11A | 0.816795 | 0.724316 | 0.921081 | 0.000964 |
| PIGR | 0.903402 | 0.849195 | 0.96107 | 0.001292 |
| UBD | 0.842175 | 0.756314 | 0.937785 | 0.001744 |
| GABBR1 | 0.842175 | 0.756314 | 0.937785 | 0.001744 |
| CXCL13 | 0.858961 | 0.78012 | 0.94577 | 0.001968 |
| KLK10 | 1.146712 | 1.049712 | 1.252676 | 0.002399 |
| HOXB8 | 1.202441 | 1.064334 | 1.358468 | 0.003061 |
| BIRC5 | 0.733502 | 0.592677 | 0.907789 | 0.00438 |
| FKBP10 | 1.355012 | 1.099003 | 1.670657 | 0.004461 |
| AURKA | 0.742826 | 0.600911 | 0.918256 | 0.00599 |
| CXCL10 | 0.860215 | 0.770493 | 0.960385 | 0.00738 |
| LGR5 | 0.891805 | 0.819603 | 0.970368 | 0.007854 |
| GZMB | 0.85544 | 0.760034 | 0.962823 | 0.009656 |
Construction of a model for prediction of the prognosis of colon carcinoma
Multivariate COX models were constructed with the genes screened by LASSO regression using “Survival” package, and the optimal model containing 9 genes was eventually searched using stepwise regression based on AIC. The result of multivariate COX analysis uncovered that the expression of TNFRSF11A, TMEM97, LGR5, CXCL13 and CXCL10 was negatively correlated with the prognosis of patients, while that of KLK10, HOXB8, FKBP10 and CD36 was positively correlated with the prognosis of patients (Figure 2). The riskscore of the optimal 9-gene signature-based risk model was formulated as below: Riskscore = (−0.1214) * TNFRSF11A + (−0.2617) * TMEM97 + (−0.1041) * LGR5 + 0.0973 * KLK10 + 0.1655 * HOXB8 + 0.227 * FKBP10 + (−0.1312) * CXCL13 + (−0.1316) * CXCL10 + 0.2593 * CD36 (Table 2).
Figure 2.

The optimal 9 survival-related genes in multivariate COX regression analysis
Table 2.
The optimal multivariate COX regression model identified based on the AIC
| ID | Coef | HR | HR.95 L | HR.95 H | pvalue |
|---|---|---|---|---|---|
| TNFRSF11A | −0.121389445175388 | 0.885688965124658 | 0.785150170108015 | 0.999101793273088 | 0.0483156910287385 |
| TMEM97 | −0.26170873519764 | 0.769735187827784 | 0.616417141289096 | 0.961187189151184 | 0.0209289615789062 |
| LGR5 | −0.104060462086471 | 0.901170809106597 | 0.823948154444505 | 0.98563098030525 | 0.0228097763436299 |
| KLK10 | 0.0973223345182819 | 1.1022155985013 | 0.99949272151635 | 1.21549582045626 | 0.0512001563910653 |
| HOXB8 | 0.16551493029547 | 1.18000058034526 | 1.03722734056514 | 1.34242640466505 | 0.011887708223002 |
| FKBP10 | 0.227030481375778 | 1.25486811746397 | 1.00946095693161 | 1.55993550955557 | 0.0408770899335053 |
| CXCL13 | −0.131201933501283 | 0.877040652618485 | 0.779880650869619 | 0.98630515513848 | 0.028513243512523 |
| CXCL10 | −0.131644181871515 | 0.876652868573988 | 0.762822646982128 | 1.00746910834309 | 0.0635816602757651 |
| CD36 | 0.259343262573961 | 1.29607862376328 | 1.10819636261264 | 1.51581421456379 | 0.00117172466920451 |
Evaluation of the prognosis prediction model
The risk of patients in the training and the testing datasets were scored based on the 9-gene signature-based model. Then, patients were divided into the high- and low-risk groups according to the median riskscore. The survival of the high- and low-risk groups was compared using Kaplan-Meier with log-rank as the statistical method. OS curves and recurrence-free survival (RFS) curves of the 394 patients in the training dataset were drawn according to the risk level. It was found that the OS rate (Figure 3(c)) and RFS rate (Figure 3(a)) of patients in the high-risk group were significantly lower than those of patients in the low-risk group. According to the OS curves and RFS curves of the 168 patients in the testing dataset, it was disclosed that the OS rate (Figure 3(d)) and RFS rate (Figure 3(b)) of the high-risk patients were significantly lower than those of the low-risk patients. Furthermore, the model was applied to score the risk of the 454 patients in a TCGA-COAD independent dataset and OS curves were drawn. It was uncovered that the OS rate of patients in the high-risk group was lower than that of patients in the low-risk group (Figure 3(e)). “SurvivalROC” package was then applied to draw ROC curves for the training dataset, testing dataset and TCGA-COAD independent dataset, respectively, and AUC values in 1 year, 3 years, and 5 years were calculated. AUC values in 1 year, 3 years, and 5 years of the training dataset were 0.757, 0.757, and 0.727, respectively (figure 3(f)); AUC values in 1 year, 3 years and 5 years of the testing dataset were 0.676, 0.622, and 0.629, respectively (Figure 3(g)); AUC values in 1 year, 3 years, and 5 years of the TCGA independent dataset were 0.645, 0.636 and 0.558, respectively (Figure 3(h)). The above research identified that the 9-gene model obtained based on the training dataset could predict the prognosis of colon carcinoma patients.
Figure 3.

Analysis of the prognostic performance of the 9-gene model in the training dataset, testing dataset and TCGA-COAD independent dataset
A-B RFS curves of patients in the training dataset (a) and the testing dataset (b) drawn by Kaplan-Meier; C-E OS curves of patients in the high- and low-risk groups in the training dataset (c), testing dataset (d) and TCGA-COAD independent dataset (e) drawn by Kaplan-Meier; F-H ROC curves were plotted to testify the performance of the 9-gene model in predicting the prognosis of colon carcinoma patients in the training dataset (f), testing dataset (g) and TCGA-COAD independent dataset (h), respectively.
Survival and pathway differences in groups based on the risk prediction model
In order to further understand the expression of the 9 model genes in patients, the riskscore distribution and survival status, each sample in the training dataset and testing dataset was scored respectively by the 9-gene model, and the patients were classified into high- and low-risk groups according to the median score. The expression of the 9 genes, the riskscore distribution and survival status of patients in the 2 datasets were analyzed. The statistical results of the training dataset are shown in Figure 4(a,c,e) and the statistical results of the testing dataset are shown in Figure 4(b,d,f). It was revealed that the survival status of the high-risk patients was poorer than that of the low-risk patients. KEGG enrichment analysis was performed in high- and low-risk groups in the training set. It was shown that there were significant differences in pathways, such as ECM RECEPTOR INTERACTION, DNA REPLICATION and CELL CYCLE, in the two groups (Figure 4(g-i)).
Figure 4.

The expression of the 9 model genes, the riskscore distribution and survival status of patients in the training dataset and testing dataset
A, B show the heatmap of the 9 model genes in patients in the training dataset (a) and the testing dataset (b); C, D present the riskscore distribution of patients in the training dataset (c) and the testing dataset (d); E, F display the survival status of patients in the training dataset (e) and the testing dataset (f); G, H, I suggest significant differences in ECM RECEPTOR INTERACTION (g), DNA REPLICATION (h), CELL CYCLE (i) pathways between high-risk group and low-risk group.
The performance of the 9-gene model in predicting the prognosis of patients with different clinical subtypes in the training dataset and the testing dataset
To further identify the performance of the 9-gene model in predicting the prognosis of colon carcinoma patients with different clinical subtypes, the risk of patients was scored by the 9-gene model with the median riskscore as the cutoff (higher than cutoff was identified as high risk and lower was identified as low risk). Clinical data of patients in the training and testing datasets were statistically analyzed and 1 age “N/A”, 17 T stage “N/A” and “Tis”, 20 N stage “N/A” and “N+”, 16 M stage “N/A” and “MX”, and 3 tumor_stage “N/A” were removed. Same method was used to remove the samples in the testing dataset with information gap. Chi-square test was performed to determine whether each relevant factor (gender, age_diag, tumor stage, AJCC-TNM stage, tumor.location) is significantly different in the high- and low-risk groups and results are shown in Tables 3 and 4.
Table 3.
Statistics of the basic clinical date of patients in the train dataset
| Low risk |
High risk |
P-value | |
|---|---|---|---|
| (n = 187) | (n = 182) | ||
| Gender | |||
| Female | 92 (49.2%) | 80 (44.0%) | 0.366 |
| Male | 95 (50.8%) | 102 (56.0%) | |
| Event | |||
| Yes | 40 (21.4%) | 84 (46.2%) | <0.001 |
| No | 147 (78.6%) | 98 (53.8%) | |
| Age_diag | |||
| <65 | 73 (39.0%) | 72 (39.6%) | 1 |
| >65 | 114 (61.0%) | 110 (60.4%) | |
| T | |||
| T1 | 4 (2.1%) | 2 (1.1%) | 0.042 |
| T2 | 17 (9.1%) | 10 (5.5%) | |
| T3 | 134 (71.7%) | 118 (64.8%) | |
| T4 | 32 (17.1%) | 52 (28.6%) | |
| N | |||
| N0 | 113 (60.4%) | 93 (51.1%) | 0.195 |
| N1 | 44 (23.5%) | 47 (25.8%) | |
| N2 | 27 (14.4%) | 40 (22.0%) | |
| N3 | 3 (1.6%) | 2 (1.1%) | |
| M | |||
| M0 | 173 (92.5%) | 150 (82.4%) | 0.005 |
| M1 | 14 (7.5%) | 32 (17.6%) | |
| Tumor_stage | |||
| Stage I | 12 (6.4%) | 6 (3.3%) | 0.015 |
| Stage II | 100 (53.5%) | 81 (44.5%) | |
| Stage III | 61 (32.6%) | 64 (35.2%) | |
| Stage IV | 14 (7.5%) | 31 (17.0%) | |
| Tumor.location | |||
| Distal | 112 (59.9%) | 111 (61.0%) | 0.913 |
| Proximal | 75 (40.1%) | 71 (39.0%) | |
Table 4.
Statistics of the basic clinical data of patients in the test dataset
| Low risk |
High risk |
P-value | |
|---|---|---|---|
| (n = 95) | (n = 64) | ||
| Gender | |||
| Female | 39 (41.1%) | 31 (48.4%) | 0.449 |
| Male | 56 (58.9%) | 33 (51.6%) | |
| Event | |||
| Yes | 21 (22.1%) | 27 (42.2%) | 0.011 |
| No | 74 (77.9%) | 37 (57.8%) | |
| Age_diag | |||
| <65 | 35 (36.8%) | 22 (34.4%) | 0.881 |
| >65 | 60 (63.2%) | 42 (65.6%) | |
| T | |||
| T1 | 4 (4.2%) | 1 (1.6%) | 0.005 |
| T2 | 15 (15.8%) | 1 (1.6%) | |
| T3 | 63 (66.3%) | 44 (68.8%) | |
| T4 | 13 (13.7%) | 18 (28.1%) | |
| N | |||
| N0 | 54 (56.8%) | 33 (51.6%) | 0.482 |
| N1 | 26 (27.4%) | 16 (25.0%) | |
| N2 | 15 (15.8%) | 15 (23.4%) | |
| M | |||
| M0 | 90 (94.7%) | 55 (85.9%) | 0.102 |
| M1 | 5 (5.3%) | 9 (14.1%) | |
| Tumor_stage | |||
| Stage I | 13 (13.7%) | 0 (0%) | 0.006 |
| Stage II | 39 (41.1%) | 30 (46.9%) | |
| Stage III | 38 (40.0%) | 25 (39.1%) | |
| Stage IV | 5 (5.3%) | 9 (14.1%) | |
| Tumor.location | |||
| Distal | 64 (67.4%) | 34 (53.1%) | 0.1 |
| Proximal | 31 (32.6%) | 30 (46.9%) | |
Besides, OS curve of patients was constructed based on various clinical subtypes (age>65, age<65, tumor_stage I/II, tumor_stage III/IV, T1&T2, T3&T4, N0&N1, N2&N3, M0, M1), and the results revealed that the 9-gene signature-based model had good predicative performance in patients with different clinical subtypes including age (age>65, age<65) and stage (tumor stage I/II, tumor_stage III/IV, T3&T4, N0&N1, N2&N3, M0) (Figure 5(a-j)).
Figure 5.

The prognostic performance of the 9-gene model in patients with different clinical subtypes in the training dataset and testing dataset
A-J The differences of OS in the high- and low-risk group in the training dataset and testing dataset (age>65 (a), age <65 (b), tumor stage I/II (c), tumor stage III/IV (d), T1&T2 (e), T3&T4 (f), N0&N1 (g), N2&N3 (h), M0 (i), M1 (j)) shown by Kaplan-Meier; K Nomogram quantitatively predicts the OS of patients.
In order to provide a quantitative method for clinicians to predict patient’s OS, “rms” package was used to construct a nomogram which integrated the 9-gene model and some clinical indexes (including gender, age_diag, tumor_stage, AJCC-TNM stage, tumor.location) (Figure 5(k)). Scores on the upper point line correspond to different clinical subtypes of patients including gender, age, tumor_stage, AJCC-TNM stage and tumor.location (metastatic location of tumors). The score on the total points line is obtained by adding all corresponding scores, and the corresponding survival rates in 1 year, 3 years, and 5 years show the prognosis of patients predicted by the model.
Discussion
Colon carcinoma is the most common malignant tumor in the digestive system with most patients found in advanced stages when primarily diagnosed, losing surgery opportunity. Hence, construction of a robust and efficient prediction model of colon carcinoma is vital for the early diagnosis, pharmacy guidance and prognosis predication of colon carcinoma patients. Currently, construction of prognostic model for colon carcinoma patients have been researched. For example, Liang et al. [16]. constructed a group of potential prognostic models by identifying DEGs according to gene expression profiling of left and right colon carcinoma in TCGA. A study constructed a prognostic model comprised 100 genes by researching tumor microenvironment-related genes to guide survival predication and treatment of colon carcinoma patients in stage I–III [17]. Gao et al. [18]. also tried to comprehensively analyze expression features of colon carcinoma-related genes in 3 datasets of GEO, and it was found that the expression features of 4 genes (VAMP1, P2RX5, CACNB1 and CRY2) may have predicative value for the prognosis of colon carcinoma. Moreover, studies found that gene expression could affect the efficacy of chemotherapy drugs. For instance, Ning et al. [19]. uncovered that CPSF3 gene is associated with recurrence of non-small cell lung cancer patients after chemotherapy. Oshi et al. [20] disclosed that ITPKC can be a predictor for prognosis of patients suffering triple-negative breast cancer receiving chemotherapy based on TCGA database. These results explain that bioinformatics study is of great value in exploring biomarkers for cancer treatment and early diagnosis. Although feature genes related to the prognosis of colon carcinoma have been studied, gene models that can robustly predict the survival of colon carcinoma haven’t been established. Furthermore, with the wide use of high-throughput sequencing technique, more gene expression datasets of colon carcinoma should be included into new studies.
Here, samples in GSE39582 microarray (tumor: 566, normal: 19) were divided into 2 groups, and were applied to establish models for analysis. DEGs were firstly screened with “edgeR”. Afterward, univariate COX regression, LASSO regression and multivariate COX analyses were conducted to screen 9 optimal signature genes: TNFRSF11A, TMEM97, LGR5, KLK10, HOXB8, FKBP10, CXCL13, CXCL10 and CD36. Among them, TNFRSF11A can be a molecular feature of colon carcinoma patients in UICC II stage [21]. TMEM97 (ei. MAC30) is testified to be highly expressed in colon carcinoma [22]. Serval research found that LGR5 can be a maker of circulating tumor cells and stem cells of ulcerative colitis associated colorectal cancers [23,24]. Multiple studies disclosed that KLK10 is differentially expressed in colorectal cancers, and can be a prognostic marker and potential treatment target [25–27]. HOXB8 is highly expressed in colorectal cancer tissue and can promote epithelial-mesenchymal transition, proliferation and metastasis of colorectal cells via activating STAT3 [28]. Studies found that the expression of CXCR5 and its ligand CXCL13 are related to the poor prognosis of advanced colorectal cancer [29], and the CXCL13-CXCR5 axis can promote the growth and invasion of colon carcinoma cells through PI3K/AKT [30]. CXCL10 can be a new serum marker for predicting the liver metastasis and prognosis of colorectal cancer [31]. CD36 can inhibit the occurrence of colorectal cancer through inhibitingβ-catenin/c-myc-mediated glycolysis by ubiquitinating GPC4 [32]. Based on the above studies, we found that the screened genes in our study were somehow related to the prognosis of patients and the occurrence of tumors of colon carcinoma.
Based on the 9 signature genes, a prognostic model was established and the riskscore based on the model was formulated as: Riskscore = (−0.1214) * TNFRSF11A + (−0.2617) * TMEM97 + (−0.1041) * LGR5 + 0.0973 * KLK10 + 0.165 5* HOXB8 + 0.227 * FKBP10 + (−0.1312) * CXCL13 + (−0.1316) * CXCL10 + 0.2593 * CD36. Thereafter, survival analysis and ROC curve were used to testify the prognostic efficacy of the model. Here, this paper validated the accuracy of the model by traditional random grouping and using a TCGA dataset, and it was shown that the prognostic performance of the model was relatively good. TCGA database is often used in molecular biological research, and abundant studies analyzed and dug data via the database. For instance, Sherafatian et al. [33]. constructed a miRNA-based diagnostic model for lung cancer by digging relevant data in TCGA. This study constructed and validate a model using GSE39582 from GEO database, followed by a further validation in TCGA database. It was revealed that the model showed a good value in predicting prognosis. Compared with single verification in a testing set, using an independent set from other databases for validation is more accurate.
We further scored the risk of patients in the training dataset, testing dataset and TCGA-COAD independent dataset. OS analysis uncovered that the survival of the high-risk patients was lower than that of patients with low-risk in all datasets. ROC analysis disclosed that the risk model could predict the prognosis of colon carcinoma patients. GSEA in the training set indicated that pathways like ECM RECEPTOR INTERACTION, DNA REPLICATION, CELL CYCLE, OXIDATIVE PHOSPHORYLATION presented significant differences in high-risk and low-risk groups. ECM RECEPTOR INTERACTION pathway plays an important part in tumor proliferation and migration. It was reported that differentially expressed genes in breast cancer are enriched in ECM RECEPTOR INTERACTION pathway and correlate with tumor progression [34,35]. A study suggested that cell proliferation processes like DNA REPLICATION and CELL CYCLE generate a certain degree of genomic instability, and the instability is closely relevant to cancer development [36]. Based on the above references, it was posited that the difference in ECM RECEPTOR INTERACTION, DNA REPLICATION and CELL CYCLE pathways is one of the reasons for prognosis differences between high- and low-risk groups.
Besides, the relationship between the 9-gene-based risk score and clinical features of patients was further discussed, and it was found that the model could predict the prognosis of patients in different ages (age>65, age<65) and stages (tumor_stage I/II, tumor_stage III/IV, T3&T4, N0&N1, N2&N3, M0). Moreover, a nomogram was constructed to provide a quantitative method for clinicians to predict OS of patients.
Conclusion
In this study, 846 DEGs were obtained based on mRNA data of colon carcinoma patients from GEO database via “edgeR” analysis. A risk prediction model containing 9 genes for patient’s prognosis was constructed through univariate COX, LASSO and multivariate COX regression analyses. It was found that the 9 genes (TNFRSF11A, TMEM97, LGR5, KLK10, HOXB8, FKBP10, CXCL13, CXCL10, CD36) in the model can be prognostic factors of colon carcinoma, which can help predict prognosis of colon carcinoma. However, the model should be further identified in other datasets, and the validity of the model should be testified in clinical experiments. Overall, the results of this study are reliable, and can provide an important reference for future clinical studies and scientific studies, as well as lay a solid theoretical basis for improving survival of patients with colon carcinoma. The above research is sufficient to prove the performance of the model in predicting prognosis of colon carcinoma patients. However, this study is a pure bioinformatics study, and more cellular and animal experiments need to be done to validate the acquired results. We will further testify the reliability of the results in future.
Availability of data and materials
The data used to support the findings of this study are included within the article. The data and materials in the current study are available from the corresponding author on reasonable request.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Authors’ contributions
All authors contributed to data analysis, drafting and revising the article, gave final approval of the version to be published and agreed to be accountable for all aspects of the work.
References
- [1].Siegel RL, Miller KD, Jemal A.. Cancer statistics, 2020. CA Cancer J Clin. 2020;70:7–30. [DOI] [PubMed] [Google Scholar]
- [2].Li C, Zheng H, Jia H, et al. Prognosis of three histological subtypes of colorectal adenocarcinoma: a retrospective analysis of 8005 Chinese patients. Cancer Med. 2019;8:3411–3419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Zheng C, Jiang F, Lin H, et al. Clinical characteristics and prognosis of different primary tumor location in colorectal cancer: a population-based cohort study. Clin Transl Oncol. 2019;21:1524–1531. [DOI] [PubMed] [Google Scholar]
- [4].Park YL, Kim S-H, Park S-Y, et al. Forkheadbox A1 regulates tumor cell growth and predicts prognosis in colorectal cancer. Int J Oncol. 2019;54:2169–2178. [DOI] [PubMed] [Google Scholar]
- [5].Zhou R, Zhang J, Zeng D, et al. Immune cell infiltration as a biomarker for the diagnosis and prognosis of stage I-III colon cancer. Cancer Immunol Immunother. 2019;68:433–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Feng Z, Shi X, Zhang Q, et al. Analysis of clinicopathological features and prognosis of 1315 cases in colorectal cancer located at different anatomical subsites. Pathol Res Pract. 2019;215:152560. [DOI] [PubMed] [Google Scholar]
- [7].Gong W, Li T. [Bioinformatical analysis of correlation between sulfatase-1 (SULF1) and prognosis of colon cancer and underlying mechanisms]. Xi Bao Yu Fen Zi Mian Yi Xue Za Zhi. 2019;35:1008–1013. [PubMed] [Google Scholar]
- [8].Xu G, Zhang M, Zhu H, et al. A 15-gene signature for prediction of colon cancer recurrence and prognosis based on SVM. Gene. 2017;604:33–40. [DOI] [PubMed] [Google Scholar]
- [9].Barbagallo C, Brex D, Caponnetto A, et al. LncRNA UCA1, upregulated in CRC biopsies and downregulated in serum exosomes, controls mRNA expression by RNA-RNA interactions. Mol Ther Nucleic Acids. 2018;12:229–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Yan Y, Wang Z, Qin B. A novel long noncoding RNA, LINC00483 promotes proliferation and metastasis via modulating of FMNL2 in CRC. Biochem Biophys Res Commun. 2019;509:441–447. [DOI] [PubMed] [Google Scholar]
- [11].Nguyen MN, Choi TG, Nguyen DT, et al. CRC-113 gene expression signature for predicting prognosis in patients with colorectal cancer. Oncotarget. 2015;6:31674–31692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Hua RX, Zhuo Z-J, Zhu J, et al. XPG gene polymorphisms contribute to colorectal cancer susceptibility: a two-stage case-control study. J Cancer. 2016;7:1731–1739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Shahid M, Choi TG, Nguyen MN, et al. An 8-gene signature for prediction of prognosis and chemoresponse in non-small cell lung cancer. Oncotarget. 2016;7:86561–86572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Chen L, Zhu Z, Gao W, et al. Systemic analysis of different colorectal cancer cell lines and TCGA datasets identified IGF-1R/EGFR-PPAR-CASPASE axis as important indicator for radiotherapy sensitivity. Gene. 2017;627:484–490. [DOI] [PubMed] [Google Scholar]
- [15].Vasquez MM, Hu C, Roe DJ, et al. Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and application. BMC Med Res Methodol. 2016;16(154). DOI: 10.1186/s12874-016-0254-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Liang L, Zeng J-H, Qin X-G, et al. Distinguishable prognostic signatures of left- and right-sided colon cancer: a study based on sequencing data. Cell Physiol Biochem. 2018;48:475–490. [DOI] [PubMed] [Google Scholar]
- [17].Zhou R, Zeng D, Zhang J, et al. A robust panel based on tumour microenvironment genes for prognostic prediction and tailoring therapies in stage I-III colon cancer. EBioMedicine. 2019;42:420–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Gao P, He M, Zhang C, et al. Integrated analysis of gene expression signatures associated with colon cancer from three datasets. Gene. 2018;654:95–102. [DOI] [PubMed] [Google Scholar]
- [19].Ning Y, Liu W, Guan X, et al. CPSF3 is a promising prognostic biomarker and predicts recurrence of non-small cell lung cancer. Oncol Lett. 2019;18:2835–2844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Oshi M, Newman S, Murthy V, et al. ITPKC as a prognostic and predictive biomarker of neoadjuvant chemotherapy for triple negative breast cancer. Cancers (Basel). 2020;12(10). DOI: 10.3390/cancers12102758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Grone J, Lenze D, Jurinovic V, et al. Molecular profiles and clinical outcome of stage UICC II colon cancer patients. Int J Colorectal Dis. 2011;26:847–858. [DOI] [PubMed] [Google Scholar]
- [22].Zhao ZR, Zhang L-J, He X-Q, et al. Significance of mRNA and protein expression of MAC30 in progression of colorectal cancer. Chemotherapy. 2011;57:394–401. [DOI] [PubMed] [Google Scholar]
- [23].Kazama S, Kishikawa J, Tanaka T, et al. Immunohistochemical expression of CD133 and LGR5 in ulcerative colitis-associated colorectal cancer and dysplasia. In Vivo. 2019;33:1279–1284. DOI: 10.21873/invivo.11600 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Wang W, Wan L, Wu S, et al. Mesenchymal marker and LGR5 expression levels in circulating tumor cells correlate with colorectal cancer prognosis. Cell Oncol (Dordr). 2018;41:495–504. [DOI] [PubMed] [Google Scholar]
- [25].Feng B, Xu W-B, Zheng M-H, et al. Clinical significance of human kallikrein 10 gene expression in colorectal cancer and gastric cancer. J Gastroenterol Hepatol. 2006;21:1596–1603. [DOI] [PubMed] [Google Scholar]
- [26].Talieri M, Li L, Zheng Y, et al. The use of kallikrein-related peptidases as adjuvant prognostic markers in colorectal cancer. Br J Cancer. 2009;100:1659–1665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Alexopoulou DK, Papadopoulos IN, Scorilas A. Clinical significance of kallikrein-related peptidase (KLK10) mRNA expression in colorectal cancer. Clin Biochem. 2013;46:1453–1461. [DOI] [PubMed] [Google Scholar]
- [28].Wang T, Lin F, Sun X, et al. HOXB8 enhances the proliferation and metastasis of colorectal cancer cells by promoting EMT via STAT3 activation. Cancer Cell Int. 2019;19:3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Qi XW, Xia S-H, Yin Y, et al. Expression features of CXCR5 and its ligand, CXCL13 associated with poor prognosis of advanced colorectal cancer. Eur Rev Med Pharmacol Sci. 2014;18:1916–1924. [PubMed] [Google Scholar]
- [30].Zhu Z, Zhang X, Guo H, et al. CXCL13-CXCR5 axis promotes the growth and invasion of colon cancer cells via PI3K/AKT pathway. Mol Cell Biochem. 2015;400:287–295. [DOI] [PubMed] [Google Scholar]
- [31].Toiyama Y, Fujikawa H, Kawamura M, et al. Evaluation of CXCL10 as a novel serum marker for predicting liver metastasis and prognosis in colorectal cancer. Int J Oncol. 2012;40:560–566. [DOI] [PubMed] [Google Scholar]
- [32].Fang Y, Shen Z-Y, Zhan Y-Z, et al. CD36 inhibits beta-catenin/c-myc-mediated glycolysis through ubiquitination of GPC4 to repress colorectal tumorigenesis. Nat Commun. 2019;10:3981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Sherafatian M, Arjmand F. Decision tree-based classifiers for lung cancer diagnosis and subtyping using TCGA miRNA expression data. Oncol Lett. 2019;18:2125–2131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Najafi M, Farhood B, Mortezaee K. Extracellular matrix (ECM) stiffness and degradation as cancer drivers. J Cell Biochem. 2019;120:2782–2790. [DOI] [PubMed] [Google Scholar]
- [35].Bao Y, Wang L, Shi L, et al. Transcriptome profiling revealed multiple genes and ECM-receptor interaction pathways that may be associated with breast cancer. Cell Mol Biol Lett. 2019;24:38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Tubbs A, Endogenous NA. DNA damage as a source of genomic instability in cancer. Cell. 2017;168:644–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data used to support the findings of this study are included within the article. The data and materials in the current study are available from the corresponding author on reasonable request.
