Abstract
Mitotic cell cycle (MCC) is a critical process in cell growth and division, and dysregulation of MCC genes may contribute to tumorigenesis. In this study, to identify diagnostic and prognostic value of MCC genes, differentially expressed MCC genes between HCC and normal tissues were identified and subjected to machine learning methods. SVM-RFE and RF-RFE were employed to select the most informative diagnostic genes. The SVM-RFE model demonstrated high performance in TCGA (AUC = 1.0), and generalizability across GSE77509 (AUC = 0.95) and GSE144269 (AUC = 0.879), outperforming RF-RFE. Permutation testing confirmed that these AUCs were outside the null distribution for all datasets. Nine genes, CDKN3, TRIP13, RACGAP1, FBXO43, EZH2, SPDL1, E2F1, TUBE1 and CDC6, were common in SVM-RFE and RF-RFE and showed robust individual diagnostic performance across datasets (AUCs > 0.81). Univariate Cox regression followed by LASSO Cox regression was used for identification of prognostic gene signature consisted of eight MCC genes, BCAT1, DPF1, CDKN2B, CDKN2C, TUBA3C, IGF1, CDC14B and SMARCA2, that predicted overall survival of HCC patients. The risk score was shown to be an independent prognostic factor for HCC and its combination with AJCC stage improved prognostic value. Kaplan–Meier analysis showed that high-risk score was associated to poorer survival across clinical subgroups; stage, grade, age, and gender. Additionally, risk score was significantly higher in patients with advanced-stage and high-grade tumors. In conclusion, diagnostic biomarker candidates classifying HCC patients and healthy controls, and a novel prognostic gene signature predicting overall survival of HCC patients were identified by using machine learning approaches.
Introduction
Mitotic cell cycle (MCC) is a precisely regulated process, which includes DNA replication, chromosome segregation and cell division [1]. Because of its importance, this process is tightly monitored by surveillance mechanisms, which ensures the accuracy and correct order of the cell cycle. Defects in proper cell cycle progression leads to deregulated cell proliferation, genomic instability and eventually tumorigenesis [2]. Gene expression signatures constituted with genes functioning in cell cycle and DNA damage response have offered promising prognostic markers for HCC [3,4]. The significance and clinical utility of gene expression profile-based assays, such as MammaPrint and Oncotype DX, have been established in breast cancer for predicting cancer outcome and guiding decision making in therapy [5–7]. This highlights the potential value of identifying diagnostic and prognostic genes for other cancers, such as HCC, to assist clinical practice.
Liver cancer is the 3rd leading cancer in deaths caused by cancer and 6th in number of new cases [8]. The most common type of liver cancer is hepatocellular carcinoma (HCC) [9]. Different risk factors and mechanisms have been associated to HCC, including activation of oncogenic signaling pathways or altered functions of cancer driver genes due to mutations [9]. In addition to genetic variations, gene expression alterations have been identified as an important factor in HCC pathogenesis.
Machine learning (ML) algorithms, including support vector machine recursive feature elimination (SVM-RFE) and random forest with recursive feature elimination (RF-RFE), have been used to predict potential diagnostic cancer biomarkers by using existing transcriptome data [10–12]. SVM-RFE has been shown to be a powerful algorithm for identifying biologically relevant genes during feature selection [13]. RF is another method that has been used for feature selection in cancer diagnosis [12]. Combining RF with RFE improves feature selection, due to the addition of RFE, which iteratively removes least important features to yield more relevant features [14]. Least absolute shrinkage and selection operator (LASSO) is a variable selection and shrinkage method, which has been used to predict the survival of cancer patients and establishment of prognostic gene signatures [3,15,16].
Due to the significance of MCC in cancer progression, targeting cell cycle genes, which are associated with tumorigenesis and patient outcome, may facilitate detection of cancer formation, create therapy strategies to improve patient prognosis and provide new targets for therapeutic applications. Therefore, in this study MCC genes that were differentially expressed in HCC were used the identify diagnostic biomarker candidates and construct the prognostic MCC gene signature.
Materials and methods
Data retrieval and analysis
RNA-seq count data of The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA LIHC) tumor and normal samples (primary solid tumor, n = 371, and solid tissue normal, n = 50) and matching clinical data were downloaded by using TCGAbiolinks R/Bioconductor package [17]. RNA-seq count data of tumor and normal samples from GSE77509 (tumor, n = 20, and normal, n = 20) [18] and GSE144269 (tumor, n = 70, and normal, n = 70) [19] were downloaded from GEO database (https://www.ncbi.nlm.nih.gov/geo/, [20]). Genes with low counts were filtered and raw counts were TMM normalized by using edgeR [21] and limma [22] R/Bioconductor packages. GSE14520 [23] raw data from tumor tissue of HCC patients (n = 221) and related clinical data, downloaded from GEO database [20], were preprocessed and RMA normalized by using oligo R/Bioconductor package [24]. MCC gene set; GOBP_MITOTIC_CELL_CYCLE, and the transcript names in this gene set were retrieved from the Molecular Signatures Database (MSigDB v2023.2.Hs, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp, [25,26]).
Identification of differentially expressed MCC genes in TCGA LIHC and GSE77509
Differentially expressed genes (DEGs) between tumor and normal samples of TCGA LIHC and GSE77509 were identified. Genes with adjusted p-value < 0.05 and |log fold change| > 1 were considered as significant in each dataset. To obtain the differentially expressed MCC genes, the upregulated and downregulated DEGs from both datasets, and MCC genes from GOBP_MITOTIC_CELL_CYCLE gene set were compared via venn diagram and 183 intersection genes were selected for further analysis (S1 Fig).
Feature selection using SVM-RFE and RF-RFE
To identify most informative MCC genes for HCC classification, the intersection genes were subjected to ML algorithms, SVM-RFE and RF-RFE. TCGA LIHC was used as training dataset, GSE77509 as internal validation set and GSE144269 as external validation set. Prior to application of ML algorithms, a filtering step was applied on TCGA LIHC to remove highly correlated genes (Pearson r > 0.9). To increase the robustness of the feature selection process, RFE with 10-fold cross validation for 50 iterations was performed. Genes that were selected in at least 90% iterations were subjected to a final round of feature selection using RFE with 10-fold repeated cross validation with 5 repeats to obtain most relevant diagnostic genes for training of SVM with linear kernel and RF models. To evaluate generalizability of models and prevent overfitting, a nested cross-validation and permutation testing (n = 100) were performed. Model performance was evaluated with AUC, sensitivity, specificity, accuracy, precision and recall. Analyses were performed by using e1071, caret [27], ROCR [28], pROC [29] and randomForest [30] R/Bioconductor packages.
Construction and performance of the prognostic MCC gene signature with LASSO
The intersection genes were utilized to construct a prognostic gene signature. TCGA LIHC patients with follow up time ≥ 30 days and complete information of vital status and follow up time were included in survival analysis and construction of gene signature. GSE14520 cohort was used to validate the prognostic MCC gene signature. Quantile normalization was applied on TCGA dataset and harmonization across datasets was performed as previously described [31,32]. Univariate Cox regression was conducted to evaluate the association between MCC genes and overall survival in TCGA LIHC. Clinical information (including days to death, days to last follow up and vital status) was used to calculate survival time and censoring status. For each gene, Hazard ratios (HR), log hazard ratios (logHR), standards errors, 95% confidence intervals (CI) and p-values were calculated. The proportional hazards (PH) assumption was tested by using Schoenfeld residuals and genes violating the PH assumption (p ≤ 0.05) were excluded from further analysis. To control multiple testing, p-values were adjusted by using Benjamini-Hochberg FDR method. Genes did not violate the PH assumption (PH p > 0.05) and with FDR < 0.05 were retained for further analysis. To further validate the reliability of selected genes, pairwise Pearson correlations among the significant genes were examined and gene pairs with correlation coefficients > 0.9 were considered as highly correlated.
The significant genes from univariate Cox regression analysis (FDR < 0.05 and PH p > 0.05) were used as input for LASSO Cox regression analysis with 10-fold cross-validation. Genes with nonzero LASSO coefficients were used construct the prognostic model. A risk score was calculated for each patient as follows:
| (1) |
Where n is the number of genes in the prognostic model, coef is coefficient for the genei, and Exp is gene expression level of genei. Accordingly, patients were divided into low- and high-risk groups based on the median value of risk score. In addition, the optimal cutoff to divide patients into low- and high-risk groups was calculated by maximally selected rank statistics. Kaplan-Meier survival analysis was performed to compare the overall survival rate between the low- and high-risk groups for median and optimal cutoff. The prognostic model was assessed by receiver-operating characteristic (ROC) curves and AUC values for 1-, 3- and 5-year survival were reported. ROC curves at 1-, 3-, and 5-year time points were generated using the timeROC package, which applies the marginal weighting method to account for right-censored survival data when estimating AUC values.
Univariate and multivariate Cox regression analyses were executed to evaluate the independent predictive value of the signature from clinical parameters, including American Joint Committee on Cancer (AJCC) pathologic stage, grade, age and gender. For each clinical parameter and risk score, HRs, 95% CIs, and p-values were computed and to control for multiple hypothesis testing, FDR adjustment was applied using the Benjamini-Hochberg method. Additionally, the PH assumption was tested for each covariate using Schoenfeld residuals for univariate and multivariate Cox regression. The concordance index (C-index) was calculated for three models: stage alone, risk score alone, and a combined model with both stage and the risk score. The prognostic value of the gene signature and the distribution of the risk score were assessed in subgroups stratified by clinical parameters, including AJCC pathologic stage, grade, age and gender. Analyses were performed by using survival [33], survminer, glmnet [34], timeROC [35] and survivalROC [36] R/Bioconductor packages.
Results
Identification of the diagnostic biomarker candidates from differentially expressed MCC genes by using SVM-RFE
GOBP_MITOTIC_CELL_CYCLE gene set was consisted of 932 unique MCC genes. Among the 932 MCC genes, 183 genes, 147 upregulated and 36 downregulated, were identified as differentially expressed in both TCGA LIHC and GSE77509 (adj. p-value < 0.05 and |lfc| > 1).
To identify the most informative MCC genes for distinguishing tumor from normal samples, SVM-RFE using a linear kernel was applied. After filtering highly correlated genes, performing RFE with 50 iterations, and a final RFE step using SVM, 110 genes were selected for model training. The model achieved an AUC of 1.0 in the training set, TCGA LIHC, and a nested 5-fold cross-validation confirmed the model’s robustness (mean AUC = 0.996, SD = 0.006; Fig 1A). The model showed consistent high accuracy, F1-score, precision and recall across the TCGA, GSE77509 (internal validation set) and GSE144269 (external validation set) (Fig 1B). The AUC values of TCGA LIHC, GSE77509 and GSE144269, 1, 0.95 and 0.879, respectively, indicated a generalizability across datasets (Fig 1C). Permutation tests (n = 100) further validated the model by demonstrating that observed AUCs lay far outside the null distribution for all three datasets and none of the permuted AUCs reached the observed AUCs (permutation test empirical p value <0.01, Fig 1D).
Fig 1. The performance of SVM-RFE.
A) Nested 5-fold cross-validation AUC values for the TCGA LIHC. B) The performance of SVM-RFE based on the confusion matrix metrics in TCGA LIHC, GSE77509 and GSE144269. C) ROC curve of three datasets. D) Permutation tests (n = 100) of AUCs for TCGA LIHC, GSE77509 and GSE144269. Real AUCs were shown as red dashed lines.
Identification of the diagnostic biomarker candidates from differentially expressed MCC genes by using RF-RFE
To evaluate the diagnostic value of differentially expressed MCC genes in HCC, another classifier, RF, was employed with RFE. Filtering highly correlated genes, followed by RFE using RF with 50 iterations, resulted in 28 genes. These genes were then subjected to final RFE to obtain 15 genes.
Nested 5-fold cross-validation for TCGA confirmed model consistency with a mean AUC and SD of 0.945 ± 0.049 (Fig 2A). Model performance declined in the GSE77509 and GSE144269 datasets relative to TCGA, as reflected by balanced accuracy, recall and AUC values (Fig 2B and Fig 2C). Similar to SVM-RFE, the discriminative ability of the model was confirmed with permutation testing (n = 100) (permutation test empirical p value <0.01, Fig 2D).
Fig 2. The performance of RF-RFE.
A) Nested 5-fold cross-validation AUC values for the TCGA LIHC. B) Confusion matrix metrics for TCGA LIHC, GSE77509, and GSE144269. C) ROC curve for TCGA LIHC, GSE77509, and GSE144269. D) Permutation testing (n = 100) results of all three datasets. Red dashed lines represented real AUC values.
Selection of the most relevant features for diagnosis of HCC
To select the most relevant and important features for diagnosis of HCC, top 20 features of SVM-RFE and 15 features of RF-RFE were investigated. Although the ranking was varied, nine genes; cyclin dependent kinase inhibitor 3 (CDKN3), thyroid hormone receptor interactor 13 (TRIP13), Rac GTPase activating protein 1 (RACGAP1), F-box protein 43 (FBXO43), enhancer of zeste 2 polycomb repressive complex 2 subunit (EZH2), spindle apparatus coiled-coil protein 1 (SPDL1), E2F transcription factor 1 (E2F1), tubulin epsilon 1 (TUBE1) and cell division cycle 6 (CDC6), were common between classifiers (Fig 3A and Fig 3B).
Fig 3. Important features from SVM-RFE and RF-RFE.
A) Top 20 features of SVM-RFE. B) 15 features of RF-RFE. C) ROC curves showing the diagnostic performance of the nine shared genes.
In TCGA, all nine genes achieved AUCs > 0.95, indicating excellent diagnostic separation between tumor and normal samples. All genes maintained good to excellent performance with AUCs > 0.81 in GSE77509 and GSE144269. These results supported the robustness and potential value of these genes as individual diagnostic biomarker candidates for HCC (Fig 3C).
Construction of prognostic MCC gene signature with LASSO
Univariate Cox regression was performed with differentially expressed MCC genes that were represented in GSE14520 dataset to identify the MCC genes associated with overall survival of HCC patients. Accordingly, 14 genes were found to be significantly associated with overall survival based on FDR < 0.05 and PH p > 0.05, ensuring that all genes satisfied the PH assumption. Based on the HR values, four genes with HR < 1 were detected as protective and ten genes with HR > 1 were detected as risk factors (Fig 4A). To evaluate the potential redundancy in gene expression profiles, pairwise Pearson correlations were computed for 14 genes and no gene pairs showed high correlation (r > 0.9), which indicated the absence of multicollinearity for these genes (Fig 4B).
Fig 4. Evaluation of the prognostic performance of MCC gene signature in TCGA LIHC.
A) Forest plot for 14 genes significantly associated with overall survival in TCGA LIHC (FDR < 0.05, PH p > 0.05) B) Pairwise Pearson correlation heatmap of the 14 significant genes. C) Distribution of risk score, blue dashed line and red dashed line represented median and optimal cutoff, respectively. D) Distribution and survival information of the risk score and heatmap of the gene expression of 8 genes. E) Kaplan-Meier analysis of TCGA LIHC patients. F) ROC curve of TCGA LIHC patients. G) Univariate Cox regression for stage, age, gender, grade, and risk score and multivariate Cox regression for stage and risk score.
Subsequently, in order to construct the prognostic model, the LASSO Cox regression was applied on 14 MCC genes from univariate Cox regression, and eight genes were identified with nonzero coefficients (S2 Fig). A risk score was calculated for each patient, and median and optimal cutoff were identified. Patients were stratified into high- and low-risk groups based on the median value of the risk score or an optimal cutoff threshold. Due to the close value of median and optimal cutoff (Fig 4C), and the widespread use of median value, median value was used for further analysis.
Among the eight genes, branched chain amino acid transaminase 1 (BCAT1), double PHD fingers 1 (DPF1), cyclin dependent kinase inhibitor 2B (CDKN2B), cyclin dependent kinase inhibitor 2C (CDKN2C) and tubulin alpha 3c (TUBA3C) had high expression in high-risk group, while insulin like growth factor 1 (IGF1), cell division cycle 14B (CDC14B) and SWI/SNF related BAF chromatin remodeling complex subunit ATPase 2 (SMARCA2) had high expression in low-risk group (Fig 4D). Kaplan-Meier analysis showed that the high-risk group had significantly poorer overall survival compared to the low-risk group for both median value and optimal cutoff (Fig 4E and S3 Fig). The 1- year, 3-year and 5-year AUC values were 0.772 and 0.681 and 0.667, respectively (Fig 4F and S3 Fig).
To evaluate whether the risk score could serve as an independent prognostic factor, univariate and multivariate Cox regression analyses were performed with risk score and clinical parameters: AJCC stage, age, grade, and gender. In univariate Cox regression analysis, stage and the risk score were found to be significantly associated with overall survival (FDR < 0.001), while age, gender, or grade did not show any significant associations (Fig 4G). To further assess the independence of the risk score, a multivariate Cox regression was performed with stage and risk score, which showed the risk score as an independent predictor of poor overall survival (HR = 3.085, 95% CI: 2.034–4.679, FDR < 0.001). Schoenfeld residual testing was performed for univariate and multivariate Cox regressions, which confirmed that all covariates satisfied the PH assumption, except for the risk score (p ≤ 0.05), which indicated prognostic effect may change over time.
C-index for stage (stage alone), risk score (risk score alone), and their combination (risk score + stage) were calculated. the C-index was 0.614 for stage, 0.691 for the risk score, and 0.719 for their combination. These results suggested that the risk score added prognostic value into standard clinical staging.
Validation of the prognostic MCC gene signature
To validate the prognostic gene signature, GSE14520 dataset was used (Fig 5A). High-risk group showed poor overall survival in Kaplan-Meier analysis both for median value and optimum cutoff (Fig 5B and S3 Fig). The performance of MCC gene signature was evaluated by AUCs at 1-, 3- and 5- year, which were 0.68, 0.651 and 0.645, respectively (Fig 5C and S3 Fig). Similarly, risk score and stage emerged as independent prognostic factors based on univariate and multivariate Cox regression results (FDR < 0.001, Fig 5D). Unlike to TCGA LIHC, Schoenfeld residual testing confirmed that all covariates satisfied the PH assumption for GSE14520 dataset, including the risk score (p > 0.05).
Fig 5. Evaluation of the prognostic performance of MCC gene signature in GSE14520 dataset.
A) Distribution and survival information of the risk score and heat map of the gene expression of 8 genes. B) Kaplan-Meier analysis of HCC patients. C) ROC curve of HCC patients. D) Forest plots from univariate and multivariate Cox regression.
For the GSE14520 dataset C-index for stage (stage alone), risk score (risk score alone), and their combination (risk score + stage) were also calculated. the C-index was 0.624 for stage, 0.633 for the risk score, and 0.703 for their combination, confirming contribution of the risk score to prognostic value of stage.
Kaplan-Meier Analysis and distribution of risk score in different clinical parameters
The association between high-risk score and overall survival was evaluated in subgroups with different clinical parameters; AJCC stage, tumor grade, age, and gender in the TCGA cohort. In each subgroup, the high-risk group consistently showed significantly worse survival than the low-risk group (Fig 6A). These results confirm the prognostic value of the risk score in all subgroups. The higher risk score was observed in patients with more advanced stage (Stage III/IV) and grade (G3/G4) tumors (Fig 6B), suggesting biological relevance to HCC aggressiveness (Fig 6B).
Fig 6. Kaplan-Meier analysis and distribution of the risk score in TCGA cohort with different clinical parameters.
A) Kaplan-Meier plots for subgroups with different clinical parameters. B) Boxplots of the risk scores across the clinical subgroups.
Discussion
In this study, to comprehend the various aspects of cell cycle and mainly focus on the MCC process, GOBP_MITOTIC_CELL_CYCLE gene set, covering the diverse features of MCC, such as cell division, cell cycle regulation and chromosome segregation, was selected and the differentially expressed MCC genes were identified. The diagnostic value of these genes in HCC was shown by two different ML methods, SVM-RFE and RF-RFE, and prognostic gene signature was generated by LASSO Cox regression.
ML algorithms facilitated the prediction of cancer progression, metastasis risk and response to treatment [37]. Several data types, including electronic health records, genomic and transcriptomic data, and medical images, can be effectively utilized for diverse purposes in cancer research, such as identifying diagnostic cancer biomarkers or co-deletion mutations, personalized medicine, and early detection. [37–40]. The utility of ML algorithms extends to structural biology, such as predicting protein crystallization from feature selection methods and ML algorithms [41], which highlights the broad adaptability of these approaches in health sciences.
ML methods, including LASSO, RF and SVM-RFE, have been used to predict tumor grades and discriminative features using different data sources, such as CT images and gene expression datasets [10,42]. Studies have applied SVM-RFE in combination with RF or other ML methods to gene expression data to identify diagnostic HCC biomarker candidates. Yi et al. combined SVM-RFE and LASSO on ferroptosis-related genes and achieved AUCs ranging from 0.879 to 0.785 for selected features in the training set, but did not report validation set for ML [43]. Zhou et al. used SVM-RFE with combination of other ML methods to identify important features from Notch signal-related genes, and reported AUC values as 0.908 ± 0.016 and 0.866 ± 0.064 for training and testing sets, respectively, for SVM classifier, while AdaBoost achieved the best diagnostic performance (AUC = 0.934) in the testing set [44]. Other studies used DEGs rather than specific biological processes or pathways to identify diagnostic biomarker candidates [10,45]. Gupta et al. used cell line models, instead of primary tumors, and obtained top 20 features using SVM and RF, RFE application resulted in three novel biomarkers with accuracy of 0.97. Filtration steps, removing transcripts with low expression and high correlation, were performed similar to current study. Among the different ML methods, SVM and RF had the best performance to find biomarkers for HCC [10]. Combination of SVM-RFE, RF, LASSO and WGCNA were employed to find diagnostic genes, the AUCs of each gene ranged between 0.961 and 0.877 in the validation set, although performance of the classifiers was not reported [46]. Similarly, there were studies focusing on individual features with over an AUC, such as 0.7 or 0.85, obtained from the intersection of SVM-RFE and LASSO, and did not evaluate classifier performance [45,47,48]. The current study distinguished from previous studies by stringent feature selection steps, including correlation filtering, RFE with 50 iterations to retain genes selected in ≥90% of runs, and a final RFE with 10-fold repeated cross-validation. The inclusion of internal and external validation sets improved the diagnostic accuracy. Permutation tests further confirmed model validity, showing observed AUCs were separated from the null distribution across all datasets. SVM-RFE model with the refined feature set from MCC gene list achieved high performance in TCGA (AUC = 1.0, mean AUC = 0.996, SD = 0.006), GSE77509 (AUC = 0.95) and in GSE144269 (AUC = 0.879) and the performance of the classifier with MCC genes was higher than similar studies in HCC with SVM-RFE to identify diagnostic genes by using gene expression data. The results of this study showed that SVM-RFE suppressed the performance of RF-RFE in the internal and external validation sets with AUC = 0.825 and AUC = 0.714, respectively.
The nine genes, TRIP13, RACGAP1, CDKN3, FBXO43, EZH2, SPDL1, TUBE1, CDC6 and E2F1, were common to SVM-RFE and RF-RFE and considered as diagnostic marker candidates for HCC. The high individual AUC values (AUCs > 0.81) of nine genes in TCGA, GSE77509 and GSE144269 also showed strong diagnostic power of these genes. While the downregulation of TUBE1 in HCC has been previously reported [49], current results with SVM-RFE and RF-RFE were the first to suggest the potential utility of TUBE1 as a diagnostic biomarker and highlight the role for TUBE1 in distinguishing HCC from normal tissue. Gene expression alterations of TRIP13, CDKN3, FBXO43, SPDL1 and E2F1 have been reported in HCC and these genes were proposed as prognostic markers or therapeutic targets for HCC patients [50–55]. The results of this study support the role of these genes as diagnostic biomarkers, expanding their clinical relevance. Among the nine candidate genes, RACGAP1, CDC6 and EZH2 have been proposed as diagnostic biomarkers for HCC [56–58], consistent the results of SVM-RFE and RF-RFE.
Previously prognostic gene signatures have been generated using LASSO Cox regression from cell cycle related genes for HCC [3,4,59]. In this study, 8-gene model was established by using LASSO Cox regression from genes particularly related to MCC process, due to its importance in tumorigenesis. In the TCGA cohort, the model achieved AUCs of 0.772, 0.681, and 0.667 at 1-, 3-, and 5-years, respectively. In the GSE14520 cohort, the AUCs were 0.68, 0.651, and 0.645, demonstrating robustness across datasets. The signature from Hallmarks of cancer gene sets from MSigDB to develop a cell cycle progression-derived model achieved AUCs of 0.776, 0.697, and 0.619 (for 1-, 3-, and 5-years, respectively) in TCGA, and 0.779, 0.803, and 0.762 in the LIRI-JP cohort [3]. Six cell cycle related MSigDB gene sets were used to establish a 13-gene prognostic model, which showed a high performance with AUC values of 0.835, 0.822, 0.808, 0.821, and 0.826 at 1-, 2-,3-, 4- and 5-years, respectively [4]. The 6-gene prognostic model from manually curated a literature-based list of 50 cell cycle genes achieved AUCs of 0.737, 0.712, and 0.683 (for 1-, 2-, and 3-years, respectively) in TCGA, and 0.742, 0.743, and 0.741 in the ICGC cohort [59]. Compared to these studies, current study demonstrated similar or slightly lower predictive accuracy, while emphasizing biological specificity to MCC process. It should be noted that time-dependent AUCs in the 0.6–0.7 range have been commonly observed for the validation sets for prognostic gene signatures [60,61]. Current study with prognostic MCC gene signature distinguished from most of the previous publications with similar set up due to its intense gene filtering at the univariate Cox regression, due to inclusion of PH assumption testing, avoidance of multicollinearity, consideration in median and optimal cutoff risk scores for patient stratification and evaluation of clinical relevance with respect to C-index and integration with AJCC stage.
Staging liver tumors, including AJCC TNM stage and histological grade, are accepted as an invaluable tool for predicting prognosis and therapy [62]. The model showed prognostic value independent of clinical parameters and its combination with AJCC stage improved prognostic discrimination, with C-indices increasing from 0.614 (stage) and 0.691 (risk score) to 0.719 (combined) in TCGA, and from 0.624 (stage) and 0.633 (risk score) to 0.703 (combined) in GSE14520. This improvement in C-index values indicates that the risk score added independent prognostic value to AJCC stage. Kaplan-Meier plots showed that the high-risk score was consistently associated with significantly poorer overall survival in all clinical subgroups, including early (Stage I/II) and advanced (Stage III/IV) stages, low (G1/G2) and high (G3/G4) tumor grades, younger and older age groups, and both sexes. These results indicate that prognostic gene signature might have broad applicability in clinical practice. Moreover, risk score were significantly higher in patients with advanced-stage and high-grade tumors, supporting a potential biological association with tumor aggressiveness.
This study provided diagnostic HCC biomarkers and a novel gene signature, which were established and validated by computational methods. Therefore, the genes identified in this study can be utilized to design novel in vivo studies that can evaluate the biological and mechanistic consequences of these gene expression alterations. Despite the strengths of this study, such as the integration of rigorous feature selection steps, and multi dataset validation, it is important to acknowledge the limitations. This study relied on retrospective datasets from public repositories. Although, it is common in computational studies, more datasets from prospective patient cohorts should be evaluated. In addition, no experimental validation was performed to confirm the identified diagnostic or prognostic genes. To experimentally validate the diagnostic and prognostic significance of the identified genes, future studies could examine their expression in HCC versus normal tissues using qPCR and immunohistochemistry. In addition, functional assays using siRNA-mediated knockdown of candidate genes in HCC cell lines may help to explain their roles in tumor proliferation and progression. It is also important to note that this study did not incorporate analysis of DNA driver mutations, which may provide additional insights into HCC biology and clinical outcomes. Therefore, future studies, which integrate genomic and transcriptomic datasets, could provide a more comprehensive understanding of diagnostic and prognostic mechanisms in HCC.
Supporting information
(PDF)
(PDF)
(PDF)
Data Availability
The count data and related clinical information of TCGA LIHC can be accessed through National Cancer Institute (NCI) Genomic Data Commons (GDC) (https://gdc.cancer.gov) via TCGAbiolinks R/Bioconductor package. The count data of GSE77509 and GSE144269 and the raw files, including clinical information, of GSE14520 can be accessed through Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). The mitotic cell cycle transcripts in the mitotic cell cycle gene set; GOBP_MITOTIC_CELL_CYCLE, can be accessed through Molecular Signatures Database (MSigDB v2023.2.Hs, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).
Funding Statement
The author(s) received no specific funding for this work.
References
- 1.Otto T, Sicinski P. Cell cycle proteins as promising targets in cancer therapy. Nat Rev Cancer. 2017;17(2):93–115. doi: 10.1038/nrc.2016.138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Williams GH, Stoeber K. The cell cycle and cancer. J Pathol. 2012;226(2):352–64. doi: 10.1002/path.3022 [DOI] [PubMed] [Google Scholar]
- 3.Hui Y, Leng J, Jin D, Liu D, Wang G, Wang Q, et al. A Cell Cycle Progression-Derived Gene Signature to Predict Prognosis and Therapeutic Response in Hepatocellular Carcinoma. Dis Markers. 2021;2021:1986159. doi: 10.1155/2021/1986159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhou Y, Lei D, Hu G, Luo F. A Cell Cycle-Related 13-mRNA Signature to Predict Prognosis in Hepatocellular Carcinoma. Front Oncol. 2022;12:760190. doi: 10.3389/fonc.2022.760190 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6. doi: 10.1038/415530a [DOI] [PubMed] [Google Scholar]
- 6.Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351(27):2817–26. doi: 10.1056/NEJMoa041588 [DOI] [PubMed] [Google Scholar]
- 7.Glas AM, Floore A, Delahaye LJMJ, Witteveen AT, Pover RCF, Bakx N, et al. Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics. 2006;7:278. doi: 10.1186/1471-2164-7-278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71(3):209–49. doi: 10.3322/caac.21660 [DOI] [PubMed] [Google Scholar]
- 9.Llovet JM, Kelley RK, Villanueva A, Singal AG, Pikarsky E, Roayaie S, et al. Hepatocellular carcinoma. Nat Rev Dis Primers. 2021;7(1):6. doi: 10.1038/s41572-020-00240-3 [DOI] [PubMed] [Google Scholar]
- 10.Gupta R, Kleinjans J, Caiment F. Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning. BMC Cancer. 2021;21(1):962. doi: 10.1186/s12885-021-08704-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang C, Zhai W, Ma Y, Wu M, Cai Q, Huang J, et al. Integrating machine learning algorithms and multiple immunohistochemistry validation to unveil novel diagnostic markers based on costimulatory molecules for predicting immune microenvironment status in triple-negative breast cancer. Front Immunol. 2024;15:1424259. doi: 10.3389/fimmu.2024.1424259 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ning B, Chi J, Meng Q, Jia B. Accurate prediction of colorectal cancer diagnosis using machine learning based on immunohistochemistry pathological images. Sci Rep. 2024;14(1):29882. doi: 10.1038/s41598-024-76083-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002;46(1–3):389–422. doi: 10.1023/a:1012487302797 [DOI] [Google Scholar]
- 14.Li L, Ching W-K, Liu Z-P. Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Comput Biol Chem. 2022;100:107747. doi: 10.1016/j.compbiolchem.2022.107747 [DOI] [PubMed] [Google Scholar]
- 15.Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95. doi: [DOI] [PubMed] [Google Scholar]
- 16.Li L, Cao Y, Fan Y, Li R. Gene signature to predict prognostic survival of hepatocellular carcinoma. Open Med (Wars). 2022;17(1):135–50. doi: 10.1515/med-2021-0405 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71. doi: 10.1093/nar/gkv1507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yang Y, Chen L, Gu J, Zhang H, Yuan J, Lian Q, et al. Recurrently deregulated lncRNAs in hepatocellular carcinoma. Nat Commun. 2017;8:14421. doi: 10.1038/ncomms14421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Candia J, Bayarsaikhan E, Tandon M, Budhu A, Forgues M, Tovuu L-O, et al. The genomic landscape of Mongolian hepatocellular carcinoma. Nat Commun. 2020;11(1):4383. doi: 10.1038/s41467-020-18186-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–10. doi: 10.1093/nar/30.1.207 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi: 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Roessler S, Jia H-L, Budhu A, Forgues M, Ye Q-H, Lee J-S, et al. A unique metastasis gene signature enables prediction of tumor relapse in early-stage hepatocellular carcinoma patients. Cancer Res. 2010;70(24):10202–12. doi: 10.1158/0008-5472.CAN-10-2607 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Carvalho BS, Irizarry RA. A framework for oligonucleotide microarray preprocessing. Bioinformatics. 2010;26(19):2363–7. doi: 10.1093/bioinformatics/btq431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40. doi: 10.1093/bioinformatics/btr260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kuhn M. Building Predictive Models inRUsing thecaretPackage. J Stat Soft. 2008;28(5). doi: 10.18637/jss.v028.i05 [DOI] [Google Scholar]
- 28.Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–1. doi: 10.1093/bioinformatics/bti623 [DOI] [PubMed] [Google Scholar]
- 29.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liaw AWM. Classification and Regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
- 31.Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol. 2023;6(1):222. doi: 10.1038/s42003-023-04588-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Huo J, Fan X, Qi B, Sun P. A five-gene signature associated with dna damage repair molecular subtype predict overall survival for hepatocellular Carcinoma. Front Genet. 2022;13:771819. doi: 10.3389/fgene.2022.771819 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Therneau T. A package for survival analysis in R. 2022. [Google Scholar]
- 34.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
- 35.Blanche P, Dartigues J-F, Jacqmin-Gadda H. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat Med. 2013;32(30):5381–97. doi: 10.1002/sim.5958 [DOI] [PubMed] [Google Scholar]
- 36.Heagerty PJ, Lumley T, Pepe MS. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56(2):337–44. doi: 10.1111/j.0006-341x.2000.00337.x [DOI] [PubMed] [Google Scholar]
- 37.Chow JCL. Quantum computing and machine learning in medical decision-making: a comprehensive review. Algorithms. 2025;18(3):156. doi: 10.3390/a18030156 [DOI] [Google Scholar]
- 38.Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Med Image Anal. 2016;33:170–5. doi: 10.1016/j.media.2016.06.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kha Q-H, Le V-H, Hung TNK, Le NQK. Development and Validation of an Efficient MRI Radiomics Signature for Improving the Predictive Performance of 1p/19q Co-Deletion in Lower-Grade Gliomas. Cancers (Basel). 2021;13(21):5398. doi: 10.3390/cancers13215398 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Murray K, Oldfield L, Stefanova I, Gentiluomo M, Aretini P, O’Sullivan R, et al. Biomarkers, omics and artificial intelligence for early detection of pancreatic cancer. Semin Cancer Biol. 2025;111:76–88. doi: 10.1016/j.semcancer.2025.02.009 [DOI] [PubMed] [Google Scholar]
- 41.Le NQK, Li W, Cao Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform. 2023;24(5):bbad319. doi: 10.1093/bib/bbad319 [DOI] [PubMed] [Google Scholar]
- 42.Tong X, Li J. Noninvasively predict the micro-vascular invasion and histopathological grade of hepatocellular carcinoma with CT-derived radiomics. Eur J Radiol Open. 2022;9:100424. doi: 10.1016/j.ejro.2022.100424 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yi S, Zhang C, Li M, Wang J. Construction of a novel diagnostic model based on ferroptosis-related genes for hepatocellular carcinoma using machine and deep learning methods. J Oncol. 2023;2023:1624580. doi: 10.1155/2023/1624580 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zhou D, Cao S, Xie H. Research on predicting the occurrence of hepatocellular carcinoma based on notch signal-related genes using machine learning algorithms. Turk J Gastroenterol. 2023;34(7):760–70. doi: 10.5152/tjg.2023.22357 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Ge S, Xu C-R, Li Y-M, Zhang Y-L, Li N, Wang F-T, et al. Identification of the diagnostic biomarker VIPR1 in hepatocellular carcinoma based on machine learning algorithm. J Oncol. 2022;2022:2469592. doi: 10.1155/2022/2469592 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Liu X, Pan B, Ding J, Zhai X, Hong J, Zheng J. Identifying potential signatures of immune cells in hepatocellular carcinoma using integrative bioinformatics approaches and machine-learning strategies. Immunol Res. 2025;73(1):46. doi: 10.1007/s12026-024-09585-3 [DOI] [PubMed] [Google Scholar]
- 47.Yi S, Zhang C, Li M, Qu T, Wang J. Machine learning and experiments identifies SPINK1 as a candidate diagnostic and prognostic biomarker for hepatocellular carcinoma. Discov Oncol. 2023;14(1):231. doi: 10.1007/s12672-023-00849-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Tu D-Y, Cao J, Zhou J, Su B-B, Wang S-Y, Jiang G-Q, et al. Identification of the mitophagy-related diagnostic biomarkers in hepatocellular carcinoma based on machine learning algorithm and construction of prognostic model. Front Oncol. 2023;13:1132559. doi: 10.3389/fonc.2023.1132559 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Huang X, Wang X, Huang G, Li R, Liu X, Cao L, et al. Bioinformatic identification of differentially expressed genes associated with hepatocellular carcinoma prognosis. Medicine (Baltimore). 2022;101(38):e30678. doi: 10.1097/MD.0000000000030678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Xue J, Wu H, Shi Y, Li Z. TRIP13 overexpression in hepatocellular carcinoma: implications for poor prognosis and immune cell infiltration. Discov Oncol. 2023;14(1):176. doi: 10.1007/s12672-023-00792-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Dai W, Fang S, Cai G, Dai J, Lin G, Ye Q, et al. CDKN3 expression predicates poor prognosis and regulates adriamycin sensitivity in hepatocellular carcinoma in vitro. J Int Med Res. 2020;48(7):300060520936879. doi: 10.1177/0300060520936879 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Tan Z, Chen M, Peng F, Yang P, Peng Z, Zhang Z, et al. E2F1 as a potential prognostic and therapeutic biomarker by affecting tumor development and immune microenvironment in hepatocellular carcinoma. Transl Cancer Res. 2022;11(8):2713–32. doi: 10.21037/tcr-22-218 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Gong Q, Zhang L, Guo J, Zhao W, Zhou B, Yang C, et al. FBXO family genes promotes hepatocellular carcinoma via ubiquitination of p53. J Cancer Res Clin Oncol. 2024;150(10):458. doi: 10.1007/s00432-024-05948-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Zhai Y, Wu F, Xu X, Zhao P, Xin L, Li M, et al. Silencing of spindle apparatus coiled-coil protein 1 suppressed the progression of hepatocellular carcinoma through farnesyltransferase-beta and increased drug sensitivity. Heliyon. 2024;10(14):e34484. doi: 10.1016/j.heliyon.2024.e34484 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Sucularli C. Identification of BRIP1, NSMCE2, ANAPC7, RAD18 and TTL from chromosome segregation gene set associated with hepatocellular carcinoma. Cancer Genet. 2022;268–269:28–36. doi: 10.1016/j.cancergen.2022.09.003 [DOI] [PubMed] [Google Scholar]
- 56.Liao S, Wang K, Zhang L, Shi G, Wang Z, Chen Z, et al. PRC1 and RACGAP1 are Diagnostic Biomarkers of Early HCC and PRC1 Drives Self-Renewal of Liver Cancer Stem Cells. Front Cell Dev Biol. 2022;10:864051. doi: 10.3389/fcell.2022.864051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Cai M-Y, Tong Z-T, Zheng F, Liao Y-J, Wang Y, Rao H-L, et al. EZH2 protein: a promising immunomarker for the detection of hepatocellular carcinomas in liver needle biopsies. Gut. 2011;60(7):967–76. doi: 10.1136/gut.2010.231993 [DOI] [PubMed] [Google Scholar]
- 58.Kong D-G, Yao F-Z. CDC6 is a possible biomarker for hepatocellular carcinoma. Int J Clin Exp Pathol. 2021;14(7):811–8. [PMC free article] [PubMed] [Google Scholar]
- 59.Wang J, Li Y, Zhang C, Chen X, Zhu L, Luo T. Characterization of diagnostic and prognostic significance of cell cycle-linked genes in hepatocellular carcinoma. Transl Cancer Res. 2021;10(11):4636–51. doi: 10.21037/tcr-21-1145 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Du X, He Y, Dong P, Yan C, Wei Y, Yao H, et al. A novel gene signature based on endoplasmic reticulum stress for predicting prognosis in hepatocellular carcinoma. Transl Cancer Res. 2024;13(9):4574–92. doi: 10.21037/tcr-24-191 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fang Y, Yang W, Wu L, Yao L, Cao X, Chen H. An aging-related gene signature to predict the prognosis of hepatocellular carcinoma. Medicine (Baltimore). 2023;102(51):e36715. doi: 10.1097/MD.0000000000036715 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Subramaniam S, Kelley RK, Venook AP. A review of hepatocellular carcinoma (HCC) staging systems. Chin Clin Oncol. 2013;2(4):33. doi: 10.3978/j.issn.2304-3865.2013.07.05 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
(PDF)
(PDF)
(PDF)
Data Availability Statement
The count data and related clinical information of TCGA LIHC can be accessed through National Cancer Institute (NCI) Genomic Data Commons (GDC) (https://gdc.cancer.gov) via TCGAbiolinks R/Bioconductor package. The count data of GSE77509 and GSE144269 and the raw files, including clinical information, of GSE14520 can be accessed through Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). The mitotic cell cycle transcripts in the mitotic cell cycle gene set; GOBP_MITOTIC_CELL_CYCLE, can be accessed through Molecular Signatures Database (MSigDB v2023.2.Hs, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).






