Abstract
The incidence of hepatocellular carcinoma (HCC) has risen significantly in recent years, while current diagnostic and therapeutic approaches remain suboptimal. This study aimed to identify novel biomarkers and therapeutic targets to improve early detection and treatment outcomes. We conducted a comprehensive analysis of HCC-related gene expression datasets (GSE101685, GSE14520, and TCGA-LIHC). Differentially expressed genes (DEGs) were identified, followed by weighted gene co-expression network analysis (WGCNA) on the training cohort. A total of 313 shared genes were identified by intersecting 691 DEGs with 1653 genes from the “MEturquoise” module. Functional enrichment analyses, including gene ontology and Kyoto Encyclopedia of Genes and Genomes, were performed to explore the biological roles of these genes. Subsequently, 109 combinations of 12 machine learning algorithms were applied to identify HCC-specific feature genes. Gene set enrichment analysis and CIBERSORT were used to explore functional pathways and immune infiltration, respectively. Functional analyses revealed that the shared genes were primarily involved in cell cycle regulation and cell division. A total of 96 HCC feature genes were identified through 109 combinations of 12 machine learning algorithms. Among them, 5 novel genes (DNAJC12, KBTBD11, SEC24B, PLSCR4, SH3YL1) with no prior association with HCC were found to have significantly lower expression in tumor samples and were validated for their diagnostic value using receiver operating characteristic analysis. Gene set enrichment analysis further showed their association with immune responses, metabolic processes, and cell cycle regulation. Immune infiltration linked DNAJC12, KBTBD11, and SEC24B to the HCC immune microenvironment. Our study identified 5 previously unreported genes as potential diagnostic biomarkers and therapeutic targets for HCC. These findings provide a new perspective for the molecular characterization and clinical management of hepatocellular carcinoma.
Keywords: bioinformatics, hepatocellular carcinoma, immune infiltration, machine learning, weighted gene co-expression network analysis
1. Introduction
The liver, the largest parenchymal organ in the human body, functions as a metabolic epicenter, playing an indispensable role in processes such as digestion, metabolism, and detoxification, while also serving as a critical immunological organ that modulates systemic immune responses and inflammation.[1] According to the 2022 Global Cancer Statistics, liver cancer ranked as the 6th most prevalent malignancy and the 3rd leading cause of cancer-related mortality worldwide, with 865,269 new cases and 757,948 deaths reported that year.[2] Hepatocellular carcinoma (HCC), the most common form of primary hepatic malignancy, accounts for more than 80% of liver cancer cases and represents a significant global health challenge, with China contributing over half of the global burden.[3]
The etiology of HCC is highly heterogeneous and involves multiple risk factors, including genetic predisposition,[4] cirrhosis,[5] chronic viral hepatitis infections, particularly hepatitis B virus (HBV) and hepatitis C virus (HCV),[6] chronic alcohol consumption,[7] nonalcoholic fatty liver disease (NAFLD),[8] and metabolic disorders such as type 2 diabetes mellitus.[9] Additionally, epidemiological studies have demonstrated a positive association between HCC risk and exposure to aflatoxins,[10] tobacco smoking,[11] and environmental air pollutants.[12] The insidious onset of HCC often results in late-stage diagnosis, and despite therapeutic advances in recent years – including surgical resection, liver transplantation, ablation therapies, and transarterial chemoembolization (TACE) for early and intermediate-stage disease, as well as molecular targeted therapies and immune checkpoint inhibitors for advanced-stage HCC – the recurrence rate remains high, and overall prognosis is poor.[13]
Recent advancements in genomics, next-generation and third-generation sequencing technologies, bioinformatics, and machine learning have greatly enhanced the precision of HCC diagnosis and treatment, driving forward the development of precision medicine.[14–16] However, there is still a significant gap in the identification of clinically applicable biomarkers with high sensitivity and specificity for HCC. In this study, bioinformatics and machine learning algorithms were employed to identify a novel set of candidate genes – DnaJ heat shock protein family member C12 (DNAJC12), kelch repeat and BTB domain containing 11 (KBTBD11), phospholipid scramblase 4 (PLSCR4), SEC24 homolog B, COPII coat complex component (SEC24B), and SH3 domain containing YSC84-like 1 (SH3YL1) – which have not been extensively studied in the context of HCC. These findings provide a critical foundation for further elucidating the molecular mechanisms underlying HCC and hold promise for the development of novel diagnostic biomarkers and therapeutic targets.
2. Materials and methods
2.1. Data acquisition
Two HCC-related datasets, GSE101685 and GSE14520, were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/), and the TCGA-LIHC dataset was downloaded from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). The GSE101685 dataset includes 8 normal tissue samples and 24 HCC samples. The GSE14520 dataset consists of data from 2 different sequencing platforms: data from the GPL3921 platform (designated as GSE14520-1), comprising 220 tumor samples and 220 nontumor samples, and data from the GPL571 platform (designated as GSE14520-2), which contains tumor and nontumor samples from 22 patients. The TCGA-LIHC dataset contains 50 normal samples and 374 HCC samples.
2.2. Batch effect correction
Prior to conducting differential gene expression analysis, we annotated the data using Perl, Data normalization was then performed using R version 4.4.0, followed by batch effect correction using the “ComBat” function from the “sva” package.[17] The corrected GSE101685 and GSE14520-1 datasets were mixed and subsequently used as the training cohort, while the GSE14520-2 and TCGA-LIHC datasets were employed as the testing cohort for further analyses. To evaluate the effectiveness of batch effect correction, we utilized principal component analysis to compare the data quality before and after batch removal in the training cohort, and visualized the result.
2.3. Identification of differentially expressed genes
Differentially expressed genes (DEGs) in the combined training cohort were identified using the “Limma” package in R software,[18] with the selection criteria set to |logFC| > 1 and P < .05. The results were visualized using a volcano plot, generated with the “ggplot2” package. To further illustrate the expression patterns of DEGs across different sample groups, a heatmap was constructed using the “pheatmap” package.
2.4. Weighted gene co-expression network analysis
We performed weighted gene co-expression network analysis (WGCNA) on the combined training cohort, which is a commonly utilized tool in modern systems biology research.[19] Firstly, genes with a standard deviation <0.5 were filtered out, and the “goodSamplesGenes” function was used to select high-quality samples and genes. Next, we used R package “WGCNA” to construct a gene co-expression network. The “pickSoftThreshold” function was applied to determine the optimal softPower value. The adjacency matrix was then converted into a topological overlap matrix, which quantifies the similarity between nodes by evaluating the weighted correlations between pairs of nodes and their relationships with other nodes in the network. Subsequently, genes exhibiting similar expression patterns were clustered into modules using the “dynamic tree cutting” algorithm, with a minimum module size set at 60, and the clustered dendrograms were cut at a height of 0.25. Pearson correlation test was employed to assess the association between module genes and clinical traits. The module with the highest correlation coefficient and P < .05 was selected as the key module, yielding the corresponding key module genes.
2.5. Enrichment analysis of the shared genes
We visualized the shared genes between DEGs and WGCNA key module genes using the R package “VennDiagram.” Subsequently, the shared genes were subjected to gene ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses with the R packages “clusterProfiler,” “org.Hs.e.g..db,” “enrichplot” and so on. A bar plot was generated to display the top 15 enriched GO pathways. The GO analysis encompassed 3 main categories: biological process (BP), cellular component, and molecular function, which collectively describe the roles of genes in BPs, molecular activities, and their involvement in cellular structures. The KEGG analysis explored the metabolic and signaling pathways in which these genes might participate.[20–22]
2.6. Machine learning algorithm and ROC validation
To further identify feature genes for HCC, we used 12 machine learning algorithms to construct predictive models, including Lasso regression, Ridge regression, stepwise generalized linear model (Stepglm), XGBoost, Random Forest, Elastic Net (Enet), partial least squares regression of generalized linear models (plsRglm), generalized boosted regression modeling (GBM), Naive Bayes, linear discriminant analysis, generalized linear model boosting (glmBoost), and support vector machine. The performance of each predictive model was evaluated and validated.[23,24] leave-one-out cross-validation was applied across 109 algorithmic combinations for training cohort, and 10-fold cross-validation was performed on the training cohort to assess model accuracy. The receiver operating characteristic (ROC) curve was utilized to validate the accuracy of the models on the testing cohort. Ultimately, the model with the highest average area under the curve (AUC) for both the training and testing cohort was defined as the optimal model.[25] The ROC curve of the optimal model was visualized using the R package “pROC.” Differential expression gene analysis was conducted again to identify HCC feature genes by applying a threshold of |logFC| > 1 and P < .05 to the genes from the optimal model genes and the shared genes. Results were visualized with a volcano plot using the R packages “ggplot2” and “ggrepel.” Additionally, 5 feature genes that have not yet been explored in the context of HCC were selected for further analysis. Boxplots were generated using the “ggpubr” package to illustrate the differential expression of these 5 genes between normal and tumor tissues, and the ROC curves for these feature genes were plotted using the “pROC” package.
2.7. Protein–protein interaction network analysis
We utilized the GeneMANIA database (https://genemania.org/) to construct and visualize the protein–protein interaction (PPI) network for the 5 feature genes, which provides a powerful platform for identifying functional associations, including physical interactions, co-expression, co-localization, predicted, genetic interactions, shared protein domains, and pathway.
2.8. Gene set enrichment analysis
To further investigate the BPs associated with the 5 novel feature genes (DNAJC12, KBTBD11, PLSCR4, SEC24B, SH3YL1), we performed KEGG pathway-related gene set enrichment analysis (GSEA).[26] HCC samples were divided into high- and low-expression groups based on the median gene expression levels. Predefined gene set (c2.cp.kegg.v7.4.symbols.gmt), including the majority of KEGG pathways, was downloaded from the Molecular Signatures Database (MSigDB). GSEA was performed using the R packages “clusterProfiler,” “org.Hs.e.g..db,” “limma,” and “enrichplot,” and the top 4 significantly enriched pathways for both the high- and low-expression groups were visualized. Statistical significance for all enrichment results was set at P < .05.
2.9. Immune infiltration analysis
Based on the gene expression profiles from the training cohort, we employed the CIBERSORT algorithm to identify the relative composition of immune cells within the tissue.[27] Subsequently, single-sample gene set enrichment analysis (ssGSEA) was utilized to quantify the levels of 22 infiltrating immune cell types in each sample.[28] To ensure robust analysis, immune cells with a standard deviation of zero were excluded, and correlation analysis was then performed to assess the relationships between the 5 feature genes and the remaining 20 types of immune infiltration, followed by visualization of the results.
3. Results
3.1. Data processing and differentially expressed gene analysis
The complete study design is outlined in the flowchart in Figure 1. After eliminating batch effects from the GSE101685, GSE14520, and TCGA-LIHC datasets, GSE101685 and GSE14520-1 were merged and defined as the training cohort for subsequent analysis. Differential expression analysis was then performed on the training data using the “limma” package, with the filtering criteria set at |logFC| > 1 and P < .05, resulting in the identification of 691 DEGs. The results of principal component analysis indicated a significant reduction in batch effects in the corrected training cohort (Fig. 2A and B). The volcano plot of the DEGs revealed 455 downregulated genes and 236 upregulated genes (Fig. 2C). The heatmap of the DEGs demonstrated clear differential expression patterns between the control and tumor groups (Fig. 2D), effectively distinguishing the 2 sample sets, with gene clustering showing distinct segregation into upregulated and downregulated categories.
Figure 1.
A flowchart of the study.
Figure 2.
Principal component analysis and differential expression analysis. (A) PCA of 2 original HCC datasets prior to batch effect correction. (B) PCA of integrated HCC dataset after batch effect correction. (C) The volcano plot of the DEGs between HCC and healthy controls. (D) The heatmap of the DEGs between HCC and healthy controls. DEGs = differentially expressed genes, HCC = hepatocellular carcinoma, PCA = principal component analysis.
3.2. WGCNA and enrichment analysis
After calculations, the suitable soft threshold power (β) was set to 7, as this was the first power value at which the scale-free topology index reached 0.9 (Fig. 3A and B). A total of approximately 7 gene modules were identified through the application of the “dynamic tree cutting” algorithm (Fig. 3C and D). The “MEturquoise” module (r2 = 0.89, P = 1e−160), consisting of 1653 genes, demonstrated the most significantly correlated with the HCC phenotype (Fig. 3E), with a remarkable correlation coefficient of 0.89 and a P-value of 1e−160. Furthermore, the correlation between gene significance and module membership within this module was exceptionally high at 0.95, with a P-value of <1e−200 (Fig. 3F). Next, we identified the intersection of 691 DEGs and 1653 WGCNA module genes, yielding 313 shared genes for further analysis (Fig. 4A). GO enrichment analysis (Fig. 4C) revealed overrepresentation of BPs including regulation of nuclear chromosome segregation, sister chromatid segregation, chromosome segregation, mitotic sister chromatid separation, and mitotic nuclear division, which are critical for cell cycle regulation. Overrepresented cellular components comprised chromosomal region, chromosome, centromeric region, spindle, condensed chromosome, and CMG complex. Enriched molecular functions included catalytic activity, acting on DNA, single-stranded DNA helicase activity, ATP-dependent activity, acting on DNA, single-stranded DNA binding, and protein kinase regulator activity, indicating their essential roles in cell cycle regulation, DNA replication, and repair. KEGG pathway enrichment analysis (Fig. 4D) further demonstrated that these genes are closely associated with key pathways such as the cell cycle, DNA replication, p53 signaling pathway, Mineral absorption, Human T-cell leukemia virus 1 infection, Cellular senescence, Oocyte meiosis and so on.
Figure 3.
The results of WGCNA. (A and B) Analysis of the scale-free fit index for various soft-thresholding powers (β). (C) Clustered dendrograms were cut at a height of 0.25 to detect and combine similar modules; (D) Original and combined modules under the clustering tree; (E) Heat map of module–trait relationships; Each cell contains the corresponding correlation and P-value. (F) Associations between gene significance and membership in the turquoise module. WGCNA = weighted gene co-expression network analysis.
Figure 4.
GO and KEGG analyses of shared genes, and validation of machine learning models. (A) Venn diagram indicating 313 shared between DGEs and WCGNA analysis. (B) ROC curves of training cohort and testing cohort. (C) Bar plots of GO enrichment analysis results for biological process, cellular component, and molecular function. (D) Bar plots of KEGG pathway enrichment analysis. DEGs = differentially expressed genes, GO = gene ontology, KEGG = Kyoto Encyclopedia of Genes and Genomes, ROC = receiver operating characteristic, WGCNA = weighted gene co-expression network analysis.
3.3. The feature genes selection via machine learning
To identify HCC-specific feature genes, we conducted analyses using 109 different combinations of 12 machine learning algorithms. The results indicated that the optimal model was “Enet [alpha = 0.2]” (Fig. 5A), and the accuracy of this diagnostic model was validated through ROC curve analysis (Fig. 4B). Further differential expression analysis of the genes from the optimal model and the shared genes resulted in the identification of 96 HCC feature genes, including 43 upregulated and 53 downregulated genes (Fig. 5B). A literature review revealed that many of these genes have already been implicated in HCC development. For instance, the mannosidase alpha class 1C member 1 (MAN1C1) gene acts as a tumor suppressor in the development of HCC,[29] while the pituitary tumor-transforming gene 1 (PTTG1) gene is involved in the metabolic reprogramming of asparagine synthetase, promoting HCC progression.[30] The autophagy-related gene small nuclear ribonucleoprotein polypeptide E (SNRPE) is overexpressed in HCC cells and regulates the proliferation and migration of HepG2 cells.[31] Additionally, the cyclin B1 (CCNB1) gene promotes HCC development by mediating DNA replication during the cell cycle,[32] and gamma-aminobutyric acid type A receptor epsilon subunit (GABRE) has been identified as an independent diagnostic biomarker for HCC in the context of liver cirrhosis[33] and so on. Notably, DNAJC12, KBTBD11, SEC24B, PLSCR4, and SH3YL1 have not been previously reported in HCC research. Differential expression analysis of these 5 genes (Fig. 5D) revealed significant differences in their expression between control and tumor tissues. As shown in Figure 5C, the ROC analysis results indicate strong diagnostic performance for each gene, with AUC values of 0.843 for DNAJC12, 0.911 for KBTBD11, 0.940 for PLSCR4, 0.858 for SEC24B, and 0.768 for SH3YL1. Notably, the AUC values for all 5 genes exceeded 0.7, demonstrating their high diagnostic accuracy in differentiating normal from tumor samples and underscoring their potential as valuable diagnostic biomarkers for HCC.
Figure 5.
Comprehensive analysis of characteristic genes and immune landscape in HCC using machine learning and bioinformatics approaches. (A) 109 machine learning algorithm combinations evaluated via 10-fold cross-validation. (B) The volcano plot of the 96 diagnostic genes identified by machine learning. (C) The gene ROC analysis by using testing cohort for the 5 novel genes. (D) Differential expression analysis of the 5 novel genes between control and tumor groups. (E) The difference in immune cells between the Tumor and Control groups. (F) The percentage of 22 types of immune cells in the Tumor and Control groups. (G) Correlation between 5 novel genes and 20 immune cells. (H) PPI network of the 5 novel genes constructed by GeneMANIA (***, P < .001; **, P < .01; *, P < .05). PPI = protein–protein interaction, ROC = receiver operating characteristic.
3.4. PPI network analysis and GSEA results of the 5 feature genes
The PPI network for the feature genes was constructed using the GeneMANIA database, which identified 20 genes that interact with the feature genes (Fig. 5H). It revealed potential links in their regulatory roles within cellular processes and biological function networks. Additionally, KEGG pathway-related GSEA results (Fig. 6) showed that genes associated with upregulated DNAJC12 expression were predominantly enriched in pathways related to complement and coagulation cascades, fatty acid metabolism, and ribosome biogenesis. In contrast, genes associated with downregulated DNAJC12 expression were primarily involved in pathways such as lysosome and Vibrio cholerae infection. Similarly, genes associated with upregulated KBTBD11 expression were enriched in complement and coagulation cascades, cytochrome P450 drug metabolism, and retinol metabolism, while those associated with downregulated KBTBD11 expression were mainly involved in cell cycle regulation, DNA replication, and Huntington disease. For PLSCR4, upregulated expression correlated with pathways related to asthma, complement and coagulation cascades, and systemic lupus erythematosus, whereas downregulated expression was linked to Alzheimer disease, Huntington disease, and Parkinson disease pathways. Genes associated with upregulated SEC24B expression were enriched in complement and coagulation cascades, cytochrome P450 drug metabolism, and retinol metabolism, while those associated with downregulated SEC24B expression were primarily involved in cell cycle regulation and extracellular matrix–receptor interactions. Lastly, genes linked to upregulated SH3YL1 expression were predominantly enriched in extracellular matrix–receptor interactions and TGF-beta signaling, whereas those linked to downregulated SH3YL1 expression were involved in cell cycle regulation and DNA replication. These findings suggest that the expression levels of these 5 feature genes are closely associated with the activity of various biological pathways, particularly in immune response, metabolic processes, and cell cycle regulation.
Figure 6.
KEGG pathway-related GSEA results of the 5 novel feature genes in HCC. (A1) Signalling pathways enriched primarily in genes associated with upregulated DNAJC12 expression. (A2) Signalling pathways enriched primarily in genes associated with the downregulated expression of DNAJC12. (B1) Signalling pathways enriched primarily in genes associated with upregulated KBTBD11 expression. (B2) Signalling pathways enriched primarily in genes associated with the downregulated expression of KBTBD11. (C1) Signalling pathways enriched primarily in genes associated with upregulated PLSCR4 expression. (C2) Signalling pathways enriched primarily in genes associated with the downregulated expression of PLSCR4. (D1) Signalling pathways enriched primarily in genes associated with upregulated SEC24B expression. (D2) Signalling pathways enriched primarily in genes associated with the downregulated expression of SEC24B. (E1) Signalling pathways enriched primarily in genes associated with upregulated SH3YL1 expression. (E2) Signalling pathways enriched primarily in genes associated with the downregulated expression of SH3YL1. GSEA = gene set enrichment analysis, HCC = hepatocellular carcinoma, KEGG = Kyoto Encyclopedia of Genes and Genomes.
3.5. Immune cells infiltration in HCC
The immune infiltration landscape of HCC was revealed by CIBERSORT analysis (Fig. 5F). Notably, the abundance of M0 macrophages was significantly increased in the tumor group, while the abundance of M1 macrophages, which have antitumor effects, was markedly decreased (Fig. 5E). This suggests the presence of a pro-tumor immune evasion mechanism within the HCC tumor microenvironment, which may promote the survival and proliferation of HCC cells. Further correlation analysis (Fig. 5G) showed that DNAJC12 was positively correlated with immature B cells and negatively correlated with memory B cells and resting dendritic cells, indicating that it may contribute to the accumulation of immature B cells while inhibiting the activity of memory B cells and resting dendritic cells, thereby weakening the host’s immune surveillance. KBTBD11 was negatively correlated with immature CD4 + T cells, suggesting that its low expression might lead to an increase in immature CD4 + T cells, thereby impairing T cell-mediated antitumor immune responses. SEC24B was positively correlated with monocytes and negatively correlated with M0 macrophages, indicating that it may influence the immunosuppressive state of the HCC tumor microenvironment by regulating monocyte recruitment and macrophage polarization. PLSCR4 and SH3YL1 showed no significant correlation with immune cells. Previous studies have also indicated that the infiltration of B cells, T cells, and dendritic cells is closely associated with the prognosis of HCC patients.[34,35] These findings reveal that DNAJC12, KBTBD11, and SEC24B are closely related to liver immune infiltration and play critical roles in shaping the immune microenvironment of HCC.
4. Discussion
HCC is one of the most prevalent malignant tumors worldwide, with a high mortality rate largely due to the lack of effective early diagnostic biomarkers, resulting in its frequent detection at advanced stages.[36] While alpha-fetoprotein (AFP) has been a traditional biomarker for HCC diagnosis, its low specificity and susceptibility to false positives limit its clinical utility.[37] Therefore, identifying novel, reliable biomarkers to improve HCC prevention, early diagnosis, and targeted therapy is of critical importance.
In this study, we integrated multiple datasets and analytical approaches to investigate the potential feature genes and their biological significance in HCC. Differential expression gene analysis initially identified 691 HCC-related DEGs, and by combining this with WGCNA, we ultimately identified 313 shared genes significantly associated with HCC. GO and KEGG enrichment analyses of these genes revealed their involvement in key BPs, including cell cycle regulation and DNA repair. Based on these findings, we employed machine learning algorithms to construct an HCC diagnostic model, which identified 96 feature genes, including 5 novel genes (DNAJC12, KBTBD11, PLSCR4, SEC24B, and SH3YL1) that had not previously been reported in HCC studies. The expression levels of these 5 genes were significantly lower in tumor tissues compared to normal tissues. Similarly, immune responses also play a role in the pathogenesis of HCC.[38] Compared to the control group, the abundance of M0 macrophages significantly increased in the tumor group, while M1 macrophages decreased markedly. M0 macrophages are typically in an inactive state, potentially providing a supportive environment for the tumor, whereas M1 macrophages exhibit pro-inflammatory and antitumor effects.[39] This phenomenon suggests the presence of an immune evasion mechanism in HCC, facilitating the survival and proliferation of tumor cells. Prior Study has indicated that immune evasion is a critical factor in the development of malignant tumors.[40] Additionally, ROC analysis further validated their potential diagnostic value, suggesting that they may serve as novel protective genes for HCC.
Further support for the functions and mechanisms of these genes was drawn from a literature review. DNAJC12, a molecular chaperone, has been found to be overexpressed in several cancers, such as lung and gastric cancers,[41,42] where it promotes tumor cell growth and survival by activating the AKT pathway, contributing to chemotherapy resistance in breast cancer.[43] In our research, DNAJC12 expression was found to be significantly reduced in HCC. Immune cell infiltration analysis suggests that its low expression may impair immune function, enabling tumor cells to evade immune surveillance and thus promoting HCC progression. These findings indicate that DNAJC12 may have tumor-suppressive properties in HCC, with low expression potentially linked to tumor progression. Similarly, KBTBD11 showed low expression in HCC in this study, supporting its tumor-suppressive role, as indicated in colorectal cancer studies, and highlighting its potential as a therapeutic target in obesity.[44,45] Our findings suggest that decreased KBTBD11 expression may lead to uncontrolled cell proliferation, thereby contributing to HCC progression, further supporting its tumor-suppressive function. PLSCR4, known as a phospholipid scramblase, has been shown to inhibit non-small cell lung cancer development upon upregulation, though it may also activate the PI3K/AKT pathway, promoting lipoma formation.[46–48] In our study, high PLSCR4 expression may offer a protective role in HCC by regulating immune responses, while low expression may lead to cellular metabolic imbalances or increased stress responses, potentially promoting tumor progression. SEC24B, a component of the COPII vesicle complex, is known to be upregulated in neurodegenerative diseases, such as Alzheimer disease and multiple system atrophy, and is involved in ferroptosis regulation in neural degeneration.[49–52] It has also been closely linked to the development of endometrial and bladder cancers.[53,54] Our findings indicate that low SEC24B expression in HCC may accelerate cancer progression by impacting cell cycle regulation, suggesting a potential anti-proliferative role in tumors. SH3YL1 has been associated with diabetic nephropathy and is involved in acute kidney injury through Nox4 regulation.[55–58] Additionally, the SH3YL1-Dock4 complex promotes cancer cell migration by regulating Rac1 activity.[59] Our study reveals that low SH3YL1 expression in HCC may enhance tumor cell proliferation, accelerating HCC progression, whereas its high-expression might inhibit cancer cell migration and signaling, thus restraining tumor development.
While these results are promising, it is important to emphasize that our study is based entirely on public transcriptomic data. No in vitro or in vivo validation has been performed. As such, our findings remain exploratory. Further functional experiments and validation in independent clinical cohorts are needed to determine the diagnostic and therapeutic relevance of these genes in HCC.
5. Conclusion
In summary, this observational study is the first to systematically explore the potential tumor-suppressive roles of DNAJC12, KBTBD11, PLSCR4, SEC24B, and SH3YL1 in HCC. By combining literature evidence with our data analysis, our results not only validate these genes’ potential functions in cancer but also provide novel insights for early HCC diagnosis and targeted therapy. Although experimental validation is currently lacking, this study provides a valuable foundation for future research. Further functional and clinical studies are needed to confirm their roles and evaluate their potential as diagnostic biomarkers or therapeutic targets in HCC.
Acknowledgments
We thank GEO database, TCGA database, and all the participants of this study.
Author contributions
Conceptualization: Jinyue Ma, Jiaxin Yao, Jiyu Pang, Bo Mu.
Formal analysis: Chunyan Zhao.
Funding acquisition: Chunyan Zhao, Bo Mu.
Investigation: Jiaxin Yao, Rendan Zhang, Yongjie Wen.
Methodology: Min Zhang.
Software: Lu Wen.
Supervision: Bo Mu.
Validation: Jiyu Pang.
Writing – original draft: Jinyue Ma.
Writing – review & editing: Jinyue Ma, Bo Mu.
Abbreviations:
- AUC
- area under the curve
- BP
- biological process
- CC
- cellular component
- CCNB1
- cyclin B1
- DEGs
- differentially expressed genes
- DNAJC12
- DnaJ heat shock protein family member C12
- Enet
- Elastic Net
- GABRE
- gamma-aminobutyric acid type A receptor epsilon subunit
- GBM
- generalized boosted regression modeling
- GEO
- Gene Expression Omnibus
- glmBoost
- generalized linear model boosting
- GO
- gene ontology
- GS
- gene significance
- GSEA
- gene set enrichment analysis
- HBV
- hepatitis B virus
- HCC
- hepatocellular carcinoma
- HCV
- hepatitis C virus
- KBTBD11
- kelch repeat and BTB domain containing 11
- KEGG
- Kyoto Encyclopedia of Genes and Genomes
- LDA
- linear discriminant analysis
- LOOCV
- leave-one-out cross-validation
- MAN1C1
- mannosidase alpha class 1C member 1
- MF
- molecular function
- MM
- module membership
- MSigDB
- molecular signatures database
- NAFLD
- nonalcoholic fatty liver disease
- PLSCR4
- phospholipid scramblase 4
- plsRglm
- partial least squares regression of generalized linear models
- PPI
- protein–protein interaction
- PTTG1
- pituitary tumor-transforming gene 1
- RF
- Random Forest
- ROC
- receiver operating characteristic
- SEC24B
- SEC24 homolog B, COPII coat complex component
- SH3YL1
- SH3 domain containing YSC84-like 1
- SNRPE
- small nuclear ribonucleoprotein polypeptide E
- ssGSEA
- single-sample gene set enrichment analysis
- Stepglm
- stepwise generalized linear model
- SVM
- support vector machine
- TACE
- transarterial chemoembolization
- TCGA
- The Cancer Genome Atlas
- TOM
- topological overlap matrix
- WGCNA
- weighted gene co-expression network analysis
The content of this article is the sole responsibility of the authors and does not necessarily represent the official views of the funding agencies.
This study was supported by grants from the Youth Fund of the National Natural Science Foundation of China (no. 81101733) and the Nanchong City-University Collaborative Projects (NO. 22SXCXTD0002, NO. 22SXQT0100).
The data were obtained from a publicly accessible database, and no human subjects were involved; therefore, the ethical parameters were not applicable.
The authors have no conflicts of interest to disclose.
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
How to cite this article: Ma J, Yao J, Zhang M, Zhao C, Pang J, Wen L, Zhang R, Wen Y, Mu B. Identification and validation of feature genes in hepatocellular carcinoma based on bioinformatics and machine learning: An observational study. Medicine 2025;104:43(e45403).
Contributor Information
Jinyue Ma, Email: majinyue@stu.nsmc.edu.cn.
Jiaxin Yao, Email: 172803404@qq.com.
Min Zhang, Email: 413072501@qq.com.
Chunyan Zhao, Email: 13699665433@163.com.
Jiyu Pang, Email: pjy1075606775@163.com.
Lu Wen, Email: 1256345067@qq.com.
Rendan Zhang, Email: 413072501@qq.com.
Yongjie Wen, Email: 1256345067@qq.com.
References
- [1].Pellicoro A, Ramachandran P, Iredale JP, Fallowfield JA. Liver fibrosis and repair: immune regulation of wound healing in a solid organ. Nat Rev Immunol. 2014;14:181–94. [DOI] [PubMed] [Google Scholar]
- [2].Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74:229–63. [DOI] [PubMed] [Google Scholar]
- [3].Chen W, Zheng R, Baade PD, et al. Cancer statistics in China, 2015. CA Cancer J Clin. 2016;66:115–32. [DOI] [PubMed] [Google Scholar]
- [4].Taniguchi H. Liver cancer 2.0. Int J Mol Sci. 2023;24:17275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Singal AG, Kanwal F, Llovet JM. Global trends in hepatocellular carcinoma epidemiology: implications for screening, prevention and therapy. Nat Rev Clin Oncol. 2023;20:864–84. [DOI] [PubMed] [Google Scholar]
- [6].Kalantari L, Ghotbabadi ZR, Gholipour A, et al. A state-of-the-art review on the NRF2 in Hepatitis virus-associated liver cancer. Cell Commun Signal. 2023;21:318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Jacob R, Prince DS, Kench C, Liu K. Alcohol and its associated liver carcinogenesis. J Gastroenterol Hepatol. 2023;38:1211–7. [DOI] [PubMed] [Google Scholar]
- [8].Pugliese N, Alfarone L, Arcari I, et al. Clinical features and management issues of NAFLD-related HCC: what we know so far. Expert Rev Gastroenterol Hepatol. 2023;17:31–43. [DOI] [PubMed] [Google Scholar]
- [9].Wang Y, Wang B, Yan S, et al. Type 2 diabetes and gender differences in liver cancer by considering different confounding factors: a meta-analysis of cohort studies. Ann Epidemiol. 2016;26:764–72. [DOI] [PubMed] [Google Scholar]
- [10].Hamid AS, Tesfamariam IG, Zhang Y, Zhang ZG. Aflatoxin B1-induced hepatocellular carcinoma in developing countries: geographical distribution, mechanism of action and prevention. Oncol Lett. 2013;5:1087–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Sealock T, Sharma S. Smoking cessation (Archived). In: StatPearls. StatPearls Publishing; 2025. [PubMed] [Google Scholar]
- [12].Trichopoulos D, Bamia C, Lagiou P, et al. Hepatocellular carcinoma risk factors and disease burden in a European cohort: a nested case-control study. J Natl Cancer Inst. 2011;103:1686–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Shannon AH, Ruff SM, Pawlik TM. Expert insights on current treatments for hepatocellular carcinoma: clinical and molecular approaches and bottlenecks to progress. J Hepatocell Carcinoma. 2022;9:1247–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Jr., Kinzler KW. Cancer genome landscapes. Science. 2013;339:1546–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].MacEachern SJ, Forkert ND. Machine learning for precision medicine. Genome. 2021;64:416–25. [DOI] [PubMed] [Google Scholar]
- [16].Lander ES, Linton LM, Birren B, et al. ; International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [DOI] [PubMed] [Google Scholar]
- [17].Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Gene Ontology C. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015;43:D1049–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16:284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: a review. Biotechnol Adv. 2021;49:107739. [DOI] [PubMed] [Google Scholar]
- [24].Liu Z, Liu L, Weng S, et al. Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer. Nat Commun. 2022;13:816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Qin H, Abulaiti A, Maimaiti A, et al. Integrated machine learning survival framework develops a prognostic model based on inter-crosstalk definition of mitochondrial function and cell death patterns in a large multicenter cohort for lower-grade glioma. J Transl Med. 2023;21:588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Subramanian A, Kuehn H, Gould J, Tamayo P, Mesirov JP. GSEA-P: a desktop application for gene set enrichment analysis. Bioinformatics. 2007;23:3251–3. [DOI] [PubMed] [Google Scholar]
- [27].Chen B, Khodadoust MS, Liu CL, Newman AM, Alizadeh AA. Profiling tumor infiltrating immune cells with CIBERSORT. Methods Mol Biol. 2018;1711:243–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Cai J, Ji Z, Wu J, et al. Development and validation of a novel endoplasmic reticulum stress-related lncRNA prognostic signature and candidate drugs in breast cancer. Front Genet. 2022;13:949314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Tu HC, Hsiao YC, Yang WY, et al. Up-regulation of golgi alpha-mannosidase IA and down-regulation of golgi alpha-mannosidase IC activates unfolded protein response during hepatocarcinogenesis. Hepatol Commun. 2017;1:230–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Zhou Q, Li L, Sha F, et al. PTTG1 reprograms asparagine metabolism to promote hepatocellular carcinoma progression. Cancer Res. 2023;83:2372–86. [DOI] [PubMed] [Google Scholar]
- [31].Wang H, Yang C, Li D, Wang R, Li Y, Lv L. Bioinformatics analysis and experimental validation of a novel autophagy-related signature relevant to immune infiltration for recurrence prediction after curative hepatectomy. Aging (Albany NY). 2023;15:2610–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Rong MH, Li JD, Zhong LY, et al. CCNB1 promotes the development of hepatocellular carcinoma by mediating DNA replication in the cell cycle. Exp Biol Med (Maywood). 2022;247:395–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Yang Y, Deng X, Chen X, et al. Landscape of active enhancers developed de novo in cirrhosis and conserved in hepatocellular carcinoma. Am J Cancer Res. 2020;10:3157–78. [PMC free article] [PubMed] [Google Scholar]
- [34].Garnelo M, Tan A, Her Z, et al. Interaction between tumour-infiltrating B cells and T cells controls the progression of hepatocellular carcinoma. Gut. 2017;66:342–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Cai X-Y, Gao Q, Qiu S-J, et al. Dendritic cell infiltration and prognosis of human hepatocellular carcinoma. J Cancer Res Clin Oncol. 2006;132:293–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Bertoletti A, Kennedy PT, Durantel D. HBV infection and HCC: the “dangerous liaisons.”. Gut. 2018;67:787–8. [DOI] [PubMed] [Google Scholar]
- [37].Song PP, Xia JF, Inagaki Y, et al. Controversies regarding and perspectives on clinical utility of biomarkers in hepatocellular carcinoma. World J Gastroenterol. 2016;22:262–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Chen C, Wang Z, Ding Y, Qin Y. Tumor microenvironment-mediated immune evasion in hepatocellular carcinoma. Front Immunol. 2023;14:1133308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Yang Q, Guo N, Zhou Y, Chen J, Wei Q, Han M. The role of tumor-associated macrophages (TAMs) in tumor progression and relevant advance in targeted therapy. Acta Pharm Sin B. 2020;10:2156–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Jhunjhunwala S, Hammer C, Delamarre L. Antigen presentation in cancer: insights into tumour immunogenicity and immune evasion. Nat Rev Cancer. 2021;21:298–312. [DOI] [PubMed] [Google Scholar]
- [41].Li Y, Li M, Jin F, Liu J, Chen M, Yin J. DNAJC12 promotes lung cancer growth by regulating the activation of beta‑catenin. Int J Mol Med. 2021;47:105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Uno Y, Kanda M, Miwa T, et al. Increased expression of DNAJC12 is associated with aggressive phenotype of gastric cancer. Ann Surg Oncol. 2019;26:836–44. [DOI] [PubMed] [Google Scholar]
- [43].Shen M, Cao S, Long X, et al. DNAJC12 causes breast cancer chemotherapy resistance by repressing doxorubicin-induced ferroptosis and apoptosis via activation of AKT. Redox Biol. 2024;70:103035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Watanabe K, Yoshida K, Iwamoto S. Kbtbd11 gene expression in adipose tissue increases in response to feeding and affects adipocyte differentiation. J Diabetes Investig. 2019;10:925–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Gong J, Tian J, Lou J, et al. A polymorphic MYC response element in KBTBD11 influences colorectal cancer risk, especially in interaction with an MYC-regulated SNP rs6983267. Ann Oncol. 2018;29:632–9. [DOI] [PubMed] [Google Scholar]
- [46].Lv L, Li T, Li X, et al. The lncRNA Plscr4 controls cardiac hypertrophy by regulating miR-214. Mol Ther Nucleic Acids. 2018;10:387–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Li Y, Zhao L, Zhao P, Liu Z. Long non-coding RNA LINC00641 suppresses non-small-cell lung cancer by sponging miR-424-5p to upregulate PLSCR4. Cancer Biomark. 2019;26:79–91. [DOI] [PubMed] [Google Scholar]
- [48].Barth LAG, Nebe M, Kalwa H, et al. Phospholipid scramblase 4 (PLSCR4) regulates adipocyte differentiation via PIP3-mediated AKT activation. Int J Mol Sci. 2022;23:9787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Wendeler MW, Paccaud JP, Hauri HP. Role of Sec24 isoforms in selective export of membrane proteins from the endoplasmic reticulum. EMBO Rep. 2007;8:258–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Sheelakumari R, Kesavadas C, Varghese T, et al. Assessment of iron deposition in the brain in frontotemporal dementia and its correlation with behavioral traits. AJNR Am J Neuroradiol. 2017;38:1953–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Wang Y, Butros SR, Shuai X, et al. Different iron-deposition patterns of multiple system atrophy with predominant parkinsonism and idiopathetic Parkinson diseases demonstrated by phase-corrected susceptibility-weighted imaging. AJNR Am J Neuroradiol. 2012;33:266–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Ryan SK, Zelic M, Han Y, et al. Microglia ferroptosis is regulated by SEC24B and contributes to neurodegeneration. Nat Neurosci. 2023;26:12–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Shi S, Tang X, Liu H. Disulfidptosis-related lncRNA for the establishment of novel prognostic signature and therapeutic response prediction to endometrial cancer. Reprod Sci. 2024;31:811–22. [DOI] [PubMed] [Google Scholar]
- [54].Bai Y, Zhang Q, Liu F, Quan J. A novel cuproptosis-related lncRNA signature predicts the prognosis and immune landscape in bladder cancer. Front Immunol. 2022;13:1027449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Choi GS, Min HS, Cha JJ, et al. SH3YL1 protein as a novel biomarker for diabetic nephropathy in type 2 diabetes mellitus. Nutr Metab Cardiovasc Dis. 2021;31:498–505. [DOI] [PubMed] [Google Scholar]
- [56].Lee SR, Lee HE, Yoo JY, et al. Nox4-SH3YL1 complex is involved in diabetic nephropathy. iScience. 2024;27:108868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].Han SY, Han SH, Ghee JY, Cha JJ, Kang YS, Cha DR. SH3YL1 protein predicts renal outcomes in patients with type 2 diabetes. Life (Basel). 2023;13:963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Yoo JY, Cha DR, Kim B, et al. LPS-induced acute kidney injury is mediated by Nox4-SH3YL1. Cell Rep. 2020;33:108245. [DOI] [PubMed] [Google Scholar]
- [59].Kobayashi M, Harada K, Negishi M, Katoh H. Dock4 forms a complex with SH3YL1 and regulates cancer cell migration. Cell Signal. 2014;26:1082–8. [DOI] [PubMed] [Google Scholar]






