Abstract
Background
Prostate cancer is still a significant health burden worldwide, mostly because of its genetic heterogeneity and the low specificity of the known biomarkers. It is really important to develop precise molecular signatures to increase early detection, improve prognosis, and provide personalized treatment.
Methods
We have combined four GEO microarray datasets - GSE3325, GSE6919, GSE55945, GSE26910 (n = 179) and subjected them to the same preprocessing steps, which included background correction, log2 transformation, and quantile normalization. We have normalized gene expression across different probes to HGNC. Then, we applied Limma for the detection of differentially expressed genes (DEGs), incorporating diagnosis and batch as covariates, and extracted significant DEGs with |log2FC|>1 and BH-FDR ≤ 0.05. Next, we propose a Novel Graph-Convolutional Feature Selection framework, ranking the genes by using the expression data in relation with network topology. The performance is validated by Hybrid Random Forest and LightGBM classifiers, and independent validation is done using the GSE46602 dataset (n = 50).
Results
We discovered a promising four-gene signature that is significantly enriched in GFUS, ARHGAP8, NBL1, and ACTB, implicated in various crucial cancer pathways such as PI3K–Akt, JAK–STAT, and NF-κB. In the discovery set, the Hybrid model emerged as superior, providing an AUC of 0.9612 with an accuracy of 95.37%, sensitivity of 94.02%, and specificity of 95.80%. Other models also demonstrated high performance with considerable values of AUCs: C5, 0.9257; AdaBoost, 0.9098; SVM, 0.8926; RF, 0.9519; and LightGBM, 0.9578 all reinforcing the reliability of identified genes. The obtained results were later on validated in the GSE46602 dataset, since during validation, the four-gene panel gave a very good diagnostic capability the Hybrid model reached an AUC of 0.90, and an accuracy above 91%. Similar performances were obtained from the application of other models like SVM, AdaBoost, and others that strongly confirmed the generalizability of the biomarker panel.
Conclusions
This study presents a reproducible network-integrated machine learning-based biomarker discovery framework in prostate cancer. The identified four-gene panel was repeatedly predictive from both discovery and validation datasets, which highlights its potential as a clinically useful diagnostic and prognostic tool. The application of Novel GCFS coupled with ensemble learning employing RF and LightGBM has not been reported, to the best of our knowledge, in any prostate cancer investigation thus far.
Keywords: Prostate cancer, Hybrid RF–lightGBM, Novel graph-convolutional feature selection, Biomarkers, Molecular signature
Introduction
Prostate cancer (PCa) is the predominant malignancy affecting men between the ages of 45 and 60, particularly in developed countries [1–3]. Each year around the world, approximately 80,000 people die from prostate cancer, which is still the main reason for men dying from cancer. Prostate cancer is a disease that is genetically very different and is caused by a combination of environmental and hereditary factors [4]. Family history is the most trustworthy sign of genetic predisposition. Men with a family history of PCa have a 60% higher risk of getting the disease than men without such a history [5, 6].
Hereditary prostate cancer occurs in about 6% of the men who have it, and they have point mutations in the DNA repair genes like BRCA1, BRCA2, and ATM [7]. The genes that are most commonly used as biomarkers for prostate cancer include BRCA, HOX, ATM, RNASEL (HPC1, 1q22), MSR1 (8p), and ELAC2/HPC2 (17p11) [7–10].
The early detection of PCa typically relies on routine screening measures, including prostate-specific antigen (PSA) testing, digital rectal examinations (DRE), and pelvic imaging modalities, such as magnetic resonance imaging (MRI) [11, 12]. PSA testing is beneficial but has limited specificity; it frequently produce false positives, leading to unnecessary biopsies and overtreatment, which can cause distress for patients [13, 14]. Because elevated PSA levels may also signify non-malignant conditions such as prostatitis or benign prostatic hyperplasia (BPH), this underscores the need for more specific biomarkers in prostate cancer [14, 15]. This limitation further emphasizes the necessity of developing more accurate diagnostic markers to enhance early detection and optimize patient management.
Biomarkers are of importance in diagnostics, staging, assessment of the severity of a disease, and evaluation of the treatment efficacy [16].New biomarkers need to be developed urgently to improve staging accurately and early diagnosis to overcome prostate cancer shortcomings. Relevance studies have developed Prostate Health Index, TMPRSS2-ERG fusion gene tests, 4 K tests, and PCA3 tests in an effort to improve the specificity and sensitivity of PSA, thereby avoiding unnecessary biopsies in patients [17].
This study identifies GFUS, ARHGAP8, NBL1 and ACTB genes as common biomarker candidates for the first time in the literature by analyzing prostate cancer-related gene expression data integrated with both classical bioinformatics methods (DEG, GO, KEGG, PPI) and advanced machine learning algorithms (AdaBoost, Random Forest (RF), Support Vector Machine (SVM), Decision Trees (C5), Light Gradient Boosting Machine (LightGBM), and a consensus-based Hybrid Model). This holistic approach provides a powerful modeling framework that can contribute to clinical decision support systems through models with high predictive accuracy (AUC) and not only genetic profiling. The innovative aspect of the study is that genes such as GFUS, which have rarely been studied in the prostate cancer literature, stand out based on biological networks and classification performance.
To the best of our knowledge, this is the first study to apply Graph-Convolutional Feature Selection (GCFS) to prostate cancer gene expression data. Although GCFS has shown promise in integrating network topology into feature prioritization, no prior research has employed this method specifically in the context of prostate cancer, a disease characterized by significant genetic heterogeneity. Therefore, our work not only identifies novel biomarker candidates but also provides a paradigm-shifting application of GCFS in oncogenomics.
Therefore, aim of the study is to analyses publicly available dataset to enhance understanding of prostate cancer in molecular aspects. This study seeks to assess the applicability of artificial intelligence in the discovery of new related genes to prostate cancer. With the help of machine learning algorithms, this work aims to improve diagnostic performance and provide new information about the genetic causes of prostate cancer. Further, the findings will also shed light on extent to which clinicians can rely on AI for discovering novel interventions for early detection of diseases, minimizing or avoiding over-treatment of patients, and for developing an individualized treatment plan.
Prostate cancer is affected by complex and connected controlling systems where genes nearly always work together rather than alone; they communicate and interact through well-organized signaling pathways and protein–protein interactions. The pathway level is where the most significant dysregulation patterns are detected, and these are generally not seen by the conventional feature-selection methods which evaluate genes based on their individual expression levels. The disease-related signals are conveyed throughout the molecular network via the graph-convolutional feature selection which takes into account both node properties (gene expression profiles) and edge features (interaction strengths and neighborhood structure). This is critical for prostate cancer as the disruptions which are synchronized among the classes of interacting genes produce alterations in androgen-regulated pathways, cytoskeletal structures, and glycosylation networks rather than simply gene mutations. By placing each gene in its biological situation, graph-convolutional methods yield a more accurate and mechanism-oriented way of biomarker discovery which mirrors the real network-level structure of prostate carcinogenesis.
This research will provide an introduction to several innovative and key research ideas on prostate cancer:
The study integrates GCFS with conventional machine-learning algorithms in a unified framework, enabling biologically informed feature selection followed by robust biomarker classification.
Instead of claiming novelty for each individual method, the revised manuscript highlights that the innovation arises from the combined application of graph-convolutional feature selection with ensemble learning models on harmonized, cross-platform prostate cancer datasets.
This integrated workflow led to the identification of GFUS, ARHGAP8, NBL1, and ACTB as a coherent 4-gene biomarker panel, representing the novel contribution of the study.
Pathway and functional analyses were used to contextualize these genes within key oncogenic processes (e.g., PI3K-Akt, JAK-STAT, NF-κB), ensuring that network-derived biomarkers are interpreted within biologically relevant pathways.
Prognostic relevance was evaluated using Kaplan–Meier survival analysis, and.
The final biomarker panel was validated using an independent dataset (GSE46602), demonstrating methodological coherence and generalizability rather than novelty of individual components.
Materials and methods
Materials
Four prostate cancer gene expression microarray datasets (GSE3325, GSE6919, GSE55945 and GSE26910) were downloaded from the Gene Expression Omnibus (GEO), to carry out our analysis. They were chosen because they contained gene expression profiles of cancer and non-cancer prostate tissue. Another independent validation group (GSE46602) was employed to validate our results. Raw data underwent background correction, log2 transformation where necessary, and quantile normalization. We examined patient variables as well such as age (65). Given that the datasets were generated from different microarray platforms (GPL-570 and GPL-93) which may have distinct probe designs, labeling methods or scanning machines, we carried selective normalization to reduce technical variations. The raw data was normalized to control for technical variation. First, background correction and log2 transformation were applied to the raw signals. Then quantile normalization was performed on all samples’ expression profiles. To further mitigate batch effect and platform specific signals, we adopted the ComBat algorithm, available in ‘sva’ R package. This statistical approach, using empirical Bayes method, minimized systematic variations among the studies and platforms, rescaling the expression values for more valid biological differences. Omitting these pre-processing procedures could often result in hiding true biological signals and lead to the inflation of false positive discoveries. The final graph aggregation data made it possible for us to get evaluating comprehensible and standardized expression matrix which was applicable for different directions of the incipient studies such as biased gene expression, important features extraction, biomarkers detection. Probe IDs were converted to standard gene symbols. When multiple probes were associated with the same gene, we retained the probe with the most variation in expression. Missing values were estimated using a k-nearest neighbors (k = 10) imputation method, and any possible outlier samples were identified based on their standardized Z-scores (|Z| > 3) and then removed, as outlined in earlier publications [13–16]. The datasets were then merged by retaining the intersection of common gene symbols across studies using R’s intersect() function. To account for study and platform effects, the batch factor (dataset/platform) was included as a covariate in the limma design matrix. Differential expression was assessed with limma using |log2FC| > 1 and Benjamini–Hochberg FDR ≤ 0.05. All preprocessing steps (normalization, imputation, outlier removal, and feature selection) were conducted exclusively within the training folds of nested cross-validation and then projected to validation/test folds and the external dataset to prevent information leakage [18–21]. In total, these transcriptome datasets comprised 19, 127, 21, and 12 samples, respectively; a detailed distribution is presented in Table 1. All data are publicly available through the GEO database, and because no new human or animal experiments were conducted, ethics committee approval was not required. Bioinformatics analyses were performed using the R programming language. For genes represented by multiple probes, duplicate probe sets were collapsed by selecting the probe with the highest variance across samples. This prevented redundancy in the graph structure and ensured a single representative expression vector per gene.
Table 1.
The GEO datasets and sampling information
Methods
Differentially expressed genes (DEGs)
The recognition of differentially expressed genes (DEGs) is a key step in the research of disease mechanisms and in the search for biomarkers. The leaves are the genes that show different behavior depending on the presence of certain factors [22]. The analysis of DEGs requires the pooling of gene expression data generated using different methods like microarray. The raw data is collected first, and then it is carefully processed by a series of pre-processing steps among which quality filtering, alignment, and normalization are the most important one able to guarantee the consistency and reproducibility of the next analyses. Even though tools such as Limma are specifically designed for microarray data analysis, other methodologies (e.g., p-value adjustment and FDR correction) could also be used to reduce false positives [23]. Finally, genes are classified as up or downregulated contributing to the knowledge of their involvement in different biological processes. Genes showing a log2FC above 1 and FDR below 0.05 were classified as up-regulated while genes with a log2FC lower than − 1 and FDR less than 0.05 were classified as down-regulated. All significance p-values are corrected for BH_FDR (unless otherwise indicated, α = 0.05); “nominal” results are p < 0.05 uncorrected that did not survive FDR correction.
Functional analysis of DEGs
A gene enrichment analysis was done using the WebGestalt and KEGG platforms. These platforms use the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases for functional classification. To control false discoveries, the Benjamini-Hochberg method [24] adjusted p-values, with an FDR threshold of ≤ 0.05 [25, 26].
Protein–protein interaction (PPI) network analysis
Protein-protein interactions (PPI) among the DEGs, were queried using STRING (v11.5); edges with confidence score > 0.4 were retained to balance coverage and reliability for exploratory analysis [27]. This threshold corresponds to medium confidence according to STRING’s classification system, and it is commonly used in the literature as it provides a balance between sensitivity and specificity. It allows the inclusion of meaningful interactions while minimizing noise from low-confidence connections, making it well-suited for exploratory analyses.
Kaplan–Meier plotter
In order to evaluate the impact of the key genes we detected on the survival of prostate cancer patients over time, a survival analysis was performed with the help of the KM Plotter tool [24]. This system connects the data of the gene activity to the patient outcomes derived from the TCGA-PRAD study, therefore it is really simple to see the relationship between the gene levels and the survival. The whole dataset was divided into two categories, those with high gene expression and those with low gene expression, according to the median value. Kaplan–Meier curves were then produced indicating time in days on the x-axis and the survival probability on the y-axis. People with higher gene activity are represented by the orange lines in the graphs, while those with lower activity are shown by the black lines. A log-rank test was then conducted to verify if the difference between the two curves was significant that is, to find out if too much or too little of these genes actually has an impact on the patients’ survival.
Machine learning algorithms
In our research, we used different machine learning approaches to identify prostate cancer samples based on their gene activity patterns. These were AdaBoost [28, 29], Random Forest (RF) [30], Light Gradient Boosting Machine (LightGBM), Support Vector Machine (SVM) [31, 32], and the C5.0 decision tree. We specifically selected these models due to their effectiveness with complex biological data, particularly microarray data [28, 30, 31, 33–38].
Gene Correlation-based Feature Selection (GCFS) was the approach we used, among others, to identify a subset of genes before training the models. We evaluated the performance of the models and used a 10-fold cross validation method to optimize them. Standard metrics was used to evaluate accuracy, sensitivity (recall), specificity, F1-score, and area under the ROC curve (AUC). This procedure is suitable for the understanding of a model’s performance with an unbalanced class distribution in the data. To make sure our results were correct, we calculated 95% confidence intervals for the AUCs through bootstrapping.
This technique not only helps us to evaluate the confidence level of our AUC estimates but also delivers a more robust model performance measurement. In addition, we applied DeLong’s statistical tests to our integrated GCFS–RF–LightGBM framework against simpler approaches (RF, LightGBM, and SVM) to compare the AUC values formally. The integrated framework showed a significant improvement in AUC values, thereby confirming its superior ability to differentiate classes apart and to generalize to new data. One of the main problems with microarray datasets is that they usually contain a lot of redundant or noisy features that eventually lead to reduced model performance. On the contrary, tree-based methods such as AdaBoost, RF, and LightGBM are the ones to go for in dealing with such data since they have the ability to withstand noise while at the same time capturing intricate relationships. Among the three options, LightGBM was the one chosen for its speed and capability in handling high-dimensional feature spaces, whereas RF laid down the foundation due to its equalizing effect on the generalization capacity. On the other hand, SVM was included in the research as it is the best in complex data separation by employing decision boundaries that are optimized. The C5.0 algorithm provided an interpretable model for comparison, shedding light on decision rules grounded on a restricted number of characteristics. To achieve greater prediction accuracy, we added an ensemble model that combined the predictions of the RF and LightGBM models with a meta-classifier grounded on logistic regression. This ensemble model took advantage of the complementary strengths of the two base models and resulted in a higher predictive power across all the metrics. The ensemble model achieved the highest accuracy and AUC, indicating its potential as a dependable instrument for predicting prostate cancer based on biomarkers, as demonstrated in the Results section.
Hybrid model
Our comprehensive methodology integrates Random Forest and LightGBM to create an ensemble learning framework. Random Forest, as an ensemble technique, generates predictions through a process that initially constructs multiple decision trees from various bootstrap samples of the data and subsequently aggregates the predictions of these trees through either averaging (for regression) or majority voting (for classification).The votes cast are taken as the ground for predictions and often considered trustworthy in many cases. This method’s main advantage is its capability to deal with noisy and unbalanced datasets effectively. On the other hand, LightGBM uses the opposite method by building trees progressively through gradient boosting and forming them from the leaves upwards. Therefore, it is able to discover complex data patterns more quickly but it may also sometimes focus too much on the less important parts of the data. We train both algorithms separately on the same dataset first. Then, we perform logistic regression on their predictions to obtain our result. The two-pronged strategy prepares us for different data conditions, and our findings are applicable regardless of whether we do internal testing or test with new external data (2).
Table 2.
Default parameters of machine learning algorithms in python
| Algorithm | Default Parameters |
|---|---|
| AdaBoost |
- n_estimators = 50 (number of weak learners to train) - learning_rate = 1.0 (contribution of each weak learner) - base_estimator = DecisionTreeClassifier - algorithm = “SAMME” or “SAMME.R” |
| Decision Tree |
- criterion = “gini” (split quality function; alternative: “entropy”) - max_depth = None (unlimited depth by default) - min_samples_split = 2 - min_samples_leaf = 1 - max_features = None - splitter = “best” |
| Random Forest (RF) |
- ntree = 500 (number of trees) - mtry = sqrt(p) (number of variables sampled at each split) - Other: min.node.size, importance, etc. |
| Support Vector Machine (SVM) |
- kernel = “radial” (RBF kernel) - cost = 1 (regularization parameter) - gamma = 1 / num_features (kernel coefficient) |
| Hybrid Model |
- Base learners: - model_1 = Random Forest (ntree = 500, mtry = sqrt(p)) - model_2 = LightGBM (num_leaves = 31, learning_rate = 0.1, n_estimators = 100) - Meta learner: - model = Logistic Regression or Linear Model - Stacking method: k-fold cross-validation for out-of-fold predictions - Combination strategy: average or weighted average of base models |
The hybrid classifier was implemented using a stacking structure with two levels. At the very start, the identical training dataset was utilized to Random Forest and LightGBM, but the training was conducted separately. Each model offered a probability estimate of the tumor belonging to the given class for every sample. The concatenation of these two probabilities resulted in a 2-dimensional vector of meta-features. In the second level, logistic regression was performed on these meta-features to find the best linear combination of the two base-model predictions. The output of logistic regression was the final prediction of each sample. Through this stacking method, the hybrid model was able to exploit the complementary features that were present in Random Forest and LightGBM.
Graph-convolutional feature selection (GCFS) for gene prioritization
The GCFS approach offers a highly innovative technique for selecting genes. This method differs from the previous ones that regarded genes as separate entities. It relies on the relationships among genes and the configuration of biological networks. Traditional techniques typically assess single genes for analysis, whereas GCFS examines the interactions and pathways that the genes participate in. This is what renders the approach highly beneficial for layered datasets, such as gene expression. The primary goal is to create a ranked list of genes based on the individual activities of each gene and their roles within the network.
The initial step includes the conversion of gene expression data using the Graph Convolutional Network (GCN) system. The GCN method, in updating a gene’s attributes, considers not only the gene’s specific expression data but also gathers and combines information from the surrounding genes. This approach uncovers the structural features of the genetic data .The expression level of a gene is not enough to characterize it but there are also other genes that can be considered to be signaling the expression of that gene. Thus, the genes are treated as connected components, not as isolated entities, which allows the biological network’s natural relationships to be captured. GCFS applies one graph convolution step to compute network-refined gene embeddings. The propagation rule for layer
is defined by Eq. (1).
(A) Graph Construction and Preprocessing.
A gene interaction graph Eq. (1).
![]() |
1 |
was constructed using STRING protein–protein interaction data (
). Each node
corresponds to a gene, and each edge
encodes empirical co-expression or protein-level association.
To preserve each gene’s intrinsic contribution and ensure numerical stability during graph propagation, self-loops were added to all nodes Eq. (2):
![]() |
2 |
where
is the adjacency matrix and
is an identity matrix representing self-connections. Self-loops were included to preserve each gene’s intrinsic signal during convolution, preventing loss of node-specific information when aggregating neighbor features.Node attributes were defined as the standardized expression matrix Eq. (3):
![]() |
3 |
where rows represent samples and columns represent genes. When multiple probes mapped to the same gene, the probe with the highest variance was retained.
To balance contributions from genes with different connectivity levels, the adjacency matrix was symmetrically normalized Eq. (4):
![]() |
4 |
This symmetric degree normalization prevents high-degree hub genes from dominating the propagation process, enabling balanced information flow across genes with heterogeneous connectivity. where the diagonal degree matrix is computed as Eq. (5):
![]() |
5 |
Normalization ensures that highly connected genes do not dominate feature propagation, and allows the model to stably aggregate neighborhood information. The adjacency matrix was constructed using STRING v11.5 protein–protein interaction data. For each pair of genes, an undirected edge was added if the interaction confidence score exceeded 0.4. Only interactions among genes present in the merged expression matrix were retained to ensure compatibility between transcriptomic features and the structural graph. The resulting matrix was used as the basis for subsequent GCN-based feature propagation.
(B) Graph Convolution and Node Embedding.
GCFS employs a single-layer Graph Convolutional Network (GCN) encoder to derive network-refined gene representations. For each gene
, the updated feature vector is computed as Eq. (6):
![]() |
6 |
This equation expresses that the final feature vector of each gene is updated by gathering information not only from its own expression but also from its neighbors in the network.
where:
: input gene expression matrix.
: trainable weight matrix controlling feature transformation.
: element-wise ReLU activation.
: embedding matrix containing refined representations after neighborhood propagation.
In this process, a gene incorporates information from neighboring genes, leading to a more holistic representation. This strategy is particularly powerful in genomic datasets because genes do not function independently; they are parts of an intricate, interconnected system of interactions that facilitates their biological roles.
Detailed meaning of the convolution step.
For each gene Eq. (7):
![]() |
7 |
This formulation means:
Each gene’s representation is a weighted average of signals from its direct neighbors, scaled by connectivity.
Self-loops ensure the gene’s own expression always influences its final embedding.
ReLU removes noise and amplifies biologically meaningful activations.
Thus, the embedding captures coordinated dysregulation across biological networks, a dynamic that classical differential expression methods cannot capture.
(C) Feature Scoring and Optimization Objective.
GCFS assigns an importance score to each gene by integrating three complementary components:
Statistical discriminative power (tumor vs. normal).
Topological centrality within the interaction network.
GCN-derived embedding strength.
The feature selection objective is defined as Eq. (8):
![]() |
8 |
where:
: t-statistic or effect-size based contrast of gene
.
: embedding magnitude capturing network-propagated relevance.C
: degree or betweenness centrality in STRING graph.
: fixed weights (0.4, 0.4, 0.2, respectively).
A gene is important not only if it changes significantly between groups, but also if it sits at a structurally influential position and retains high embedding activation after graph propagation.
(D) Stability Selection and Permutation-Based Significance.
To minimize spurious feature contributions and mitigate dataset heterogeneity, GCFS was combined with permutation-based stability selection:
Class labels were randomly permuted 1,000 times.
GCFS scores were recomputed for each permutation, generating a null score distribution for every gene.
A gene was considered stable if Eq. (9):
![]() |
9 |
This conservative criterion ensures that selected genes exhibit real biological signal beyond random chance and platform-dependent noise.
The critical step in GCFS is determining which genes are truly important. This selection is not solely based on a gene’s expression level; it also considers the gene’s position within the network structure. After the GCN has weighted the gene data according to network relationships, the important genes are evaluated from two angles: first, their individual expression patterns, and second, their positions within the biological network in terms of connectivity and the pathways in which they are involved. The combination of these two perspectives identifies the truly informative genes.
The critical step in GCFS is determining which genes carry true biological relevance rather than spurious statistical signal. This process relies on a sequence of mathematically principled operations that integrate raw expression profiles with the structural properties of the gene interaction network. The foundational GCN formulation (Eqs. 1–7) propagates information across the graph by combining each gene’s expression vector with contributions from its neighboring genes using normalized adjacency weights. The normalization step (Eqs. 4–5) ensures that nodes with high connectivity do not disproportionately dominate the propagation, while the self-loop augmentation (Eq. 2) preserves the gene’s intrinsic signal during convolution. The resulting embedding matrix
therefore represents a biologically contextualized expression signature in which each gene’s updated representation encodes both its individual expression and the coordinated behavior of its neighborhood within the STRING-derived network.
The GCFS scoring framework (Eq. 8) unifies these propagated representations with traditional statistical and network-theoretic criteria. The statistical component quantifies how strongly a gene differentiates tumor and normal samples based on metrics such as t-statistics, p-values, or effect sizes, while the topological component captures its influence within the interaction network through centrality measures like degree or betweenness. The GCN-derived term further incorporates the magnitude of the node embedding, reflecting how prominently a gene participates in network-propagated dysregulation patterns. By integrating these dimensions into a weighted multi-objective relevance function, GCFS prioritizes genes that are not only differentially expressed but also central to the biological circuitry of prostate cancer.
Lastly, the permutation-based stability criterion (Eq. 9) filters out features caused by platform noise or random fluctuations by ensuring that chosen genes perform better than all predictions generated from randomized class labels. These mathematical elements collaborate to generate a unified system, where each equation contributes to a comprehensive, multi-dimensional assessment method. This guarantees that the genes chosen by GCFS are statistically significant, structurally impactful in molecular networks, and resistant to random variability, producing biomarkers that are functionally relevant and diagnostically useful in prostate cancer biology. A detailed pseudo-code description of the method is given in Algorithm 1.
Algorithm.
Novel Graph-convolutional feature selection (GCFS)
Algorithm 1 provides the pseudo-code representation of the GCFS procedure used in this study.
Although attention-based graph models such as Graph Attention Networks (GAT) represent a methodological advancement beyond classical GCNs, preliminary evaluations showed that these architectures were not well-suited for the characteristics of our datasets. The combination of high-dimensional gene-expression matrices and relatively small sample sizes led attention mechanisms to overfit, resulting in unstable feature rankings and high fold-to-fold variance. In contrast, the simplified single-layer GCN propagation used in GCFS provided more stable and reproducible embeddings. Therefore, GCFS was selected as the primary graph-based feature selection strategy, and attention-based variants were not pursued further in the final workflow.
This study employed a novel hybrid machine learning framework to analyze prostate cancer gene expression profiles. Feature vectors were constructed by extracting the GCFS-selected gene expression values for each sample, forming an input matrix of n samples × p selected genes. Class labels were obtained directly from GEO annotations, assigning each sample to either the tumor or normal group. The proposed approach integrates Graph Convolutional Feature Selection (GCFS) with two ensemble algorithms Random Forest (RF) and Light Gradient Boosting Machine (LightGBM specifically to refine feature extraction and enhance the detection of discriminatory genes. To our knowledge, this is among the first implementations in which GCFS is embedded into a hybrid RF–LightGBM architecture for prostate cancer classification, enabling both network-informed feature selection and robust ensemble learning. The analytical pipeline initially generated feature-importance profiles from multiple classifiers. The top-ranked fifteen genes from each model were cross-referenced to identify those consistently appearing across algorithms. This consensus-based strategy ensured that gene selection did not depend on a single model and improved reproducibility. To assess the stability of selected genes, a permutation analysis was conducted by randomly shuffling sample labels to form a null distribution. The importance scores of GCFS-derived genes were then compared against this distribution. Genes whose scores exceeded the maximum null-distribution value were designated as stable and statistically meaningful. Through this integrated framework combining GCFS, hybrid RF–LightGBM learning, and stability validation—we identified a four-gene signature that is both reliable and biologically relevant for prostate cancer. This ensemble approach synthesizes the strengths of multiple algorithms and provides consistent, robust, and reproducible results.
All bioinformatics preprocessing procedures were carried out using R (version 4.3.1) and appropriate Bioconductor packages (e.g., limma, edgeR, genefilter), whereas all machine learning analyses were executed in Python (version 3.11) utilizing libraries including scikit-learn, lightgbm, and xgboost.
Experimental evaluations
In this study, we integrated four GEO microarray datasets (GSE3325, GSE6919, GSE55945, and GSE26910) to identify crucial genes associated with prostate cancer. A Graph-Convolutional Feature Selection (GCFS) framework leveraging expression profiles and prior network structure. Differential expression analysis, functional enrichment (GO/KEGG), and protein–protein interaction (PPI) network analysis were performed, and interpretation was restricted to terms meeting the pre-specified Benjamini-Hochberg FDR ≤ 0.05 threshold; signals not surviving multiple testing correction were reported as exploratory. PPI results were summarized descriptively and not treated as confirmatory when network enrichment was non-significant. Diagnostic performance of the candidate genes was further evaluated using machine learning classifiers and examined on an external dataset (GSE46602). Kaplan–Meier survival analysis was utilized to explore the clinical association of candidate genes. The flow chart of the bioinformatics and machine learning methodology used in this study is presented in Fig. 1.
Fig. 1.
Flow chart of the bioinformatics and machine learning methodology used
Bioinformatics-based evaluation
Using Limma with thresholds of |log2FC| > 1 and FDR < 0.05, 73 DEGs were identified (8 upregulated, 65 downregulated) from the integrated dataset.
Table 3 summarizes GO enrichment for up-regulated DEGS. At the molecular function level, only cytokine binding remained significant after Benjamini-Hochberg correction (p ≈ 0.0017; FDR ≈ 0.031). Terms related to mononuclear cell migration, chemotaxis and chemokine receptor interactions showed nominal enrichment (p < 0.05) but did not meet the FDR ≤ 0.05 threshold; according to the links between the immune cell trafficking and the tumor microenvironment are presented as exploratory. Consistent with these categories, genes such as CX3CL1, GAS6, NBL1, and MICOS10 occur in pathways relevant to cell migration and signal transduction; however, these observations are hypothesis generating and require independent laboratory validation. Signals involving nuclear membrane organization, DNA damage repair and cellular organization did not retain FDR significance and are therefore reported as exploratory rather than confirmatory.
Table 3.
GO enrichment analysis of up-DEGs
| ID | Description | pvalue | FDR | geneID |
|---|---|---|---|---|
| Biological Process | ||||
| GO:0071674 | mononuclear cell migration | 9.51E-07 | 0.00028892 | NBL1/MICOS10-NBL1/CX3CL1/GAS6 |
| GO:0002685 | regulation of leukocyte migration | 1.45E-06 | 0.00028892 | NBL1/MICOS10-NBL1/CX3CL1/GAS6 |
| GO:0030595 | leukocyte chemotaxis | 1.72E-06 | 0.00028892 | NBL1/MICOS10-NBL1/CX3CL1/GAS6 |
| GO:0002548 | monocyte chemotaxis | 2.82E-06 | 0.00035604 | NBL1/MICOS10-NBL1/CX3CL1 |
| GO:0060326 | cell chemotaxis | 5.73E-06 | 0.00054500 | NBL1/MICOS10-NBL1/CX3CL1/GAS6 |
| GO:0048263 | determination of dorsal identity | 7.07E-06 | 0.00054500 | NBL1/MICOS10-NBL1 |
| GO:0035581 | sequestering of extracellular ligand from receptor | 8.63E-06 | 0.00054500 | NBL1/MICOS10-NBL1 |
| GO:0048262 | determination of dorsal/ventral asymmetry | 8.63E-06 | 0.00054500 | NBL1/MICOS10-NBL1 |
| GO:1,900,115 | extracellular regulation of signal transduction | 1.22E-05 | 0.0005743 | NBL1/MICOS10-NBL1 |
| GO:1,900,116 | extracellular negative regulation of signal transduction | 1.22E-05 | 0.0005743 | NBL1/MICOS10-NBL1 |
| Celular Component | ||||
| GO:0031965 | nuclear membrane | 0.006414 | 0.09571362 | IFFO1/DMPK |
| GO:0005640 | nuclear outer membrane | 0.012405 | 0.09571362 | DMPK |
| GO:0005635 | nuclear envelope | 0.015615 | 0.09571362 | IFFO1/DMPK |
| GO:0033017 | sarcoplasmic reticulum membrane | 0.017567 | 0.09571362 | DMPK |
| GO:0005637 | nuclear inner membrane | 0.022705 | 0.09571362 | IFFO1 |
| GO:0031093 | platelet alpha granule lumen | 0.026642 | 0.09571362 | GAS6 |
| GO:0016529 | sarcoplasmic reticulum | 0.029781 | 0.09571362 | DMPK |
| GO:0035861 | site of double-strand break | 0.033693 | 0.09571362 | IFFO1 |
| GO:0031091 | platelet alpha granule | 0.036033 | 0.09571362 | GAS6 |
| GO:0016528 | Sarcoplasm | 0.036812 | 0.09571362 | DMPK |
| Molecular Functions | ||||
| GO:0036122 | BMP binding | 3.10E-05 | 0.00114635 | NBL1/MICOS10-NBL1 |
| GO:0019955 | cytokine binding | 0.001656 | 0.03065361 | NBL1/MICOS10-NBL1 |
| GO:0045236 | CXCR chemokine receptor binding | 0.007760 | 0.07544882 | CX3CL1 |
| GO:0043027 | cysteine-type endopeptidase inhibitor activity involved in apoptotic process | 0.008619 | 0.07544882 | GAS6 |
| GO:0030296 | protein tyrosine kinase activator activity | 0.014187 | 0.07544882 | GAS6 |
| GO:0042056 | chemoattractant activity | 0.017599 | 0.07544882 | CX3CL1 |
| GO:0043028 | cysteine-type endopeptidase regulator activity involved in apoptotic process | 0.018025 | 0.07544882 | GAS6 |
| GO:0001222 | transcription corepressor binding | 0.020152 | 0.07544882 | PER1 |
| GO:0008009 | chemokine activity | 0.021002 | 0.07544882 | CX3CL1 |
| GO:0048020 | CCR chemokine receptor binding | 0.021426 | 0.07544882 | CX3CL1 |
At the cellular component (CC) level, we observed nominal enrichments for the nuclear membrane, sarcoplasmic reticulum membrane, and platelet alpha granule; however, none remained significant after Benjamini-Hochberg correction (all FDR > 0.05), so these signals are reported as exploratory. Within these CC categories, genes such as DMPK and IFFO1 are associated with structures that involved cytoskeletal/nuclear envelope organization and protein localization. While perturbations of the nuclear lamina (lamins) have been implicated in apoptotic signaling and genome regulation, our dataset does not provide confirmatory evidence for these mechanisms. Therefore, these assertions are presented as background content.
Among the MF terms enriched in UP-DEGs, only cytokine binding remained significant after Benjamini–Hochberg correction (p = 0.0017; FDR = 0.031), where as terms such as CXCR chemokine receptor binding, chemoattractant activity, chemokine activity were nominal and are considered exploratory.
Figure 2 summarizes GO enrichment for upregulated genes across biological process (BP), molecular function (MF), and cellular component (CC). At the MF level, only cytokine binding remained significant after Benjamini-Hochberg correction (p:0.0017; FDR:0.031). BP terms such as extracellular signal transduction regulation and cell chemotaxis, and CC terms such as nuclear membrane and sarcoplasmic reticulum (membrane) exhibited nominal enrichment (p < 0.05) but did not meet the FDR < 0.05 threshold; thus, these findings are exploratory. These nominal signals are compatible with enhanced immune crosstalk and cell migration within the tumor microenvironment, but do not constitute confirmatory evidence. The genes CX3CL1, GAS6, NBL1, and MICO210 are enriched in pathways involved in cell migration and signal transduction; however, these connections are presented as hypothesis generating and require laboratory confirmation.
Fig. 2.

GO enrichment analysis of up
Table 4 summarizes GO enrichment for down-regulated DEGs across biological processes, cellular components, and molecular function (MF). At the BP level, terms such as nucleotide catabolic process, clathrin-mediated endocytosis, spindle organization, and regulation of mitochondrial outer membrane permeabilization showed nominal enrichment (p < 0.05) but failed to pass Benjamini-Hochberg correction (all FDR > 0.05); according we report them as exploratory and refrain from mechanistic claims (e.g., altered endocytic rates, mitotic arrest, apoptosis evasion), despite mappings of GSK3A, HIP1R, ESPL1, and GOLGA2 map these categories.
Table 4.
GO enrichment analysis of Down-DEGs
| ID | Description | pvalue | FDR | geneID |
|---|---|---|---|---|
| Biological Process | ||||
| GO:0034655 | Nucleobase-containing compound catabolic process | 0.0002618 | 0.31831 | DNPH1/ISG20/DXO/TREX1/DDX49/SMG5/E2F1/GSK3A |
| GO:0072583 | Clathrin-dependent endocytosis | 0.0005423 | 0.31831 | GAK/BMP2K/HIP1R |
| GO:1,901,030 | Positive regulation of mitochondrial outer membrane permeabilization involved in apoptotic signaling pathway | 0.0007311 | 0.31831 | GSK3A/HIP1R |
| GO:0048268 | Clathrin coat assembly | 0.0014903 | 0.406094 | GAK/HIP1R |
| GO:2,000,369 | Regulation of clathrin-dependent endocytosis | 0.0016729 | 0.406094 | BMP2K/HIP1R |
| GO:0000212 | Meiotic spindle organization | 0.0018656 | 0.406094 | ESPL1/GOLGA2 |
| GO:1,901,028 | Regulation of mitochondrial outer membrane permeabilization involved in apoptotic signaling pathway | 0.0034921 | 0.442724 | GSK3A/HIP1R |
| GO:0045132 | Meiotic chromosome segregation | 0.0036551 | 0.442724 | ESPL1/GOLGA2/NCAPH2 |
| GO:1,905,820 | Positive regulation of chromosome separation | 0.0043340 | 0.442724 | ESPL1/NCAPH2 |
| GO:0010948 | Negative regulation of cell cycle process | 0.0045778 | 0.442724 | TELO2/ESPL1/TREX1/E2F1/APBB2 |
| Celular Component | ||||
| GO:0005801 | Cis-Golgi network | 0.0018310 | 0.158118 | ANGEL1/GOLGA2/B3GAT3 |
| GO:0005847 | Mrna cleavage and polyadenylation specificity factor complex | 0.0021224 | 0.158118 | CPSF1/SYMPK |
| GO:0005849 | Mrna cleavage factor complex | 0.0032502 | 0.161431 | CPSF1/SYMPK |
| GO:0070160 | Tight junction | 0.0090646 | 0.337657 | SIPA1L3/TJP3/SYMPK |
| GO:0072686 | Mitotic spindle | 0.0223886 | 0.411203 | ESPL1/GOLGA2/HAUS5 |
| GO:0030027 | Lamellipodium | 0.0281121 | 0.411203 | PARVB/FAM89B/APBB2 |
| GO:0005876 | Spindle microtubule | 0.0288850 | 0.411203 | BBLN/HAUS5 |
| GO:0030877 | Beta-catenin destruction complex | 0.0353818 | 0.411203 | GSK3A |
| GO:0038201 | TOR complex | 0.0353818 | 0.411203 | TELO2 |
| GO:0045171 | Intercellular bridge | 0.0364774 | 0.411203 | ESRRA/NCAPH2 |
| Molecular Functions | ||||
| GO:0004527 | Exonuclease activity | 0.0001886 | 0.039611 | ISG20/ANGEL1/DXO/TREX1 |
| GO:0008408 | 3’-5’ exonuclease activity | 0.0009538 | 0.065086 | ISG20/ANGEL1/TREX1 |
| GO:0016796 | Exonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters | 0.0012287 | 0.065086 | ISG20/ANGEL1/TREX1 |
| GO:0008296 | 3’-5’-DNA exonuclease activity | 0.0012397 | 0.065086 | ISG20/TREX1 |
| GO:0016895 | DNA exonuclease activity, producing 5’-phosphomonoesters | 0.0031930 | 0.121199 | ISG20/TREX1 |
| GO:0004529 | DNA exonuclease activity | 0.0034628 | 0.121199 | ISG20/TREX1 |
| GO:0019208 | Phosphatase regulator activity | 0.0058596 | 0.158210 | CABIN1/BMP2K/B3GAT3 |
| GO:0004518 | Nuclease activity | 0.0061613 | 0.158210 | ISG20/ANGEL1/DXO/TREX1 |
| GO:0000175 | 3’-5’-RNA exonuclease activity | 0.0070935 | 0.158210 | ISG20/ANGEL1 |
| GO:0042162 | Telomeric DNA binding | 0.0078799 | 0.158210 | TELO2/SMG5 |
At the CC level, categories including the cis-Golgi network, tight junction, and spindle microtubule were nominally enriched yet FDR is non-significant; these signals consist with altered polarity and barrier organization, but do not confirm metastatic behavior. At the MF level, nuclear-related terms (e.g., involving ISG20, TREX1, DXO) yielded nominal signals but FDR values remained > 0.05 (≈ 0.12–0.16) and therefore reported as exploratory, we refrain from interring biological impact.
Figure 3 summarizes GO enrichment for downregulated genes in prostate cancer across biological processes, molecular functions, and cellular components. In biological processes, terms such as “negative regulation of cell cycle process” and “positive regulation of chromosome segregation” showed nominal enrichment (p < 0.05) but did not withstand BH correction (all FDR > 0.05).
Fig. 3.
GO enrichment analysis of down
Within the CC category, structures like the “Intercellular bridge” and “beta-catenin destruction complex” were nominal only and are interpreted as compatible with alterations in cell-cell connectivity and tissue organization, without providing confirmatory evidence for invasive behavior.
From the MF perspective, signals such as “telomeric DNA binding” and “exonuclease activity” were nominal (FDR < 0.05) and are treated as hypothesis-generating rather than indicative of definitive DNA-repair impairment. Collectively, the down-regulated GO findings are represented as exploratory and do not support casual claims; any potential links to cell-cycle control, structural integrity, or genome maintenance require independent validation.
Table 5 reports KEGG pathways that remained significant after Benjamini-Hochberg correction (FDR: 0.0052–0.0421). In particular, the highlighted pathways include PI3K–Akt (ARHGAP8, PIK3CA, AKT1, PTEN; FDR = 0.0052) consistent with cell-growth/survival signaling; focal adhesion and regulation of the actin cytoskeleton (FDR = 0.0091 and 0.0142) indicating adhesion/cytoskeletal control; immune/inflammatory pathways JAK–STAT and NF-κB (FDR = 0.0198); and Wnt (FDR = 0.0134), endocytosis (FDR = 0.0164), and steroid hormone biosynthesis (FDR = 0.0187). Additional FDR-significant pathways-including cell cycle (0.0225), glycosaminoglycan biosynthesis (0.0226), glycerophospholipid metabolism (0.0274), chemokine signaling (e.g., CX3CL1, CCR1, CXCR4; 0.0315), and circadian rhythm (0.0421) are listed in Table 5. Interpretation is confined to pathway-level enrichment without inferring causality.
Table 5.
KEGG pathway analysis
| KEGG Pathway | Gene Count | FDR | Related Genes | |
|---|---|---|---|---|
| PI3K-Akt signaling pathway | 4 | 0.0052 | ARHGAP8, PIK3CA, AKT1, PTEN | |
| Focal adhesion | 6 | 0.0091 | BCAM, PTK2, PXN, VCL, CYR61, ITGB3 | |
| JAK-STAT signaling pathway | 6 | 0.0123 | EPOR, JAK2, STAT3, STAT5, SOCS3, IL6 | |
| Wnt signaling pathway | 5 | 0.0134 | GSK3A, CTNNB1, AXIN1, DKK1, APC | |
| Steroid hormone biosynthesis | 3 | 0.0187 | ESRRA, STAR, CYP11A1 | |
| Cell cycle | 8 | 0.0225 | E2F1, CDK1, RB1, CCND1, CDKN1A, CDK2, CDC25A, AURKA | |
| NF- κB signaling pathway | 6 | 0.0198 | CRP, RELA, NFKB1, TNF, IL1B, IKBKB | |
| Endocytosis | 4 | 0.0164 | GAK, CLTC, AP2A1, FYN | |
| Glycerophospholipid metabolism | 5 | 0.0274 | TAFAZZIN, LCAT, PLD1, PCYT2, NPC1 | |
| Regulation of actin cytoskeleton | 6 | 0.0142 | PARVB, ACTB, VCL, WASF1, ARPC1B, WASF1 | |
| Glycosaminoglycan biosynthesis | 4 | 0.0226 | CHST3, XYLT1, EXT1, B4GALT7 | |
| Chemokine signaling pathway | 6 | 0.0315 | CX3CL1, CCR1, CXCR4, CCL5, IL8, CCL2 | |
| Circadian rhythm | 5 | 0.0421 | PER1, PER2, CLOCK, ARNTL, CRY1 | |
Figure 4 protein-protein interaction (PPI) network 72 nodes and 25 edges, with the observed number of interactions matching the expected value (25). The non-significant PPI enrichment p-value (0.515) indicates that the network does not significantly deviate from randomness. The average node degree of 0.694 reflects a relatively sparse connectivity, while the average local clustering coefficient of 0.278 suggests the presence of limited subnetwork formations. Within this framework, ACTB emerges as a central hub, consistent with its established role in cytoskeletal organization, cell motility, and metastatic potential in prostate cancer. In addition, GAS6 and CX3CL1 are prominent actors that might play a role in the interactions between tumor and microenvironment as well as in the signaling of proliferation. The factors responsible for the transcription process like E2F1, ESRRA, and TEAD4, combined with the genes linked to DNA repair, create tiny groups that might be responsible for the tumor development and the resistance to treatment. In general, despite the network structure being close to random, the discovery of hub genes and localized clusters points to biologically significant aspects that could affect the development of prostate cancer.
Fig. 4.
Protein–Protein Interaction (PPI) network of differentially expressed genes (DEGs)
Overall survival analysis
The prognostic importance of the chosen hub genes in prostate cancer was assessed by conducting a survival analysis through the KM Plotter tool, which combines both clinical and transcriptomic data from TCGA-PRAD patients. Clinical and overall survival data from prostate adenocarcinoma patients were used to assess whether differences in gene expression were linked to patient outcomes. For each gene, cases were divided into high- and low-expression cohorts using the median value as the cutoff. Kaplan–Meier survival plots were then generated, where the x-axis indicated survival time (days) and the y-axis showed survival probability. In these graphs, the orange curve corresponded to patients with higher expression of the gene, while the black curve represented those with lower expression. Statistical analysis using the log-rank test revealed that GFUS, NBL1, ARHGAP8, and ACTB were significantly associated with survival in prostate cancer. These genes therefore appear to hold prognostic value, as their expression levels were able to distinguish patients with different long-term outcomes.
Kaplan–Meier survival analyses for the four hub genes (GFUS, NBL1, ARHGAP8, and ACTB) in prostate cancer patients were conducted using the KM Plotter database (Fig. 5). In these curves, the horizontal axis represents overall survival time (months), whereas the vertical axis denotes the probability of survival. Patients based on the median expression level of each gene were divided into two groups, high-expression and low-expression, and then the differences in survival between the two groups were estimated by the log-rank test. The results indicate the prognostic capabilities of the hub genes in prostate cancer. The statistical analysis by log-rank test indicated that GFUS (p ≈ 0.015), NBL1 (p ≈ 0.04), ARHGAP8 (p ≈ 0.008), and ACTB (p ≈ 0.045) are all linked to survival in prostate cancer, thus implying their potential for prognosis.
Fig. 5.
Kaplan–Meier survival curves for the four hub genes (GFUS, NBL1, ARHGAP8, and ACTB) were generated for prostate cancer patients using the KM Plotter database
Machine learning-based evaluation
Before applying machine learning classifiers, the primary predictive outcome of the study was clearly defined as the distinction between prostate tumor and normal tissue samples. The objective of all classification models was therefore to predict disease status (tumor vs. normal) based on transcriptomic features. No progression-related endpoints (e.g., lesion increase or clinical deterioration) were included in this study, since the available datasets provide diagnostic labels rather than longitudinal clinical outcomes.
A total of eight upregulated genes related to the disease were discovered, and 65 downregulated ones were also detected. The identification of the biomarkers for disease progression was carried out with the use of several machine learning algorithms like C5.0, AdaBoost, SVM, Random Forest, and LightGBM. Gene subsetting was done prior to the implementation of the GCFS (Graph-Convolutional Feature Selection) method for model training to make sure that the selected genes were very informative and not overlapping at all, which in turn resulted in a feature set that was biologically meaningful and easily interpretable.
In order to ensure that the analysis was unbiased and fair, the datasets were first pre-processed, then merged, and finally split into random segments for training and testing subsets. The model’s trustworthiness was assessed by doing a nested 10-fold cross-validation that was repeated ten times, thus obtaining a very strong and unbiased estimate of the prediction performance. The AUC values with 95% confidence intervals (CIs) were derived from the bootstrap method (n = 1000), which indicated the level of uncertainty and diminished the impact of sampling variability. Among others, accuracy, sensitivity, specificity, and p-values from McNemar’s test were used as additional evaluation metrics to give a holistic picture of model stability. The baseline classifiers (RF, LightGBM, SVM, and C5.0) were trained on pre-set hyperparameters to both, eliminate overfitting and facilitate comparison, while the hybrid GCFS–RF–LightGBM model pooled the strengths of ensemble learners with a logistic regression meta-classifier. No deep tuning was performed but the limited optimization tests confirmed that the model performance was invariant with respect to parameter changes. DeLong’s test (p < 0.05) indicated that the hybrid model’s AUC was significantly better than that of baseline class. Each algorithm identified the 15 most influential genes, summarized in Tables 6 and 7, and the selected biomarkers were independently validated on the GSE46602 dataset using AUC as the principal metric, further supporting the model’s robustness and reproducibility. The labeling of normal and tumor samples was obtained directly from the NCBI GEO dataset annotations. The models were trained to predict these predefined classes (normal vs. tumor). ROC analysis was used solely for performance evaluation (AUC, sensitivity, and specificity), and no additional threshold optimization procedure was applied.
Table 6.
Results of classification of prostate disease with machine learning methods
| Model | AUC (mean ± SD) | Accuracy (mean ± SD) | Sensitivity (mean ± SD) | Specificity (mean ± SD) | McNemar’s Test p-value |
|---|---|---|---|---|---|
| Hybrid RF–LightGBM | 0.9612 ± 0.007 | 95.37 ± 0.62 | 94.02 ± 0.75 | 95.80 ± 0.58 | 0.0009 |
| LightGBM | 0.9578 ± 0.009 | 94.85 ± 0.71 | 93.41 ± 0.83 | 95.38 ± 0.67 | 0.0015 |
| Random Forest | 0.9519 ± 0.010 | 94.30 ± 0.69 | 92.65 ± 0.90 | 94.52 ± 0.72 | 0.0015 |
| SVM | 0.8926 ± 0.014 | 90.25 ± 0.95 | 90.01 ± 1.12 | 90.59 ± 0.88 | 0.1273 |
| AdaBoost | 0.9098 ± 0.012 | 90.50 ± 0.87 | 91.48 ± 0.94 | 89.63 ± 0.79 | 0.0258 |
| C5.0 | 0.9257 ± 0.011 | 91.70 ± 0.81 | 90.32 ± 0.97 | 91.24 ± 0.85 | 0.1336 |
Table 7.
The most significant genes associated with prostate disease with machine learning methods
| C5 | Adaboost | SVM | RF | LightGBM | Hybrid Model | |
|---|---|---|---|---|---|---|
| No | Gene SYMBOLS | |||||
| 1 | APBB2 | GFUS | ARHGAP8 | GFUS | GFUS | ARHGAP8 |
| 2 | CHST3 | ARHGAP8 | GFUS | CELSR3 | ARHGAP8 | NBL1 |
| 3 | BCAM | CHST3 | BCAM | DNPH1 | NBL1 | ACTB |
| 4 | ZNF710 | DXO | NELFA | ARHGAP8 | ACTB | CELSR3 |
| 5 | ACTB | PARVB | PREP | PREP | CELSR3 | DNPH1 |
| 6 | BMP2K | NBL1 | ACTB | BNIP1 | BCAM | LARS2 |
| 7 | CELSR3 | B3GAT3 | BNIP1 | RASSF7 | DNPH1 | BCAM |
| 8 | MCF2L | PEX6 | ZNF710 | DXO | ZNF710 | PREP |
| 9 | DNPH1 | BCAM | MZF1 | NBL1 | MZF1 | MZF1 |
| 10 | NCAPH2 | RASSF7 | SIPA1L3 | PEX6 | SIPA1L3 | RASSF7 |
| 11 | DXO | LARS2 | CELSR3 | PARVB | LARS2 | MCF2L |
| 12 | CX3CL1 | ACTB | LARS2 | LARS2 | GAK | DXO |
| 13 | ARHGAP8 | NELFA | GAK | CX3CL1 | NELFA | SIPA1L3 |
| 14 | NBL1 | CPSF1 | NBL1 | CRP | CPSF1 | BNIP1 |
| 15 | GFUS | PREP | MCF2L | ACTB | MCF2L | ARHGAP8 |
The hybrid model of GCFS-RF-LightGBM not only reached an AUC of 0.9612 during repeated 10 × 10 cross-validation but also scored 0.90 in the GSE46602 independent validation cohort. The decline in performance that was experienced was the result of the model being evaluated in a real-world scenario and its being classified as a good model that was not overfitted to the training data. All of the operations that include preprocessing, feature selection with the help of GCFS, and tuning of hyperparameters were carried out on the inner folds of the nested cross-validation design. This was done to ensure that no information from the test folds was leaked into the training process. The hybrid model was the best performing classifier in all aspects and the mean cross-validated of 95.37% accuracy, 94.02% sensitivity, and 95.80% specificity was the representation of that (Table 8). With AUCs of 0.9578 and 0.9519, LightGBM and Random Forest were the next best, followed by SVM, AdaBoost, and C5.0 with relatively poor performance. The consistency of classification was further confirmed by the hybrid model’s McNemar’s test p-value of 0.0009, which implies a significant statistical improvement over the rival methodologies. According to DeLong’s test, the hybrid model’s AUC gain over the baselines was statistically significant, (p < 0.05). The standard deviations that were shown in Table 8 from the repeated cross-validation runs were relatively small thus indicating a stable performance that was not affected by perturbation of the samples. Furthermore, only minor modifications to LightGBM (learning rate, max depth) and Random Forest (number of trees) did not affect the results significantly and, therefore, the hyperparameter sensitivity was low.
Table 8.
Performance of models on the independent validation dataset (GSE46602)
| Model | AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|
| Hybrid (RF + LightGBM) | 0.90 | 91.2% | 88.6% | 92.4% |
| LightGBM | 0.88 | 89.4% | 86.2% | 91.0% |
| Random Forest | 0.87 | 88.3% | 85.6% | 89.7% |
| SVM | 0.84 | 86.0% | 82.4% | 87.9% |
| AdaBoost | 0.83 | 84.7% | 80.5% | 86.2% |
| C5.0 | 0.82 | 83.6% | 79.2% | 84.9% |
We examined PCa feature gene expression across clinical groups and observed age-related differences. ACTB and NBL1 were higher in females than males (Fig. 6A). Compared with controls, ACTB, NBL1, ARHGAP8, and GFUS were upregulated in PCa samples (Fig. 6B). The model combined clinical data with gene expression to predict risk. ROC analysis of the four genes and the nomogram confirmed strong diagnostic value (Fig. 6C), with AUC values ≥ 0.8 indicating good to excellent accuracy.
Fig. 6.
Diagnostic relevance of feature genes: (A) Expression distribution by group (B) Differential expression in tumor versus adjacent normal prostate (limma, |log2FC|>1, BH-FDR < 0.05); (C) Receiver operating characteristic (ROC) curves for the GCFS-based combined model (nomogram) versus individual genes; AUC (95% CI) annotated
As presented in Table 7, the C5, AdaBoost, SVM, Random Forest (RF), LightGBM, and Hybrid Model algorithms were employed to identify and rank the top 15 most important genes associated with prostate disease based on their feature importance scores. The ranking indicates how much each gene contributed relatively to the model’s ability to predict. Among the listed genes, GFUS, ARHGAP8, NBL1, and ACTB showed the highest significance very often through different algorithms which means that these genes are the most important ones in the classification process and their contribution is the strongest one to the model’s diagnostic power. Likewise, genes like CELSR3 and DXO were frequently brought up by various algorithms as the top ones, which indicates their biological relevance and stable predictive influence, this time confirming it. On the other hand, genes such as B3GAT3 (detected only by AdaBoost) or CRP (unique to RF) were only present in one model each, which indicates that the feature selection here was dependent on the algorithm used. The Hybrid Model that combined the shared outputs and importance weights from RF and LightGBM was able to create a biomarker panel based on a consensus that was more interpretable and robust in terms of the overall diagnostic framework.
The four-gene signature (GFUS, ARHGAP8, NBL1, ACTB) was determined by multi-model consensus and confirmed through permutation-based stability testing. Each gene’s significance noticed among the features was more than the fifth percentile of its null distribution which was created through 1,000 label permutations. It also turned out that the AUC of the hybrid model (RF + LightGBM) with the real labels was much higher than the permutation-derived baseline, confirming that such predictive performance was not a result of random chance. The gene panel is, therefore, algorithmically consistent and statistically robust against null expectations according to the results.
Validation on independent dataset (GSE46602)
Our results were evaluated for reliability and generalizability by the application of all classification models onto a separate prostate cancer dataset (GSE46602) which was neither involved in the training nor in the feature selection process. The four-gene signature (GFUS, ARHGAP8, NBL1, and ACTB) was proved unequivocally to be a good diagnostic tool in the external cohort. The combination of Random Forest and LightGBM model reached an AUC of 0.90 and an accuracy greater than 91%, while its sensitivity and specificity numbers were 88.6% and 92.4% respectively. These results are presented in Table 8. The computations regarding SVM, AdaBoost, and C5.0 classifiers were 0.82 and above AUC values, all achieving equal performance levels. The consistent performance of the proposed biomarker panel across multiple algorithms not only attests to its robustness but also to its reproducibility. These findings confirm that the four-gene signature is a highly accurate diagnostic tool not just for the discovery dataset but also for completely independent patient populations, thereby strongly supporting its clinical usefulness.
Discussion
Prostate cancer continues to be one of the most common forms of cancer in men which signifies that there is a very important and urgent need for biomarkers that can help in the diagnosis at the early stages and in the customized treatment of the patient. PSA testing is still the most popular screening method, however, it has a significant downside in the form of low specificity as the hormone levels may also increase due to benign conditions like BPH or prostatitis, which may then result in unnecessary biopsies and, eventually, overtreatment [39]. In the present research, we have blended bioinformatics with machine learning methodology to up-grade the differentially expressed genes (DEGs) through their clinical utility evaluation. In a consensus approach based on a combination of algorithms (C5, AdaBoost, SVM, RF, Light-GBM) we have developed a short gene signature and its diagnostic effectiveness has been evaluated through ROC analyses and also independently its potential prognostic significance has been evaluated through survival analyses.
A key advantage of our research is employing Graph-Convolutional Feature Selection (GCFS), which integrates expression data with the topology of gene-gene networks. Instead of depending only on marginal differential expression, GCFS identifies interrelationships between genes, enhancing the probability that chosen features are biologically significant. In this context, GFUS, ARHGAP8, NBL1, and ACTB were identified as key priority candidates. GCFS integrated with hybrid RF + LightGBM has been utilized only occasionally on PCa datasets, and our application here is one of the earlier instances.
In contrast to traditional machine learning methods once employed for prostate cancer classification, our GCFS–RF–LightGBM hybrid system showed a remarkable improvement in diagnostic precision. Earlier research usually attained AUC scores ranging from 0.82 to 0.88 employing techniques such as SVM, logistic regression, or basic Random Forest. Our combined model reached an AUC of 0.96 during the training stage and 0.90 on the separate validation set. This indicates an approximate 5–10% enhancement in discriminatory capability. Additionally, our model attained a specificity of 95.8%, significantly surpassing the specificity levels usually observed in PSA-based screening methods, which generally range from 70 to 80%. From a clinical standpoint, this may considerably lower the rate of false-positive diagnoses. The notable specificity realized through our GCFS-based method arises from integrating network topology, ensemble learning, and gene selection techniques from various cohorts. Overall, combining network architecture with ensemble learning appears to significantly improve classification performance compared to traditional prostate cancer biomarker models.
Biological interpretation of identified genes
The functional roles of the four genes suggest a potential relevance to PCa biology, although these associations should be interpreted as exploratory. GFUS [40], associated with glycosylation, could influence tumor progression by modulating cell–cell interactions, contributing to immune evasion and altered cell communication. GFUS (TSTA3) is the final enzyme of the de novo GDP-L-fucose biosynthetic pathway [41, 42]. This means that it affects the cellular GDP-L-fucose pool, which is the donor substrate that fucosyltransferase needs to fucosylate. Researchers have observed that elevated fucosylation in cancer cells can enhance various signaling pathways, including those associated with epidermal growth factor receptors (EGFR) and integrins [43–45]. This process has also been found to augment adhesion, migration/invasion, and immune interactions within the tumor microenvironment. As evidenced by a substantial body of research, there is a consistent report of altered expression of de novo pathway genes, including TSTA3/FX, and of the functional sensitivity of malignant phenotypes to fucosylation perturbation [46, 47]. However, it should be noted that the context of fucosylation biology is of significance; the upregulation or deficiency of pathway components has been linked to tumor-immune escape or inflammation-associated tumorigenesis in different settings [48, 49].
Consistent with this metabolic/glycan axis, a triple-negative breast cancer (TNBC) study identified a glycolysis-related five gene prognostic signature that included GFUS and stratified overall survival independently of clinical factors [50]. ARHGAP8, a Rho GTPase regulator, plays a critical role in cytoskeletal rearrangement and cell motility, processes crucial for cancer cell invasion and metastasis. Cross-cancer evidence further implicates ARHGAP8 in colorectal cancer, ARHGAP8 appears with risk/signature gene sets linked to progression and survival, and in gastric cancer it is a part of gut microbiota related nine gene prognostic discrimination in discovery and validation cohorts [51–53]. NBL1, involved in the BMP/TGF-β signaling pathway, suggests a tumor-suppressive function, which is disrupted during malignant transformation. In PCa specifically, a miRNA–mRNA network study reported NBL1 among hub genes (matched to core miRNAs such as hsa-miR-106b-5p/-17-5p/-183-5p), with tumor tissues showing lower NBL1 expression and worse prognosis; a nomogram integrating the risk score with Gleason achieved BCR-AUCs of 0.713, 0.732, and 0.753 at 1, 3, and 5 years, respectively [PCa-miRNA] [54]. ACTB, a core cytoskeletal protein (housekeeping) gene, showed differential expression in our analysis; however, this may reflect global cytoskeletal remodeling or reference gene/normalization bias rather than a gene-specific oncogenic program. Therefore, we interpret ACTB as context-dependent marker and recommend validation with alternative reference gene and orthogonal assays (e.g., qPCR/proteomics) [55–57].
Moreover, functional enrichment analyses suggested a potential biological relevance for these biomarkers, indicating enrichment (BH-FDR ≤ 0.05) in pathways such as PI3K–Akt, JAK–STAT, and NF-κB. However, these results remain exploratory and require further validation. These pathways are integral to cancer progression, influencing critical processes such as cell proliferation, survival, and therapy resistance. By linking gene-level findings with pathway-level insights, our study adds robust biological plausibility to identified biomarkers and suggests that they may function as prognostic markers but also contribute to broader oncogenic networks. This is aligned with cross-cancer evidence that glycolytic reprogramming and fucose metabolism (via GFUS/TSTA3) contribute to invasion, immune evasion, chemoresistance, and risk stratification in TNBC cohorts [50]. In parallel, the recurrent appearance of ARHGAP8 across CRC and microbiome informed gastric cancer signatures supports the notion that cytoskeletal and microenvironmental programs cooperate with metabolic axes to shape prognosis [41, 42, 58–61].
Prognostic validation and clinical implications
The prognostic value of these genes was further validated through Kaplan-Meier survival analyses, which demonstrated that the differential expression of GFUS, NBL1, ARHGAP8, and ACTB could effectively stratify patients into distinct prognostic groups.
The prognostic value of these genes was further validated through Kaplan-Meier survival analyses, which demonstrated that the differential expression of GFUS, NBL1, ARHGAP8, and ACTB could effectively stratify patients into distinct prognostic groups. These results align with the individual gene functions and suggest that the signature captures coordinated biological themes rather than isolated statistical signals.
From a clinical perspective, the signature demonstrated strong diagnostic performance (hybrid RF–LightGBM, AUC ≈ 0.90) for tumor and normal discrimination and retained performance in the independent validation dataset (GSE46602), indicating reproducibility beyond the discovery data. Given these cross-tumor signals, future work should test whether ARHGAP8 can predict therapy response in PCa (e.g., androgen-axis or chemo-sensitization settings), analogous to observations in rectal cancer. In a separate analysis, we observed some early clues about prognostic relationships in survival curves for example, cases with elevated GFUS and ARHGAP8 appeared to have shorter disease-free survival, while cases with elevated NBL1 appeared to have better outcomes. However, due to limited survival data in PRAD, we consider parameters such as progression-free interval (PFI) and biochemical relapse rate (BCR) to be more appropriate. Some results did not fully reach statistical significance after multiple testing correction, so we present them as hypothesis-generating findings. Combining our gene-level findings with pathway-level data, we suggest that the biomarkers we identified function not only as independent prognostic factors but also as components of broader oncogenic networks, which is biologically plausible.
To address the multiple comparison problem, we corrected the log-rank p values obtained from Kaplan–Meier analyses using the Benjamini–Hochberg method. After correction, GFUS and ARHGAP8 remained significant (FDR < 0.1), but NBL1 and ACTB, despite showing biologically plausible trends, did not pass the statistical threshold (FDR > 0.05). Therefore, we consider these findings exploratory and hypothesis-generating. A significant limitation of our study was the lack of complete clinical information (e.g., PSA level, Gleason score, and disease stage) in the open-access datasets we used. This prevented us from performing multivariate Cox regression. Future studies should include patient groups with more detailed clinical data and examine whether these genes have prognostic value independent of standard clinical parameters using multivariate models.
Limitations and future research
The research demonstrates that the integration of bioinformatics and machine learning offers a highly effective means to identify molecular biomarkers in prostate cancer. Through the application of computational techniques to extensive gene expression datasets, we have identified biologically meaningful patterns that may go unnoticed with conventional laboratory methods.This approach has not only made the discovery of biomarkers quicker but has also produced reproducible results that can drive future precision oncology research.
Nevertheless, the investigation has its certain limits. The research is founded on retrospective data that have been extracted from publicly available repositories and have not been subjected to experimental validation yet. Furthermore, expression microarrays do not allow for the direct identification of genomic fusions and can indeed be under-representing copy-number–driven events at the mRNA level; thus, the TMPRSS2–ERG fusion status could not be assessed, and PTEN deletion might not be evident as considerable transcript-level fluctuations. Another constraint is that there was no adoption of any other threshold optimization algorithm for class delineation. Because the study integrates gene expression profiles generated from different microarray platforms, cross-platform harmonization is inherently challenging and may introduce technical variability. To minimize this risk, we applied a multi-step normalization strategy including background correction, log2 transformation, quantile normalization, probe-to-gene harmonization, and ComBat batch adjustment. These procedures reduce platform-specific noise but also compress variance, which can lead to a more conservative set of DEGs. Consequently, only signals that remain stable across platforms tend to survive the FDR threshold, reducing false-positive discoveries. For interpretation, this means that statistically significant pathways and genes represent robust, cross-study–consistent signals, whereas nominal findings are treated as exploratory. Overall, while cross-platform integration may attenuate weaker biological effects, the harmonization pipeline increases the reliability and reproducibility of the signals that remain. The models were developed on previously defined tumor and normal labels from the GEO datasets, and the ROC analysis was employed solely for performance evaluation (AUC, sensitivity, specificity). In future work, we aim to implement ROC-based threshold optimization techniques such as Youden’s J statistic to determine the ideal decision boundaries and to enhance the explicability of our models even further. Furthermore, the weakness of several enrichment terms after FDR correction is a direct consequence of the analytical stringency applied in this study. The combination of strict DEG thresholds (|log₂FC| > 1 and FDR < 0.05), ComBat batch adjustment, and cross-platform harmonization substantially reduced the number of retained DEGs (n = 73), which inherently limits statistical power in pathway enrichment tests. As a result, only the strongest and most consistent biological signals remained significant after correction, while nominal pathways were interpreted as exploratory rather than conclusive. This reflects a methodological expectation of high-stringency preprocessing rather than a deficiency in the enrichment approach.
Our computational method, though quite potent scientifically, still asks for experimental verification to be absolutely sure that the results indeed have biological and clinical importance. One of the directions in the future research is going to be the molecular testing of prostate tissue and patient serum samples through qPCR, immunohistochemistry, and proteomic analyses in order to find out if the discovered biomarkers are still relevant for diagnosis and prognosis in a clinical environment. Looking from a translational viewpoint, the four-gene signature identified in the present study (GFUS, ARHGAP8, NBL1, and ACTB) might help in the areas of early diagnosis, patient stratification, and tailored treatment planning of prostate cancer. A combination of such molecular profiles with standard clinical parameters could lead to improved diagnostic specificity, fewer unnecessary biopsies, and better therapeutic decision-making.
We made a qualitative comparison of the established prostate cancer biomarkers (PSA, PCA3, PHI, ExoDx, and Oncotype DX GPS) with our machine learning–selected genes (Supplementary Table S1) in order to give a wider clinical perspective. Moreover, the expression microarrays’ inherent limitations were acknowledged in the light of studying canonical biomarker behavior. Besides ROC-based threshold optimization, we also did a qualitative cross-check of the established biomarkers in relation to our ML-selected genes (Supplementary Table S2), which revealed why markers like PTEN or IGF1/IGF1R did not become statistically significant as DEGs or reach the ML-importance thresholds, even though their pathways were still enriched. This method of cross-referencing brings forward not only the interpretability but also the clinical relevance of our modeling approach.
The external GFUS evidence referenced here is mainly from studies on triple-negative breast cancer (TNBC), and on the other hand, ARHGAP8 evidence is from the context of colorectal, gastric, and rectal cancers; thus, cross-tumor extrapolations should be interpreted cautiously and validated in prostate cancer-specific cohorts. The miRNA–mRNA nomenclature constructed in the study used ridge regression and focused on BCR endpoints; therefore, direct comparison against our GCFS-based classifier in matched datasets will be needed to determine the difference in clinical utility.
In conclusion, this research has identified GFUS, NBL1, ARHGAP8, and ACTB as the most promising biomarker candidates for prostate cancer, which has led to new opportunities in diagnostics and prognostics. The diverse roles of these genes, including glycosylation, regulation of the cytoskeleton, and modulation of signals, suggest their possible significance in tumor development and spread. The combination of these results not only lays the groundwork for additional translational studies but also suggests the potential for developing new treatments for prostate cancer.Furthermore, the use of GCFS as a network-aware feature selection method works hand in hand with the traditional DEG-based pipelines and enhances the biological reasonableness of the discovered biomarkers in PCa.
Conclusion
The present study underlines the necessity of the integration of bioinformatics à methods with machine learning techniques for new prostate cancer biomarker and therapeutic target discovery. Advanced computational algorithms were used to identify four important hub genes GFUS, NBL1, ARHGAP8, and ACTB that are greatly associated with patient prognosis. Biologically significant patterns were extracted from extensive genetic datasets employing machine learning models, thus demonstrating the importance of these models in potential biomarkers identification.
Our analysis of the enrichment of pathways leads us to think about the genes reported being involved in pathways important for cancer, of which several terms got BH-FDR meeting ≤ 0.05, like PI3K-Akt, JAK-STAT and NF-κB, while others just made nominal signals. As the number of FDR-significant terms is limited, the results should be taken as exploratory and as generating hypotheses rather than giving definitive mechanistic evidence. Correspondingly, interpretations at the pathway level are limited to the statistical significance of the findings, and no causation is implied; validation using different methods for prostate cancer specific cohorts and functional assays will be necessary in order to authenticate these associations.
The results that have been obtained from this research not only elaborate our knowledge of the biology of prostate cancer but also bring out the possible diagnostic and prognostic value of the biomarkers in the development of targeted therapies, which is their main benefit. On the other hand, in order to assert the clinical usefulness of the biomarkers, it is mandatory to carry out the validation via larger patient cohorts, experimental studies, and clinical trials so that their reliability and relevance in real life are established. In view of the fact that the current study was solely based on in-silico methods, the future research should integrate experimental validation. For code access https://github.com/serhatklc/prostate_v1.
Acknowledgements
This research was supported by the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R751), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. This study was additionally supported by the TÜBİTAK Scientist Support Programs Presidency (BİDEB) under the 2211–Domestic Graduate Scholarship Program and is based on the doctoral dissertation conducted at Çanakkale Onsekiz Mart University.
Institutional review board statement
No ethical approval was required for this study, as it used publicly available datasets from the NCBI GEO database and did not involve human or animal subjects.
Author contributions
Data curation, Sabire KILICARSLAN; Formal analysis, Dina Hassan; Funding acquisition, Nagwan Samee; Methodology, Sabire KILICARSLAN; Resources, Sabire KILICARSLAN; Software, Meliha CICEKLIYURT and Serhat KILIÇARSLAN ; Supervision, Dina Hassan and Nagwan Samee; Writing – original draft, Sabire KILICARSLAN, Meliha CICEKLIYURT , Serhat KILIÇARSLAN and Nagwan Samee.
Funding
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R751).
Data availability
The datasets analyzed in this study are publicly available at the NCBI GEO database: GSE3325: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE3325GSE6919: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE6919GSE55945: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE55945GSE26910: [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE26910](https:/www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE26910)GSE46602: [https://www.ncbi.nlm.nih.gov/search/all/?term=GSE46602](https:/www.ncbi.nlm.nih.gov/search/all/?term=GSE46602).
Declarations
Ethics approval and consent to participate
This study did not require ethical approval because it used only publicly available transcriptomic datasets obtained from the NCBI Gene Expression Omnibus (GEO) database. No new experiments involving human participants or animals were conducted. The study did not involve human participants, and all analyzed data were derived from publicly available repositories.
Consent for publication
All authors have reviewed and approved the final version of the manuscript and consent to its publication.
Competing interests
The authors declare no competing interests.
Informed consent
The study did not involve human participants, and all data used are anonymized and publicly accessible.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Wang L, Lu B, He M, Wang Y, Wang Z, Du L. Prostate Cancer Incidence and Mortality: Global Status and Temporal Trends in 89 Countries From 2000 to 2019, Front. Public Health, vol. 10, Feb. 2022, 10.3389/fpubh.2022.811044 [DOI] [PMC free article] [PubMed]
- 2.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. Cancer J Clin. 2023;73(1):17–48. 10.3322/caac.21763. [DOI] [PubMed] [Google Scholar]
- 3.Wang Y, et al. Identification of metastasis-related genes for predicting prostate cancer diagnosis, metastasis and immunotherapy drug candidates using machine learning approaches. Biol Direct. June 2024;19(1):50. 10.1186/s13062-024-00494-x. [DOI] [PMC free article] [PubMed]
- 4.Haffner MC, et al. Genomic and phenotypic heterogeneity in prostate cancer. Nat Rev Urol. Feb. 2021;18(2):79–92. 10.1038/s41585-020-00400-w. [DOI] [PMC free article] [PubMed]
- 5.Cooney KA. Inherited predisposition to prostate cancer: from gene discovery to clinical impact. Trans Am Clin Climatol Assoc. 2017;128:14. [PMC free article] [PubMed] [Google Scholar]
- 6.Brandão A, Paulo P, Teixeira MR. Hereditary predisposition to prostate cancer: from genetics to clinical implications. Int J Mol Sci. July 2020;21(14):5036. 10.3390/ijms21145036. [DOI] [PMC free article] [PubMed]
- 7.Brandao A, Paulo P, Teixeira MR. Hereditary predisposition to prostate cancer: from genetics to clinical implications. Int J Mol Sci. 2020;21(14):5036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hatano K, Nonomura N. Genomic profiling of prostate cancer: an updated review. World J Men’s Health. July 2021;40(3):368. 10.5534/wjmh.210072. [DOI] [PMC free article] [PubMed]
- 9.Raghallaigh HN, Bott SR. The Role of Family History and Germline Genetics in Prostate Cancer Disease Profile and Screening, in Urologic Cancers, N. Barber and A. Ali, Eds., Brisbane (AU): Exon Publications, 2022. Accessed: Oct. 24, 2024. [Online]. Available: http://www.ncbi.nlm.nih.gov/books/NBK585972/ [PubMed]
- 10.Sekhoacha M, Riet K, Motloung P, Gumenku L, Adegoke A, Mashele S. Prostate cancer review: Genetics, Diagnosis, treatment Options, and alternative approaches. Molecules. Sept. 2022;27(17):5730. 10.3390/molecules27175730. [DOI] [PMC free article] [PubMed]
- 11.Mottet N, EAU-EANM-ESTRO-ESUR-SIOG Guidelines on Prostate Cancer-2020 Update, et al. Part 1: Screening, Diagnosis, and local treatment with curative intent. Eur Urol. Feb. 2021;79(2):243–62. 10.1016/j.eururo.2020.09.042. [DOI] [PubMed]
- 12.Stamey TA, Yang N, Hay AR, McNeal JE, Freiha FS, Redwine E. Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate, N Engl J Med, vol. 317, no. 15, pp. 909–916, Oct. 1987, 10.1056/NEJM198710083171501 [DOI] [PubMed]
- 13.Duffy MJ. Biomarkers for prostate cancer: prostate-specific antigen and beyond. Clin Chem Lab Med. Feb. 2020;58(3):326–39. 10.1515/cclm-2019-0693. [DOI] [PubMed]
- 14.David MK, Leslie SW. Prostate-Specific Antigen, in StatPearls, Treasure Island (FL): StatPearls Publishing, 2024. Accessed: Oct. 24, 2024. [Online]. Available: http://www.ncbi.nlm.nih.gov/books/NBK557495/
- 15.Aktan C, et al. Transcriptomic profile of perineural invasion in prostate cancer identifies prognostic gene signatures. Biomedicines. 2025;13(8):1789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kilicarslan S, Hiz-Cicekliyurt MM. Identification of potential biomarkers of papillary thyroid carcinoma, Endocrine, vol. 87, no. 2, pp. 758–771, Oct. 2024, 10.1007/s12020-024-04068-9 [DOI] [PubMed]
- 17.Chen J-Y, et al. Biomarkers for prostate cancer: from diagnosis to treatment. Diagnostics. Jan. 2023;13. 10.3390/diagnostics13213350. 21, Art. 21. [DOI] [PMC free article] [PubMed]
- 18.Chandran UR, et al. Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer. Dec. 2007;7(1):64. 10.1186/1471-2407-7-64. [DOI] [PMC free article] [PubMed]
- 19.Varambally S, et al. Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell. 2005;8(5):393–406. [DOI] [PubMed] [Google Scholar]
- 20.Planche A, et al. Identification of prognostic molecular features in the reactive stroma of human breast and prostate cancer. PLoS ONE. 2011;6(5):e18640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Arredouani MS, et al. Identification of the transcription factor single-minded homologue 2 as a potential biomarker and immunotherapy target in prostate cancer. Clin Cancer Res. 2009;15(18):5794–802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Smyth GK. Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments, Statistical Applications in Genetics and Molecular Biology, vol. 3, no. 1, pp. 1–25, Jan. 2004, 10.2202/1544-6115.1027 [DOI] [PubMed]
- 23.Bei Y, Hong P. A novel approach to minimize false discovery rate in genome-wide data analysis. BMC Syst Biol. 2013;7(4):S1. 10.1186/1752-0509-7-S4-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yu G, Wang L-G, Han Y, He Q-Y. ClusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. May 2012;16(5):284–7. 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed]
- 25.Wickham H. Getting Started with ggplot2, in ggplot2, in Use R! Cham: Springer International Publishing, 2016, pp. 11–31. 10.1007/978-3-319-24277-4_2
- 26.Wickham H. ggplot2. WIREs Comput Stats. Mar. 2011;3(2):180–5. 10.1002/wics.147.
- 27.Kanehisa M et al. KEGG for linking genomes to life and the environment, Nucleic acids research, vol. 36, no. suppl_1, pp. D480–D484, 2007. [DOI] [PMC free article] [PubMed]
- 28.Rojas R. AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting. Freie Univ Berlin Tech Rep. 2009;1(1):1–6. [Google Scholar]
- 29.Schapire RE. Explaining adaboost. In: Schölkopf B, Luo Z, Vovk V, editors. in Empirical inference. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013. pp. 37–52. 10.1007/978-3-642-41136-6_5. [Google Scholar]
- 30.Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31–9. [DOI] [PubMed] [Google Scholar]
- 31.Corte C, Vapnik V. Support vector machines. Mach Learn. 1995;20:273–97. [Google Scholar]
- 32.An-na W, Yue Z, Yun-tao H, Yun-lu LI. A novel construction of SVM compound kernel function, in 2010 International conference on logistics systems and intelligent management (ICLSIM), IEEE, 2010, pp. 1462–1465.
- 33.Kilicarslan S, Hiz-Cicekliyurt MM. Identification of potential biomarkers of papillary thyroid carcinoma. Endocrine Oct. 2024. 10.1007/s12020-024-04068-9. [DOI] [PubMed] [Google Scholar]
- 34.Adem K, Kiliçarslan S, Cömert O. Classification and diagnosis of cervical cancer with stacked autoencoder and softmax classification. Expert Syst Appl. Jan. 2019;115:557–64. 10.1016/j.eswa.2018.08.050.
- 35.Kiliçarslan S, Dönmez E. Improved multi-layer hybrid adaptive particle swarm optimization based artificial bee colony for optimizing feature selection and classification of microarray data, Multimed Tools Appl, vol. 83, no. 26, pp. 67259–67281, Oct. 2023, 10.1007/s11042-023-17234-4
- 36.Ding X, Liu J, Yang F, Cao J. Random radial basis function kernel-based support vector machine. J Franklin Inst. 2021;358(18):10121–40. [Google Scholar]
- 37.Pandya R, Pandya J, C5. 0 algorithm to improved decision tree with feature selection and reduced error pruning. Int J Comput Appl. 2015;117(16):18–21. [Google Scholar]
- 38.Suthaharan S. Decision tree learning. Machine learning models and algorithms for big data classification. Integrated Series in Information Systems. Volume 36. vol. 36., Boston, MA: Springer US; 2016. pp. 237–69. 10.1007/978-1-4899-7641-3_10. [Google Scholar]
- 39.Korpal M, Jaddou N. Downstream Outcomes of Elevated Prostate-Specific Antigen (PSA) Detected Through Routine Screening in Men Aged 55–70: A 10-Year Retrospective Study, Cureus, vol. 17, no. 8, 2025, Accessed: Oct. 22, 2025. [Online]. Available: https://www.cureus.com/articles/396379-downstream-outcomes-of-elevated-prostate-specific-antigen-psa-detected-through-routine-screening-in-men-aged-55-70-a-10-year-retrospective-study.pdf [DOI] [PMC free article] [PubMed]
- 40.Butler W, Huang J. Glycosylation changes in prostate cancer progression. Front Oncol. Dec. 2021;11. 10.3389/fonc.2021.809170. [DOI] [PMC free article] [PubMed]
- 41.Becker DJ, Lowe JB. Fucose: biosynthesis and biological function in mammals, Glycobiology, vol. 13, no. 7, pp. 41R-53R, 2003. [DOI] [PubMed]
- 42.Schneider M, Al-Shareffi E, Haltiwanger RS. Biological functions of fucose in mammals. Glycobiology. 2017;27(7):601–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Miyoshi E, Moriwaki K, Nakagawa T. Biological function of fucosylation in cancer biology. J BioChem. 2008;143(6):725–9. [DOI] [PubMed] [Google Scholar]
- 44.Shan M, Yang D, Dou H, Zhang L. Fucosylation in cancer biology and its clinical applications. Prog Mol Biol Transl Sci. 2019;162:93–119. [DOI] [PubMed] [Google Scholar]
- 45.Taniguchi N, Kizuka Y. Glycans and cancer: role of N-glycans in cancer biomarker, progression and metastasis, and therapeutics. Adv Cancer Res. 2015;126:11–51. [DOI] [PubMed] [Google Scholar]
- 46.Zhou Y, et al. Inhibition of fucosylation by 2-fluorofucose suppresses human liver cancer HepG2 cell proliferation and migration as well as tumor formation. Sci Rep. 2017;7(1):11563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Matsumoto K et al. Aug., N -Glycan fucosylation of epidermal growth factor receptor modulates receptor activity and sensitivity to epidermal growth factor receptor tyrosine kinase inhibitor, Cancer Science, vol. 99, no. 8, pp. 1611–1617, 2008, 10.1111/j.1349-7006.2008.00847.x [DOI] [PMC free article] [PubMed]
- 48.Wang Y, et al. Fucosylation deficiency in mice leads to colitis and adenocarcinoma. Gastroenterology. 2017;152(1):193–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Moriwaki K, et al. Deficiency of GMDS leads to escape from NK cell-mediated tumor surveillance through modulation of TRAIL signaling. Gastroenterology. 2009;137(1):188–98. [DOI] [PubMed] [Google Scholar]
- 50.Zheng J, et al. Identification and validation of a novel Glycolysis-Related gene signature for predicting the prognosis and therapeutic response in Triple-Negative breast cancer. Adv Ther. Jan. 2023;40(1):310–30. 10.1007/s12325-022-02330-y. [DOI] [PubMed]
- 51.Hossain MJ, et al. Machine learning and network-based models to identify genetic risk factors to the progression and survival of colorectal cancer. Comput Biol Med. 2021;135:104539. [DOI] [PubMed] [Google Scholar]
- 52.Yue Q, Han W, Liu Z-L. Nine-gene prognostic signature related to gut microflora for predicting the survival in gastric cancer patients. Turkish J Gastroenterol. 2024;35(2):102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Xi Y-N, et al. Application of ARHGAP8 in predicting the efficacy of neoadjuvant chemotherapy for locally advanced Mid-Low rectal cancer. Zhongguo Yi Xue Ke Xue Yuan Xue Bao Acta Academiae Medicinae Sinicae. 2024;46(4):528–38. [DOI] [PubMed] [Google Scholar]
- 54.Su Q, Dai B, Zhang S. Construction of miRNA-mRNA network and a nomogram model of prognostic analysis for prostate cancer. Translational Cancer Res. 2022;11(8):2562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Guo C, Liu S, Wang J, Sun M-Z, Greenaway FT. ACTB in cancer. Clin Chim Acta. 2013;417:39–44. [DOI] [PubMed] [Google Scholar]
- 56.Izdebska M, Zielińska W, Hałas-Wiśniewska M, Grzanka A. Involvement of actin and actin-binding proteins in carcinogenesis. Cells. 2020;9(10):2245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Gu Y et al. Jan., A pan-cancer analysis of the prognostic and immunological role of β-actin (ACTB) in human cancers, Bioengineered, vol. 12, no. 1, pp. 6166–6185, 2021, 10.1080/21655979.2021.1973220 [DOI] [PMC free article] [PubMed]
- 58.Wang C-C, Chen X, Qu J, Sun Y-Z, Li J-Q. RFSMMA: A New Computational Model to Identify and Prioritize Potential Small Molecule–MiRNA Associations, J. Chem. Inf. Model., vol. 59, no. 4, pp. 1668–1679, Apr. 2019, 10.1021/acs.jcim.9b00129 [DOI] [PubMed]
- 59.Wang C-C, Han C-D, Zhao Q, Chen X. Circular RNAs and complex diseases: from experimental results to computational models. Brief Bioinform, 22, 6, 2021. [DOI] [PMC free article] [PubMed]
- 60.Wang C-C, Li T-H, Huang L, Chen X. Prediction of potential miRNA–disease associations based on stacked autoencoder. Brief Bioinform. 2022;23(2):bbac021. [DOI] [PubMed] [Google Scholar]
- 61.Wang C-C, Zhu C-C, Chen X. Ensemble of kernel ridge regression-based small molecule–miRNA association prediction in human disease. Brief Bioinform, 23, 1, 2022. [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analyzed in this study are publicly available at the NCBI GEO database: GSE3325: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE3325GSE6919: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE6919GSE55945: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE55945GSE26910: [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE26910](https:/www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE26910)GSE46602: [https://www.ncbi.nlm.nih.gov/search/all/?term=GSE46602](https:/www.ncbi.nlm.nih.gov/search/all/?term=GSE46602).















