Expression graph network framework for biomarker discovery

Yang Liu; Jason Huse; Kasthuri Kannan

doi:10.1093/bib/bbaf559

. 2025 Oct 27;26(5):bbaf559. doi: 10.1093/bib/bbaf559

Expression graph network framework for biomarker discovery

Yang Liu ^1,², Jason Huse ^3,⁴, Kasthuri Kannan ^5,^✉

PMCID: PMC12554635 PMID: 41139924

Abstract

Biomarker discovery for complex diseases, such as cancer, hinges on uncovering molecular signatures that capture intricate, interconnected relationships within biological data—a challenge that traditional statistical and machine learning methods often fail to meet due to the complexity of high-dimensional gene expression profiles. To overcome this, we introduce the expression graph network framework (EGNF). This cutting-edge graph-based approach integrates graph neural networks with network-based feature engineering to enhance the predictive identification of biomarkers. EGNF constructs biologically informed networks by combining gene expression data and clinical attributes within a graph database, utilizing hierarchical clustering to generate dynamic, patient-specific representations of molecular interactions. Leveraging graph learning techniques, including graph convolutional networks and graph attention networks, our framework identifies statistically significant and biologically relevant gene modules for classification. Validated across three independent datasets consisting of contrasting tumor types and clinical scenarios, EGNF consistently outperforms traditional machine learning models, achieving superior classification accuracy and interpretability. Notably, it delivers perfect separation between normal and tumor samples while excelling in nuanced tasks such as classifying disease progression and predicting treatment outcomes. This scalable, interpretable, and robust framework provides a powerful tool for biomarker discovery, with wide-ranging applications in precision medicine and the elucidation of disease mechanisms across diverse clinical contexts.

Keywords: biomarker, graph neural network, EGNF

Introduction

Classification problems constitute the cornerstone of machine learning, with the paramount objective of categorizing observations into discrete classes based on their attributes. This process is especially critical in healthcare, where precise classification has a direct impact on diagnosis, treatment strategies, and patient outcomes. Machine learning has revolutionized traditional analytical methods, offering innovative approaches to process and interpret complex datasets. The application of these techniques to biomarker discovery has emerged as a particularly promising domain, providing researchers with powerful tools to identify molecular indicators that can predict disease states, progression, and treatment response.

Cancer represents a complex disease characterized by the accumulation of genetic and epigenetic alterations that drive dysregulated gene expression, metabolic reprogramming, and aberrant cellular signaling pathways [1]. The molecular heterogeneity of cancer presents significant challenges for accurate classification and biomarker discovery, as tumors exhibit distinct gene expression profiles that vary both between and within cancer types. This complexity arises from the interconnected nature of biological pathways, where alterations in one molecular component can cascade through multiple regulatory networks, affecting cellular processes ranging from DNA repair mechanisms to immune response modulation. Traditional analytical approaches that treat genes as independent entities often fail to capture these intricate molecular interactions, highlighting the need for sophisticated methodologies that can model the network-based nature of cancer biology.

Isocytrate Dehydrogenase-wildtype (IDH-wt) glioblastoma exemplifies these challenges as the most aggressive and heterogeneous primary brain tumor in adults. Recent studies have revealed that IDH-wt glioblastomas exhibit profound molecular diversity with distinct gene expression subtypes that correlate with different clinical outcomes and treatment responses [2, 3]. The tumor microenvironment further contributes to this complexity, with significant intratumoral heterogeneity observed at the single-cell level, where different cellular populations within the same tumor display varied transcriptional programs [4]. This molecular complexity makes accurate subtype classification crucial for clinical decision-making, yet conventional machine learning approaches often struggle to identify robust biomarkers that can reliably distinguish between these molecular subtypes due to their inability to account for the interconnected nature of dysregulated pathways in glioblastoma pathogenesis.

Graph-based learning approaches have garnered significant attention in biomedical research due to their distinctive ability to model intricate relationships between biological entities. Unlike traditional machine learning methods that treat samples as independent observations, graph-based approaches leverage the inherent interconnectedness of biological data, capturing relationships that might otherwise remain obscured. This capability is particularly valuable in biomarker discovery, where understanding the interactions between molecules can provide more profound insights into disease mechanisms than analyzing individual features in isolation.

Cluster-based feature engineering is a powerful technique for enhancing machine learning models by grouping similar data points based on shared attributes, thereby enabling the extraction of meaningful patterns. One widely used method is the k-means clustering algorithm, which clusters data points based on proximity in feature space [5]. K-means clustering has been employed in various feature engineering applications to enhance model performance by identifying relationships between neighboring data points. By analyzing interrelationships within these clusters, models can make more informed predictions, as interactions between neighboring data points often reveal underlying data structures [6]. Unlike dimensionality reduction techniques, cluster-based methods like k-means clustering focus on enhancing model performance by capitalizing on these interrelationships rather than reducing feature count. Hierarchical clustering extends this concept by creating nested clusters at different levels of similarity, which aligns conceptually with the multi-level nature of biological interactions and provides a natural framework for discovering biomarker relationships across different scales of biological organization.

Among various classification techniques, traditional methods like logistic regression remain notable for their simplicity and interpretability in biomarker studies. Despite their widespread use, these methods often fail to capture the complex, non-linear relationships present in biological systems, thereby constraining their utility for comprehensive biomarker discovery. While support vector machines (SVMs), random forest models, and elastic net regression offer more robust approaches to biomarker discovery, especially in high-dimensional settings, these methods primarily operate on tabular data and do not inherently account for the network structure of biological systems. Random forest models, though capable of capturing non-linear interactions through their ensemble of decision trees, still treat features independently and cannot directly incorporate the relational information encoded in biological networks.

Graph neural networks (GNNs) have emerged as a powerful class of models designed to advance biomarker discovery by leveraging graph-structured data prevalent in biological applications. GNNs capture complex relationships between biological entities represented as nodes, along with their interactions defined by edges. Among the most prominent GNN architectures are graph convolutional networks (GCNs) and graph attention networks (GATs). GCNs extend convolutional neural networks to graph data by leveraging the adjacency structure, enabling efficient information propagation among connected features [7]. GATs enhance this capability by introducing attention mechanisms that allow the model to dynamically weigh the importance of different node neighbors [8]. More recent architectures further improve performance through positional and structural encodings that better capture the graph topology [9].

The integration of multi-omics data through graph-based approaches has significantly enhanced the identification of clinically relevant biomarkers. Previous studies have demonstrated superior performance in patient classification and biomarker identification compared to conventional methods. For instance, Wang et al. [10] introduced MOGONET, a framework integrating multi-omics data using GCNs. Similarly, Ramirez et al. [11] applied GCNs to cancer classification, showcasing the potential of graph-based approaches in distinguishing between cancer types based on gene expression data. These studies highlight the power of graph-based methods in capturing the complex interplay between different biological layers.

Existing graph-based approaches for biomarker discovery typically rely on established biological networks, such as protein-protein interaction networks or co-expression networks. The weighted gene co-expression network analysis (WGCNA) developed by Langfelder and Horvath [12] has been widely used to identify modules of co-expressed genes that may serve as potential biomarkers. Building upon this foundation, recent approaches have integrated deep learning architectures with network-based feature extraction to enhance biomarker identification. For instance, Yu et al. [13] proposed iHofman, which combines hierarchical autoencoders with weighted attention mechanisms for circRNA–miRNA interaction prediction, demonstrating how attention-based fusion of sequence and structural features can improve predictive performance. Similarly, advanced GNN approaches have incorporated multi-level attention mechanisms and feature fusion strategies to capture complex biological relationships [14, 15]. These multi-level attention graph neural networks based on co-expression gene modules have shown promise for disease diagnosis and prognosis, demonstrating improved predictive performance and interpretability over traditional network-based methods.

Despite these advances, a critical limitation of existing approaches is that they are not specifically tailored for tissue sample classification in biomarker discovery. Most methods rely on predefined biological networks, which may not accurately reflect the specific relationships relevant to the disease or condition under investigation. Additionally, these approaches often struggle to handle datasets with varying sample sizes, limiting their applicability across different clinical contexts.

This research advances machine learning in biomedical applications through two primary contributions. First, we develop the expression graph network framework (EGNF), a novel methodology integrating network generation with GCNs and GATs for gene expression-based classification. EGNF leverages deep learning to enhance the extraction of complex patterns and relationships from gene expression data, significantly improving classification accuracy. The generated networks capture intricate relationships between samples and features, adaptively configuring to different sample sizes while preserving biological relevance. Our approach uniquely employs hierarchical clustering to identify meaningful biological relationships, providing a natural bridge between conventional cluster-based feature engineering and advanced graph-based learning methods.

Second, we develop a biologically meaningful network-based feature selection method specifically designed for gene expression data. By combining network analysis with conventional statistical techniques, our method identifies gene modules that are both statistically significant and biologically relevant, offering more profound insights into disease progression mechanisms. This approach reduces data complexity while maintaining predictive power, ultimately improving the interpretability of machine learning models.

Together, these contributions offer several advantages over existing methods: (i) they enable more accurate patient/sample stratification by leveraging complex patterns encoded in generated graphs; (ii) they provide insights into biological mechanisms underlying disease states by highlighting important connections between biomarkers; (iii) they facilitate the integration of multi-modal data, capturing relationships spanning different biological domains; and (iv) they demonstrate robust performance across different datasets and disease types, suggesting broad applicability in precision medicine.

Materials and methods

Overview of EGNF

Our methodological framework consists of several sequential analytical stages (Fig. 1). Initially, we performed differential expression analysis on 80% of the data using DESeq2 to identify differentially expressed genes [16]. Using this training data, we constructed a graph network by selecting extreme sample clusters with high or low median expression for each group (unpaired method), or group ratio values (paired method) from one-dimensional hierarchical clustering as nodes and establishing connections between sample clusters of different genes through shared samples. We then conducted graph-based feature selection considering three criteria: node degrees, gene frequency within communities, and inclusion in known biological pathways. The selected features were then used to generate sample clusters via one-dimensional hierarchical clustering, which served as nodes for building the prediction network. In the final stage, we utilized GNNs for sample-specific graph-based predictions, where each sample was represented by a corresponding subgraph structure.

Alt text: Framework showing how gene expression data are converted into graphs, used to generate both biomarker features, and evaluated by prediction tasks. — Framework for network-based feature selection.

This study utilized open-source libraries for biomarker discovery, including PyTorch Geometric for GNN model development and network analysis tools, such as Neo4j and their Graph Data Science (GDS) library. All algorithm development and validation phases were executed within this computational framework to ensure reproducibility and scalability.

Datasets

This study used three paired gene expression datasets to assess model performance across clinical contexts. The glioma dataset comprised 295 primary and 275 corresponding recurrent tumors from IDH-wt patients, sourced from the Glioma Longitudinal Analysis Consortium and the MD Anderson Cancer Center’s Glioblastoma Moon Shot project, enabling analysis of molecular changes during disease progression. The breast cancer dataset included 111 normal tissue specimens paired with 113 matched tumor samples from The Cancer Genome Atlas Program, allowing for the examination of differences between healthy and malignant tissues. The third dataset contained 69 matched pre- and post-treatment samples from HER2-negative breast cancer patients who underwent neoadjuvant chemotherapy with Bevacizumab (GSE87455), obtained from the GEO repository, facilitating analysis of treatment-induced molecular alterations. All datasets exclusively included patients with complete paired gene expression profiles; unpaired samples were excluded from the analysis.

Network generation

After normalization using dataset-specific approaches, we performed gene-wise hierarchical clustering using a bottom-up agglomerative approach with Euclidean distance and median linkage. For paired datasets, we applied Inline graphic normalization followed by rescaling values to a range of 1 to 2 for each sample class, then used the class2/class1 ratios to perform one-dimensional hierarchical clustering. For unpaired datasets, we employed normalization followed by z-normalization for each sample class and conducted hierarchical clustering separately. This median merge provides robust clustering by reducing sensitivity to outliers and extreme values compared to single or complete linkage, enabling the identification of distinct sample clusters (Fig. 2). From the resulting clusters for each gene, we selected the top 10% most extreme clusters based on absolute z-normalized median expression values to serve as nodes in our gene-gene interaction network, constructed using the Neo4j graph database. Each node represents a gene with associated properties including median expression value, gene identifier, sample count, and levels within trees. Edges connect genes sharing samples between clusters, with edge formation controlled by a minimum common sample threshold (default = 1), and each edge annotated with the number of shared samples and tree levels of connected nodes. Sample-specific subgraphs were extracted by identifying all relevant nodes and edges for each sample, providing structured input features for downstream GNN classification tasks. Prior to GNN model training, all node features and edge features were normalized to ensure consistent scaling and optimal convergence during the learning process.

Alt text: Stepwise schematic showing clustering of genes, creation of edges between related nodes, and construction of sample-specific subnetworks for input into graph neural network models. — Network generation process.

Graph-based feature selection for identifying biomarkers

We developed network-based feature selection methods for identifying significant biomarkers in paired and unpaired datasets. Both approaches begin with dimension reduction using DESeq2, followed by dataset-specific normalization. We then applied the above hierarchical clustering and network construction method above to build the gene-gene interaction graph.

To reduce the impact of indirect, complex relationships among the same gene, we randomly selected a single sample cluster for each gene per graph sampling iteration and repeated this process 10 000 times. Degree centrality algorithms were applied to enumerate degrees for each gene. Modularity optimization algorithms were utilized to delineate gene communities. For each community, we quantified gene frequency and scrutinized whether genes within communities are incorporated in enriched pathways. Bootstrapping analyses further refined marker selection by comparing each marker’s mean expression against the mean expression of all alternative markers, addressing the zero-inflated distribution of degree, frequency, and pathway scores.

To evaluate the impact of pathway enrichment on predictive performance, we constructed reference marker catalog devoid of pathway enrichment. A comprehensive scoring system was developed to prioritize markers by ranking adjusted p-values according to various filtration criteria and employing a summing rank for definitive selection.

Other feature selection methods

Weighted Gene Correltaion Network Analysis (WGCNA), Analysis of Variance (ANOVA), and Minimum Redundency Maximum Relevance (mRMR) are three complementary feature selection methods used to identify informative genes from high-dimensional gene expression data. WGCNA constructs gene co-expression networks by clustering genes based on their correlation patterns, identifying modules of highly correlated genes, and associating them with phenotypic traits to uncover biologically relevant relationships. ANOVA, a statistical approach, determines whether gene expression levels significantly differ across multiple groups by comparing within-group and between-group variability. Lastly, mRMR selects features that are both highly relevant to the target variable and minimally redundant with each other, ensuring an informative yet non-redundant subset of genes for downstream analysis. These methods collectively enhance the robustness of feature selection by integrating network-based, statistical, and information-theoretic approaches.

Graph convolutional network model

GCNs extend the concept of convolutional operations to graph-structured data. A GCN model operates by aggregating feature information from neighboring nodes in a graph to learn node representations. Given a graph Inline graphic where represents the set of nodes and is the set of edges, let be the adjacency matrix. The propagation rule of a GCN layer can be expressed as:

where Inline graphic is the adjacency matrix with added self-connections, is the diagonal degree matrix, is the node feature matrix in layer , is the learnable weight matrix in layer , and is a non-linear activation function such as ReLU.

Graph attention network model

GATs introduce attention mechanisms into GNNs, allowing the model to assign different importance weights to different neighbors. The attention coefficient between node Inline graphic and neighbor is computed as:

where Inline graphic and are the input feature vectors of nodes and for layer , is a learnable weight matrix, is a learnable attention vector, || denotes concatenation, and LeakyReLU is an activation function. The attention coefficients are normalized using the softmax function across all neighbors of node Inline graphic :

The node feature update for node Inline graphic is then computed as a weighted sum of its neighbors’ features:

Where Inline graphic is a non-linear activation function, and are the attention coefficients that weight the contribution of each neighbor’s features. GCNs aggregate neighborhood information uniformly using normalized adjacency matrices, while GATs utilize an attention mechanism to focus on the most relevant neighbors, dynamically weighting each neighbor’s contribution to a node’s feature representation.

Graph attention network v2 model

Graph attention network v2 (GATv2) improves upon the original GAT by introducing dynamic attention mechanisms that better capture node relationships [9]. Unlike GAT, where the attention coefficients are computed in a static manner, GATv2 allows for a more expressive and adaptive attention function by making the attention mechanism order-invariant. This enables the model to assign attention to weights dynamically based on node features rather than relying on predefined structures.

The attention mechanism in GATv2 is defined similarly to GAT but introduces a critical modification. For a node, the attention coefficient between node and one of its neighbors is computed as:

Beyond this modification, the normalization of attention coefficients and the node feature update follow the same formulation as in GAT. By leveraging this improved attention mechanism, GATv2 provides better adaptability and expressiveness in learning from graph-structured data.

Traditional machine learning models

Traditional machine learning approaches have shown considerable success in various classification tasks. Logistic regression, a fundamental statistical model, provides interpretable results by modeling probability through a logistic function [17]. The elastic net extends conventional regression by incorporating both L1 and L2 regularization, effectively handling multicollinearity and performing feature selection simultaneously [18]. Random Forest, an ensemble learning method, combines multiple decision trees to reduce overfitting and improve generalization by leveraging bagging and random feature selection [19]. SVM excels in finding optimal hyperplanes to separate classes in high-dimensional spaces through kernel transformations [20]. Multilayer Perceptron (MLP), a type of artificial neural network, can capture complex non-linear relationships by utilizing multiple layers of interconnected neurons with activation functions [21]. These models have been widely applied across various domains, demonstrating their versatility and effectiveness in handling different types of data and classification problems.

Parameter tuning process

Model hyperparameters were optimized using five-fold cross-validation on the training dataset. For conventional machine learning models (logistic regression, elastic net, random forest, SVM, and MLP), we employed random search to efficiently explore parameter spaces [22], optimizing regularization parameters for the elastic net, tree parameters for random forest, kernel parameters for SVM, and network architecture for MLP. For GNNs, we utilized Bayesian optimization to navigate their more complex parameter spaces, including network depth, learning rates, and attention mechanisms [23]. Conventional machine learning analyses were implemented using the caret package [24] in R with parallel processing, while GNN computations were performed on an NVIDIA GPU with 45 GB VRAM in an HPC environment. All models were evaluated through ten iterations of training and testing, with the final performance assessed using average accuracy and AUC scores.

Results

GNN-based classification improves performance

The machine learning models, trained on gene expression profiles selected through WGCNA, mRMR, and ANOVA, displayed varied performance across different classification tasks (Fig. 3). They showed strong results in binary classification of normal versus tumor samples in breast cancer, achieving high accuracy and AUC. However, their performance decreased in more complex tasks like distinguishing primary from recurrent tumors and pre- from post-treatment samples.

Alt text: Bar charts comparing the performance of multiple machine learning models across three datasets. Each model is shown in a different color, and each panel represents a distinct feature set. — Performance of machine learning models for different features.

We implemented our methodology incorporating network generation and GNNs and compared its performance against traditional machine learning models (Fig. 3) using accuracy and AUC as baselines across 10 iterations. Table 1 presents the relative improvements achieved by our GNN-based classification for each feature set. Our method enhanced accuracy by 13.8%, −2.4%, and 2.0% across the three datasets. Similarly, it improved AUC by 17.1%, −0.7%, and 1.3%, respectively. The variability in magnitude improvement is likely due to differences in sample sizes and the complexity of the classification task. GNNs faced challenges in achieving superior performance in data-limited contexts.

Table 1.

Performance comparison across datasets and feature selection methods

Dataset	Features	Accuracy (SD)				AUC (SD)
		GCN-gene	GCN	GAT	GATv2	GCN-gene	GCN	GAT	GATv2
Primary-recurrent	WGCNA	0.717 (0.017)	0.748 (0.058)	0.698 (0.063)	0.740 (0.008)	0.780 (0.013)	0.782 (0.085)	0.731 (0.088)	0.771 (0.011)
	ANOVA	0.838 (0.006)	0.847 (0.034)	0.855 (0.017)	0.850 (0.020)	0.901 (0.005)	0.897 (0.032)	0.901 (0.019)	0.900 (0.024)
	mRMR	0.884 (0.010)	0.820 (0.056)	0.866 (0.021)	0.862 (0.020)	0.954 (0.004)	0.892 (0.055)	0.932 (0.018)	0.924 (0.009)
Normal-tumor	WGCNA	0.813 (0.027)	0.963 (0.023)	0.917 (0.030)	0.922 (0.026)	0.806 (0.014)	0.991 (0.008)	0.961 (0.020)	0.967 (0.017)
	ANOVA	0.583 (0.073)	0.943 (0.046)	0.865 (0.132)	0.913 (0.023)	0.399 (0.191)	0.978 (0.032)	0.907 (0.144)	0.950 (0.015)
	mRMR	0.746 (0.143)	0.935 (0.010)	0.928 (0.031)	0.952 (0.009)	0.712 (0.122)	0.976 (0.004)	0.968 (0.009)	0.974 (0.015)
Pre-post treatment	WGCNA	-	0.700 (0.018)	0.696 (0.019)	0.696 (0.019)	-	0.649 (0.007)	0.647 (0.005)	0.645 (0.006)
	ANOVA	-	0.721 (0.058)	0.679 (0.041)	0.700 (0.063)	-	0.733 (0.076)	0.646 (0.052)	0.680 (0.081)
	mRMR	-	0.779 (0.037)	0.675 (0.070)	0.689 (0.034)	-	0.786 (0.036)	0.660 (0.069)	0.671 (0.043)

Open in a new tab

Bold values indicate the best performance for each dataset.

Computational cost of GNNs

Table 2 documents the computational time for each model across different datasets. The GNN-based approaches required significantly more computational resources compared to traditional machine learning models. However, enhanced classification performance can justify this increased computational cost. Notably, our GNN implementations were run sequentially due to resource limitations; parallel processing could potentially lead to substantial improvements in computational efficiency.

Table 2.

Time cost for different models across tasks

Model	Primary-recurrent	Normal-tumor	Pre-post treatment
Logistic	0.5 s	0.5 s	0.4 s
Elastic net	1.7 s	3.1 s	1.2 s
Random forest	15.6 s	2.6 s	1.6 s
SVM	6.8 s	5.0 s	1.9 s
MLP	13.7 s	7.4 s	3.1 s
GCN-gene	12 days	24 h	-
GCN-all-feature	12 days	24 h	23 h
GAT-all-feature	23 days	36 h	36 h

Open in a new tab

Graph-based features enhance performance

We utilized graph-based feature profiles to construct graph networks. The GNN predictions leveraging these networks demonstrated superior performance compared to all other feature types (Fig. 4). In traditional machine learning models, graph-based features achieved perfect separation in the normal-tumor classification task and exceeded 90% accuracy and AUC in pre- and post-treatment sample classification. These features also delivered competitive performance for primary-recurrent classification.

Alt text: Bar charts comparing classification performance of graph-based features against WGCNA, ANOVA, and mRMR features across multiple cancer datasets, showing accuracy and AUC metrics. — Performance of different features including graph-based features.

The GNN models achieved an accuracy of 0.948 and an AUC of 0.977 in the primary-recurrent dataset, perfect separation in the normal-tumor dataset, and an accuracy of 0.896 with an AUC of 0.926 in the pre-post treatment dataset (Table 3). With our feature selection methodology, accuracy improved by 7.2%, 1.3%, and 16.9% in these respective tasks, while AUC increased by 2.4%, 0.2%, and 17.8%. The exceptional performance of DB_unpaired and DB_unpaired_nopath features in GNN models highlights the effectiveness of our approach in capturing complex biological relationships.

Table 3.

Extended performance comparison with paired/unpaired datasets and feature selection methods

Dataset	Features	Accuracy (SD)				AUC (SD)
		GCN-gene	GCN	GAT	GATv2	GCN-gene	GCN	GAT	GATv2
Primary-recurrent	DB_Paired	0.816 (0.023)	0.775 (0.047)	0.733 (0.033)	0.779 (0.014)	0.882 (0.014)	0.833 (0.041)	0.785 (0.042)	0.843 (0.019)
	DB_paired_nopath	0.821 (0.015)	0.812 (0.037)	0.733 (0.025)	0.784 (0.052)	0.896 (0.014)	0.872 (0.035)	0.773 (0.036)	0.837 (0.063)
	DB_unpaired	0.938 (0.048)	0.942 (0.014)	0.915 (0.027)	0.908 (0.016)	0.977 (0.031)	0.977 (0.015)	0.963 (0.015)	0.958 (0.014)
	DB_unpaired_nopath	0.948 (0.016)	0.937 (0.023)	0.910 (0.023)	0.928 (0.019)	0.971 (0.014)	0.974 (0.024)	0.961 (0.013)	0.970 (0.012)
	WGCNA	0.717 (0.017)	0.748 (0.058)	0.698 (0.063)	0.740 (0.008)	0.780 (0.013)	0.782 (0.085)	0.731 (0.088)	0.771 (0.011)
	ANOVA	0.838 (0.006)	0.847 (0.034)	0.855 (0.017)	0.850 (0.020)	0.901 (0.005)	0.897 (0.032)	0.901 (0.019)	0.900 (0.024)
	mRMR	0.884 (0.010)	0.820 (0.056)	0.866 (0.021)	0.862 (0.020)	0.954 (0.004)	0.892 (0.055)	0.932 (0.018)	0.924 (0.009)
Normal-tumor	DB_paired	0.617 (0.104)	0.911 (0.036)	0.889 (0.047)	0.899 (0.041)	0.492 (0.264)	0.944 (0.033)	0.887 (0.031)	0.919 (0.040)
	DB_paired_nopath	0.580 (0.060)	0.902 (0.021)	0.828 (0.033)	0.885 (0.069)	0.457 (0.168)	0.940 (0.007)	0.834 (0.034)	0.903 (0.077)
	DB_unpaired	0.815 (0.033)	0.980 (0.007)	1.000 (0.000)	0.985 (0.034)	0.832 (0.048)	0.995 (0.004)	1.000 (0.000)	0.993 (0.018)
	DB_unpaired_nopath	0.717 (0.093)	0.948 (0.011)	0.835 (0.122)	0.889 (0.019)	0.693 (0.121)	0.960 (0.004)	0.854 (0.132)	0.917 (0.034)
	WGCNA	0.813 (0.027)	0.963 (0.023)	0.917 (0.030)	0.922 (0.026)	0.806 (0.014)	0.991 (0.008)	0.961 (0.020)	0.967 (0.017)
	ANOVA	0.583 (0.073)	0.943 (0.046)	0.865 (0.132)	0.913 (0.023)	0.399 (0.191)	0.978 (0.032)	0.907 (0.144)	0.950 (0.015)
	mRMR	0.746 (0.143)	0.935 (0.010)	0.928 (0.031)	0.952 (0.009)	0.712 (0.122)	0.976 (0.004)	0.968 (0.009)	0.974 (0.015)
Pre-post	DB_paired	-	0.793 (0.028)	0.721 (0.041)	0.711 (0.064)	-	0.833 (0.030)	0.740 (0.071)	0.691 (0.064)
	DB_paired_nopath	-	0.793 (0.073)	0.718 (0.052)	0.696 (0.030)	-	0.789 (0.080)	0.698 (0.057)	0.669 (0.050)
	DB_unpaired	-	0.846 (0.041)	0.850 (0.015)	0.843 (0.066)	-	0.892 (0.034)	0.886 (0.019)	0.858 (0.092)
	DB_unpaired_nopath	-	0.796 (0.024)	0.871 (0.063)	0.896 (0.057)	-	0.839 (0.032)	0.915 (0.053)	0.926 (0.057)
	WGCNA	-	0.700 (0.018)	0.696 (0.019)	0.696 (0.019)	-	0.649 (0.007)	0.647 (0.005)	0.645 (0.006)
	ANOVA	-	0.721 (0.058)	0.679 (0.041)	0.700 (0.063)	-	0.733 (0.076)	0.646 (0.052)	0.680 (0.081)
	mRMR	-	0.779 (0.037)	0.675 (0.070)	0.689 (0.034)	-	0.786 (0.036)	0.660 (0.069)	0.671 (0.043)

Open in a new tab

Bold values indicate the best performance for each dataset.

Biomarkers selected from unpaired approach

Our approach successfully identified key predictive markers from unpaired feature selection, many of which have literature support for their functional relevance (Supplementary Table S1). In IDH-wt glioma, PGR, COL17A1, and RGS14 showed strong predictive value for recurrence, consistent with their known roles in tumor aggressiveness and therapy resistance [25–27]. For breast cancer normal-tumor differentiation, CXCL14, MSLN, LCN2, and MED1 emerged as significant predictors, supported by evidence of their differential expression and involvement in tumor progression [28–31]. In HER2-negative breast cancer treated with Bevacizumab, MMP2, COL5A2, COL6A1, DLK1, and MDK were identified as predictive of treatment response, aligning with their documented roles in angiogenesis and stromal interactions [32–36]. These findings validate our biomarker discovery method and underscore the biological relevance of the identified markers, enhancing their potential translational value in precision oncology.

Discussion

The EGNF leverages the power of GNNs to advance biomarker identification through gene expression data, offering a robust approach to modeling biological relationships. To the best of our knowledge, this represents the first time that common samples have been used to connect nodes in a biological network instead of relying on traditional logical relationships such as protein-protein interactions or pathway memberships. This novel approach enables the discovery of functional relationships that may not be captured by existing biological databases, potentially revealing new therapeutic targets and diagnostic markers (Table 4). The predictive markers identified by EGNF across the three datasets—IDH-wt glioma, normal-tumor breast cancer, and HER2-negative breast cancer treated with Bevacizumab—exemplify the model’s ability to uncover biologically significant genes within complex expression networks.

Table 4.

Featured biomarkers identified by EGNF across cancer datasets with supporting clinical evidence

Cancer type	Biomarker	Biological function	Clinical evidence
IDH-wt glioma	PGR	Progesterone signaling/migration	[25, 37]
	COL17A1	ECM remodeling/invasion	[26]
	RGS14	Immune regulation/prognosis	[27]
Breast cancer (normal-tumor)	CXCL14	Anti-cancer/immune modulation	[29, 40–42]
	MSLN	Immunotherapy target	[30]
	LCN2	Metastasis promotion	[28]
	MED1	Estrogen resistance	[31]
HER2-negative breast cancer (Bevacizumab)	MMP2	Matrix degradation/invasion	[32]
	COL5A2	Tumor microenvironment	[34]
	COL6A1	Prognosis/microenvironment	[34]
	DLK1	NOTCH signaling modulation	[33]
	MDK	Angiogenesis/therapy resistance	[35, 36]

Open in a new tab

In IDH-wt glioma, our approach identified PGR, COL17A1, and RGS14 as key predictive markers for recurrence, consistent with their known roles in tumor aggressiveness and therapy resistance. These markers reflect pathways associated with recurrence, including progesterone signaling, extracellular matrix remodeling, and G-protein regulation, respectively, which are critical in the biology of aggressive gliomas. Recent clinical studies have further validated these findings: PGRMC1 has been demonstrated as a tumor-promoting factor in glioblastoma, where it modulates tumor progression, immune microenvironment, and therapy response [25, 37]. COL17A1’s role in ECM remodeling and invasion has been supported by integrated analyses identifying it as a potential biomarker of glioblastoma multiforme [26], while RGS14’s involvement in immune regulation and prognosis has been confirmed through machine learning studies unveiling immune-related signatures in multicenter glioma research [27]. Recurrence represents a major clinical challenge in glioma management, with IDH-wt tumors showing particularly aggressive behavior and high recurrence rates, often leading to treatment failure and poor patient outcomes [38, 39]. The identification of molecular signatures associated with recurrence could enable earlier intervention and personalized treatment strategies, potentially improving the dismal prognosis currently associated with these tumors.

For breast cancer normal-tumor differentiation, CXCL14, MSLN, LCN2, and MED1 emerged as significant predictors, supported by evidence of their differential expression and involvement in tumor progression. These markers highlight immune modulation, cell surface alterations, inflammation, and estrogen receptor coactivation, processes central to tumorigenesis. The clinical relevance of these biomarkers has been extensively validated: CXCL14 expression in tumor stroma has been confirmed as an independent survival marker, with fibroblast-derived CXCL14 promoting epithelial-to-mesenchymal transition and metastasis through ACKR2-dependent mechanisms [29, 40, 41]. Importantly, low CXCL14 expression levels have been specifically associated with poor survival rates in triple-negative breast cancer patients [42]. MSLN has been validated as a novel immunotherapy target for triple negative breast cancer [30], while LCN2’s role in promoting breast cancer progression and metastasis has been well-documented [28]. MED1’s involvement in estrogen resistance mechanisms has been demonstrated, where silencing MED1 sensitizes breast cancer cells to pure anti-estrogen fulvestrant both in vitro and in vivo [31].

In the HER2-negative breast cancer cohort treated with Bevacizumab, MMP2, COL5A2, COL6A1, DLK1, and MDK were identified as predictive of treatment response, aligning with their documented roles in angiogenesis and stromal interactions. These markers demonstrate the regulation of angiogenesis, stromal architecture, and growth factor signaling pathways. Treatment resistance in breast cancer, particularly to targeted therapies like Bevacizumab, remains a significant obstacle in clinical management, with most patients eventually developing resistance mechanisms that lead to disease progression [43]. The clinical validation of these markers is substantial: MMP2 values in tumor tissue have been characterized in basal-like breast cancer patients, demonstrating their role in matrix degradation and invasion [32]. COL5A2 and COL6A1 have been identified through pan-cancer analyses as significant factors in prognosis and tumor microenvironment modulation [34]. DLK1’s different expression levels have been shown to inversely modulate the oncogenic potential of breast cancer cells through inhibition of NOTCH1 signaling [33]. MDK’s role in angiogenesis and therapy resistance has been confirmed through angiogenesis-related analyses associated with prognosis and tumor immune microenvironment, as well as single-cell RNA sequencing studies identifying molecular biomarkers predicting progression to CDK4/6 inhibition [35, 36]. MMP2 and collagen genes influence extracellular matrix dynamics affecting drug delivery, while MDK provides an alternative angiogenic signaling pathway that may contribute to treatment resistance. These markers represent interconnected biological processes that determine Bevacizumab response, offering insights for patient stratification in anti-angiogenic therapy and enabling clinicians to identify patients who might benefit from alternative or combination approaches before resistance develops.

It is important to note that markers not listed in Table 4 still warrant further investigation, even in the absence of direct supporting clinical studies. The novel network construction approach employed by EGNF may reveal previously uncharacterized gene interactions and functional relationships that have not yet been explored in clinical contexts. These markers could represent emerging therapeutic targets or novel biomarkers that require validation through future experimental and clinical studies.

Limitations

While EGNF demonstrates significant promise, several limitations must be acknowledged. Interpretability remains a critical focus in GNN research, aligning with EGNF’s emphasis on biologically meaningful feature selection. Advances have introduced GNN models with explainability layers that highlight key gene subnetworks driving predictions in cancer prognosis [44, 45]. This approach not only improves classification but also provides clinicians with actionable insights into disease mechanisms. Integrating such interpretability mechanisms into EGNF could enhance its ability to identify significant gene modules, thereby increasing its value for translational research, where biological relevance is paramount.

Scalability is another consideration, as GNNs often require substantial computational resources. Research has demonstrated that optimization techniques, such as pruning redundant graph connections, can reduce computational overhead without sacrificing accuracy in gene expression tasks [46]. GNN model performance could be further enhanced through two key optimizations: (i) implementation of a more comprehensive hyperparameter search strategy, and (ii) adoption of more stringent thresholds for shared sample selection. These potential enhancements suggest that the current performance metrics of GNNs represent a conservative estimate of the methodology’s capabilities. For EGNF, adopting such optimization strategies could make it more feasible for large-scale genomic studies, broadening its practical utility.

Ethical considerations

The deployment of biomarker discovery pipelines like EGNF in translational medicine raises important ethical implications that must be carefully considered. First, the potential for algorithmic bias in biomarker selection could lead to disparities in healthcare outcomes if training datasets are not representative of diverse patient populations. Ensuring equitable representation across different ethnic groups, socioeconomic backgrounds, and geographic regions is crucial for developing universally applicable biomarkers. Second, the clinical implementation of Artificial Intelligence (AI) derived biomarkers requires robust validation frameworks to prevent premature adoption of markers that may lack sufficient clinical evidence. The integration of such biomarkers into clinical decision-making must be accompanied by appropriate regulatory oversight and transparency in algorithmic processes to maintain patient trust and safety. Additionally, questions of data ownership, patient consent for AI-driven analysis, and the potential commercialization of biomarker discoveries necessitate clear ethical guidelines and governance frameworks. Finally, the accessibility and affordability of biomarker-based diagnostics and treatments must be considered to prevent exacerbation of existing healthcare inequalities.

Future directions

Recent developments in the field offer valuable insights into how GNN-based methods, such as EGNF, can be contextualized and potentially enhanced. One notable trend is the application of GNNs to multi-modal biological data for improved disease classification and biomarker discovery. Researchers have explored hybrid GNN models that integrate gene expression profiles with protein-protein interaction networks to identify cancer-specific biomarkers, achieving high predictive accuracy by capturing cross-modal dependencies [47]. This suggests that EGNF could benefit from incorporating additional data types, such as epigenetic or proteomic information, to enrich its network representations. The ability to model such complex interactions is a key advantage of GNNs, enabling the detection of subtle patterns that traditional methods might overlook.

Another area of progress is the application of GNNs to rare disease research, where data scarcity presents a significant challenge. Studies have shown that GNN-based frameworks, combined with transfer learning, can effectively identify biomarkers using limited gene expression datasets [48, 49]. This is particularly relevant for EGNF, as its performance in scenarios with small sample sizes, such as distinguishing nuanced disease states, could be bolstered by adopting similar techniques. By pre-training on larger, related datasets and fine-tuning for specific tasks, EGNF may overcome limitations associated with data availability, thereby enhancing its applicability in precision medicine.

In our future work, we will sequence samples for single-cell spatial transcriptomics analysis to provide complementary validation of our findings. This approach will enable biomarker characterization at cellular resolution within the tissue microenvironment, potentially revealing cell-type-specific signatures and spatial expression patterns that could enhance the clinical utility of the identified markers. Furthermore, it will allow us to assess whether the sample-based network connections inferred by EGNF reflect genuine cellular interactions and spatial organization within the tumor. In addition, recent studies have introduced controlled noise injection into the data to better manage false discovery rates, thereby yielding more reliable biomarker candidates [50]. Adopting such techniques could further strengthen the validity of markers identified by EGNF.

Conclusion

Importantly, EGNF demonstrates two significant advantages that extend beyond our specific test cases. First, it offers a powerful approach for biomarker identification, extracting genes with genuine biological relevance rather than statistical artifacts. This capability stems from the method’s integration of network topology with expression data, enabling it to capture complex gene interactions that traditional feature selection methods might miss. Second, EGNF provides superior predictive performance across various classification tasks, as evidenced by our comparative analyses. This dual strength, identifying meaningful biomarkers while enhancing prediction accuracy, makes EGNF particularly valuable in clinical applications where both mechanistic insights and reliable patient stratification are crucial. Moreover, these capabilities suggest that EGNF could be effectively applied to any sample classification task involving gene expression data, regardless of the disease context or research area.

Looking ahead, EGNF is well positioned to benefit from emerging trends in GNN research, including multi-modal integration, transfer learning for small datasets, enhanced interpretability, and computational optimization. Incorporating these advances could further strengthen its robustness, broaden its applicability across disease contexts, and increase its clinical utility. Ultimately, EGNF has the potential to bridge the gap between computational innovation and clinical impact, offering a powerful tool for addressing critical challenges such as glioma recurrence and breast cancer treatment resistance.

Key Points

We introduce a novel graph-based biomarker discovery framework that integrates graph neural networks (GNNs) with network-driven feature engineering, enabling biomarker identification by modeling gene expression as graph-structured data rather than independent features.
The framework demonstrates superior predictive performance compared to traditional machine learning approaches by leveraging GNNs’ ability to capture complex gene-gene interactions and regulatory relationships within biological networks.
Network-based features provide direct biological interpretability, allowing researchers to trace how specific graph structures and connectivity patterns contribute to biomarker significance and revealing insights into underlying disease mechanisms.
The pipeline offers broad cross-disease generalizability, demonstrating robust transferability across diverse datasets and biological contexts for discovering context-specific biomarkers.
This approach represents a paradigm shift from conventional feature selection methods toward network topology analysis, opening new avenues for understanding complex disease mechanisms through graph-structured biological data.

Supplementary Material

Supplementary_Table_1_bbaf559

supplementary_table_1_bbaf559.xlsx^{(16.5KB, xlsx)}

Acknowledgments

The authors express their gratitude to the kind comments and suggestions from Dr Xi Luo, Dr Vahed Maroufy and Dr Goo Jun at UT Health Houston.

Contributor Information

Yang Liu, Department of Translational Molecular Pathology, University of Texas MD Anderson Cancer Center, 2130 W Holcombe Blvd, Texas 77030, United States; Department of Biostatistics and Data Science, University of Texas Health Science Center at Houston, 1200 Pressler Street, Texas 77030, United States.

Jason Huse, Department of Translational Molecular Pathology, University of Texas MD Anderson Cancer Center, 2130 W Holcombe Blvd, Texas 77030, United States; Department of Pathology, University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Texas 77030, United States.

Kasthuri Kannan, Department of Translational Molecular Pathology, University of Texas MD Anderson Cancer Center, 2130 W Holcombe Blvd, Texas 77030, United States.

Author contributions

Y.L., K.K., and J.H. conceived and designed the study. Y.L. performed the computational analysis and wrote the manuscript. K.K., and J.H. edited the manuscript. All the authors reviewed and approved the final manuscript.

Funding

This work was supported by the MD Anderson Moonshot Program.

Data availability

The gene expression datasets used in this study are publicly available from the GLASS Consortium (https://www.synapse.org/Synapse:syn17038081/wiki/585622), TCGA (https://www.cancer.gov/ccg/research/genome-sequencing/tcga), and the GEO under accession number GSE87455 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87455).

Code availability

The source code and tutorial of this work can be found on GitHub https://github.com/yliu38/EGNF.

References

1. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–74. [DOI] [PubMed] [Google Scholar]
2. Fares J, Ulasov I, Timashev P. et al. Molecular diversity in isocitrate dehydrogenase-wild-type glioblastoma. Brain Commun 2024;6:fcae108. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Verhaak RG, Hoadley KA, Purdom E. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010;17:98–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Neftel C, Laffy J, Filbin MG. et al. An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell. 2019;178:835–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281–97. Berkeley: University of California Press, 1967. [Google Scholar]
6. Mostafa SM, Amano H. Effect of clustering data in improving machine learning model accuracy. J Theor Appl Inf Technol 2019;97:2973–81. [Google Scholar]
7. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks arXiv preprint arXiv:1609.02907. 2017.
8. Veličković P, Cucurull G, Casanova A. et al. Graph attention networks. arXiv preprint arXiv:1710.10903. 2018
9. Brody S, Alon U, Yahav E. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491. 2021.
10. Wang T, Shao W, Huang Z. et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun 2021;12:3445. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Ramirez R, Chiu YC, Hererra A. et al. Classification of cancer types using graph convolutional neural networks. Front Phys 2020;8:203. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Yu CQ, Jiang C, Wang L. et al. iHofman: a predictive model integrating high-order and low-order features with weighted attention mechanisms for circRNA-miRNA interactions. BMC Biol 2025;23:162. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Wang X, Yu C, Li L. et al. A feature extraction method based on noise reduction for circRNA-miRNA interaction prediction combining multi-structure features in the association networks. Brief Bioinform 2024;25:bbad111. [DOI] [PubMed] [Google Scholar]
15. Wei M, Wang L, Li Y. et al. BioKG-CMI: a multi-source feature fusion model based on biological knowledge graph for predicting circRNA-miRNA interactions. Sci China Inf Sci 2024;67:189104. [Google Scholar]
16. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression 3rd edn. Hoboken: Wiley, 2013. [Google Scholar]
18. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67:301–20. [Google Scholar]
19. Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
20. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273–97. [Google Scholar]
21. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323:533–6. [Google Scholar]
22. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res 2012;13:281–305. [Google Scholar]
23. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, Vol. 25, pp. 2951–9, 2012. [Google Scholar]
24. Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008;28:1–26.27774042 [Google Scholar]
25. Cortese AC, Smart C, Brown SD. Intracellular progesterone receptor and cSrc protein working together to promote migration and invasion of human glioblastoma cells. Front Endocrinol 2021;12:640298. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Yan Y, Li X, Zhang Y. et al. Integrated analysis of ECT2 and COL17A1 as potential biomarkers of glioblastoma multiforme. Biomed Res Int 2022;2022:9453549. [Google Scholar]
27. Yang S, Wang X, Huan R. et al. Machine learning unveils immune-related signature in multicenter glioma studies. iScience. 2024;27:109317. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Yang J, Bielenberg DR, Rodig SJ. et al. Lipocalin 2 promotes breast cancer progression. Proc Natl Acad Sci U S A 2009;106:3913–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Gu XL, Ou ZL, Lin FJ. et al. Expression of CXCL14 and its anticancer role in breast cancer. Breast Cancer Res Treat 2012;135:725–35. [DOI] [PubMed] [Google Scholar]
30. Tchou J, Wang LC, Selven B. et al. Mesothelin, a novel immunotherapy target for triple negative breast cancer. Breast Cancer Res Treat 2012;133:799–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Zhang L, Cui J, Leonard M. et al. Silencing MED1 sensitizes breast cancer cells to pure anti-estrogen fulvestrant in vitro and in vivo. PloS One 2013;8:e70641. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Radenkovic S, Konjevic G, Jurisic V. et al. Values of MMP-2 and MMP-9 in tumor tissue of basal-like breast cancer patients. Cell Biochem Biophys 2014;68:143–52. [DOI] [PubMed] [Google Scholar]
33. Nueda ML, Naranjo AI, Baladrón V. et al. Different expression levels of DLK1 inversely modulate the oncogenic potential of human MDA-MB-231 breast cancer cells through inhibition of NOTCH1 signaling. FASEB J 2017;31:3484–96. [DOI] [PubMed] [Google Scholar]
34. Li X, Li Z, Gu S. et al. A pan-cancer analysis of collagen VI family on prognosis, tumor microenvironment, and its potential therapeutic effect. BMC Bioinformatics 2022;23:390. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Gao S, Wang Y, Xu Y. et al. An angiogenesis-related lncRNA signature is associated with prognosis and tumor immune microenvironment in breast cancer. J Pers Med 2023;13:513. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Luo L, Yang P, Mastoraki S. et al. Single-cell RNA sequencing identifies molecular biomarkers predicting late progression to CDK4/6 inhibition in patients with HR+/HER2- metastatic breast cancer. Mol Cancer 2025;24:48. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Dumitru CA, Schröder H, Schäfer FTA. et al. Progesterone receptor membrane component 1 (PGRMC1) modulates tumour progression, the immune microenvironment and the response to therapy in glioblastoma. Cells. 2023;12:2498. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Weller M, van den Bent M, Tonn JC. et al. European Association for Neuro-Oncology (EANO) guideline on the diagnosis and treatment of adult astrocytic and oligodendroglial gliomas. Lancet Oncol 2017;18:e315–29. [DOI] [PubMed] [Google Scholar]
39. Molenaar RJ, Maciejewski JP, Wilmink JW. et al. Wild-type and mutated IDH1/2 enzymes and therapy responses. Oncogene. 2018;37:1949–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Sjöberg E, Augsten M, Bergh J. et al. Expression of the chemokine CXCL14 in the tumour stroma is an independent marker of survival in breast cancer. Br J Cancer 2016;114:1117–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Sjöberg E, Meyrath M, Milde L. et al. A novel ACKR2-dependent role of fibroblast-derived CXCL14 in epithelial-to-mesenchymal transition and metastasis of breast cancer. Clin Cancer Res 2019;25:3702–17. [DOI] [PubMed] [Google Scholar]
42. Gibbs C, So JY, Ahad A. et al. CXCL14 attenuates triple-negative breast cancer progression by regulating immune profiles of the tumor microenvironment in a T cell-dependent manner. Int J Mol Sci 2022;23:9314. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Bergers G, Hanahan D. Modes of resistance to anti-angiogenic therapy. Nat Rev Cancer 2008;8:592–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Ying Z, Bourgeois D, You J. et al. GNNExplainer: Generating explanations for graph neural networks. In: Advances in Neural Information Processing Systems, Vol. 32, pp. 9240–51, 2019. [PMC free article] [PubMed] [Google Scholar]
45. Chereda H, Bleckmann A, Menck K. et al. Explaining decisions of graph convolutional neural networks: Patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer. Genome Med 2021;13:42. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Chen J, Ma T, Xiao C. FastGCN: fast learning with graph convolutional networks via importance sampling. Int Conf Learn Represent 2020. arXiv:1801.10247 [Google Scholar]
47. Rhee S, Seo S, Kim S. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In: Proceedingsofthe Twenty-Seventh International Joint Conference on Artificial Intelligence. pp. 3527–34, 2018. [Google Scholar]
48. Dutil F, Cohen JP, Weiss M. et al. Towards gene expression convolutions using gene interaction graphs. Int Conf Mach Learn Workshop Comput Biol 2018. arXiv:1806.06975 [Google Scholar]
49. Theodoris CV, Xiao L, Chopra A. et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Shah S, Li X, Wu X. et al. Noise injection to control false discoveries in single-cell differential expression. Nat Biotechnol 2023;41:1878–87. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Table_1_bbaf559

supplementary_table_1_bbaf559.xlsx^{(16.5KB, xlsx)}

Data Availability Statement

[ref1] 1. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–74. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Fares J, Ulasov I, Timashev P. et al. Molecular diversity in isocitrate dehydrogenase-wild-type glioblastoma. Brain Commun 2024;6:fcae108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Verhaak RG, Hoadley KA, Purdom E. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010;17:98–110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Neftel C, Laffy J, Filbin MG. et al. An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell. 2019;178:835–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281–97. Berkeley: University of California Press, 1967. [Google Scholar]

[ref6] 6. Mostafa SM, Amano H. Effect of clustering data in improving machine learning model accuracy. J Theor Appl Inf Technol 2019;97:2973–81. [Google Scholar]

[ref7] 7. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks arXiv preprint arXiv:1609.02907. 2017.

[ref8] 8. Veličković P, Cucurull G, Casanova A. et al. Graph attention networks. arXiv preprint arXiv:1710.10903. 2018

[ref9] 9. Brody S, Alon U, Yahav E. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491. 2021.

[ref10] 10. Wang T, Shao W, Huang Z. et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun 2021;12:3445. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Ramirez R, Chiu YC, Hererra A. et al. Classification of cancer types using graph convolutional neural networks. Front Phys 2020;8:203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Yu CQ, Jiang C, Wang L. et al. iHofman: a predictive model integrating high-order and low-order features with weighted attention mechanisms for circRNA-miRNA interactions. BMC Biol 2025;23:162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Wang X, Yu C, Li L. et al. A feature extraction method based on noise reduction for circRNA-miRNA interaction prediction combining multi-structure features in the association networks. Brief Bioinform 2024;25:bbad111. [DOI] [PubMed] [Google Scholar]

[ref15] 15. Wei M, Wang L, Li Y. et al. BioKG-CMI: a multi-source feature fusion model based on biological knowledge graph for predicting circRNA-miRNA interactions. Sci China Inf Sci 2024;67:189104. [Google Scholar]

[ref16] 16. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression 3rd edn. Hoboken: Wiley, 2013. [Google Scholar]

[ref18] 18. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67:301–20. [Google Scholar]

[ref19] 19. Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]

[ref20] 20. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273–97. [Google Scholar]

[ref21] 21. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323:533–6. [Google Scholar]

[ref22] 22. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res 2012;13:281–305. [Google Scholar]

[ref23] 23. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, Vol. 25, pp. 2951–9, 2012. [Google Scholar]

[ref24] 24. Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008;28:1–26.27774042 [Google Scholar]

[ref25] 25. Cortese AC, Smart C, Brown SD. Intracellular progesterone receptor and cSrc protein working together to promote migration and invasion of human glioblastoma cells. Front Endocrinol 2021;12:640298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Yan Y, Li X, Zhang Y. et al. Integrated analysis of ECT2 and COL17A1 as potential biomarkers of glioblastoma multiforme. Biomed Res Int 2022;2022:9453549. [Google Scholar]

[ref27] 27. Yang S, Wang X, Huan R. et al. Machine learning unveils immune-related signature in multicenter glioma studies. iScience. 2024;27:109317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Yang J, Bielenberg DR, Rodig SJ. et al. Lipocalin 2 promotes breast cancer progression. Proc Natl Acad Sci U S A 2009;106:3913–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Gu XL, Ou ZL, Lin FJ. et al. Expression of CXCL14 and its anticancer role in breast cancer. Breast Cancer Res Treat 2012;135:725–35. [DOI] [PubMed] [Google Scholar]

[ref30] 30. Tchou J, Wang LC, Selven B. et al. Mesothelin, a novel immunotherapy target for triple negative breast cancer. Breast Cancer Res Treat 2012;133:799–804. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Zhang L, Cui J, Leonard M. et al. Silencing MED1 sensitizes breast cancer cells to pure anti-estrogen fulvestrant in vitro and in vivo. PloS One 2013;8:e70641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32. Radenkovic S, Konjevic G, Jurisic V. et al. Values of MMP-2 and MMP-9 in tumor tissue of basal-like breast cancer patients. Cell Biochem Biophys 2014;68:143–52. [DOI] [PubMed] [Google Scholar]

[ref33] 33. Nueda ML, Naranjo AI, Baladrón V. et al. Different expression levels of DLK1 inversely modulate the oncogenic potential of human MDA-MB-231 breast cancer cells through inhibition of NOTCH1 signaling. FASEB J 2017;31:3484–96. [DOI] [PubMed] [Google Scholar]

[ref34] 34. Li X, Li Z, Gu S. et al. A pan-cancer analysis of collagen VI family on prognosis, tumor microenvironment, and its potential therapeutic effect. BMC Bioinformatics 2022;23:390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Gao S, Wang Y, Xu Y. et al. An angiogenesis-related lncRNA signature is associated with prognosis and tumor immune microenvironment in breast cancer. J Pers Med 2023;13:513. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Luo L, Yang P, Mastoraki S. et al. Single-cell RNA sequencing identifies molecular biomarkers predicting late progression to CDK4/6 inhibition in patients with HR+/HER2- metastatic breast cancer. Mol Cancer 2025;24:48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Dumitru CA, Schröder H, Schäfer FTA. et al. Progesterone receptor membrane component 1 (PGRMC1) modulates tumour progression, the immune microenvironment and the response to therapy in glioblastoma. Cells. 2023;12:2498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Weller M, van den Bent M, Tonn JC. et al. European Association for Neuro-Oncology (EANO) guideline on the diagnosis and treatment of adult astrocytic and oligodendroglial gliomas. Lancet Oncol 2017;18:e315–29. [DOI] [PubMed] [Google Scholar]

[ref39] 39. Molenaar RJ, Maciejewski JP, Wilmink JW. et al. Wild-type and mutated IDH1/2 enzymes and therapy responses. Oncogene. 2018;37:1949–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Sjöberg E, Augsten M, Bergh J. et al. Expression of the chemokine CXCL14 in the tumour stroma is an independent marker of survival in breast cancer. Br J Cancer 2016;114:1117–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Sjöberg E, Meyrath M, Milde L. et al. A novel ACKR2-dependent role of fibroblast-derived CXCL14 in epithelial-to-mesenchymal transition and metastasis of breast cancer. Clin Cancer Res 2019;25:3702–17. [DOI] [PubMed] [Google Scholar]

[ref42] 42. Gibbs C, So JY, Ahad A. et al. CXCL14 attenuates triple-negative breast cancer progression by regulating immune profiles of the tumor microenvironment in a T cell-dependent manner. Int J Mol Sci 2022;23:9314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. Bergers G, Hanahan D. Modes of resistance to anti-angiogenic therapy. Nat Rev Cancer 2008;8:592–603. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Ying Z, Bourgeois D, You J. et al. GNNExplainer: Generating explanations for graph neural networks. In: Advances in Neural Information Processing Systems, Vol. 32, pp. 9240–51, 2019. [PMC free article] [PubMed] [Google Scholar]

[ref45] 45. Chereda H, Bleckmann A, Menck K. et al. Explaining decisions of graph convolutional neural networks: Patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer. Genome Med 2021;13:42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Chen J, Ma T, Xiao C. FastGCN: fast learning with graph convolutional networks via importance sampling. Int Conf Learn Represent 2020. arXiv:1801.10247 [Google Scholar]

[ref47] 47. Rhee S, Seo S, Kim S. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In: Proceedingsofthe Twenty-Seventh International Joint Conference on Artificial Intelligence. pp. 3527–34, 2018. [Google Scholar]

[ref48] 48. Dutil F, Cohen JP, Weiss M. et al. Towards gene expression convolutions using gene interaction graphs. Int Conf Mach Learn Workshop Comput Biol 2018. arXiv:1806.06975 [Google Scholar]

[ref49] 49. Theodoris CV, Xiao L, Chopra A. et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] 50. Shah S, Li X, Wu X. et al. Noise injection to control false discoveries in single-cell differential expression. Nat Biotechnol 2023;41:1878–87. [Google Scholar]

PERMALINK

Expression graph network framework for biomarker discovery

Yang Liu

Jason Huse

Kasthuri Kannan

Abstract

Introduction

Materials and methods

Overview of EGNF

Figure 1.

Datasets

Network generation

Figure 2.

Graph-based feature selection for identifying biomarkers

Other feature selection methods

Graph convolutional network model

Graph attention network model

Graph attention network v2 model

Traditional machine learning models

Parameter tuning process

Results

GNN-based classification improves performance

Figure 3.

Table 1.

Computational cost of GNNs

Table 2.

Graph-based features enhance performance

Figure 4.

Table 3.

Biomarkers selected from unpaired approach

Discussion

Table 4.

Limitations

Ethical considerations

Future directions

Conclusion

Key Points

Supplementary Material

Acknowledgments

Contributor Information

Author contributions

Funding

Data availability

Code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases