Improved supervised prediction of aging-related genes via weighted dynamic network analysis

Qi Li; Khalique Newaz; Tijana Milenković

doi:10.1186/s12859-021-04439-3

. 2021 Oct 25;22:520. doi: 10.1186/s12859-021-04439-3

Improved supervised prediction of aging-related genes via weighted dynamic network analysis

Qi Li ^1,^#, Khalique Newaz ^1,^#, Tijana Milenković ^1,^✉

PMCID: PMC8543111 PMID: 34696741

Abstract

Background

This study focuses on the task of supervised prediction of aging-related genes from -omics data. Unlike gene expression methods for this task that capture aging-specific information but ignore interactions between genes (i.e., their protein products), or protein–protein interaction (PPI) network methods for this task that account for PPIs but the PPIs are context-unspecific, we recently integrated the two data types into an aging-specific PPI subnetwork, which yielded more accurate aging-related gene predictions. However, a dynamic aging-specific subnetwork did not improve prediction performance compared to a static aging-specific subnetwork, despite the aging process being dynamic. This could be because the dynamic subnetwork was inferred using a naive Induced subgraph approach. Instead, we recently inferred a dynamic aging-specific subnetwork using a methodologically more advanced notion of network propagation (NP), which improved upon Induced dynamic aging-specific subnetwork in a different task, that of unsupervised analyses of the aging process.

Results

Here, we evaluate whether our existing NP-based dynamic subnetwork will improve upon the dynamic as well as static subnetwork constructed by the Induced approach in the considered task of supervised prediction of aging-related genes. The existing NP-based subnetwork is unweighted, i.e., it gives equal importance to each of the aging-specific PPIs. Because accounting for aging-specific edge weights might be important, we additionally propose a weighted NP-based dynamic aging-specific subnetwork. We demonstrate that a predictive machine learning model trained and tested on the weighted subnetwork yields higher accuracy when predicting aging-related genes than predictive models run on the existing unweighted dynamic or static subnetworks, regardless of whether the existing subnetworks were inferred using NP or the Induced approach.

Conclusions

Our proposed weighted dynamic aging-specific subnetwork and its corresponding predictive model could guide with higher confidence than the existing data and models the discovery of novel aging-related gene candidates for future wet lab validation.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-021-04439-3.

Keywords: Biological networks, Dynamic network analysis, Aging, Node classification

Introduction

Motivation and related work

Incidence of many complex diseases, such as diabetes, cancer, osteoarthritis, cardiovascular, and Alzheimer’s disease, increases with age [1]. Even recent and widespread COVID-19 is highly related to aging. Namely, according to the United States Centers for Disease Control and Prevention1, as of March 24, 2021, $\sim 81 %$ of all deaths related to COVID-19 occurred in the age range of 65 years and above.

Understanding the molecular mechanisms behind the aging process, including comprehensive and accurate identification of human genes implicated in aging, is important for studying and treating such aging-related diseases [2, 3]. However, analyzing human aging via wet lab experiments is difficult due to long human life span and ethical constraints [4, 5]. Analyzing human aging computationally can fill this gap. This includes identification (i.e., prediction) of aging-related genes via supervised learning from human -omics data [3, 6, 7], which is the task that we focus on in this paper. By supervised prediction of aging-related genes, we mean that a part of current aging-related knowledge is used when predicting which genes are linked to aging, and the remaining part of that knowledge is used when evaluating the predictions. On the other hand, an orthogonal and thus complementary task, which is outside of the scope of this paper, is that of unsupervised prediction of aging-related genes. By this, we mean that no current aging-related knowledge is used when making predictions, and instead, all of the aging-related knowledge is used only to evaluate the predictions.

Approaches for our task at hand—supervised prediction of aging-related genes—can be categorized as follows: (1) those that use gene expression data alone, (2) those that use protein–protein interaction (PPI) network data alone, and (3) those that combine gene expression data with PPI network data.

Approaches from the first category predict a gene as aging-related if its expression level varies with age [8–12]. While such approaches do capture aging-specific information, they ignore interactions between genes, i.e., their protein products. This is their drawback, because proteins carry out cellular functioning, including the aging process, by interacting with each other [13].

Approaches from the second category predict a gene as aging-related if its position (i.e., node representation/embedding/feature) in the PPI network is “similar enough” to the network positions of known aging-related genes [6, 7, 14, 15]. While these approaches do consider PPIs that carry out cellular functioning, their drawback is that the PPIs are context-unspecific, i.e., the PPIs span different conditions, such as cell types, tissues, diseases, environments, or patients. In the context of our study, this means that the PPIs are not aging-specific, i.e., they span different ages.

Approaches from the third category address the above drawbacks by considering both aging-specific gene expression data and context-unspecific PPI network data. Earlier approaches of this type extracted genes’ features from each of gene expression data and PPI network data individually and then concatenated the features [6, 7]. As such, they integrated the features rather than the data. Consequently, they still considered the context-unspecific PPI network. More recent efforts, including by our group, first integrated the two data types to infer an aging-specific subnetwork of the entire context-unspecific PPI network, and then analyzed the resulting subnetwork [4, 16]. Specifically, these studies used a traditional Induced approach: given gene expression data for different ages and an entire context-unspecific PPI network (which happens to be static), the studies formed a subnetwork snapshot for each age, where each snapshot consisted of all genes that were significantly expressed (i.e., active) at the given age and all PPIs from the entire network that exist between the active genes. That is, they formed a given age-specific snapshot by taking the Induced subgraph on the active genes at that age. Then, they combined snapshots for all ages into a dynamic aging-specific PPI subnetwork. Such a subnetwork, being dynamic, is meant to capture how network positions of genes change with age.

These studies used their inferred dynamic subnetworks in unsupervised aging-related tasks [4, 16]. So, while their inferred subnetworks are relevant for this paper, their unsupervised analyses of the subnetworks fall outside the scope of this paper.

On the other hand, supervised analyses from our other, even more recent work [17, 18] are directly relevant for this paper. In that work, we performed supervised prediction of aging-related genes from three networks: (i) the dynamic aging-specific subnetwork inferred by Faisal and Milenković [4], (ii) a static (still aging-specific) counterpart of this dynamic subnetwork, and (iii) the entire (also static) context-unspecific PPI network from which the above dynamic aging-specific subnetwork was inferred. Note that the static Induced subnetwork from point ii) was constructed by aggregating all nodes and edges over all snapshots of the dynamic Induced subnetwork. Then, first, we examined whether using the dynamic or static aging-specific subnetwork improved the supervised prediction accuracy compared to using the entire (static) context-unspecific PPI network. Indeed, we found this to hold overall (i.e., it held for many although not all of the considered evaluation measures). Second, because the aging process is dynamic, we examined whether using the dynamic aging-specific subnetwork was superior to using the static aging-specific subnetwork. Surprisingly, we found this not to hold overall (i.e., it did not hold for most of the considered evaluation measures).

This unexpected finding could be because the dynamic subnetwork was inferred using the Induced approach, which is quite naive as it considers all PPIs from the context-unspecific network that exist between only the active genes at a given age. However, first, not all PPIs between the active genes might be equally “important”, and the Induced approach has no way of identifying the most important of all such PPIs [19]. Second, the Induced approach fails to consider any inactive genes that might critically connect the active genes in the network [19].

Recently, an alternative, methodologically more advanced notion of network propagation (NP) was used for inference of a static context-specific subnetwork in the context of cancer [20]. NP maps expression levels (i.e., activities) onto the genes in the entire context-unspecific PPI network. Then, NP propagates the activities via random walk or diffusion, to assign condition-specific weights to the nodes (genes, i.e., their protein products) or edges (PPIs) in the entire PPI network. Finally, NP assumes that it is the highest-weighted network regions that are the most relevant for the condition of interest, i.e., such regions form the context-specific subnetwork. Hence, as opposed to the Induced approach, first, NP assigns weights to PPIs that can help identify the most “important” PPIs. Second, NP can consider a non-active gene if, for example, the gene is connected to sufficiently many active genes.

Our group extended two prominent NP approaches, i.e., NetWalk [21] and HotNet2 [22], to allow for the inference of dynamic context-specific subnetworks in the context of aging [19]. Then, we evaluated the NP-based aging-specific subnetworks in an unsupervised learning context, finding that NP-based dynamic aging-specific subnetworks outperform the dynamic aging-specific subnetwork created using the Induced approach [19]. In particular, a dynamic aging-specific subnetwork inferred using the NetWalk method outperformed every other considered dynamic aging-specific subnetwork [19]. This is why in this paper, out of all existing NP-based aging-specific subnetworks, we focus on the NetWalk-based one. For methodological details of NetWalk and HotNet2, please refer to their original publications [21, 22] or to our work on unsupervised study of aging [19]. Note that while the inferred NP-based subnetworks are relevant for this paper, the unsupervised analyses of the networks by Newaz and Milenković [19] fall outside of its scope.

With the above information in mind, in this paper, a key question that we aim to answer is whether using an NP-based dynamic aging-specific subnetwork (either the existing one by NetWalk or a novel one that we propose as a contribution of our study; see below) will outperform every other available network (i.e., the dynamic and static aging-specific subnetworks resulting from the Induced approach, static counterparts of the dynamic NP-based aging-specific subnetworks, and the entire static context-unspecific PPI network). This question has several subquestions, as discussed next.

Our study and contributions

Our first subquestion is whether using an NP-based dynamic aging-specific subnetwork improves the supervised prediction of aging-related genes compared to using any of the existing (dynamic or static) Induced aging-specific subnetworks. To answer this, we consider two NP-based dynamic subnetworks and two Induced subnetworks, as follows.

One of the considered NP-based dynamic aging-specific subnetworks, which we refer to as NetWalk-Dynamic, was inferred in our existing study [19] (see “Methods”). This subnetwork was constructed by using NP (specifically, NetWalk) to propagate expression activities onto the genes in the entire context-unspecific PPI network in order to assign aging-specific weights to the PPIs. However, NetWalk does not have a procedure of extracting a context-specific (i.e., aging-specific) subnetwork from such a weighted PPI network [19]. Therefore, in our previous study [19], we first applied a weight threshold and then treated all PPIs with weights above the threshold as aging-specific. Then, we extracted the aging-specific PPIs (and proteins involved in these PPIs) as the resulting aging-specific subnetwork. Consequently, this subnetwork is unweighted, as the PPIs were assigned equal importance (see [19] for more detail). This approach, which was the best one could do at the time, suffers from two limitations. First, it is hard to choose a meaningful PPI weight threshold, because there is no clear guideline about which threshold is meaningful. So, one needs to empirically evaluate subnetworks constructed at multiple thresholds, which is computationally inefficient. Second, this approach cannot distinguish between different aging-specific PPIs, because it assumes that each such PPI carries the same amount of aging-specific information as the others. To address these limitations, in this study, we aim to infer a novel, i.e., weighted NP-based dynamic aging-specific subnetwork, with hope that accounting for PPI weights, i.e., for aging-specific importance of the PPIs, will perform better in our task of supervised prediction of aging-related genes. We refer to this weighted subnetwork as wNetWalk-Dynamic (see “Methods”). We aim to evaluate whether at least one of NetWalk-Dynamic and wNetWalk-Dynamic is better than the dynamic and static Induced subnetworks from our previous studies that have already been discussed above [17, 18]. We refer to the latter two as Induced-Dynamic and Induced-Static, respectively.

Assuming that NP performs better than the Induced approach, our second subquestion is whether at least one of the two NP-based dynamic aging-specific subnetworks is better than static counterparts of both of the NP-based dynamic subnetworks. We do this because aging is a dynamic process, and so dynamic network analysis of aging should outperform static network analysis of aging. To answer this, for reasons described in “Methods” section, we construct two distinct static counterparts for NetWalk-Dynamic, referred to as NetWalk-Static and NetWalk-Static*, respectively. We also construct a (single) static counterpart for wNetWalk-Dynamic, referred to as wNetWalk-Static*.

Recall that we hypothesize that weighted NP will be better than unweighted NP, which is why we infer wNetWalk-Dynamic in the first place. So, our third subquestion is to test this hypothesis—whether wNetWalk-Dynamic performs better than NetWalk-Dynamic. Note that for as comprehensive evaluation as possible, we also consider the entire context-unspecific network that all aging-specific subnetworks are inferred from, with an expectation that the latter will outperform the entire network. We refer to the entire network as Entire. We summarize the eight considered (sub)networks in Table 1.

Table 1.

Summary of the eight considered (sub)networks for each of two considered entire context-unspecific human PPI networks

Network name	Weighted	Unweighted	Dynamic	Static
Entire		$✓$		$✓$
Induced-Dynamic		$✓$	$✓$
Induced-Static		$✓$		$✓$
NetWalk-Dynamic		$✓$	$✓$
NetWalk-Static		$✓$		$✓$
NetWalk-Static*		$✓$		$✓$
wNetWalk-Dynamic	$✓$		$✓$
wNetWalk-Static*	$✓$			$✓$

Open in a new tab

One of the two entire context-unspecific networks is from HPRD [23], which is quite old, but we use it because many relevant aging-specific subnetworks were already inferred from it in the prior work (see text for details). The other entire context-unspecific network is from BioGRID [24], which we use as the most recent human PPI data (“Data” section). So, when we use HPRD, “Entire” represents the HPRD entire context-unspecific network, and the other seven aging-specific subnetworks are all inferred from HPRD. When we use BioGRID, “Entire” represents the BioGRID entire context-unspecific network, and the other seven aging-specific subnetworks are all inferred from BioGRID. We study HPRD (and its associated aging-specific subnetworks) separately from BioGRID (and its associated aging-specific subnetworks), with the goal of evaluating whether results that we observe for HPRD also hold for the most recent BioGRID data

For each (sub)network, we develop multiple supervised predictive models where each model consists of a (1) feature, (2) feature dimensionality reduction choice, and (3) classifier. (4) Regarding features, given that our (sub)networks are of different types (unweighted vs. weighted, and static vs. dynamic), we need to choose features that fit a given network type. For unweighted dynamic and static (sub)networks, we use existing sophisticated features from our previous studies [17, 18]. For weighted dynamic subnetwork, we consider five existing features [25–27]. However, because the existing weighted features are simple extensions of the notion of unweighted network centrality, we believe that we can do better. So, we propose a new way to extract weighted dynamic features of nodes, with hope that they will outperform the existing weighted dynamic features. Thus, our fourth subquestion is whether using our proposed weighted dynamic features will outperform the existing weighted dynamic features on wNetWalk-Dynamic. In the process, as a control, we propose static counterparts of our new weighted dynamic features to evaluate whether the latter will be superior, as we would expect. (2) Regarding dimensionality reduction choices, for each feature, we consider its full version and its dimensionality-reduced version. (3) Regarding classifiers, we couple each feature (version) with three prominent classifiers. For details, see Additional file 1: Section S1.

For each predictive model, we evaluate its performance via cross-validation [18]: we train on a subset of known aging- and non-aging-related genes and test on the remaining subset of known aging- and non-aging-related genes. To define known aging- and non-aging-related gene labels for classification, we rely on highly confident ground truth data, i.e., GenAge [28]. We evaluate performance of a predictive model in terms of the area under the precision-recall (AUPR) curve, precision, recall, and F-score. Given all predictive models of a (sub)network, we choose the best predictive model that yields the highest AUPR score. Then, we compare the different (sub)networks, each under its selected best model, via all four prediction accuracy measures. For details, see Additional file 1: Section S1.2.

Because cancer is known to be aging-related [1, 29–31], in addition to cross-validation, we use a highly confident cancer-related data set to validate the performance of each (sub)network. Specifically, given a (sub)network, we measure the number of cancer-related genes among its novel gene predictions, i.e., genes predicted as aging-related that are currently not associated with aging.

In summary, the above four subquestions can be seen as testing the following four hypotheses (for each of HPRD and BioGRID): (1) whether using an NP-based aging-specific subnetwork will outperform all Induced aging-specific subnetworks, (2) whether using an NP-based dynamic aging-specific subnetwork will outperform all NP-based static aging-specific subnetworks, (3) whether using the weighted dynamic NP-based aging-specific subnetwork will outperform the unweighted dynamic NP-based aging-specific subnetworks, and (4) given the weighted dynamic subnetwork, whether using our proposed features will outperform existing features. We summarize our study in Fig. 1.

Fig. 1 — Summary of our study. For details on each step of the study, see the text

We find that overall, all of our four hypotheses hold with respect to both cross-validation and cancer-related validation, for both HPRD and BioGRID. Most importantly, our proposed weighted dynamic aging-specific subnetwork (i.e., wNetWalk-Dynamic) along with our proposed weighted dynamic feature performs the best among all considered (sub)networks. This finding justifies the need of using a weighted dynamic aging-specific subnetwork in the task of predicting aging-related genes, which is the key contribution of our study.

Because wNetWalk-Dynamic performs the best, we believe that it has potential to guide future experimental validation of its aging-related gene predictions that are currently not known to be aging-related. As the first attempt to this, we identify all six such genes that are predicted only by wNetWalk-Dynamic (when using HPRD). We manage to validate all of the six genes by manually searching for a link between them and the aging process in PubMed articles. This finding is another contribution of our study.

Results

Recall that we consider two entire context-unspecific networks: HPRD and BioGRID. All relevant aging-specific subnetworks from the prior work were inferred from HPRD. For this reason, due to space constraints, in the main paper, we report detailed results primarily for HPRD, but we do discuss BioGRID results briefly as well. We report detailed results for BioGRID in the supplement. Unless otherwise noted, results reported in the main paper are for HPRD.

Moreover, in the main paper, we primarily report detailed results for when using the GenAge-based definition of aging- and non-aging-related gene labels; we do this in the first six subsections. Human genes in GenAge are sequence orthologs of aging-related genes in model species. Because not all human genes have sequence orthologs in model species [32, 33], using GenAge may miss aspects of the aging process that are unique to the human species [34]. So, we consider another source of aging-related knowledge obtained by studying the human species directly (rather than doing so indirectly from model species), namely the down-regulated aging-related genes from genotype-tissue expression (GTEx) project [8], i.e., GTEx-DAG. We report results when using the GTEx-DAG-based label definition in the seventh subsection titled “Results when using our secondary aging-related gene label definition with respect to GTEx-DAG”. Note that we consider both GenAge- and GTEx-DAG-based label definitions only for HPRD. For BioGRID, we consider only the GenAge-based definition. This is because analyses on BioGRID are computationally expensive given its large size, and we have had to choose one of GenAge and GTEx-DAG. We have chosen the former because (as we will show) for HPRD-based (sub)networks, prediction accuracy scores are overall much higher when using GenAge than when using GTEx-DAG, which might indicate the former to be of higher quality. Also, results when using GenAge are qualitatively similar for HPRD and BioGRID, and we expect them to be similar when using GTEx-DAG as well.

The best predictive model for each (sub)network

In this subsection, we comment on how we select the best predictive model for each (HPRD-based) (sub)network. Recall that a predictive model consists of three components, i.e., a node feature, a feature dimensionality reduction choice, and a classifier. Regarding node features, we consider four types of features, each corresponding to one of the four network types (i.e., each feature can only be extracted from its corresponding network type). We summarize the considered features in Table 2. Regarding dimensionality reduction choices, for each feature, we consider its full version as well as its principal component analysis (PCA) reduced versions. Regarding classifiers, for each feature version, we run three classifiers, i.e., logistic regression (LR), naive Bayes (NB), and support vector machine (SVM) with radial basis function (rbf) (SVM-rbf). For methodological details and reasons of why we use these components, see Additional file 1: Section S1. For summary of predictive models we use for each (sub)network, see Additional file 1: Section S1.2 and Additional file 1: Table S1.

Table 2.

The considered node features and their types, i.e., weighted dynamic, weighted static, unweighted dynamic, and unweighted static

Weighted

Unweighted

Dynamic

Diff-nobin-2

DegC-wt, ClusC-wt, CloseC-wt, BetwC-wt, EigenC-wt

DGDV, GoT, GDC,

ECC, KC, DegC, CentraMV

Static

Static-nobin-2

SGDV, UniNet, 30BPIs

Open in a new tab

Features in bold are (the best versions of) our proposed features, and the rest are the considered existing features. Note that we list only the best version of our proposed weighted dynamic features for simplicity, as we proposed, tested, and compared 30 such features. Also, note that we only test the weighted static counterpart of the best weighted dynamic feature. For the discussion on our proposed weighted dynamic features, see Section “The proposed weighted features”. For the discussion on how we select the best of the existing weighted dynamic features and the best of our proposed weighted dynamic features, see Additional file 1: Section S2 and Additional file 1: Fig. S1

Recall that for each of the eight (sub)networks, we select the best predictive model that yields the highest AUPR score. Briefly, the eight selected models (each corresponding to a (sub)network) encompass five features, two dimensionality reduction choices, and three classifiers (Table 3). While some individual components of a model are somewhat preferred over others, there is no model as a whole that consistently works the best over all types of (sub)networks. So, in summary, there is no clear pattern regarding which component of which model works well on which (sub)network.

Table 3.

The selected best predictive model for each (sub)network with respect to the GenAge-based aging- and non-aging-related gene labels, and when using HPRD

Entire	Induced-Dynamic	Induced-Static
SGDV + None + NB	DGDV + None + NB	SGDV + None + NB
NetWalk-Dynamic	NetWalk-Static	NetWalk-Static*
DGDV + None + NB	30BPIs + None + SVM-rbf	30BPIs + PCA + LR
wNetWalk-Dynamic	wNetWalk-Static*
Diff-nobin-2 + None + LR	Static-nobin-2 + None + LR

Open in a new tab

The model is represented as “X+Y+Z”, where X represents the selected feature, Y represents the selected dimensionality reduction choice (i.e., None or PCA), and Z represents the selected classifier. For dimensionality reduction choices, “None” means the selected feature is in its full version, and “PCA” means the selected feature is in its PCA-reduced version. For the best predictive models when using BioGRID instead of HPRD, see Additional file 1: Table S3

Using NP-based versus induced aging-specific subnetworks

We first comment on our hypothesis (1): whether using an NP-based aging-specific subnetwork will outperform all Induced aging-specific subnetworks. We begin with testing whether the best of the existing (unweighted) dynamic subnetworks from our previous study [19] (i.e., NetWalk-Dynamic) and its two (unweighted) static counterparts (i.e., NetWalk-Static and NetWalk-Static*) outperforms the two Induced subnetworks (i.e., Induced-Dynamic and Induced-Static). However, we find this not to be the case. In the context of cross-validation, NetWalk-Dynamic, which performs the best among the above three unweighted NP-based subnetworks, has a tied performance to Induced-Dynamic in terms of recall and F-score but has a lower performance than Induced-Dynamic in terms of AUPR and precision (Fig. 2). Also, NetWalk-Dynamic has a lower performance than Induced-Static in terms of AUPR, recall, and F-score, but a marginally better performance than Induced-Static in terms of precision. Similarly, in the context of cancer-related validation, NetWalk-Dynamic has a lower performance than both Induced-Dynamic and Induced-Static (Fig. 4).

Fig. 2 — Prediction accuracy in terms of AUPR, precision, recall, and F-score for the eight (sub)networks, each under its best predictive model, when using the primary GenAge-based definition of aging- and non-aging-related gene labels, and when using HPRD. On the x-axis, the number in the parenthesis after the name of each (sub)network corresponds to the number of predictions. Note that we have two groups of results for wNetWalk-Dynamic, i.e., wNetWalk-Dynamic Existing and wNetWalk-Dynamic Proposed. The former corresponds to using the best existing weighted dynamic feature on wNetWalk-Dynamic, and the latter corresponds to using the best proposed weighted dynamic feature on wNetWalk-Dynamic. For results when using BioGRID instead of HPRD, see Additional file 1: Fig. S4

Fig. 4 — Cancer-related validation of the novel aging-related gene predictions for the eight considered (sub)networks, each under its best predictive model, when using the primary GenAge-based definition of aging- and non-aging-related gene labels, and when using HPRD. On the x-axis, the number in the parentheses after the name of each (sub)network is the number of the novel predictions. The precision score of a (sub)network is the percentage of novel predictions that are validated by the cancer-related data. Note again that we have two groups of results for wNetWalk-Dynamic, i.e., wNetWalk-Dynamic Existing and wNetWalk-Dynamic Proposed, which we already explained in the caption of Fig. 2. For results when using BioGRID instead of HPRD, see Additional file 1: Fig. S7

These results match our postulation that unweighted NetWalk-based subnetworks could be suboptimal because all of their PPIs are treated with equal importance (see “Introduction”). This is exactly why we propose to infer a weighted subnetwork using NetWalk in the first place. Indeed, when we look into whether the best of the weighted NP-based subnetworks, i.e., wNetWalk-Dynamic, outperforms all Induced subnetworks, we find this to be the case in terms of most of the considered evaluation measures, as follows.

In the context of cross-validation, wNetWalk-Dynamic, when analyzed via our proposed novel weighted dynamic features, outperforms all Induced subnetworks in terms of AUPR, recall, and F-score. In terms of precision, while wNetWalk-Dynamic is marginally (i.e., not statistically significant) inferior to Induced-Dynamic, it still outperforms Induced-Static. The superiority of wNetWalk-Dynamic is statistically significant when compared to Induced-Dynamic in terms of AUPR, recall, and F-score, and when compared to Induced-Static in terms of AUPR and F-score (Fig. 2) (adjusted p-values $< 0.05$ ).

For all four cross-validation prediction accuracy measures, all (sub)networks perform statistically significantly better than the random approach (adjusted p-values $\leq 0.032$ ). That is, each (sub)network predicts a majority (over 56%) of the genes that are labeled as aging-related (i.e., are known aging-related genes). So, we expect the pairwise overlaps of all (sub)networks’ true positives to be reasonably large. Indeed, we find that overlaps are statistically significantly high for all (sub)network pairs. The largest overlap over all (sub)network pairs is 89.5%, the smallest one is 61.2%, and the average one is 72.9% (Fig. 3). We note that even though our wNetWalk-Dynamic marginally underperforms Induced-Dynamic in terms of precision, wNetWalk-Dynamic not only correctly predicts 63 out of all 65 known aging-related genes that Induced-Dynamic predicts (i.e., 96.9% of them), but it also correctly predicts 38 known aging-related genes that Induced-Dynamic does not predict. Similarly, we examine pairwise overlaps of the (sub)networks’ predicted genes that are currently labeled as non-aging-related, i.e., of their novel aging-related gene predictions. We find that all of the pairwise overlaps are statistically significantly high. The largest overlap over all pairs is 62.2%, the smallest one is 17.9%, and the average one is 33.9% (Additional file 1: Fig. S2).

Fig. 3 — Overlaps (expressed as Jaccard indices) between true positive predictions of the eight considered (sub)networks, each under its best predictive model, when using the primary GenAge-based definition of aging- and non-aging-related gene labels, and when using HPRD. On the x- and y-axes, the numbers in the parentheses after the name of each subnetwork are the number of true positive predictions (i.e., known aging-related genes) and the number of total predictions. Within each table/matrix cell, the number on the first line is the Jaccard index of the overlap between the corresponding pair of subnetworks; the number on the second line is the raw overlap size, i.e., the actual count of true positives that are in the overlap between the two subnetworks; the number on the third line is the adjusted p-value of the overlap with respect to the hypergeometric test. The darker the color of a given cell, the larger the Jaccard index value, i.e., the higher the overlap. Note that we have two groups of results for wNetWalk-Dynamic, i.e., wNetWalk-Dynamic Existing and wNetWalk-Dynamic Proposed, which we already explained in the caption of Fig. 2. For results when using BioGRID instead of HPRD, see Additional file 1: Fig. S5

In the context of cancer-related validation (Fig. 4), we again find that wNetWalk-Dynamic (under our proposed novel weighted dynamic features) performs better than all Induced aging-specific subnetworks. That is, 34% of novel aging-related gene predictions (i.e., predicted genes currently not known to be aging-related) made by wNetWalk-Dynamic are present in the cancer-related data, while the best Induced aging-specific subnetwork (i.e., Induced-Dynamic) only gets 26% of novel predictions validated by the cancer-related data.

All of the above results signify that our hypothesis (1), i.e., that using an NP-based aging-specific subnetwork is better than using all Induced aging-specific subnetworks, holds with respect to almost all evaluation measures, i.e., AUPR, recall, F-score, pairwise prediction overlaps, as well as the cancer-related validation. This result stresses the significance of proposing our wNetWalk-Dynamic.

Using dynamic versus static NP-based aging-specific subnetworks

Here, we evaluate our hypothesis (2): whether using the best of the NP-based dynamic aging-specific subnetwork will outperform all NP-based static aging-specific subnetworks. Because wNetWalk-Dynamic is superior to (unweighted) NetWalk-Dynamic, it is wNetWalk-Dynamic that we compare to the NP-based static subnetworks (i.e., NetWalk-Static, NetWalk-Static*, and wNetWalk-Static*).

In the context of cross-validation, we find that wNetWalk-Dynamic indeed outperforms all three static NP-based subnetworks in terms of all four prediction accuracy measures (Fig. 2). The superiority of wNetWalk-Dynamic is statistically significant when compared to NetWalk-Static* in terms of recall, and when compared to wNetWalk-Static* in terms of AUPR (adjusted p-values $< 0.05$ ), while all other superiorities of wNetWalk-Dynamic are non-significant. When we look into the pairwise prediction overlaps between the dynamic vs. static NP-based subnetworks (Fig. 3), we confirm wNetWalk-Dynamic’s superiority. Namely, it is wNetWalk-Dynamic that not only has the highest precision, but also has the highest number of predictions than any of the other NP-based subnetworks.

In the context of cancer-related validation, we again confirm the superiority of wNetWalk-Dynamic (Fig. 4). That is, 34% of the novel aging-related genes predicted by wNetWalk-Dynamic are present in the cancer-related data. At the same time, the best static NP-based subnetwork (i.e., NetWalk-Static*) only has 28% of its novel predictions present in the cancer-related data.

Therefore, hypothesis (2), i.e., that using an NP-based dynamic aging-specific subnetwork is better than using all NP-based static subnetworks, holds with respect to all evaluation measures. This result further stresses the significance of proposing our weighted dynamic NP-based aging-specific subnetwork.

Using weighted versus unweighted dynamic NP-based aging-specific subnetworks

Here, we evaluate our hypothesis (3): whether using the weighted dynamic NP-based aging-specific subnetwork (i.e., wNetWalk-Dynamic) is better than using the unweighted dynamic NP-based aging-specific subnetwork (i.e., NetWalk-Dynamic). Indeed, we find this to be the case.

In the context of cross-validation, as shown in Fig. 2, wNetWalk-Dynamic outperforms NetWalk-Dynamic in terms of all four prediction accuracy measures. Of the four, the superiority is statistically significantly high (adjusted p-values $< 0.05$ ) in terms of recall and F-score. When we look into the pairwise prediction overlaps between wNetWalk-Dynamic and NetWalk-Dynamic, we find that wNetWalk-Dynamic not only correctly predicts 64 out of all 67 known aging-related genes that NetWalk-Dynamic predicts (i.e., 95.5% of them), but it also correctly predicts 37 known aging-related genes that NetWalk-Dynamic does not predict.

In the cancer-related validation, we again find wNetWalk-Dynamic to be better (Fig. 4). Specifically, among the novel aging-related genes (currently not known to be aging-related) predicted by wNetWalk-Dynamic and NetWalk-Dynamic, 34% and only 14%, respectively, are among the cancer-related genes.

Henceforth, hypothesis (3), i.e., that using wNetWalk-Dynamic is superior to using NetWalk-Dynamic, holds with respect to all evaluation measures. This result again stresses the importance of inferring our weighted NP-based dynamic aging-specific subnetwork.

Using our proposed versus existing weighted dynamic features

Here, we evaluate our hypothesis (4): whether using our proposed weighted dynamic features will be better than using the existing weighted dynamic features. To validate our hypothesis (4), given wNetWalk-Dynamic, it would suffice for our best weighted dynamic feature to outperform the best existing weighted dynamic feature. So, given wNetWalk-Dynamic, we choose the best predictive model based on the considered existing weighted dynamic features and the best predictive model based on our proposed weighted dynamic features, (see “Methods” and Additional file 1: Section S2). Then, we compare the two chosen best models using cross-validation, prediction overlap, and cancer-related validation.

In the context of cross-validation, we find that our best proposed feature outperforms the best existing feature in terms of all four prediction accuracy measures. Of the four measures, the superiority is statistically significant in terms of AUPR, recall, and F-score (i.e., adjusted p-values $< 0.05$ ) (Fig. 2). When we look into the pairwise prediction overlaps between our best proposed feature and the best existing feature, we find that the best proposed feature not only correctly predicts 79 out of all 84 known aging-related genes that the best existing feature predicts (i.e., 94.0% of them), but it also correctly predicts 22 known aging-related genes that the best existing feature does not predict.

In the context of cancer-related validation, our best proposed feature again wins over the best existing feature. Specifically, 34% of the novel aging-related genes predicted by the former and only 19% of the novel aging-related genes predicted by the latter are present in the cancer-related data.

Therefore, our hypothesis (4), i.e., that using the best of our proposed weighted dynamic features is better than using all of the existing weighted dynamic features, holds with respect to all evaluation measures.

Literature validation of novel aging-related gene predictions resulting from the best subnetwork, i.e., wNetWalk-Dynamic

Because wNetWalk-Dynamic (under the best of our proposed features) is the best subnetwork among all eight considered (sub)networks, those genes that are predicted by wNetWalk-Dynamic as aging-related but are currently not known to be linked to aging (i.e., that are novel aging-related gene predictions) are the most confident predictions for future experimental validation. There are 59 such novel predictions. Because it would be extremely time consuming to validate all of the 59 novel predictions, we aim to validate using literature search those novel predictions that are predicted only by wNetWalk-Dynamic but not by any other (sub)network. There are six such predictions, i.e., ATF1, GLI3, JUNB, NRP1, SMC1A, and SPI1. When we manually search for a direct (or indirect) link between these six genes and the aging process (or an aging-related disease) in PubMed articles, we manage to validate all six of them. Namely:

ATF1 was shown to be related to the pathogenesis of the Alzheimer’s disease [35], which is known to be highly related to aging [36].
GLI3 was shown to be linked to tumorigenicity of the colorectal cancer [37], which was shown to be mainly affecting the elderly [38].
JUNB’s expression levels were shown to increase in all immune cells during the aging process [39].
NRP1 was shown to be overexpressed in various human carcinoma cell lines and primary tumors [40], which are the main causes of skin cancer that usually appeared in old people [41].
The mutation of SMC1A was detected in colorectal cancer [42], which was shown to be mainly affecting the elderly [38].
SPI1 was identified as being upregulated in Alzheimer’s disease [43], which is known to be highly related to aging [36].

Results when using our secondary aging-related gene label definition with respect to GTEx-DAG

All results reported so far are for when using the GenAge-based labels to define genes as aging- vs. non-aging-related, and when using HPRD. Here, we report results when using the GTEx-DAG-based labels while still using HPRD. The (sub)networks’ best predictive models when using GTEx-DAG are shown in Additional file 1: Table S2.

In the context of cross-validation, like for the GenAge-based labels, we do see that wNetWalk-Dynamic again performs the best among all tested eight (sub)networks in terms of all prediction accuracy measures. Unlike for the GenAge-based labels, we now also see that (unweighted) NetWalk-Dynamic outperforms both Induced subnetworks. These results further support the need for NP-based weighted dynamic aging-specific subnetworks (Additional file 1: Fig. S3).

However, in our previous studies [18], we already showed that for the two Induced subnetworks, prediction accuracy is not nearly as good under the GTEx-DAG-based gene labels as it is under the GenAge-based gene labels. Here, we confirm that the same holds for the NP-based (weighted and unweighted) subnetworks. Their prediction accuracy scores are not statistically significantly more accurate than the random approach (adjusted p-values $\geq 0.05$ ).

Because of the random-like performance of all (sub)networks with respect to GTEx-DAG, we do not perform analysis of pairwise prediction overlaps. Also, we do not make any novel predictions from the GTEx-DAG-based classifiers. Consequently, the cancer-related validation cannot be performed.

Results for BioGRID-based (sub)networks

All results up to this point are for when using the HPRD entire context-unspecific network to infer all aging-specific subnetworks. In this subsection, we present results when using the BioGRID entire network instead. Note that in this case, we only use the GenAge-based aging- vs. non-aging-related gene labels. That is, we do not use GTEx-DAG-based labels when using BioGRID given their random-like performance when using HPRD (see above). The best predictive models for BioGRID-based (sub)networks (when using GenAge) are shown in Additional file 1: Table S3.

In the context of cross-validation, prediction accuracy scores for the BioGRID-based (sub)networks are statistically significantly higher than those of the random approach (adjusted p-values $< 0.05$ ). Similar to the HPRD-based subnetworks, for the BioGRID-based subnetworks, we see that wNetWalk-Dynamic performs the best among all tested eight (sub)networks in terms of all prediction accuracy measures. More importantly, wNetWalk-Dynamic statistically significantly (adjusted p-values $< 0.05$ ) outperforms most of the (sub)networks, for most of the considered performance measures. The only exceptions are: (1) Entire with respect to F-score and recall; (2) wNetWalk-Static with respect to all four prediction accuracy measures. These results further support the need for NP-based weighted dynamic aging-specific subnetworks (Additional file 1: Fig. S4). When we look into the actual prediction accuracy scores, we find that the scores are below 0.5 for all considered BioGRID-based (sub)networks (Additional file 1: Fig. S4), which overall is lower than for HPRD-based (sub)networks (Fig. 2). When we look into the raw prediction counts of aging-related genes (Additional file 1: Fig. S5 and S6), we find that overall there are more predictions but fewer true positives for BioGRID-based (sub)networks than for HPRD-based (sub)networks, which is why the performance of BioGRID is inferior to that of HPRD.

In the context of cancer-related validation (Additional file 1: Fig. S7), we again find that wNetWalk-Dynamic (under our proposed novel weighted dynamic features) performs better than all other (sub)networks. That is, 27% of novel aging-related gene predictions (i.e., predicted genes currently not known to be aging-related) made by wNetWalk-Dynamic are present in the cancer-related data, while the best unweighted (sub)network (i.e., Induced-Static) only gets 17% of novel predictions validated by the cancer-related data.

All of the above results signify that our four hypotheses hold not just when using HPRD but also when using BioGRID, because in both cases, using wNetWalk-Dynamic under our proposed weighted dynamic features performs the best. This again stresses the significance of proposing our wNetWalk-Dynamic and our weighted dynamic features, regardless of which entire human PPI network data we are using.

Discussion

In this study, we have analyzed aging-specific subnetworks derived from one of two human PPI network data sets, namely HPRD and BioGRID. Some additional observations and implications of our results are as follows.

Recall that a predictive model consists of three individual components, i.e., a feature, a classifier, and a dimensionality reduction choice. We find that there is no clear pattern regarding which component of which model works well on which (sub)network (Table 3 and Additional file 1: Tables S2 and S3). This stresses the need for our comprehensive evaluation of running all of the different models, in order to give each (sub)network the best-case advantage. Also, it emphasizes the importance of a shift in the field of machine learning development of interpretable predictive models, in order to allow for understanding the effect of each model component on the prediction outcome. Furthermore, the superiority of our proposed weighted dynamic aging-specific subnetwork when using our proposed weighted dynamic features with respect to GenAge-based aging-related gene labels over all other (sub)networks in terms of almost all evaluation measures stress out the significance of our work.

The performance is inferior when using GTEx-DAG compared to when using GenAge (on HPRD-based subnetworks). This is surprising, for two reasons. First, genes in GTEx-DAG have been linked to aging directly in human, while those in GenAge have been linked to aging indirectly based on their sequence similarities to aging-related genes in model species. So, GTEx-DAG should better capture aspects of the aging process that are unique to human. Given this, and given that we have aimed to extract human aging-related knowledge from human PPI subnetworks, one would expect GTEx-DAG to perform better. However, it does not. This could be because GenAge may be more accurate than GTEx-DAG. In particular, GenAge has more rigorous selection criteria of aging-related genes than GTEx-DAG (see Section “Data” for more details). Second, the considered (sub)networks have been formed by integrating gene expression data with PPI data. Given that GTEx-DAG is also gene expression-based, while GenAge is sequence-based, one would expect GTEx-DAG to perform better. However, it does not. This could be because GTEx-DAG and GenAge are highly complementary (when considering HPRD, only 17 genes are labeled as aging-related in both), and because our predictive models may be better suited for the aging-related knowledge present in GenAge.

The performance is inferior when using the BioGRID network data compared to when using the HPRD network data (under GenAge-based aging-related label information). This could be because the aging-specific gene expression data used to infer aging-specific subnetworks originates from the same time period as HPRD, while the BioGRID data that we have used is much more recent. Newer aging-specific gene expression data, obtained using more modern technologies such as RNA-seq have become available, such as data from the Genotype-Tissue Expression project (GTEx) [44]. Analyzing BioGRID-based (sub)networks using the newer gene expression data may thus be promising and is the subject of our future work.

Conclusions

In this study, we have proposed a novel NP-based weighted dynamic aging-specific subnetwork, along with novel weighted dynamic node features, which combined yield better prediction accuracy in supervised prediction of aging-related genes. We have used NP in hope of improving upon Induced that has been used traditionally. Indeed, this is what we observe.

Our findings are quantitatively robust to the choice of the entire context-unspecific PPI network used to infer aging-specific subnetworks. However, our findings are not quantitatively robust to the choice of ground truth aging-related data. Namely, our predictive models work better for sequence-based aging-specific expression data from model species than for gene expression-based aging-specific expression data from human. Development of predictive models that will be capable of capturing well human-unique aspects of the aging process (which do not exist in model species) is important.

Some ways to improve upon the current findings, including potentially on their robustness to data choice, could be as follows. First, in the current study, we use traditional machine learning where the user has to define a feature first and then “plug-in” this feature into an off-the-shelf classifier. However, lots of advancements have been made in deep learning on graphs, including graph convolutional networks (GCNs) for dynamic network analysis, in domains such as social or citation networks [45, 46]. So, it might be worth it to adopt or customize these existing GCNs to the computational biology domain and specifically to the task of studying the aging process. Second, the aging-specific gene expression data considered for subnetwork inference in the current study were curated a while ago. So, it might be worth it to adopt newer aging-specific gene expression data for this purpose, which is the subject of our future work.

We have studied human aging from PPI data. Nonetheless, our work can be applied to other types of biological networks, other species, or other biological processes, such as disease progression.

Methods

Data

The two considered entire context-unspecific PPI networks

In this study, we consider two entire context-unspecific PPI networks. The first considered unweighted entire context-unspecific PPI network in this study is obtained from the HPRD database [23]. We take the largest connected component that encompasses 8938 nodes and 35900 edges, and we denote it as Entire. The second considered unweighted entire context-unspecific PPI network in this study is obtained from the BioGRID database (i.e., version 4.4.197 released on April. 25th, 2021) [24]. We consider the largest connected component of human physical interactions from BioGRID, which encompasses 18,928 nodes and 484,146 edges. For each entire context-unspecific PPI network, we consider the following aging-specific subnetworks that are inferred from the given entire PPI network. We use HPRD entire network for illustration. All aging-specific subnetworks with respect to BioGRID are analogous to HPRD.

Seven considered aging-specific subnetworks

Other than Entire, we also consider seven aging-specific subnetworks that are inferred from Entire via two network inference methods, i.e., the Induced approach and the NP-based approach called NetWalk, as follows.

First, we use the dynamic aging-specific subnetwork inferred by Faisal and Milenković [4] using the Induced approach. Specifically, given human gene expression data at 37 ages spanning between 20 and 99 years [10], a dynamic aging-specific subnetwork consisting of 37 subnetwork snapshots, one snapshot per age, was constructed. Each subnetwork snapshot corresponded to the Induced subgraph of those genes that are significantly expressed (i.e., active) at the given age. Because the Induced approach does not map gene activities onto genes or PPIs (i.e., it does not assign weights to genes or PPIs), the 37 Induced subnetwork snapshots are unweighted. We refer to the resulting Induced unweighted dynamic aging-specific subnetwork, as Induced-Dynamic. Moreover, one of our key questions is whether aging-related genes can be predicted more accurately from a dynamic network than from a static network. So, we create an unweighted static counterpart for Induced-Dynamic, as follows. We aggregate (i.e., take the union of) the nodes and edges over all 37 snapshots such that a PPI is kept if it is present in any of the 37 subnetwork snapshots. We denote this static subnetwork as Induced-Static. This allows for a fair comparison between a dynamic subnetwork against its static counterpart, as the two contain the same nodes and edges.

Other than Induced-Dynamic, we also use an NP-based dynamic aging-specific subnetwork inferred in our previous work [19]. Key differences between the Induced approach and NP are discussed in Section “Introduction”. We extended two prominent NP approaches originally proposed for inference of a static cancer-specific subnetwork, i.e., NetWalk and HotNet2, to the problem of inferring a dynamic aging-specific subnetwork [19]. Specifically, given an NP approach, we created a dynamic aging-specific subnetwork with 37 age-specific subnetwork snapshots, as follows. Given Entire and expression levels for genes at a given age, we used the NP approach to assign age-specific weights to every PPI of Entire, such that the higher the weight of a PPI, the more important the PPI is. We did this for each of the 37 ages and obtained 37 age-specific weighted subnetwork snapshots. Finally, to obtain an unweighted dynamic aging-specific subnetwork, we only kept those “highly” weighted PPIs in each of the 37 age-specific weighted subnetwork snapshots, which resulted in the most relevant aging-specific dynamic subnetwork (i.e., the collection of the 37 age-specific subnetwork snapshots). See Newaz and Milenković [19] for details. We evaluated dynamic aging-specific subnetworks produced by NetWalk against those produced by HotNet2, and found the former to be better, i.e., to overlap more with aging-related ground truth data [19]. Consequently, in this paper, we only focus on the dynamic aging-specific subnetworks inferred using NetWalk, see Additional file 1: Section S3 for overview of NetWalk. For this NP approach, we considered two versions, referred to as option 1 and option 2 [19]. The key difference between these two versions lies in the different ways in which they process gene expression levels for each gene type (i.e., active or non-active genes) prior to propagating them through the network. Intuitively, the former assigns actual expression levels to only those genes that are significantly expressed (active) at a given age and “non-informative” (i.e., “dummy”) expression levels to all other genes in the entire network, while the latter assigns actual expression levels to both active and non-active genes. We have shown that the dynamic subnetwork inferred using option 2 (which was referred to as NetWalk* in the original publication [19]) overlapped better with the aging-related ground truth data than the dynamic subnetwork inferred using option 1 [19]. Hence, we use the dynamic subnetwork inferred using NetWalk option 2, which we refer to as NetWalk-Dynamic.

NetWalk-Dynamic was constructed by only keeping “highly” weighted PPIs. This means that a weight “threshold” was defined before the “highly” weighted PPIs were selected. In other words, those PPIs with weights less than a predefined threshold were removed. However, identifying such a threshold can be a time-consuming task. Additionally, we believe that keeping all of the PPIs and distinguishing between them using aging-specific weights is likely to capture more aging-specific information than only keeping a set of most “important” PPIs. That is, we aim to construct a dynamic aging-specific subnetwork such that (1) the aging-specific information can be distinguished via the PPIs’ weights and (2) a predefined weight “threshold” is not needed, as follows.

Given Entire and the gene expression data for 37 ages [10], we first use NetWalk (i.e., option 2) to obtain 37 weighted subnetwork snapshots in the same manner as done by Newaz and Milenković [19] (also described above). However, unlike Newaz and Milenković [19], we do not remove any of the PPIs from any of the 37 weighted subnetwork snapshots. Hence, each of the 37 snapshots has all of the PPIs from Entire. Second, to make the weights of PPIs comparable across different snapshots, we normalize the PPI weights over all 37 snapshots between the values of 0.01 and 1. Additionally, we believe that studying changes in the age-specific weights of PPIs across subsequent ages can capture more aging-specific information than just analyzing “raw” age-specific weights of PPIs. Hence, third, given the 37 snapshots with normalized PPI weights, we create 36 “differential” snapshots, as follows. Given two consecutive snapshots, i.e., i and $i + 1$ , we create a differential snapshot, such that, for a given PPI (e.g., Y), we define its differential weight ( $Y_{i, i + 1}^{wt}$ ) as the percentage change from its weight in snapshot i (i.e., $Y_{i}^{wt}$ ) to its weight in snapshot $i + 1$ (i.e., $Y_{i + 1}^{wt}$ ). Formally, we define the weight of the PPI in a differential snapshot ( $i, i + 1$ ) as $Y_{i, i + 1}^{wt} = \frac{[Y_{i + 1}^{wt} - Y_{i}^{wt}] \times 100}{[Y_{i + 1}^{wt} + Y_{i}^{wt}]}$ . Because there are 36 consecutive snapshots, we create 36 differential weighted subnetworks. We use the collection of 36 differential weighted subnetworks as the weighted NetWalk-Dynamic aging-specific subnetwork, which we refer to as wNetWalk-Dynamic. Note that because the differential edge weights are percentages of weight changes between two consecutive snapshots, the maximum and minimum possible differential edge weights are -100 and 100, respectively.

Hence, we consider two versions of NetWalk-based dynamic aging-specific subnetworks, i.e., the unweighted dynamic aging-specific subnetwork resulting from thresholding (NetWalk-Dynamic), and the weighted dynamic aging-specific network without thresholding (wNetWalk-Dynamic). Additionally, we consider a counterpart of wNetWalk-Dynamic, which we obtain as follows. We follow the same procedure that we do for wNetWalk-Dynamic up to the step where we obtain 37 snapshots with normalized edge weights. We consider this weighted dynamic subnetwork as the “non-differential” counterpart of wNetWalk.

We consider this “non-differential” counterpart of wNetWalk because of the following reasons. wNetWalk-Dynamic is a “differential” weighted subnetwork, where the weight of an edge captures a change between two consecutive snapshots. However, the considered existing weighted dynamic features were originally not defined for a differential weighted dynamic subnetwork. To give the existing weighted dynamic features the best-case advantage, we evaluate them not only on wNetWalk-Dynamic but also on the non-differential counterpart of wNetWalk-Dynamic. Because we find that the existing weighted dynamic features perform better with respect to the non-differential counterpart of wNetWalk-Dynamic than with respect to wNetWalk-Dynamic, we only report results of the existing weighted features based on the non-differential counterpart of wNetWalk-Dynamic.

Similar to Induced-Dynamic, we also need static counterparts for the dynamic NP-based subnetworks. We consider two versions of static subnetworks, i.e., static and static*. Because the first version is meant for unweighted dynamic subnetworks, we create the static counterpart only for NetWalk-Dynamic, which we denote as NetWalk-Static. On the other hand, because NP was originally proposed for static context-specific subnetwork, for an NP-based dynamic subnetwork, we use NP to infer another static subnetwork counterpart, i.e., static*, as follows. We first aggregate gene expression data over all ages (so that a node is considered active if it is active at any one or more of the 37 ages) and then integrate the aggregated gene expression-based activity data with Entire using NetWalk. The resulting static aging-specific subnetwork (without thresholding) is the counterpart of wNetWalk-Dynamic, and is denoted as wNetWalk-Static*. Then, to create the static counterpart of NetWalk-Dynamic, given wNetWalk-Static*, we choose the edge weight threshold that results in a static subnetwork matching the number of edges in NetWalk-Static, and we denote such static subnetwork as NetWalk-Static*. This way, the two versions of the static subnetworks, i.e., NetWalk-Static and NetWalk-Static*, are fairly comparable to each other.

In total, we analyze Entire which is unweighted, two unweighted Induced subnetworks, three unweighted NP-based subnetworks, and two weighted NP-based subnetworks. That is, we consider $1 + 2 + 3 + 2 = 8$ (sub)networks for each of the HPRD and BioGRID entire context-unspecific networks (Table 4).

Table 4.

The sizes of the eight considered HPRD-based (sub)networks

Entire	Induced-Dynamic	Induced-Static	NetWalk-Dynamic
8,938; 35,900	4,696; 15,062	6,371; 22,975	3,749; 9,617
NetWalk-Static	NetWalk-Static*	wNetWalk-Dynamic	wNetWalk-Static*
4,973; 14,509	5,065; 14,509	8,938; 35,900	8,938; 35,900

Open in a new tab

The size of a dynamic subnetwork is the average over its 37 (or 36, depending on whether the considered dynamic subnetwork has 37 or 36 snapshots) subnetwork snapshots. The two numbers delimited by “;” are node and edge counts of the corresponding (sub)network. The sizes of the eight considered BioGRID-based (sub)networks are in Additional file 1: Table S4

Six considered aging-related ground truth data

The purpose of using the following data is described in the next section.

GenAge [28] is a highly trusted source of human aging-related genes, most of which are sequence orthologs of experimentally validated aging-related genes in model organisms. In particular, a gene’s homolog in human is selected to be aging-related if the following criteria are met. (1) The gene’s influence in model organism’s aging process is experimentally validated; (2) There are multiple literature suggesting that the human homolog is implicated in multiple pathologies that are related to the aging process; (3) There is phenotypical information of mutations that the human homolog is involved; (4) The gene has effects on gene’s product(s) in mammals. Of all 307 human genes from GenAge, 185 are present in all eight considered HPRD-based (sub)networks. Of all 307 human genes from GenAge, 227 are present in all eight considered BioGRID-based (sub)networks. Henceforth, we denote the set of 185 HRPD-related or the set of 227 BioGRID-related GenAge genes as GenAge.
GTEx [8] contains 863 genes whose expressions decrease with age (i.e., are down-regulated). Of which 347 are present in all eight considered HPRD-based (sub)networks and 691 are present in all eight considered BioGRID-based (sub)networks. Henceforth, we denote the set of 347 of HPRD-related or the set of 691 BioGRID-related GTEx down-regulated genes as GTEx-DAG.
GTEx [8] contains 710 genes whose expressions increase with age (i.e., are up-regulated). Of which 161 are resent in all eight considered HPRD-based (sub)networks and 391 are resent in all eight considered BioGRID-based (sub)networks. Henceforth, we denote the set of 161 of HPRD-related or the set of 691 BioGRID-related GTEx up-regulated genes as GTEx-UAG.
Lu et al. [9] identified 442 genes whose expressions vary with age, of which 265 are present in all eight considered HPRD-based (sub)networks and 306 are present in all eight considered BioGRID-based (sub)networks.
Berchtold et al. [10] identified 8,277 genes whose expressions vary with age, of which 2,312 are present in all eight considered HPRD-based (sub)networks and 2,987 are present in in all eight considered BioGRID-based (sub)networks.
Simpson et al. [12] identified 2,911 genes whose expressions vary across different stages of Alzheimer’s disease, of which 930 are present in all eight considered HPRD-based (sub)networks and 1,186 are present in all eight considered BioGRID-based (sub)networks.

Two considered definitions of (non-)aging-related gene labels

For supervised classification (discussed in the following sections), we need a set of genes labeled as aging-related (positive class) and a set of genes labeled as non-aging-related (negative class). Because GenAge is considered to be one of the most confident aging-related ground truth data sources, our primary label definition is with respect to GenAge. Human genes in GenAge are sequence orthologs of aging-related genes in model organisms. However, human aging is a complex process that has unique biological aspects compared to the aging process in a model species [34], which might be missed by GenAge. So, to hopefully capture the human-unique aspects, we separately consider a secondary label definition with respect to GTEx [8], which was obtained by studying aging directly in human. Specifically, of the entire GTEx, we focus on GTEx-DAG, because we analyze PPI data, and because genes in GTEx-DAG were shown to be relevant for PPIs, unlike genes in GTEx-UAG [8]. The two label definitions, i.e., based on either GenAge or GTEx-DAG, are as follows.

Primary definition with respect to GenAge We label as aging-related those 185 genes from GenAge that exist in all eight considered HPRD-based (sub)networks. Because we want our non-aging-related gene set to be as confident as possible, we label as non-aging-related those genes that exist in all eight considered (sub)networks but not in any of the six considered aging-related ground truth data; there are 1,485 such genes. We intentionally use the same 185-gene positive class and the same 1,485-gene negative class for classification in all eight HPRD-based (sub)networks. This allows us to fairly compare classification accuracy (i.e., the accuracy of aging-related gene prediction) across all (sub)networks. Similarly, we define GenAge-based aging and non-aging labels with respect to BioGRID, which results in 227 aging-related genes and 6,575 non-aging-related genes.

Secondary definition with respect to GTEx-DAG We label as aging-related those 347 genes from GTEx-DAG that exist in all eight (sub)networks. We label as non-aging-related those 1,485 genes that exist in all eight (sub)networks but not in any of the six considered aging-related ground truth data. Note that we use the GenAge definition separately from the GTEx-DAG definition, i.e., we perform classification for each of the two individually. This is because the two are very different (sequence- vs. expression-based) data that possibly capture different aspects of the aging process. Indeed, of the 185 GenAge-based and 347 GTEx-based aging-related genes, only 17 genes are common to the two gene sets.

Cancer-related genes

Vogelstein et al. [47] identified 138 mutation-related cancer driver genes, i.e., genes whose intragenic mutations contribute to cancer. Sondka et al. [48] identified 723 cancer driver genes by manually curating cancer-related genes from COSMIC to determine the presence of somatic mutation patterns in cancer. We combine these two sets of cancer-related genes, and obtain 728 unique cancer-related genes. Of these 728 genes, 366 are present in each of the eight HPRD-based (sub)networks analyzed in this study. Similarly, 524 are present in each of the eight BioGRID-based (sub)networks analyzed in this study. Henceforth, we denote the 366 HPRD-related or the 524 BioGRID-related cancer genes as cancer-related genes. Of the 366 HPRD-related cancer genes, 49 are aging-related and 87 are non-aging-related according to the primary GenAge-based definition. Of the 524 BioGRID-related cancer genes, 57 are aging-related and 176 are non-aging-related according to the primary GenAge-based definition. Also, among the 366 HPRD-related cancer genes, 15 are aging-related and 87 are non-aging-related according to the secondary GTEx-DAG-based definition.

Predictive models: node features and classifiers

For existing node features and classifiers, please see Additional file 1: Section S1.1.

The proposed weighted features

Different types of network neighborhoods of a node typically capture different characteristics of the node’s network position. Hence, given a weighted network, we propose to extract features of a node using four network neighborhood types. That is, for a given node v, we extract its features using (1) edges connecting v to its direct (one-hop) neighbors, (2) edges among one-hop neighbors of v, (3) edges between one-hop neighbors of v and two-hop neighbors of v, and (4) edges among two-hop neighbors of v. We characterize a neighborhood type of a node by measuring the distribution of weights of the corresponding edges. Specifically, given W different edge weights in the dynamic weighted subnetwork, we count the number of times each of the W edge weights occur in the given neighborhood type. In more detail, given a dynamic weighted subnetwork (i.e., wNetWalk-Dynamic) with N snapshots, we create a feature vector of a node using three different approaches, as follows.

Approach 1: for a given neighborhood type, for a given snapshot, for a given node, we compute counts of each of the W edge weights among its neighbors, resulting in a feature vector of size W. Then, we concatenate this feature vector from each of the N snapshots to get a feature vector of length $W \times N$ . This results in four feature vectors, one corresponding to each of the four neighborhood types. Additionally, we define a feature vector that incorporates all neighborhood types, as follows. For each snapshot, we concatenate the size W feature vectors from each of the four neighborhood types to get a feature vector of size $W \times 4$ . Then, we combine these concatenated feature vectors from each of the N snapshots into a single feature vector of size $W \times 4 \times N$ . So, in total, we create five feature vectors for each node.
Approach 2: for a given neighborhood type, for a given node, we measure the Pearson correlation of weight counts between each pair of N snapshots, resulting in a correlation matrix of size $N \times N$ , where we take its upper triangular matrix as a feature vector of size $N \times (N - 1) / 2$ . We do this for each of the four neighborhood types separately, which results in four feature vectors. Additionally, to incorporate all four neighborhood types in a single feature vector, we concatenate the feature vectors corresponding to the four neighborhood types to obtain another feature vector of size $4 \times N \times (N - 1) / 2$ . So, in total, we create five features for each node.
Approach 3: for a given neighborhood type, for a given snapshot, for a given node, we measure the deviation of the distribution of the edge weights in the neighborhood from the distribution of edge weights in the whole weighted dynamic subnetwork. We use the Cramér-von Mises statistic [49] to measure the deviation between the two distributions, resulting in a “distance number” that quantifies the distance between the two distributions. Then, given N snapshots, we combine the N distance numbers to obtain a single feature vector of size N. We do this for each of the four neighborhood types separately, which results in four feature vectors. Additionally, to incorporate all four neighborhood types in a single feature vector, we concatenate the feature vectors corresponding to the four neighborhood types to obtain another feature vector of size $N \times 4$ . So, in total, we create five feature vectors for each node.

The above procedure results in 15 weighted dynamic feature vectors for a node. In addition to the above way of characterizing the distribution of weights, where we count the occurrences of “raw” edge weights, we characterize the distribution of “binned” weights, where we bin the W edge weights into 200 bins, as follows. We first divide the interval between the smallest possible weight (i.e., $- 100$ ) and the largest possible weight (i.e., 100) in a weighted dynamic subnetwork into equal-sized bins of size of 1, and then we assign all of the weights in the weighted dynamic subnetwork to their corresponding bins. Finally, we count how many edge weights belong to a given bin, to obtain the corresponding binned weight distributions. Finally, we use the same three approaches as above, i.e., approach 1, approach 2, and approach 3, to create 15 additional weighted dynamic feature vectors. Hence, in total, we create 30 weighted dynamic feature vectors, 15 using raw (“nobin”) weights, and 15 using binned (“bin”) weights.

Given aging-related ground truth data, we compare the 30 weighted dynamic features with respect to their ability to predict aging-related genes, in order to select the best weighted dynamic feature. Given the HPRD-based wNetWalk-Dynamic, for the primary definition of aging- and non-aging-related genes using GenAge data, the best weighted dynamic feature vector corresponds to the case when we use distributions of raw weights (i.e., approach 1 above) of the second neighborhood type of a node. We call this feature vector “Diff-nobin-2”. Given the HPRD-based wNetWalk-Dynamic, for the secondary definition of aging- and non-aging-related genes using GTEx-DAG data, the best weighted dynamic feature vector corresponds to the case when we use distributions of binned weights (i.e., approach 1 above) of the first neighborhood type of a node. We call this feature vector “Diff-bin-1”. Given the BioGRID-based wNetWalk-Dynamic, for the primary definition of aging- and non-aging-related genes using GenAge data, the best weighted dynamic feature vector corresponds to the case when we use Pearson correlation of raw weights (i.e., approach 2 above) of the second neighborhood type of a node. We call this feature vector “Diff-nobin-cor-2”.

To fairly compare wNetWalk-Dynamic to its static counterpart (i.e., wNetWalk-Static*), we create the corresponding static version of the best weighted dynamic feature, as follows. Given wNetWalk-Static*, we use the same approach to extract a weighted static feature that we use to create the best selected weighted dynamic feature. We identify the corresponding static features as “Static-nobin-2”, “Static-bin-1” and “Static-nobin-2” that correspond to Diff-nobin-2, Diff-bin-1, and Diff-nobin-cor-2, respectively.

Supplementary Information

12859_2021_4439_MOESM1_ESM.pdf^{(2.9MB, pdf)}

Additional file 1. Supporting methodological details of existing node features, evaluation framework, and additional results for HPRD-based (sub)networks and BioGRID-based (sub)networks.

Acknowledgements

Note that a preliminary version of this paper is available on arXiv [50].

Abbreviations

ADEx2011: Alzheimer’s disease expression 2011—a set of genes identified in 2011 as having their expressions change across different stages of Alzheimer’s disease
AUPR: Area under the precision-recall curve
BetwC-wt: Weighted betweenness centrality
BEx2004: Brain expression 2004—a set of genes identified in 2004 as having their expressions in a brain tissue change with age
BEx2008: Brain expression 2008—a set of genes identified in 2008 as having their expressions in a brain tissue change with age
BioGRID: Biological general repository for interaction datasets
CentraMV: Centrality mean and variation
CloseC-wt: Weighted closeness centrality
ClusC-wt: Weighted clustering coefficient
DegC: Degree centrality
DegC-wt: Weighted degree centrality
DGDV: Dynamic graphlet degree vector
ECC: Eccentricity centrality
EigenC-wt: Weighted eigen vector centrality
FDR: False discovery rate
FN: False negative
FP: False positive
GCN: Graph convolutional networks
GDC: Graphlet degree centrality
GenAge: The ageing gene database
GoT: Graphlet orbit transitions
GTEx-DAG: Genotype-tissue expression project—downregulated aging-related genes
GTEx-UAG: Genotype-tissue expression project—upregulated aging-related genes
HPRD: Human protein reference database
HuRI: Human reference interactome
KC: k-coreness centrality
LR: Logistic regression
mBPIs: Top m most important binary protein interaction attributes
NB: Naive Bayes
NP: Network propagation
PCA: Principal component analysis
PPI: Protein–protein interaction
SGDV: Static graphlet degree vector
SVM: Support vector machine
SVM-rbf: Support vector machine with a non-linear radial basis function kernel
TN: True negative
TP: True positive
tSNE: t-distributed stochastic neighbor embedding
UniNet: Union of 11 network centrality measures

Authors' contributions

Conceptualization: QL, KN and TM. Data collection and process: QL and KN. Methodology and experiments: QL and KN. Results visualization and analysis: QL and KN. Writing: QL, KN and TM. Supervision: TM. Project administration: TM. Funding acquisition: TM. All authors have read and approved the manuscript.

Funding

This work including publication costs is funded by the U.S. National Science Foundation (NSF) CAREER award CCF-1452795. The funding body had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials

All data generated and analyzed during this study are available in the cone repository, https://nd.edu/~cone/WeightedSuperAge. The expression data [10] for the inference of our aging-specific subnetworks is GSE11882.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

https://www.cdc.gov/nchs/nvss/vsrr/COVID19.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Qi Li and Khalique Newaz contributed equally to this work

References

1.Campisi J. Aging, cellular senescence, and cancer. Ann Rev Physiol. 2013;75:685–705. doi: 10.1146/annurev-physiol-030212-183653. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bolignano D, Mattace-Raso F, Sijbrands EJ, Zoccali C. The aging kidney revisited: a systematic review. Ageing Res Rev. 2014;14:65–80. doi: 10.1016/j.arr.2014.02.003. [DOI] [PubMed] [Google Scholar]
3.Fabris F, De Magalhães JP, Freitas AA. A review of supervised machine learning applied to ageing research. Biogerontology. 2017;18(2):171–188. doi: 10.1007/s10522-017-9683-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Faisal FE, Milenković T. Dynamic networks reveal key players in aging. Bioinformatics. 2014;30(12):1721–1729. doi: 10.1093/bioinformatics/btu089. [DOI] [PubMed] [Google Scholar]
5.Faisal FE, Zhao H, Milenković T. Global network alignment in the context of aging. IEEE/ACM Trans Comput Biol Bioinf. 2014;12(1):40–52. doi: 10.1109/TCBB.2014.2326862. [DOI] [PubMed] [Google Scholar]
6.Freitas AA, Vasieva O, de Magalhães JP. A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related. BMC Genom. 2011;12(1):27. doi: 10.1186/1471-2164-12-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kerepesi C, Daróczy B, Sturm Á, Vellai T, Benczúr A. Prediction and characterization of human ageing-related proteins by using machine learning. Sci Rep. 2018;8(1):4094. doi: 10.1038/s41598-018-22240-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jia K, Cui C, Gao Y, Zhou Y, Cui Q. An analysis of aging-related genes derived from the genotype-tissue expression project (GTEx) Cell Death Discov. 2018;5(1):26. doi: 10.1038/s41420-018-0093-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lu T, Pan Y, Kao S-Y, Li C, Kohane I, Chan J, Yankner BA. Gene regulation and DNA damage in the ageing human brain. Nature. 2004;429(6994):883. doi: 10.1038/nature02661. [DOI] [PubMed] [Google Scholar]
10.Berchtold NC, Cribbs DH, Coleman PD, Rogers J, Head E, Kim R, Beach T, Miller C, Troncoso J, Trojanowski JQ, et al. Gene expression changes in the course of normal brain aging are sexually dimorphic. Proc Natl Acad Sci. 2008;105(40):15605–15610. doi: 10.1073/pnas.0806883105. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Holtman IR, Raj DD, Miller JA, Schaafsma W, Yin Z, Brouwer N, Wes PD, Möller T, Orre M, Kamphuis W, et al. Induction of a common microglia gene expression signature by aging and neurodegenerative conditions: a co-expression meta-analysis. Acta Neuropathol Commun. 2015;3(1):31. doi: 10.1186/s40478-015-0203-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Simpson JE, Ince PG, Shaw PJ, Heath PR, Raman R, Garwood CJ, Gelsthorpe C, Baxter L, Forster G, Matthews FE, et al. Microarray analysis of the astrocyte transcriptome in the aging brain: relationship to Alzheimer’s pathology and APOE genotype. Neurobiol Aging. 2011;32(10):1795–807. doi: 10.1016/j.neurobiolaging.2011.04.013. [DOI] [PubMed] [Google Scholar]
13.Kirkwood TB. Understanding the odd science of aging. Cell. 2005;120(4):437–447. doi: 10.1016/j.cell.2005.01.027. [DOI] [PubMed] [Google Scholar]
14.Fang Y, Wang X, Michaelis EK, Fang J. Classifying aging genes into DNA repair or non-DNA repair-related categories. In: International conference on intelligent computing. Springer; 2013. p. 20–29.
15.Fabris F, Freitas AA, Tullet J. An extensive empirical comparison of probabilistic hierarchical classifiers in datasets of ageing-related genes. IEEE/ACM Trans Comput Biol Bioinf. 2016;13(6):1045–1058. doi: 10.1109/TCBB.2015.2505288. [DOI] [PubMed] [Google Scholar]
16.Elhesha R, Sarkar A, Boucher C, Kahveci T. Identification of co-evolving temporal networks. BMC Genom. 2019;20(434):1–16. doi: 10.1186/s12864-019-5719-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Li Q, Milenković T. Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork. In: IEEE international conference on bioinformatics and biomedicine (BIBM); 2019. p. 130–7. [DOI] [PubMed]
18.Li Q, Milenković T. Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork. arXiv preprint arXiv:1908.08135 (2021). (Note: This paper is under review, and it is an extended version of the IEEE BIBM 2019 paper with the same name). [DOI] [PubMed]
19.Newaz K, Milenković T. Inference of a dynamic aging-related biological subnetwork via network propagation. IEEE/ACM Trans Comput Biol Bioinform. 2020. [DOI] [PubMed]
20.Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017;18(9):551. doi: 10.1038/nrg.2017.38. [DOI] [PubMed] [Google Scholar]
21.Komurov K, White MA, Ram PT. Use of data-biased random walks on graphs for the retrieval of context-specific networks from genomic data. PLoS Comput Biol. 2010;6(8):1–10. doi: 10.1371/journal.pcbi.1000889. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Leiserson MD, Vandin F, Wu H-T, Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellan M, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet. 2015;47(2):106–114. doi: 10.1038/ng.3168. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Prasad T, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database-2009 update. Nucleic Acids Res. 2009;37(suppl–1):767–772. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(suppl-1):535–539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Opsahl T, Agneessens F, Skvoretz J. Node centrality in weighted networks: generalizing degree and shortest paths. Soc Netw. 2010;32(3):245–251. [Google Scholar]
26.Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proc Natl Acad Sci. 2004;101(11):3747–3752. doi: 10.1073/pnas.0400087101. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Anderson TW. On the distribution of the two-sample Cramer-von Mises criterion. Ann Math Statist. 1962;33(3):1148–1159. [Google Scholar]
28.Tacutu R, Thornton D, Johnson E, Budovsky A, Barardo D, Craig T, Diana E, Lehmann G, Toren D, Wang J, et al. Human ageing genomic resources: new and updated databases. Nucleic Acids Res. 2017;46(D1):1083–1090. doi: 10.1093/nar/gkx1042. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Fiscon G, Conte F, Licursi V, Nasi S, Paci P. Computational identification of specific genes for glioblastoma stem-like cells identity. Sci Rep. 2018;8(1):1–10. doi: 10.1038/s41598-018-26081-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Panebianco V, Pecoraro M, Fiscon G, Paci P, Farina L, Catalano C. Prostate cancer screening research can benefit from network medicine: an emerging awareness. NPJ Syst Biol Appl. 2020;6(1):1–6. doi: 10.1038/s41540-020-0133-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Grimaldi AM, Conte F, Pane K, Fiscon G, Mirabelli P, Baselice S, Giannatiempo R, Messina F, Franzese M, Salvatore M, et al. The new paradigm of network medicine to analyze breast cancer phenotypes. Int J Mol Sci. 2020;21(18):6690. doi: 10.3390/ijms21186690. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Mazza R, Strozzi F, Caprera A, Ajmone-Marsan P, Williams JL. The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species. BMC Genom. 2009;10(1):604. doi: 10.1186/1471-2164-10-604. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 2009;25(9):404–413. doi: 10.1016/j.tig.2009.07.006. [DOI] [PubMed] [Google Scholar]
34.Danchin A. Bacteria in the ageing gut: did the taming of fire promote a long human lifespan? Environ Microbiol. 2018;20(6):1966–1987. doi: 10.1111/1462-2920.14255. [DOI] [PubMed] [Google Scholar]
35.Ding J, Kong W, Mou X, Wang S. Construction of transcriptional regulatory network of Alzheimer’s disease based on panda algorithm. Interdiscip Sci Comput Life Sci. 2019;11(2):226–36. doi: 10.1007/s12539-018-0297-0. [DOI] [PubMed] [Google Scholar]
36.Goedert M, Spillantini MG. A century of Alzheimer’s disease. Science. 2006;314(5800):777–81. doi: 10.1126/science.1132814. [DOI] [PubMed] [Google Scholar]
37.Iwasaki H, Nakano K, Shinkai K, Kunisawa Y, Hirahashi M, Oda Y, Onishi H, Katano M. Hedgehog Gli3 activator signal augments tumorigenicity of colorectal cancer via upregulation of adherence-related genes. Cancer Sci. 2013;104(3):328–336. doi: 10.1111/cas.12073. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Holt PR, Kozuch P, Mewar S. Colon cancer and the elderly: from screening to treatment in management of gi disease in the elderly. Best Pract Res Clin Gastroenterol. 2009;23(6):889–907. doi: 10.1016/j.bpg.2009.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Zheng Y, Liu X, Le W, Xie L, Li H, Wen W, Wang S, Ma S, Huang Z, Ye J, et al. A human circulating immune cell landscape in aging and COVID-19. Protein Cell. 2020;11(10):740–770. doi: 10.1007/s13238-020-00762-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Beck B, Driessens G, Goossens S, Youssef KK, Kuchnio A, Caauwe A, Sotiropoulou PA, Loges S, Lapouge G, Candi A, et al. A vascular niche and a VEGF-Nrp1 loop regulate the initiation and stemness of skin tumours. Nature. 2011;478(7369):399–403. doi: 10.1038/nature10525. [DOI] [PubMed] [Google Scholar]
41.Garcovich S, Colloca G, Sollena P, Andrea B, Balducci L, Cho WC, Bernabei R, Peris K. Skin cancer epidemics in the elderly as an emerging issue in geriatric oncology. Aging Dis. 2017;8(5):643. doi: 10.14336/AD.2017.0503. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Sarogni P, Palumbo O, Servadio A, Astigiano S, D’Alessio B, Gatti V, Cukrov D, Baldari S, Pallotta MM, Aretini P, et al. Overexpression of the cohesin-core subunit SMC1A contributes to colorectal cancer development. J Exp Clin Cancer Res. 2019;38(1):1–16. [DOI] [PMC free article] [PubMed]
43.Srinivasan K, Friedman BA, Etxeberria A, Huntley MA, van Der Brug MP, Foreman O, Paw JS, Modrusan Z, Beach TG, Serrano GE, et al. Alzheimer’s patient microglia exhibit enhanced aging and unique transcriptional activation. Cell Rep. 2020;31(13):107843. doi: 10.1016/j.celrep.2020.107843. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.GTEx Consortium, et al. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60. [DOI] [PMC free article] [PubMed]
45.Pareja A, Domeniconi G, Chen J, Ma T, Suzumura T, Kanezashi H, Kaler T, Leisersen CE. Evolvegcn: evolving graph convolutional networks for dynamic graphs. In: Thirty-third AAAI conference on artificial intelligence; 2020.
46.Manessi F, Rozza A, Manzo M. Dynamic graph convolutional networks. Pattern Recogn. 2020;97:107000. [Google Scholar]
47.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Sondka Z, Bamford S, Cole CG, Ward SA, Dunham I, Forbes SA. The COSMIC cancer gene census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018;1:696–705. doi: 10.1038/s41568-018-0060-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Anderson TW. On the distribution of the two-sample Cramer-von Mises criterion. Ann Math Stat. 1962;1148–1159.
50.Li Q, Milenković T. Improving supervised prediction of aging-related genes via dynamic network analysis. arXiv preprint arXiv:2005.03659 (v2) (2020) [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2021_4439_MOESM1_ESM.pdf^{(2.9MB, pdf)}

Additional file 1. Supporting methodological details of existing node features, evaluation framework, and additional results for HPRD-based (sub)networks and BioGRID-based (sub)networks.

Data Availability Statement

[CR1] 1.Campisi J. Aging, cellular senescence, and cancer. Ann Rev Physiol. 2013;75:685–705. doi: 10.1146/annurev-physiol-030212-183653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Bolignano D, Mattace-Raso F, Sijbrands EJ, Zoccali C. The aging kidney revisited: a systematic review. Ageing Res Rev. 2014;14:65–80. doi: 10.1016/j.arr.2014.02.003. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Fabris F, De Magalhães JP, Freitas AA. A review of supervised machine learning applied to ageing research. Biogerontology. 2017;18(2):171–188. doi: 10.1007/s10522-017-9683-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Faisal FE, Milenković T. Dynamic networks reveal key players in aging. Bioinformatics. 2014;30(12):1721–1729. doi: 10.1093/bioinformatics/btu089. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Faisal FE, Zhao H, Milenković T. Global network alignment in the context of aging. IEEE/ACM Trans Comput Biol Bioinf. 2014;12(1):40–52. doi: 10.1109/TCBB.2014.2326862. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Freitas AA, Vasieva O, de Magalhães JP. A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related. BMC Genom. 2011;12(1):27. doi: 10.1186/1471-2164-12-27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Kerepesi C, Daróczy B, Sturm Á, Vellai T, Benczúr A. Prediction and characterization of human ageing-related proteins by using machine learning. Sci Rep. 2018;8(1):4094. doi: 10.1038/s41598-018-22240-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Jia K, Cui C, Gao Y, Zhou Y, Cui Q. An analysis of aging-related genes derived from the genotype-tissue expression project (GTEx) Cell Death Discov. 2018;5(1):26. doi: 10.1038/s41420-018-0093-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Lu T, Pan Y, Kao S-Y, Li C, Kohane I, Chan J, Yankner BA. Gene regulation and DNA damage in the ageing human brain. Nature. 2004;429(6994):883. doi: 10.1038/nature02661. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Berchtold NC, Cribbs DH, Coleman PD, Rogers J, Head E, Kim R, Beach T, Miller C, Troncoso J, Trojanowski JQ, et al. Gene expression changes in the course of normal brain aging are sexually dimorphic. Proc Natl Acad Sci. 2008;105(40):15605–15610. doi: 10.1073/pnas.0806883105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Holtman IR, Raj DD, Miller JA, Schaafsma W, Yin Z, Brouwer N, Wes PD, Möller T, Orre M, Kamphuis W, et al. Induction of a common microglia gene expression signature by aging and neurodegenerative conditions: a co-expression meta-analysis. Acta Neuropathol Commun. 2015;3(1):31. doi: 10.1186/s40478-015-0203-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Simpson JE, Ince PG, Shaw PJ, Heath PR, Raman R, Garwood CJ, Gelsthorpe C, Baxter L, Forster G, Matthews FE, et al. Microarray analysis of the astrocyte transcriptome in the aging brain: relationship to Alzheimer’s pathology and APOE genotype. Neurobiol Aging. 2011;32(10):1795–807. doi: 10.1016/j.neurobiolaging.2011.04.013. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Kirkwood TB. Understanding the odd science of aging. Cell. 2005;120(4):437–447. doi: 10.1016/j.cell.2005.01.027. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Fang Y, Wang X, Michaelis EK, Fang J. Classifying aging genes into DNA repair or non-DNA repair-related categories. In: International conference on intelligent computing. Springer; 2013. p. 20–29.

[CR15] 15.Fabris F, Freitas AA, Tullet J. An extensive empirical comparison of probabilistic hierarchical classifiers in datasets of ageing-related genes. IEEE/ACM Trans Comput Biol Bioinf. 2016;13(6):1045–1058. doi: 10.1109/TCBB.2015.2505288. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Elhesha R, Sarkar A, Boucher C, Kahveci T. Identification of co-evolving temporal networks. BMC Genom. 2019;20(434):1–16. doi: 10.1186/s12864-019-5719-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Li Q, Milenković T. Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork. In: IEEE international conference on bioinformatics and biomedicine (BIBM); 2019. p. 130–7. [DOI] [PubMed]

[CR18] 18.Li Q, Milenković T. Supervised prediction of aging-related genes from a context-specific protein interaction subnetwork. arXiv preprint arXiv:1908.08135 (2021). (Note: This paper is under review, and it is an extended version of the IEEE BIBM 2019 paper with the same name). [DOI] [PubMed]

[CR19] 19.Newaz K, Milenković T. Inference of a dynamic aging-related biological subnetwork via network propagation. IEEE/ACM Trans Comput Biol Bioinform. 2020. [DOI] [PubMed]

[CR20] 20.Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017;18(9):551. doi: 10.1038/nrg.2017.38. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Komurov K, White MA, Ram PT. Use of data-biased random walks on graphs for the retrieval of context-specific networks from genomic data. PLoS Comput Biol. 2010;6(8):1–10. doi: 10.1371/journal.pcbi.1000889. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Leiserson MD, Vandin F, Wu H-T, Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellan M, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet. 2015;47(2):106–114. doi: 10.1038/ng.3168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Prasad T, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database-2009 update. Nucleic Acids Res. 2009;37(suppl–1):767–772. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(suppl-1):535–539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Opsahl T, Agneessens F, Skvoretz J. Node centrality in weighted networks: generalizing degree and shortest paths. Soc Netw. 2010;32(3):245–251. [Google Scholar]

[CR26] 26.Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proc Natl Acad Sci. 2004;101(11):3747–3752. doi: 10.1073/pnas.0400087101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Anderson TW. On the distribution of the two-sample Cramer-von Mises criterion. Ann Math Statist. 1962;33(3):1148–1159. [Google Scholar]

[CR28] 28.Tacutu R, Thornton D, Johnson E, Budovsky A, Barardo D, Craig T, Diana E, Lehmann G, Toren D, Wang J, et al. Human ageing genomic resources: new and updated databases. Nucleic Acids Res. 2017;46(D1):1083–1090. doi: 10.1093/nar/gkx1042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Fiscon G, Conte F, Licursi V, Nasi S, Paci P. Computational identification of specific genes for glioblastoma stem-like cells identity. Sci Rep. 2018;8(1):1–10. doi: 10.1038/s41598-018-26081-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Panebianco V, Pecoraro M, Fiscon G, Paci P, Farina L, Catalano C. Prostate cancer screening research can benefit from network medicine: an emerging awareness. NPJ Syst Biol Appl. 2020;6(1):1–6. doi: 10.1038/s41540-020-0133-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Grimaldi AM, Conte F, Pane K, Fiscon G, Mirabelli P, Baselice S, Giannatiempo R, Messina F, Franzese M, Salvatore M, et al. The new paradigm of network medicine to analyze breast cancer phenotypes. Int J Mol Sci. 2020;21(18):6690. doi: 10.3390/ijms21186690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Mazza R, Strozzi F, Caprera A, Ajmone-Marsan P, Williams JL. The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species. BMC Genom. 2009;10(1):604. doi: 10.1186/1471-2164-10-604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 2009;25(9):404–413. doi: 10.1016/j.tig.2009.07.006. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Danchin A. Bacteria in the ageing gut: did the taming of fire promote a long human lifespan? Environ Microbiol. 2018;20(6):1966–1987. doi: 10.1111/1462-2920.14255. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Ding J, Kong W, Mou X, Wang S. Construction of transcriptional regulatory network of Alzheimer’s disease based on panda algorithm. Interdiscip Sci Comput Life Sci. 2019;11(2):226–36. doi: 10.1007/s12539-018-0297-0. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Goedert M, Spillantini MG. A century of Alzheimer’s disease. Science. 2006;314(5800):777–81. doi: 10.1126/science.1132814. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Iwasaki H, Nakano K, Shinkai K, Kunisawa Y, Hirahashi M, Oda Y, Onishi H, Katano M. Hedgehog Gli3 activator signal augments tumorigenicity of colorectal cancer via upregulation of adherence-related genes. Cancer Sci. 2013;104(3):328–336. doi: 10.1111/cas.12073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Holt PR, Kozuch P, Mewar S. Colon cancer and the elderly: from screening to treatment in management of gi disease in the elderly. Best Pract Res Clin Gastroenterol. 2009;23(6):889–907. doi: 10.1016/j.bpg.2009.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Zheng Y, Liu X, Le W, Xie L, Li H, Wen W, Wang S, Ma S, Huang Z, Ye J, et al. A human circulating immune cell landscape in aging and COVID-19. Protein Cell. 2020;11(10):740–770. doi: 10.1007/s13238-020-00762-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Beck B, Driessens G, Goossens S, Youssef KK, Kuchnio A, Caauwe A, Sotiropoulou PA, Loges S, Lapouge G, Candi A, et al. A vascular niche and a VEGF-Nrp1 loop regulate the initiation and stemness of skin tumours. Nature. 2011;478(7369):399–403. doi: 10.1038/nature10525. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Garcovich S, Colloca G, Sollena P, Andrea B, Balducci L, Cho WC, Bernabei R, Peris K. Skin cancer epidemics in the elderly as an emerging issue in geriatric oncology. Aging Dis. 2017;8(5):643. doi: 10.14336/AD.2017.0503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Sarogni P, Palumbo O, Servadio A, Astigiano S, D’Alessio B, Gatti V, Cukrov D, Baldari S, Pallotta MM, Aretini P, et al. Overexpression of the cohesin-core subunit SMC1A contributes to colorectal cancer development. J Exp Clin Cancer Res. 2019;38(1):1–16. [DOI] [PMC free article] [PubMed]

[CR43] 43.Srinivasan K, Friedman BA, Etxeberria A, Huntley MA, van Der Brug MP, Foreman O, Paw JS, Modrusan Z, Beach TG, Serrano GE, et al. Alzheimer’s patient microglia exhibit enhanced aging and unique transcriptional activation. Cell Rep. 2020;31(13):107843. doi: 10.1016/j.celrep.2020.107843. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.GTEx Consortium, et al. The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60. [DOI] [PMC free article] [PubMed]

[CR45] 45.Pareja A, Domeniconi G, Chen J, Ma T, Suzumura T, Kanezashi H, Kaler T, Leisersen CE. Evolvegcn: evolving graph convolutional networks for dynamic graphs. In: Thirty-third AAAI conference on artificial intelligence; 2020.

[CR46] 46.Manessi F, Rozza A, Manzo M. Dynamic graph convolutional networks. Pattern Recogn. 2020;97:107000. [Google Scholar]

[CR47] 47.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Sondka Z, Bamford S, Cole CG, Ward SA, Dunham I, Forbes SA. The COSMIC cancer gene census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018;1:696–705. doi: 10.1038/s41568-018-0060-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Anderson TW. On the distribution of the two-sample Cramer-von Mises criterion. Ann Math Stat. 1962;1148–1159.

[CR50] 50.Li Q, Milenković T. Improving supervised prediction of aging-related genes via dynamic network analysis. arXiv preprint arXiv:2005.03659 (v2) (2020) [DOI] [PMC free article] [PubMed]

PERMALINK

Improved supervised prediction of aging-related genes via weighted dynamic network analysis

Qi Li

Khalique Newaz

Tijana Milenković

Abstract

Background

Results

Conclusions

Supplementary Information

Introduction

Motivation and related work

Our study and contributions

Table 1.

Fig. 1.

Results

The best predictive model for each (sub)network

Table 2.

Table 3.

Using NP-based versus induced aging-specific subnetworks

Fig. 2.

Fig. 4.

Fig. 3.

Using dynamic versus static NP-based aging-specific subnetworks

Using weighted versus unweighted dynamic NP-based aging-specific subnetworks

Using our proposed versus existing weighted dynamic features

Literature validation of novel aging-related gene predictions resulting from the best subnetwork, i.e., wNetWalk-Dynamic

Results when using our secondary aging-related gene label definition with respect to GTEx-DAG

Results for BioGRID-based (sub)networks

Discussion

Conclusions

Methods

Data

The two considered entire context-unspecific PPI networks

Seven considered aging-specific subnetworks

Table 4.

Six considered aging-related ground truth data

Two considered definitions of (non-)aging-related gene labels

Cancer-related genes

Predictive models: node features and classifiers

The proposed weighted features

Supplementary Information

Acknowledgements

Abbreviations

Authors' contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases