Abstract
Motivated by the lack of adequate tools to perform pathway enrichment analysis, this work presents an approach specifically tailored to SomaScan data. Starting from annotated gene sets, we developed a greedy, top-down procedure to iteratively identify strongly intracorrelated SOMAmer modules, termed “SomaModules”, based on 11K SomaScan data. We generated two repositories based on the latest MSigDB and MitoCarta releases, containing more than 40,000 SOMAmer-based gene sets combined. These repositories can be utilized by any unstructured pathway enrichment analysis tool. We validated our results with two case examples: (i) Alzheimer’s disease specific pathways in a 7K SomaScan case–control study, and (ii) mitochondrial pathways using 11K SomaScan data linked to physical performance outcomes. Using gene set enrichment analysis (GSEA), we found that, in both examples, SomaModules had significantly higher enrichment than the original gene set counterparts. These findings were robust and not significantly affected by the choice of enrichment metric or the Kolmogorov enrichment statistic used in the GSEA procedure. We provide users with access to all code, documentation and data needed to reproduce our current repositories, which also will enable them to leverage our framework to analyze SomaModules derived from other sources, including custom, user-generated gene sets.
Keywords: SomaScan, SOMAmers, functional enrichment, pathway analysis, GSEA, SomaModules
Introduction
SomaScan is a highly multiplexed, aptamer-based assay capable of simultaneously measuring thousands of human proteins broadly ranging from femto- to micro-molar concentrations. This technology relies on protein-capture SOMAmer (slow offrate modified aptamer) reagents, designed to optimize high affinity, slow off-rate, and high specificity to target proteins. These targets extensively cover major molecular functions including receptors, kinases, growth factors, and hormones, and span a diverse collection of secreted, intracellular, and extracellular proteins or domains. In recent years, SomaScan has increasingly been adopted as a powerful tool to discover biomarkers across a wide range of diseases and conditions, as well as to elucidate their biological underpinnings in proteomics and multiomics studies. −
Over the past decade, the coverage of the assay has grown steadily in an approximately linear fashion, from about 1000 to 11 thousand proteins in the most recent version of the assay. Similarly to other omics capable of measuring thousands of molecular features, an essential part of the downstream analysis is to perform a so-called pathway enrichment analysis, an approach that casts gene-level measurements in a broader biological context, thus allowing researchers to interpret their data in terms of a great variety of gene sets that may represent different functions, processes, components, or associations with disease. Since the pioneering efforts proposing over-representation analysis and gene set enrichment analysis (GSEA), , more than a hundred different algorithms and variants have been developed to implement pathway enrichment analysis. , Regardless of the method, however, pathway enrichment analysis relies on gene set repositories that contain the pathways or gene sets to be used as reference. Arguably the largest of them is the Molecular Signatures Database (MSigDB), which contains nearly 35,000 human gene sets compiled from KEGG, Reactome, Gene Ontology, the Pathway Interaction Database, WikiPathways, and many other publicly available resources. Most of these gene sets, however, have been derived from microarrays and other transcriptomics data. For lack of other alternatives, these gene sets are typically also used in the analysis of proteomics data, which is largely inadequate for a variety of reasons.
On the one hand, regulatory processes occurring after mRNA is made, for example mRNA rate of decay, accumulation of untranslated mRNA in stress granules or efficiency of translation, are known to play a substantial role in controlling steady-state protein abundance. For instance, in the context of the integrated stress response, , the presence of AU-rich elements in mRNA contributes significantly to processes that regulate gene expression, RNA decay, and rates of translation, making gene expression a weak predictor of downstream protein abundance. Spatial and temporal variations of mRNAs, as well as the local availability of resources for protein biosynthesis, are also known to strongly influence the relationship between protein levels and those of their coding transcripts. Despite the common practice to use mRNA expression as proxy for protein abundance, transcript levels are not sufficient to predict protein levels in many scenarios, as manifested by transcript and protein abundances showing weak or even negative correlations for many genes. On the other hand, different omics technologies are known to introduce a variety of potential sources of sampling bias that may confound functional enrichment analysis. − Sampling biases may be introduced, for instance, by different experimental platforms having biased representations of the underlying gene ontology structure or by detecting gene products with uneven reliability. For these reasons, we argue that pathway enrichment analysis should be performed against gene sets derived from proteomics data and, more specifically, tailored to each proteomics platform. Indeed, different proteomics platforms show substantial disagreement in protein quantification; studies comparing SomaScan, Olink, mass spectrometry, and immuno-assays showed that abundances quantified by different methods show weak or nonsignificant correlations for a sizable proportion of target proteins. − Although SomaScan has consistently showed a remarkably low variability (with a median coefficient of variation of about 5%) and sensitivity (with 98% of the measured human proteins appearing >2-fold brighter in human plasma samples compared to buffer wells), − SOMAmer reagents bind cognate proteins in their native folded conformations and may, therefore, be sensitive to proteoform complexity, which in turn may explain the observed discrepancies between SomaScan and other proteomic platforms. In addition, preanalytical variation effects such as sample storage, repeated freeze–thaw cycles and in vitro hemolysis may have a different impact on protein abundances measured by different assays. ,
In summary, upon noting the lack of SomaScan-specific resources for pathway enrichment analysis, our work represents, to the best of our knowledge, the first endeavor aiming to fill this gap. By using data generated with the newest 11K SomaScan assay from 2542 human plasma samples, as well as publicly available data from the previous 7K assay version, we built and validated an extension of pathway collections that takes into account strongly intracorrelated SOMAmer modules, termed “SomaModules”, derived from existing gene sets. Here, we show that, in case examples chosen to explore and validate this procedure, SomaModules indeed appear to outperform standard, transcriptomic-derived gene sets to characterize pathway enrichment. We provide two public repositories based on the latest MSigDB and MitoCarta releases, containing more than 40,000 SOMAmer-based gene sets combined. This flexible framework allows users to integrate SomaModules with GSEA and other pathway enrichment analysis tools such as IPA, g:Profiler and others.
Experimental Section
Cohort Information
Blood samples were collected from 2542 visits made by 666 participants in the Baltimore Longitudinal Study of Aging (BLSA), a study of normative human aging established in 1958 and managed by the National Institute on Aging (NIA), National Institutes of Health (NIH). The study protocol was conducted in accordance with Declaration of Helsinki principles and was reviewed and approved by the Institutional Review Board of the NIH’s Intramural Research Program. Written informed consent was obtained from all participants. Following sample preprocessing using standard protocols, EDTA plasma vials were stored at −80 °C. In the BLSA, visits are scheduled at regular intervals (every four years for participants younger than 60 years old; every two years for those between ages 60 and 80, and yearly for those older than 80). The number of visits per participant included in this study ranged from 1 to 15 (median = 3, mean = 3.8) and spanned the period 1993–2024, although most of them (97.8%) were from the period 2005–2024. In this study, we analyzed a total of 15 metrics obtained from short and expanded physical performance batteries, which included repeated 6 m walks at usual and fast speeds, narrow (20 cm wide) walks, chair stands, and semi- and full-tandem standing balance tests.
The SomaScan Assay
Plasma proteomic profiles were characterized using the 11K (v5.0) SomaScan assay (SomaLogic, Inc.; Boulder, CO), which consists of 10,776 SOMAmers that target annotated human proteins. Relative protein abundances are expressed in relative fluorescence units (RFU), fully normalized following a standard procedure to account for nuisance effects in hybridization, sample volume and plate biases, and log10-transformed after normalization. For further details on the assay and the data processing pipeline, see refs and . The SomaScan assay was run during 2024 at NIA’s SomaParadise facility (Baltimore, MD), which is certified by the SomaLogic Authorized Site program. The median sample storage time from collection to measurement was 10 years. In a separate study, we showed negligible degradation and storage effects for most SOMAmers for samples stored up to 30 years.
Extraction of SomaModules
SOMAmers are uniquely described by their sequence ID (“SeqId”). Most, but not all, SOMAmers are uniquely mapped to target proteins. Using Entrez gene symbols from SomaScan annotations as target identifiers, a total of 10,776 human protein SOMAmers in the 11K (v5.0) SomaScan assay are mapped to 9608 unique Entrez gene symbols, spanning 10,923 unique SeqId-Entrez gene symbol pairs. Gene sets in MSigDB are organized in collections that can be downloaded from the repository’s website. In this work, we used the latest version available (v2024.1.Hs), released in August 2024. Because gene symbols from SomaScan annotations may not match those used to annotate MSigDB gene sets, we used gene aliases from the HUGO Gene Nomenclature Committee to maximize the pairing between annotations. As a result of this procedure, we generated a SOMAmer-based translation of MSigDB collections, in which gene sets are expressed in terms of unique SOMAmer identifiers (SeqIds).
Using the SOMAmer-based translation of MSigDB gene sets as our starting point, we developed a procedure to expand the repository by adding so-called “SomaModules”, which are subsets derived from the original MSigDB gene sets that are tightly coexpressed in SOMAmer space. In order to illustrate the procedure, let us consider the “Cellular Senescence” gene set from the Reactome collection. Whereas this pathway is comprised of 197 genes, its SOMAmer-based translation spans 138 SeqIds based on the newest 11K SomaScan assay. Using plasma proteomic profiles from 666 BLSA participants, we calculated Pearson’s correlations of all SOMAmer pairs in this pathway (Figure a). In cases where multiple visits per participant were available, only the first (earliest) one was used at this stage to prevent spuriously biasing the correlation estimates. Once the full correlation matrix was obtained, we generated a hierarchical clustering dendrogram, which was sequentially cut at different heights to produce from K = 2 to K = 5 clusters; the total number of clusters considered was thus . For each cluster, we calculated its size, n, defined as the number of SOMAmers in the cluster, and its mean correlation, r, defined as the average correlation across all SOMAmer pairs in the cluster. Applying size and intracluster correlation thresholds, we kept only clusters that satisfied n ≥ 10 and r ≥ 0.5. Then, we ranked clusters in decreasing order of size (first criterion) and mean correlation (second criterion). The top cluster was chosen as the first SomaModule. After removing from the correlation matrix all SOMAmers that belonged to that first SomaModule, the process was iterated to (potentially) discover additional, nonoverlapping SomaModules. Each SomaModule was identified by a collection prefix (e.g., “R” for Reactome), followed by an incremental index that indicates the pathway of origin (e.g., “R.144” for the Cellular Senescence pathway we just described) and an incremental integer suffix that indicates whether we refer to the (parent) original gene set (e.g., “R.144.0”) or a (child) SomaModule (e.g., “R.144.1” for the first SomaModule and “R.144.2” for the second one). Figure b shows the SOMAmer network obtained from pairwise correlations; nodes represent the 138 SOMAmers in this gene set and links represent the 5280 significant pairwise correlations (p-value <0.05), where green and red indicate positive and negative correlations, respectively, and where link thickness is proportional to the correlation magnitude. The two SomaModules R.144.1 (of size 45) and R.144.2 (of size 10) are indicated by different node colors. It should be noticed that the fundamental building blocks of SomaModules are SOMAmers, uniquely identified via their SeqIds; however, by mapping them to gene identifiers, some node labels are repeated. For instance, SeqId “10346-5” and “10354-57” are two SOMAmers in SomaModule R.144.1 both mapped to STAT3. In this example, we observe that SomaModule R.144.1 includes STAT3, cyclin-dependent kinases (CDK), mitogen-activated protein kinases (MAPK), ribosomal proteins (RPS), and ubiquitins (UB); in turn, SomaModule R.144.2 is comprised of various histone complex proteins. Table shows the list of all SomaModule collections derived from MSigDB currently available and their size (i.e., the number of gene sets contained in each collection). The latter, naturally, depends on the size and intracluster correlation thresholds used in the SomaModule extraction procedure. As pointed out by Ramanan et al., a commonly chosen minimum threshold for pathway size is 10 genes, − which appears suitable to prevent false positive associations due to large single-gene or single-SNP effects. The intracluster correlation threshold, in turn, dictates the trade-off between false positives (favored by lower thresholds) and false negatives (favored by higher thresholds). In this work, by imposing the r ≥ 0.5 constraint, we aimed at generating robust SomaModules. Despite this somewhat conservative choice, we obtained a relatively large number of SomaModules. Indeed, across all MSigDB collections, we generated 14,363 (child) SomaModules out of 27,342 (parent) gene sets that passed the n ≥ 10 size constraint after being translated in terms of SOMAmers; in other words, we expanded the original gene sets by slightly more than 50%. As with every other aspect of pathway analysis, however, threshold criteria should be dictated by the context and characteristics of the study, and may differ from the choices outlined in this work. For reference, Table S1 shows the size of MSigDB collections derived from different threshold parameter choices.
1.
Extraction of SomaModules. (a) Pearson’s pairwise correlation matrix of SOMAmers mapped to the Cellular Senescence gene set from Reactome. Data were generated using the 11K SomaScan assay on plasma samples obtained from 666 BLSA participants. (b) SOMAmer correlation network, where green and red links represent significant positive and negative correlations, respectively. The original gene set (R.144.0) is comprised of 138 SOMAmers; our procedure identified two SomaModules, R.144.1 and R.144.2, consisting of 45 and 10 SOMAmers, respectively.
1. Collections from MSigDB in the SomaModule Repository.
ID | description | MSigDB ID | size |
---|---|---|---|
H | hallmarks | h.all | 86 |
POS | positional | c1.all | 304 |
CGP | chem. genet. perturbations | c2.cgp | 4318 |
BC | BioCarta | c2.cp.biocarta | 281 |
K | KEGG | c2.cp.kegg_medicus | 395 |
KL | KEGG legacy | c2.cp.kegg_legacy | 272 |
PID | pathway interaction datab. | c2.cp.pid | 323 |
R | reactome | c2.cp.reactome | 1827 |
WP | WikiPathways | c2.cp.wikipathways | 973 |
MIR | microRNA targets | c3.mir.mirdb | 3666 |
MIRL | microRNA targets legacy | c3.mir.mir_legacy | 365 |
TFT | TF targets | c3.tft.gtrd | 789 |
TFTL | TF targets legacy | c3.tft.tft_legacy | 1146 |
3CA | curated cancer cell atlas | c4.3ca | 184 |
CGN | cancer gene neighborhoods | c4.cgn | 670 |
CM | cancer modules | c4.cm | 561 |
GOBP | GO biological process | c5.go.bp | 6970 |
GOCC | GO cellular component | c5.go.cc | 823 |
GOMF | GO molecular function | c5.go.mf | 1215 |
HPO | human phenotype ontology | c5.hpo | 5337 |
ONC | oncogenic | c6.all | 295 |
IMM | immunologic | c7.immunesigdb | 9363 |
VAX | vaccine response | c7.vax | 365 |
CT | cell type | c8.all | 1177 |
SomaModule Characteristics, Limitations, and Alternative Approaches
The SomaModule framework is a greedy approach to identify, through successive iterations, the largest nonoverlapping, strongly intracorrelated modules. In order to explore the characteristics of these modules, we calculated: (i) all SOMAmer–SOMAmer correlations within each parent gene set, (ii) those within each child SomaModule, and (iii) the cross-correlations between SOMAmers in the parent gene set (not included in the SomaModule) and those in the SomaModule. Then, for each parent–child pair, we calculated: (i) the mean correlation within the parent gene set, (ii) the mean correlation within each child SomaModule, and (iii) the mean cross-correlation between parent and child (excluding SOMAmers present in both parent and child). For the sake of simplicity, here we considered only the first child of each parent gene set, since very few gene sets have more than one child SomaModule. Figure S1 shows mean correlation density distributions across all parent–child pairs from different collections (hallmarks, KEGG, Reactome and WikiPathways). As expected, all child SomaModule distributions (in red) show strong within-module correlations (r > 0.5 by definition). Not surprisingly, the within-parent gene set correlations (in black) are generally much weaker. Interestingly, we observe that the parent–child cross-correlation distributions (in green) are shifted toward negative values; that is, the hierarchical clustering procedure used to extract SomaModules is biased to SOMAmer clusters that are negatively correlated with the rest of the SOMAmers in the parent gene set. Indeed, this characteristic may be explained as due to the fact that SomaModules are preferentially selected to contrast with the background.
Naturally, the hierarchical-clustering-based procedure to extract SomaModules described above is just one approach out of many possible alternative formulations. For the sake of illustration, let us consider one possible use of the popular weighted gene coexpression network analysis (WGCNA) in this scenario. More specifically, let us revisit the SOMAmer correlation matrix from Figure a and apply WGCNA to find the optimal partition of this parent gene set into submodules. WGCNA is a tool with multiple parameters to adjust. To determine the optimal power (β) to weigh correlations (such that increased power suppresses weaker correlations), WGCNA proposes a grid-search approach, in which a range of β values is explored against a scale-free topology model fit. Figure S2 shows that, in this case, β = 5 was the best fit. WGCNA offers a large variety of parameters to extract clusters. Consistent with our previous assumption, we set “minModuleSize” (the minimum module size for module detection) equal to 10. Because we wanted to discriminate up- vs down-regulated SOMAmers, we discarded the “unsigned” network type, leaving us with “signed” and “signed hybrid” modalities. Additionally, we explored the “deepSplit” parameter (which provides a simplified control over how sensitive module detection should be to module splitting) and three topological overlap measure (TOM) types (“none”, “unsigned”, and “signed”). Results are summarized in Table S2. In order to select the optimal clustering, our criterion was to maximize the mean variance explained by the first principal component (averaged over all clusters). We found that the optimal clustering consists of three clusters of sizes 88 (corresponding to the “noise” or “background” cluster), 40, and 10 SOMAmers, respectively. The latter two are fully included in the previously identified SomaModules “R.144.1” and “R.144.2”, respectively, shown in Figure b. The concordance between SomaModules and the optimal WGCNA clustering is highly significant (Fisher’s exact test p-value = 10–42).
Top-down clustering approaches, such as those exemplified here to find subclusters within previously annotated pathways, naturally preclude the possibility of finding de novo gene sets. This, in fact, could be achieved by implementing bottom-up procedures, similar to those used to build de novo transcriptomic modules in blood. − The task to extract de novo modules presents some challenges, for instance: (i) to define criteria for the optimal number and size of clusters; (ii) to assess module stability and robustness; and (iii) to annotate de novo modules using reliable and biologically meaningful descriptors. Nonetheless, this is a promising direction of research that may complement the top-down approach we followed in this paper.
As stated in the Introduction, the goal of this work is to address potential sources of bias that may confound functional enrichment analysis, accounting for differences between transcriptomics- and proteomics-derived gene sets, as well as capturing SomaScan-specific characteristics such as proteome coverage, sensitivity, and specificity. Other sources of biological bias, caveats, and shortcomings of pathway analysis, however, remain to be carefully assessed on a case-by-case basis. One of them is referred to as biological bias, to account for the fact that cell types have highly specialized omics profiles, which can be used to reliably identify the tissue of origin. On the other hand, although disease-specific traits are typically assessed by contrast to pathway references derived from a healthy cohort, this procedure may mask disease subtypes that would require disease-specific pathway references, such as e.g. the COVID-19 Disease Map. In this regard, the Molecular Signatures Database, and therefore also the SomaModule repository derived from it, contain thousands of tissue- and disease-specific gene sets, such as those included in the cell type (CT), curated cancer cell atlas (3CA), cancer gene neighborhoods (CGN), cancer modules (CM), oncogenic (ONC) and human phenotype ontology (HPO) collections (see Table ). The choice of reference gene set collections is dictated by the context of the scientific question under study; to illustrate this point, we discuss below an application in which the relationship between physical performance and mitochondrial function in aging is investigated using SomaModules derived from a collection of annotated mitochondrial pathways.
Another challenge of pathway enrichment analysis is that of gene set overlap, where some genes participate in multiple gene sets. This phenomenon occurs in the presence of multifunctional genes (i.e., genes that play a role in several biological functions or molecular processes) but is also prevalent in some gene set collections with redundancies or a strong hierarchical structure, such as Reactome and the three lineages of gene ontology (GO). Beyond some early attempts to address this issue, − this remains an important topic that deserves further investigation and lies beyond the scope of the present work.
SomaModule Repository and Usage
Closely following the structure of the MSigDB public repository, we generated our first release of SomaModule collections in Gene Matrix Transposed (.gmt) file format and made them publicly available on the Open Science Framework repository. To ensure data provenance, we followed the same naming conventions adopted by MSigDB, adding a suffix that indicates the SomaModule build version. Although we plan to maintain the SomaModule repository updated as new MSigDB versions are released, we also provide all the code, documentation and data needed to regenerate the full repository, so that users are free to explore the SomaModule framework to suit their needs and interests (for instance, by using different module selection criteria). Our framework, moreover, is not limited to MSigDB collections. To emphasize this point, we generated a second SomaModule repository from MitoCarta3.0, which contains annotated mitochondrial pathways, and used it to explore the longitudinal association between mitochondrial processes and physical performance metrics (see Results and Discussion section below). Our framework can be similarly leveraged to analyze gene sets from other sources, including custom, user-generated gene sets. Gene set collections in .gmt file format can be seamlessly uploaded into GSEA but, simply consisting of lists of gene identifiers, they can also be used with other nontopology-based pathway analysis tools, e.g. those implementing over representation analysis and functional class scoring methods. , Let us point out that GSEA does not allow gene identifiers to use hyphens (dashes), which are in fact used in SomaScan’s SeqIds; we circumvent this issue by replacing them programmatically with underscores. Therefore, we provide two different versions of our repository build (using either original or modified SeqIds), as well as R scripts showing how to reformat SeqIds when uploading SOMAmer rank lists into GSEA.
Gene Set Enrichment Analysis (GSEA)
GSEA is a well-established tool to assess pathway enrichment of ranked gene lists. GSEA is a threshold-free method that analyzes all measured genes on the basis of their ranking score, without prior gene filtering. Let r k be the ranking score associated with the k-th gene, g k , with k = 1, ..., n an index running through all measured genes. Without loss of generality, we assume genes to be indexed in decreasing order of their ranking scores. Typically, it is assumed that ranking scores span positive and negative values; the absolute value of a ranking score is a measure of effect size and/or statistical significance, whereas the sign indicates the direction of change (see below for further details on SOMAmer rank metrics adopted for the analysis of our case examples).
Let be a gene set (after removing any genes not included in the universe of measured genes). The step variable x k is defined as
1 |
where p is the Kolmogorov enrichment statistic, whose values are p = 0 (corresponding to the classic/unweighted GSEA version) and p = 1, 1.5, and 2 (corresponding to weighted GSEA versions). This weight parameter was introduced to emphasize the role of genes at the top and bottom of the gene list (i.e., those with larger absolute values of the ranking score in either direction) in detriment of genes in the center of the list (i.e., those with lower absolute values of the ranking score in either direction), although recent assessments using RNA-Seq-based benchmarks suggested that the classic/unweighted approach offered comparable or better sensitivity-vs-specificity trade-offs. The cumulative enrichment of gene set is defined as the running sum of the step variable x k from top to bottom, i.e.
2 |
In this way, starting at E 0 ≡ 0, GSEA progressively examines genes from the top to the bottom of the ranked list, increasing the cumulative enrichment if the gene at that position belongs to the pathway and decreasing the running sum otherwise. The step variable is normalized to ensure that E n = 0 at the bottom of the list. Then, the target gene set’s enrichment score (ES) is defined as the maximum departure from zero (in either direction, i.e. positive or negative) along the cumulative enrichment profile; its sign indicates whether the observed enrichment is associated with the top (positive ES) or bottom (negative ES) portion of the ranked gene list. In order to control for pathway size, GSEA also computes a normalized enrichment score (NES). Results described in the main section of this paper used GSEA in Preranked mode with 1000 gene set permutations and the classic/unweighted Kolmogorov enrichment statistic. Supporting Data files report results obtained with all (unweighted or weighted) choices for the Kolmogorov enrichment statistic.
SOMAmer Rank Metrics
We assume a rank metric of the form
3 |
where sign = ±1 indicates the direction of change and p-value is an appropriate measure of the significance of that change. In the Results and Discussion section below, we explore two case examples with different types of study design. For the first one, a case–control study, we fitted SOMAmer expression values against health condition (AD or control), sex, age, and race using functions lmFit, eBayes and topTable from the R package limma v.3.56.1, which provide the AD/control fold change (log FC) and corresponding p-value estimated by empirical Bayes variance stabilization. Volcano plots in Figure S3 show the results obtained from plasma (panel (a)) and CSF (panel (b)) samples. Using sign(log FC) in eq , we obtain the rank score. For the second case example, a longitudinal analysis of physical performance, mixed-effects models were run according to the formula: “phys.perf. ∼ (1|subject) + sex + age + log10(RFU)”, where each physical performance outcome was regressed against fixed effects for sex, age, and SOMAmer relative concentration, considering subject identifiers as random effects. The statistical significance of the SOMAmer abundance term was assessed by performing a likelihood ratio test to compare the full model against the null model: “phys.perf. ∼ (1|subject) + sex + age” using the anova function from R package stats v.4.4.0. Mixed-effects models were implemented via the lmer function from R package lme4 v.1.1.35.5 with the argument REML = FALSE to optimize the log-likelihood.
Data and Software Availability
Data analysis was performed in R v.4.4.0. Repositories in .gmt file format were imported using R package fgsea v.1.30.0. Clinical data were converted from SAS format using R package haven v.2.5.4. Fitting procedure results were parsed with R package broom v.1.0.6. GSEA runs were generated using the command-line client gsea-cli.sh v.4.3.3. Wilcoxon and t tests were performed using base R package stats v.3.6.2. Plots were created using R packages RColorBrewer v.1.1-3, venn v.1.12, gplots v.3.1.3.1 and calibrate v.1.7.7. Anonymized data sets and custom R source code used in our analyses are available on the Open Science Framework repository at osf.io/wemcs (DOI 10.17605/OSF.IO/WEMCS).
Results and Discussion
We will illustrate the use of SomaModules by means of two case examples: (i) a case–control proteomic analysis of Alzheimer’s disease (AD) using the 7K SomaScan platform to examine AD-specific pathways, and (ii) a longitudinal analysis of physical performance using the 11K SomaScan assay to investigate the enrichment of mitochondrial pathways. These applications will showcase how SomaModules can be broadly used to analyze different clinical traits across different types of study design and different SomaScan versions.
Our first case example is from a cohort of control (n = 18) and AD (n = 18) patients in the Emory Goizueta Alzheimer’s Disease Research Center using the 7K SomaScan assay on plasma and cerebrospinal fluid (CSF) samples. After scanning all pathways in MSigDB, we identified six of them that were AD-specific (based on their names) and had at least one SomaModule. Figure shows Venn diagrams with the number of overlapping SOMAmers in the 7K assay based on the original MSigDB pathways (panel (a)) and the first SomaModules derived from them (panel (b)). Here, we notice that a large number of SOMAmers were specific to individual pathways; in particular, gene set WP.44 from WikiPathways and gene sets CGP.178 and CGP.181 from the chemical and genetic perturbations (CGP) collection contain a large proportion of nonoverlapping SOMAmers, both in their original versions (panel (a)) as well as in their SomaModule versions. Using plasma samples, Figure shows pathway enrichment quantified by means of GSEA’s enrichment scores (ES) and normalized enrichment scores (NES), the latter of which was designed to account for differences in gene set size. Here, enrichment is defined as protein abundance in controls relative to AD patients; in agreement with an independent study, plasma protein abundance is biased to be generally greater in controls compared with AD patients (Figure S3a). We observe that, for all six AD-specific pathways, SomaModules (red symbols) are more sensitive than the original MSigDB counterparts (blue symbols). For instance, KEGG’s Alzheimer’s disease original gene set (KL.9.0) has NES = 1.6 and p-value ≥0.05, whereas the SomaModule derived from it (KL.9.1) has NES = 4.4 and p-value <10–3. Although results shown in Figure correspond to GSEA’s classic/unweighted Kolmogorov–Smirnov enrichment statistic, we explored all weighted options (enrichment statistic parameters p = 1, 1.5, and 2) with similar results (Supporting Data 1). Paired Wilcoxon and Student’s t tests comparing the absolute value of enrichment scores between original MSigDB gene sets and SomaModules are statistically significant in all cases considered; we thus conclude that SomaModules show significantly higher enrichment than the original MSigDB gene set counterparts. In a similar vein, Figure shows pathway enrichment results obtained from CSF samples. Here, enrichment is defined in the opposite direction, as protein abundance in AD patients relative to controls (Figure S3b). Results for all GSEA weighted options are provided in Supporting Data 2. Although only a few of the selected pathways are significant, we observe that, in cases where they are, the enrichment scores of SomaModules are greater than those of their corresponding original MSigDB counterparts.
2.
Overlaps among AD-specific pathways in the 7K SomaScan assay. Each intersection shows the number of overlapping SOMAmers; zeros are not shown. Pathways were obtained from the KEGG legacy (KL), WikiPathways (WP), and chemical and genetic perturbations (CGP) collections included in MSigDB. (a) Original (MSigDB) gene sets. (b) SomaModules derived from the original (MSigDB) gene sets.
3.
GSEA enrichment scores of AD-specific pathways using 7K SomaScan data from AD vs control plasma samples. Two metrics of pathway enrichment are shown: (a) enrichment scores; and (b) normalized enrichment scores (designed to account for differences in gene set size). For each pathway, SomaModules (red symbols) are compared to the original MSigDB counterparts (blue symbols). Pathways are ordered, from top to bottom, in decreasing order of SomaModule enrichment. Pathways shown are enriched in controls relative to AD patients. Pathway names from MSigDB are followed, in parentheses, by the identifiers used in our repository, which are formed by a collection prefix (“WP” for WikiPathways, “KL” for KEGG legacy, and “CGP” for chemical and genetic perturbations) and an integer suffix.
4.
GSEA enrichment scores of AD-specific pathways using 7K SomaScan data from AD vs control CSF samples. Two metrics of pathway enrichment are shown: (a) enrichment scores; and (b) normalized enrichment scores (designed to account for differences in gene set size). For each pathway, SomaModules (red symbols) are compared to the original MSigDB counterparts (blue symbols). Pathways are ordered, from top to bottom, in decreasing order of SomaModule enrichment. Pathways shown are enriched in AD patients relative to controls. Pathway names from MSigDB are followed, in parentheses, by the identifiers used in our repository, which are formed by a collection prefix (“WP” for WikiPathways, “KL” for KEGG legacy, and “CGP” for chemical and genetic perturbations) and an integer suffix.
The second example is a longitudinal analysis of physical performance in the BLSA study. Notice that, while SomaModules were built using only the first available visit for each participant (666 samples), for this analysis we used all available visits from those same participants (2542 samples). Using mixed effects models to control for repeat visits, age and sex, we generated rank-ordered lists of 11K SOMAmers associated with 15 different metrics of physical performance. Based on previous findings on the relation between exercise and mitochondrial function in aging, − we hypothesized that mitochondrial pathway enrichment would be significantly associated with physical performance outcomes.
Derived from MitoCarta, we identified 20 mitochondrial pathways that had one SomaModule counterpart; utilizing those 40 gene sets, we ran separate GSEA analyses for each physical performance outcome. Figure shows a paired comparison of enrichment scores (panel (a)) and normalized enrichment scores (panel (b)) using 6 m walking time as outcome. We find that, for all the mitochondrial pathways considered, SomaModules (red symbols) are more enriched than their original gene set counterparts from MitoCarta (blue symbols). In fact, while all SomaModules are very significantly enriched (p-value <10–3), we observe that the enrichment significance for many of the original gene sets is weaker (10–3 ≤ p-value < 0.05) or nonsignificant altogether (p-value ≥0.05). Results similarly obtained by considering other physical performance outcomes are provided in Supporting Data 3 for all choices of GSEA enrichment statistics. In order to compare the absolute values of enrichment between original MitoCarta gene sets and SomaModules derived from them, we performed paired Wilcoxon and Student’s t tests for each physical performance outcome. Figure shows the −log10(p-value) obtained from paired Wilcoxon tests as a function of the median of paired differences in the absolute value of enrichment scores. In all cases, absolute enrichment scores of SomaModules were greater than the corresponding ones for the MitoCarta pathways of origin; moreover, we observe that, except for normalized enrichment scores using the “narrow walk ratio” and “expanded physical performance battery score” outcomes, all remaining comparisons yielded statistically significant differences. Figure S4 shows similar results using Student’s t-test for comparison. Altogether, these results show that SomaModules have significantly greater enrichment than the original gene set counterparts. These findings are robust and not significantly affected by the choices of enrichment metric or the Kolmogorov enrichment statistic used in the GSEA procedure.
5.
GSEA enrichment scores of mitochondrial pathways associated with the time to walk 6 m using 11K SomaScan data from 2542 BLSA plasma samples. For each pathway, SomaModules (red symbols) are compared to the original MSigDBS counterparts (blue symbols). Pathways are ordered, from top to bottom, in decreasing order of SomaModule enrichment. (a) Enrichment scores. (b) Normalized enrichment scores (designed to account for differences in gene set size).
6.
Paired Wilcoxon test significance of enrichment score differences between SomaModules and original MitoCarta pathways for different physical performance outcomes. Panels show differences in (a) enrichment scores and (b) normalized enrichment scores. Physical performance outcomes are listed on the right-hand side. The horizontal dashed line corresponds to p-value = 0.05.
Conclusions
Given the lack of adequate tools to perform pathway enrichment analysis on proteomic data, this work presents an approach specifically tailored to the SomaScan assay. Using existing, annotated gene sets as the starting point, our framework implements a top-down procedure to identify strongly correlated SOMAmer modules, termed “SomaModules”, based on 11K SomaScan data recently generated from plasma samples in the BLSA study.
The SomaModule framework is a greedy approach to identify, through successive iterations, the largest nonoverlapping modules whose median correlation of intramodule SOMAmer pairs is above a predefined threshold value. Throughout this paper, we utilize SomaModules that were generated by using a minimum size of 10 SOMAmers and a minimum intramodule median correlation of 0.5. However, we provide well-documented code to allow users to build their own SomaModule repositories using other threshold criteria of their choice, as well as other pathway sources (including their own, customized gene sets). In this work, we generated two repositories: the first one, based on the latest MSigDB release, contains a total of 41,705 SOMAmer-based gene sets ranging in size from 10 to 1697 SOMAmers; the second one, based on the latest MitoCarta release, contains 80 mitochondrial gene sets ranging in size from 10 to 317 SOMAmers. These repositories can be utilized in any nontopology-based pathway enrichment analysis tool, such as over-representation analysis and functional class scoring (FCS) methods. In this work, we demonstrated the use of our repositories with gene set enrichment analysis (GSEA), the pioneering and arguably most popular FCS tool. Following the same formatting used in MSigDB, our repositories can be seamlessly integrated to GSEA.
We validated our repositories with two case examples. In the first one, we used 7K SomaScan data from a study of Alzheimer’s disease (AD) patients and non-AD controls to test AD-specific pathways from MSigDB. We found that SomaModules had significantly higher enrichment than the original gene set counterparts. In our second case example, we used 11K SomaScan data from the BLSA to explore the enrichment of mitochondrial pathways from MitoCarta in association with a variety of physical performance outcomes. In agreement with our previous example, we found that SomaModules had greater enrichment than the original gene set counterparts and that those differences were generally significant. These findings were robust, not significantly affected by the choice of enrichment metric or the Kolmogorov enrichment statistic used in the GSEA procedure. Comparisons performed with paired Wilcoxon and Student’s t tests consistently showed that SomaModules had significantly greater enrichment than the original gene set counterparts.
As SomaScan becomes more widely adopted as a state-of-the-art tool for proteomics discovery, we hope that this work will serve as a valuable technical reference and resource for the growing user community. We finally note that similar procedures may be adopted for follow-up work extending annotated pathways (or uncovering novel ones) based on data from other proteomics platforms, such as Olink and targeted mass spectrometry, and even other omics assays beyond proteomics.
Supplementary Material
Acknowledgments
This research was supported entirely by the Intramural Research Program of the National Institute on Aging, NIH. The authors thank Cassandra Blew and Cassandra Joynes for useful discussions and preliminary analyses of SomaModule characteristics, as well as the BLSA Team and study participants.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.4c01114.
Table S1: number of gene sets in each MSigDB collection, derived using different threshold combinations for minimum gene set size and intracluster correlation. Table S2: summary of results from WGCNA runs using different sets of parameters. Figure S1: mean correlation density distributions for parent–child SomaModule pairs derived from different MSigDB collections. Figure S2: WGCNA grid-search for the optimal soft-thresholding power (β) for network construction. Figure S3: volcano plots showing differentially abundant SOMAmers from an Alzheimer’s disease vs control study using 7K SomaScan. Figure S4: paired Student’s t-test significance of enrichment score differences between SomaModules and original MitoCarta pathways for different physical performance outcomes (PDF)
Supporting Data 1: GSEA enrichment scores of AD-specific pathways using 7K SomaScan data from AD vs control plasma samples (XLSX)
Supporting Data 2: GSEA enrichment scores of AD-specific pathways using 7K SomaScan data from AD vs control CSF samples (XLSX)
Supporting Data 3: GSEA enrichment scores of mitochondrial pathways associated with 15 physical performance metrics using 11K SomaScan data from 2542 BLSA plasma samples (XLSX)
The authors declare the following competing financial interest(s): J.C., K.A.W. and L.F. have given unpaid seminars and/or webinars sponsored or co-sponsored by SomaLogic.
References
- Gold L., Ayers D., Bertino J., Bock C., Bock A., Brody E. N., Carter J., Dalby A. B., Eaton B. E., Fitzwater T., Flather D., Forbes A., Foreman T., Fowler C., Gawande B., Goss M., Gunn M., Gupta S., Halladay D., Heil J., Heilig J., Hicke B., Husar G., Janjic N., Jarvis T., Jennings S., Katilius E., Keeney T. R., Kim N., Koch T. H., Kraemer S., Kroiss L., Le N., Levine D., Lindsey W., Lollo B., Mayfield W., Mehan M., Mehler R., Nelson S. K., Nelson M., Nieuwlandt D., Nikrad M., Ochsner U., Ostroff R. M., Otis M., Parker T., Pietrasiewicz S., Resnicow D. I., Rohloff J., Sanders G., Sattin S., Schneider D., Singer B., Stanton M., Sterkel A., Stewart A., Stratford S., Vaught J. D., Vrkljan M., Walker J. J., Watrobka M., Waugh S., Weiss A., Wilcox S. K., Wolfson A., Wolk S. K., Zhang C., Zichi D.. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS One. 2010;5:e15004. doi: 10.1371/journal.pone.0015004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rohloff J. C., Gelinas A. D., Jarvis T. C., Ochsner U. A., Schneider D. J., Gold L., Janjic N.. Nucleic Acid Ligands With Protein-like Side Chains: Modified Aptamers and Their Use as Diagnostic and Therapeutic Agents. Mol. Ther. Nucleic Acids. 2014;3:e201. doi: 10.1038/mtna.2014.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emilsson V., Ilkov M., Lamb J. R., Finkel N., Gudmundsson E. F., Pitts R., Hoover H., Gudmundsdottir V., Horman S. R., Aspelund T., Shu L., Trifonov V., Sigurdsson S., Manolescu A., Zhu J., Olafsson O. ¨., Jakobsdottir J., Lesley S. A., To J., Zhang J., Harris T. B., Launer L. J., Zhang B., Eiriksdottir G., Yang X., Orth A. P., Jennings L. L., Gudnason V.. Co-regulatory networks of human serum proteins link genetics to disease. Science. 2018;361:769–773. doi: 10.1126/science.aaq1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun B. B., Maranville J. C., Peters J. E., Stacey D., Staley J. R., Blackshaw J., Burgess S., Jiang T., Paige E., Surendran P., Oliver-Williams C., Kamat M. A., Prins B. P., Wilcox S. K., Zimmerman E. S., Chi A., Bansal N., Spain S. L., Wood A. M., Morrell N. W., Bradley J. R., Janjic N., Roberts D. J., Ouwehand W. H., Todd J. A., Soranzo N., Suhre K., Paul D. S., Fox C. S., Plenge R. M., Danesh J., Runz H., Butterworth A. S.. Genomic atlas of the human plasma proteome. Nature. 2018;558:73. doi: 10.1038/s41586-018-0175-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanaka T., Biancotto A., Moaddel R., Moore A. Z., Gonzalez-Freire M., Aon M. A., Candia J., Zhang P., Cheung F., Fantoni G., Semba R. D., Ferrucci L., Ferrucci L.. Plasma proteomic signature of age in healthy humans. Aging Cell. 2018;17:e12799. doi: 10.1111/acel.12799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pietzner M., Wheeler E., Carrasco-Zanini J., Cortes A., Koprulu M., Wörheide M. A., Oerton E., Cook J., Stewart I. D., Kerrison N. D., Luan J., Raffler J., Arnold M., Arlt W., O’Rahilly S., Kastenmüller G., Gamazon E. R., Hingorani A. D., Scott R. A., Wareham N. J., Langenberg C.. Mapping the proteo-genomic convergence of human diseases. Science. 2021;374:eabj1541. doi: 10.1126/science.abj1541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts J. A., Varma V. R., An Y., Varma S., Candia J., Fantoni G., Tiwari V., Anerillas C., Williamson A., Saito A., Loeffler T., Schilcher I., Moaddel R., Khadeer M., Lovett J., Tanaka T., Pletnikova O., Troncoso J. C., Bennett D. A., Albert M. S., Yu K., Niu M., Haroutunian V., Zhang B., Peng J., Croteau D. L., Resnick S. M., Gorospe M., Bohr V. A., Ferrucci L., Thambisetty M.. A brain proteomic signature of incipient Alzheimer’s disease in young APOE ϵ4 carriers identifies novel drug targets. Sci. Adv. 2021;7:eabi8178. doi: 10.1126/sciadv.abi8178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker K. A., Chen J., Zhang J., Fornage M., Yang Y., Zhou L., Grams M. E., Tin A., Daya N., Hoogeveen R. C., Wu A., Sullivan K. J., Ganz P., Zeger S. L., Gudmundsson E. F., Emilsson V., Launer L. J., Jennings L. L., Gudnason V., Chatterjee N., Gottesman R. F., Mosley T. H., Boerwinkle E., Ballantyne C. M., Coresh J.. Large-scale plasma proteomic analysis identifies proteins and pathways associated with dementia risk. Nat. Aging. 2021;1:473–489. doi: 10.1038/s43587-021-00064-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oh H. S.-H., Rutledge J., Nachun D., Pálovics R., Abiose O., Moran-Losada P., Channappa D., Urey D. Y., Kim K., Sung Y. J., Wang L., Timsina J., Western D., Liu M., Kohlfeld P., Budde J., Wilson E. N., Guen Y., Maurer T. M., Haney M., Yang A. C., He Z., Greicius M. D., Andreasson K. I., Sathyan S., Weiss E. F., Milman S., Barzilai N., Cruchaga C., Wagner A. D., Mormino E., Lehallier B., Henderson V. W., Longo F. M., Montgomery S. B., Wyss-Coray T.. Organ aging signatures in the plasma proteome track health and disease. Nature. 2023;624:164–172. doi: 10.1038/s41586-023-06802-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duggan M. R., Peng Z., Sipilä P. N., Lindbohm J. V., Chen J., Lu Y., Davatzikos C., Erus G., Hohman T. J., Andrews S. J., Candia J., Tanaka T., Joynes C. M., Alvarado C. X., Nalls M. A., Cordon J., Daya G. N., An Y., Lewis A., Moghekar A., Palta P., Coresh J., Ferrucci L., Kivimäki M., Walker K. A.. Proteomics identifies potential immunological drivers of postinfection brain atrophy and cognitive decline. Nat. Aging. 2024;4:1263–1278. doi: 10.1038/s43587-024-00682-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candia, J. In Protein Arrays: Methods and Applications; Ruiz Romero, C. , Calamia, V. , Lourido, L. , Eds.; Springer Nature, 2025; Chapter 9. [Google Scholar]
- Tavazoie S., Hughes J. D., Campbell M. J., Cho R. J., Church G. M.. Systematic determination of genetic network architecture. Nat. Genet. 1999;22:281–285. doi: 10.1038/10343. [DOI] [PubMed] [Google Scholar]
- Mootha V. K., Lindgren C. M., Eriksson K.-F., Subramanian A., Sihag S., Lehar J., Puigserver P., Carlsson E., Ridderstråle M., Laurila E., Houstis N., Daly M. J., Patterson N., Mesirov J. P., Golub T. R., Tamayo P., Spiegelman B., Lander E. S., Hirschhorn J. N., Altshuler D., Groop L. C.. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003;34:267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
- Subramanian A., Tamayo P., Mootha V. K., Mukherjee S., Ebert B. L., Gillette M. A., Paulovich A., Pomeroy S. L., Golub T. R., Lander E. S., Mesirov J. P.. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen T.-M., Shafi A., Nguyen T., Draghici S.. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol. 2019;20:203. doi: 10.1186/s13059-019-1790-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie C., Jauhari S., Mora A.. Popularity and performance of bioinformatics software: the case of gene set analysis. BMC Bioinf. 2021;22:191. doi: 10.1186/s12859-021-04124-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liberzon A., Subramanian A., Pinchback R., Thorvaldsdóttir H., Tamayo P., Mesirov J. P.. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M., Sato Y., Kawashima M., Furumichi M., Tanabe M.. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–D462. doi: 10.1093/nar/gkv1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gillespie M., Jassal B., Stephan R., Milacic M., Rothfels K., Senff-Ribeiro A., Griss J., Sevilla C., Matthews L., Gong C., Deng C., Varusai T., Ragueneau E., Haider Y., May B., Shamovsky V., Weiser J., Brunson T., Sanati N., Beckman L., Shao X., Fabregat A., Sidiropoulos K., Murillo J., Viteri G., Cook J., Shorser S., Bader G., Demir E., Sander C., Haw R., Wu G., Stein L., Hermjakob H., D’Eustachio P.. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aleksander S. A., Balhoff J., Carbon S., Cherry J. M., Drabkin H. J., Ebert D., Feuermann M., Gaudet P., Harris N. L., Hill D. P.. The Gene Ontology Consortium et al. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224:iyad031. doi: 10.1093/genetics/iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaefer C. F., Anthony K., Krupa S., Buchoff J., Day M., Hannay T., Buetow K. H.. PID: the Pathway Interaction Database. Nucleic Acids Res. 2008;37:D674–D679. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Agrawal A., Balcı H., Hanspers K., Coort S. L., Martens M., Slenter D. N., Ehrhart F., Digles D., Waagmeester A., Wassink I., Abbassi-Daloii T., Lopes E. N., Iyer A., Acosta J. M., Willighagen L. G., Nishida K., Riutta A., Basaric H., Evelo C. T., Willighagen E. L., Kutmon M., Pico A. R.. WikiPathways 2024: next generation pathway database. Nucleic Acids Res. 2024;52:D679–D689. doi: 10.1093/nar/gkad960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vogel C., Marcotte E. M.. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 2012;13:227–232. doi: 10.1038/nrg3185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Costa-Mattioli M., Walter P.. The integrated stress response: From mechanism to disease. Science. 2020;368:eaat5314. doi: 10.1126/science.aat5314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernández-Elvira M., Sunnerhagen P.. Post-transcriptional regulation during stress. FEMS Yeast Res. 2022;22:foac025. doi: 10.1093/femsyr/foac025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y., Beyer A., Aebersold R.. On the dependency of cellular protein levels on mRNA abundance. Cell. 2016;165:535–550. doi: 10.1016/j.cell.2016.03.014. [DOI] [PubMed] [Google Scholar]
- Ferrucci L., Candia J., Ubaida-Mohien C., Lyashkov A., Banskota N., Leeuwenburgh C., Wohlgemuth S., Guralnik J. M., Kaileh M., Zhang D., Sufit R., De S., Gorospe M., Munk R., Peterson C. A., McDermott M. M.. Transcriptomic and proteomic of gastrocnemius muscle in peripheral artery disease. Circ. Res. 2023;132:1428–1443. doi: 10.1161/CIRCRESAHA.122.322325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Timmons J. A., Szkop K. J., Gallagher I. J.. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015;16:186. doi: 10.1186/s13059-015-0761-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wijesooriya K., Jadaan S. A., Perera K. L., Kaur T., Ziemann M.. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput. Biol. 2022;18:e1009935. doi: 10.1371/journal.pcbi.1009935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee K. S., Su X., Huan T.. Metabolites are not genes - avoiding the misuse of pathway analysis in metabolomics. Nat. Metab. 2025;7:858. doi: 10.1038/s42255-025-01283-0. [DOI] [PubMed] [Google Scholar]
- Raffield L. M., Dang H., Pratte K. A., Jacobson S., Gillenwater L. A., Ampleford E., Barjaktarevic I., Basta P., Clish C. B., Comellas A. P., Cornell E., Curtis J. L., Doerschuk C., Durda P., Emson C., Freeman C. M., Guo X., Hastie A. T., Hawkins G. A., Herrera J., Johnson W. C., Labaki W. W., Liu Y., Masters B., Miller M., Ortega V. E., Papanicolaou G., Peters S., Taylor K. D., Rich S. S., Rotter J. I., Auer P., Reiner A. P., Tracy R. P., Ngo D., Gerszten R. E., O’Neal W. K., Bowler R. P.. Comparison of Proteomic Assessment Methods in Multiple Cohort Studies. Proteomics. 2020;20:e1900278. doi: 10.1002/pmic.201900278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pietzner M., Wheeler E., Carrasco-Zanini J., Kerrison N., Oerton E., Koprulu M., Luan J., Hingorani A., Williams S., Wareham N., Langenberg C.. Synergistic insights into human health from aptamer- and antibody-based proteomic profiling. Nat. Commun. 2021;12:6822. doi: 10.1038/s41467-021-27164-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lopez-Silva C., Surapaneni A., Coresh J., Reiser J., Parikh C. R., Obeid W., Grams M. E., Chen T. K.. Comparison of aptamer-based and antibody-based assays for protein quantification in chronic Kidney Disease. Clin. J. Am. Soc. Nephrol. 2022;17:350–360. doi: 10.2215/CJN.11700921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rooney M. R., Chen J., Ballantyne C. M., Hoogeveen R. C., Tang O., Grams M. E., Tin A., Ndumele C. E., Zannad F., Couper D. J., Tang W., Selvin E., Coresh J.. Comparison of proteomic measurements across platforms in the Atherosclerosis Risk in Communities (ARIC) Study. Clin. Chem. 2023;69:68–79. doi: 10.1093/clinchem/hvac186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rooney M. R., Chen J., Ballantyne C. M., Hoogeveen R. C., Boerwinkle E., Yu B., Walker K. A., Schlosser P., Selvin E., Chatterjee N., Couper D., Grams M. E., Coresh J.. Plasma proteomic comparisons change as coverage expands for SomaLogic and Olink. medRxiv. 2024:2024.07.11.24310161. doi: 10.1101/2024.07.11.24310161. [DOI] [Google Scholar]
- Eldjarn G. H., Ferkingstad E., Lund S. H., Helgason H., Magnusson O. T., Gunnarsdottir K., Olafsdottir T. A., Halldorsson B. V., Olason P. I., Zink F., Gudjonsson S. A., Sveinbjornsson G., Magnusson M. I., Helgason A., Oddsson A., Halldorsson G. H., Magnusson M. K., Saevarsdottir S., Eiriksdottir T., Masson G., Stefansson H., Jonsdottir I., Holm H., Rafnar T., Melsted P., Saemundsdottir J., Norddahl G. L., Thorleifsson G., Ulfarsson M. O., Gudbjartsson D. F., Thorsteinsdottir U., Sulem P., Stefansson K.. Large-scale plasma proteomics comparisons through genetics and disease associations. Nature. 2023;622:348–358. doi: 10.1038/s41586-023-06563-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldjarn G. H., Ferkingstad E., Lund S. H., Helgason H., Magnusson O. T., Gunnarsdottir K., Olafsdottir T. A., Halldorsson B. V., Olason P. I., Zink F., Gudjonsson S. A., Sveinbjornsson G., Magnusson M. I., Helgason A., Oddsson A., Halldorsson G. H., Magnusson M. K., Saevarsdottir S., Eiriksdottir T., Masson G., Stefansson H., Jonsdottir I., Holm H., Rafnar T., Melsted P., Saemundsdottir J., Norddahl G. L., Thorleifsson G., Ulfarsson M. O., Gudbjartsson D. F., Thorsteinsdottir U., Sulem P., Stefansson K.. Author Correction: Large-scale plasma proteomics comparisons through genetics and disease associations. Nature. 2024;630:E3. doi: 10.1038/s41586-024-07549-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwon D.. The antibodies don’t work! The race to rid labs of molecules that ruin experiments. Nature. 2024;635:26–28. doi: 10.1038/d41586-024-03590-0. [DOI] [PubMed] [Google Scholar]
- Candia J., Cheung F., Kotliarov Y., Fantoni G., Sellers B., Griesman T., Huang J., Stuccio S., Zingone A., Ryan B., Tsang J., Biancotto A.. Assessment of Variability in the SOMAscan Assay. Sci. Rep. 2017;7:14248. doi: 10.1038/s41598-017-14755-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candia J., Daya G., Tanaka T., Ferrucci L., Walker K.. Assessment of variability in the plasma 7k SomaScan proteomics assay. Sci. Rep. 2022;12:17147. doi: 10.1038/s41598-022-22116-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candia J., Fantoni G., Delgado-Peraza F., Shehadeh N., Tanaka T., Moaddel R., Walker K., Ferrucci L.. Variability of 7K and 11K SomaScan Plasma Proteomics Assays. J. Proteome Res. 2024;23:5531–5539. doi: 10.1021/acs.jproteome.4c00667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell B. L., Yasui Y., Li C. I., Fitzpatrick A. L., Lampe P. D.. Impact of freeze-thaw cycles and storage time on plasma samples used in mass spectrometry based biomarker discovery projects. Cancer Inf. 2005;1:117693510500100. doi: 10.1177/117693510500100110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candia J., Fantoni G., Moaddel R., Delgado-Peraza F., Shehadeh N., Tanaka T., Ferrucci L.. Effects of in vitro hemolysis and repeated freeze-thaw cycles in protein abundance quantification using the SomaScan and Olink assays. J. Proteome Res. 2025;24:2517–2528. doi: 10.1021/acs.jproteome.5c00069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrucci L.. The Baltimore Longitudinal Study of Aging (BLSA): a 50-year-long journey and plans for the future. J. Gerontol., Ser. A. 2008;63:1416–1419. doi: 10.1093/gerona/63.12.1416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- https://www.gsea-MSigDB.org/gsea/MSigDB (accessed Dec 18, 2024).
- https://www.genenames.org (accessed Dec 18, 2024).
- https://www.reactome.org/content/detail/R-HSA-2559583 (accessed Dec 18, 2024).
- Ramanan V. K., Shen L., Moore J. H., Saykin A. J.. Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet. 2012;28:323–332. doi: 10.1016/j.tig.2012.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menashe I., Maeder D., Garcia-Closas M., Figueroa J. D., Bhattacharjee S., Rotunno M., Kraft P., Hunter D. J., Chanock S. J., Rosenberg P. S., Chatterjee N.. Pathway analysis of breast cancer genome-wide association study highlights three pathways and one canonical signaling cascade. Cancer Res. 2010;70:4453–4459. doi: 10.1158/0008-5472.CAN-09-4502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong H., Yang X., Kaplan L. M., Molony C., Schadt E. E.. Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am. J. Hum. Genet. 2010;86:581–591. doi: 10.1016/j.ajhg.2010.02.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K., Li M., Bucan M.. Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 2007;81:1278–1283. doi: 10.1086/522374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perry J. R. B., McCarthy M. I., Hattersley A. T., Zeggini E., Weedon M. N., Frayling T. M.. Wellcome Trust Case Control Consortium. Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009;58:1463–1467. doi: 10.2337/db08-1378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holmans P.. Statistical methods for pathway analysis of genome-wide data for association with complex genetic traits. Adv. Genet. 2010;72:141–179. doi: 10.1016/B978-0-12-380862-2.00007-2. [DOI] [PubMed] [Google Scholar]
- Langfelder P., Horvath S.. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaussabel D., Quinn C., Shen J., Patel P., Glaser C., Baldwin N., Stichweh D., Blankenship D., Li L., Munagala I., Bennett L., Allantaz F., Mejias A., Ardura M., Kaizer E., Monnet L., Allman W., Randall H., Johnson D., Lanier A., Punaro M., Wittkowski K. M., White P., Fay J., Klintmalm G., Ramilo O., Palucka A. K., Banchereau J., Pascual V.. A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. Immunity. 2008;29:150–164. doi: 10.1016/j.immuni.2008.05.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li S., Rouphael N., Duraisingham S., Romero-Steiner S., Presnell S., Davis C., Schmidt D. S., Johnson S. E., Milton A., Rajam G., Kasturi S., Carlone G. M., Quinn C., Chaussabel D., Palucka A. K., Mulligan M. J., Ahmed R., Stephens D. S., Nakaya H. I., Pulendran B.. Molecular signatures of antibody responses derived from a systems biology study of five human vaccines. Nat. Immunol. 2014;15:195–204. doi: 10.1038/ni.2789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altman M. C., Rinchai D., Baldwin N., Toufiq M., Whalen E., Garand M., Syed Ahamed Kabeer B., Alfaki M., Presnell S. R., Khaenam P., Ayllón-Benítez A., Mougin F., Thébault P., Chiche L., Jourde-Chiche N., Phillips J. T., Klintmalm G., O’Garra A., Berry M., Bloom C., Wilkinson R. J., Graham C. M., Lipman M., Lertmemongkolchai G., Bedognetti D., Thiebaut R., Kheradmand F., Mejias A., Ramilo O., Palucka K., Pascual V., Banchereau J., Chaussabel D.. Development of a fixed module repertoire for the analysis and interpretation of blood transcriptome data. Nat. Commun. 2021;12:4385. doi: 10.1038/s41467-021-24584-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ostaszewski M., Niarakis A., Mazein A., Kuperstein I., Phair R., Orta-Resendiz A., Singh V., Aghamiri S. S., Acencio M. L., Glaab E.. et al. COVID19 Disease Map, a computational knowledge repository of virus-host interaction mechanisms. Mol. Syst. Biol. 2021;17:e10387. doi: 10.15252/msb.202110387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tarca A. L., Draghici S., Bhatti G., Romero R.. Down-weighting overlapping genes improves gene set analysis. BMC Bioinf. 2012;13:136. doi: 10.1186/1471-2105-13-136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liberzon A., Birger C., Thorvaldsdóttir H., Ghandi M., Mesirov J. P., Tamayo P.. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simillion C., Liechti R., Lischer H. E. L., Ioannidis V., Bruggmann R.. Avoiding the pitfalls of gene set enrichment analysis with SetRank. BMC Bioinf. 2017;18:151. doi: 10.1186/s12859-017-1571-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- https://www.genepattern.org/file-formats-guide/#GMT (accessed Dec 18, 2024).
- https://osf.io/wemcs( accessed Dec 18, 2024).
- Rath S., Sharma R., Gupta R., Ast T., Chan C., Durham T. J., Goodman R. P., Grabarek Z., Haas M. E., Hung W. H. W., Joshi P. R., Jourdain A. A., Kim S. H., Kotrys A. V., Lam S. S., McCoy J. G., Meisel J. D., Miranda M., Panda A., Patgiri A., Rogers R., Sadre S., Shah H., Skinner O. S., To T.-L., Walker M. A., Wang H., Ward P. S., Wengrod J., Yuan C.-C., Calvo S. E., Mootha V. K.. MitoCarta3.0: an updated mitochondrial proteome now with sub-organelle localization and pathway annotations. Nucleic Acids Res. 2021;49:D1541–D1547. doi: 10.1093/nar/gkaa1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- https://www.gsea-MSigDB.org/gsea/downloads.jsp (accessed Dec 18, 2024).
- Candia J., Ferrucci L.. Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks. PLoS One. 2024;19:e0302696. doi: 10.1371/journal.pone.0302696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smyth G. K.. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3:3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
- Gałecki, A. ; Burzykowski, T. . Linear Mixed-Effects Models Using R; Springer: New York, 2013. [Google Scholar]
- Dammer E. B., Ping L., Duong D. M., Modeste E. S., Seyfried N. T., Lah J. J., Levey A. I., Johnson E. C. B.. Multi-platform proteomic analysis of Alzheimer’s disease cerebrospinal fluid and plasma reveals network biomarkers associated with proteostasis and the matrisome. Alzheimers Res. Ther. 2022;14:174. doi: 10.1186/s13195-022-01113-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang Y., Zhou X., Ip F. C., Chan P., Chen Y., Lai N. C. H., Cheung K., Lo R. M. N., Tong E. P. S., Wong B. W. Y., Chan A. L. T., Mok V. C. T., Kwok T. C. Y., Mok K. Y., Hardy J., Zetterberg H., Fu A. K. Y., Ip N. Y.. Large-scale plasma proteomic profiling identifies a high-performance biomarker panel for Alzheimer’s disease screening and staging. Alzheimer’s Dement. 2022;18:88–102. doi: 10.1002/alz.12369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menshikova E. V., Ritov V. B., Fairfull L., Ferrell R. E., Kelley D. E., Goodpaster B. H.. Effects of exercise on mitochondrial content and function in aging human skeletal muscle. J. Gerontol., Ser. A. 2006;61:534–540. doi: 10.1093/gerona/61.6.534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorriento D., Di Vaia E., Iaccarino G.. Physical exercise: A novel tool to protect mitochondrial health. Front. Physiol. 2021;12:660068. doi: 10.3389/fphys.2021.660068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Memme J. M., Erlich A. T., Phukan G., Hood D. A.. Exercise and mitochondrial health. J. Physiol. 2021;599:803–817. doi: 10.1113/JP278853. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data analysis was performed in R v.4.4.0. Repositories in .gmt file format were imported using R package fgsea v.1.30.0. Clinical data were converted from SAS format using R package haven v.2.5.4. Fitting procedure results were parsed with R package broom v.1.0.6. GSEA runs were generated using the command-line client gsea-cli.sh v.4.3.3. Wilcoxon and t tests were performed using base R package stats v.3.6.2. Plots were created using R packages RColorBrewer v.1.1-3, venn v.1.12, gplots v.3.1.3.1 and calibrate v.1.7.7. Anonymized data sets and custom R source code used in our analyses are available on the Open Science Framework repository at osf.io/wemcs (DOI 10.17605/OSF.IO/WEMCS).