ABSTRACT
The increasing availability and scale of biobanks and “omic” datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of “signal” genes with those of “noise” genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating (“bagging”) algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene–trait clusters and suggests multiple new hypotheses for future investigations.
Keywords: GWAS, pathway analysis, structural equation model, summary data
1. INTRODUCTION
Understanding the biological mechanisms by which genetic variation influences phenotypes is one of the primary challenges in human genetics (Lappalainen and MacArthur, 2021). Genome-wide association studies (GWAS) have successfully mapped thousands of genetic loci associated with complex human traits. However, it is extremely time-consuming and inefficient to investigate every identified association and validate the function (Visscher et al., 2017). Moreover, complex traits are usually highly polygenic and are associated with a large number of variants across the genome, each explaining only a small fraction of the genetic variance (Manolio et al., 2009; Shi et al., 2016). These difficulties have hindered the translation of GWAS findings into drug development and clinical applications (Cano-Gamez and Trynka, 2020).
Recently, studies have revealed that many complex traits are associated with the same genomic loci (Pickrell et al., 2016) and identified many pairs of traits with strong genetic correlation (Solovieff et al., 2013; Bulik-Sullivan et al., 2015; Ning et al., 2020). This phenomenon indicates that disease-causing variants may cluster into key pathways that drive several diseases at the same time (Boyle et al., 2017). Motivated by the underlying connection among traits, we aim to use large-scale biobank data containing thousands of phenotypes to aggregate information from correlated complex traits and infer their shared genetic architectures. In this article, we aim to use large-scale biobank data to aggregate the information from various traits and infer their shared genetic architectures.
Our motivating dataset is the UK Biobank, a rich database of genetic and phenotypic information from hundreds of thousands of participants across the UK. Participants are genotyped to capture genome-wide genetic variation at millions of single nucleotide polymorphisms (SNPs). A wide variety of phenotypes are recorded, including biological measurements, lifestyle indicators, biomarkers in blood, and disease diagnosis. The UK Biobank data provide plenty of opportunities for identifying genetic associations with complex traits.
There are several non-trivial hurdles to recover shared genetic architectures from GWAS data. First, any trait is a product of genetic and environmental influences. The environmental factors can both lead to spurious associations or shadow true genetic signals. Unfortunately, environmental factors are not directly observed in most datasets, making it difficult to isolate the genetic contribution from environmental influences. Second, the biobank data are usually gathered in multiple batches and are regularly augmented with newly collected data. Therefore, the database includes observations from several cohorts and the summary statistics are derived from multiple populations. Third, the set of measured traits in large biobanks is evergrowing and many traits are repeatedly measured in slightly different ways. It is desirable to develop a statistical method that is insensitive to data perturbation and yields consistent statistical conclusions as the dataset continues to be enriched.
In this paper, we develop a new statistical method—PATHway discovery through Genome–Phenome Summary data (PathGPS)—based on a model that assumes most human traits are regulated by one or several genetic or environmental pathways (Figure 1a). PathGPS can generate clusters of genes and traits associated with the same biological pathways (Figure 1b) and addresses the aforementioned challenges. First, by subtracting the empirical covariance of traits computed using “noise” genes (genetic variants showing no or weak associations with the traits) from that using “signal” genes (genetic variants showing strong associations with some traits), PathGPS disentangles genetic mechanisms from environmental factors. PathGPS then applies principal component analysis (PCA) to the disentangled covariance matrix and provides a low-rank representation of genetic associations. Second, PathGPS can be applied to summary statistics derived from several cohorts as long as the underlying genetic pathways are shared across cohorts. Third, to stabilize PathGPS, we design a novel implementation of the bootstrap aggregation (“bagging”) applied to unsupervised learning (Figure 1c). In particular, by resampling the genes, PathGPS obtains a weighted graph, which estimates the likelihood that a pair of variables (could be genes or traits) appear in the same pathway. Dimension reduction techniques and clustering algorithms are then applied to visualize this graph and produce clusters.
FIGURE 1.
An example of the structural equation model. The example contains four SNPs and four traits. There are two latent genetic mediators: the genetic mediator
is influenced by SNPs
,
, and affects traits
and
; the genetic mediator
is influenced by SNPs
,
, and affects trait
. There is one latent environmental mediator
, which is independent of the SNPs and affects traits
,
.
PathGPS only requires GWAS summary statistics, which can be more easily accessed compared to individual genetic data. An additional benefit is that the computational complexity of our method does not depend on the sample size of the GWAS, once the summary statistics are already produced. Our proposal is motivated by the literature investigating summary statistics for heritability and latent factor characterization (Finucane et al., 2015; Bulik-Sullivan et al., 2015; Tanigawa et al., 2019). In addition, we draw upon statistical methods with both sparsity and low-rank structures (Kaiser, 1958; Jennrich, 2001; Zou et al., 2006; Witten et al., 2009). PathGPS also builds on the idea of using bootstrap resamples to reduce the variance of statistical learning methods (Breiman, 2001).
The paper is organized as follows. In Section 2, we introduce the statistical model and the PathGPS algorithm. In Section 3, we investigate the performance of PathGPS in simulated and real datasets and discuss findings using the UK BioBank data. We conclude the paper with more discussion in Section 4.
2. METHODS
In Section 2.1, we lay out the model characterizing latent pathways. We discuss column space estimation in Section 2.2.1, a preparation step for the gene–trait clustering in Section 2.2.2. In Section 2.3, we propose to use bootstrap aggregation to boost the stability of the proposed clustering algorithm.
2.1. Structural equation model of latent pathways
We describe latent genetic and environmental pathways connecting SNPs and traits using a linear structural equation model (SEM). We start with an individual-level model of the SNPs and traits, and then derive the summary statistics from the individual-level model.
Suppose there are
SNPs
and
traits
. We assume the traits are influenced by the SNPs through
latent genetic mediators
. Meanwhile, we assume the traits are also affected by
unobserved environmental mediators
. Mathematically, we adopt the linear SEM,
![]() |
(1) |
![]() |
(2) |
where
,
denote zero-mean errors in the mediators
and traits
, respectively, and
,
,
are coefficient matrices. We assume the errors
,
are zero-mean and independent of the SNPs as well as the environmental mediators.
Our goal is to discover genetic pathways: SNPs
genetic mediator (latent)
traits. Since the genetic mediators are unobserved, we look for clusters of SNPs and traits related to the same underlying genetic pathway. Figure 1 displays an example: there are two genetic pathways:
in red and
in blue, and we aim to uncover the corresponding gene–trait clusters
and
. In terms of the SEM, let
,
be the
-th column of
,
, and Equations 1 and 2 are equivalent to
![]() |
The
-th genetic pathway refers to the SNPs’ effects on the
-th mediator
, denoted by
, and the effect of
on the traits
, denoted by
. The
-th gene–trait clusters comprises the SNPs and traits with non-zero loadings in
and
, respectively.
Our analysis does not operate with the individual level data but instead handles the more readily available summary statistics—gene–trait effect (marginal association) estimates. The estimated marginal associations of gene–trait pairs, denoted by
, are obtained from running simple linear regressions with one trait as the response and one SNP as the predictor. For pre-processing, we use the PLINK software to select (approximately) independent index SNPs. We normalize the SNP vectors
to zero mean and unit variance, and the matrix
is close to an identity matrix. Under the SEM model and the above normalization, the estimated marginal associations of the index SNPs ideally take the form
![]() |
(3) |
By plugging Equations 1, 2 into Equation 3 and collecting zero-mean environmental mediators
, two sources of errors
,
in
, the estimated marginal association matrix
satisfies
![]() |
(4) |
In the following, we explain how to use
to uncover the gene–trait clusters—non-zero loadings in matching column pairs of
and
.
2.2. Estimation of gene–trait clusters
Two biological phenomena facilitate the learning of gene–trait clusters from the summary statistics
. First, the ubiquity of pleiotropy—a single mutation may affect multiple traits—is supported by increasing evidence. Correspondingly, in models 1 and 2, the number of strong genetic pathways is expected to be significantly smaller than the number of traits and SNPs collected, that is,
,
, and thus the product matrix
is low-rank (of rank at most
). Second, though the total number of SNPs is colossal, only a small proportion of SNPs are expected to get involved in a certain genetic pathway. In the statistical terminology, the underlying true coefficient matrix
should consist of sparse columns. In addition, a genetic pathway may only influence a limited number of the traits collected. Therefore, most of the elements in columns
are anticipated to be zero. The low-rank and sparse structures together imply that the traits and SNPs can be grouped into a few clusters (low-rank property), each containing a relatively small number of SNPs and traits (sparse property).
We discuss our proposal PathGPS leveraging the low-rank and sparse structures. We start with estimating the low-dimensional column spaces of
and
in Section 2.2.1. In Section 2.2.2, we discuss methods to find
,
with sparse columns in the estimated column spaces
,
, respectively. Finally, we construct a gene–trait clusters for each column pair
,
, corresponding to the
-th genetic pathway. The whole procedure is summarized in Algorithm 1.
2.2.1. Column space estimation
Matrices
,
in Equation 4 are not identifiable without further assumptions. In fact, for any invertible matrix
, define
,
, then
. However, the column spaces
and
are uniquely defined. Therefore, we start with estimating
and
.
We use a baseline method to demonstrate the challenge posed by the presence of environmental influences in estimating the column spaces. Provided with the true number of latent mediators
, arguably the most straightforward column space estimator, which we call “simple SVD” in the following, consists of two steps:
Compute the singular value decomposition (SVD) of
and take the top
singular vectors
,
.Let
,
.
However, by using Equation 4 and taking expectation over errors
and
, the estimated marginal associations satisfy
![]() |
(5) |
The decomposition suggests that the column space of
, concentrated around its expectation
, is contaminated by the environmental variation
and the response error covariance matrix
. The contamination is serious when the ratio
is not ignorable. As a consequence, the “simple SVD” will mistake environmental influences for genetic components.
To separate the genetic and environmental components in Equation 5, we propose a method using noise SNPs. The idea is to use the estimated marginal associations
of
noise SNPs, which are not (or only weakly) associated with traits, to learn the non-genetic structure and remove it from the estimated marginal associations
of signal SNPs. Explicitly, the marginal associations of the noise SNPs, satisfy
![]() |
(6) |
Compared to Equation 5, the expectation of
of the noise SNPs does not contain any genetic component and is a scalar multiple of the environmental effects. As a direct corollary of Equations 5 and 6,
![]() |
(7) |
Equation 7 demonstrates the environmental influences can be removed by subtracting a scalar multiple of
from
. Motivated by this cancellation of the non-genetic influences, we introduce the differencing estimator of
:
Compute the truncated eigen-decomposition of
with
components. Denote the matrix of
eigenvectors by
. Let
.- As for
, we suggest

(8)
2.2.2. Gene–trait clustering
In this section, we find sparse matrices in the estimated column spaces in Section 2.2.1 and construct gene–trait clusters from the non-zero loadings.
Let
,
be two candidate matrices from
to
, respectively. We aim to find a transformation matrix
such that the columns of
,
are sparse. The task aligns with a number of readily available methods from factor analysis with a focus on sparsity. In particular, we adopt two commonly used approaches summarized below.
Varmiax (Kaiser, 1958). We start from an orthonormal matrix
and solve for a rotation matrix
to maximize the variances of squared loadings of
’s columns. The resulting
tend to have many small loadings and we set those values to zero. Finally, we let
and again set small values in
to zero.Promax (Hendrickson and White, 1964). Promax first applies the above Varimax to
, and then rotates the orthogonal columns of Varimax to a least squares fit. The approach relaxes the orthonormal restriction of
in Varimax and thus the loadings in
are pushed further apart. We let
and truncate small values in
,
to zero.
Finally, provided with sparse column estimators
,
, we loop over
column pairs
and assign the traits and genes with non-zero loadings into a cluster. Details are summarized in Algorithm 1. In the upcoming section, we will build upon Algorithm 1 and enhance its stability by bootstrap aggregation.
2.3. Bootstrap aggregation of PathGPS
Several issues may undermine the reliability of Algorithm 1. First, in modern biobanks, the set of measured traits is evergrowing. Many traits are repeatedly measured in slightly different ways. It is desirable to obtain stable results if the traits are slightly perturbed. Second, in the data preprocessing procedures, we use an external SNP dataset to select signal and noise index SNPs. We expect to arrive at similar gene–trait clusters if we perturb the index SNP sets, especially the signal SNPs, by a small amount. Third, Algorithm 1 relies on a set of hyperparameters, such as the number of latent mediators
. The cluster list
should be robust to the selection of hyperparameters.
We propose a bootstrap aggregation (bagging) approach to stabilize the pipeline and make the results more replicable by perturbing the entire procedure many times and then aggregating the results. In the following, we discuss the two components of the bagging procedure: SNP bootstrapping (Section 2.3.1) and the aggregation method via a co-appearance graph (Section 2.3.2). The full bagging procedure is summarized in Algorithm 2.
2.3.1. SNP bootstrapping
Motivated by Breiman (2001), we bootstrap the SNPs used by Algorithm 1. In each trial, we resample the same number of signal SNPs with replacement and obtain bootstrapped signal estimated marginal associations
. We then apply Algorithm 1 to
and
and arrive at a cluster list
. We repeat from the resampling
times and obtain a collection of cluster lists
. The left panel of Figure 2 describes an example of the bootstrap process.
FIGURE 2.
SNP bootstrapping and co-appearance graph. In the left panel, we obtain multiple clusterings based on different sets of bootstrapped SNPs. In the right panel, we display the co-appearance graph obtained based on the four bootstrap samples in the left panel. The weight computation can be found in the supplementary materials.
2.3.2. Co-appearance graph aggregation
Based on the multiple gene–trait clusters
generated by the SNP bootstrapping, we propose to aggregate the cluster lists using a co-appearance graph. Consider a graph whose nodes denote SNPs and traits. For two nodes
and
, we define the weight for the edge connecting
and
(called the co-appearance frequency in the following)
![]() |
(9) |
where
denotes the gene–trait clusters in the list
obtained from the
-th bootstrap sample. The right panel of Figure 2 displays an example of the co-appearance graph. If two nodes always show up in the same cluster, the pair will have a high co-appearance frequency (Equation 9), and we are more confident about the connection of the pair.
The co-appearance graph is convenient for downstream clustering and visualization. One option is using t-SNE (Hinton and Roweis, 2002) or UMAP (McInnes et al., 2018) to find low-dimensional embeddings. The embeddings can be further used to visualize genes and traits that are closely connected in the co-appearance graph. The representations can also be fed to various clustering methods based on feature vectors like k-means. Alternatively, we can directly use graph clustering methods, such as spectral clustering and label propagation, to obtain gene–trait clusters.
3. RESULTS
In Section 3.1, we generate simulated datasets following SEM Equations 1 and 2. We demonstrate the differencing estimator’s performance in estimating the column space in Section 3.1.1. We showcase that bagging enhances the stability of PathGPS in Section 3.1.2. In Section 3.2, we report the findings from applying PathGPS to the metabolomics data and the UK Biobank data.
3.1. Simulated datasets
3.1.1. Column space estimation
We construct simulation datasets following SEM Equations 1 and 2. We consider
individuals,
signal SNPs,
noise SNPs,
traits,
latent genetic mediators, and
latent environmental mediators. The number of latent genetic mediators are significantly smaller than the number of index SNPs and the number of traits. The signal/noise SNPs, latent environmental mediators, and the random errors
,
are independent standard Gaussian random variables. As for coefficient matrices, we first generate elements of the coefficient matrices
,
uniformly from
, and then randomly set
of the entries to zero to create sparse matrices
,
. We also generate elements of the environmental mediators’ coefficient matrix
uniformly from
. We adjust the magnitudes of the environmental influence, the errors in the mediators
, and the errors in the traits
so that the proportion of the environmental mediators’ variance
(environmental factor strength) varies from
to
, while the total variance of the non-genetic component stays the same. We compare two column space estimators: the differencing estimator in Algorithm 1 and the “simple SVD” estimator. We measure the performance of column space estimators by the column space distance: let
and
be two arbitrary matrices of dimension
, and let
,
be the corresponding column spaces, respectively, define the column space distance as
![]() |
(10) |
where
,
are the projection operators onto the column spaces of
,
.
In Figure 3, we report the column space distances Equation 10 of the two methods under different levels of environmental influences. The performance of the “simple SVD” approach deteriorates as the environmental influence increases. The “simple SVD” approach mistakenly counts the leading environmental factor as a genetic influence. In contrast, the proposed column space estimator is robust to the environmental factors. This is because, despite of the magnitude of the environmental influence, the environmental factors are nearly entirely captured by its estimator
and subtracted from
. When the environmental factor strength [the proportion of the environmental mediators’ variance
in Equation 4] exceeds
, the one standard deviation intervals of the column space distance based on the proposed estimator fall strictly below those produced by the “simple SVD” approach. In the supplementary materials, we also compare the clustering results based on the proposed estimator against the “simple SVD”. In the supplementary materials, we have included additional simulations for a more comprehensive analysis. In particular, we demonstrate that our proposed method exhibits robustness in the presence of the many weak effects commonly associated with polygenic traits.
FIGURE 3.
Column space estimation. We compare the “simple SVD” approach and the differencing estimator in PathGPS. We evaluate the estimation performance by the column space distance in Equation 10. We vary the proportion of the environmental mediators’ variance
(environmental factor strength) in Equation 4 from
to
, and display the average column space distances plus and minus one standard deviation for
in left figure and
in the right figure. All results are aggregated over 100 trials.
3.1.2. Gene–trait clustering
As in Section 3.1.1, we follow SEM Equations 1 and 2, and consider
individuals,
signal SNPs,
noise SNPs,
traits,
latent genetic mediators, and
latent environmental mediators. In the default setting (default), we design sparse coefficient matrices
,
to have a total of
genes and traits in each cluster. We also enforce each gene and trait to belong to at most one cluster. In addition to the default setting, we consider two variations: a sparse setting (sparse) with only
genes and traits per cluster; an overlap setting (overlap) where a gene/trait may belong to multiple clusters (multiple membership). Details of the three simulation settings are summarized in Table 1.
TABLE 1.
Summary of simulation settings of gene–trait clusters discovery.
| Setting | Number of genes and traits/cluster | Multiple membership |
|---|---|---|
| default | 16 |
|
| sparse | 12 |
|
| overlap | 16 |
|
We compare several versions of the PathGPS: (a) the one-shot pipeline (baseline) following Algorithm 1 (no further clustering is required, clustering method denoted by “NA”); (b) the clustering without bootstrapping (one-shot) following Algorithm 1; (c) the co-appearance clustering with bootstrapping (bootstrap) following Algorithm 1 with 200 bootstrap resamples. As for the clustering methods used by the approach one-shot and bootstrap based on co-appearance graphs, we consider: (a) first learn the low-dimensional embeddings via t-SNE or UMAP, and then cluster the embeddings by k-means; (b) directly apply graph clustering methods: spectral clustering and hierarchical clustering.
We evaluate the performance of the clustering by the minimal clustering error across label permutation below. In particular, let
be the true list of
clusters defined on set
, and
be an estimate with the same number of clusters, then we define the clustering error
![]() |
(11) |
where
denotes a permutation over cluster labels. To test the robustness to the misspecification of hyperparameters, we input a sequence of hyper parameters around the true values. For the number of latent genetic pathways
, we provide
such that
, which includes the correct specification
. For the number of genes and traits in each channel
, we input hyperparameters
, which also includes the correct specification.
In Figure 4, across different simulation settings, the baseline achieves the best accuracy at the true hyperparameters, while the bootstrap approach is in general more robust.
FIGURE 4.
Gene–trait clusters discovery. We compare three variants of PathGPS the one-shot pipeline (baseline in circle) following Algorithm 1 (no further clustering is required, clustering method denoted by “NA”), the clustering without bootstrapping (one-shot in triangle) following Algorithm 1, and the co-appearance clustering with bootstrapping (bootstrap) following Algorithm 1 with 200 bootstrap resamples. We evaluate the estimation performance by the clustering error Equation 11. We input a sequence of
(first row) and
(second row) around the true values, and plot the average clustering errors aggregated over 100 trials.
3.2. Real datasets
3.2.1. Metabolomics data
We use the genome-metabolome-wide association study in Kettunen et al. (2016) as the main dataset. The summary statistics are derived from 24925 participants and contain 123 metabolites,
passing quality control SNPs. We remove 18 traits with <
variance explained according to Supplementary Table 1 of Kettunen et al. (2016). To select approximately independent index SNPs, we apply PLINK to an external dataset (Davis et al., 2017) of 72 metabolites, including a large proportion of the traits in the main dataset. We regard 50 index SNPs with at least one significant marginal association as signal SNPs, and 250 index SNPs with no significant marginal associations as noise SNPs. We extract the marginal associations corresponding to the signal and noise SNPs from the main dataset as the input for PathGPS.
We apply PathGPS to a metabolomics dataset (Section 3.2.1) and the UK Biobank (Section 3.2.2), and discuss gene–trait clusters produced. Preprocessing procedures can be found in the supplementary materials, including the details of choosing “signal” and “noise” genes. The results, including the lists and visualizations of the gene–trait clusters, are summarized in Figure 5.
FIGURE 5.
Applications of PathGPS. Panel A displays the summary of the metabolomics data (a1), the UMAP embeddings of seven gene–trait clusters produced by PathGPS with co-appearance edge weights (a2), and representative traits and mapped genes in each cluster (a3). In (a4), we subsample traits without replacement, and PathGPS (UMAP) produces more consistent cluster memberships than the baseline method (Figure 1b5). Panel B displays the summary of the UK Biobank data (b1), the UMAP visualization (b2), and representative genes and traits of 10 clusters produced by PathGPS (b3). PathGPS (UMAP) again produces more stable clusters (b4). The representative traits and mapped genes in (a3) and (b3) are selected manually.
For the metabolomics dataset, PathGPS produces seven clusters, which roughly correspond to large high-density lipoprotein (HDL), small HDL, low-density lipoprotein (LDL), intermediate-density lipoprotein (IDL), large very-low-density lipoprotein (VLDL), small VLDL, and non-lipoprotein measurements (Figure 5a). Thus, using genetic data only, PathGPS is able to recover the known taxonomy of circulating metabolites. In the supplementary materials, we provide a comparison of the clusters produced by PathGPS and those of the “simple SVD” method. PathGPS confirms several known causal genes, such as PLTP as a regulator of HDL size (Huuskonen et al., 2001) and PCSK9 as a regulator of LDL cholesterol (Maxwell and Breslow, 2004). PathGPS also proposes several biological hypotheses that are not as well established, including RNF111 in relation to HDL (Holmen et al., 2014) and TM4SF5 in relation to lipid measurements (Choi et al., 2021). In fact, few gene–trait pairs suggested by PathGPS directly reach the genome-wide significant level after correcting for multiple testing, but the majority of the gene–trait pairs are at least moderately associated. This demonstrates the ability of PathGPS to associate a group of genes with a group of traits, when any single association is not sufficiently strong.
In each trial, we subsample around half of the traits without replacement, and apply PathGPS to the selected subset of traits. We compare the co-appearance weights (co-appearing probabilities) obtained with (Figure 5 a4 UMAP) and without (Figure 5a4 Baseline) bootstrap aggregation. Close-to-one co-appearance weights (co-appearing probabilities) indicate the associated pairs always fall into the same cluster and close-to-zero values imply the associated pairs always end up in different clusters. We observe the histograms of co-appearance weights (co-appearing probabilities) of PathGPS with bagging have sharper spikes around 0 and 1. The bowl-shaped histograms indicate the bagging procedure increases the stability of PathGPS towards trait inclusion.
3.2.2. UK Biobank data
We use the GWAS summary statistics from the UK Biobank data generated by the Neale Lab. The summary statistics are derived from
participants of white-British ancestry and contain
passing quality control SNPs.
For data preprocessing, we first remove traits with missing female or male summary statistics. The female summary statistics are used for SNP and trait selection, and the male summary statistics are used for downstream gene–trait clusters exploration. We select approximately independent index SNPs based on the female summary statistics using the PLINK software (Purcell et al., 2007). To eliminate unreliable estimates of genetic associations, we focus on the 175 traits with at least one significant index SNPs at a
confidence level. The resulting traits are a combination of lab measurements, disease diagnoses, medication, and a small number of lifestyle habits. Among the index SNPs, we regard the SNPs with at least one significant marginal association test as signal SNPs (1200 in total). We regard the SNPs with no significant marginal association tests as noise SNPs (250 in total). Finally, the estimated marginal associations between the selected traits and signal (noise) SNPs from the male population are used as the input for PathGPS.
PathGPS produces 10 clusters (Figure 5b3), among which 3 are closely related to some diseases (venous thromboembolism, cardiovascular diseases, and type 2 diabetes), and the other 7 contain biometric measurements, such as bone mineral density, immune system, fat-free mass, and skin or hair colours. In the UMAP visualization (Figure 5b2), the edges reflect high (top 350) co-appearances between vertex pairs and may offer insights into disease mechanisms. For instance, our analysis finds the medication simvastatin is closely related to high cholesterol and cardiovascular diseases (CVD), which is not surprising given that it is widely prescribed to reduce CVD risk (Bibbins-Domingo et al., 2016). We also find atorvastatin—another drug in the statin family—is highly related to bone mineral density (BMD) and associated traits. This finding is consistent with existing evidence that statin increases BMD (Li et al., 2020; Lupattelli et al., 2004). In addition, edges connecting monocytes, neutrophils, and lymphocytes to diabetes and asthma diagnoses have high weights, suggesting connections between the immune system and the two common diseases. In particular, diabetes may be related to the immune system through multiple mechanisms; for example, hyperglycemia in diabetes may cause dysfunction of the immune response (Berbudi et al., 2020). As for asthma, T lymphocytes are critical to the development of asthma (Larché et al., 2003). The co-occurrence of diabetes and asthma may be attributed to the shared immunological pathways (Torres et al., 2021). In the supplementary materials, we provide a comparison of the clusters produced by PathGPS and those of the “simple SVD” method.
Regarding the genetic architecture, our analysis confirms many existing discoveries, such as the association between HERC2 and hair colour (Branicki et al., 2011), PELO and red blood cells (Mills et al., 2016), and NME7 and venous thromboembolism (Heit et al., 2012). We also find less well-established biological hypotheses, such as BCL2 and Atrial fibrillation (Li et al., 2018), GFI1, and lymphocyte cells (Van der Meer et al., 2010). The UMAP embedding provides further information beyond the cluster membership. For example, the cluster containing smoking, alcohol, and diabetes is adjacent to the cluster containing CVD, indicating a multifaceted health effect of alcohol consumption and tobacco usage.
4. DISCUSSION
In this article, we propose PathGPS—a promising statistical tool to discover genes and traits sharing latent biological pathways. When applied to the UK Biobank and metabolomics data, PathGPS not only confirms many established genetic associations but also generates novel biological hypotheses. By grouping diseases with shared biological pathways, PathGPS can enhance the understanding of comorbidities and contribute to the development of comprehensive clinical practices.
We highlight that PathGPS only requires summary statistics and thus can be readily applied to a number of biobank datasets (Chen et al., 2011; Christensen et al., 2012; Avlund et al., 2014). It is possible that, for certain traits, the underlying genetic pathways differ across sub-populations around the globe. The heterogeneity of genetic pathways can potentially lead to individualized treatments. Therefore, it is of value to compare the output gene–trait clusters of PathGPS applied to various biobank datasets.
Our proposal of using PathGPS with bootstrap aggregation addresses the call of reproducibility research. When scientific findings rely on statistical analysis, the statistical results should be stable under “reasonable” data perturbations (Yu, 2013). In particular, biobanks and other databses are often regularly augmented by additional measurements and samples, and it would be desirable to obtain consistent conclusions when more data become available.
There are several avenues for future work. First, research shows that the interactions between genes and environment shape human development, and childhood experiences can alter gene expression. So it may be useful to extend the current model to include the interaction of genetic and environmental factors. Second, PathGPS outputs groups of traits associated with the same pathway and it would be of great interest to further investigate the causal mechanism. Third, given that it is difficult to develop rigorous uncertainty quantification and inference for clustering and unsupervised learning tasks, it would be useful to consider how experiments can be designed to validate or disprove the potential pathways generated by PathGPS. Finally, the PathGPS deliberately selects index SNPs to be distant from each other to ensure their independence, which results in a limited number of such SNPs. To expand the scope and involve a larger number of SNPs, our method would need to be extended to handle dependent SNPs. This would require us to model the covariance matrix of the index SNPs in order to establish connections between marginal associations and the coefficients derived from a full regression.
Supplementary Material
Web Appendices, Tables, and Figures, referenced in Sections 2 and 3, along with codes, are available with this paper at the Biometrics website on Oxford Academic. The codes used for the analysis in this paper are also available at https://github.com/ZijunGao/PathGPS.
Acknowledgement
We would like to express our gratitude to Professors Robert Tibshirani and Jonathan Taylor for their invaluable suggestions. Additionally, we extend our appreciation to the anonymous reviewers for their constructive guidance and feedback.
Contributor Information
Zijun Gao, Marshall Business School, University of Southern California, Los Angeles CA, 90089, United States.
Qingyuan Zhao, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, CB3 0WB, United Kingdom.
Trevor Hastie, Department of Statistics and Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, United States.
FUNDING
Q.Z. is supported by the Isaac Newton Trust and EPSRC grant EP/V049968/1. T.H. are partially supported by grants Division of Mathematical Sciences (DMS) 2013736 and Division of Information and Intelligent Systems (IIS) 1837931 from the (US) National Science Foundation and grant 5R01 EB 001988-21 from the (US) National Institutes of Health.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
The data that support the findings in this paper are open for access. The UK biobank data can be accessed at http://www.nealelab.is/uk-biobank, and the metabolomics data sets are provided in Kettunen et al. (2016) and Davis et al. (2017).
References
- Avlund K., Osler M., Mortensen E. L., Christensen U., Bruunsgaard H., Holm-Pedersen P. et al. (2014). Copenhagen aging and midlife biobank (camb): an introduction. Journal of Aging and Health, 26, 5–20. [DOI] [PubMed] [Google Scholar]
- Berbudi A., Rahmadika N., Tjahjadi A. I., Ruslami R. (2020). Type 2 diabetes and its impact on the immune system. Current Diabetes Reviews, 16, 442–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bibbins-Domingo K., Grossman D. C., Curry S. J., Davidson K. W., Epling J. W., García F. A. et al. (2016). Statin use for the primary prevention of cardiovascular disease in adults: us preventive services task force recommendation statement. JAMA, 316, 1997–2007. [DOI] [PubMed] [Google Scholar]
- Boyle E. A., Li Y. I., Pritchard J. K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell, 169, 1177–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Branicki W., Liu F., van Duijn K., Draus-Barini J., Pośpiech E., Walsh S. et al. (2011). Model-based prediction of human hair color using DNA variants. Human Genetics, 129, 443–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breiman L. (2001). Random forests. Machine Learning, 45, 5–32. [Google Scholar]
- Bulik-Sullivan B., Finucane H. K., Anttila V., Gusev A., Day F. R., Loh P.-R. et al. (2015). An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47, 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cano-Gamez E., Trynka G. (2020). From gwas to function: using functional genomics to identify the mechanisms underlying complex diseases. Frontiers in Genetics, 11, 424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Z., Chen J., Collins R., Guo Y., Peto R., Wu F. et al. (2011). China kadoorie biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. International Journal of Epidemiology, 40, 1652–1666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi C., Son Y., Kim J., Cho Y. K., Saha A., Kim M. et al. (2021). Tm4sf5 knockout protects mice from diet-induced obesity partly by regulating autophagy in adipose tissue. Diabetes, 70, 2000–2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christensen H., Nielsen J. S., Sørensen K. M., Melbye M., Brandslund I. (2012). New national biobank of the Danish center for strategic research on type 2 diabetes (dd2). Clinical Epidemiology, 4, 37–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davis J. P., Huyghe J. R., Locke A. E., Jackson A. U., Sim X., Stringham H. M. et al. (2017). Common, low-frequency, and rare genetic variants associated with lipoprotein subclasses and triglyceride measures in finnish men from the metsim study. PLoS Genetics, 13, e1007079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finucane H. K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R. et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 47, 1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heit J. A., Armasu S. M., Asmann Y. W., Cunningham J. M., Matsumoto M. E., Petterson T. M. et al. (2012). A genome-wide association study of venous thromboembolism identifies risk variants in chromosomes 1q24. 2 and 9q. Journal of Thrombosis and Haemostasis, 10, 1521–1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hendrickson A. E., White P. O. (1964). Promax: a quick method for rotation to oblique simple structure. British Journal of Statistical Psychology, 17, 65–70. [Google Scholar]
- Hinton G., Roweis S. T. (2002). Stochastic neighbor embedding. NeurIPS Proceedings, 15, 833–840. [Google Scholar]
- Holmen O. L., Zhang H., Fan Y., Hovelson D. H., Schmidt E. M., Zhou W. et al. (2014). Systematic evaluation of coding variation identifies a candidate causal variant in tm6sf2 influencing total cholesterol and myocardial infarction risk. Nature Genetics, 46, 345–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huuskonen J., Olkkonen V. M., Jauhiainen M., Ehnholm C. (2001). The impact of phospholipid transfer protein (pltp) on hdl metabolism. Atherosclerosis, 155, 269–281. [DOI] [PubMed] [Google Scholar]
- Jennrich R. I. (2001). A simple general procedure for orthogonal rotation. Psychometrika, 66, 289–306. [Google Scholar]
- Kaiser H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23, 187–200. [Google Scholar]
- Kettunen J., Demirkan A., Würtz P., Draisma H. H., Haller T., Rawal R. et al. (2016). Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Nature Communications, 7, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lappalainen T., MacArthur D. G. (2021). From variant to function in human disease genetics. Science, 373, 1464–1468. [DOI] [PubMed] [Google Scholar]
- Larché M., Robinson D. S., Kay A. B. (2003). The role of t lymphocytes in the pathogenesis of asthma. Journal of Allergy and Clinical Immunology, 111, 450–463. [DOI] [PubMed] [Google Scholar]
- Li G. H.-Y., Cheung C.-L., Au P. C.-M., Tan K. C.-B., Wong I. C.-K., Sham P.-C. (2020). Positive effects of low ldl-c and statins on bone mineral density: an integrated epidemiological observation analysis and mendelian randomization study. International Journal of Epidemiology, 49, 1221–1235. [DOI] [PubMed] [Google Scholar]
- Li Y., Song B., Xu C. (2018). Effects of guanfu total base on bcl-2 and bax expression and correlation with atrial fibrillation. Hellenic Journal of Cardiology, 59, 274–278. [DOI] [PubMed] [Google Scholar]
- Lupattelli G., Scarponi A. M., Vaudo G., Siepi D., Roscini A. R., Gemelli F. et al. (2004). Simvastatin increases bone mineral density in hypercholesterolemic postmenopausal women. Metabolism, 53, 744–748. [DOI] [PubMed] [Google Scholar]
- McInnes L., Healy J., Melville J. (2018). Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. Journal of Open Source Software, 29, 861. 10.21105/joss.00861 [DOI] [Google Scholar]
- Manolio T. A., Collins F. S., Cox N. J., Goldstein D. B., Hindorff L. A., Hunter D. J. et al. (2009). Finding the missing heritability of complex diseases. Nature, 461, 747–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maxwell K. N., Breslow J. L. (2004). Adenoviral-mediated expression of pcsk9 in mice results in a low-density lipoprotein receptor knockout phenotype. Proceedings of the National Academy of Sciences, 101, 7100–7105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mills E. W., Wangen J., Green R., Ingolia N. T. (2016). Dynamic regulation of a ribosome rescue pathway in erythroid cells and platelets. Cell Reports, 17, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ning Z., Pawitan Y., Shen X. (2020). High-definition likelihood inference of genetic correlations across human complex traits. Nature Genetics, 52, 859–864. [DOI] [PubMed] [Google Scholar]
- Pickrell J. K., Berisa T., Liu J. Z., Ségurel L., Tung J. Y., Hinds D. A. (2016). Detection and interpretation of shared genetic influences on 42 human traits. Nature Genetics, 48, 709–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A., Bender D. et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi H., Kichaev G., Pasaniuc B. (2016). Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics, 99, 139–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solovieff N., Cotsapas C., Lee P. H., Purcell S. M., Smoller J. W. (2013). Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics, 14, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanigawa Y., Li J., Justesen J. M., Horn H., Aguirre M., DeBoever C. et al. (2019). Components of genetic associations across 2,138 phenotypes in the UK biobank highlight adipocyte biology. Nature Communications, 10, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres R. M., Souza M. D. S., Coelho A. C. C., de Mello L. M., Souza-Machado C. (2021). Association between asthma and type 2 diabetes mellitus: Mechanisms and impact on asthma control—a literature review. Canadian Respiratory Journal, 2021, 8830439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van der Meer L., Jansen J., Van Der Reijden B. (2010). Gfi1 and gfi1b: key regulators of hematopoiesis. Leukemia, 24, 1834–1843. [DOI] [PubMed] [Google Scholar]
- Visscher P. M., Wray N. R., Zhang Q., Sklar P., McCarthy M. I., Brown M. A. et al. (2017). 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101, 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witten D. M., Tibshirani R., Hastie T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu B. (2013). Stability. Bernoulli, 19, 1484–1500. [Google Scholar]
- Zou H., Hastie T., Tibshirani R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265–286. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices, Tables, and Figures, referenced in Sections 2 and 3, along with codes, are available with this paper at the Biometrics website on Oxford Academic. The codes used for the analysis in this paper are also available at https://github.com/ZijunGao/PathGPS.
Data Availability Statement
The data that support the findings in this paper are open for access. The UK biobank data can be accessed at http://www.nealelab.is/uk-biobank, and the metabolomics data sets are provided in Kettunen et al. (2016) and Davis et al. (2017).
















