Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Jul 15;80(3):ujae060. doi: 10.1093/biomtc/ujae060

PathGPS: discover shared genetic architecture using GWAS summary data

Zijun Gao 1,, Qingyuan Zhao 2, Trevor Hastie 3
PMCID: PMC11247175  PMID: 39005072

ABSTRACT

The increasing availability and scale of biobanks and “omic” datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of “signal” genes with those of “noise” genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating (“bagging”) algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene–trait clusters and suggests multiple new hypotheses for future investigations.

Keywords: GWAS, pathway analysis, structural equation model, summary data

1. INTRODUCTION

Understanding the biological mechanisms by which genetic variation influences phenotypes is one of the primary challenges in human genetics (Lappalainen and MacArthur, 2021). Genome-wide association studies (GWAS) have successfully mapped thousands of genetic loci associated with complex human traits. However, it is extremely time-consuming and inefficient to investigate every identified association and validate the function (Visscher et al., 2017). Moreover, complex traits are usually highly polygenic and are associated with a large number of variants across the genome, each explaining only a small fraction of the genetic variance (Manolio et al., 2009; Shi et al., 2016). These difficulties have hindered the translation of GWAS findings into drug development and clinical applications (Cano-Gamez and Trynka, 2020).

Recently, studies have revealed that many complex traits are associated with the same genomic loci (Pickrell et al., 2016) and identified many pairs of traits with strong genetic correlation (Solovieff et al., 2013; Bulik-Sullivan et al., 2015; Ning et al., 2020). This phenomenon indicates that disease-causing variants may cluster into key pathways that drive several diseases at the same time (Boyle et al., 2017). Motivated by the underlying connection among traits, we aim to use large-scale biobank data containing thousands of phenotypes to aggregate information from correlated complex traits and infer their shared genetic architectures. In this article, we aim to use large-scale biobank data to aggregate the information from various traits and infer their shared genetic architectures.

Our motivating dataset is the UK Biobank, a rich database of genetic and phenotypic information from hundreds of thousands of participants across the UK. Participants are genotyped to capture genome-wide genetic variation at millions of single nucleotide polymorphisms (SNPs). A wide variety of phenotypes are recorded, including biological measurements, lifestyle indicators, biomarkers in blood, and disease diagnosis. The UK Biobank data provide plenty of opportunities for identifying genetic associations with complex traits.

There are several non-trivial hurdles to recover shared genetic architectures from GWAS data. First, any trait is a product of genetic and environmental influences. The environmental factors can both lead to spurious associations or shadow true genetic signals. Unfortunately, environmental factors are not directly observed in most datasets, making it difficult to isolate the genetic contribution from environmental influences. Second, the biobank data are usually gathered in multiple batches and are regularly augmented with newly collected data. Therefore, the database includes observations from several cohorts and the summary statistics are derived from multiple populations. Third, the set of measured traits in large biobanks is evergrowing and many traits are repeatedly measured in slightly different ways. It is desirable to develop a statistical method that is insensitive to data perturbation and yields consistent statistical conclusions as the dataset continues to be enriched.

In this paper, we develop a new statistical method—PATHway discovery through Genome–Phenome Summary data (PathGPS)—based on a model that assumes most human traits are regulated by one or several genetic or environmental pathways (Figure 1a). PathGPS can generate clusters of genes and traits associated with the same biological pathways (Figure 1b) and addresses the aforementioned challenges. First, by subtracting the empirical covariance of traits computed using “noise” genes (genetic variants showing no or weak associations with the traits) from that using “signal” genes (genetic variants showing strong associations with some traits), PathGPS disentangles genetic mechanisms from environmental factors. PathGPS then applies principal component analysis (PCA) to the disentangled covariance matrix and provides a low-rank representation of genetic associations. Second, PathGPS can be applied to summary statistics derived from several cohorts as long as the underlying genetic pathways are shared across cohorts. Third, to stabilize PathGPS, we design a novel implementation of the bootstrap aggregation (“bagging”) applied to unsupervised learning (Figure 1c). In particular, by resampling the genes, PathGPS obtains a weighted graph, which estimates the likelihood that a pair of variables (could be genes or traits) appear in the same pathway. Dimension reduction techniques and clustering algorithms are then applied to visualize this graph and produce clusters.

FIGURE 1.

FIGURE 1

An example of the structural equation model. The example contains four SNPs and four traits. There are two latent genetic mediators: the genetic mediator Inline graphic is influenced by SNPs Inline graphic, Inline graphic, and affects traits Inline graphic and Inline graphic; the genetic mediator Inline graphic is influenced by SNPs Inline graphic, Inline graphic, and affects trait Inline graphic. There is one latent environmental mediator Inline graphic, which is independent of the SNPs and affects traits Inline graphic, Inline graphic.

PathGPS only requires GWAS summary statistics, which can be more easily accessed compared to individual genetic data. An additional benefit is that the computational complexity of our method does not depend on the sample size of the GWAS, once the summary statistics are already produced. Our proposal is motivated by the literature investigating summary statistics for heritability and latent factor characterization (Finucane et al., 2015; Bulik-Sullivan et al., 2015; Tanigawa et al., 2019). In addition, we draw upon statistical methods with both sparsity and low-rank structures (Kaiser, 1958; Jennrich, 2001; Zou et al., 2006; Witten et al., 2009). PathGPS also builds on the idea of using bootstrap resamples to reduce the variance of statistical learning methods (Breiman, 2001).

The paper is organized as follows. In Section 2, we introduce the statistical model and the PathGPS algorithm. In Section 3, we investigate the performance of PathGPS in simulated and real datasets and discuss findings using the UK BioBank data. We conclude the paper with more discussion in Section 4.

2. METHODS

In Section 2.1, we lay out the model characterizing latent pathways. We discuss column space estimation in Section 2.2.1, a preparation step for the gene–trait clustering in Section 2.2.2. In Section 2.3, we propose to use bootstrap aggregation to boost the stability of the proposed clustering algorithm.

2.1. Structural equation model of latent pathways

We describe latent genetic and environmental pathways connecting SNPs and traits using a linear structural equation model (SEM). We start with an individual-level model of the SNPs and traits, and then derive the summary statistics from the individual-level model.

Suppose there are Inline graphic SNPs Inline graphic and Inline graphic traits Inline graphic. We assume the traits are influenced by the SNPs through Inline graphic latent genetic mediators Inline graphic. Meanwhile, we assume the traits are also affected by Inline graphic unobserved environmental mediators Inline graphic. Mathematically, we adopt the linear SEM,

2.1. (1)
2.1. (2)

where Inline graphic, Inline graphic denote zero-mean errors in the mediators Inline graphic and traits Inline graphic, respectively, and Inline graphic, Inline graphic, Inline graphic are coefficient matrices. We assume the errors Inline graphic, Inline graphic are zero-mean and independent of the SNPs as well as the environmental mediators.

Our goal is to discover genetic pathways: SNPs Inline graphic genetic mediator (latent) Inline graphic traits. Since the genetic mediators are unobserved, we look for clusters of SNPs and traits related to the same underlying genetic pathway. Figure 1 displays an example: there are two genetic pathways: Inline graphic in red and Inline graphic in blue, and we aim to uncover the corresponding gene–trait clusters Inline graphic and Inline graphic. In terms of the SEM, let Inline graphic, Inline graphic be the Inline graphic-th column of Inline graphic, Inline graphic, and Equations 1 and 2 are equivalent to

2.1.

The Inline graphic-th genetic pathway refers to the SNPs’ effects on the Inline graphic-th mediator Inline graphic, denoted by Inline graphic, and the effect of Inline graphic on the traits Inline graphic, denoted by Inline graphic. The Inline graphic-th gene–trait clusters comprises the SNPs and traits with non-zero loadings in Inline graphic and Inline graphic, respectively.

Our analysis does not operate with the individual level data but instead handles the more readily available summary statistics—gene–trait effect (marginal association) estimates. The estimated marginal associations of gene–trait pairs, denoted by Inline graphic, are obtained from running simple linear regressions with one trait as the response and one SNP as the predictor. For pre-processing, we use the PLINK software to select (approximately) independent index SNPs. We normalize the SNP vectors Inline graphic to zero mean and unit variance, and the matrix Inline graphic is close to an identity matrix. Under the SEM model and the above normalization, the estimated marginal associations of the index SNPs ideally take the form

2.1. (3)

By plugging Equations 1, 2 into Equation 3 and collecting zero-mean environmental mediators Inline graphic, two sources of errors Inline graphic, Inline graphic in Inline graphic, the estimated marginal association matrix Inline graphic satisfies

2.1. (4)

In the following, we explain how to use Inline graphic to uncover the gene–trait clusters—non-zero loadings in matching column pairs of Inline graphic and Inline graphic.

2.2. Estimation of gene–trait clusters

Two biological phenomena facilitate the learning of gene–trait clusters from the summary statistics Inline graphic. First, the ubiquity of pleiotropy—a single mutation may affect multiple traits—is supported by increasing evidence. Correspondingly, in models 1 and 2, the number of strong genetic pathways is expected to be significantly smaller than the number of traits and SNPs collected, that is, Inline graphic, Inline graphic, and thus the product matrix Inline graphic is low-rank (of rank at most Inline graphic). Second, though the total number of SNPs is colossal, only a small proportion of SNPs are expected to get involved in a certain genetic pathway. In the statistical terminology, the underlying true coefficient matrix Inline graphic should consist of sparse columns. In addition, a genetic pathway may only influence a limited number of the traits collected. Therefore, most of the elements in columns Inline graphic are anticipated to be zero. The low-rank and sparse structures together imply that the traits and SNPs can be grouped into a few clusters (low-rank property), each containing a relatively small number of SNPs and traits (sparse property).

We discuss our proposal PathGPS leveraging the low-rank and sparse structures. We start with estimating the low-dimensional column spaces of Inline graphic and Inline graphic in Section 2.2.1. In Section 2.2.2, we discuss methods to find Inline graphic, Inline graphic with sparse columns in the estimated column spaces Inline graphic, Inline graphic, respectively. Finally, we construct a gene–trait clusters for each column pair Inline graphic, Inline graphic, corresponding to the Inline graphic-th genetic pathway. The whole procedure is summarized in Algorithm 1.

2.2.

2.2.1. Column space estimation

Matrices Inline graphic, Inline graphic in Equation 4 are not identifiable without further assumptions. In fact, for any invertible matrix Inline graphic, define Inline graphic, Inline graphic, then Inline graphic. However, the column spaces Inline graphic and Inline graphic are uniquely defined. Therefore, we start with estimating Inline graphic and Inline graphic.

We use a baseline method to demonstrate the challenge posed by the presence of environmental influences in estimating the column spaces. Provided with the true number of latent mediators Inline graphic, arguably the most straightforward column space estimator, which we call “simple SVD” in the following, consists of two steps:

  1. Compute the singular value decomposition (SVD) of Inline graphic and take the top Inline graphic singular vectors Inline graphic, Inline graphic.

  2. Let Inline graphic, Inline graphic.

However, by using Equation 4 and taking expectation over errors Inline graphic and Inline graphic, the estimated marginal associations satisfy

2.2.1. (5)

The decomposition suggests that the column space of Inline graphic, concentrated around its expectation Inline graphic, is contaminated by the environmental variation Inline graphic and the response error covariance matrix Inline graphic. The contamination is serious when the ratio Inline graphic is not ignorable. As a consequence, the “simple SVD” will mistake environmental influences for genetic components.

To separate the genetic and environmental components in Equation 5, we propose a method using noise SNPs. The idea is to use the estimated marginal associations Inline graphic of Inline graphic noise SNPs, which are not (or only weakly) associated with traits, to learn the non-genetic structure and remove it from the estimated marginal associations Inline graphic of signal SNPs. Explicitly, the marginal associations of the noise SNPs, satisfy

2.2.1. (6)

Compared to Equation 5, the expectation of Inline graphic of the noise SNPs does not contain any genetic component and is a scalar multiple of the environmental effects. As a direct corollary of Equations 5 and 6,

2.2.1. (7)

Equation 7 demonstrates the environmental influences can be removed by subtracting a scalar multiple of Inline graphic from Inline graphic. Motivated by this cancellation of the non-genetic influences, we introduce the differencing estimator of Inline graphic:

  1. Compute the truncated eigen-decomposition of Inline graphic with Inline graphic components. Denote the matrix of Inline graphic eigenvectors by Inline graphic. Let Inline graphic.

  2. As for Inline graphic, we suggest
    graphic file with name TM0124.gif (8)

2.2.2. Gene–trait clustering

In this section, we find sparse matrices in the estimated column spaces in Section 2.2.1 and construct gene–trait clusters from the non-zero loadings.

Let Inline graphic, Inline graphic be two candidate matrices from Inline graphic to Inline graphic, respectively. We aim to find a transformation matrix Inline graphic such that the columns of Inline graphic, Inline graphic are sparse. The task aligns with a number of readily available methods from factor analysis with a focus on sparsity. In particular, we adopt two commonly used approaches summarized below.

  • Varmiax (Kaiser, 1958). We start from an orthonormal matrix Inline graphic and solve for a rotation matrix Inline graphic to maximize the variances of squared loadings of Inline graphic’s columns. The resulting Inline graphic tend to have many small loadings and we set those values to zero. Finally, we let Inline graphic and again set small values in Inline graphic to zero.

  • Promax (Hendrickson and White, 1964). Promax first applies the above Varimax to Inline graphic, and then rotates the orthogonal columns of Varimax to a least squares fit. The approach relaxes the orthonormal restriction of Inline graphic in Varimax and thus the loadings in Inline graphic are pushed further apart. We let Inline graphic and truncate small values in Inline graphic, Inline graphic to zero.

Finally, provided with sparse column estimators Inline graphic, Inline graphic, we loop over Inline graphic column pairs Inline graphic and assign the traits and genes with non-zero loadings into a cluster. Details are summarized in Algorithm 1. In the upcoming section, we will build upon Algorithm 1 and enhance its stability by bootstrap aggregation.

2.3. Bootstrap aggregation of PathGPS

Several issues may undermine the reliability of Algorithm 1. First, in modern biobanks, the set of measured traits is evergrowing. Many traits are repeatedly measured in slightly different ways. It is desirable to obtain stable results if the traits are slightly perturbed. Second, in the data preprocessing procedures, we use an external SNP dataset to select signal and noise index SNPs. We expect to arrive at similar gene–trait clusters if we perturb the index SNP sets, especially the signal SNPs, by a small amount. Third, Algorithm 1 relies on a set of hyperparameters, such as the number of latent mediators Inline graphic. The cluster list Inline graphic should be robust to the selection of hyperparameters.

We propose a bootstrap aggregation (bagging) approach to stabilize the pipeline and make the results more replicable by perturbing the entire procedure many times and then aggregating the results. In the following, we discuss the two components of the bagging procedure: SNP bootstrapping (Section 2.3.1) and the aggregation method via a co-appearance graph (Section 2.3.2). The full bagging procedure is summarized in Algorithm 2.

2.3.1. SNP bootstrapping

Motivated by Breiman (2001), we bootstrap the SNPs used by Algorithm 1. In each trial, we resample the same number of signal SNPs with replacement and obtain bootstrapped signal estimated marginal associations Inline graphic. We then apply Algorithm 1 to Inline graphic and Inline graphic and arrive at a cluster list Inline graphic. We repeat from the resampling Inline graphic times and obtain a collection of cluster lists Inline graphic. The left panel of Figure 2 describes an example of the bootstrap process.

FIGURE 2.

FIGURE 2

SNP bootstrapping and co-appearance graph. In the left panel, we obtain multiple clusterings based on different sets of bootstrapped SNPs. In the right panel, we display the co-appearance graph obtained based on the four bootstrap samples in the left panel. The weight computation can be found in the supplementary materials.

2.3.2. Co-appearance graph aggregation

Based on the multiple gene–trait clusters Inline graphic generated by the SNP bootstrapping, we propose to aggregate the cluster lists using a co-appearance graph. Consider a graph whose nodes denote SNPs and traits. For two nodes Inline graphic and Inline graphic, we define the weight for the edge connecting Inline graphic and Inline graphic (called the co-appearance frequency in the following)

2.3.2. (9)

where Inline graphic denotes the gene–trait clusters in the list Inline graphic obtained from the Inline graphic-th bootstrap sample. The right panel of Figure 2 displays an example of the co-appearance graph. If two nodes always show up in the same cluster, the pair will have a high co-appearance frequency (Equation 9), and we are more confident about the connection of the pair.

The co-appearance graph is convenient for downstream clustering and visualization. One option is using t-SNE (Hinton and Roweis, 2002) or UMAP (McInnes et al., 2018) to find low-dimensional embeddings. The embeddings can be further used to visualize genes and traits that are closely connected in the co-appearance graph. The representations can also be fed to various clustering methods based on feature vectors like k-means. Alternatively, we can directly use graph clustering methods, such as spectral clustering and label propagation, to obtain gene–trait clusters.

2.3.2.

3. RESULTS

In Section 3.1, we generate simulated datasets following SEM Equations 1 and 2. We demonstrate the differencing estimator’s performance in estimating the column space in Section 3.1.1. We showcase that bagging enhances the stability of PathGPS in Section 3.1.2. In Section 3.2, we report the findings from applying PathGPS to the metabolomics data and the UK Biobank data.

3.1. Simulated datasets

3.1.1. Column space estimation

We construct simulation datasets following SEM Equations 1 and 2. We consider Inline graphic individuals, Inline graphic signal SNPs, Inline graphic noise SNPs, Inline graphic traits, Inline graphic latent genetic mediators, and Inline graphic latent environmental mediators. The number of latent genetic mediators are significantly smaller than the number of index SNPs and the number of traits. The signal/noise SNPs, latent environmental mediators, and the random errors Inline graphic, Inline graphic are independent standard Gaussian random variables. As for coefficient matrices, we first generate elements of the coefficient matrices Inline graphic, Inline graphic uniformly from Inline graphic, and then randomly set Inline graphic of the entries to zero to create sparse matrices Inline graphic, Inline graphic. We also generate elements of the environmental mediators’ coefficient matrix Inline graphic uniformly from Inline graphic. We adjust the magnitudes of the environmental influence, the errors in the mediators Inline graphic, and the errors in the traits Inline graphic so that the proportion of the environmental mediators’ variance Inline graphic (environmental factor strength) varies from Inline graphic to Inline graphic, while the total variance of the non-genetic component stays the same. We compare two column space estimators: the differencing estimator in Algorithm 1 and the “simple SVD” estimator. We measure the performance of column space estimators by the column space distance: let Inline graphic and Inline graphic be two arbitrary matrices of dimension Inline graphic, and let Inline graphic, Inline graphic be the corresponding column spaces, respectively, define the column space distance as

3.1.1. (10)

where Inline graphic, Inline graphic are the projection operators onto the column spaces of Inline graphic, Inline graphic.

In Figure 3, we report the column space distances Equation 10 of the two methods under different levels of environmental influences. The performance of the “simple SVD” approach deteriorates as the environmental influence increases. The “simple SVD” approach mistakenly counts the leading environmental factor as a genetic influence. In contrast, the proposed column space estimator is robust to the environmental factors. This is because, despite of the magnitude of the environmental influence, the environmental factors are nearly entirely captured by its estimator Inline graphic and subtracted from Inline graphic. When the environmental factor strength [the proportion of the environmental mediators’ variance Inline graphic in Equation 4] exceeds Inline graphic, the one standard deviation intervals of the column space distance based on the proposed estimator fall strictly below those produced by the “simple SVD” approach. In the supplementary materials, we also compare the clustering results based on the proposed estimator against the “simple SVD”. In the supplementary materials, we have included additional simulations for a more comprehensive analysis. In particular, we demonstrate that our proposed method exhibits robustness in the presence of the many weak effects commonly associated with polygenic traits.

FIGURE 3.

FIGURE 3

Column space estimation. We compare the “simple SVD” approach and the differencing estimator in PathGPS. We evaluate the estimation performance by the column space distance in Equation 10. We vary the proportion of the environmental mediators’ variance Inline graphic (environmental factor strength) in Equation 4 from Inline graphic to Inline graphic, and display the average column space distances plus and minus one standard deviation for Inline graphic in left figure and Inline graphic in the right figure. All results are aggregated over 100 trials.

3.1.2. Gene–trait clustering

As in Section 3.1.1, we follow SEM Equations 1 and 2, and consider Inline graphic individuals, Inline graphic signal SNPs, Inline graphic noise SNPs, Inline graphic traits, Inline graphic latent genetic mediators, and Inline graphic latent environmental mediators. In the default setting (default), we design sparse coefficient matrices Inline graphic, Inline graphic to have a total of Inline graphic genes and traits in each cluster. We also enforce each gene and trait to belong to at most one cluster. In addition to the default setting, we consider two variations: a sparse setting (sparse) with only Inline graphic genes and traits per cluster; an overlap setting (overlap) where a gene/trait may belong to multiple clusters (multiple membership). Details of the three simulation settings are summarized in Table 1.

TABLE 1.

Summary of simulation settings of gene–trait clusters discovery.

Setting Number of genes and traits/cluster Multiple membership
default 16 Inline graphic
sparse 12 Inline graphic
overlap 16 Inline graphic

We compare several versions of the PathGPS: (a) the one-shot pipeline (baseline) following Algorithm 1 (no further clustering is required, clustering method denoted by “NA”); (b) the clustering without bootstrapping (one-shot) following Algorithm 1; (c) the co-appearance clustering with bootstrapping (bootstrap) following Algorithm 1 with 200 bootstrap resamples. As for the clustering methods used by the approach one-shot and bootstrap based on co-appearance graphs, we consider: (a) first learn the low-dimensional embeddings via t-SNE or UMAP, and then cluster the embeddings by k-means; (b) directly apply graph clustering methods: spectral clustering and hierarchical clustering.

We evaluate the performance of the clustering by the minimal clustering error across label permutation below. In particular, let Inline graphic be the true list of Inline graphic clusters defined on set Inline graphic, and Inline graphic be an estimate with the same number of clusters, then we define the clustering error

3.1.2. (11)

where Inline graphic denotes a permutation over cluster labels. To test the robustness to the misspecification of hyperparameters, we input a sequence of hyper parameters around the true values. For the number of latent genetic pathways Inline graphic, we provide Inline graphic such that Inline graphic, which includes the correct specification Inline graphic. For the number of genes and traits in each channel Inline graphic, we input hyperparameters Inline graphic, which also includes the correct specification.

In Figure 4, across different simulation settings, the baseline achieves the best accuracy at the true hyperparameters, while the bootstrap approach is in general more robust.

FIGURE 4.

FIGURE 4

Gene–trait clusters discovery. We compare three variants of PathGPS the one-shot pipeline (baseline in circle) following Algorithm 1 (no further clustering is required, clustering method denoted by “NA”), the clustering without bootstrapping (one-shot in triangle) following Algorithm 1, and the co-appearance clustering with bootstrapping (bootstrap) following Algorithm 1 with 200 bootstrap resamples. We evaluate the estimation performance by the clustering error Equation 11. We input a sequence of Inline graphic (first row) and Inline graphic (second row) around the true values, and plot the average clustering errors aggregated over 100 trials.

3.2. Real datasets

3.2.1. Metabolomics data

We use the genome-metabolome-wide association study in Kettunen et al. (2016) as the main dataset. The summary statistics are derived from 24925 participants and contain 123 metabolites, Inline graphic passing quality control SNPs. We remove 18 traits with <Inline graphic variance explained according to Supplementary Table 1 of Kettunen et al. (2016). To select approximately independent index SNPs, we apply PLINK to an external dataset (Davis et al., 2017) of 72 metabolites, including a large proportion of the traits in the main dataset. We regard 50 index SNPs with at least one significant marginal association as signal SNPs, and 250 index SNPs with no significant marginal associations as noise SNPs. We extract the marginal associations corresponding to the signal and noise SNPs from the main dataset as the input for PathGPS.

We apply PathGPS to a metabolomics dataset (Section 3.2.1) and the UK Biobank (Section 3.2.2), and discuss gene–trait clusters produced. Preprocessing procedures can be found in the supplementary materials, including the details of choosing “signal” and “noise” genes. The results, including the lists and visualizations of the gene–trait clusters, are summarized in Figure 5.

FIGURE 5.

FIGURE 5

Applications of PathGPS. Panel A displays the summary of the metabolomics data (a1), the UMAP embeddings of seven gene–trait clusters produced by PathGPS with co-appearance edge weights (a2), and representative traits and mapped genes in each cluster (a3). In (a4), we subsample traits without replacement, and PathGPS (UMAP) produces more consistent cluster memberships than the baseline method (Figure 1b5). Panel B displays the summary of the UK Biobank data (b1), the UMAP visualization (b2), and representative genes and traits of 10 clusters produced by PathGPS (b3). PathGPS (UMAP) again produces more stable clusters (b4). The representative traits and mapped genes in (a3) and (b3) are selected manually.

For the metabolomics dataset, PathGPS produces seven clusters, which roughly correspond to large high-density lipoprotein (HDL), small HDL, low-density lipoprotein (LDL), intermediate-density lipoprotein (IDL), large very-low-density lipoprotein (VLDL), small VLDL, and non-lipoprotein measurements (Figure 5a). Thus, using genetic data only, PathGPS is able to recover the known taxonomy of circulating metabolites. In the supplementary materials, we provide a comparison of the clusters produced by PathGPS and those of the “simple SVD” method. PathGPS confirms several known causal genes, such as PLTP as a regulator of HDL size (Huuskonen et al., 2001) and PCSK9 as a regulator of LDL cholesterol (Maxwell and Breslow, 2004). PathGPS also proposes several biological hypotheses that are not as well established, including RNF111 in relation to HDL (Holmen et al., 2014) and TM4SF5 in relation to lipid measurements (Choi et al., 2021). In fact, few gene–trait pairs suggested by PathGPS directly reach the genome-wide significant level after correcting for multiple testing, but the majority of the gene–trait pairs are at least moderately associated. This demonstrates the ability of PathGPS to associate a group of genes with a group of traits, when any single association is not sufficiently strong.

In each trial, we subsample around half of the traits without replacement, and apply PathGPS to the selected subset of traits. We compare the co-appearance weights (co-appearing probabilities) obtained with (Figure 5 a4 UMAP) and without (Figure 5a4 Baseline) bootstrap aggregation. Close-to-one co-appearance weights (co-appearing probabilities) indicate the associated pairs always fall into the same cluster and close-to-zero values imply the associated pairs always end up in different clusters. We observe the histograms of co-appearance weights (co-appearing probabilities) of PathGPS with bagging have sharper spikes around 0 and 1. The bowl-shaped histograms indicate the bagging procedure increases the stability of PathGPS towards trait inclusion.

3.2.2. UK Biobank data

We use the GWAS summary statistics from the UK Biobank data generated by the Neale Lab. The summary statistics are derived from Inline graphic participants of white-British ancestry and contain Inline graphic passing quality control SNPs.

For data preprocessing, we first remove traits with missing female or male summary statistics. The female summary statistics are used for SNP and trait selection, and the male summary statistics are used for downstream gene–trait clusters exploration. We select approximately independent index SNPs based on the female summary statistics using the PLINK software (Purcell et al., 2007). To eliminate unreliable estimates of genetic associations, we focus on the 175 traits with at least one significant index SNPs at a Inline graphic confidence level. The resulting traits are a combination of lab measurements, disease diagnoses, medication, and a small number of lifestyle habits. Among the index SNPs, we regard the SNPs with at least one significant marginal association test as signal SNPs (1200 in total). We regard the SNPs with no significant marginal association tests as noise SNPs (250 in total). Finally, the estimated marginal associations between the selected traits and signal (noise) SNPs from the male population are used as the input for PathGPS.

PathGPS produces 10 clusters (Figure 5b3), among which 3 are closely related to some diseases (venous thromboembolism, cardiovascular diseases, and type 2 diabetes), and the other 7 contain biometric measurements, such as bone mineral density, immune system, fat-free mass, and skin or hair colours. In the UMAP visualization (Figure 5b2), the edges reflect high (top 350) co-appearances between vertex pairs and may offer insights into disease mechanisms. For instance, our analysis finds the medication simvastatin is closely related to high cholesterol and cardiovascular diseases (CVD), which is not surprising given that it is widely prescribed to reduce CVD risk (Bibbins-Domingo et al., 2016). We also find atorvastatin—another drug in the statin family—is highly related to bone mineral density (BMD) and associated traits. This finding is consistent with existing evidence that statin increases BMD (Li et al., 2020; Lupattelli et al., 2004). In addition, edges connecting monocytes, neutrophils, and lymphocytes to diabetes and asthma diagnoses have high weights, suggesting connections between the immune system and the two common diseases. In particular, diabetes may be related to the immune system through multiple mechanisms; for example, hyperglycemia in diabetes may cause dysfunction of the immune response (Berbudi et al., 2020). As for asthma, T lymphocytes are critical to the development of asthma (Larché et al., 2003). The co-occurrence of diabetes and asthma may be attributed to the shared immunological pathways (Torres et al., 2021). In the supplementary materials, we provide a comparison of the clusters produced by PathGPS and those of the “simple SVD” method.

Regarding the genetic architecture, our analysis confirms many existing discoveries, such as the association between HERC2 and hair colour (Branicki et al., 2011), PELO and red blood cells (Mills et al., 2016), and NME7 and venous thromboembolism (Heit et al., 2012). We also find less well-established biological hypotheses, such as BCL2 and Atrial fibrillation (Li et al., 2018), GFI1, and lymphocyte cells (Van der Meer et al., 2010). The UMAP embedding provides further information beyond the cluster membership. For example, the cluster containing smoking, alcohol, and diabetes is adjacent to the cluster containing CVD, indicating a multifaceted health effect of alcohol consumption and tobacco usage.

4. DISCUSSION

In this article, we propose PathGPS—a promising statistical tool to discover genes and traits sharing latent biological pathways. When applied to the UK Biobank and metabolomics data, PathGPS not only confirms many established genetic associations but also generates novel biological hypotheses. By grouping diseases with shared biological pathways, PathGPS can enhance the understanding of comorbidities and contribute to the development of comprehensive clinical practices.

We highlight that PathGPS only requires summary statistics and thus can be readily applied to a number of biobank datasets (Chen et al., 2011; Christensen et al., 2012; Avlund et al., 2014). It is possible that, for certain traits, the underlying genetic pathways differ across sub-populations around the globe. The heterogeneity of genetic pathways can potentially lead to individualized treatments. Therefore, it is of value to compare the output gene–trait clusters of PathGPS applied to various biobank datasets.

Our proposal of using PathGPS with bootstrap aggregation addresses the call of reproducibility research. When scientific findings rely on statistical analysis, the statistical results should be stable under “reasonable” data perturbations (Yu, 2013). In particular, biobanks and other databses are often regularly augmented by additional measurements and samples, and it would be desirable to obtain consistent conclusions when more data become available.

There are several avenues for future work. First, research shows that the interactions between genes and environment shape human development, and childhood experiences can alter gene expression. So it may be useful to extend the current model to include the interaction of genetic and environmental factors. Second, PathGPS outputs groups of traits associated with the same pathway and it would be of great interest to further investigate the causal mechanism. Third, given that it is difficult to develop rigorous uncertainty quantification and inference for clustering and unsupervised learning tasks, it would be useful to consider how experiments can be designed to validate or disprove the potential pathways generated by PathGPS. Finally, the PathGPS deliberately selects index SNPs to be distant from each other to ensure their independence, which results in a limited number of such SNPs. To expand the scope and involve a larger number of SNPs, our method would need to be extended to handle dependent SNPs. This would require us to model the covariance matrix of the index SNPs in order to establish connections between marginal associations and the coefficients derived from a full regression.

Supplementary Material

ujae060_Supplemental_Files

Web Appendices, Tables, and Figures, referenced in Sections 2 and 3, along with codes, are available with this paper at the Biometrics website on Oxford Academic. The codes used for the analysis in this paper are also available at https://github.com/ZijunGao/PathGPS.

Acknowledgement

We would like to express our gratitude to Professors Robert Tibshirani and Jonathan Taylor for their invaluable suggestions. Additionally, we extend our appreciation to the anonymous reviewers for their constructive guidance and feedback.

Contributor Information

Zijun Gao, Marshall Business School, University of Southern California, Los Angeles CA, 90089, United States.

Qingyuan Zhao, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, CB3 0WB, United Kingdom.

Trevor Hastie, Department of Statistics and Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, United States.

FUNDING

Q.Z. is supported by the Isaac Newton Trust and EPSRC grant EP/V049968/1. T.H. are partially supported by grants Division of Mathematical Sciences (DMS) 2013736 and Division of Information and Intelligent Systems (IIS) 1837931 from the (US) National Science Foundation and grant 5R01 EB 001988-21 from the (US) National Institutes of Health.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The data that support the findings in this paper are open for access. The UK biobank data can be accessed at http://www.nealelab.is/uk-biobank, and the metabolomics data sets are provided in Kettunen et al. (2016) and Davis et al. (2017).

References

  1. Avlund  K., Osler  M., Mortensen  E. L., Christensen  U., Bruunsgaard  H., Holm-Pedersen  P.  et al. (2014). Copenhagen aging and midlife biobank (camb): an introduction. Journal of Aging and Health, 26, 5–20. [DOI] [PubMed] [Google Scholar]
  2. Berbudi  A., Rahmadika  N., Tjahjadi  A. I., Ruslami  R. (2020). Type 2 diabetes and its impact on the immune system. Current Diabetes Reviews, 16, 442–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bibbins-Domingo  K., Grossman  D. C., Curry  S. J., Davidson  K. W., Epling  J. W., García  F. A.  et al. (2016). Statin use for the primary prevention of cardiovascular disease in adults: us preventive services task force recommendation statement. JAMA, 316, 1997–2007. [DOI] [PubMed] [Google Scholar]
  4. Boyle  E. A., Li  Y. I., Pritchard  J. K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell, 169, 1177–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Branicki  W., Liu  F., van Duijn  K., Draus-Barini  J., Pośpiech  E., Walsh  S.  et al. (2011). Model-based prediction of human hair color using DNA variants. Human Genetics, 129, 443–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Breiman  L. (2001). Random forests. Machine Learning, 45, 5–32. [Google Scholar]
  7. Bulik-Sullivan  B., Finucane  H. K., Anttila  V., Gusev  A., Day  F. R., Loh  P.-R.  et al. (2015). An atlas of genetic correlations across human diseases and traits. Nature Genetics, 47, 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cano-Gamez  E., Trynka  G. (2020). From gwas to function: using functional genomics to identify the mechanisms underlying complex diseases. Frontiers in Genetics, 11, 424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen  Z., Chen  J., Collins  R., Guo  Y., Peto  R., Wu  F.  et al. (2011). China kadoorie biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. International Journal of Epidemiology, 40, 1652–1666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Choi  C., Son  Y., Kim  J., Cho  Y. K., Saha  A., Kim  M.  et al. (2021). Tm4sf5 knockout protects mice from diet-induced obesity partly by regulating autophagy in adipose tissue. Diabetes, 70, 2000–2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Christensen  H., Nielsen  J. S., Sørensen  K. M., Melbye  M., Brandslund  I. (2012). New national biobank of the Danish center for strategic research on type 2 diabetes (dd2). Clinical Epidemiology, 4, 37–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Davis  J. P., Huyghe  J. R., Locke  A. E., Jackson  A. U., Sim  X., Stringham  H. M.  et al. (2017). Common, low-frequency, and rare genetic variants associated with lipoprotein subclasses and triglyceride measures in finnish men from the metsim study. PLoS Genetics, 13, e1007079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Finucane  H. K., Bulik-Sullivan  B., Gusev  A., Trynka  G., Reshef  Y., Loh  P.-R.  et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 47, 1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Heit  J. A., Armasu  S. M., Asmann  Y. W., Cunningham  J. M., Matsumoto  M. E., Petterson  T. M.  et al. (2012). A genome-wide association study of venous thromboembolism identifies risk variants in chromosomes 1q24. 2 and 9q. Journal of Thrombosis and Haemostasis, 10, 1521–1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hendrickson  A. E., White  P. O. (1964). Promax: a quick method for rotation to oblique simple structure. British Journal of Statistical Psychology, 17, 65–70. [Google Scholar]
  16. Hinton  G., Roweis  S. T. (2002). Stochastic neighbor embedding. NeurIPS Proceedings, 15, 833–840. [Google Scholar]
  17. Holmen  O. L., Zhang  H., Fan  Y., Hovelson  D. H., Schmidt  E. M., Zhou  W.  et al. (2014). Systematic evaluation of coding variation identifies a candidate causal variant in tm6sf2 influencing total cholesterol and myocardial infarction risk. Nature Genetics, 46, 345–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Huuskonen  J., Olkkonen  V. M., Jauhiainen  M., Ehnholm  C. (2001). The impact of phospholipid transfer protein (pltp) on hdl metabolism. Atherosclerosis, 155, 269–281. [DOI] [PubMed] [Google Scholar]
  19. Jennrich  R. I. (2001). A simple general procedure for orthogonal rotation. Psychometrika, 66, 289–306. [Google Scholar]
  20. Kaiser  H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23, 187–200. [Google Scholar]
  21. Kettunen  J., Demirkan  A., Würtz  P., Draisma  H. H., Haller  T., Rawal  R.  et al. (2016). Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Nature Communications, 7, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lappalainen  T., MacArthur  D. G. (2021). From variant to function in human disease genetics. Science, 373, 1464–1468. [DOI] [PubMed] [Google Scholar]
  23. Larché  M., Robinson  D. S., Kay  A. B. (2003). The role of t lymphocytes in the pathogenesis of asthma. Journal of Allergy and Clinical Immunology, 111, 450–463. [DOI] [PubMed] [Google Scholar]
  24. Li  G. H.-Y., Cheung  C.-L., Au  P. C.-M., Tan  K. C.-B., Wong  I. C.-K., Sham  P.-C. (2020). Positive effects of low ldl-c and statins on bone mineral density: an integrated epidemiological observation analysis and mendelian randomization study. International Journal of Epidemiology, 49, 1221–1235. [DOI] [PubMed] [Google Scholar]
  25. Li  Y., Song  B., Xu  C. (2018). Effects of guanfu total base on bcl-2 and bax expression and correlation with atrial fibrillation. Hellenic Journal of Cardiology, 59, 274–278. [DOI] [PubMed] [Google Scholar]
  26. Lupattelli  G., Scarponi  A. M., Vaudo  G., Siepi  D., Roscini  A. R., Gemelli  F.  et al. (2004). Simvastatin increases bone mineral density in hypercholesterolemic postmenopausal women. Metabolism, 53, 744–748. [DOI] [PubMed] [Google Scholar]
  27. McInnes  L., Healy  J., Melville  J. (2018). Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. Journal of Open Source Software, 29, 861. 10.21105/joss.00861 [DOI] [Google Scholar]
  28. Manolio  T. A., Collins  F. S., Cox  N. J., Goldstein  D. B., Hindorff  L. A., Hunter  D. J.  et al. (2009). Finding the missing heritability of complex diseases. Nature, 461, 747–753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Maxwell  K. N., Breslow  J. L. (2004). Adenoviral-mediated expression of pcsk9 in mice results in a low-density lipoprotein receptor knockout phenotype. Proceedings of the National Academy of Sciences, 101, 7100–7105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mills  E. W., Wangen  J., Green  R., Ingolia  N. T. (2016). Dynamic regulation of a ribosome rescue pathway in erythroid cells and platelets. Cell Reports, 17, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ning  Z., Pawitan  Y., Shen  X. (2020). High-definition likelihood inference of genetic correlations across human complex traits. Nature Genetics, 52, 859–864. [DOI] [PubMed] [Google Scholar]
  32. Pickrell  J. K., Berisa  T., Liu  J. Z., Ségurel  L., Tung  J. Y., Hinds  D. A. (2016). Detection and interpretation of shared genetic influences on 42 human traits. Nature Genetics, 48, 709–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Purcell  S., Neale  B., Todd-Brown  K., Thomas  L., Ferreira  M. A., Bender  D.  et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Shi  H., Kichaev  G., Pasaniuc  B. (2016). Contrasting the genetic architecture of 30 complex traits from summary association data. The American Journal of Human Genetics, 99, 139–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Solovieff  N., Cotsapas  C., Lee  P. H., Purcell  S. M., Smoller  J. W. (2013). Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics, 14, 483–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Tanigawa  Y., Li  J., Justesen  J. M., Horn  H., Aguirre  M., DeBoever  C.  et al. (2019). Components of genetic associations across 2,138 phenotypes in the UK biobank highlight adipocyte biology. Nature Communications, 10, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Torres  R. M., Souza  M. D. S., Coelho  A. C. C., de Mello  L. M., Souza-Machado  C. (2021). Association between asthma and type 2 diabetes mellitus: Mechanisms and impact on asthma control—a literature review. Canadian Respiratory Journal, 2021, 8830439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Van der Meer  L., Jansen  J., Van Der Reijden  B. (2010). Gfi1 and gfi1b: key regulators of hematopoiesis. Leukemia, 24, 1834–1843. [DOI] [PubMed] [Google Scholar]
  39. Visscher  P. M., Wray  N. R., Zhang  Q., Sklar  P., McCarthy  M. I., Brown  M. A.  et al. (2017). 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101, 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Witten  D. M., Tibshirani  R., Hastie  T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Yu  B. (2013). Stability. Bernoulli, 19, 1484–1500. [Google Scholar]
  42. Zou  H., Hastie  T., Tibshirani  R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265–286. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae060_Supplemental_Files

Web Appendices, Tables, and Figures, referenced in Sections 2 and 3, along with codes, are available with this paper at the Biometrics website on Oxford Academic. The codes used for the analysis in this paper are also available at https://github.com/ZijunGao/PathGPS.

Data Availability Statement

The data that support the findings in this paper are open for access. The UK biobank data can be accessed at http://www.nealelab.is/uk-biobank, and the metabolomics data sets are provided in Kettunen et al. (2016) and Davis et al. (2017).


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES