Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2022 Jan 27;18(1):e1009975. doi: 10.1371/journal.pgen.1009975

Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity

Andrew J Grant 1,*, Dipender Gill 2,3,4,5, Paul D W Kirk 1,6, Stephen Burgess 1,7
Editor: Michael P Epstein8
PMCID: PMC8794082  PMID: 35085229

Abstract

Clustering genetic variants based on their associations with different traits can provide insight into their underlying biological mechanisms. Existing clustering approaches typically group variants based on the similarity of their association estimates for various traits. We present a new procedure for clustering variants based on their proportional associations with different traits, which is more reflective of the underlying mechanisms to which they relate. The method is based on a mixture model approach for directional clustering and includes a noise cluster that provides robustness to outliers. The procedure performs well across a range of simulation scenarios. In an applied setting, clustering genetic variants associated with body mass index generates groups reflective of distinct biological pathways. Mendelian randomization analyses support that the clusters vary in their effect on coronary heart disease, including one cluster that represents elevated body mass index with a favourable metabolic profile and reduced coronary heart disease risk. Analysis of the biological pathways underlying this cluster identifies inflammation as potentially explaining differences in the effects of increased body mass index on coronary heart disease.

Author summary

Genome-wide association studies have found many genetic variants that are correlated with traits, particularly complex traits such as body mass index (BMI). However, genetic association data cannot tell us how these variants influence the trait, or whether they influence the trait in the same way. Insight into these questions may be gained by analysing the associations between the variants and other related traits. Variants with similar patterns of associations across a set of traits may be thought to act via similar biological mechanisms. Here we present a new statistical method for grouping genetic variants according to their associations with chosen traits, so that each group represents variants acting on these traits in a distinct way. We apply the method to genetic variants associated with BMI and then study the effects of each of the identified groups of variants on coronary heart disease. We find a group of genetic variants associated with higher BMI and decreased risk of heart disease, which is in contrast to the established overall harmful effect of BMI on heart disease.

Introduction

In recent years, the number of genome-wide association studies (GWAS) has grown enormously [1]. Such studies provide valuable information linking genetic variants across the human genome to a wide range of traits. What often remain less understood are the underlying mechanisms by which the associated genetic variants affect the traits. Insight into these mechanisms may be gained by investigating the pattern of associations with other related traits: genetic variants that share similar association patterns may be thought to act via similar mechanisms [2]. For example, some genetic variants associated with type 2 diabetes are also associated with obesity related traits such as body mass index (BMI), whereas others are instead associated with traits such as triglycerides, suggesting that the variants influence type 2 diabetes risk via different biological mechanisms [3].

A number of techniques have been implemented to cluster genetic variants based on their associations with traits that are believed to be relevant in informing biological pathways. The traits often include separate risk factors or potential mediators of some disease outcome(s) of interest. A common approach is to use hierarchical clustering, which groups observations based on their distance from each other [47]. The number of clusters is then chosen heuristically. Other clustering approaches which have been applied to genetic variant-trait association estimates include fuzzy c-means [6] and Bayesian nonnegative matrix factorization [3]. A related approach which aims to determine distinct components of genetic variant-trait associations uses truncated singular value decomposition [8].

A key characteristic of previously implemented approaches is that they cluster based on the Euclidean distance between vectors of the genetic variant-trait association estimates, defined as the length of the line between the association estimates plotted as points on a graph. However, when trying to determine shared biological mechanisms, a more relevant clustering target is the proportional associations of each genetic variant with the set of traits. If two variants influence a set of related traits via a common mechanism, the genetic associations may differ considerably in magnitude due to one variant having a stronger effect than the other. However, their proportional associations across the traits will be similar for both variants. Equivalent to looking at proportional associations is to consider the direction of the association vector from the origin. That is, in order to distinguish between variants which act via different mechanisms, it is the direction of the association vector rather than its location in space which is of most importance. This is illustrated graphically in Fig 1. Relating similar directions of genetic associations to shared biological mechanisms has been discussed by, for example, Yaghootkar et al. [9], Winkler et al. [2] and Udler et al. [3]. We note that implicit in this definition of mechanism is the assumption that the relationships between the genetic associations with one trait and the genetic associations with each of the other traits are linear.

Fig 1. Illustrative figure showing the difference between clustering based on Euclidean distance compared with direction.

Fig 1

Panel (a) plots 90 simulated points representing genetic associations with two traits. Each point was generated from one of three bivariate normal distributions. Panel (b) plots the normalised genetic associations, representing the proportional association of each genetic variant with respect to the two traits. All points sit on the unit circle. The green points represent genetic variants which are positively associated with each trait by similar magnitudes. The orange points represent genetic variants which are positively associated with trait 1 and negatively associated with trait 2, again by similar magnitudes. Methods based on Euclidean distance such as Gaussian mixture models and hierarchical clustering would consider there to be three clusters, distinguishing between the light and dark green points, as shown in Panel (a). Directional clustering approaches would consider there to be two clusters, grouping the green points in the same cluster. This is shown in Panel (b), where the points are clearly grouped in two separate clusters.

In this paper we introduce a novel procedure for clustering genetic variants based on their associations with a given set of traits to identify groups with common biological mechanisms. We develop the NAvMix (Noise-Augmented von Mises–Fisher Mixture model) clustering method, which extends a directional clustering approach to include a noise cluster as well as a data-driven method for choosing the number of clusters. The method is shown in a simulation study to perform well in identifying true clusters and to outperform alternative approaches across a range of scenarios. We further apply the procedure to cluster genetic variants associated with body mass index (BMI). We study the downstream effects of the different components of BMI on coronary heart disease (CHD) using Mendelian randomization, which uses genetic variants as instrumental variables to study potential causal effects of a risk factor on an outcome [10, 11]. We identify a BMI increasing cluster of variants associated with a favourable cardiometabolic profile and lower CHD risk. Analysis of the biological pathways which underlie each group of variants suggests that a key difference of this cluster compared with the others is its distinct effect on systemic inflammation. The clustering method demonstrated in this work is thus able to identify distinct pathways underlying complex traits, in turn highlighting specific mechanisms for therapeutic intervention.

Results

Overview of the proposed clustering approach

We use a mixture model approach to clustering, which supposes that each observation is a realisation from one of a fixed number of probability distributions. Since we are interested in clustering based on direction of association, we fit a mixture of von Mises–Fisher (vMF) distributions, which is a distribution characterised by the mean direction of the observations from the origin and a dispersion parameter. A mixture model of vMF distributions has previously been described by Banerjee et al. [12]. We augment this approach by including a noise cluster, in recognition of the fact that not all observed vectors of genetic variant-trait association estimates are expected to fit well within the set of specified distributions. The noise cluster will contain outliers to the specified model, providing robustness to the identification of clusters. Our method of clustering is thus to fit a Noise-Augmented von Mises–Fisher Mixture model (NAvMix).

The NAvMix algorithm outputs a probability for each observation belonging to each cluster based on the given data. Each observation can then be assigned according to which cluster it has the highest probability of membership (referred to as hard clustering). The approach also provides the ability for soft clustering, which is where an observation is assigned to any cluster for which it has a probability of membership over a certain level, so that observations may belong to more than one cluster. Although the algorithm requires a fixed number of clusters to be specified, we repeat the procedure for varying numbers of clusters then chose the final number using the Bayesian Information Criterion (BIC). Full details of the procedure are given in the Methods section.

Let β^j· be the vector of association estimates of genetic variant j with the set of traits under consideration, and let Σ^j be the covariance matrix of this vector. We assume that the genetic variants are independent of each other (that is, no linkage disequilibrium). We also note that the association estimates do not need to have been taken in the same sample, so we can consider sets of associations between genetic variants and any trait for which corresponding GWAS summary statistics are available. Although it is possible to input the raw association estimates into the algorithm, we propose inputting the standardised association estimates, given by Σ^j-1/2β^j· for the jth variant. The standardisation means that each element of the input vector is independent and has the same standard error. It thus is able to account for correlation between association estimates. Assuming all genetic associations are estimated with the same sample size for a given trait, this will not distort the direction vector. If there are significant differences between sample sizes used to estimate genetic associations for the same trait, and associations with different traits are on similar scales, the unstandardised association estimates may also be used, possibly as a sensitivity analysis. The first step in the algorithm is to transform each input vector to have a magnitude of one. This is done by dividing each vector by its Euclidean distance from the origin. We shall refer to this as normalisation. The normalised vectors represent the proportional association estimates.

The diagonal elements of the covariance matrices represent the variances of the genetic variant-trait association estimates. The off-diagonal elements represent the covariances between these estimates. If the genetic associations are estimated in separate samples for each trait, these covariances will be theoretically equal to zero. If the association estimates are taken from the same sample, the covariances will still be approximately zero if the traits are independent. If the traits are correlated, an estimate of this correlation is required to estimate the full covariance matrix in the one sample setting. This is easily computed using individual level data (Methods). If published GWAS summary statistics are being used, this information will not always be available. Nonetheless, the simulation study presented in the following section shows the clustering approach still performs well in the case where traits are truly correlated but the correlation estimates are set to zero.

Simulation results

We performed a simulation study in order to evaluate the performance of the proposed method and to compare it with alternative clustering approaches. We chose two methods for comparison. The first was to fit Gaussian mixture models to the standardised association estimates using the mclust algorithm in R [13]. The method was chosen for comparison because it is a model-based approach that is able to estimate the number of clusters by fitting multiple models and choosing between them using a principled model selection criterion. The second approach used for comparison was to fit Gaussian mixture models using the proportional association estimates. This is a case of model misspecification, since the association estimates after normalisation will not follow Gaussian distributions, even if the association estimates themselves do (see, for example, Fig 1). It thus demonstrates the result of applying a method for clustering based on Euclidean distance to proportional associations. Note that other R packages which implement a form of directional clustering were not used for comparison because they either do not allow for estimation of the number of clusters (for example, skmeans [14], which uses the spherical k-means algorithm) or do not incorporate a noise cluster (for example, movMF [15]), and so performance cannot easily be compared.

We simulated data for genetic variants across six scenarios, where the number of traits (denoted by m) was either 2 or 9 and the number of clusters (K) was either 1, 2 or 4. In each scenario, each of 80 genetic variants were associated with one of K latent factors, representing the different clusters. Each trait was a function of these latent factors, 20 additional noise genetic variants, and random variation of which a proportion, determined by the parameter γ, came from a shared unmeasured confounding variable. The γ = 0 case represents uncorrelated traits, however it also proxies the scenario where the traits may be correlated but measured in separate, non-overlapping samples. Increasing values of γ therefore demonstrate the effect of increased trait correlation and/or sample overlap. We applied NAvMix in two ways. In the first, the off-diagonal entries of the covariance matrices were set to zero. In the second, the estimated trait correlation from individual level data was incorporated into the procedure, so the full estimated covariance matrices were used. In the primary simulation study presented here, the genetic variant-trait associations were estimated in a single sample of 20 000 individuals. S1 Text also presents the results of a simulation study where the sample sizes for each trait differed. Full details of the simulation parameters are given in the Methods section.

We evaluated the performance of each method using four measures: the adjusted Rand index; the silhouette coefficient; the mean number of clusters estimated; and the mean number of observations assigned to the noise cluster. The adjusted Rand index is a similarity measure between the true and estimated cluster memberships, and shows how well each method allocated the observations [16, 17]. The closer to 1, the closer the estimated cluster membership is to the truth. The silhouette for an observation is based on its closeness to other observations within its cluster and its separation from observations outside its cluster [18]. A higher value indicates that the observation fits well within its allocated cluster. We define the distance between two observations as the distance along the surface of the unit sphere after normalising, and we define the silhouette coefficient as the mean silhouette of all observations, with a higher silhouette coefficient indicating better formed clusters. Fig 2 shows boxplots of the adjusted Rand index for each method and scenario. Boxplots of the silhouette coefficients are shown in Fig A in S1 Text. Table 1 shows the mean number of clusters estimated and the mean size of the noise cluster for each method and scenario.

Fig 2. Comparison of methods in the simulation study.

Fig 2

Boxplots of the adjusted Rand index for each scenario using NAvMix, NAvMix incorporating trait correlation estimates (cor), mclust, and mclust using proportional associations (pr).

Table 1. Mean number of clusters estimated and mean number of observations allocated to the noise cluster for each simulated scenario using NAvMix, NAvMix incorporating trait correlation estimates (cor), mclust, and mclust using proportional associations (pr).

The true number of variants in the noise cluster is 20.

γ Number of traits (m) Number of clusters (K) Number of clusters Number of noise variants
NAvMix NAvMix (cor) mclust mclust (pr) NAvMix NAvMix (cor) mclust mclust (pr)
0 2 1 1.00 1.01 1.19 7.09 19.88 19.93 18.17 8.93
2 2.00 2.00 2.07 8.08 17.63 17.66 16.79 6.61
4 3.66 3.67 3.45 8.35 14.09 13.95 13.53 6.55
9 1 1.42 1.29 3.41 1.52 23.83 24.77 19.69 25.98
2 2.04 2.03 4.99 2.09 26.17 26.46 19.88 28.34
4 4.17 4.11 4.19 4.09 24.93 25.68 19.34 28.39
0.4 2 1 1.00 1.00 1.20 6.93 20.18 20.41 18.11 9.55
2 2.00 2.00 2.06 8.07 17.63 17.62 16.75 6.75
4 3.66 3.61 3.47 8.32 13.10 15.71 13.81 6.54
9 1 1.56 1.14 3.30 1.73 24.00 26.86 19.41 26.03
2 2.08 2.03 4.33 2.21 26.59 27.40 19.15 28.71
4 4.18 4.02 2.88 4.09 25.65 27.39 18.37 28.93
0.8 2 1 1.01 1.01 1.22 6.52 21.20 22.27 18.18 10.95
2 2.00 2.00 2.04 8.01 17.91 17.86 16.50 6.68
4 3.79 3.33 3.38 8.13 12.12 22.60 12.70 7.80
9 1 1.97 1.13 1.11 2.17 23.85 27.04 19.28 25.40
2 3.49 2.04 1.98 4.22 24.52 27.01 18.42 25.67
4 4.44 4.00 2.34 5.60 26.90 27.12 18.68 28.07

NAvMix performed very well in terms of allocating the observations to the correct clusters, with a median adjusted Rand index above the mclust approaches in nearly all scenarios. It similarly outperformed with respect to the silhouette coefficient, and selected, on average, a number of clusters closer to the true number. The mclust algorithm tended to overestimate the number of clusters, particularly when there were no truly distinct clusters (that is, in the K = 1 scenarios). The exception was when the traits were highly correlated (with γ = 0.8), where NAvMix tended to select too many clusters. However, incorporating the trait correlation estimates in NAvMix improved the performance in these cases. Note that when K = 4, one of the clusters had only 10 genetic variants. Nonetheless, NAvMix still selected close to 4 clusters, on average, and had higher median adjusted Rand indices and silhouette coefficients than the mclust approaches. Other than in the scenarios with both a higher number of traits (m = 9) and high trait correlation (γ = 0.8), there was not a big difference in the results between using NAvMix with and without trait correlation estimates. This suggests that, unless there is substantial trait correlation or sample overlap, the procedure is robust to missing these estimates. Incorporating trait correlation becomes more important as the number of traits increases and the number of true clusters decreases. Finally, mclust tended to allocate fewer observations to the noise cluster than NAvMix, particularly in the lower dimensional (m = 2) settings.

We repeated the analysis on the same simulated datasets but where the genetic variants were filtered such that only those which associated with at least one trait at genome-wide significance were included. This greatly improved the performance of NAvMix in the highly correlated trait scenarios (see Figs B and C and Table A in S1 Text). In the simulation scenarios where the sample sizes differed, the results were similar to those of the primary simulation study (see Figs D and E and Table B in S1 Text). In these scenarios, the various sample sizes were up to five times different, suggesting that the procedure is robust to reasonably large differences in sample sizes used for each trait.

Clustering BMI associated genetic variants

We applied our procedure to cluster BMI associated genetic variants identified by the GWAS of Pulit et al. [19]. We considered genetic variants associated with BMI at a p-value < 5 × 10−8 and pruned at r2 < 0.001. The clustering was performed in relation to the genetic associations with nine traits: body fat percentage; systolic blood pressure (SBP); triglycerides; high-density lipoprotein cholesterol (HDL); educational attainment; physical activity; lifetime smoking score; waist-to-hip ratio (WHR); and type 2 diabetes. These are lifestyle or cardiometabolic traits which have previously been shown to be related to BMI and which may offer insight into the pathways to downstream effects of BMI such as CHD [20, 21]. The genetic association estimates with these traits were all obtained from publicly available GWAS summary statistics (Methods). We clustered the 539 genetic variants that were available across all datasets. The full list of genetic variants and their allocated cluster, along with their probabilities of membership for each cluster, is given in S1 Table.

Five clusters were identified, with 1 genetic variant allocated to the noise cluster. Fig 3 shows a heat map of the proportional genetic association estimates with each trait by cluster and Fig 4 plots the means of each fitted vMF distribution, representing the proportional associations for an observation at the centre of each cluster. The largest four clusters, labelled Clusters 1–4, contain genetic variants with very similar positive average proportional associations with fat percentage, WHR and type 2 diabetes. Variants in Cluster 3 have close to zero average association with SBP, whereas those in Clusters 1, 2, and 4 have positive average association with SBP. Variants in Cluster 2 have close to zero average association with smoking, whereas those in Clusters 1, 3 and 4 have positive average association with smoking. Variants in Cluster 4 have positive average association with HDL and negative average association with triglycerides, in contrast with those in Clusters 1–3.

Fig 3. Heat map showing the association estimates of the BMI associated genetic variants with each trait by cluster.

Fig 3

The association estimates were first standardised by dividing by their standard errors, then normalised so that the vectors of association estimates for each variant have magnitude one. Thus, the values shown represent the proportional association estimates for each genetic variant on the set of traits. The value in parentheses underneath each cluster label is the number of variants in the respective cluster.

Fig 4. Parallel plot of the mean vector of the fitted von Mises–Fisher distribution for each cluster.

Fig 4

The plotted points represent the standardised proportional association with each trait for an observation at the centre of each cluster.

Cluster 5 contains 20 genetic variants. These variants, on average, are positively associated with HDL and negatively associated with SBP, triglycerides, WHR and type 2 diabetes. These variants also have close to zero average association with smoking, physical activity and education, as well as weaker positive association with fat percentage compared with the other four clusters.

Mendelian randomization estimates of the effect of BMI on CHD

Mendelian randomization has previously suggested that BMI has a positive causal effect on CHD risk using as instruments 94 genetic variants identified by Locket et al. [22][23]. We applied two-sample Mendelian randomization [24] using as instruments the set of BMI associated genetic variants which were used for clustering, as well as separately using the sets of variants for each cluster in turn (Methods). As well as applying the inverse-variance weighted (MR-IVW) method [25], we also performed as sensitivity analyses the MR-Median method [26], the Contamination Mixture (MR-ConMix) method [27] and the MR-PRESSO method [28]. Each of these methods provides a valid test for the causal null hypothesis under different sets of assumptions (Methods).

Fig 5 shows scatterplots of the genetic association estimates with BMI against their association estimates with CHD risk for each set of instruments considered, as well the results of the Mendelian randomization analyses. When using the full set of genetic variants as instruments, the results suggest a positive effect of increased BMI on CHD risk, with an estimated odds ratio (OR) from MR-IVW of 1.50 (95% confidence interval of 1.40–1.62) per 1 standard deviation increase in genetically predicted BMI. All sensitivity analyses gave similar estimates. This is in line with the results of Larsson et al. [23]. A similar result was obtained using the largest two clusters, with an estimated OR of 1.83 (1.68–2.00) using Cluster 1 and of 1.54 (1.38–1.72) using Cluster 2. When using the Cluster 3 genetic variants as instruments, the estimate attenuated toward the null, with an estimated OR of 1.22 (0.99–1.50). When using Cluster 4 genetic variants as instruments, there was no evidence that increased BMI is associated with CHD risk, with an estimated OR of 0.94 (0.69–1.29). When using Cluster 5 genetic variants as instruments, the results suggest a decrease in CHD risk from increased BMI, with an estimated OR of 0.34 (0.19–0.64). Note that the MR-Egger intercept test [29] did not show evidence of directional pleiotropy in any of these analyses (see Table C in S1 Text).

Fig 5. Results from the Mendelian randomization analyses of the effect of BMI on CHD.

Fig 5

Scatterplots are of the associations of each genetic variant with BMI (standard deviation units) and the log odds ratio of CHD risk. The slopes of the dotted lines are the MR-IVW estimates for the respective cluster. Forest plots show the estimates and 95% confidence intervals from Mendelian randomization, for all genetic variants and for each cluster. Mendelian randomization estimates represent the change in odds ratio of CHD risk per 1 standard deviation increase in genetically predicted BMI. The dotted lines indicate an odds ratio of 1.

Exploring the biological pathways of clusters of BMI associated variants

We conducted gene set analysis on the BMI associated variants using the Functional Mapping and Annotation Platform [30] in order to examine the biological pathways relating to each cluster. The variants were mapped to genes based on positional and eQTL mappings, which were in turn tested for enrichment in gene sets from various pathway databases (Methods). A number of distinct patterns emerge: Cluster 1 variants are associated with pathways related to cell division and differentiation; Cluster 3 variants with pathways related to cellular signalling; Cluster 4 variants with pathways related to lipid metabolism; and Cluster 5 variants with pathways related to inflammation. Cluster 2 variants were not found to be significantly enriched with any of the tested pathways. The full set of pathways associated with the mapped genes is given in S2 Table.

The role of Cluster 5 variants in inflammation is of particular interest given its proposed relation to favourable adiposity. In order to confirm the role of these variants in inflammation, we conducted a Mendelian randomization analysis to examine the association of genetically predicted BMI, using all variants and each cluster separately, with C-reactive protein (CRP), a measure of systemic inflammation (Methods). The results from the MR-IVW method are shown in Fig 6. When using all variants as instruments, MR-IVW estimated an increase in CRP of 0.44 standard deviations (95% confidence interval of 0.38–0.50) per standard deviation increase in genetically predicted BMI. The results when using Clusters 1–4 as instruments were in line with this. However, there was no evidence that the component of BMI predicted by Cluster 5 variants is associated with CRP (MR-IVW estimate of 0.01, 95% confidence interval of -0.24–0.27). These findings were supported in sensitivity analyses (see Fig F in S1 Text).

Fig 6. Results from the Mendelian randomization analyses of the effect of BMI on CRP.

Fig 6

MR-IVW estimates and 95% confidence intervals of the association of genetically predicted BMI with CRP, for all genetic variants and for each cluster. The estimates represent the change in CRP in standard deviation units per 1 standard deviation increase in genetically predicted BMI. The dotted line indicates no association between genetically predicted levels of CRP and BMI.

To further explore the pathways by which the various clusters affect inflammation, we performed separate Mendelian randomization analyses with the 41 cytokines and growth factors studied by Ahola-Olli et al. [31] and Kalaoja et al. [32] as outcomes (see Table D in S1 Text for the full list of cytokines and growth factors considered). Fig 7 shows the MR-IVW estimates for each cluster and outcome. There was evidence of variation in the effects of BMI predicted by Cluster 5 variants on the cytokines compared with the effects of BMI predicted by the other clusters. For a number of inflammatory traits, such as hepatocyte growth factor (HGF) and TNF-related apoptosis inducing ligand (TRAIL), BMI predicted by Cluster 5 variants showed a weaker association than the other clusters. In some cases, such as for monocyte chemotactic protein-1 (MCP1), the MR-IVW estimates using Cluster 5 variants were in the opposite direction to the other clusters. These results were supported in sensitivity analyses (see S3 Table).

Fig 7. Results from the Mendelian randomization analyses of the effect of BMI on cytokines and growth factors.

Fig 7

MR-IVW estimates (expressed as Z-scores, i.e. estimate divided by its standard error) for the association of genetically predicted BMI with 41 cytokines and growth factors. Values denoted with * have a p-value less than 0.05/41.

Discussion

In this paper we have presented a procedure for clustering genetic variants based on their associations with a given set of traits using the NAvMix method. The method uses a directional clustering algorithm to distinguish between genetic variants based on their proportional associations with the traits. Since it is a model-based clustering approach, it has many advantages over current methods that are employed for clustering genetic variants based on trait associations, such as a data-driven method for choosing the number of clusters and the ability to use soft clustering. The inclusion of a noise cluster provides robustness to outliers, offering greater confidence in the identified clusters. A simulation study showed the method performs well in a range of settings, and that it outperformed alternative clustering approaches in assigning observations based on proportional associations. Importantly, the method did not identify false positive clusters in the simulation setting when no true clusters existed in the data, in contrast to the other methods considered.

The application to clustering BMI associated genetic variants identified five clusters, suggesting that genetic predictors of BMI can be broken down into five separate mechanisms based on their associations with the traits considered. Interestingly, variants in Clusters 1 and 2 were similar in their average associations across each of the traits considered with the exception of smoking, where Cluster 2 had close to zero association. One possible explanation for this is that these variants differ according to some addictive behaviour related mechanism. However, no such pathways were identified in the gene set analysis for Cluster 1. This suggests that some other mechanism may be driving this change, although further analysis is required to identify what this may be.

Mendelian randomization analyses provided evidence that the different pathways affecting BMI have different downstream effects on CHD risk. When using as instruments the set of genetic variants in Clusters 1 and 2, the Mendelian randomization estimate of BMI on CHD risk was positive, in line with the established overall effect of increased BMI. When using as instruments the set of variants in Cluster 3, the estimate was still positive but attenuated to the null. The main difference between this cluster and Clusters 1 and 2 is that the variants do not, on average, associate with increased SBP. Previous evidence suggests that increased SBP is a downstream consequence of increased BMI [33], and has also been shown to have a causal effect on CHD [27]. Our results therefore support that the genetically predicted component of BMI that does not associate with increased SBP has a lower positive effect on CHD risk. However, there is still evidence of a positive causal effect, suggesting there are other mechanisms by which increased BMI may increase CHD risk [34].

When using as instruments the set of genetic variants in Cluster 4, which have average associations with increased HDL and decreased triglycerides, Mendelian randomization suggested there was no association with CHD risk. Furthermore, the Mendelian randomization estimate of the component of BMI predicted by the variants in Cluster 5 was negative. That is, in Cluster 5, we have identified genetic variants related to a BMI increasing pathway that is protective of CHD. Orientating to the BMI-increasing alleles, these genetic variants are associated with a favourable metabolic profile, namely increased HDL and decreased SBP, triglycerides, WHR and type 2 diabetes liability.

By analysing the biological pathways underpinning the different clusters, we found evidence supporting that the heterogeneity between the effects of the different components of BMI on cardiovascular risk may be related to inflammation. Furthermore, our findings identify possible inflammatory pathways related to elevated BMI that represent therapeutic targets for preventing CHD. Specifically, the estimated effects of Cluster 5 variants, in contrast to the BMI increasing variants more generally, are consistent with lower levels of key inflammatory cytokines implicated in CHD pathogenesis, including HGF [35], MCP1 [36] and TRAIL [37]. By ameliorating the increased inflammation attributable to elevated BMI, its detrimental effects on CHD risk may also be mitigated.

A number of studies have previously sought to identify genetic variants associated with metabolically favourable adiposity. Huang et al. [38] conducted pairwise significance tests between adiposity traits and various other cardiometabolic traits to identify genetic variants which, for at least one such pairing, associate with an increase in the adiposity trait and a decrease in the cardiometabolic trait. A similar approach to identifying genetic variants associated with favourable adiposity has also been performed by Yaghootkar et al. [39]. Our approach differs to these in that our clusters are formed without using genetic associations with the risk factor or outcome of interest, in this case BMI and CHD, but rather in relation to the chosen traits. Therefore, any difference between clusters in their associations with CHD risk is a meaningful statistical test, rather than a difference driven by the clustering algorithm.

The proposed approach has some limitations. It uses as input the full covariance matrix of the genetic variant-trait associations. If it assumed that the traits are uncorrelated or that the genetic variant-trait associations are estimated in separate samples, then these matrices can be easily constructed from the standard errors of the genetic association estimates which are typically available from published GWAS results. In practice, it is unlikely that the entire set of traits will be uncorrelated, since they would typically be related at least via common association with the primary trait of interest. We have shown how the full covariance matrices can be estimated using estimates of the trait correlations, either from individual level data or from a reference dataset. Furthermore, the simulation study suggested that, unless the traits are highly correlated with each other, the method is robust to ignoring the genetic variant-trait association correlations. This also suggests that the approach is robust to some participant overlap in the samples. If the traits are highly correlated, there is significant sample overlap, and individual level data are not available, there exist methods to estimate the correlation between genetic associations using summary level data. One approach is to use the intercept term from cross-trait LD score regression [40]. Another is to estimate the correlation between genetic association estimates using only variants which are assumed to not be associated with the traits [41].

Another limitation is that the results are dependent on the choice of traits used to cluster on. Domain knowledge should be used to select a set of traits which are believed to be informative of potential mechanisms of the genetic variants under consideration. Future research will look to extend the method to include feature selection [42], so that the inclusion of a moderate to large number of traits, many of which may not distinguish between clusters, is possible. It should be noted that adding highly correlated traits does not add much extra information, and may impact the results if correlation estimates are not incorporated. Thus, if there are a number of traits of interest which are highly correlated, it is better to choose just one of them.

In the applied example, the genetic variants used for clustering were chosen according to them being associated with a primary trait of interest, in this case BMI. This resulted in a fairly large number of variants to cluster, in part because of the very large sample size of the GWAS in which these associations were estimated. Other traits of interest may not have so many independent variants associated with them at genome-wide significance. A low number of variants may make it more difficult to find true clusters if the cluster sizes are small. Nonetheless, there are many traits for which, say, 100 or more variants have been found to associate, and this will only grow as GWAS sample sizes increase. Furthermore, the simulation results showed that our clustering approach is still generally able to detect relatively small clusters, with clusters as small as 10 variants out of 100 in total in some settings. In the case where there are only a very small number of variants associated with the primary trait of interest, we would recommend lowering the threshold for inclusion below genome-wide significance rather than include correlated variants. Genetic variants which are not independent would be expected to associate similarly with the given traits, and so it would not be informative to include these.

In conclusion, we have presented a procedure for clustering genetic variants based on their direction of association with relevant traits, in order to gain insight into their underlying biological mechanisms and pathways. We have demonstrated the utility of clustering genetic variants in this way by applying the method to BMI associated genetic variants and performing Mendelian randomization analyses to infer the differential effects of distinct BMI increasing pathways on CHD risk.

Methods

The von Mises–Fisher distribution

The m-dimensional von Mises–Fisher (vMF) distribution has probability density function

f(xμ,κ)=Cm(κ)eκμx,

where ‖x‖ = ‖μ‖ = 1 and Cm(κ) is a normalising constant given by

Cν(x)=xν/2-1(2π)ν/2Iν/2-1(x),

where Iν(x) is the modified Bessel function of the first kind and order ν [12, 43]. The mean parameter μ is a unit vector which represents the direction from the origin in m-dimensional space. The concentration parameter κ represents the spread of observations around the mean. When κ = 0, the distribution is the uniform distribution on the (m − 1)-dimensional unit sphere. As κ increases, the distribution becomes increasingly focused around the point on the unit sphere given by μ.

The noise-augmented von Mises–Fisher mixture model

Suppose we have m-dimensional observations {x1, …, xn} where ‖xj‖ = 1 for all j (if the observations are not normalised to have magnitude 1, then this normalisation is the first step in the procedure). Here, xj represents the vector of proportional association estimates for genetic variant j with the m traits. That is, if standardised genetic association estimates are being used, the vector Σ^j-1/2β^j· is normalised to have magnitude 1. Further suppose that each observation either belongs to one of K clusters, each cluster containing observations from a vMF distribution, or else belongs to none of these clusters and is therefore considered noise. We can represent this with the K + 1 component vMF mixture model given by

p(xjΘ)=k=1K+1p(xj,zj=kμk,κk)=k=1K+1πkf(xjμk,κk)

for the jth observation, where:

  • Θ = {μ1, …, μK, κ1, …, κK, π1, …, πK+1};

  • z = {z1, …, zn} denotes cluster membership (that is, zj = k if xj belongs to cluster k);

  • πk is the mixing proportion of cluster k, with k=1K+1πk=1;

  • f(xμ, κ) is the density function of the m-dimensional vMF distribution;

  • μK+1 is the unit vector which is fixed according to the global sample mean direction, given by
    μK+1=j=1nxjj=1nxj;
  • κK+1 is fixed at a number close to zero (for example 0.0001).

In this model, cluster K + 1 is referred to as the noise cluster. With κ close to zero, the distribution function represents the uniform distribution on the (m − 1)-dimensional unit sphere, and so observations which do not fit well to the other K clusters will tend to be assigned here. Note that, since the noise cluster is uniformly distributed, the value of μK+1 is arbitrary, and we choose the global sample mean for convenience. The use of a uniform distribution for a noise cluster has been commonly used in Gaussian mixture models [44], and our model gives a directional analogue of this approach. Alternative approaches to incorporating a noise component to Gaussian mixture models have also been proposed [4547]. Although beyond the scope of the present work, different noise distributions for NAvMix could be explored by changing the density of component K + 1.

The log-likelihood function is

lK(Θ)=j=1nlog{k=1K+1πkf(xjμk,κk)}.

In order to maximise the likelihood function to obtain estimates of the parameters Θ, we would require knowledge of the latent variables z. Mixture models of this sort are thus fitted using the EM algorithm [48].

The EM algorithm

Suppose we have an estimate of Θ, denoted by Θ^. Let Q(ΘΘ^)=EzX,Θ^lK(Θ). Then

Q(ΘΘ^)=j=1nk=1K+1γjklog{πkf(xjμk,κk)},

where

γjk=Pr(zj=kxj,Θ^)=πkf(xjμk,κk)l=1K+1f(xjμl,κl),k=1,,K+1.

Computing the γjk for a given Θ^ is the E step in the EM algorithm.

Given the γjk, we can estimate Θ by maximising Q(ΘΘ^). Following Banerjee et al. [12], the parameter estimates are obtained from

μ^k=j=1nγjkxjj=1nγjkxj,k=1,,K,
Im/2(κ^k)Im/2-1(κ^k)=j=1nγjkxjj=1nγjk,k=1,,K (1)
π^k=1nj=1nγjk,k=1,,K+1.

This is the M step of the EM algorithm. Note that we do not update the noise cluster parameters, μK+1 and κK+1, but we do update the proportion of observations which are assigned to the noise cluster, π^K+1. Now, (1) does not give a closed form solution for computing κ^k. However, a number of methods for approximating these solutions have been proposed which allow the concentration parameter estimates to be easily updated. Banerjee et al. [12] proposed the approximation

κ^k=r¯km-r¯k31-r¯k2,

where

r¯k=j=1nγjkxjj=1nγjk.

Hornik and Grün [15] summarise several other approximation methods and provide software for implementing each of them. Note that, in practice, values of r¯ very close to 1 can cause numerical problems (due to the fact that this relates to the case where the observations are almost all at the same point, and the precision is thus close to infinity). To get around this, we cap the value that κ^k can take at 500.

The EM algorithm can be started at either the E step, given an initial estimate of Θ, or at the M step, given initial values of the γjk. The algorithm is iterated until the absolute value of the difference between successive values of lK(Θ^) is less than some predefined convergence threshold. In our simulation study and applied example, we used 10−4 as the convergence threshold.

Initialisation of the algorithm

In order to initialise the algorithm, we must first set an initial proportion of observations which belong in the noise cluster, which we will denote by 0<π^K+1(0)<1. We then perform the spherical k-means procedure [14], which clusters observations based on similarity of their direction from the origin, analogous to the k-means procedure which clusters observations based on Euclidean distance. We take as initial values, for i = 1, …, n,

γik={1-π^K+1(0),ifobservationiisassignedtoclusterk0,otherwise,k=1,,Kγi(K+1)=π^K+1(0).

We then begin the EM algorithm at the M step. Note that the spherical k-means procedure relies on an initial random set of cluster means, and thus its results are sensitive to this randomisation. There is a possibility that certain initial values from the procedure will result in the EM algorithm converging to a local, rather than global, maximum. We therefore run the algorithm a number of times in practice, each time beginning with different initial values. We take as final parameter estimates those which result in the EM algorithm converging to the greatest maximum. In our simulation study and applied example, we ran the algorithm with 5 different initialisations.

Choosing the number of clusters

In practice, we will not know the number of clusters to fit to the data. The number of clusters can be determined using an information criterion, for example BIC [44, 49]. For successive values of K, we perform the algorithm above and compute

ϕm(K)=-2lK(Θ^)+rm(K)log(n),

where rm(K) = (m + 2)K + m is the number of parameters estimated. We continue until ϕm(K) increases for successive iterations. The final number of clusters is then taken to be arg minK ϕm(K).

Assigning cluster membership

Output from the procedure for fitting the mixture model is a set of probabilities for each observation belonging to each cluster (that is, the γik parameters). The simplest approach for assigning cluster membership is to assign each observation to the cluster for which it has the greatest probability of membership (that is, z^i=arg maxkγik). This is the approach used in both the simulation study and the applied example presented in this paper.

Mixture model approaches to clustering allow for flexibility in the way that cluster membership is assigned. For increased confidence in the clusters, a threshold could be set such that an observation is only assigned to a cluster if the probability of membership is greater than a certain level. Those which do not meet the threshold for any cluster remain unassigned. Finally, soft clustering is possible, whereby observations are assigned to any cluster for which its probability of membership is greater than a certain level. Under the soft clustering approach, an observation may be assigned to more than one cluster.

Genetic variant-trait association covariance matrix

For variant j, the (k, l)th element of Σ^j is given by

se(β^jk)se(β^jl)cor(β^jk,β^jl),

where se(β^jk) is the standard error of β^jk. If the genetic variant-trait associations are estimated in separate, non-overlapping, samples, then cor(β^jk,β^jl)=0 and Σ^j can be taken to be the diagonal matrix with kth diagonal entry equal to se2(β^jk). If the traits are estimated in the same sample, then the off-diagonal entries of Σ^j will be non-zero. Although the correlation between β^jk and β^jl is not easily estimated, provided the jth genetic variant explains only a small proportion of the variance in the kth and lth traits, then cor(β^jk,β^jl)cor(Xk,Xl), where Xk and Xl are the kth and lth traits, respectively [50]. We can therefore compute the (k, l)th entry of Σ^j, ij, by

se(β^jk)se(β^jl)cor^(Xk,Xl),

where cor^(Xk,Xl) is an estimate of the correlation between Xk and Xl. As a result of this, if the traits are assumed to be independent, then the off-diagonal entries of Σ^j can be approximated by zeros, and the covariance matrix taken to be diagonal as in the separate samples case.

Simulation study

We simulated n = 100 independent genetic variants for N = 20000 individuals, denoted Gij for individual i and genetic variant j, and m traits, denoted Xil for individual i and trait l, from the following model

mafjUniform(0.01,0.5)GijBinomial(2,mafj)Ui,εi1,,εimN(0,1),independentlyLik=jn(k)βjkGijXil=k=1KδklLik+jn(K+1)αjGij+γUi+1-γ2εil,

for i = 1, …, N and l = 1, …, m. The variables L1, …, LK are latent factors which represent K different mechanisms by which the genetic variants act on the observed traits X1, …, Xm, with n(k) indexing the variants which are associated with Lk. The variants indexed by n(K+1) are those in the noise cluster. These variants act directly on the traits and do not associate with any of the latent factors. The common variable Ui induces correlation between the traits, with the amount of correlation determined by γ. The relationship between the genetic variants in the kth cluster and the other variables are illustrated in the directed acyclic graph in Fig G in S1 Text. The number of traits was either m = 2 or 9 and we set γ = 0, 0.4 or 0.8. The first 80 variants were split into 1, 2 or 4 clusters, with the remaining 20 variants considered to be noise. For the k = 2 scenarios, each cluster contained 40 variants. For the k = 4 scenarios, the cluster sizes were 30, 20, 20 and 10.

We generated the βjk values such that most of the genetic variants were weakly associated with the traits, while a relatively small number of them were associated more strongly. For each k, and for each jn(k), with probability 1 − ϕ, ϕ ∼ Uniform(0.05, 0.2), βjk was generated from the Uniform(0.03, 0.06) distribution (which results in a p-value, on average, below the genome-wide significance level), and with probability ϕ from the N(0.1, 0.022) distribution. For jn(k), βjk was set to zero. The αj values were generated from the Uniform(−0.1, 0.1) distribution, jn(K+1), and set to zero otherwise.

When m = 2, δkl was set to the (k, l)th element of the matrices

(11),(111-1),(111-1-11-1-1),

for the 1, 2 and 4 cluster scenarios, respectively. When m = 9, δkl was set to the (k, l)th element of the matrices

(111111111),(1110.50.50.50.50.50.51-1-10.50.50.50.5-0.5-0.5),(1110.50.50.50.50.50.5111110000-1-1-10.50.50.50.5-0.5-0.5-1-1-1-1-10000),

for the 1, 2 and 4 cluster scenarios, respectively. These values determine the direction and relative magnitude of association between the genetic variants in each cluster with the traits. For example, in the m = 2, K = 2 scenario, one cluster contains variants which are positively associated with both traits, whereas the other cluster contains variants that are positively associated with trait 1 and negatively associated with trait 2. The parametrisation of the αj, βjk and δkl parameters are such that the proportion of variance of each trait explained by the genetic variants was approximately 5–10%.

The estimated genetic variant-trait associations were computed using simple linear regression of each trait on each genetic variant in turn. The resulting datasets were clustered using NAvMix with an initial proportion of genetic variants in the noise cluster of 0.05, and using mclust with an initial noise cluster of of 5 randomly selected genetic variants.

A supplementary simulation study was also performed where the sample size differed for each trait. Each sample size was randomly chosen to be between 10000 and 50000. The results of this supplementary simulation study is presented in S1 Text.

Clustering BMI associated genetic variants

Genetic variant association estimates with BMI were taken from the GWAS of Pulit et al. [19]. Variants with p-value < 5 × 10−8 were pruned using the TwoSampleMR package in R [51] with r2 = 0.001.

Genetic variant association estimates with body fat percentage, SBP, triglycerides and HDL were taken from results from the Neale Lab which are based on the UK Biobank dataset (http://www.nealelab.is/uk-biobank/). Genetic variant associations for educational attainment were taken from the GWAS of Okbay et al. [52]; for physical activity, the GWAS of Doherty et al. [53]; for lifetime smoking score, the GWAS of Wootton et al. [54]; for WHR the GWAS of Pulit et al. [19]; and for type 2 diabetes, the GWAS of Mahajan et al. [6]. Note that for the educational attainment dataset, one BMI associated genetic variant (rs10761785) was replaced with a proxy (rs2163188) with r2 = 0.9842 (identified using PhenoScanner [55, 56]). All studies used were performed on samples of individuals of European ancestry or predominantly European ancestry. All genetic variant trait-association estimates were orientated with respect to the alleles such that the associations with BMI were positive. Table E in S1 Text shows the sample sizes for each study as well as the number of the BMI associated genetic variants which associate with each trait at the genome-wide significance level.

Clustering was performed using NAvMix with an initial proportion of genetic variants in the noise cluster of 0.05, and 5 separate initialisations of the algorithm was used. The probability of membership of each genetic variant to each cluster produced by the algorithm is shown in S1 Table.

Mendelian randomization analyses

A genetic variant is a valid instrumental variable for a Mendelian randomization analysis if it is: associated with the risk factor; independent of any confounders of the risk factor-outcome relationship; and has no causal pathway to the outcome other than via the risk factor [57]. Under the two-sample framework, the genetic variant-risk factor and genetic variant-outcome associations are estimated in separate samples [24]. Under the assumption that all variants in the analysis are valid instruments, MR-IVW produces a statistically consistent estimator of the causal effect and a test for the causal null hypothesis [25]. The three methods used for sensitivity analyses were chosen since they each produce a valid estimate of the causal effect of BMI on CHD under different assumptions [58]: MR-Median (a majority of the genetic variants are valid instrument); the Contamination Mixture method (a plurality of the genetic variants are valid instruments); and the MR-PRESSO method (the InSIDE assumption is met). The intercept test from the MR-Egger method was used to test for the presence of unmeasured directional pleiotropy. Analyses were carried out using the MendelianRandomization [59, 60] and MRPRESSO [28] packages.

Genetic variant association estimates with CHD were taken from the CARDIoGRAMplusC4D dataset of Nikpay et al. [61] and accessed using PhenoScanner [55, 56]. Genetic variant associations with CRP were taken from results from the Neale Lab which are based on the UK Biobank dataset (http://www.nealelab.is/uk-biobank/). Genetic variant association estimates with the 41 cytokines and growth factors were taken from the data supporting Ahola-Olli et al. [31] and Kalaoja et al. [32]. Table F in S1 Text gives a list of the BMI associated genetic variants which were not available in each of the outcome datasets and were therefore excluded from the relevant Mendelian randomization analyses.

Gene mapping and gene set analysis

The 539 BMI associated genetic variants were mapped to genes using the SNP2GENE function in FUMA [30]. Summary statistics for each cluster of variants were uploaded separately, and were identified as pre-defined lead SNPs. Both positional and eQTL mapping was performed. For the eQTL mapping, tissue types were selected as all those from the following sources: EQTL catalogue; PsychENCODE; van der Wijst et al. scRNA eQTLs; DICE; eQTLGen; Blood eQTLs; MuTHER; xQTLServer; ComminMind Consortium; BRAINEAC; and GTEx v8. All other default settings were used. Gene set analysis was performed using the GENE2FUNC function. The results presented in S2 Table include all canonical pathways from MsigDB, as well as gene ontology processes, which associate with the mapped genes using hypergeometric tests (with multiple test correction applied per cluster).

Supporting information

S1 Text. Additional simulation results and supplementary information for the simulation study and applied example.

(PDF)

S1 Table. Allocated cluster and probability of membership to each cluster for each BMI associated genetic variant.

(XLSX)

S2 Table. List of canonical pathways and gene ontology processes associated with the mapped genes for each cluster of BMI associated genetic variants.

(XLSX)

S3 Table. Results from Mendelian randomization sensitivity analyses of the effect of BMI on cytokines and growth factors.

Estimates and 95% confidence intervals from MR-Median, the Contamination Mixture method (MR-ConMix) and MR-PRESSO for the association of genetically predicted BMI with 41 cytokines and growth factors.

(XLSX)

Data Availability

All the data used in this paper are publicly available. Summary statistics for genetic associations with traits were downloaded from: https://zenodo.org/record/1251813#.X8drUF7gquP (BMI and WHR); http://www.nealelab.is/uk-biobank/ (body fat percentage, SBP, triglycerides, HDL and CRP); https://www.thessgac.org/data (educational attainment); https://ora.ox.ac.uk/objects/uuid:ff479f44-bf35-48b9-9e67-e690a2937b22 (physical activity); https://data.bris.ac.uk/data/dataset/10i96zb8gm0j81yz0q6ztei23d (lifetime smoking score); http://diagram-consortium.org/downloads.html (T2D); http://www.phenoscanner.medschl.cam.ac.uk/ (CHD); https://data.bris.ac.uk/data/dataset/3g3i5smgghp0s2uvm1doflkx9x (cytokines and growth factors). R code for performing the NAvMix clustering algorithm, and for reproducing the simulation results and applied analysis, can be found at https://github.com/aj-grant/navmix.

Funding Statement

AJG and SB are supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant number 204623/Z/16/Z). DG is supported by the British Heart Foundation Research Centre of Excellence (RE/18/4/34215) at Imperial College London and a National Institute for Health Research Clinical Lectureship (CL-2020-16-001) at St. George’s, University of London. PDWK is supported by the UK Medical Research Council (MC_UU_00002/13). This research was funded by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. For the purpose of open access, the author has applied a CC-BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS discovery: Biology, function, and translation. Am J Hum Genet. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Winkler TW, Günther F, Höllerer S, Zimmermann M, Loos RJ, Kutalik Z, et al. A joint view on genetic variants for adiposity differentiates subtypes with distinct metabolic implications. Nat Commun. 2018;9(1):1946. doi: 10.1038/s41467-018-04124-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Udler MS, Kim J, von Grotthuss M, Bonàs-Guarch S, Cole JB, Chiou J, et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 2018;15(9):1–23. doi: 10.1371/journal.pmed.1002654 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Dimas AS, Lagou V, Barker A, Knowles JW, Mägi R, Hivert MF, et al. Impact of type 2 diabetes susceptibility variants on quantitative glycemic traits reveals mechanistic heterogeneity. Diabetes. 2014;63(6):2158–2171. doi: 10.2337/db13-0949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017;66(11):2888–2902. doi: 10.2337/db16-1253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Mahajan A, Wessel J, Willems SM, Zhao W, Robertson NR, Chu AY, et al. Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes. Nat Genet. 2018;50(4):559–571. doi: 10.1038/s41588-018-0084-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ruth KS, Day FR, Tyrrell J, Thompson DJ, Wood AR, Mahajan A, et al. Using human genetics to understand the disease impacts of testosterone in men and women. Nat Med. 2020;26(2):252–258. doi: 10.1038/s41591-020-0751-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Tanigawa Y, Li J, Justesen JM, Horn H, Aguirre M, DeBoever C, et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nat Commun. 2019;10(1):4064. doi: 10.1038/s41467-019-11953-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Yaghootkar H, Scott RA, White CC, Zhang W, Speliotes E, Munroe PB, et al. Genetic evidence for a normal-weight “metabolically obese” phenotype linking insulin resistance, hypertension, coronary artery disease, and type 2 diabetes. Diabetes. 2014;63(12):4369–4377. doi: 10.2337/db14-0318 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003;32(1):1–22. doi: 10.1093/ije/dym289 [DOI] [PubMed] [Google Scholar]
  • 11. Lawlor DA, Harbord RM, Sterne JAC, Timpson N, Davey Smith G. Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Stat Med. 2008;27(8):1133–1163. doi: 10.1002/sim.3034 [DOI] [PubMed] [Google Scholar]
  • 12. Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res. 2005;6(46):1345–1382. [Google Scholar]
  • 13. Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8(1):289–317. doi: 10.32614/RJ-2016-021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Dhillon IS, Modha DS. Concept decompositions for large sparse text data using clustering. Mach Learn. 2001;42(1):143–175. doi: 10.1023/A:1007612920971 [DOI] [Google Scholar]
  • 15. Hornik K, Grün B. movMF: An R package for fitting mixtures of von Mises-Fisher distributions. J Stat Softw. 2014;58(10):1–31. doi: 10.18637/jss.v058.i10 [DOI] [Google Scholar]
  • 16. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356 [DOI] [Google Scholar]
  • 17. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]
  • 18. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7 [DOI] [Google Scholar]
  • 19. Pulit SL, Stoneman C, Morris AP, Wood AR, Glastonbury CA, Tyrrell J, et al. Meta-analysis of genome-wide association studies for body fat distribution in 694Â 649 individuals of European ancestry. Hum Mol Genet. 2019;28(1):166–174. doi: 10.1093/hmg/ddy327 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Van Gaal LF, Mertens IL, De Block CE. Mechanisms linking obesity with cardiovascular disease. Nature. 2006;444(7121):875–880. doi: 10.1038/nature05487 [DOI] [PubMed] [Google Scholar]
  • 21. Davies NM, Dickson M, Davey Smith G, van den Berg GJ, Windmeijer F. The causal effects of education on health outcomes in the UK Biobank. Nat Hum Behav. 2018;2(2):117–125. doi: 10.1038/s41562-017-0279-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. doi: 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Larsson SC, Bäck M, Rees JMB, Mason AM, Burgess S. Body mass index and body composition in relation to 14 cardiovascular conditions in UK Biobank: a Mendelian randomization study. Eur Heart J. 2019;41(2):221–226. doi: 10.1093/eurheartj/ehz388 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Burgess S, Scott RA, Timpson NJ, Davey Smith G, Thompson SG, EPIC- InterAct Consortium. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur J Epidemiol. 2015;30(7):543–552. doi: 10.1007/s10654-015-0011-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Burgess S, Butterworth A, Thompson SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol. 2013;37(7):658–665. doi: 10.1002/gepi.21758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Bowden J, Davey Smith G, Haycock PC, Burgess S. Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet Epidemiol. 2016;40(4):304–314. doi: 10.1002/gepi.21965 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Burgess S, Foley CN, Allara E, Staley JR, Howson JMM. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nat Commun. 2020;11:376. doi: 10.1038/s41467-019-14156-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Verbanck M, Chen CY, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet. 2018;50(5):693–698. doi: 10.1038/s41588-018-0099-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Bowden J, Davey Smith G, Burgess S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol. 2015;44(2):512–525. doi: 10.1093/ije/dyv080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826. doi: 10.1038/s41467-017-01261-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ahola-Olli AV, Würtz P, Havulinna AS, Aalto K, Pitkänen N, Lehtimäki T, et al. Genome-wide association study identifies 27 loci influencing concentrations of circulating cytokines and growth factors. Am J Hum Genet. 2017;100:40–50. doi: 10.1016/j.ajhg.2016.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Kalaoja M, Corbin LJ, Tan VY, Ahola-Olli AV, Havulinna AS, Santalahti K, et al. The role of inflammatory cytokines as intermediates in the pathway from increased adiposity to disease. Obesity. 2021;29(2):428–437. doi: 10.1002/oby.23060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Marini S, Merino J, Montgomery BE, Malik R, Sudlow CL, Dichgans M, et al. Mendelian randomization study of obesity and cerebrovascular disease. Ann Neurol. 2020;87(4):516–524. doi: 10.1002/ana.25686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Gill D, Zuber V, Dawson J, Pearson-Stuttard J, Carter AR, Sanderson E, et al. Risk factors mediating the effect of body mass index and waist-to-hip ratio on cardiovascular outcomes: Mendelian randomization analysis. International Journal of Obesity. 2021;45(7):1428–1438. doi: 10.1038/s41366-021-00807-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Morishita R, Aoki M, Yo Y, Ogihara T. Hepatocyte growth factor as cardiovascular hormone: Role of HGF in the pathogenesis of cardiovascular disease. Endocr J. 2002;49(3):273–284. doi: 10.1507/endocrj.49.273 [DOI] [PubMed] [Google Scholar]
  • 36. Georgakis MK, Gill D, Rannikmäe K, Traylor M, Anderson CD, MEGASTROKE consortium of the International Stroke Genetics Consortium (ISGC), et al. Genetically determined levels of circulating cytokines and risk of stroke. Circulation. 2019;139(2):256–268. doi: 10.1161/CIRCULATIONAHA.118.035905 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Bernardi S, Bossi F, Toffoli B, Fabris B. Roles and clinical applications of OPG and TRAIL as biomarkers in cardiovascular disease. BioMed Res Int. 2016;2016:1752854. doi: 10.1155/2016/1752854 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Huang LO, Rauch A, Mazzaferro E, Preuss M, Carobbio S, Bayrak CS, et al. Genome-wide discovery of genetic loci that uncouple excess adiposity from its comorbidities. Nat Metab. 2021;3(2):228–243. doi: 10.1038/s42255-021-00346-2 [DOI] [PubMed] [Google Scholar]
  • 39. Yaghootkar H, Lotta LA, Tyrrell J, Smit RAJ, Jones SE, Donnelly L, et al. Genetic evidence for a link between favorable adiposity and lower risk of type 2 diabetes, hypertension, and heart disease. Diabetes. 2016;65(8):2448–2460. doi: 10.2337/db15-1671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh PR, et al. An atlas of genetic correlations across human diseases and traits. Nature Genetics. 2015;47(11):1236–1241. doi: 10.1038/ng.3406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Ray D, Boehnke M. Methods for meta-analysis of multiple traits using GWAS summary statistics. Genetic Epidemiology. 2018;42(2):134–145. doi: 10.1002/gepi.22105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Law MH, Jain AK, Figueiredo MAT. Feature selection in mixture-based clustering. In: Adv Neural Inf Process Syst. vol. 15; 2003. p. 641–648. [Google Scholar]
  • 43. Mardia KV, Jupp P. Directional statistics. Chichester: John Wiley & Sons; 2000. [Google Scholar]
  • 44. Banfield JD, Raftery AE. Model-Based Gaussian and non-Gaussian clustering. Biometrics. 1993;49(3):803–821. doi: 10.2307/2532201 [DOI] [Google Scholar]
  • 45. Hennig C, Coretto P. The noise component in model-based cluster analysis. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R, editors. Data analysis, machine learning and applications. Berlin, Heidelberg: Springer; 2008. p. 127–138. [Google Scholar]
  • 46. Coretto P, Hennig C. Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research. 2017;18:1–39. [Google Scholar]
  • 47. Crook OM, Mulvey CM, Kirk PDW, Lilley KS, Gatto L. A Bayesian mixture modelling approach for spatial proteomics. PLoS Comput Biol. 2018;14(11):1–29. doi: 10.1371/journal.pcbi.1006516 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 1977;39(1):1–22. [Google Scholar]
  • 49. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6(2):461–464. doi: 10.1214/aos/1176344136 [DOI] [Google Scholar]
  • 50. Sanderson E, Spiller W, Bowden J. Testing and correcting for weak and pleiotropic instruments in two-sample multivariable Mendelian randomization. Stat Med. 2021;40(25):5434–5452. doi: 10.1002/sim.9133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Hemani G, Zheng J, Elsworth B, Wade KH, Haberland V, Baird D, et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife. 2018;7:e34408. doi: 10.7554/eLife.34408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, et al. Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 2016;533(7604):539–542. doi: 10.1038/nature17671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Doherty A, Smith-Byrne K, Ferreira T, Holmes MV, Holmes C, Pulit SL, et al. GWAS identifies 14 loci for device-measured physical activity and sleep duration. Nat Commun. 2018;9(1):5257. doi: 10.1038/s41467-018-07743-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Wootton RE, Richmond RC, Stuijfzand BG, Lawn RB, Sallis HM, Taylor GMJ, et al. Evidence for causal effects of lifetime smoking on risk for depression and schizophrenia: a Mendelian randomisation study. Psychol Med. 2020;50(14):2435–2443. doi: 10.1017/S0033291719002678 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Staley JR, Blackshaw J, Kamat MA, Ellis S, Surendran P, Sun BB, et al. PhenoScanner: a database of human genotype–phenotype associations. Bioinformatics. 2016;32(20):3207–3209. doi: 10.1093/bioinformatics/btw373 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Kamat MA, Blackshaw JA, Young R, Surendran P, Burgess S, Danesh J, et al. PhenoScanner V2: an expanded tool for searching human genotype–phenotype associations. Bioinformatics. 2019;35:4851–4853. doi: 10.1093/bioinformatics/btz469 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000;29(4):722–729. doi: 10.1093/ije/29.4.722 [DOI] [PubMed] [Google Scholar]
  • 58. Slob EAW, Burgess S. A comparison of robust Mendelian randomization methods using summary data. Genet Epidemiol. 2020;44(4):313–329. doi: 10.1002/gepi.22295 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Yavorska OO, Burgess S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. Int J Epidemiol. 2017;46(6):1734–1739. doi: 10.1093/ije/dyx034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Broadbent JR, Foley CN, Grant AJ, Mason AM, Staley JR, Burgess S. MendelianRandomization v0.5.0: updates to an R package for performing Mendelian randomization analyses using summarized data [version 2; peer review: 1 approved, 2 approved with reservations]. Wellcome Open Res. 2020;5(252). doi: 10.12688/wellcomeopenres.16374.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1000 Genomes–based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 2015;47(10):1121–1130. doi: 10.1038/ng.3396 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

David Balding, Michael P Epstein

17 Jul 2021

Dear Dr Grant,

Thank you very much for submitting your Research Article entitled 'Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Michael P. Epstein

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: The manuscript by Grant and colleagues proposed a novel method for clustering genetics variants based on their associations with multiple related traits. The advantage of cross-trait analysis, compared with single-trait analysis, is that it can potentially partition variants underlying a trait of interest into multiple clusters/patterns, each associated with a distinct mechanism. The manuscript proposed a novel approach to variant clustering, based on directions of variant effects on multiple traits. This is an interesting and well-motivated idea and is a timely contribution to this field. Applying this method, NAvMix, the authors reported some interesting findings in BMI and related traits.

Major comments

(1) One major issue of cross-trait analysis is overlapping samples between different studies. This is potentially a serious problem, as it would induce correlations of variant effects across traits, a pattern potentially mis-identified as true clusters. The proposed strategy to deal with this seems inadequate. Most often, individual level data is not available, and according to the authors, NAvMix, will simply set the off-diagonal terms in the covariance matrix 0, i.e. ignoring the possible correlations. The justification is that this is not a problem in simulations. However, it's possible that in the simulation setting, the correlations induced by shared samples are too weak, compared with the correlation of true effect sizes. As a result, ignoring induced correlations does not change the results much. This may not be the case, however, in the real data.

My suggestion is: better justification using more extensive simulations. In particular, it would be helpful to vary the relative importance of true effect size correlations vs. correlation induced by shared samples. It would be helpful, for example, to vary how much phenotypic variations are explained by genetic variants vs. by the shared underlying factors ($U_i$, in simulations). Related to this point: it would be helpful to better explain some of the parameter settings in simulations. Ex. effect sizes are drawn from certain distributions. Are the parameters of these distributions realistic?

Another suggestion: it is possible to estimate correlation of effect sizes between traits, due to shared samples, using cross-trait LD score regression [PMID: 26414676]. That paper uses full GWAS summary statistics to estimate genetic correlation, while accounting for shared samples. It shows that the "residual" correlation due to shared samples is a simple function of phenotypic correlations, the sample sizes of the two studies, and the number of shared samples. In fact, it can directly estimate this residual correlation. It may be possible to directly use these estimates (at an appropriate scale) in NAvMix.

(2) The noise cluster presumably increases the robustness of the method. However, in the real data, only 1 out of >500 BMI variants is assigned to the noise cluster. One possible explanation is that: the mean of the noise cluster is the global mean direction. This may be too similar to the major clusters (clusters 1 and 2 in the BMI results). As a result, not many variants would be assigned to the noise cluster. It seems to make sense to use a broader distribution (uniform at some scale) for the noise cluster.

(3) The simulation procedure largely follows the model of NAvMix. So perhaps unsurprisingly, NAvMix performs well under those simulations. A more realistic, or biologically motivated simulation would give a better idea of the performance of NAvMix. The authors could explicitly simulate a few "latent factors", which act on the observed phenotypes. Then each variant could either act on these latent fators, or act on some traits directly. I think if each trait acts only on one factor a time, then this reduces to the variant clustering problem. The advantage of this simulation regime is that one could easily vary some of the parameters and assess if NAvMix are robust to these changes. For example, is the method robust to direct effects of variants (i.e. variants acting on a trait without affecting any of the factors)?

(4) The main difference of clusters 1 and cluster 2 is that cluster 1 variants are associated with smoking, but cluster 2 not. I found this result interesting/puzzling. Given the smoking association, I would imagine that cluster 1 may capture some "behavior" component of BMI. Indeed, it was reported that hertiability of obesity is mostly enriched in brain, and it is not hard to imagine that smoking variants act through some kind of additive behavior. However, the pathway analysis of cluster 1 does not suggest anything related to behavior/brain. One concern is that the results may be driven by sample sharing between BMI and smoking GWAS, though it's not clear to me if this would lead to different clusters. In any case, it would be helpful to check this - could be done using the intercept term of cross-trait LDSC, as mentioned above. In fact, it would be good to check the correlations driven by shared samples for all pairs of traits.

Another relevant question is: has BMI GWAS data adjusted smoking, and vice versa? Is it possible that associations of BMI variants with smoking driven by some kind of collider bias? This could happen when one has a causal model like this: SNP -> smoking <- additive behavior -> BMI. Adjusting smoking in GWAS of BMI may lead to false association of SNP and BMI.

(5) The authors suggest that cluster 5 variants represent some kind of "favorable adiposity", that reduces inflammatory cytokines and hence the CHD risk. This is an attractive model. However, I would caution against over-interpretation. The data basically says that cluster 5 variants have effects on inflammatory cytokines. It's unclear that this effect is mediated through favorable/protective adiposity. It's possible that these variants have some pleiotropic effects on both BMI and cytokines, without adoposity playing any role in the regulation of these cytokines. To rigorously make this claim, I'd imagine some mediation type of analysis, where one controls BMI or some measures of favorable adiposity, and show that the cluster 5 variants would be no longer associated with the CHD risk. I doubt such data is available for this type of analysis though.

Minor comments

- Variants in cluster 4 have metabolic functions, and are associated with HDL and TG. However, it is not found to have an effect on CHD risk in MR analysis. This is a bit unexpected, given that LDL is a known risk factor of CHD. It would be good to examine the LDL association of cluster 4 variants.

- Typo: line 314, norm of $x_j$ = 1 for all $i$ - I suppose it should be "for all j".

- Typo: denoted $X_{ik}$ for individual i and trait l, in line 406.

- Table S5: it would be helpful to show the fold enrichment (or expected number of overlapping genes). Also, the authors used KEGG and REACTOME. It would be helpful to add GO analysis as well, which is more commonly used for this type of analysis, and may cover more biological processes.

Reviewer #2: The authors have proposed a new procedure NAvMix (and its R implementation) to cluster variants based on their genetic effects on multiple traits. The procedure considers a separate cluster for noise on top of the existing clustering method on the unit hypersphere based on a mixture of von Mises-Fisher distributions. The authors have shown good performance of NAvMix in simulations. The authors have applied NAvMix on genetic variants associated with BMI to unravel distinct biological pathways. However, I have some concerns with the results. The simulation study is somewhat narrow in scope (e.g. their data generating mechanism appears close to the main feature of their procedure that they are leveraging; a wider range of evaluation metrics could have been considered). Takeaways from different real data analyses are not always clear. I have provided detailed comments below.

**Major Comments**

1. “Assuming all genetic associations are estimated with the same sample size for a given trait, this will not distort the direction vector. Otherwise, the direction vector will be weighted toward traits for which the associations are more precisely estimated.” - Is this a limitation of the method? What would be practical solution to this issue? Are the authors recommending using NAvMix on traits with similar sample size only? When working with summary statistics on different traits, it is often the case that sample sizes differ widely. For e.g. recent meta-analysis [PMID 34059833] of glycemic traits included ~90K samples for 2hr glucose whereas >280K samples for fasting glucose.

2. Lines 68-69: Genetic variants for which vector of association estimates are input in the algorithm are assumed to be independent of each other. Do the authors recommend that only the top hit in a locus be included in the set to be clustered? But then, for most traits, only a handful of loci are significant (not anywhere close to the number 100 assumed in the simulations or the 539 variants in the their real data analysis— a huge number that one can get only from very large sample sizes like they have used). Can this method be used for maybe 10-20 independent variants or on traits with modest sample sizes? Do the authors recommend a lower limit to these numbers?

3. The data generating mechanism (DGM) for the simulations seems somewhat favorable for the “directional statistics” concept that the authors are leveraging in their procedure. In my opinion, it will be useful to consider more realistic and a different DGM. For e.g., instead of using angles between genetic effects to generate different clusters of effects, the authors can consider a structural equation model (as their DGM) reflecting different DAGs capturing the different underlying biological mechanisms among traits (for different DAGs the authors may follow Fig 1 of PMID 29226067)

4. Clustering evaluation metrics: The authors considered Rand Index. Isn’t adjusted Rand index a better metric to use since it is adjusted for chance? What about additionally using other popular metrics (like normalized mutual information, Silhouette index, etc) that capture other aspects of clustering that Rand index cannot? For evaluation, I think considering metrics that do not require a priori knowledge of clusters is also important.

5. Fig 5: I see that different clusters of genetic instruments may or may not give different causal relation result. I don’t see a clear distinction between clusters in terms of their MR results. Can the authors clarify the takeaway from the MR analysis done here?

6. The required input/variables/format for their R package is not clear. What parameter choices do users need to make when using NAvMix package? It will be useful for readers if the authors can provide an outline of how readers can implement their procedure from start to finish.

**Minor Comments**

7. Line 149: Are the authors including SNPs with r^2<0.001? If not, I am not sure why Lines 68-69 say that variants should not be in LD.

8. Lines 332-333: specify the contents of parameter vector \\Theta.

9. There might be some notational inconsistency here and there. For e.g., what do X_j, X_k stand for in Line 398 and are they related to vector x_j in the previous paragraphs? Is vector x_j =\\hat\\Sigma_j^{-1/2} \\hat\\beta_j? It is worth clarifying these relations for readers.

10. What happens to \\pi_k estimate if there are only a few members in a cluster (other than the noise cluster)? Will the estimate be unstable? Can unstable estimates for such a cluster lead to lack of convergence?

11. Figs 3-7: It will be helpful to have the number of SNPs in each cluster mentioned in the figures.

12. Figs 6 & 7: Only MR-IVW results are presented. MR-IVW is least likely to be robust to violation of assumptions (it is unlikely that all 539 SNPs are valid instruments). Is only MR-IVW presented because the results for these MR analyses using different methods/assumptions give the same qualitative results?

Reviewer #3: Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity

This paper aims to answer the question ‘how do genetic variants influence complex traits?’ and presents a new approach for clustering genetic variants according to the direction of association with complex traits and incorporates a noise model to improve overall clustering results. They perform simulation studies and apply their method to studying variants associated with high BMI and look at the effects on coronary heart disease.

The paper introduces a novel model but needs more extensive benchmarking and robustness tests to convince me of its utility.

Major Concerns:

I’d like to see some justification/evidence that the magnitude of associations to traits is not important, but the direction of association is the key concern. Is there biological rationale for claiming that magnitudes of association do not encode useful information about how variants influence traits?

Benchmarking should be done over a wider range of comparison models and over parameter values (Why were the number of traits chosen to just be 2 or 9?) I’d also like to see answers to questions about limitations of the model - (How many data points are required to detect clusters reliably? What is the smallest fraction that the method can feasibly detect? How does performance change as a function of noise?)

Moderate Concerns:

I’d like to see a justification for choosing a uniform noise model. Does incorporating this noise model improve performance over simple pre-processing/data cleaning approaches?

I’d like a brief explanation for why they believe that the other approaches they benchmark achieve lower Rand index on their simulations. Why were these models selected for benchmarking?

Is the assumption of no linkage disequilibrium a fair assumption?

Minor Concerns:

Figure 4 is a bit confusing - can this be better illustrated? The figure makes it appear that several clusters are actually quite similar - are the differences between them significant?

Figure 5,6 - What do the dotted lines represent?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Decision Letter 1

David Balding, Michael P Epstein

23 Nov 2021

Dear Dr Grant,

Thank you very much for submitting your Research Article entitled 'Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity' to PLOS Genetics. We are willing to accept your manuscript once you submit a revised version that addresses the minor comments raised by one reviewer. 

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Michael P. Epstein

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Reviewer #1: The authors have done a thorough job in addressing all my comments. I do not have any more concerns.

Reviewer #2: The authors have reasonably addressed my previous concerns. Some minor comments below.

1. Table 1: In the caption or somewhere appropriate, it will be useful to provide the actual number of noise variants in the simulated data to better appreciate the estimated “Number of noise variants” from each method.

2. Lines 153-154: The authors say “This suggests that, unless there is substantial trait correlation or sample overlap, the procedure is robust to missing these estimates.” Since the difference in adjusted Rand index between NAvMix and NAvMix (cor) increases quite a bit with increasing number of traits, decreasing number of clusters, and increasing strength of correlation (similar observation for silhouette index in Fig A), the authors need to include these other conditions in the above statement.

3. Regarding this correlation: Authors discuss limitations of NAvMix, which includes “It uses as input the full covariance matrix of the genetic variant-trait associations.” However, I am not sure why this is a limitation. Under reasonable assumptions, the authors approximate \\Sigma_j as in lines 450-451. While individual-level data readily gives Corr(X_k, X_l), one can also estimate this correlation easily and quickly using summary statistics only. The field of cross-phenotype association tests using summary statistics uses such an estimate all the time (see for e.g. PMID 29226385). I believe the authors can use such an estimate instead of making the sub-optimal assumption that off-diagonal elements of \\Sigma_j are 0.

4. I may have missed but did not find details about sample sizes, the number of significant variants in all, and the number of significant variants after pruning for each of the traits considered from Pulit et al, UK Biobank, etc. I think readers will find these details useful in interpreting results from NAvMix. In particular, I think NAvMix requires large sample sizes and highly polygenic traits.

5. With respect to Fig 5, the authors mention “The components of BMI predicted by Clusters 1, 2, and 3 show a similar effect size and direction to the overall result.” Does this mean the clusters 1, 2 and 3 are not quite different and could have been just one cluster?

Reviewer #3: The revision adequately addressed all our concerns

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Xin He

Reviewer #2: No

Reviewer #3: No

Decision Letter 2

David Balding, Michael P Epstein

1 Dec 2021

Dear Dr Grant,

We are pleased to inform you that your manuscript entitled "Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Michael P. Epstein

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-00715R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

David Balding, Michael P Epstein

17 Dec 2021

PGENETICS-D-21-00715R2

Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity

Dear Dr Grant,

We are pleased to inform you that your manuscript entitled "Noise-augmented directional clustering of genetic association data identifies distinct mechanisms underlying obesity" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Additional simulation results and supplementary information for the simulation study and applied example.

    (PDF)

    S1 Table. Allocated cluster and probability of membership to each cluster for each BMI associated genetic variant.

    (XLSX)

    S2 Table. List of canonical pathways and gene ontology processes associated with the mapped genes for each cluster of BMI associated genetic variants.

    (XLSX)

    S3 Table. Results from Mendelian randomization sensitivity analyses of the effect of BMI on cytokines and growth factors.

    Estimates and 95% confidence intervals from MR-Median, the Contamination Mixture method (MR-ConMix) and MR-PRESSO for the association of genetically predicted BMI with 41 cytokines and growth factors.

    (XLSX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All the data used in this paper are publicly available. Summary statistics for genetic associations with traits were downloaded from: https://zenodo.org/record/1251813#.X8drUF7gquP (BMI and WHR); http://www.nealelab.is/uk-biobank/ (body fat percentage, SBP, triglycerides, HDL and CRP); https://www.thessgac.org/data (educational attainment); https://ora.ox.ac.uk/objects/uuid:ff479f44-bf35-48b9-9e67-e690a2937b22 (physical activity); https://data.bris.ac.uk/data/dataset/10i96zb8gm0j81yz0q6ztei23d (lifetime smoking score); http://diagram-consortium.org/downloads.html (T2D); http://www.phenoscanner.medschl.cam.ac.uk/ (CHD); https://data.bris.ac.uk/data/dataset/3g3i5smgghp0s2uvm1doflkx9x (cytokines and growth factors). R code for performing the NAvMix clustering algorithm, and for reproducing the simulation results and applied analysis, can be found at https://github.com/aj-grant/navmix.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES