Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2023 Feb 28;13:3389. doi: 10.1038/s41598-023-30415-3

A clustering linear combination method for multiple phenotype association studies based on GWAS summary statistics

Meida Wang 1, Xuewei Cao 1, Shuanglin Zhang 1, Qiuying Sha 1,
PMCID: PMC9975197  PMID: 36854754

Abstract

There is strong evidence showing that joint analysis of multiple phenotypes in genome-wide association studies (GWAS) can increase statistical power when detecting the association between genetic variants and human complex diseases. We previously developed the Clustering Linear Combination (CLC) method and a computationally efficient CLC (ceCLC) method to test the association between multiple phenotypes and a genetic variant, which perform very well. However, both of these methods require individual-level genotypes and phenotypes that are often not easily accessible. In this research, we develop a novel method called sCLC for association studies of multiple phenotypes and a genetic variant based on GWAS summary statistics. We use the LD score regression to estimate the correlation matrix among phenotypes. The test statistic of sCLC is constructed by GWAS summary statistics and has an approximate Cauchy distribution. We perform a variety of simulation studies and compare sCLC with other commonly used methods for multiple phenotype association studies using GWAS summary statistics. Simulation results show that sCLC can control Type I error rates well and has the highest power in most scenarios. Moreover, we apply the newly developed method to the UK Biobank GWAS summary statistics from the XIII category with 70 related musculoskeletal system and connective tissue phenotypes. The results demonstrate that sCLC detects the most number of significant SNPs, and most of these identified SNPs can be matched to genes that have been reported in the GWAS catalog to be associated with those phenotypes. Furthermore, sCLC also identifies some novel signals that were missed by standard GWAS, which provide new insight into the potential genetic factors of the musculoskeletal system and connective tissue phenotypes.

Subject terms: Genetic association study, Genome-wide association studies

Introduction

Over the last decades, genome-wide association studies (GWAS) have been very successful in detecting genetic variants associated with human complex traits or diseases13. At the same time, a vast majority of GWAS summary statistics obtained from single-trait tests are publicly available, which contain the estimated marginal effect sizes, the corresponding standard deviations, Z scores or p-values. Normally, raw genotypes and phenotypes are not easy to be accessed as a result of privacy concerns and some logistical considerations, thus motivating an extensive interest in developing statistical methods based on GWAS summary statistics46. On the other hand, because multiple related phenotypes are often measured as indicators for one specific trait, considering the correlated structure between multiple phenotypes and jointly analyzing these phenotypes may increase statistical power in association studies712.

Recently, many multiple phenotype association tests based on GWAS summary statistics have been proposed. CPASSOC13 contains two separate tests (Hom and Het), where Hom is more powerful when the genetic variant has homogeneous effects on the phenotypes; Het is more powerful when heterogeneous effects are present, whereas Monte-Carlo simulations are needed to calculate the p-value of Het when the number of traits is large, which is computationally intensive. SSU14,15 is a test statistic based on the sum of squared Z scores, which follows a mixture of chi-squared distributions under the null hypothesis. PCFisher16 has the test statistic that combines all p-values of independent principal components using Fisher’s method, where allocates larger weights to PCs with smaller eigenvalues. The classical Wald test16 uses the Z score vector and the inverse matrix of the correlation matrix among phenotypes to construct a quadratic test statistic. The adaptive multi-trait association test (aMAT)17 builds a group of multi-phenotype association tests (MATs) that may have good performance in a specific scenario and then integrates the testing results adaptively.

In our previous studies, we developed the Clustering Linear Combination (CLC) method18 and a computationally efficient CLC (ceCLC) method19 to test the association between multiple phenotypes and a genetic variant based on individual level genotypes and phenotypes. Both of these methods perform very well compared with other multiple phenotypes association tests especially for phenotypes that have natural grouping. In this research, we develop a novel approach called CLC based on GWAS summary statistic (sCLC). In sCLC, we use the LD score regression20,21 to estimate the correlation matrix among phenotypes. It has been shown that the LD score regression which has been commonly used in recent years can control the potential confounders such as population stratification, unknown sample overlap, cryptic relatedness, and so forth2022. In our simulation studies, we consider a range of simulation settings and compare sCLC with other five commonly used methods for multiple phenotype association studies using GWAS summary statistics to evaluate the performance of sCLC. The simulation results show that sCLC can control the Type I error rate well and has the highest power in most scenarios. We also apply the sCLC method to UK Biobank GWAS summary statistics for 70 related musculoskeletal system and connective tissue phenotypes in the XIII category of UK Biobank. The results show that sCLC identifies the most number of significant SNPs, and most of these SNPs can be matched to the genes that have been reported in the GWAS catalog to be associated with the phenotypes in the XIII category. Furthermore, sCLC also identifies some novel signals that were missed by standard GWAS. The new identified signals may provide new insight into the potential genetic factors of the musculoskeletal system and connective tissue phenotypes.

Materials and methods

We consider a GWAS with M SNPs and K correlated phenotypes of interest. Each time, a single SNP j is considered, then we repeat the same procedure for all SNPs, j=1,,M. For SNP j, we assume that we have Z score vector Zj=(Z1j,Z2j,,ZKj)T across K phenotypes from GWAS summary statistics. If Z score is not provided, we can compute the Z score as Zkj=β^kjse^(β^kj), k=1,,K, where β^kj is the estimated effect size of SNP j on phenotype k, and se^(β^kj) is the standard deviation of β^kj. Based on the GWAS summary statistics, we propose the following sCLC method.

Firstly, sCLC uses the LD score regression (LDSC)20,21 to estimate the correlation matrix among phenotypes, denoted by R. Specifically, consider the pair of phenotypes s and k, the bivariate LDSC20 regresses the pairwise product of Z scores on the LD scores, the expected value of ZsjZkj is:

EZsjZkj=Gglj+ρsk,

where Gg is related to the genetic covariance between phenotypes s and k; lj is the LD score of SNP j which can be obtained from the reference panel20,21; and ρsk is the correlation between phenotypes s and k. Therefore, the bivariate LDSC20 can be applied to each pair of phenotypes, and the estimated intercepts ρsk are used to estimate the off-diagonal elements of R. When s=k, it reduces to the univariate LDSC21 for each phenotype and the estimated intercepts are used to estimate the diagonal elements of R. In this procedure, all M SNPs are used to estimate R, and the LD scores for SNPs can be obtained from the reference panel, such as the 1000 Genome Project23. Moreover, LDSC can control potential confounders such as population stratification, unknown sample overlap, cryptic relatedness, and so forth2022.

Secondly, similar to CLC18, we use the hierarchical clustering approach with similarity matrix R and dissimilarity matrix 1-R to partition the original K phenotypes into L disjoint clusters (L=1,2,,K). The agglomerative hierarchical clustering starts with each phenotype as a singleton cluster (L=K) and then successively merges pairs of clusters that have the smallest distance (highest similarity) until all clusters have been merged into a single cluster that contains all phenotypes (L=1)24. Because we consider a single SNP j and multiple phenotypes at a time, the notation Zj can be simplified by Z. After applying the hierarchical clustering method to partition the original K phenotypes into L disjoint clusters (L=1,2,,K), we define a K×L matrix B with the (k,l)th element equals 1 if the kth phenotype belongs to the lth cluster, otherwise it equals 0. Then the CLC test statistic to test the association between the K phenotypes and a SNP with L clusters is given by:

TCLCL=WZTWRWT-1WZ,

where W=BTR-1. TCLCL follows a χ2 distribution with degrees of freedom L under the null hypothesis. We denote the p-value of TCLCL by pL for 1LK.

Finally, we use Cauchy combination25,26 to integrate the p-values obtained from the second step for all possible number of clusters, pL for 1LK. The test statistic of sCLC for a SNP is defined as the linear combination of the transformed p-values divided by K (all possible number of clusters), which is given by

TsCLC=1KL=1Ktan(0.5-pL)π.

Under the null hypothesis, pL follows a standard uniform distribution, so tan(0.5-pL)π has a standard Cauchy distribution. Because p1,,pK correspond to each possible number of clusters for K phenotypes, there exists a correlated structure between them. Liu et al.25,26 showed that a weighted sum of “correlated” standard Cauchy variables still has an approximately Cauchy tail, and the influence of the correlated structure on the tail is quite limited because of the heaviness of the Cauchy tail. Therefore, TsCLC is approximately standard Cauchy distributed. Based on the cumulative density distribution of the standard Cauchy distribution, the p-value of TsCLC can be approximated by 0.5-arctanTsCLC/π.

Comparison of methods

To better demonstrate the performance of the sCLC approach, we compare sCLC with other five methods for multiple phenotype association studies using GWAS summary statistics: SSU14,15, Hom13, PCFisher16, Wald16, and aMAT17. Below, we briefly summarize these five methods, where Z score vector and the phenotypic correlation matrix R are the same as we define previously.

SSU

The test statistic of SSU is TSSU=ZTZ and the distribution of TSSU can be well approximated by aχd2+b with a=i=1Kci3i=1Kci2, b=i=1Kci-(i=1Kci2)2i=1Kci3, and d=(i=1Kci2)3(i=1Kci3)2, where ci s are the eigenvalues of R. The p value of TSSU can be obtained by p(χd2>(TSSU-b)/a). Note that the degrees of freedom of TSSU may be less than K with highly correlated phenotypes.

Hom

Assume that there are summary statistics of GWASs from J cohorts with K traits. Let Tijk be a summary statistic for the ith SNP, jth cohort, and kth trait. Let Ti=(Ti11,,TiJ1,,Ti1K,,TiJK)T. For simplification, we omit the SNP index, then T=(T11,,TJ1,,T1K,,TJK)T represents a vector of test statistics for single SNP-trait association tests. The test statistic of Hom is SHom=eT(RV)-1T(eT(RV)-1T)TeT(VRV)-1e, which follows a χ2 distribution with one degree of freedom, where eT=(1,,1) is a vector of length J×K with all elements being 1, V is a diagonal matrix of weights wjk=nj, and nj is the sample size in the jth cohort. In this study, we consider J=1 cohort to compare Hom with other methods.

PCFisher

Assume that the spectral decomposition of R is R=m=1KλmumumT, where λ1λ2λK>0 are the eigenvalues of R, and um is the eigenvector corresponding to the m th largest eigenvalue λm. We assume that the K-dimensional vector of the summary statistics ZN(μ,R). It can be shown that16 PCm=umTZNumTμ,λm,1mK. The non-centrality parameter (ncp) of PCm under the alternative hypothesis is ncpm=(umTμ)2/λm. PCFisher16 combines p-values of all K independent principal components using Fisher’s method with its null distribution and the test statistic is given by PCFihser=-2m=1Klog(pm)χ2K2.

Wald

The test statistic of Wald test is defined as TWald=ZTR-1Z. Assume that the spectral decomposition of R is R=UΛUT=m=1KλmumumT, then the test statistic can be written as TWald=ZTR-1Z=UTZTΛ-1UTZ=m=1KPCm2λmχK2. So, the Wald test is a special quadratic PC-based test16.

aMAT

The method was developed to deal with potential (near) singularity problem of R. The singular value decomposition (SVD) of R is R=UΣUT. A modified pseudoinverse Rγ+ is calculated by Rγ+=UΣγ+UT, where Σγ+ is formed from Σ by taking the reciprocal of the largest m singular values σ1,,σm, and setting all other elements to zero, where m is the largest integer that satisfies σ1/σm<γ. The test statistic of MAT(γ) is defined as TMAT(γ)=ZTRγ+Z. Because the optimal value of γ is unknown, aMAT combines the results from a class of MAT tests, TaMAT=minγΓpMAT(γ), where pMAT(γ) is the p value of MAT(γ), and Γ=1,10,30,50. Finally, a Gaussian copula approximation is applied to calculate the p-value of aMAT. Therefore, aMAT is analogous to a PC-based method which restricts the analysis to the top m axes of the largest variation17.

Results

Simulation design

Based on a widely used simulation procedure17,27, we generate Z scores from a multivariate normal distribution N(μ,R). We consider two different correlation matrix structures: (1) R is the sample correlation matrix of 70 related musculoskeletal system and connective tissue phenotypes in the UK Biobank (details of the 70 phenotypes are described in the Application to UK Biobank summary statistics); and (2) R is generated based on the Autoregressive model (AR(1) model)28 for 40 phenotypes, where R=Bdiag(R1,R2,R3,R4), a block diagonal matrix, with R1=R3=rsk=ρ|s-k| and R2=R4=-ρ|s-k|. We use ρ=0.1 in the simulation studies.

To investigate how the estimation error of R may affect on the testing results, similar to Wu17, we consider two cases in the 70 phenotypic correlation matrix structure. In the first case, we suppose that R is known and perform our proposed method, sCLC, and all competing methods based on R. In the second case, we suppose that R is unknown and the estimated phenotypic correlation matrix is approximated by R with a small white noise N(0,δ), denoted by Rδ. We choose δ=10-5 and δ=10-4 in the simulation studies, and use R(δ) in the association tests for all the methods.

To evaluate Type I error rate of sCLC, we generate 108 Z score vectors under the null hypothesis (μ=0) and choose different significant levels. In order to evaluate power, we generate 104 Z score vectors under an alternative with different effect size vector μ in four scenarios. In the first two scenarios, we assume that the SNP impacts on phenotypes with the same direction. Scenario 3 considers different directions of effects on phenotypes. Scenario 4 is a sparse simulation model, where a SNP impacts on a small proportion of phenotypes. The significant level of 5×10-8 is chosen for the power evaluation.

Scenario 1: Generate μ=β(1/K,2/K,,1)T.

Scenario 2: Generate μ=(0,0,...,0K/2,β,β,...,βK/2)T.

Scenario 3: Generate μ=(β11,,β1k,β21,,β2k,β31,,β3k,β41,,β4k,β51,,β5k)T, where β11==β1k=β21==β2k=0,β31==β3k=β41==β4k=β, (β51,,β5k)=-2βk+1(1,,k), and k=K/5.

Scenario 4: Generate μ=(β11,,β1k,β21,,β2k,β31,,β3k,,β14,1,,β14,k)T. β11==β1k=β21==β2k==β13,1==β13,k=0, (β14,1,,β14,k)=2βk+1(1,,k), and k=K/14.

Simulation results

Type I error rates

Table 1 shows the estimated Type I error rates at different significance levels for all six methods with the phenotypic correlation matrix R of 70 phenotypes. The Type I error rates with the correlation matrix R(10-5) and R(10-4) of 70 phenotypes are recorded in Tables S1 and S2. From these tables, we can see that the sCLC approach can control the Type I error rates very well at different significant levels α, which indicates that it is a valid test. Among the five competing methods, SSU yields inflated Type I error rates when α is smaller and the other four methods can control Type I error rates very well. Table S3 shows the estimated Type I error rates at different significance levels for all six methods with the phenotypic correlation structure for the 40 phenotypes. We observe that all methods can well-control Type I error rates.

Table 1.

The estimated Type I error rates at different significance levels for the six methods with the phenotypic correlation structure for the 70 phenotypes.

α 1×10-3 1×10-4 1×10-5 1×10-6 1×10-7
SSU 1.05×10-3 1.13×10-4 1.25×10-5 1.61×10-6 2.29×10-7
sCLC 1.07×10-3 1.05×10-4 1.06×10-5 1.17×10-6 7.98×10-8
Hom 1.00×10-3 9.82×10-5 1.01×10-5 9.47×10-7 9.97×10-8
Wald 1.01×10-3 1.00×10-4 9.98×10-6 1.17×10-6 1.7×10-7
aMAT 9.97×10-4 1.00×10-4 1.02×10-5 1.17×10-6 1.3×10-7
PCFisher 1.00×10-3 9.90×10-5 1.01×10-5 1.09×10-6 1.5×10-7

The bold-faced values indicate that the type I error rates cannot be controlled.

Power comparisons

Power comparison results of the six methods under four scenarios with the phenotypic correlation matrix R of 70 phenotypes are presented in Fig. 1. Figures S1 and S2 show the power comparisons of the six methods with the correlation matrix R(10-5) and R(10-4) of 70 phenotypes, respectively. From these figures, we can observe that (1) when SNPs have homogeneous effects on the phenotypes (scenarios 1 and 2), our proposed method sCLC, as well as Hom and SSU have higher power than the other three PC-based methods (Wald, aMAT, and PCFisher); whereas all the methods have comparable powers except for Hom when the SNP affects on phenotypes in different directions. (2) The power of Hom dramatically reduces and almost is zero in scenarios 3, while sCLC and SSU are robust to the direction of the genetic effect on the phenotypes. (3) sCLC and SSU are more powerful than other methods when a SNP affects on a small proportion of phenotypes (scenario 4), and Hom is less powerful in this case. (4) In all of the four scenarios, the power patterns observed in Figs. S1 and S2 are very close to that of Fig. 1, indicating that the estimation errors (noise δ) of R have little influence on the powers for all the methods. Figure S3 shows the power comparisons of the six methods with the phenotypic correlation structure for the 40 phenotypes. sCLC is still more powerful than the other five methods under all four scenarios.

Figure 1.

Figure 1

Power comparisons of the six methods, SSU, sCLC, Hom, Wald, aMAT, and PCFisher for the phenotypic correlation structure of the 70 phenotypes at a significant level of 5×10-8.

Application to UK biobank summary statistics

Connective tissue dysplasia (CTD) and musculoskeletal disorders2931, such as Systemic Lupus Erythematosus (SLE), Sjögren Syndrome (SS), and Rheumatoid Arthritis (RA), may influence the physical activity or movement of patients. These kinds of diseases seriously affect the quality of life of people and have been reported to be potentially affected by genetic factors32. In this paper, we consider the GWAS summary statistics in the XIII category of UK Biobank with 70 musculoskeletal system and connective tissue phenotypes to detect potential genetic factors.

The UK Biobank is a large long-term biobank study which has recruited almost half a million participants in the UK, enrolled at ages from 40 to 6933. Sequenced genotypes for 488,377 participants with 784,256 variants in autosomal chromosomes were extracted by UK Biobank dataset34. Similar to Liang et al.28, we first perform quality controls (QCs) on genotypes and individuals by using PLINK 1.935. We remove SNPs with missing rates larger than 5%, p-values from Hardy–Weinberg equilibrium exact test less than 10-6, and minor allele frequency (MAF) less than 5%. In addition, we screen out individuals with missing genotype rate larger than 5% and without sex information. After these pre-processing, there are 466,580 individuals with 288,647 genetic variants left.

On the other hand, the phenotypes that coded by International Classification of Diseases, the 10th Revision (ICD-10) codes are considered in our study. We truncate the full ICD-10 code to the UK Biobank ICD-10 level 3 code (http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=41202) to define Electronic Health Record (EHR)-derived phenotypes. When the individual has the truncated ICD-10 code recorded for a specific phenotype, the corresponding EHR-derived phenotype for that individual will be coded as 1, otherwise it will be 0 (1 for cases and 0 for controls). In the XIII category, we only consider phenotypes with more than 200 cases and there are a total of 72 unique phenotypes, such as rheumatoid arthritis (M06.9) and Systemic Lupus Erythematosus (M32.9). Table S4 lists the ICD-10 code, the name of the disease, heritability, and case–control ratio for each of the 72 phenotypes. Since our proposed method is a population-based method and cannot be applied to a mixed population due to population stratification, we analyze 409,672 individuals with the white British ancestry. Similar to Liang et al.28, we also exclude individuals who are marked as outliers for heterozygosity, and have been identified to have more than ten third-degree relatives or closer, etc. The final dataset includes N=322,607 individuals with M=288,647 common variants across K=72 phenotypes for analyses. All the phenotypes are adjusted by 13 covariates, including age, sex, genotyping array, and the first 10 genetic principal components (PCs).

To apply our method, we first calculate the GWAS summary statistics for the 72 phenotypes based on 288,647 SNPs. We observed that all of the 72 phenotypes have extremely unbalanced case–control ratios, where the largest case–control ratio is 0.03937 for Gonarthrosis (M17.9) and the smallest case–control ratio is 0.000658 for Lumbar and other intervertebral disk disorders with myelopathy (M51.0). Therefore, we use the saddlepoint approximation (SPA)36 to calculate the adjusted Z scores. For the jth SNP and kth phenotype j=1,,M,k=1,,K, we calculate the score test statistic37 Skj=i=1N(Yik-Y¯k)Gij, where Y¯k=i=1nYik/N. Yik denotes the kth phenotype for the ith individual, Gij denotes the jth SNP for the ith individual (i=1,,N). The adjusted Z-score is defined as Zkj=sign(Skj)FChi-1(1-pkj), where FChi() denotes the cumulative density function of χ12 and pkj is the p-value of Skj obtained using SPA36. Based on the adjusted Z-scores, we then apply LDSC to estimate the correlation matrix among phenotypes. We run the single-trait LDSC21 to estimate the diagonal elements for each phenotype, and the off-diagonal elements are estimated by the cross-trait LDSC20. Two phenotypes M79.6 (Enthesopathy of lower limb) and M67.8 (Other specified disorders of synovium and tendon) are excluded in this procedure because the estimators of their heritability are out of bounds. Therefore, there are a total of 70 phenotypes in the simulation studies and real data analysis. The phenotypic correlation matrix only needs to be estimated once for all SNPs. Finally, we apply our proposed sCLC method and the other five methods to test the association between each of 288,647 SNPs and 70 phenotypes, and the commonly used genome-wide significant level α=5×10-8 is considered.

Among all the six methods, sCLC identifies the largest number of SNPs (969), where Hom identifies 74 SNPs, SSU identifies 872 SNPs, Wald test identifies 654 SNPs, aMAT identifies 622 SNPs, and PCFisher identifies 585 SNPs. Figure 2A shows the Venn Diagram for five methods except for SSU, since SSU cannot control Type I error rates in our simulation studies. There are 33 SNPs identified by all five methods, and 318 SNPs only identified by sCLC. Figure 3 shows the Manhattan plot from the sCLC test results, in which 947 out of 969 SNPs are located in chromosome 6. To evaluate the 969 SNPs identified by sCLC, we first map those SNPs to genes, and we use the commonly used UCSC reference gene file (https://hgdownload-test.gi.ucsc.edu/goldenPath/hg19/bigZips/genes/). Each gene has a position interval. A SNP can be mapped to a gene if its position is within the interval or 20 kb downstream or 20 kb upstream from the interval. These 969 SNPs can be mapped to 235 genes. From the results, we find that 746 out of 969 SNPs can be matched to the genes that have been reported to be associated with the Chapter XIII phenotypes in GWAS catalog. Moreover, among 318 SNPs only identified by sCLC, 229 SNPs can be mapped to the genes that have been reported to be associated with those phenotypes.

Figure 2.

Figure 2

Venn diagram. (A) The number of significant SNPs identified by the five methods. (B) The number of lead SNPs identified by sCLC, Wald, aMAT, and PCFisher.

Figure 3.

Figure 3

Manhattan Plot from the results of sCLC using multiple phenotypes based on the phenotypes on the UK Biobank XIII category. Each SNP ordered by the genomic position is represented in the x-axis and the association strength with the transformed p-values -log10p is represented in the y-axis.

However, SNPs within the same LD block are highly correlated and are more likely to be mapped to the same gene. For example, 205 out of 969 identified SNPs are mapped to gene TSBP1-AS1, which is associated with 10 phenotypes in the XIII category; other genes such as NOTCH4, HLA-DRA, and HLA-DRB1 also have many identified SNPs mapped on them. Hence, we are also interested in the independent lead SNPs associated with those phenotypes. We use the Functional Mapping and Annotation (FUMA)38 platform to obtain independent lead SNPs and distinct risk loci. Here, the independent lead SNPs are defined as r2<0.1 and distinct loci are > 250 kb apart. The 969 SNPs identified by sCLC are represented by 13 lead SNPs located in 8 distinct risk loci; the 654 SNPs identified by Wald are represented by 10 lead SNPs located in 6 distinct risk loci; the 622 SNPs identified by aMAT are represented by 10 lead SNPs located in 7 distinct risk loci; and the 585 SNPs identified by PCFisher are represented by 10 lead SNPs located in 6 distinct risk loci. Since the MHC region is excluded by FUMA38, Hom has no lead SNPs. Figure 2B shows the Venn Diagram of the lead SNPs for sCLC, Wald, aMAT and PCFisher. There are 5 lead SNPs identified by all four methods, and 4 lead SNPs only identified by sCLC. Table 2 shows the details of the summary statistics for all of the 18 independent lead SNPs identified by those four methods. The graying out rows indicate that the SNPs/matched genes have been reported in the GWAS catalog. There are 5 out 13 lead SNPs for sCLC that have not been reported in the GWAS catalog, which may provide us a new insight into the potential genetic factors of the musculoskeletal system and connective tissue phenotypes. Among those 5 SNPs, SNP rs13107325 has the Annotation-Dependent Depletion (CADD) score39 greater than 20, which means having a high observed probability of a deleterious variant effect. In addition, we compare the p-values of the 13 independent lead SNPs obtained by sCLC with the minimum p-value (MinP) among 70 p-values for testing the association between a SNP and each of the 70 phenotypes. Table S5 shows the comparison results. There are 6 out of 13 SNPs (graying out) with MinP>5×10-8, indicating that these six SNPs have no association with any of the 70 phenotypes by univariate association tests. However, by jointly analyzing the 70 phenotypes, sCLC identified these six SNPs indicating that these 6 SNPs have pleiotropic effects on the phenotypes.

Table 2.

Summary statistics of the independent lead SNPs identified by sCLC, Wald, aMAT, PCFisher.

Chr SNP BP A1 A2 sCLC P Wald P aMAT P PCFisher P Mapped gene Reported trait
1 rs4846567 219,750,717 G T 2.88E−09 ZC3H11B M19.9; M85.8
4 rs4148157 89,020,934 A G 1.67E−16 6.54E−14 ABCG2 M10.9
4 rs2231142 89,052,323 G T 5.16E−17 3.96E−16 ABCG2 M10.9
4 rs13107325 103,188,709 C T 6.70E−09 7.46E−09 SLC39A8 M19.9
6 rs13212534 25,983,010 A G 9.47E−09 TRIM38
6 rs13195040 27,413,924 A G 9.00E−09 1.80E−08 ZNF184
6 rs13207082 27,251,379 A G 1.08E−10 2.31E−08 POM121L2 M85.8
6 rs67340775 28,304,384 A G 3.78E−12 ZKSCAN3
6 rs3117425 29,260,431 C T 1.46E−08 2.92E−08 OR14J1 M72.9
6 rs404240 29,523,957 A G 1.91E−11 GABBR1 M32.9; M85.8
7 rs2598104 37,977,249 C T 5.00E−16 1.07E−13 2.14E−13 5.81E−14 EPDR1 M72.0; M85.8
7 rs2290221 37,987,632 A G 5.32E−20 4.69E−19 EPDR1 M72.0; M85.8
7 rs118028828 38,026,155 C T 5.55E−17 2.22E−16
8 rs655028 70,049,047 A G 2.22E−16 7.08E−16 1.44E−15 4.31E−15
19 rs34945782 57,678,336 C T 1.34E−11 2.16E−08 4.32E−08 2.42E−08 DUXA M72.0; M85.9
22 rs62228062 46,381,234 A G 1.74E−35 2.88E−32 WNT7B M85.9
22 rs28698504 46,403,715 A G 6.23E−12 1.24E−09 2.48E−09 2.06E−08
22 rs9627391 46,447,097 C T 3.27E−13 2.50E−12 4.99E−12 1.50E−11 LINC00899 M72.0

The bold out rows indicate that the SNPs/mapped genes have been reported in the GWAS Catalog. “–” represents that the SNP is not an independent lead SNP for the corresponding method.

In order to better understand the biological meaning behind 235 mapped genes identified by sCLC, similar to Cao et al.40, we use DAVID functional annotation software for the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis41,42. There are 29 significantly enriched pathways identified by sCLC with FDR < 0.05 and enriched gene count > 2 (Fig. 4). From Fig. 4, we can observe that two related pathways significantly enriched, systemic lupus erythematosus (hsa05322; FDR=2.9×10-32) and rheumatoid arthritis (hsa05323; FDR=3.7×10-7). Especially, there are 32 genes enriched in the systemic lupus erythematosus pathway, including eight genes in HLA-family (HLA-DMA, HLA-DMB, HLA-DOB, HLA-DQA2, HLA-DQA1, HLA-DRA, HLA-DRB1, HLA-DQB1), 20 genes in the four core histones (H2A(6): H2AC6, H2AC13, H2AC14, H2AC15, H2AC16, H2AC17; H2B(6): H2BC3, H2BC4, H2BC13, H2BC14, H2BC15, H2BC17; H3(4): H3C3, H3C10, H3C11, H3C12; H4(4): H4C3, H4C11, H4C12, H4C13), as well as four genes (C2, C4B, C4A, TNF). For the rheumatoid arthritis pathway, sCLC identifies 104 SNPs mapped to 11 genes that are enriched in this pathway, including HLA-DMA, HLA-DMB, ATP6V1G2, HLA-DRA, LTB, TNF, HLA-DOB, HLA-DQA2, HLA-DRB1, HLA-DQA1, and HLA-DQB1.

Figure 4.

Figure 4

The KEGG pathway enrichment analysis is based on the genes identified by sCLC and the KEGG database. The pathways in red denote the pathways that are related to the diseases of the musculoskeletal system and connective tissue.

Discussion

In this paper, we propose a multiple-phenotype association test strategy called sCLC which is based on GWAS summary statistics. Through a variety of simulation studies and an application to the UK Biobank XIII category summary statistics, we observed that sCLC is a valid and powerful approach. Specially, sCLC detected some novel signals associated with the musculoskeletal system and connective tissue phenotypes, which provides more evidence to show that those diseases are potentially affected by genetic factors. The sCLC method is also computationally efficient. Since the estimation of the phenotypic correlation matrix R is independent of the association test for each SNP, we only need to estimate R once by using LDSC for all SNPs. In real data analysis with 288,647 SNPs and 70 phenotypes, after estimation of R, the running time of sCLC on a computer with 4 Intel Cores @ 3.60 GHz and 16 GB memory is about 4 min 40 s. sCLC as well as many other multiple phenotype association methods, such as the compared methods in this article, test the null hypothesis that a given variant does not contribute to any of the analyzed phenotypes. Therefore, a genetic variant will be identified by these methods even if it is associated with only one phenotype. Hence the identified genetic variants by these methods may not be pleiotropic variants and further analyses are required to interpret the possibility of pleiotropy43. This is a limitation of the proposed method in identifying pleiotropic effects. Recently, some methods4345 are proposed to evaluate pleiotropic effects. For example, Schaid et al.43 proposed a new statistical method to evaluate pleiotropy using a sequential testing framework. This approach can determine the number of phenotypes associated with a genetic variant and which phenotypes are associated, while accounting for correlations among the phenotypes. SHAHER44, a novel framework for analysis of the shared genetic background of correlated phenotypes, can identify genetic factors common for all analyzed phenotypes and specific genetic factors for each phenotype using genetic correlations between phenotypes. PolarMorphism46 is a summary-statistic-based framework to map and interpret pleiotropic loci in a joint analysis of multiple phenotypes. It identifies horizontally pleiotropic SNPs by converting the trait-specific SNP effect sizes to polar coordinates.

On the other hand, the hierarchical clustering approach in sCLC is applied to cluster multiple phenotypes based on the phenotypic correlation matrix R. Therefore, the phenotypes in the same cluster may be affected by non-genetic factors, which may influent the power for disease variant discovery. Instead of using the phenotypic correlation matrix, the genetic correlation matrix among multiple phenotypes20,21 can also be used in the hierarchical clustering. Furthermore, considering only the phenotypes with a significant non-zero heritability in the estimation of the genetic correlation matrix may also improve the statistical power in the multiple phenotype association studies. Therefore, we would like to consider using the genetic correlation matrix estimated by the LDSC regression20 or using network-based approaches to cluster phenotypes based on shared genetic architectures in our further work47.

Supplementary Information

Supplementary Information. (964.5KB, docx)

Acknowledgements

Part of this research has been conducted using the UK Biobank Resource under application number 41722 and the NHGRI-EBI GWAS Catalog. X.C. was partially funded by the Michigan Technological University Health Research Institute Fellowship program and the Portage Health Foundation Graduate Assistantship. High-Performance Computing Shared Facility (Superior) at Michigan Technological University was used in obtaining results presented in this publication.

Author contributions

Formal analysis: M.W.; research design: M.W., S.Z., and Q.S.; real data processing: X.C.; visualization: M.W and X.C.; writing original draft: M.W., X.C., and Q.S.; writing review and editing: M.W., S.Z., X.C., and Q.S.

Data availability

UK Biobank data can be accessed by application through http://www.ukbiobank.ac.uk. UK Biobank has approval by the Research Ethics Committee (REC) under approval number 16/NW/0274. UK Biobank obtained participant’s consent for the data to be used for health-related research, and all methods were performed in accordance with the relevant guidelines and regulations.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-023-30415-3.

References

  • 1.Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012;90:7–24. doi: 10.1016/j.ajhg.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lutz SM, Fingerlin TE, Hokanson JE, Lange C. A general approach to testing for pleiotropy with rare and common variants. Genet. Epidemiol. 2017;41:163–170. doi: 10.1002/gepi.22011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pei G, et al. Investigation of multi-trait associations using pathway-based analysis of GWAS summary statistics. BMC Genomics. 2019;20:43–54. doi: 10.1186/s12864-018-5373-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 2017;18:117–127. doi: 10.1038/nrg.2016.142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kwak I-Y, Pan W. Gene-and pathway-based association tests for multiple traits with GWAS summary statistics. Bioinformatics. 2017;33:64–71. doi: 10.1093/bioinformatics/btw577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Guo B, Wu B. Statistical methods to detect novel genetic variants using publicly available GWAS summary data. Comput. Biol. Chem. 2018;74:76–79. doi: 10.1016/j.compbiolchem.2018.02.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Liang X, Wang Z, Sha Q, Zhang S. An adaptive Fisher’s combination method for joint analysis of multiple phenotypes in association studies. Sci. Rep. 2016;6:1–10. doi: 10.1038/srep34323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Deng Y, Pan W. Conditional analysis of multiple quantitative traits based on marginal GWAS summary statistics. Genet. Epidemiol. 2017;41:427–436. doi: 10.1002/gepi.22046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods. 2014;11:407–409. doi: 10.1038/nmeth.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Liang X, Sha Q, Rho Y, Zhang S. A hierarchical clustering method for dimension reduction in joint analysis of multiple phenotypes. Genet. Epidemiol. 2018;42:344–353. doi: 10.1002/gepi.22124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jiang C, Zeng Z-B. Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics. 1995;140:1111–1127. doi: 10.1093/genetics/140.3.1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stephens M. A unified framework for association analysis with multiple related phenotypes. PLoS ONE. 2013;8:e65245. doi: 10.1371/journal.pone.0065245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhu X, et al. Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am. J. Hum. Genet. 2015;96:21–36. doi: 10.1016/j.ajhg.2014.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet. Epidemiol. 2009;33:497–507. doi: 10.1002/gepi.20402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yang, Q. & Wang, Y. Methods for analyzing multivariate phenotypes in genetic association studies. J. Probab. Stat.2012 (2012). [DOI] [PMC free article] [PubMed]
  • 16.Liu, Z. & Lin, X. A geometric perspective on the power of principal component association tests in multiple phenotype studies. J. Am. Stat. Assoc. (2019). [DOI] [PMC free article] [PubMed]
  • 17.Wu C. Multi-trait genome-wide analyses of the brain imaging phenotypes in UK Biobank. Genetics. 2020;215:947–958. doi: 10.1534/genetics.120.303242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Sha Q, Wang Z, Zhang X, Zhang S. A clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. Bioinformatics. 2019;35:1373–1379. doi: 10.1093/bioinformatics/bty810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang M, Zhang S, Sha Q. A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. PLoS ONE. 2022;17:e0260911. doi: 10.1371/journal.pone.0260911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bulik-Sullivan B, et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 2015;47:1236–1241. doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bulik-Sullivan BK, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Turley P, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 2018;50:229–237. doi: 10.1038/s41588-017-0009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Consortium. G. P An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li X, Zhang S, Sha Q. Joint analysis of multiple phenotypes using a clustering linear combination method based on hierarchical clustering. Genet. Epidemiol. 2020;44:67–78. doi: 10.1002/gepi.22263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liu Y, Xie J. Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 2020;115:393–402. doi: 10.1080/01621459.2018.1554485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Liu Y, et al. ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 2019;104:410–421. doi: 10.1016/j.ajhg.2019.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Guo B, Wu B. Integrate multiple traits to detect novel trait–gene association using GWAS summary data with an adaptive test approach. Bioinformatics. 2019;35:2251–2257. doi: 10.1093/bioinformatics/bty961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Liang X, Cao X, Sha Q, Zhang S. HCLC-FC: A novel statistical method for phenome-wide association studies. PLoS ONE. 2022;17(11):e0276646. doi: 10.1371/journal.pone.0276646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Mosca M, Tani C, Vagnani S, Carli L, Bombardieri S. The diagnosis and classification of undifferentiated connective tissue diseases. J. Autoimmun. 2014;48:50–52. doi: 10.1016/j.jaut.2014.01.019. [DOI] [PubMed] [Google Scholar]
  • 30.Nikolenko V, et al. Morphological signs of connective tissue dysplasia as predictors of frequent post-exercise musculoskeletal disorders. BMC Musculoskelet. Disord. 2020;21:1–7. doi: 10.1186/s12891-020-03698-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mosca M, Neri R, Bombardieri S. Undifferentiated connective tissue diseases (UCTD): A review of the literature and a proposal for preliminary classification criteria. Clin. Exp. Rheumatol. 1999;17:615–620. [PubMed] [Google Scholar]
  • 32.Iudici M, Cuomo G, Vettori S, Avellino M, Valentini G. Quality of life as measured by the short-form 36 (SF-36) questionnaire in patients with early systemic sclerosis and undifferentiated connective tissue disease. Health Qual. Life Outcomes. 2013;11:1–6. doi: 10.1186/1477-7525-11-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sudlow C, et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.McGuirl MR, Smith SP, Sandstede B, Ramachandran S. Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics. Genetics. 2020;215:511–529. doi: 10.1534/genetics.120.303096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chang, C. C. et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience4, s13742-13015-10047-13748 (2015). [DOI] [PMC free article] [PubMed]
  • 36.Daniels, H. E. Saddlepoint approximations in statistics. Ann. Math. Stat. 631–650 (1954).
  • 37.Sha Q, Zhang Z, Zhang S. Joint analysis for genome-wide association studies in family-based designs. PLoS ONE. 2011;6:e21957. doi: 10.1371/journal.pone.0021957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Watanabe K, Taskesen E, Van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 2017;8:1–11. doi: 10.1038/s41467-017-01261-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cao, X., Liang, X., Zhang, S. & Sha, Q. Gene selection by incorporating genetic networks into case-control association studies. Eur. J. Hum. Genet. (2022). [DOI] [PMC free article] [PubMed]
  • 41.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 42.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Schaid DJ, et al. Multivariate generalized linear model for genetic pleiotropy. Biostatistics. 2019;20:111–128. doi: 10.1093/biostatistics/kxx067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Svishcheva GR, et al. A novel framework for analysis of the shared genetic background of correlated traits. Genes. 2022;13:1694. doi: 10.3390/genes13101694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lee CH, Shi H, Pasaniuc B, Eskin E, Han B. PLEIO: A method to map and interpret pleiotropic loci with GWAS summary statistics. Am. J. Hum. Genet. 2021;108:36–48. doi: 10.1016/j.ajhg.2020.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.von Berg J, ten Dam M, van der Laan SW, de Ridder J. PolarMorphism enables discovery of shared genetic variants across multiple traits from GWAS summary statistics. Bioinformatics. 2022;38:i212–i219. doi: 10.1093/bioinformatics/btac228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Barabási A-L, Gulbahce N, Loscalzo J. Network medicine: A network-based approach to human disease. Nat. Rev. Genet. 2011;12:56–68. doi: 10.1038/nrg2918. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information. (964.5KB, docx)

Data Availability Statement

UK Biobank data can be accessed by application through http://www.ukbiobank.ac.uk. UK Biobank has approval by the Research Ethics Committee (REC) under approval number 16/NW/0274. UK Biobank obtained participant’s consent for the data to be used for health-related research, and all methods were performed in accordance with the relevant guidelines and regulations.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES