Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Nov 22;39(12):btad707. doi: 10.1093/bioinformatics/btad707

Joint analysis of multiple phenotypes for extremely unbalanced case–control association studies using multi-layer network

Hongjing Xie 1, Xuewei Cao 2, Shuanglin Zhang 3, Qiuying Sha 4,
Editor: Russell Schwartz
PMCID: PMC10697735  PMID: 37991852

Abstract

Motivation

Genome-wide association studies is an essential tool for analyzing associations between phenotypes and single nucleotide polymorphisms (SNPs). Most of binary phenotypes in large biobanks are extremely unbalanced, which leads to inflated type I error rates for many widely used association tests for joint analysis of multiple phenotypes. In this article, we first propose a novel method to construct a Multi-Layer Network (MLN) using individuals with at least one case status among all phenotypes. Then, we introduce a computationally efficient community detection method to group phenotypes into disjoint clusters based on the MLN. Finally, we propose a novel approach, MLN with Omnibus (MLN-O), to jointly analyse the association between phenotypes and a SNP. MLN-O uses the score test to test the association of each merged phenotype in a cluster and a SNP, then uses the Omnibus test to obtain an overall test statistic to test the association between all phenotypes and a SNP.

Results

We conduct extensive simulation studies to reveal that the proposed approach can control type I error rates and is more powerful than some existing methods. Meanwhile, we apply the proposed method to a real data set in the UK Biobank. Using phenotypes in Chapter XIII (Diseases of the musculoskeletal system and connective tissue) in the UK Biobank, we find that MLN-O identifies more significant SNPs than other methods we compare with.

Availability and implementation

https://github.com/Hongjing-Xie/Multi-Layer-Network-with-Omnibus-MLN-O.

1 Introduction

A genome-wide association studies (GWAS) is defined by the National Institutes of Health as an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease (National Institutes of Health 2007). Common statistical methods usually test the association between a single phenotype and multiple single nucleotide polymorphisms (SNPs), which means, only one phenotype is tested at a time. The joint analysis between multiple phenotypes and a SNP can increase the power of detecting significant SNPs and enhance computing efficiency in comparison with the analysis between a single phenotype and a SNP (Ferreira and Purcell 2009, Im et al. 2012, O’Reilly et al. 2012, Aschard et al. 2014, Kim et al. 2015, Zhu et al. 2015, Sha et al. 2019). Recently, more and more joint analyses of multiple phenotypes have been put forward to analyse the relationship between multiple related phenotypes and a SNP.

Previous studies revealed that some SNPs are significantly associated with multiple phenotypes (Lobo 2008, Kent Jr 2009, Stearns 2010, Sivakumaran et al. 2011, Solovieff et al. 2013, Liang et al. 2018, Geiler-Samerotte et al. 2020, Liu et al. 2021) and joint analysis of multiple phenotypes is in accordance with biology (Van der Sluis et al. 2013), therefore, many statistical methods for joint analyses of multiple phenotypes have been proposed (Galesloot et al. 2014). Briefly, these methods can be mainly divided into three categories. The first category is based on regression methods such as generalized linear mixed effects model (Fitzmaurice and Laird 1993), generalized estimating equations (Liang and Zeger 1986), Multiphen (O’Reilly et al. 2012), and multivariate analysis of variance (MANOVA) (Cole et al. 1994) which is suitable for categorical independent variables. However, MANOVA might lose power when a SNP is associated with all phenotypes. Debashree Ray et al. proposed a unified score-based test statistic, USAT, that performs better than MANOVA in such situations and nearly as well as MANOVA elsewhere (Ray et al. 2016).

The second category is combining test statistics from univariate analyses, which is simpler and more flexible than other methods especially when the dependent variables include continuous, discrete, and survival data (Yang and Wang 2012). For example, Sophie van der Sluis (Van der Sluis et al. 2013) developed a method which is called the Trait-based Association Test that uses the Extended Simes procedure (TATES). The TATES method combines P-values obtained from standard univariate GWAS to acquire one trait-based P-value while correcting for correlations between components (Van der Sluis et al. 2013). Zhang et al. (2014) presented a category of the professed sum of powered score (SPU) tests by incorporating all score test statistics from univariate tests. Xu et al. (2003) developed the Wald Chi-squared type statistic by taking a quadratic form of the vector of the univariate association test statistic.

The third category is based on variable reduction methods which are, in general, only applicable to multiple continuous phenotypes that are approximately normally distributed (Yang and Wang 2012). For instance, the popular dimension reduction approach, principle component analysis of phenotypes (Aschard et al. 2014), and canonical correlation analysis (Tang and Ferreira 2012). Due to the relatedness of the multiple phenotypes, dimension reduction of the dependent variables becomes more essential. There are more dimension reduction methods emerged recently, such as the clustering linear combination (CLC) method (Sha et al. 2019), computationally efficient CLC (ceCLC) (Wang et al. 2022), hierarchical clustering method (HCM) (Liang et al. 2018), and an agglomerative nesting clustering algorithm for phenotypic dimension reduction in the joint analysis of multiple phenotypes (AGNEP) (Liu et al. 2021). But many of these methods are time consuming, and CLC, ceCLC, HCM, and AGNEP are not suitable for extremely unbalanced case–control studies (Cao et al. 2023).

In this article, we propose a novel dimension reduction method called MLN with Omnibus (MLN-O) by utilizing Multi-Layer Network (MLN) and Omnibus test statistics to test the association between multiple phenotypes and a SNP. MLN is designed for dimension reduction of correlated and extremely unbalanced case–control phenotypes. To build MLN, we only consider individuals with at least one case status for all phenotypes which can significantly reduce the running time of MLN construction and community detection when the case-control ratios of phenotypes are extremely unbalanced.

2 Materials and methods

Suppose there are n unrelated individuals with K correlated phenotypes and a SNP. The kth (k=1,2,…,K) binary phenotype of the ith individual is denoted by yik, where yik=0 or 1 represents the control or case status of the individual respectively. The SNP of the ith individual is represented by xi, which can take one of the three values, 0, 1, and 2, to indicate the number of minor alleles at the SNP. Our proposed approach, MLN-O, involves the following three steps.

2.1 Step 1. Construct an MLN based on phenotypes

In the first step, we introduce a novel method to construct an MLN using individuals with at least one case status among all phenotypes. For each layer (individual) i, the (j,k)th element of the adjacency matrix Ai is given by

Ajki={1, yij×yik=10, otherwise, (1)

with i=1,,n;j,k=1,,K. The element Ajki=1 indicates that individual i has a case status for both jth and kth phenotypes. Then, we combine Ai for all n individuals to obtain the overall adjacency matrix A with multiple layers, i.e. A=i=1nAi, where Ajk=a means that there are a individuals with a case status for both jth and kth phenotypes. We then define the transformed similarity matrix W as W=diag(A)1/2Adiag(A)1/2.

Note that the construction of MLN using all individuals is equivalent to constructing it solely using individuals who have at least one case status among all phenotypes. If the ith individual with a control status for all phenotypes, yik=0 for k=1,2,,K, and Ajki=0 for all pairs of (j,k). So, the ith individual does not contribute to the construction of the MLN. If we construct the MLN excluding the individuals with all control status, the computational time can be significantly reduced.

The advantage of the MLN is obvious for analyzing the extremely unbalanced case–control phenotypes. The construction of MLN is intuitive. Each individual has a single layer network, we combine all of the networks of the individuals to get the MLN. It uses the number of edges between different phenotypes to represent the intensity of the connection among all phenotypes instead of the correlation of the phenotypes of all individuals in a dataset. MLN enhances the connectivity of phenotypes. It only considers individuals with at least one case status but does not consider individuals without any diseases. Because they do not carry any information to reveal the clustering structures among phenotypes.

2.2 Step 2. Cluster phenotypes by a community detection method based on the transformed similarity matrix

Based on the transformed similarity matrix W, we propose the following community detection method based on modularity measurement (Fortunato 2010, Malliaros and Vazirgiannis 2013). We divide the K phenotypes into k0 clusters (k0=1,2,,K) using a complete hierarchical clustering method with similarity matrix W and build a K×K connectivity matrix C(k0), where the (j,k)th element is given by

Cj,k(k0)={1,ifjandkareinthesamecommunity0,otherwise.

Then we calculate the modularity of the network with k0 clusters, denoted as Qk0, and it is given by

Qk0=12Dj,k=1(Wj,kdjdk2D)Cj,k(k0), (2)

where dj=kWj,k represents the total degrees of node j; and D=jdj/2 stands for the total number of edges in the MLN.

Modularity is a measure of the structure of a network, which measures how well the network is divided into different modules or clusters. A network with high modularity has dense connections between the phenotypes within clusters but sparse connections between phenotypes in different clusters. Using the modularity metric to decide the optimal number of clusters is straightforward and computationally efficient. We determine the optimal number of clusters as L=argmaxl=1,,K{Ql}.

After we cluster the total K phenotypes into L clusters, where 1LK, the phenotypes in the same cluster are merged into a single dichotomous phenotype. Suppose the lth cluster has Kl phenotypes, l1th,, and lKlth phenotypes, then for the ith individual, the merged phenotype is defined as Yil=max(yil1,yil2,,yilKl) for i=1,,n and l=1,,L.

2.3 Step 3. Test the association between phenotypes and a SNP

We first assume that there are no covariates. To test the association between phenotypes in each cluster and a SNP, we consider the merged phenotype in a cluster and a SNP following the generalized linear model (Wu et al. 2011, Pan et al. 2014)

g(E[Yil|x]i)=β0l+β1lxifor i=1,2,,n;l=1,2,,L (3)

where g() is a monotonic link function (Li et al. 2020). The link function takes the identity function under the linear regression model framework for quantitative or continuous phenotypes and takes the logit function under the logistic regression model framework for qualitative or binary phenotypes.

If there are p covariates zi1,zi1,,zip being considered for the ith individual, we can adjust all phenotypes and the genotype by the covariates via the following linear regression models (Price et al. 2010),

xi=α0+α1zi1++αpzip+εi,
Yil=γ0l+γ1lzi1+γ2lzi2++γplzip+τil.

That means we use the residuals of the above linear models instead of the original phenotypes and genotype to perform downstream analysis. It’s worth noting that after we adjust the phenotypes for the covariates, Yil becomes a continuous phenotype.

We propose the following Omnibus statistic based on the MLN (MLN-O) to test the association between phenotypes and a SNP. To test the associations between the lth merged phenotype and a SNP under model (3), we use the score test statistic given by

Tl=i=1n(xix¯)(YilY¯l)ni=1n(xix¯)2i=1n(YilY¯l)2,

where x¯=i=1nxi/n and Y¯l=i=1nYil/n for l=1,,L. To test the association between phenotypes in all clusters and a SNP, we use the Omnibus statistic to combine the score test statistics T1,,TL(Sha et al. 2019). The Omnibus statistic is given by

TMLNO=(T1,T2,,TL)Σ1(T1,T2,,TL)T, (4)

where Σ=(Σkl) is the correlation matrix of (T1,T2,,TL). Based on Sha et al. (2019), we can use the correlation matrix of the L merged phenotypes to estimate Σ, where the (k,l)th element of Σ is estimated by Σ^kl=i=1n(YikY¯k)(YilY¯l)/i=1n(YikY¯k)2i=1n(YilY¯l)2. Then, TMLNO follows a Chi-square distribution with L degrees of freedom.

3 Comparison methods

We compare MLN-O with the following four commonly used methods for multiple phenotypes association studies.

MANOVA (Cole et al. 1994): MANOVA estimates the association between categorical independent variables and multiple response variables simultaneously. The test statistic (Wilk’s Lambda) of MANOVA, TMANOVA, measures the proportion of variance explained by the dependent variable which is equivalent to the likelihood ratio test that can be approximated with an asymptotic χ2 distribution with K degree of freedom, where K is the number of phenotypes.

USAT (Ray et al. 2016): UAST is the weighted summation of the Sum of Squared Score (SSU) test statistic (Yang and Wang 2012) and MANOVA test statistic. Suppose TSSU represents the test statistic of SSU, which follows a shifted scaled χ2 distribution with 1 degree of freedom. USAT is defined as TUSAT=min0w1pw, where pw is the P-value of the test statistic Tw=wTMANOVA+(1-w)TSSU and w is a tuning parameter.

TATES (Van der Sluis et al. 2013): For each phenotype among multiple correlated phenotypes, TATES adjusts for correlations between phenotypes, then integrates P-values estimated from traditional univariate GWAS for a SNP and each of phenotypes. The integrated phenotype-based P-value PTATES is obtained from the extended Simes procedure with pTATES=min(Kep(i)/Kei), where Ke stands for the effective number of all KP-values for a given SNP, and Kei is the effective number of P-values of the top (i=1,2,,K)P-values, and p(i) is the ith ordered P-value.

MultiPhen (O’Reilly et al. 2012): MultiPhen applies the proportional odds logistic regression to regress genotype of a SNP on multiple phenotypes to test whether effect sizes of all phenotypes are significantly different from zero. The resulting test statistic is based on the likelihood ratio test that asymptotically follows a χ2 distribution with K degrees of freedom, where K is the number of phenotypes.

4 Simulation studies

To evaluate the type I error rate and power of MLN-O, we generate genotypes according to the minor allele frequency (MAF) and assume Hardy Weinberg equilibrium. To generate a dichotomous disease affection status, we first generate quantitative phenotypes similar to that in Wang et al. (Wang et al. 2018). Then, we use a liability threshold model based on these quantitative phenotypes to define disease affection status. That is, the top nr quantitative phenotypes are defined to be affected in the ordered phenotypes in decreasing order, where n is the number of samples and r is the case–control ratio. To generate the binary phenotypes with the extremely unbalanced case–control ratios, we use r=0.001 and 0.002. In the following, we describe how to generate quantitative phenotypes by the factor model

y=λx+cγf+1c2ε,

where y=(y1,y2,,yK) is a vector of phenotypes, x is the genotype at a SNP of interest; λ=(λ1,λ2,,λK)T is the vector of effect sizes of the SNP on the phenotypes; f=(f1,f2,,fR) is a vector of factors with R elements and f:MVN(0,Σ), where Σ=(1ρ)I+ρJ, J is a matrix with elements of 1 in size R×R, I is the identity matrix in the same size with J, and ρ is the correlation between factors; γ is a K by R matrix; c is a constant number; and ε=(ε1,ε2,,εK)T is a vector of residuals, ε1,ε2,,εK are independent, and εk:N(0,1) for k=1,2,,K.

Based on the factor model, we consider the following four models in which the within-factor correlation is c2 and the between-factor correlation is ρc2. The phenotypic correlation configuration mimic that of UK10K (UK10K Consortium 2015), by way of explanation, the phenotypes are split into several phenotype blocks (phenotype factors) and the within-factor correlation is larger than the between-factor correlation (Sha et al. 2019).

Model 1: There are five factors (R=5) and genotypes impact on two factors with the same effect size. That is, λ=(β11,,β1k,,β41,,β4k,β51,,β5k), where k=K/5, and γ=Bdiag(1K/5,,1K/5) is a block diagonal matrix, and β11==β1k=β21==β2k=β31==β3k=0, β41==β4k=β51==β5k=β.

Model 2: There are five factors (R=5) and genotypes impact on two factors with the same effect size, but opposite directions. That is, γ=Bdiag(1K/5,,1K/5) is a block diagonal matrix, and λ=(β11,,β1k,,β41,,β4k,β51,,β5k), where k=K/5, β11==β1k=β21==β2k=β31==β3k=0, β41==β4k=β, and β51==β5k=β.

Model 3: There are five factors (R=5) and genotypes impact on two factors with different effect sizes and opposite directions. That is, γ=Bdiag(1K/5,,1K/5) is a block diagonal matrix, and λ=(β11,,β1k,….,β41,,β4k,β51,,β5k), where k=K/5, β11==β1k=β21==β2k=β31==β3k=0, β41==β4k=β, and β51==β5k=2β/(k+1)(1,2,,k).

Model 4: There are ten factors (R=10) and genotypes impact on four factors. That is, γ=Bdiag(1K/10,,1K/10), k=K/10, λ=(β11,,β1k,β21,,β2k,,β91,,β9k,β10,1,,β10,k), where β11==β1k=β21==β2k==β61==β6k=0, β71==β7k=β/(k/2+1)(1,2,,k/2,k/2,,2,1), β81==β8k=β/(k/2+1)(1,2,,k/2,k/2,,2,1), β91==β9k=β, and β10,1==β10,k=β.

In the simulation studies, we set MAF=0.3, the between-factor correlation ρc2=0.24, and the within-factor correlation c2=0.4. To evaluate the type I error rates of the method we proposed, we let β=0. To evaluate the power, we let β0. To generate phenotypes with extremely unbalanced case–control ratios, we use two different case–control ratios, 0.001 and 0.002.

5 Results

To evaluate the type I error rate, we set β=0 and the number of phenotypes K=100. Table 1 displays type I error rates of the five methods, MLN-O, USAT, MultiPhen, MAMOVA, and TATES, under model 1 with two case–control ratios (0.001 and 0.002), two sample sizes (20 000 and 30 000), and four significance levels (0.05, 0.01, 0.001 and 0.0001). For significance levels of 0.05, 0.01, 0.001 and 0.0001, the 95% confident intervals (CIs) are (0.04865, 0.05135), (0.00938, 0.01062), (0.0008, 0.0012), (0.00004, 0.00016) under 100 000 replicates, respectively. The values in boldface indicate that the type I error rates are out of control. From Table 1, we can observe that our proposed method MLN-O, USAT, and MANOVA can control type I error rates and USAT is conservative in all scenarios. MultiPhen cannot control type I error rates in all scenarios except at significance level α=0.0001. Meanwhile, TATES cannot control type I error rates in some scenarios. We observe the similar results under models 2–4 in Supplementary Tables S1–S3.

Table 1.

The estimated type I error rates for two extremely case–control ratios (ratio = 0.001 and ratio = 0.002), two different sample sizes (n = 20 000 and n = 30 000) under model 1 at different significant levels with 100 000 replicates.a

Ratio Sample α-level MLN-O USAT MANOVA MultiPhen TATES
0.001 20 000 0.05 0.04979 0.03131 0.04911 0.07791 0.04361
0.01 0.00959 0.00594 0.00946 0.01770 0.00887
0.001 0.00099 0.00057 0.00079 0.00199 0.00227
0.0001 7.00E−05 3.00E−05 6.00E−05 0.00022 0.00015
30 000 0.05 0.05115 0.03114 0.04991 0.06089 0.04901
0.01 0.01009 0.00622 0.00962 0.01252 0.01049
0.001 0.00099 6.00E−04 0.00088 0.00128 0.00127
0.0001 0.00011 4.00E−05 7.00E−05 8.00E−05 0.00012
0.002 20 000 0.05 0.04964 0.03191 0.04997 0.0659 0.04435
0.01 0.01001 0.00648 0.0096 0.01452 0.01169
0.001 0.00110 0.00066 0.00086 0.00135 0.00124
0.0001 8.00E−05 0.00014 0.00013 0.00016 0.00016
30 000 0.05 0.04952 0.03082 0.04909 0.06780 0.04376
0.01 0.00999 0.00627 0.00934 0.01475 0.01372
0.001 0.00092 0.00059 0.00095 0.00185 0.00172
0.0001 0.00011 7.00E−05 7.00E−05 0.00014 1.00E−04
a

The bold-faced values indicate the P-values beyond the upper bound of the corresponding 95% CIs.

We compare the power of MLN-O with the four comparing methods, USAT, MultiPhen, MAMOVA, and TATES. We consider the same settings for power comparison as that for type I error evaluations, which contain two case–control ratios (0.001 and 0.002) along with two sample sizes (20 000 and 30 000) for β0. Figure 1 shows the power comparisons of the five tests (MLN-O, USAT, MultiPhen, MAMOVA, TATES) for 100 binary phenotypes. The sample size is 20 000 and the case–control ratio is 0.002. We can observe that the method we proposed, MLN-O, has the highest power and TATES has the lowest power in all settings under all models. The other three methods, USAT, MultiPhen and MANOVA, have similar powers, but USAT outperforms MultiPhen and MANOVA in most of the settings; MANOVA has slightly higher power compared with MultiPhen. Supplementary Fig. S1 shows the power comparisons of all methods for 100 binary phenotypes with the sample size 20 000 and the case–control ratio 0.001. Supplementary Figs S2 and S3 display the power comparison results for the sample size 30 000 under the case–control ratio 0.001 and 0.002, respectively. The patterns of the powers shown in Supplementary Figs S1–S3 are similar to what we observe in Fig. 1.

Figure 1.

Figure 1.

Power comparisons of the five tests (MLN-O, USAT, MANOVA, MultiPhen, and TATES) for 100 binary phenotypes. The sample size is 20 000 and the case–control ratio is 0.002. The between-factor correlation is 0.24 and the within-factor correlation is 0.4.

6 Applications to UK Biobank

The UK Biobank cohort contains about 500 000 participants in the United Kingdom (UK) (Bycroft et al. 2018) and the genome-wide genotyping was performed using the UK Biobank Axiom Array. Approximately 850 000 variants were directly measured, with more than 90 million variants imputed using the Haplotype Reference Consortium and UK10K and 1000 Genomes reference panels (https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/genetic-data). The UK Biobank has already inspired many researchers to explore the associations between human genetic variation and disease, and their connection with a wide range of environmental and lifestyle factors (Bycroft et al. 2018). There are more than almost 3000 papers published using the UK Biobank data (https://www.ukbiobank.ac.uk/enable-your-research/publications) so far. Various studies using UK Biobank data have successfully identified thousands of genetic variants, such as SNPs, that are associated with human traits and diseases (Solovieff et al. 2013).

The principal limitation of UK Biobank data is that most of the binary phenotypes are extremely unbalanced and many of the current methods for association studies are not suitable for this kind of phenotypes. Our proposed MLN-O method is applicable to the extremely unbalanced binary phenotypes. Following the phenotype preprocess introduced by Liang et al. (2022), we consider 72 level 3 phenotypes with the number of cases >200 in Chapter XIII (Diseases of the musculoskeletal system and connective tissue) (Field ID = 41202) in UK Biobank. The phenotypes’ codes are based on the International Classification of Diseases, the 10th Revision (ICD-10) codes. We also use the quality controls (QCs) for both SNPs and individuals using PLINK 1.9 (https://www.cog-genomics.org/plink/1.9/) to select 288 647 SNPs that is described in Liang et al. (2022). After preprocessing phenotype data and performing QCs, 322 607 individuals are considered together with 72 phenotypes, and 13 covariates containing the first 10 principal components for adjusting the population stratification, genotype array, sex, and age. These phenotypes are extremely unbalanced. The minimum case–control ratio among these 72 phenotypes is 0.0006354481 (M25.4) and the maximum case–control ratio is 0.0378727058 (M17.9). We first construct the MLN of these 72 phenotypes. After applying the proposed community detection method to the MLN, these 72 phenotypes can be partitioned into 19 disjoint clusters (Supplementary Table S4).

We apply the five methods to the UK Biobank data set described above. Figure 2 is the Manhattan plot of MLN-O which displays most of the significant SNPs on chromosome 6. In the Manhattan plot, the vertical axis is the negative logarithm of the association P-value for each SNP and the horizontal axis represents the genomic coordinates including the 22 autosomes. The extent of each chromosome is shown by each block with different colors. At the significance level α=5×108, the total number of SNPs identified by MLN-O is 1030, which are the dots above the horizontal line in Fig. 2. To compare our method with other four methods in the read data analysis, we also apply the other four methods, USAT, TATES, MANOVA, and MultiPhen, on the same UK Biobank data set. MANOVA detects 648 significant SNPs, followed by USAT (627). The other two methods, MultiPhen and TATES, identify 610 and 619 significant SNPs, respectively. The Manhattan plots of these four methods are shown in Supplementary Fig. S4. We can see from Supplementary Fig. S4, the majority of the significant SNPs identified by each method are on chromosome 6.

Figure 2.

Figure 2.

Manhattan plot of the negative log-transformed P-values in the real data analysis by MLN-O against base pair positions for 22 autosomes. The horizontal line represents the GWAS significance level.

To visualize the intersecting sets of significant SNPs identified by the five methods, we implement the Venn diagram in Fig. 3. From the Venn diagram, we can see that there are 330 significant SNPs found by all five methods and 321 SNPs only identified by MLN-O. Among 321 SNPs detected by MLN-O, two SNPs are reported in the GWAS Catalog that are significantly associated with the phenotypes in Chapter XIII (Diseases of the musculoskeletal system and connective tissue): rs3130340 is associated with Bone mineral density (spine) (M85.8) reported by Styrkarsdottir et al. (2008) and rs3130320 is associated with Systemic lupus erythematosus (M32.9) reported by Chung et al. (2011). We also map 321 SNPs to gene regions (±0 kb window) and there are 248 SNPs mapped to 119 genes. Among these 119 genes, 50 genes are associated with the phenotypes in Chapter XIII reported in the GWAS Catalog. Supplementary Table S5 shows the P-values of the five methods for the SNPs identified by MLN-O, the mapped genes, and references for the mapped genes that are associated with the diseases in Chapter XIII reported in the GWAS Catalog.

Figure 3.

Figure 3.

Venn diagram of significant SNPs identified by the five methods for Category XIII (diseases of the musculoskeletal system and connective tissue) in UK Biobank.

7. Discussions

To date, statistical methods for joint analysis of multiple phenotypes becomes an essential tool to increase statistical power for detecting significant associations between multiple phenotypes and SNPs. In this paper, we propose a method, MLN-O, for testing the association between a SNP and multiple correlated binary phenotypes with extremely unbalanced case–control ratios. Through considerable simulation studies, we show that the type I error rates can be correctly controlled by MLN-O. As for the power comparison in simulation studies, MLN-O has the highest power among all the methods we compared in all simulation settings under the four models. In the real data analysis, MLN-O identifies much more significant SNPs associated with the 72 correlated diseases of the musculoskeletal system and connective tissue compared with the other four methods. MLN-O detects 321 unique significant SNPs and half of them are reported in the GWAS Catalog to be associated with the corresponding diseases. We also conclude that a large amount of the significant SNPs associated with the 72 correlated diseases of the musculoskeletal system and connective tissue are enriched on chromosome 6.

There are some advantages of MLN-O. First, the construction of MLN is based on an individual’s case–control status to reveal the relationships among all phenotypes we considered. MLN uses the number of edges between different phenotypes to represent the intensity of the connection among all phenotypes instead of the correlation of the phenotypes of all individuals in a dataset. Second, community detection based on MLN is computationally efficient. The phenotype clustering using MLN is more robust and time efficient because of the dimension reduction. Third, clustering based on the MLN is suitable for extremely unbalanced case–control phenotypes no matter how small the case–control ratio is as long as the sample size is large enough since the consideration only focuses on the individuals with case-statues when clustering the correlated phenotypes. In fact, the smaller the case–control ratio the faster the clustering. Fourth, after clustering and merging the correlated phenotypes, the case–control ratios of the merged phenotypes are increased. Data analysis using MLN-O to the UK Biobank shows that MLN-O identifies much more significant SNPs than other comparison methods. From simulation studies and real data analysis, we can see that clustering phenotypes based on network is a competitive method. In the future, we can consider integrating genetic information to cluster phenotypes.

If the number of clusters in MLN is relatively large, the omnibus test have a large number of degrees of freedom, which may impact the validity of the test. To assess the validity of MLN-O in a scenario involving a large number of clusters, we simulate an additional model (Model 5) with K=100 binary phenotypes in 50 groups. There are 50 factors and genotypes impact on ten factors with the same effect size, but opposite directions. That is, λ=(β11,β12,β21,β22,,β50 1,β50 2) and γ=Bdiag(1K/50,,1K/50) is a block diagonal matrix, where β11=β12==β40,1=β40,2=0; β41,1=β41,2==β45,1=β45,2=β; β46,1=β46,2==β50,1=β50,2=β. Supplementary Table S6 shows the estimated type I error rates of the five methods under Model 5 at different significant levels with 100 000 replicates. Similar to models 1–4, our proposed method MLN-O, USAT, and MANOVA can control type I error rates, but MultiPhen cannot control type I error rates in all scenarios and TATES cannot control type I error rates in some scenarios. The results show that the omnibus test is also a valid test with a large number of degrees of freedom.

In our proposed MLN-O, we use omnibus statistic to test the association between the merged phenotypes with a SNP in step 3. Actually, any multiple phenotype association test can be used to substitute the omnibus statistic, such as MANOVA and USAT. It is also important to emphasize that the proposed method, MLN-O, is specifically designed to address extremely unbalanced case–control association studies. In scenarios where binary phenotypes are balanced, the majority of the merged phenotypes would hold a value of 1. So, the proposed method is not suitable for such cases. Therefore, we recommend applying our proposed MLN-O method exclusively to only the situation involving multiple binary phenotypes with extremely unbalanced case–control.

Supplementary Material

btad707_Supplementary_Data

Acknowledgements

Part of this research has been conducted using the UK Biobank Resource under application numbers 41722 and 102999, and the NHGRI-EBI GWAS Catalog.

Contributor Information

Hongjing Xie, Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, United States.

Xuewei Cao, Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, United States.

Shuanglin Zhang, Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, United States.

Qiuying Sha, Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, United States.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

X.C. was in part funded by the Portage Health Foundation Graduate Assistantship.

Data availability

The data underlying this article were provided by the UK Biobank under applications. Data will be shared on request to the corresponding author with permission of the UK Biobank.

References

  1. Aschard H, Vilhjálmsson BJ, Greliche N. et al. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 2014;94:662–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bycroft C, Freeman C, Petkova D. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 2018;562:203–9. 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cao X, Zhang S, Sha Q. A novel method for multiple phenotype association studies based on genotype and phenotype network. bioRxiv, 10.1101/2023.02.23.529687, 2023, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
  4. Chung SA, Taylor KE, Graham RR. et al. ; SLEGEN. Differential genetic associations for systemic lupus erythematosus based on anti-dsDNA autoantibody production. PLoS Genet 2011;7:e1001323. 10.1371/journal.pgen.1001323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cole DA, Maxwell SE, Arvey R. et al. How the power of MANOVA can both increase and decrease as a function of the intercorrelations among the dependent variables. Psychol Bull 1994;115:465–74. [Google Scholar]
  6. Ferreira MA, Purcell SM.. A multivariate test of association. Bioinformatics 2009;25:132–3. [DOI] [PubMed] [Google Scholar]
  7. Fitzmaurice GM, Laird NM.. A likelihood-based method for analysing longitudinal binary responses. Biometrika 1993;80:141–51. [Google Scholar]
  8. Fortunato S. Community detection in graphs. Phys Rep 2010;486:75–174. [Google Scholar]
  9. Galesloot TE, Van Steen K, Kiemeney LA. et al. A comparison of multivariate genome-wide association methods. PLoS ONE 2014;9:e95923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Geiler-Samerotte KA, Li S, Lazaris C. et al. Extent and context dependence of pleiotropy revealed by high-throughput single-cell phenotyping. PLoS Biol 2020;18:e3000836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Im HK, Gamazon ER, Nicolae DL. et al. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am J Hum Genet 2012;90:591–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kent JW Jr. Analysis of multiple phenotypes. Genet Epidemiol 2009;33:S33–S39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kim J, Bai Y, Pan W.. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genet Epidemiol 2015;39:651–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li X, Zhang S, Sha Q.. Joint analysis of multiple phenotypes using a clustering linear combination method based on hierarchical clustering. Genet Epidemiol 2020;44:67–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Liang K-Y, Zeger SL.. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13–22. [Google Scholar]
  16. Liang X, Cao X, Sha Q. et al. HCLC-FC: a novel statistical method for phenome-wide association studies. PLoS ONE 2022;17:e0276646. 10.1371/journal.pone.0276646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liang X, Sha Q, Rho Y. et al. A hierarchical clustering method for dimension reduction in joint analysis of multiple phenotypes. Genet Epidemiol 2018;42:344–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liu F, Zhou Z, Cai M. et al. AGNEP: an agglomerative nesting clustering algorithm for phenotypic dimension reduction in joint analysis of multiple phenotypes. Front Genet 2021;12:648831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lobo I. Pleiotropy: one gene can affect multiple traits. Nat Educ 2008;1:10. [Google Scholar]
  20. Malliaros FD, Vazirgiannis M.. Clustering and community detection in directed networks: a survey. Phys Rep 2013;533:95–142. [Google Scholar]
  21. National Institutes of Health. Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS). USA: National Institutes of Health. 2007. [Google Scholar]
  22. O'Reilly PF, Hoggart CJ, Pomyen Y. et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE 2012;7:e34861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Pan W, Kim J, Zhang Y. et al. A powerful and adaptive association test for rare variants. Genetics 2014;197:1081–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Price AL, Kryukov GV, de Bakker PI. et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 2010;86:832–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Ray D, Pankow JS, Basu S.. USAT: a unified score‐based association test for multiple phenotype‐genotype analysis. Genet Epidemiol 2016;40:20–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Sha Q, Wang Z, Zhang X. et al. A clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. Bioinformatics 2019;35:1373–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sivakumaran S, Agakov F, Theodoratou E. et al. Abundant pleiotropy in human complex diseases and traits. Am J Hum Genet 2011;89:607–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Solovieff N, Cotsapas C, Lee PH. et al. Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet 2013;14:483–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stearns FW. One hundred years of pleiotropy: a retrospective. Genetics 2010;186:767–73. 10.1534/genetics.110.122549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Styrkarsdottir U, Halldorsson BV, Gretarsdottir S. et al. Multiple genetic loci for bone mineral density and fractures. N Engl J Med 2008;358:2355–65. 10.1056/NEJMoa0801197 [DOI] [PubMed] [Google Scholar]
  31. Tang CS, Ferreira MA.. A gene-based test of association using canonical correlation analysis. Bioinformatics 2012;28:845–50. [DOI] [PubMed] [Google Scholar]
  32. UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 2015;526:82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Van der Sluis S, Posthuma D, Dolan CV.. TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet 2013;9:e1003235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang M, Zhang S, Sha Q.. A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. PLoS ONE 2022;17:e0260911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wang Z, Sha Q, Fang S. et al. Testing an optimally weighted combination of common and/or rare variants with multiple traits. PLoS ONE 2018;13:e0201186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wu MC, Lee S, Cai T. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011;89:82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Xu X, Tian L, Wei L.. Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics 2003;4:223–9. [DOI] [PubMed] [Google Scholar]
  38. Yang Q, Wang Y.. Methods for analyzing multivariate phenotypes in genetic association studies. J Probab Stat 2012;2012:652569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhang Y, Xu Z, Shen X. et al. ; Alzheimer's Disease Neuroimaging Initiative. Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data. NeuroImage 2014;96:309–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zhu H, Zhang S, Sha Q.. Power comparisons of methods for joint association analysis of multiple phenotypes. Hum Hered 2015;80:144–52. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad707_Supplementary_Data

Data Availability Statement

The data underlying this article were provided by the UK Biobank under applications. Data will be shared on request to the corresponding author with permission of the UK Biobank.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES