Statistical Impact of Sample Size and Imbalance on Multivariate Analysis in silico and A Case Study in the UK Biobank

Xinyuan Zhang; Ruowang Li; Marylyn D Ritchie

. 2021 Jan 25;2020:1383–1391.

Statistical Impact of Sample Size and Imbalance on Multivariate Analysis in silico and A Case Study in the UK Biobank

Xinyuan Zhang ^1,^2,³, Ruowang Li ¹, Marylyn D Ritchie ^1,³

PMCID: PMC8075427 PMID: 33936514

Abstract

Large-scale biobank cohorts coupled with electronic health records offer unprecedented opportunities to study genotype-phenotype relationships. Genome-wide association studies uncovered disease-associated loci through univariate methods, with the focus on one trait at a time. With genetic variants being identifiedfor thousands of traits, researchers found that 90% of human genetic loci are associated with more than one trait, highlighting the ubiquity of pleiotropy. Recently, multivariate methods have been proposed to effectively identify pleiotropy. However, the statistical performance in natural biomedical data, which often have unbalanced case-control sample sizes, is largely known. In this work, we designed 21 scenarios of real-data informed simulations to thoroughly evaluate the statistical characteristics of univariate and multivariate methods. Our results can serve as a reference guide for the application of multivariate methods. We also investigated potential pleiotropy across type II diabetes, Alzheimer’s disease, atherosclerosis of arteries, depression, and atherosclerotic heart disease in the UK Biobank.

Introduction

Understanding genetic factors that contribute to disease susceptibility is the center of human genetics research. Genome-wide association studies (GWAS) have uncovered thousands of genetic variants that are associated with complex diseases. A recent study found that 90% of these GWAS significant loci are associated with multiple diseases, suggesting widespread pleiotropy in the human genome¹. Pleiotropy describes a variant or a gene that influences more than one phenotype and plays a critical role in many aspects of biology^2-4. Univariate and multivariate methods are two types of statistical methods that can be applied to detect genetic associations with multiple diseases⁵. Univariate models focus on one phenotype at a time, such as GWAS, while multivariate methods jointly model the association across multiple phenotypes simultaneously. Previous studies demonstrated that multivariate methods have higher power than univariate methods, which holds great potential in discovering pleiotropy with multivariate methods. However, previous simulations were based on quantitative traits or balanced sample sizes (equal numbers of cases and controls)^6-8. With the application to natural biomedical data, it is beneficial to acquire the expected type I error and power under unbalanced sample size scenarios.

Sample size imbalance is a key feature of natural biomedical data. The wide range of disease prevalence in the population introduces different case control sample size to the human phenome. For instance, phenome-wide association studies evaluate the genetic association across hundreds and thousands of diseases obtained from electronic health records^9,10, with varying case control sample sizes. Most of the statistical methods are developed based on the balanced case control assumptions. With the application of statistical methods to natural biomedical data, it is crucial to understand the statistical characteristics under real-world scenarios. The role of sample size imbalance has been previously studied for univariate methods for both common and rare variants^11,12. However, to our knowledge, the impact of sample size imbalance on multivariate analyses is largely unknown.

Here, we conducted a natural biomedical data informed simulation study to evaluate univariate and multivariate methods in identifying pleiotropy for binary phenotypes with different sample sizes. We designed 21 scenarios of various degrees of sample size imbalance and characterized type I error and power for logistic regression and MultiPhen¹³. MultiPhen is chosen in our study because it is designed for studying binary traits and has sufficient statistical power². The correlation structure used in the simulation was obtained from selected traits with different case sample sizes from the UK Biobank³. Our simulation results provide the landscape of type I error and power of univariate and multivariate methods under various scenarios, thus providing a potential reference guide for the application of these methods to natural biomedical data. Furthermore, it has been previously suggested that studying pleiotropy in large-biobank cohorts coupled with electronic health record provide novel insights into biology^10,14-17. As a case study, we applied logistic regression (univariate method) and MultiPhen (multivariate method) to investigate potential pleiotropy across type II diabetes, Alzheimer’s disease, atherosclerosis of arteries, depression, and atherosclerotic heart disease in the UK Biobank.

Methods

Simulation Design

We designed 5 balanced and 16 unbalanced case sample size scenarios (Table 1) with a total sample size of 10,000. For balanced case sample size design, each trait has the same case sample size across five traits, e.g. 100 cases for all five traits (Table 1). Our simulation was performed via a multivariate binary phenotype generation tool ‘bindata’ R package²⁶. An example of our simulation code is provided at the end of this manuscript, and we also deposited our simulation code on GitHub [https://github.com/blairzhang126/Multivariate-Sim]. We simulated 10 replicates for each scenario, with 100 independent datasets per replicate. We simulated one common genetic variant per dataset, with a minor allele frequency of 0.05. The simulation of the genetic variant is based on Hardy-Weinberg equilibrium. The genetic effect size was set as 0 for type I error simulations and 0.3 for power evaluations. The disease prevalence was set to achieve the desired case sample size. Phenotype correlation was estimated from five selected phenotypes given their case sample sizes (Table 2) from European individuals in the UK Biobank³ based on the following ICD-10 codes: severe depression episode without psychotic symptoms (F32.2), adjustment disorders (F43.2), other forms of angina pectoris (I20.8), other forms of chronic ischaemic heart disease (I25.8) and unspecified cardiomyopathy (I42.9).

Table 1.

Case Sample Size Design

Balanced Case Sample Size for Each of Five Traits					Labels in plot
100	200	300	400	500	Scenario1-5
Unbalanced Case Sample Size across Five Traits
Trait1	Trait2	Trait3	Trait4	Trait5
100	100	100	100	500	Scenario6
100	100	100	500	500	Scenario7
100	100	500	500	500	Scenario8
100	500	500	500	500	Scenario9
200	200	200	200	500	Scenario10
200	200	200	500	500	Scenario11
200	200	500	500	500	Scenario12
200	500	500	500	500	Scenario13
300	300	300	300	500	Scenario14
300	300	300	500	500	Scenario15
300	300	500	500	500	Scenario16
300	500	500	500	500	Scenario17
400	400	400	400	500	Scenario18
400	400	400	500	500	Scenario19
400	400	500	500	500	Scenario20
400	500	500	500	500	Scenario21

Open in a new tab

Table 2.

Phenotypes and Case Sample Size from UK Biobank

ICD10	Description	Broad disease category	Case sample size (after quality control)
E11.9	Type II diabetes without complications	Endocrine, nutritional and metabolic diseases	16,516
F32.3	Severe depressive episode with psychotic symptoms	Mental, behavioral and neurodevelopmental disorders	236
G30.9	Alzheimer’s disease	Diseases of the nervous system	325
I70.2	Atherosclerosis of arteries of the extremities	Diseases of the circulatory system	501
I25.1	Atherosclerotic heart disease	Diseases of the circulatory system	16,932

Open in a new tab

Type I error and Power calculation

For each replicate, we simulated 100 independent datasets. For MultiPhen, Type I error and power were calculated as the number of datasets with a p-value less than 0.05 out of 100 total datasets. The p-value threshold for logistic regression was 0.01, as corrected for multiple testing burden across five traits (calculated as 0.05/5). Each bar in the bar plot in the results section represents the type I error or power obtained from 10 replicates. The plots of simulation results were generated using ggplot2 R package¹⁸.

Quality Control in the UK Biobank

Our analyses were performed on white British individuals from the UK Biobank. We followed quality control procedure described in the previous literature¹⁹. We excluded poor quality samples that had a sample missing rate higher than 5% and an unusual heterozygosity¹⁹, and individuals who were closer than 2^nd degree relatives. We further removed the samples with sex mismatches. Among the rest of them, we included individuals whose phenotype and covariate information are available. For imputed genotype data, we performed our analysis on the common variants with a minor allele frequency of ≥ 0.01 and had an imputation info score of ≥ 0.3. We applied a linkage disequilibrium filtering to select independent SNPs with “--indep-wise 1000 80 0.1” in PLINK²⁰. In total, there are 214,318 SNPs and 295,423 white British individuals included in our subsequent analyses.

Association Analyses in the UK Biobank

We defined our phenotypes based on the ICD-10 codes, and selected five traits that consist of unbalanced case sample sizes (Table 2). We performed logistic regression and MultiPhen on individuals and genetic variants that passed quality control. All of the association models were adjusted by age, genetic inferred sex, genotyping array and first 20 principal components. There were in total 1,071,590 tests being performed for logistic regression and the Bonferroni correction threshold is 4.67*10^-8 (calculated as 0.05/(214318*5)). For MultiPhen, the Bonferroni threshold is 2.33*10^-7 (calculated as 0.05/214318).

Results

We observed an overall controlled type I error for all of the simulation scenarios (Figure 1). We observed comparable type I error rates for logistic regression and MultiPhen and most of the values are less than 0.1. The mean of type I error across 10 replicates is around 0.05 for all simulation scenarios (Figure 1). Even with varying degrees of case sample size imbalance across the five traits, we did not observe an obvious trend between sample size imbalance and type I error under our simulation settings.

For balanced case sample size settings (scenarios 1-5), we observed an increasing trend of power with the increase of case sample size (Figure 2). And case numbers of more than 200 (scenario 3-5) yield a mean of statistical power of >60%. For unbalanced case sample size scenarios (6-21), we observed the increase of power when adding more traits with larger case sample sizes (refer to Table 1). We have also observed the baseline power for each set (scenario 6,10,14,18) increases as the case sample size increases. Interestingly, we see that MultiPhen has higher power than logistic regression for most of the simulation scenarios (Figure 2).

We demonstrated our univariate and multivariate results from the UK Biobank in a Hudson plot (Figure 3)²⁷. The SNPs evaluated in our study are independent from each other with the R-squared less than 0.1 (see Methods). We observed very similar patterns of the associations identified by logistic regression and MultiPhen (Figure 3). In total, we observed 22 Bonferroni significant variants identified by MultiPhen. and 32 Bonferroni significant variants by logistic regression. Interestingly, Bonferroni significant variants identified by Multiphen have all been identified by logistic regression (Figure 4).

Figure 4. — Venn diagram of the Bonferroni significant variants identified by logistic regression and MultiPhen

We observed a missense common variant rs11591147 located on PCSK9 gene on chromosome 1, which is associated with atherosclerotic heart disease (p-value: 6.029 * 10^-11). PCSK9 protein regulates cholesterol in the bloodstream and has been suggested to play a role in atherosclerosis²¹. SNP rs10738609 on chromosome 9 is an intron variant that is located at CDKN2B-AS1 gene, which is a known hot spot gene for cardiovascular diseases²². We observed its significant association (univariate p-value: 3.252 * 10^-76) with atherosclerotic heart disease in our study. We further looked at its association with other tested diseases and observed its association with type II diabetes (univariate p-value: 1.461 * 10^-5) and a moderate level of association with atherosclerosis of arteries (univariate p-value: 0.0003428).

There are 15 Bonferroni significant variants that are associated with type II diabetes. We identified one genetic variant SNP rs8047395 located on chromosome 16 near FTO gene (univariate p-value: 5.607 * 10^-12), which is a previously known genetic variant that is associated with type II diabetes²³. We also identified a known SNP rs76895963 to be associated with type II diabetes²⁸. SNP rs2673142 showed a moderate significant association with depression (p-value: 0.00034) in addition to type II diabetes.

For depression, we identified one novel variant rs548613298 that is associated with depression from our analysis. For Alzheimer’s disease, both methods identified three Bonferroni significant genetic variants located on chromosome 19 near APOC1/APOE region (rs12691088, rs79701229 and rs60049679). The region was known to have a strong association with Alzheimer’s disease^24,25. These three genetic variants showed a moderate significant association with atherosclerotic heart disease (with univariate p-values around 0.005). As for atherosclerosis of arteries, we did not observe any Bonferroni significant variant.

Discussion

Type I error was mostly controlled under 0.1 for our simulation scenarios. We did not observe an obvious impact of sample size imbalance on type I error (Figure 1). We found that statistical power increases as the number of phenotypes with larger case sample size increases (Figure 2). We also observed an elevation of statistical power for unbalanced case sample sizes when adding more phenotypes with 500 cases. MultiPhen outperforms logistic regression on many sample size imbalance simulation settings (Figure 2). Multivariate methods previously demonstrated higher power than logistic regression¹³ under balanced sample size, and our work demonstrated the same trend in sample size imbalance scenarios.

For our case study in UK Biobank, we performed logistic regression and MultiPhen analyses across type II diabetes, atherosclerotic heart disease, depression, Alzheimer’s disease and atherosclerosis of arteries. We identified many previously known genetic variants as well as novel variants. We demonstrated the effectiveness of applying both methods in identifying pleiotropy. There were 22 Bonferroni significant variants being identified by MultiPhen, which have all been identified by logistic regression. The reason that MultiPhen has identified lesser number of significant variant might due to its limited power in scenarios when the genetic effect is in consistent with the phenotypic correlation¹³. By applying both methods, it assists us to limit the false positives in the discovery of pleiotropy as well as help with the interpretation of the results.

One of the limitations is that we evaluated the risk genetic effect, future work on protective genetic effect and a mixture of both directions of genetic effect would comprehensively understand the power of these methods. Evaluating additional scenarios that may provide more understanding of the inflation of type I error, which likely also lead to higher power for MultiPhen, would be warranted. While controlled at a rate of 0.10 or less, it would be beneficial to get the type I error controlled under 0.05 or less if possible. As for the case study, we only investigated the independent SNPs. Future study on more coverage of the genetic variants would shed more lights on the biology.

In this work, we conducted a natural biomedical data informed simulation study to characterize statistical performance of univariate and multivariate methods in detecting genetic associations with multiple phenotypes. Our design of sample size imbalance offers a new perspective of the statistical performance of these methods, which would greatly assist future discovery of pleiotropy. Our case study showcases the effectiveness of applying univariate and multivariate methods in identifying pleiotropy in large-scale biobank cohort.

Acknowledgement

We would like to thank Yogasudha Veturi and William Bone for the discussion on this project. This project is under UK Biobank application ID 32133.

Simulation Code Example

#R code for simulating 100 balanced case sample size for power evaluation.

This code is for simulating 5 traits.

library(bindata)

library(MultiPhen)

n=10000

maf=0.05

#User can specify different beta0 to control case sample size beta0=c(-4.6,-4.6,-4.6,-4.6,-4.6)

x<-sample(c(0,1,2),n,replace=T,prob=c((1-maf)*(1-maf),2*maf*(1-maf),maf*maf)) x<-as.matrix(x)

#User can specify different beta to control the effect sizes of the SNPs beta=c(0.3,0.3,0.3,0.3,0.3)

#User can input a phenotype matrix which they wish to produce the correlation matrix for simulated traits. Here I'm posting an example of the correlation matrix (b_cor) among 5 traits that described in the manuscript.

b_cor<-matrix(c(1.175_3417154000,0.0415276512,0.0007543885,0.001951613,- 0.001077797, 0.0415276512, 1.175_3417154000, 0.0008421039, 0.005441721, 0.002168689, 0.0007543885, 0.0008421039, 1.175_3417154000, 0.098728472, 0.003179557, 0.0019516132, 0.0054417214, 0.0987284719, 1.175_341715400, 0.029784037, -0.0010777969, 0.0021686888, 0.0031795574, 0.029784037, 1.175_341715400),nrow=5,ncol=5,byrow=TRUE)

prob<-matrix(nrow=10000, ncol=5)

prob[,1]<-exp(beta0[1]+x %*% t(beta[1]))/(1+exp(beta0[1]+x %*% t(beta[1])))

prob[,2]<-exp(beta0[2]+x %*% t(beta[2]))/(1+exp(beta0[2]+x %*% t(beta[2])))

prob[,3]<-exp(beta0[3]+x %*% t(beta[3]))/(1+exp(beta0[3]+x %*% t(beta[3])))

prob[,4]<-exp(beta0[4]+x %*% t(beta[4]))/(1+exp(beta0[4]+x %*% t(beta[4])))

prob[,5]<-exp(beta0[5]+x %*% t(beta[5]))/(1+exp(beta0[5]+x %*% t(beta[5])))

y<-t(apply(prob, 1, function(m) rmvbin(1, margprob=m, bincorr=b_cor)))

colnames(y) <-c("Trait_1","Trait_2", "Trait_3", "Trait_4", "Trait_5") logistic.out1 <- glm(y[,1] ~ x[,1],family=binomial) tmp1 <- summary(logistic.out1)[[12]][2,]

logistic.out2 <- glm(y[,2] ~ x[,1],family=binomial) tmp2 <- summary(logistic.out2)[[12]][2,]

logistic.out3 <- glm(y[,3] ~ x[,1],family=binomial) tmp3 <- summary(logistic.out3)[[12]][2,]

logistic.out4 <- glm(y[,4] ~ x[,1],family=binomial) tmp4 <- summary(logistic.out4)[[12]][2,]

logistic.out5 <- glm(y[,5] ~ x[,1],family=binomial) tmp5 <- summary(logistic.out5)[[12]][2,]

tmp<-cbind(tmp1,tmp2,tmp3,tmp4,tmp5) tmp_t<-t(tmp)

write.table(tmp_t,file="run1.logistic.output",quote=F,row.names=T,col.names=T ,sep='\t')

y<-as.matrix(y)

rownames(y)<-seq(1:10000)

rownames(x)<-seq(1:10000)

mPhen_out <- mPhen(x[,1, drop=FALSE], y, phenotypes = all, resids = NULL, covariates=NULL, strats = NULL.opts = mPhen.options(c("regression","pheno.input"))) mPhen_jointp <- mPhen_out$Results[,,,2][6]

write.table(mPhen_jointp, file="run1.multiphen.output”, col.names=T, row.names=T, sep="\t”,quote=F)

Figures & Table

References

1.Watanabe K., et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019;51:1339–1348. doi: 10.1038/s41588-019-0481-0. [DOI] [PubMed] [Google Scholar]
2.Paaby A. B., Rockman M. V. The many faces of pleiotropy. Trends in Genetics. 2013;29:66–73. doi: 10.1016/j.tig.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Stearns F. W. One hundred years of pleiotropy: A retrospective. Genetics. 2010;186:767–773. doi: 10.1534/genetics.110.122549. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Solovieff N., Cotsapas C., Lee P. H., Purcell S. M., Smoller J. W. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics 2013 14:7. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Hackinger S., Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol. 2017;7:170125–13. doi: 10.1098/rsob.170125. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Galesloot T. E., van Steen K., Kiemeney L. A. L. M., Janss L. L., Vermeulen S. H. A comparison of multivariate genome-wide association methods. PLoS ONE. 2014;9:e95923–8. doi: 10.1371/journal.pone.0095923. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Majumdar A., Haldar T., Witte J. S. Determining which phenotypes underlie a pleiotropic signal. Genet. Epidemiol. 2016;40:366–381. doi: 10.1002/gepi.21973. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Porter H. F., O’Reilly P. F. Multivariate simulation framework reveals performance of multi-trait GWAS methods. Nature Publishing Group. 2017;7:1–12. doi: 10.1038/srep38837. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ritchie M. D. Large-Scale analysis of genetic and clinical patient data. Annu. Rev. Biomed. Data Sci. 2018;1:263–274. [Google Scholar]
10.Pendergrass S. A., et al. Phenome-wide association study (PheWAS) for detection of pleiotropy within the population architecture using genomics and epidemiology (PAGE) Network. PLoS Genet. 2013;9:e1003087–26. doi: 10.1371/journal.pgen.1003087. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Verma A., et al. A simulation study investigating power estimates in phenome-wide association studies. BMC Bioinformatics. 2018;19:1–8. doi: 10.1186/s12859-018-2135-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhang X., Basile A. O., Pendergrass S. A., Ritchie M. D. Real world scenarios in rare variant association analysis - the impact of imbalance and sample size on the power in silico. BMC Bioinformatics. 2019;20:124. doi: 10.1186/s12859-018-2591-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.O’Reilly P. F., et al. MultiPhen: Joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE. 2012;7:e34861–12. doi: 10.1371/journal.pone.0034861. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Verma A., et al. PheWAS and beyond: The landscape of associations with medical diagnoses and clinical measures across 38,662 individuals from Geisinger. The American Journal of Human Genetics. 2018;102:592–608. doi: 10.1016/j.ajhg.2018.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zhang X., et al. Detecting potential pleiotropy across cardiovascular and neurological diseases using univariate, bivariate, and multivariate methods on 43, 870 individuals from the eMERGE network. PSB. 2019;24:272–283. [PMC free article] [PubMed] [Google Scholar]
16.Hall M. A., et al. Detection of pleiotropy through a phenome-wide association study (PheWAS) of epidemiologic data as part of the environmental architecture for genes linked to environment (EAGLE) study. PLoS Genet. 2014;10:e1004678–33. doi: 10.1371/journal.pgen.1004678. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bush W. S., Oetjens M. T., Crawford D. C. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nature Publishing Group. 2016;17:129–145. doi: 10.1038/nrg.2015.36. [DOI] [PubMed] [Google Scholar]
18.Wickham H., Wickham M. H. The ggplot package. 2007.
19.Bycroft C., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Purcell S., et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Shapiro M. D., Fazio S. PCSK9 and atherosclerosis - lipids and beyond. JAT. 2017;24:RV17003–472. doi: 10.5551/jat.RV17003. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Kathiresan S., Srivastava D. Genetics of human cardiovascular disease. Cell. 2012;148:1242–1257. doi: 10.1016/j.cell.2012.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang T., et al. The association between common genetic variation in the FTO gene and metabolic syndrome in Han Chinese. Chinese Medical Journal. 2010;123:1852. [PubMed] [Google Scholar]
24.Kim J., Basak J. M., Holtzman D. M. The role of Apolipoprotein E in Alzheimer’s disease. Neuron. 2009;63:287–303. doi: 10.1016/j.neuron.2009.06.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Liu C.-C., Kanekiyo T., Xu H., Bu G. Apolipoprotein E and Alzheimer disease: risk, mechanisms and therapy. Nat Rev Neurol. 2013;9:106–118. doi: 10.1038/nrneurol.2012.263. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Leisch F., et al. bindata: Generation of artificial binary data. R package. 2005.
27.Hudson R package is freely available on github : https://github.com/anastasia-lucas/hudson .
28.MacArthur J., et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1-175_3417154] 1.Watanabe K., et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019;51:1339–1348. doi: 10.1038/s41588-019-0481-0. [DOI] [PubMed] [Google Scholar]

[r2-175_3417154] 2.Paaby A. B., Rockman M. V. The many faces of pleiotropy. Trends in Genetics. 2013;29:66–73. doi: 10.1016/j.tig.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-175_3417154] 3.Stearns F. W. One hundred years of pleiotropy: A retrospective. Genetics. 2010;186:767–773. doi: 10.1534/genetics.110.122549. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-175_3417154] 4.Solovieff N., Cotsapas C., Lee P. H., Purcell S. M., Smoller J. W. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics 2013 14:7. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-175_3417154] 5.Hackinger S., Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol. 2017;7:170125–13. doi: 10.1098/rsob.170125. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-175_3417154] 6.Galesloot T. E., van Steen K., Kiemeney L. A. L. M., Janss L. L., Vermeulen S. H. A comparison of multivariate genome-wide association methods. PLoS ONE. 2014;9:e95923–8. doi: 10.1371/journal.pone.0095923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-175_3417154] 7.Majumdar A., Haldar T., Witte J. S. Determining which phenotypes underlie a pleiotropic signal. Genet. Epidemiol. 2016;40:366–381. doi: 10.1002/gepi.21973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-175_3417154] 8.Porter H. F., O’Reilly P. F. Multivariate simulation framework reveals performance of multi-trait GWAS methods. Nature Publishing Group. 2017;7:1–12. doi: 10.1038/srep38837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-175_3417154] 9.Ritchie M. D. Large-Scale analysis of genetic and clinical patient data. Annu. Rev. Biomed. Data Sci. 2018;1:263–274. [Google Scholar]

[r10-175_3417154] 10.Pendergrass S. A., et al. Phenome-wide association study (PheWAS) for detection of pleiotropy within the population architecture using genomics and epidemiology (PAGE) Network. PLoS Genet. 2013;9:e1003087–26. doi: 10.1371/journal.pgen.1003087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-175_3417154] 11.Verma A., et al. A simulation study investigating power estimates in phenome-wide association studies. BMC Bioinformatics. 2018;19:1–8. doi: 10.1186/s12859-018-2135-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-175_3417154] 12.Zhang X., Basile A. O., Pendergrass S. A., Ritchie M. D. Real world scenarios in rare variant association analysis - the impact of imbalance and sample size on the power in silico. BMC Bioinformatics. 2019;20:124. doi: 10.1186/s12859-018-2591-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-175_3417154] 13.O’Reilly P. F., et al. MultiPhen: Joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE. 2012;7:e34861–12. doi: 10.1371/journal.pone.0034861. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-175_3417154] 14.Verma A., et al. PheWAS and beyond: The landscape of associations with medical diagnoses and clinical measures across 38,662 individuals from Geisinger. The American Journal of Human Genetics. 2018;102:592–608. doi: 10.1016/j.ajhg.2018.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15-175_3417154] 15.Zhang X., et al. Detecting potential pleiotropy across cardiovascular and neurological diseases using univariate, bivariate, and multivariate methods on 43, 870 individuals from the eMERGE network. PSB. 2019;24:272–283. [PMC free article] [PubMed] [Google Scholar]

[r16-175_3417154] 16.Hall M. A., et al. Detection of pleiotropy through a phenome-wide association study (PheWAS) of epidemiologic data as part of the environmental architecture for genes linked to environment (EAGLE) study. PLoS Genet. 2014;10:e1004678–33. doi: 10.1371/journal.pgen.1004678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-175_3417154] 17.Bush W. S., Oetjens M. T., Crawford D. C. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nature Publishing Group. 2016;17:129–145. doi: 10.1038/nrg.2015.36. [DOI] [PubMed] [Google Scholar]

[r18-175_3417154] 18.Wickham H., Wickham M. H. The ggplot package. 2007.

[r19-175_3417154] 19.Bycroft C., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20-175_3417154] 20.Purcell S., et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21-175_3417154] 21.Shapiro M. D., Fazio S. PCSK9 and atherosclerosis - lipids and beyond. JAT. 2017;24:RV17003–472. doi: 10.5551/jat.RV17003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-175_3417154] 22.Kathiresan S., Srivastava D. Genetics of human cardiovascular disease. Cell. 2012;148:1242–1257. doi: 10.1016/j.cell.2012.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-175_3417154] 23.Wang T., et al. The association between common genetic variation in the FTO gene and metabolic syndrome in Han Chinese. Chinese Medical Journal. 2010;123:1852. [PubMed] [Google Scholar]

[r24-175_3417154] 24.Kim J., Basak J. M., Holtzman D. M. The role of Apolipoprotein E in Alzheimer’s disease. Neuron. 2009;63:287–303. doi: 10.1016/j.neuron.2009.06.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25-175_3417154] 25.Liu C.-C., Kanekiyo T., Xu H., Bu G. Apolipoprotein E and Alzheimer disease: risk, mechanisms and therapy. Nat Rev Neurol. 2013;9:106–118. doi: 10.1038/nrneurol.2012.263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26-175_3417154] 26.Leisch F., et al. bindata: Generation of artificial binary data. R package. 2005.

[r27-175_3417154] 27.Hudson R package is freely available on github : https://github.com/anastasia-lucas/hudson .

[r28-175_3417154] 28.MacArthur J., et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Statistical Impact of Sample Size and Imbalance on Multivariate Analysis in silico and A Case Study in the UK Biobank

Xinyuan Zhang, BS

Ruowang Li, PhD

Marylyn D Ritchie, PhD

Abstract

Introduction