Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 1.
Published in final edited form as: Trends Genet. 2022 Jan 3;38(4):353–363. doi: 10.1016/j.tig.2021.12.002

Maturation and application of phenome-wide association studies

Shiying Liu 1, Dana C Crawford 1,2,3
PMCID: PMC8930498  NIHMSID: NIHMS1763151  PMID: 34991903

Abstract

In the past ten years since its introduction, phenome-wide association studies (PheWAS) have uncovered novel genotype-phenotype relationships. Along the way, PheWAS have evolved in many aspects as a study design with the expanded availability of large data repositories with genome-wide data linked to detailed phenotypic data. Advancement in methods, including algorithms, software, and publicly available integrated resources, makes it feasible to more fully realize the potential of PheWAS, overcoming the previous computational and analytical limitations. We review here the most recent improvements and notable applications of PheWAS since the second half of the decade from its inception. We also note the challenges that remain embedded along the entire PheWAS analytical pipeline that necessitate further development of tools and resources to further advance the understanding of the complex genetic architecture underlying human diseases and traits.

Introduction

Phenome-wide association studies (PheWAS) are high-throughput studies originally designed to test a single or handful of disease or trait associated genetic variants for associations with other phenotypes (Figure 1). PheWAS identifies cross-phenotype associations, shared genetic architectures, pleiotropy1, and potentially novel genetic associations. Since its introduction into the literature in 20102, PheWAS has become an established approach now widely applied in genome-wide association studies (GWAS) alongside fine-mapping and other post-GWAS approaches. PheWAS has also evolved in its methodology, with advances in high-throughput phenotyping approaches and consideration of rare variation from whole exome and genome sequencing data. We review here the notable applications of PheWAS as well as advances in PheWAS methods since its last review in 20163.

Figure 1. Common units for tests of association in phenome-wide association studies.

Figure 1.

PheWAS have evolved from testing a single or handful of genetics variants for association with multiple phenotypes to phenome-wide scans involving a variety of ‘omic variables as the defined unit for tests of association. The single nucleotide variant (A) remains the most popular unit in PheWAS. PheWAS was developed using common single nucleotide polymorphisms, and more recent PheWAS have expanded to include rare single nucleotide variants from newer generation genome-wide arrays and whole exome and whole genome sequencing efforts. Recent gene unit (B) approaches have been used in PheWAS owing to the availability of genome-wide rare and common variants. Shown here is a schematic of a gene, including the transcriptional start site (arrow), two exons (the boxes) with the untranslated regions denotes by their smaller box size, an intron (between the two exons), and flanking sequence. Definitions of gene units are user-specific and can vary widely, ranging from coding region variants only (which would exclude the flanking region, untranslated regions, and the intron, for example) to all genetic variants within the user-specified gene boundary. PheWAS units for tests of association also now include polygenic risk scores (C). Polygenic risk scores are typically calculated on a per-individual basis as the number of risk alleles for a trait or outcome weighted by each risk allele’s effect size estimated by large genome-wide association studies. Like gene units, PRS definitions are user-specified and can vary widely. PheWAS in recent years has expanded beyond traditional genomics to include transcriptomic data (D). The Genotype-Tissue Expression data resource is a popular reference dataset for expression quantitative trait loci used in PheWAS. Future PheWAS units for tests of association will undoubtedly expand to include novel variant annotation and gene unit definitions, variants beyond single nucleotide variants (e.g., copy number variants), and other ‘omic data.

Notable Applications of PheWAS

Some of the most notable applications in the last five years have been PheWAS performed in large biobanks linked to electronic health records (EHRs). These datasets are ideal for PheWAS given the breadth of phenotypes available, depth of the sample sizes, and availability of genome-wide data. The clinical data also allow for automated, high-throughput phenotyping using convenient and accessible tools available in R to generate phecodes from International Classification of Diseases (ICD) codes, for example.

The most comprehensive PheWAS results available to date include those from the UK Biobank and the Million Veteran Program (MVP), two of the largest biobanks with genome-wide data linked to EHRs and survey data. Established in 2006, the UK Biobank now consists of ~500,000 participants ages 40-69 living in the United Kingdom with clinical, epidemiological, and genomic (array-based and whole exome sequencing) data available for study4. The MVP began ascertainment in 2011 and today has approximately 600,000 United States Veterans with clinical, epidemiological, and array-based genome-wide data available5,6. While the UK Biobank is predominantly European-descent, approximately 30% of MVP participants self-identify as a member of a racial or ethnic group currently underrepresented in genomic studies. The UK Biobank individual-level data are accessible to investigators for research purposes7; in contrast, the equivalent data for MVP is currently accessible only to VA investigators. For both resources, summary statistics for precomputed genotype-phenotype associations are available for thousands of clinical and epidemiologically collected phenotypes, and these have been aggregated and made available in a searchable web browser known as the Global Biobank Engine (GBE8). With the addition of the Biobank Japan, the GBE is a large resource of ~750,000 participants with genotype-phenotype association summary data useful for transpopulation and population-specific PheWAS look-ups and hypothesis generation (Table). Individual-level data are not available through GBE, limiting the possibility of downstream analyses such as fine-mapping.

Table.

Resources for phenome-wide association studies (PheWAS)

Resource Description Website Citation
Global Biobank Engine (GBE) Summary statistics for precomputed genotype-phenotype associations available for ~750,000 participants from the Million Veteran Program, UK Biobank, and Biobank of Japan. https://biobankengine.stanford.edu/ 8
PhenomeXcan A resource organizing results from the PheWAS catalog into gene-trait associations by leveraging available gene expression and regulatory data. http://apps.hakyimlab.org/phenomexcan/ 56
PheWAS Knowledgebase (PheKB) An online repository that allows for collaborative efforts to build and share phenotyping algorithms. applicable to EHRs that can be validated across multiple institutions with different EHR systems. http://phekb.org 65
PheCode Map A resource consisting of beta versions of phecode map translating ICD-10 and ICD-10-CM codes in EHR to phecodes. https://phewascatalog.org/phecodes_icd10
https://phewascatalog.org/phecodes_icd10cm
38
PheMap An online knowledge base that enables high-throughput phenotyping for EHR via the quantified concepts constructed from utilizing open-access resources with NLP. https://www.vumc.org/cpm/phemap 39
Open Targets Genetics The platform provides resources integrating GWAS findings as well as functional genomic data, and enabling robust systematic analysis with data visualization. https://genetics.opentargets.org 58
Cancer PRSweb An integrated repository containing PRS for 35 major cancer traits constructed from previous GWAS findings, the NHGRI-EBI GWAS Catalog, together with UK Biobank-based GWASs and providing summary statistics for PRS-PheWAS related to those traits. https://prsweb.sph.umich.edu 60

Apart from the UK Biobank and MVP juggernauts, PheWAS has been conducted in several other biobanks representing single medical centers or collaborations across medical centers such as the electronic Medical Records & Genomics (eMERGE) Network. Among the single medical center PheWAS conducted since 2016, Geisinger representing clinical populations from Central Pennsylvania conducted its analyses for ~630,000 common SNPs (minor allele frequency >1%) genotyped by Regeneron on almost 39,000 patients participating in the MyCode Community Health Initiative9. This PheWAS in the DiscovEHR collaboration used the cross-phenotype associations of ICD codes and clinical laboratory measures to construct a disease-disease network where diseases and SNPs formed an edge in the bipartite network if they were associated9. The disease-disease network offered more formal examination of the cross-phenotype associations within and across disease classifications, and both novel putative disease-disease and disease-SNP associations were identified in this population of European-descent9.

Comprehensive PheWAS is appealing given the potential for discovery but is still limited in practice as it requires genome-wide and phenome-wide data. While limited to specific variants, focused PheWAS can be equally informative, revealing pleiotropic relationships for established, often functionally characterized variants already known to be important in human health. As an example, a PheWAS of APOE in the UK Biobank confirmed the strong, known relationships with dementia (Alzheimer's disease), hyperlipidemia, and ischemic heart disease with APOE alleles conferring either risk (e4) or protection (e2) compared with the e3/e3 referent10. New from this PheWAS are potential relationships between the risk-conferring APOE e4 allele-containing genotypes and associations with fewer cases of obesity, type 2 diabetes, chronic airway obstruction, gallstones, and liver disease compared with APOE e3/e3 genotypes10. The APOE e2/e2 genotype was associated with more cases of peripheral thromboembolism, aneurysms, peptic ulcers, cervical disorders, and bunions (hallux valgus) compared with e3/e3 genotypes10. Some of these same associations have been inconsistently reported in the literature, particularly the low frequency e2/e2 genotype, underscoring the statistical power afforded by the large and uniformly genotyped and phenotyped UK Biobank.

Another recent focused PheWAS features the UK Biobank and variants that tag the ABO blood groups. ABO, located on chromosome 9, encodes a glycosyltransferase that catalyzes the transfer of activated sugar molecules to the proper acceptor. Specific genetic variation in ABO dictates the resulting ABO blood group for an individual11. The most common blood group phenotypes in human populations are A, B, AB (co-dominant), and O (recessive, non-functional allele). ABO locus variation is suggestive of natural selection12 and individual variants that tag specific ABO blood groups are associated with thromboembolic13 and arterial disease outcomes and specific cancers such as pancreatic cancer14. Variants that tag ABO blood groups also associate with susceptibility to infectious disease, including COVID-1915(p19). In a targeted UK Biobank PheWAS of four variants that tag ABO blood groups, several associations were noted including the corroboration of previously reported associations with thromboembolic and arterial disease outcomes16. Sex-stratified analyses, made possible by publicly available UK Biobank summary statistics, suggested several associations may be stronger in one sex compared with the other16.

The UK Biobank was also the main resource for a PheWAS of cirrhosis-associated genetic variants. Cirrhosis is characterized by impaired liver function caused by fibrosis or scarring of the liver. Like other complex human diseases, risk of cirrhosis is associated with environmental and genetic factors. In a combined GWAS/PheWAS effort involving both the UK Biobank and the Michigan Genomics Initiative17, several unexpected associations were identified such as opposing effects of cirrhosis-associated variants in TM6SF2 and SERPINA1 on peripheral (trunk) fat18. Cirrhosis-associated variants in PNPLA3 and TM6SF2 were associated unexpectedly with decreased hemoglobin and platelet traits as well as neutrophil count18. HFE rs1800562, an established risk variant for hemochromatosis and cirrhosis of the liver, was associated with risk of multiple sclerosis in this PheWAS. This latter finding highlights once again that the breadth and depth of today's biobanks enables the identification of independent and pleiotropic variants for relatively uncommon diseases such as cirrhosis and multiple sclerosis in human populations. Collectively, the combined GWAS-PheWAS for cirrhosis reveals a complex genetic architecture that complicates the development of treatments intended to reverse liver fibrosis18.

Other notable PheWAS applications have considered genetic variation in and around candidate genes as opposed to candidate variants. In one such PheWAS19, 26 genes linked to Alagille, Marfan, Noonan, and Digeorge syndromes were identified using Online Mendelian Inheritance in Man (OMIM20) and Human Phenotype Ontology (HPO21). All four syndromes are rare in the general population and linked to specific Mendelian mutations in these 26 genes as reported in these databases. In this targeted PheWAS, both common and rare variants in these same genes were tested for association in the UK Biobank with individual phenotypes characteristic of components of each of the rare syndromes to characterize variable expressivity of these Mendelian genes as well as to identify pleiotropic relationships19. Several significant associations were identified in this targeted PheWAS conducted in a general population, potentially identifying genetic variants that modify the complex constellation of phenotypes characteristic of these four rare syndromes.

Targeted PheWAS are also being conducted across genes or genetic variants in the form of genetic risk scores. Genetic risk scores are typically calculated as the sum of the number of risk alleles an individual carries for any given trait or phenotype. Risk alleles are identified by GWAS, and genetic risk scores can be calculated on a per individual basis as unweighted or weighted by the genetic effect of each genetic variant estimated by the GWAS22. Well-studied human traits such as body mass index, a common measure to categorize individuals for healthy or unhealthy weight, are ripe for genetic risk score PheWAS approaches. In one such targeted PheWAS, genetic risk scores based on 76 variants associated with body mass index were constructed and tested for associations with phecodes in the UK Biobank23. Identified associations were then assessed for causal association using Mendelian randomization. This PheWAS demonstrated causal associations with genetically determined body mass index and a variety of health outcomes such as osteoarthrosis and knee derangement associated with obesity in non-genetic observational studies.

Targeted PheWAS is an attractive method under a broad umbrella encompassing drug development efforts. Successful drug development requires identifying effective targets while avoiding adverse events. Human genetic studies have identified several novel drug targets which have resulted in recent clinical trials that target low-density lipoprotein cholesterol levels for heart attack and stroke prevention, for example24. As an extension of these recent past successes, PheWAS have been conducted to identify possible adverse events associated with GWAS-identified variants mapped to 19 candidate drug targets25. This multi-biobank targeted PheWAS identified several associations between GWAS-identified genetic variants in potential drug target genes and phenotypes considered adverse events such as severe acne (associated with PNPLA3 rs738409 previously associated with ALT) and asthma (IFIH1 rs1990760 previously associated with type 1 diabetes25). An independent study leveraging PheWAS results from the UK Biobank demonstrated the utility of incorporating additional genetic features such as gene expression in the prediction of potential adverse drug events26.

Advances in PheWAS Phenotyping

Extracting phenotype information and accurately classifying phenotypes from EHRs is critical for genetic association studies in general. Most, but not all27-30, PheWAS leverage structured data available in EHRs such as ICD codes and ICD-Clinically Modified (CM) codes31. ICD codes or ICD-CM codes are primarily designed for clinical care and/or billing purposes32, and while their direct use in PheWAS is possible, it leads to unnecessary tests of association and potentially reduced statistical power due to multiple testing. Also, some billing codes do not fully capture the intended phenotype or represent meaningful disease groups. To achieve robust phenotyping, a common strategy is the use of a variety of phenotyping algorithms together with the classification software to group ICD codes31 (Table).

The most common automated ICD/ICD-CM groupings employed for PheWAS to date are phecodes. Phecodes were developed by a small group of physicians at a single institution who manually reviewed ICD-9-CM codes to establish the groups or bins by which cases of disease outcomes can be classified33 (Figure 2). While phecodes remain popular in part due to the availability of software that can be used to generate them34, they are not the only ICD/ICD-CM groupings available for PheWAS. For example, clinical classification software (CCS), developed as part of the Healthcare Cost and Utilization Project resources35, can collapse ICD-9-CM codes into various categories with clinical significance. Compared to phecodes, however, CCS does not offer the granularity required for PheWAS. Several new phecode groups have emerged since 2015, the year the ICD-9 based coding system was replaced by ICD-10 in the United States. ICD-10 offers more codes compared with ICD-9, including codes associated with detailed information for both the diagnosis (e.g., laterality and severity) and the lab test results36. From this change emerged methods that map ICD-10 codes to ICD-9 codes, and vice versa37; the converted ICD-9 codes can then be mapped to ICD-9-based phecodes for the subsequent association studies.

Figure 2. Current and future trends in phenotyping for phenome-wide association studies (PheWAS).

Figure 2.

PheWAS use high-throughput approaches to perform tests of association in large datasets such as clinical data linked to genome-wide data. Current PheWAS rely on phenotyping approaches that, while automated in the assignment of case-control status, are based on minimal curation by content experts who create the rules for case-control status. In the top half of this figure, a PheWAS of clinical data leverages the availability of structured data such as International Classification of Diseases (ICD) codes, 9th and 10th editions, clinically modified (CM). These data are minimally curated by a small group of physicians so that redundant codes and closely related codes are collapsed into a single phenotype or phecode for, in this example, chronic kidney disease. Future PheWAS phenotyping approaches (bottom half of the figure) are being developed to informatically extract both structured and unstructured data (clinical notes) available in electronic health records to allow for more data-driven approaches to case and control groupings (in this example, using unsupervised clustering approaches).

While the aforementioned ICD-9 phecode strategy to include newer ICD-10 data is now common practice, the conversion process is imperfect and loses information and granularity due to the differences in structure and semantics between the two coding systems. To address these limitations, new phecode maps designed to use ICD-10 and ICD-10-CM codes specifically have been developed38. The new phecode mappings were evaluated in two independent study populations, UK Biobank and a single medical center (Vanderbilt University Medical Center), in their phecode coverage, phenotype reproducibility, and detection of known PheWAS genotype-phenotype relationships. The evaluations demonstrated that the ICD-10/ICD-10-CM phecode map performs well regarding the phecode coverage and representativeness while yielding similar genotype-phenotype associations well established from GWAS and PheWAS.

In addition to the efforts focused on billing codes, there are efforts that integrate resources across different platforms to facilitate the automated phenotyping process. PheMap39 is one such effort that leverages natural language processing (NLP) to extract medical concepts from multiple public resources and subsequently maps the quantified concepts extracted to phenotypes. The knowledge base that PheMap builds allows for efficient assignment for 841 unique phenotypes with the support of tremendous medical information. The continuous scores provided by PheMap also promise to improve the statistical power of downstream analyses required of PheWAS. PheMap, however, is limited in generalizability and its ability to handle fine-scale phenotype definitions. There remains a need for phenotyping algorithms that can accurately identify distinct diseases and traits with an improved level of automation and freedom from manual curation while incorporating information from different sources.

More emphasis has been laid on developing automated approaches to further promote the efficiency and robustness of phenotyping. The enhanced scalability embedded in the automated annotation algorithms suits high-throughput phenotyping and promotes standardization across institutions. PheNorm40, developed in 2018, is an automated algorithm mainly leveraging the ICD-9-CM codes and tackles the time-consuming step of annotation. Using PheNorm, the informative features are obtained via domain knowledge or automated curation with the surrogate-assisted feature extraction method41 and then incorporated into the normal mixture model to generate the final prediction score. The requirement for converting the prediction score into the predicted probability and the subsequent manual selection for a threshold to create the binary trait further limits the application of PheNorm. The Multimodal automated phenotyping (MAP) algorithm42 can directly yield a final predicted probability leveraging the latent mixture models integrating the ICD features with the NLP features in an automated manner. The process involves fitting the Poisson mixture models to the two individual features and the combination of the two, respectively, while fitting the normal mixture models to the logarithm transformed features and feature combinations mentioned above. The disease of interest could be treated as a continuous trait43 for the subsequent association studies, leading to improved statistical power and accurate estimates. Moreover, the estimated prevalence generated in the unsupervised learning process can serve as the threshold for the binary disease classification and is more tailored to specific phenotypes as well as platforms.

Emerging Statistical Considerations for PheWAS

EHR-based PheWAS face several statistical challenges. For example, the frequency for the target phenotype can be low, leading to unbalanced or extremely unbalanced case-control comparisons. Analysis of sparse data using standard logistic tests, such as the Wald test, in the presence of extremely unbalanced data can largely inflate the type I error, particularly for genetic variants with low frequency44. Similarly, score test tends to be more computationally efficient but cannot control the type I error when the case-control ratio is meager. In addition to the inflation of the type I error, the nature of PheWAS demands a tremendous number of tests of association, requiring new methods to be scaled to meet the requirements. The state-of-art strategy in the field of sparse data is the Firth’s penalized maximum likelihood method45, which largely controls the bias introduced by rare events and works well for the complete separation but is computationally inefficient.

To improve computational efficiency, a single-variant test that can be scaled to the PheWAS level of tests for binary phenotypes was developed46. This single-variant test known as SPA leverages saddlepoint approximation47 to avoid using normal approximation, which is generally required in score tests and is computationally expensive. The combination of saddlepoint approximation and normal approximation can further speed up the analysis without sacrificing the control for type I error, yielding the fastSPA. To further account for sample relatedness embedded in many large association studies, a novel method known as Scalable and Accurate Implementation of GEneralized mixed model (SAIGE48) has been developed. SAIGE leverages logistic mixed models with an average information restricted maximum likelihood (AI-REML49) algorithm to deal with relatedness while incorporating SPA to control the unbalanced case-control ratios. Notably, SAIGE also employs a series of optimization methods in the estimation, outperforming other approaches, such as the generalized mixed model association test (GMMAT50), in controlling population structure and relatedness with improved computational efficiency.

Genetic variants with low frequency further impose challenges on the PheWAS design and implementation. Current methods primarily focus on combining and/or binning rare variants into the region- or gene-based area, and these bins then serve as the unit for burden tests or dispersion tests51. Multiple binning strategies have been proposed, but it is unclear which approaches are the most powerful or robust to inappropriate binning. In an attempt to address this particular challenge, the BioBin framework was developed to aggregate rare variants into multiple levels based on biological knowledge. BioBin couples binning with various association test methods such as SKAT52, and the resulting "Bio-KAT"53 then has the ability to carry out multiple tests in parallel with promoted computational efficiency that is required of PheWAS.

Integrating 'Omics in PheWAS

With the rapidly increasing volume of multi-omic data, whether from humans or model organisms (e.g., mice54 and zebrafish55), it is now feasible to incorporate these data into the systematic analytic stream of PheWAS. Human gene expression databases such as the Genotype-Tissue Expression database (GTEx) provide expression data and genome-wide data, enabling both the targeted (expression quantitative loci or eQTLs) and exhaustive (ePheWAS) tests of association to identify functional variants. PhenomeXcan56, a new resource developed in 2020, integrates ePheWAS with human phenotype data, creating an exhaustive database that integrates GWAS-identified variants with transcriptomic data. The integrative functional framework constructed in 2018 also demonstrates the potentials of increasing the statistical power in PheWAS by systematically leveraging the regulatory data, such as the transcription factor-based motifs, promoters, enhancers, and expression quantitative trait loci57. In general, PheWAS fed by integrated 'omic data demand tools for systematic and efficient data processing, integration, and analysis. Open Targets Genetics58 serves as an example that aggregates data from both GWAS and functional genetics studies into a publicly accessible integrated system, offering interactive tools to prioritize genes associated with diseases.

Other Analysis Trends in PheWAS

As demonstrated in the emerging PheWAS literature, PheWAS is not limited to single variants/genes or structured phenotypic data (Figures 1 and 2). A major anticipated analysis trend is the phenome-wide PRS association study (PRS-PheWAS). PRS have been developed for many human diseases with extensive GWAS findings available, ranging from psychiatric disorders to cardiovascular disease to various common cancers. As an example of an applied PRS-PheWAS, the Michigan Genomics Initiative identified PRS associated traits as well as introduced exclusion PRS PheWAS, an association study that excludes those affected with diagnosis of cancer of interest17. With more PRS and phenotypic data available, tools and resources for comparing PRS construction methods and performing standardized PRS-PheWAS are now available. PRSweb59 is a visual catalog constructed from publicly available resources to provide comprehensive PRS-PheWAS results for several skin cancers. Cancer PRSweb60 is an online repository that enables PRS-PheWAS for 35 common cancer traits that also provides the results for secondary trait associations identified via the exclusion PRS-PheWAS. In addition to cancer, more PRS are being constructed for human diseases given their potential for clinical utility, making it possible to perform PRS-PheWAS for other disease traits, such as psychiatric disorders61.

In addition to the structured phenotypic data as previously discussed (e.g., ICD codes), many other kinds of phenotypes could also be incorporated into the PheWAS analytic framework. Neuroimaging data, which better capture brain dysfunctions and elucidate the underlying etiology of neurological disorders, is rapidly growing in volume. The complexity and unique properties of neuroimaging data when coupled with genomic data can offer more direct brain-genotype relationships compared with the less specific binary phenotypes for brain associated outcomes in GWAS. To capitalize on emerging availability of brain-level data linked to genome-wide data, Neuroimaging PheWAS62 offers an accessible web-based system that enables the analysis and query of these complex data.

Other notable trends include emerging statistical approaches applied in parallel to or post PheWAS. Many of these methods aim to distinguish and interpret the novel cross-phenotype associations initially identified in PheWAS3. Cross-trait linkage disequilibrium (LD) score (LDSC) regression63,64 is one popular approach that does not demand individual genotype data to quantify the genetic correlation, thus providing valuable insights into the shared genetic architecture for the traits of interest. In the future, more sophisticated functional analyses following PheWAS will be required to more fully understand and potentially translate results from these complex genome-phenome studies.

Conclusions

With the advances in data volume and computational power, PheWAS has matured as a study design valuable in biomedical sciences with the potential for translational significance relevant to clinical practice. Still, challenges and many outstanding questions remain for PheWAS. For both EHR-based and cohort or cross-sectional PheWAS, additional methods and tools for phenotyping, incorporating information from multiple resources with enhanced scalability as well as level of automation, can improve the power of the application. From targeted to genome-wide, the statistical analyses demanded of PheWAS require increasingly sophisticated algorithms and the support of publicly available data and computational resources. The machine learning approach has demonstrated its ability in facilitating PheWAS with improvements in the phenotyping process and statistical power, and machine learning will undoubtedly continue to further contribute to the evolution of the PheWAS analytical pipeline. Overall, after these first ten years, PheWAS has proven to be a valuable approach in human genetics and genomics and is now staple in a suite of study designs and statistical approaches deemed useful and oftentimes necessary in the understanding of complex human disease and traits.

Outstanding questions.

How can phenotyping be more detailed or accurate yet still automated?

Are unsupervised approaches to phenotyping meaningful in PheWAS?

How can multiple data type inform PheWAS phenotyping or interpretation?

How can the balance between multiple testing and type I error be improved?

What are the optimal statistical approaches for PheWAS beyond a single genetic variant at a time?

How can PheWAS output be mined and visualized for better comprehension and interpretation?

How can computational resources be improved to enable comprehensive genomewide PheWAS in today’s setting of massive biobanks linked to clinical and epidemiologic data?

What are the best in silico, in vitro, and in vivo experiments for downstream biological corroboration and confirmation of statistically-identified cross-phenotype associations from PheWAS?

Highlights.

Pleiotropy, the concept that a gene or genetic variant affects more than one phenotype or trait, is at least a century old. In contrast, phenome-wide association studies (PheWAS), an approach used to identify cross-phenotype associations, was introduced only in the last decade. Still relatively young, PheWAS has rapidly matured into a widely used study design and analytical approach that can still benefit from improvements in its pipeline as genome-wide datasets expand in their breadth and depth, challenging the computational and statistical limits of today’s PheWAS.

Acknowledgements

This work was funded by 1R01 GM126249-03. This publication was also made possible by the Clinical and Translational Science Collaborative of Cleveland, UL1TR002548 from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Hackinger S, Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol. 2017;7( 11): 170125. doi: 10.1098/rsob.170125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Denny JC, Ritchie MD, Basford MA, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics.2010;26(9):1205–1210. doi: 10.1093/bioinformatics/btq126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bush WS, Oetjens MT, Crawford DC. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nat Rev Genet. 2016;17(3):129–145. doi: 10.1038/nrg.2015.36 [DOI] [PubMed] [Google Scholar]
  • 4.Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gaziano JM, Concato J, Brophy M, et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–223. doi: 10.1016/j.jclinepi.2015.09.016 [DOI] [PubMed] [Google Scholar]
  • 6.Hunter-Zinck H, Shi Y, Li M, et al. Genotyping Array Design and Data Quality Control in the Million Veteran Program. Am J Hum Genet. 2020;106(4):535–548. doi: 10.1016/j.ajhg.2020.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Conroy M, Sellors J, Effingham M, et al. The advantages of UK Biobank’s open-access strategy for health research. J Intern Med. 2019;286(4):389–397. doi: 10.1111/joim.12955 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.McInnes G, Tanigawa Y, DeBoever C, et al. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics. 2019;35(14):2495–2497. doi: 10.1093/bioinformatics/bty999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Verma A, Lucas A, Verma SS, et al. PheWAS and Beyond: The Landscape of Associations with Medical Diagnoses and Clinical Measures across 38,662 Individuals from Geisinger. Am J Hum Genet. 2018;102(4):592–608. doi: 10.1016/j.ajhg.2018.02.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lumsden AL, Mulugeta A, Zhou A, Hyppönen E. Apolipoprotein E (APOE) genotype-associated disease risks: a phenome-wide, registry-based, case-control study utilising the UK Biobank. EBioMedicine. 2020;59:102954. doi: 10.1016/j.ebiom.2020.102954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yamamoto F, Clausen H, White T, Marken J, Hakomori S. Molecular genetic basis of the histo-blood group ABO system. Nature. 1990;345(6272):229–233. doi: 10.1038/345229a0 [DOI] [PubMed] [Google Scholar]
  • 12.Calafell F, Roubinet F, Ramírez-Soriano A, Saitou N, Bertranpetit J, Blancher A. Evolutionary dynamics of the human ABO gene. Hum Genet. 2008;124(2):123–135. doi: 10.1007/s00439-008-0530-8 [DOI] [PubMed] [Google Scholar]
  • 13.Vasan SK, Rostgaard K, Majeed A, et al. ABO Blood Group and Risk of Thromboembolic and Arterial Disease: A Study of 1.5 Million Blood Donors. Circulation. 2016; 133(15):1449–1457; discussion 1457. doi: 10.1161/CIRCULATIONAHA.115.017563 [DOI] [PubMed] [Google Scholar]
  • 14.Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, et al. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet. 2009;41(9):986–990. doi: 10.1038/ng.429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Severe Covid-19 GWAS Group, Ellinghaus D, Degenhardt F, et al. Genomewide Association Study of Severe Covid-19 with Respiratory Failure. N Engl J Med. 2020;383(16):1522–1534. doi: 10.1056/NEJMoa2020283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li S, Schooling CM. A phenome-wide association study of ABO blood groups. BMC Medicine. 2020;18(1):334. doi: 10.1186/s12916-020-01795-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fritsche LG, Gruber SB, Wu Z, et al. Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am J Hum Genet. 2018;102(6):1048–1061. doi: 10.1016/j.ajhg.2018.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Vl C, Y C, X D, Sk H, Ek S. Genetic variants that associate with cirrhosis have pleiotropic effects on human traits. Liver international : official journal of the International Association for the Study of the Liver. 2020;40(2). doi: 10.1111/liv.14321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tcheandjieu C, Aguirre M, Gustafsson S, et al. A phenome-wide association study of 26 mendelian genes reveals phenotypic expressivity of common and rare variants within the general population. PLoS Genet. 2020; 16(11):e1008802. doi: 10.1371/journal.pgen.1008802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 2019;47(D1):D1038–D1043. doi: 10.1093/nar/gky1151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Groza T, Köhler S, Moldenhauer D, et al. The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Am J Hum Genet. 2015;97(1):111–124. doi: 10.1016/j.ajhg.2015.05.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Igo RP, Kinzy TG, Cooke Bailey JN. Genetic Risk Scores. Curr Protoc Hum Genet. 2019;104(1):e95. doi: 10.1002/cphg.95 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hyppönen E, Mulugeta A, Zhou A, Santhanakrishnan VK. A data-driven approach for studying the role of body mass in multiple diseases: a phenome-wide registry-based case-control study in the UK Biobank. Lancet Digit Health. 2019;1(3):e116–e126. doi: 10.1016/S2589-7500(19)30028-7 [DOI] [PubMed] [Google Scholar]
  • 24.Heilbron K, Mozaffari SV, Vacic V, et al. Advancing drug discovery using the power of the human genome. J Pathol. 2021;254(4):418–429. doi: 10.1002/path.5664 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Diogo D, Tian C, Franklin CS, et al. Phenome-wide association studies across large population cohorts support drug target validation. Nat Commun. 2018;9(1):4285. doi: 10.1038/s41467-018-06540-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Duffy Á, Verbanck M, Dobbyn A, et al. Tissue-specific genetic features inform prediction of drug side effects in clinical trials. Sci Adv. 2020;6(37):eabb6242. doi: 10.1126/sciadv.abb6242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pendergrass SA, Brown-Gentry K, Dudek SM, et al. The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genet Epidemiol. 2011;35(5):410–422. doi: 10.1002/gepi.20589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pendergrass SA, Brown-Gentry K, Dudek S, et al. Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet. 2013;9(1):e1003087. doi: 10.1371/journal.pgen.1003087 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hall MA, Verma A, Brown-Gentry KD, et al. Detection of pleiotropy through a Phenome-wide association study (PheWAS) of epidemiologic data as part of the Environmental Architecture for Genes Linked to Environment (EAGLE) study. PLoS Genet.2014; 10(12):e1004678. doi: 10.1371/journal.pgen.1004678 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pendergrass SA, Buyske S, Jeff JM, et al. A phenome-wide association study (PheWAS) in the Population Architecture using Genomics and Epidemiology (PAGE) study reveals potential pleiotropy in African Americans. PLoS One. 2019; 14(12). doi: 10.1371/journal.pone.0226771 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Pendergrass SA, Crawford DC. Using Electronic Health Records To Generate Phenotypes For Research. Curr Protoc Hum Genet. 2019;100(1):e80. doi: 10.1002/cphg.80 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet. 2011;12(6):417–428. doi: 10.1038/nrg2999 [DOI] [PubMed] [Google Scholar]
  • 33.Wei WQ, Bastarache LA, Carroll RJ, et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLOS ONE. 2017;12(7):e0175508. doi: 10.1371/journal.pone.0175508 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Rj C, L B, Jc D. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics (Oxford, England). 2014;30(16). doi: 10.1093/bioinformatics/btu197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Steiner C, Elixhauser A, Schnaier J. The Healthcare Cost and Utilization Project: An overview. Effective clinical practice : ECP. 2002;5:143–151. [PubMed] [Google Scholar]
  • 36.Steindel SJ. International classification of diseases, 10th edition, clinical modification and procedure coding system: descriptive overview of the next generation HIPAA code sets. J Am Med Inform Assoc. 2010;17(3):274–282. doi: 10.1136/jamia.2009.001230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Neuraz A, Chouchana L, Malamut G, et al. Phenome-wide association studies on a quantitative trait: application to TPMT enzyme activity and thiopurine therapy in pharmacogenomics. PLoS Comput Biol. 2013;9(12):e1003405. doi: 10.1371/journal.pcbi.1003405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wu P, Gifford A, Meng X, et al. Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation. JMIR Medical Informatics. 2019;7(4):e14325. doi: 10.2196/14325 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Zheng NS, Feng Q, Kerchberger VE, et al. PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records. J Am Med Inform Assoc. 2020;27(11):1675–1687. doi: 10.1093/jamia/ocaa104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yu S, Ma Y, Gronsbell J, et al. Enabling phenotypic big data with PheNorm. Journal of the American Medical Informatics Association. 2018;25(1):54–60. doi: 10.1093/jamia/ocx111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Yu S, Chakrabortty A, Liao KP, et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J Am Med Inform Assoc. 2017;24(e1):e143–e149. doi: 10.1093/jamia/ocw135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Liao KP, Sun J, Cai TA, et al. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. J Am Med Inform Assoc. 2019;26(11):1255–1262. doi: 10.1093/jamia/ocz066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sinnott JA, Dai W, Liao KP, et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Hum Genet. 2014;133(11):1369–1382. doi: 10.1007/s00439-014-1466-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ma C, Blackwell T, Boehnke M, Scott LJ, GoT2D investigators. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet Epidemiol. 2013;37(6):539–550. doi: 10.1002/gepi.21742 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Firth D Bias Reduction of Maximum Likelihood Estimates. Biometrika. 1993;80(1):27–38. doi: 10.2307/2336755 [DOI] [Google Scholar]
  • 46.Dey R, Schmidt EM, Abecasis GR, Lee S. A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS. Am J Hum Genet. 2017;101(1):37–49. doi: 10.1016/j.ajhg.2017.05.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Daniels HE. Saddlepoint Approximations in Statistics. The Annals of Mathematical Statistics. 1954;25(4):631–650. [Google Scholar]
  • 48.Zhou W, Nielsen JB, Fritsche LG, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50(9):1335–1341. doi: 10.1038/s41588-018-0184-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Gilmour AR, Thompson R, Cullis BR. Average Information REML: An Efficient Algorithm for Variance Parameter Estimation in Linear Mixed Models. Biometrics. 1995;51(4):1440–1450. doi: 10.2307/2533274 [DOI] [Google Scholar]
  • 50.Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models - ScienceDirect. Accessed November 12, 2021. https://www.sciencedirect.com/science/article/pii/S000292971600063X [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Basile AO, Wallace JR, Peissig P, Mccarty CA, Brilliant M, Ritchie MD. Knowledge driven binning and phewas analysis in marshfield personalized medicine research project using biobin. In: Biocomputing 2016. WORLD SCIENTIFIC; 2015:249–260. doi: 10.1142/9789814749411_0024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Li H, Wang X, Rukina D, et al. An Integrated Systems Genetics and Omics Toolkit to Probe Gene Function. Cell Syst. 2018;6(1):90–102.e4. doi: 10.1016/j.cels.2017.10.016 [DOI] [PubMed] [Google Scholar]
  • 55.Unlu G, Qi X, Gamazon ER, et al. Phenome-based approach identifies RIC1-linked Mendelian syndrome through zebrafish models, biobank associations and clinical studies. Nat Med. 2020;26(1):98–109. doi: 10.1038/s41591-019-0705-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Pividori M, Rajagopal PS, Barbeira A, et al. PhenomeXcan: Mapping the genome to the phenome through the transcriptome. Sci Adv. 2020;6(37):eaba2083. doi: 10.1126/sciadv.aba2083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zhao J, Cheng F, Jia P, Cox N, Denny JC, Zhao Z. An integrative functional genomics framework for effective identification of novel regulatory variants in genome-phenome studies. Genome Medicine. 2018;10(1):7. doi: 10.1186/s13073-018-0513-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ghoussaini M, Mountjoy E, Carmona M, et al. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 2021;49(D1):D1311–D1320. doi: 10.1093/nar/gkaa840 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Fritsche LG, Beesley LJ, VandeHaar P, et al. Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb. PLOS Genetics. 2019;15(6):e1008202. doi: 10.1371/journal.pgen.1008202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Fritsche LG, Patil S, Beesley LJ, et al. Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks. Am J Hum Genet. 2020;107(5):815–836. doi: 10.1016/j.ajhg.2020.08.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Leppert B, Millard LAC, Riglin L, et al. A cross-disorder PRS-pheWAS of 5 major psychiatric disorders in UK Biobank. PLoS Genet. 2020; 16(5). doi: 10.1371/journal.pgen.1008185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Zhao L, Batta I, Matloff W, O’Driscoll C, Hobel S, Toga AW. Neuroimaging PheWAS (Phenome-Wide Association Study): A Free Cloud-Computing Platform for Big-Data, Brain-Wide Imaging Association Studies. Neuroinform. 2021;19(2):285–303. doi: 10.1007/s12021-020-09486-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Bulik-Sullivan BK, Loh PR, Finucane HK, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–295. doi: 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Bulik-Sullivan B, Finucane HK, Anttila V, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47(11):1236–1241. doi: 10.1038/ng.3406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kirby JC, Speltz P, Rasmussen LV, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. Journal of the American Medical Informatics Association. 2016;23(6):1046–1052. doi: 10.1093/jamia/ocv202 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES