To the editor.
Despite great strides in the development and wide acceptance of standards for exchanging structured information about genomic variants, progress in standards for computational phenotype analysis for translational genomics has lagged behind. Phenotypic features (signs, symptoms, laboratory and imaging findings, results of physiological tests, etc.) are of high clinical importance, yet exchanging them in conjunction with genomic variation is often overlooked or even neglected. In the clinical domain, substantial work has been dedicated to the development of computational phenotypes.1 Traditionally, these approaches have largely relied on rule-based methods and large sources of clinical data to identify cohorts of patients with or without a specific disease.2–5 However, they were not developed to enable deep phenotyping of abnormalities, to facilitate computational analysis of interpatient phenotypic similarity, or to support computational decision support. To address this, the Global Alliance for Genomics and Health6 (GA4GH) has developed the Phenopacket schema, which supports exchange of computable longitudinal case-level phenotypic information for diagnosis of and research on all types of disease, including Mendelian and complex genetic diseases, cancer, and infectious diseases. A Phenopacket characterizes an individual person or biosample, linking that individual to detailed phenotypic descriptions, genetic information, diagnoses, and treatments (Fig 1). The Phenopacket software is available at https://github.com/phenopackets/.
Figure 1. Phenopacket schema overview.

The GA4GH Phenopacket schema consists of several optional elements, each of which contains information about a certain topic, such as phenotype, variant, or pedigree. An element can contain other elements, which allows a hierarchical representation of data. For instance, Phenopacket contains elements of type Individual, PhenotypicFeature, Biosample, and so on. Individual elements can therefore be regarded as building blocks that are combined to create larger structures. Colors represent the major themes of elements within the schema.
The ‘PhenotypicFeature’ is the central element of the Phenopacket schema. A ‘PhenotypicFeature’ can be used to describe any phenotypic characteristic (often, but not necessarily, clinical abnormalities), including signs and symptoms, laboratory findings, histopathology findings, imaging, and electrophysiological results, along with modifier and qualifier concepts. Each phenotypic feature is described using an ontology term. Although the Phenopacket schema does not mandate which ontology to use, it provides recommendations, such as the Human Phenotype Ontology7 (HPO) for rare diseases and the National Cancer Institute Thesaurus (NCIT) for transmission of information about a cancer specimen (e.g., pathological staging or more detailed information about histology or tumor markers).8 Within the schema, it is possible to indicate whether an abnormality was excluded during the diagnostic process (e.g., whether a morphological cardiac defect was excluded by echocardiography) or to use other optional HPO terms to denote the severity, frequency (e.g., number of occurrences of seizures per week), laterality (e.g., unilateral), or other pattern of a phenotypic feature in the patient being described. Finally, the onset (and if applicable the resolution) of specific features can be indicated.
Other key elements of the schema are ‘Measurement’, which is used to capture quantitative (i.e., numerical), ordinal (e.g., absent/present), or categorical measurements; ‘Biosample,’ a description of biological material obtained from the individual represented in the Phenopacket and used for phenotypic, genotypic, or other -omics analysis; and ‘MedicalAction,’ which includes a hierarchical representation of medical actions, including medications, procedures, and other actions taken for clinical management. The ‘Treatment’ element is a subelement of ‘MedicalAction’ and represents administration of a pharmaceutical agent, broadly defined as prescription and over-the-counter medicines, vaccines, and other therapeutic agents, such as monoclonal antibodies or chimeric antigen receptor (CAR)-T-cell-therapy.
The ‘Interpretation’ element specifies interpretations of genomic findings. This element leverages complementary resources developed by the GA4GH Genomic Knowledge Standards Work Stream: the Variation Representation Specification (VRS) and VRS Added Tools for Interoperable Loquacious Exchange (VRSATILE).6 Further information on this and other elements is available in the online documentation (https://phenopacket-schema.readthedocs.io/).
The Phenopacket schema was designed to support several use cases. Phenotype-driven rare-disease genomic diagnostic software has previously used bespoke formats to represent phenotypic data (generally in the form of a list of HPO terms), and information about the pedigree. The Phenopacket provides a standard input format for these tools that will simplify computational analysis pipelines, and the additional clinical information will enable analysis pipelines and algorithms to leverage other data such as age of onset and excluded abnormalities. A number of databases have adopted the standard to represent the clinical data of individuals in the context of rare disease genomics (European Genome-phenome Archive), registries (European Joint Programme on Rare Diseases and Western Australian Register of Developmental Anomalies), biosamples (EBML-EBI BioSamples database), and biobanks (the Japanese Agency for Medical Research and Development Tohoku Medical Megabank project and National Center Biobank Network). In addition, Phenopackets can be used to store a computational representation of a case report, and we envision that authors could submit representations of the patients as phenopackets to accompany published case reports and descriptions of genotype-phenotype correlations. In addition to these use cases, the Phenopacket schema is designed to interact with Electronic Health Record (EHR) data. A long-standing challenge has been that computational phenotype analysis is poorly connected with the EHR and also that EHRs are not standardized across countries or even institutions in a given country. To enable precision medicine, standards and tools are needed to improve machine readable phenotypic characterization of patients beyond current standard EHR billing and clinical encounter data capture. To address this, we have implemented a Fast Healthcare Interoperability Resources (FHIR) implementation guide for representing a phenopacket within EHR systems (Supplementary Table 1).
Requirements and specifications for the standard were established through a community effort under the auspices of the GA4GH; Version 1.0 of the GA4GH standard was released in 2019 to elicit feedback from the community. Version 2.0, which is described here, was developed on the basis of this feedback and expanded the data model to include a better representation of temporality, medical actions, and quantitative measures. The Phenopacket schema (version 2.0) was formally reviewed and approved as a GA4GH standard6 in 2021. It is designed to be interoperable with other relevant standards, including the traditional PED (pedigree format) file as well as the GA4GH pedigree standard, the GA4GH Beacon,9 and the GA4GH Variation Representation Specification. The GA4GH has committed to coordinate its activities and future roadmaps with those of other standards development organizations, including the International Organization for Standardization (ISO) Technical Subcommittee for Genomics Informatics (ISO/TC215/SC1) and HL7 Clinical Genomics. Consequently, a FHIR implementation guide for Phenopacket interoperability has been developed and the Phenopacket schema is at the approval stage of the ISO certification process (Supplementary Table 2).
The variant call format (VCF) standard for storing genotyping data allowed a wide range of research groups to write software for analyzing such data.10 The GA4GH Phenopacket schema aspires to be similarly transformative in the landscape of genome analysis using phenotype data. Multiple providers of phenotypic data include patients and clinicians, via a variety of mechanisms, including clinical notes and electronic health records, interfaces such as FHIR, app-based entry, and mobile devices. The Phenopacket schema acts as a common model that can capture data from many sources with a unified software representation and in turn can be used by multiple receivers of the phenotypic information, including journals, databases, registries, and clinical laboratories. We anticipate that the Phenopacket schema will encourage the development of a collection of software for the analysis of genomic data in the context of clinical information that will accelerate innovation and discovery. Genomic data will become ever more important in translational research and clinical care in the coming years and decades. The Phenopacket schema represents a standard for capturing clinical data and integrating it with genomic data that will help to obtain the maximal utility of this data for understanding disease and developing precision medicine approaches to therapy.
Supplementary Material
Acknowledgements
The authors gratefully acknowledge insight and feedback from Marian H. Adly, Pier Luigi Buttigieg, Janine Lewis, Manuel Posada de la Paz, and Maria Taboada. This work was supported by 7RM1HG010860-02 (NHGRI). Additional funding was as follows. PNR was supported by NLM contract #75N97019P00280, NIH NHGRI RM1HG010860, NIH OD R24OD011883, NIH NICHD 1R01HD103805-01. HH was supported by NIH OD R24OD011883. GIS was supported by ELIXIR, the research infrastructure for life-science data. CGC was supported by NIH NCATS U24TR002306. KCL was supported by NIH OD 5UM1OD023221. MB was supported by BioMedIT Network project of Swiss Institute of Bioinformatics (SIB) and Swiss Personalized Health Network (SPHN). AHW was supported by NIH NHGRI K99HG010157, NIH NHGRI R00HG010157. CJM, MAH, MCM-T, JAM, DD were supported by NIH NHGRI RM1HG010860, NIH OD R24OD011883. AM-J was supported by Australian Genomics. Australian Genomics is supported by the National Health and Medical Research Council (GNT1113531). DS, JOBJ were supported by NIH NHGRI RM1HG010860, NIH OD R24OD011883, NIH NICHD 1R01HD103805-01. MD was supported by NIH NHGRI U54HG004028, NIH NHGRI 5U01HG008473-03, NIH NCATS OT2TR003434-01S1U54HG008033-01. GSB was supported by Roy Hill Community Foundation, Angela Wright Bennett Foundation, McCusker Charitable Foundation, Borlaug Foundation, Stan Perron Charitable Foundation. LB was supported by NIH NHGRI U41HG006834 (Clinical Genome Resource). MC was supported by EMBL-EBI Core Funds and Wellcome Trust GA4GH award number 201535/Z/16/Z. AH was supported by NIH NHGRI 1U41HG006627, NIH NHGRI 1U54HG006542, NIH NHGRI 1RM1HG010860. PNS was supported by The Alan Turing Institute. NLH was supported by NIH NHGRI RM1HG010860, NIH OD R24OD011883, U.S. Department of Energy Contract DE-AC02-05CH11231. NP was supported by Moorfields Eye Charity. NQ-R was supported by EU Horizon 2020 research and innovation programme grant agreement 825575 (EJP-RD). OE was supported by NIH grants UL1TR002384, R01CA194547, P01CA214274 LLS SCOR grants 180078-01, 7021-20, Starr Cancer Consortium Grant I11-0027. HL was supported by CIHR Foundation Grant on Precision Health for Neuromuscular Diseases FDN-167281. RT was supported by CIHR postdoctoral fellowship award MFE-171275. LDS was supported by Genome Canada and NIH NHGRI U24HG011025. SO was supported by AMED. DP, LM, AP, SB, MR, RK were supported by EU Horizon 2020 research and innovation programme grant agreements 779257 (Solve-RD) and 825575 (EJP-RD). RRF was supported by NLM contract #75N97019P00280.
Competing interests
SK is an employee of Ada Health GmbH. NP is a director of Phenopolis Ltd. OE is supported by Janssen, Johnson and Johnson, Volastra Therapeutics, AstraZeneca and Eli Lilly research grants. He is scientific advisor and equity holder in Freenome, Owkin, Volastra Therapeutics and One Three Biotech. ARM is an employee of Philips Research North America. OJB is an employee of PhenoTips. MA is an editor employed by Wiley. AS is an employee of Lifebit Biotech Ltd.
The GA4GH Phenopacket Modeling Consortium
Consortia authors
Myles Axton49, Lawrence Babb50, Cornelius F. Boerkoel51, Bimal P. Chaudhari44,45, Hui-Lin Chin52,53 Michel Dumontier54, Nour Gazzaz53,55 David P. Hansen29, Harry Hochheiser56, Veronica A. Kinsler57,58, Hanns Lochmüller59,60,61 Alexander R. Mankovich62, Gary I. Saunders63, Panagiotis I. Sergouniotis64, Rachel Thompson59 & Andreas Zankl65,66,67
49Wiley, Inc, Research, Hoboken 7030, NJ, USA
50Broad Institute of MIT and Harvard, Cambridge 2142, MA, USA
51University of British Columbia, Medical Genetics, Vancouver V6H3N1, CA
52Khoo Teck Puat-National University Children’s Medical Institute, National University Hospital, Department of Paediatrics, Singapore 119074, Singapore
53Women’s Hospital of British Columbia, Provincial Medical Genetics, Vancouver V6H3N1, CA
54Maastricht University, Institute of Data Science, Maastricht 6229 EN, NL
55King Abdulaziz University Hospital, department of pediatrics, Jeddah, SA
56University of Pittsburgh, Biomedical Informatics, Pittsburgh 15206, PA, USA
57Great Ormond St Hospital for Children, Paediatric Dermatology, London WC1N 3JH, UK
58Francis Crick Institute, Mosaicism and Precision Medicine Laboratory, London NW1 1AT, UK
59Children’s Hospital of Eastern Ontario Research Institute, Molecular Biomedicine, Ottawa K1H 8L1, CA
60University of Ottawa, Brain and Mind Research Institute, Department of Cellular and Molecular Medicine, Ottawa K1H 8M5, CA
61The Ottawa Hospital, Neuromuscular Centre, Ottawa K1Y 4E9, CA
62Philips Research North America, Precision Diagnosis & Image-Guided Therapy, Cambridge 2141, MA, USA
63ELIXIR, ELIXIR Hub, Cambridge CB10 1SD, UK
64University of Manchester, Division of Evolution, Infection and Genomics, Manchester M13 9PT, UK
65The University of Sydney, Faculty of Medicine and Health, Sydney 2006, AU
66The Children’s Hospital at Westmead, Department of Clinical Genetics, Westmead 2145, AU
67Garvan Institute of Medical Research, Kinghorn Centre for Clinical Genomics and Bone Division, Darlinghurst 2010, AU
References
- 1.Richesson R & Smerek M Electronic health records-based phenotyping. Rethinking clinical trials: A living textbook of pragmatic clinical trials 2016, (2014). [Google Scholar]
- 2.Hripcsak G & Albers DJ Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc 20, 117–121 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Shivade C et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J. Am. Med. Inform. Assoc 21, 221–230 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wei W-Q & Denny JC Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 7, 41 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Richesson RL, Sun J, Pathak J, Kho AN & Denny JC Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artif. Intell. Med 71, 57–61 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rehm HL et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom 1, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Köhler S et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sioutos N et al. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J. Biomed. Inform 40, 30–43 (2007). [DOI] [PubMed] [Google Scholar]
- 9.Fiume M et al. Federated discovery and sharing of genomic data using Beacons. Nat. Biotechnol 37, 220–224 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Danecek P et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
