Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2020 May 30;2020:750–759.

Developing a FHIR-based Framework for Phenome Wide Association Studies: A Case Study with A Pan-Cancer Cohort

Nansu Zong 1, Deepak K Sharma 1, Yue Yu 1, Jan B Egan 2, Jaime I Davila 1, Chen Wang 1*, Guoqian Jiang 1*
PMCID: PMC7233075  PMID: 32477698

Abstract

Phenome Wide Association Studies (PheWAS) enables phenome-wide scans to discover novel associations between genotype and clinical phenotypes via linking available genomic reports and large-scale Electronic Health Record (EHR). Data heterogeneity from different EHR systems and genetic reports has been a critical challenge that hinders meaningful validation. To address this, we propose an FHIR-based framework to model the PheWAS study in a standard manner. We developed an FHIR-based data model profile to enable the standard representation of data elements from genetic reports and EHR data that are used in the PheWAS study. As a proof-of-concept, we implemented the proposed method using a cohort of 1,595 pan-cancer patients with genetic reports from Foundation Medicine as well as the corresponding lab tests and diagnosis from Mayo EHRs. A PheWAS study is conducted and 81 significant genotype-phenotype associations are identified, in which 36 significant associations for cancers are validated based on a literature review.

Introduction

Genome-wide association studies (GWAS) performs a broad range of statistical tests over participants with disease and without the disease to investigate the relation between the genome-wide set of genetic alterations and the disease (1). While thousands of SNPs have been successfully identified to have associations with diseases (e.g., type 2 diabetes and osteoporosis (2)), GWAS only focuses on a restricted phenotypic domain based on the existence or nonexistence of a particular disease, and neglects the benefit of using of phenotypes (including sub-phenotypes, biomarkers, and endophenotypes). As genomic information and investment into the development of large-scale Electronic Health Record (EHR) systems grow, a new trend to link rich phenotypic data accessible in EHR systems to genomic data indicates a supplementary alternative, Phenome-wide association study (PheWAS) (3, 4). PheWAS offers phenome-wide scans for exploring new associations between genotype and clinical phenotypes and gives insights into underlying biological associations. PheWAS has been effectively employed as a proof-of-concept to discover both expected associations (3, 5) as well as possible new associations (6).

The main idea behind PheWAS is to connect genetic data with phenotypic data in a variety of medical systems to discover the genotype-phenotype associations. Integrating heterogeneous medical records with genetic data can provide quantitative measurements (e.g., laboratory tests) along with detailed disease conditions (e.g., diagnosis), which increases power for statistical analysis. However, data heterogeneity is commonly encountered when integrating data from multiple sources thus hindering EHR data usage (7). In addition, PheWAS studies are limited by data biases in single-center research (8, 9). Those issues can be solved by the validation of PheWAS conducted across multiple institutions, which utilize heterogeneous medical records. Population Architecture using Genomics and Epidemiology (PAGE) network is an example of such endeavors (5, 8).

A benefit of having a standard input data format in addition to having a clear standardized definition of what data is present and what values are acceptable is plug and play functionality for conducting any PheWAS studies enabling replication across the different institutions so long as data is generated in that standardized format. The model-driven approach for standardizing phenotypic data has been increasingly adopted by the cancer research informatics community. For example, cancer profiles for clinical applications, such as breast cancer, colorectal cancer, prostate cancer, have been developed by the Clinical Data Interchange Standards Consortium (CDISC) (10), Clinical Information Modeling Initiative (CIMI) (11), and Royal College of Pathologists of Australasia (RCPA) (12). Nevertheless, such a profile that represents patient genotypes (e.g., from the genetic report) and phenotypes (e.g., from lab test and diagnosis) for PheWAS study have yet to be studied. Furthermore, to support precision medicine, the promotion of the standard to facilitate exchanging of data, such as clinical and genomic information, between parties is the primary focus in healthcare communities, such as HL7 Clinical Genomics (13). There are many standardized data models, including Informatics for Integrating Biology and the Bedside (i2b2) (14), the National Quality Forum (NQF) Quality Data Model (QDM) (15), the OHDSI Common Data Model (CDM) (16), the HL7 Consolidated Clinical Document Architecture (CDA) (17). The Fast Healthcare Interoperability Resources (FHIR) enables the quickly exchanging of EHR data, which is considered as a next-generation standards framework (18). FHIR is widely adopted by the major modern EHR vendors (e.g., Epic) and healthcare providers (e.g., Mayo Clinic and Intermountain Healthcare) (19).

By creating a standardized data model profile to facilitate the PheWAS study across diverse research institutions, we proposed a framework to populate the data of a PheWAS in a standard manner. We developed an FHIR-based data model profile to enable the standard representation of the data elements used in the PheWAS studies. As a proof-of-concept, we implemented the proposed method and the data model profile using 1,595 genetic reports from

Foundation Medicine as well as the corresponding lab tests and diagnosis codes from Mayo Clinic’s clinical data warehouse known as the Unified Data Platform (UDP). An aggregated PheWAS study was conducted based on the data matrices generated from the proposed data profile, and 81 significant genotype-phenotype associations were identified as results, in which 36 significant associations for cancers were validated based on a literature review.

Methods

The framework is designed to enable the results of PheWAS to be validated across multiple organizations via the adoption of an FHIR-based data profile. There are mainly three modules in the framework as shown in Figure 1: 1) data preparation and preprocessing, where the data is generated from diverse EHR systems and databases, 2) FHIR-based data model profiling, where an FHIR-based profile is developed based on the PheWAS study criteria, 3) mapping to FHIR-based data profile, where the local data schema will be mapped to the standardized profile, and 4) data population and PheWAS, where the local data will be used to populate the matrices for PheWAS based on the FHIR-based profile.

Figure 1.

Figure 1.

The framework of FHIR-based PheWAS study

Data preparation and preprocessing

The two sources, genetic reports, and EHR data are used in this study. For the genetic reports, we utilized the 1,595 reports generated from Foundation Medicine, which is a clinically available test that provides actionable information based on the results of the individual genomic profile of each patient’s cancer. Every test result provides microsatellite instability (MSI) and tumor mutational burden (TMB) to assist immunotherapy decisions. For the diagnosis and lab tests, we extracted the EHR data from Mayo Clinic’s UDP (20). The UDP is a clinical data warehouse that provides a combined view of multiple heterogeneous data across multiple databases, e.g., EPIC-based EHR. To integrate genetic reports and EHR data, we mapped the patients based on three data elements: 1) patient clinic number, 2) names (first and last name), and 3) Date Of Birth (DOB). In practice, if only the names and DOB were matched for a patient, a manual review was conducted for accurate mapping. Even though such a method provides the mappings with high precision and recall rates, it may not be feasible on larger datasets. A customized matching strategy that uses more features, such as race, sex, zip code, may provide an automated solution with an acceptable recall rate in other cases. The report issue time was also recorded for the population of the matrix in Section 2.4. The diagnosis and lab tests are extracted from EHR based on the mapped patients. For the diagnosis, all diagnosed diseases across the entirety of visits were collected. The diseases were encoded with International Classification of Disease (ICD-9/10) codes and phecode (3), which is a custom grouping of ICD9 billing codes to approximate the clinical disease phenome. Similarly, lab test records were collected from all visits. The Logical Observation Identifiers Names and Codes (LOINC) was adopted to encode the lab test items. The values were normalized to remove the noises, e.g., “Neg”, “N”, and “Negative” are represented with “negative”. The top 10 elements in each dataset can be found in Table 1.

Table 1.

Distribution of the top 10 elements in each report

ID Genes Diagnosis   Lab tests
Name # Records (Percentage) Phecode “Description” # Records (Percentage) LOINC “Description” # Records (Percentage)
1 TP53 818 (51.29%) 198 “Secondary malignant neoplasm” 512 (32.10%) 777-3 “ Platelets [#/volume] in Blood by Automated count “ 931 (58.37%)
2 KRAS 389 (24.39%) 401 “Hypertension” 381 (23.89%) 2160-0 “ Creatinine [Mass/volume] in Serum or Plasma “ 928 (58.18%)
3 CDKN2A/B 194 (12.16%) 401.1 “Essential hypertension” 373 (23.39%) 4544-3 “ Hematocrit [Volume Fraction] of Blood by Automated count “ 927 (58.12%)
4 PIK3CA 165 (10.34%) 272 “Disorders of lipoid metabolism” 343 (21.50%) 6690-2 “ Leukocytes [#/volume] in Blood by Automated count “ 927 (58.12%)
5 APC 155 (9.72%) 272.1 “Hyperlipidemia” 341 (21.38%) 787-2 “ MCV [Entitic volume] by Automated count “ 927 (58.12%)
6 PTEN 143 (8.97%) 285 “Other anemias” 317 (19.87%) 718-7 “ Hemoglobin [Mass/volume] in Blood “ 927 (58.12%)
7 ARID1A 427 “Cardiac dysrhythmias” 305 (19.12%) 788-0 “ Erythrocyte distribution width [Ratio] by Automated count “ 927 (58.12%)
8 CDKN2A 126 (7.90%) 512 “Other symptoms of respiratory system” 273 (17.12%) 789-8 “ Erythrocytes [#/volume] in Blood by Automated count “ 927 (58.12%)
9 RB1 120 (7.52%) 276 “Disorders of fluid” 269 (16.87%) 742-7 “ Monocytes [#/volume] in Blood by Automated count “ 923 (57.87%)
10 TERT 111 (6.96%) 198.1 “Secondary malignancy of lymph nodes” 269 (16.87%) 751-8 “ Neutrophils [#/volume] in Blood by Automated count “ 923 (57.87%)

FHIR-based data model profile

We proposed a data model profile using the FHIR modeling mechanism to functionalize the standard representation of genetic reports, lab test results, and diagnosis for the PheWAS study. The genetic entries were modeled by the developed profile, “PheWASGeneticReport”. This profile was derived from the existing profile “Observation-genetics”, which was generated based on the resources “Observation”. “PheWASGeneticReport” models observations about a mutated gene based on the extension “Observation-geneticsGene”. The alterations were modeled with the extension, “Observation-geneticsVariant”, where the three types of alternations were encoded with, 1) Disease-relevant genomic alteration, 2) Variants of Unknown Significance (VUS), and 3) Disease-relevant gene with no reportable alterations identified. The lab test entries were modeled by the developed profile, “PheWASLabTest”, based on the resource “Observation”. “PheWASLabTest” specifies the constraint with ‘code’ and ‘value’ to model the test items. A test item was encoded with LOINC with three types of the scale used in PheWAS: numeric, ordinal, and categorical. Of note, for the Quantitative (Qn) type in LOINC, relational operators (e.g., <, >, and =) were removed to change into numeric or categorized (e.g., 1-10). We have defined standard units of measurement to be used for normalizing numeric values. However, selecting custom units for localized studies is also supported. If the results are the only element that requires validation across different organizations, from a statistical perspective, using consistent units of measurement within the organization is sufficient for conducting a localized PheWAS study. The types, Narratives (Nar), Multi, Document (Doc), and Set were ignored. We modeled the diagnosis entries with multiple diagnosis reports across the entirety of visits. While the lab test entities only modeled the latest visit regarding the issue date of the genetic report, the profile could be easily extended by adding more elements to represent multiple values, e.g., minimum, maximum, mean, and median, for the entries from multiple visits. The diagnosis and lab test values for each visit can also be recorded based on the same extension mechanism.

The diagnosis entries were modeled by “Condition” with a developed extension, “PheWASDiagnosis”, where each disease was encoded with ICD 9/10 and Phecode vocabularies along with a frequency counter. The logical model in UML can be found in Figure 2.

Figure 2.

Figure 2.

FHIR-based PheWAS data model profile based on FHIR resources

In practice, we used FHIR Release 4 (R4) (21) for laying out the model elements. In the absence of a UML symbol for the profile, we reused the generalization symbol (inheritance) and distinguished profiles from extensions by showing the class namespaces. The PheWAS model extensions and profiles were created using the Forge editor (22). The UML Model was put together by manually extending and profiling the imported FHIR entities. A detailed model report document and its web rendering are available at (https://github.com/BD2KOnFHIR/phewas-on-fhir).

Mapping to FHIR-based data model profile

To populate the data for PheWAS with FHIR resources, we established a mapping between the FHIR resources and local data as shown in Figure 3. For the general information, the “identifiers”, “status”, and “subjects”, “cohort”, and “reported” are mapped to the corresponding elements in “Condition”. For each genetic report of a patient, the profile “PheWASGeneticReport” was mapped. Specifically, the item “tumorType” was mapped to “bodySite” and the remaining items mapped to the corresponding defined extensions. Of note, different with the cardinality of “gene” (1..*) and “variant” (1..*) in the genetic report, “PheWASGeneticReport” has a cardinality defined as “0..1” in “Observation”, therefore, a data entry of the genetic report will be represented with multiple data entries based on “PheWASGeneticReport”. The lab test entries were mapped to “PheWASLabTest”, where “code”, “value”, and “unit” were mapped correspondingly. The diagnosis entities were mapped to “PheWASDiagnosis”, where “Count” and “ICD9”/ “Phecode” were mapped correspondingly.

Figure 3.

Figure 3.

The mapping between PheWAS patient profile and the FHIR-based PheWAS data model profile. The elements of PheWAS patient profile are in green. The FHIR resources are represented in purple and the items are in blue. The newly generated items based on FHIR “Extension” are in yellow

Data population and PheWAS

Three matrices were populated based on the patient profile modeled in Section 2.2, which are patient-genetic, patient-lab test, and patient-diagnosis. To form the matrices, the elements in diagnosis and lab test (i.e., diseases for patient-diagnosis and test item for patient-lab test matrix), and the elements (i.e., genes for patient-genetic matrix) in the genetic report, were extracted respectively as the columns. Each patient record was considered as a row in the matrices. For a patient in the patient-genetic and patient-diagnosis matrix, each cell indicated the presence/absence of each reported gene variant and disease diagnosis. For a patient in the genetic-lab test matrix, each cell is the value of each lab test.

We conducted two kinds of tests based on two sets of cohorts that corresponded to the three matrices in an aggregated PheWAS for gene mutations. For patient-diagnosis cohorts, a case was a patient record with a valid phecode while other records were labeled as a control. We calculated the case and control chi-square distribution-associated allelic p-value. We selected only those that occurred in a minimum of 10 cases as a threshold of clinical interest. For genetic-lab test cohorts, since all the lab test variables in this study are numerical, we conducted the

Kolmogorov–Smirnov (KS) test of the value distributions for each gene-lab test pair. If the lab test with the observed cell counts fell below a 10% threshold, then the entire population was filtered out of the study. Since the conventional Bonferroni correction is considered conservative for PheWAS (3, 4, 23), we adjusted all the p-values by FDR (24).

In addition, we also conducted a literature review to validate whether a significant association identified from the PheWAS studies was a known association or not.

Experiment and Results

Based on the two kinds of data matrices, we conducted two PheWAS studies, gene v.s. lab test and gene v.s. phecode, which revealed well-established associations with significant p-values. As shown in Figure 4, four associations were identified between the genes - CDKN2A/B, CDKN2A, TERT, and SKT11, and the lab tests. CDKN2A/B was found significantly related to blood monocytes count (p-value=0.0269 in Figure 4) as it is related to the regulation of monocyte–macrophage function. For example, CDKN2A/B locus performs as a modifier on atherosclerosis. Atherogenesis is enhanced by the transplantation of heterozygous CDKN2A-deficient bone marrow, which increases the circulation of pro-inflammatory Ly6Chi monocytes and proliferation of peritoneal monocyte/macrophage (25, 26). CNKD2A is also responsible for the activation of the D-CDK4/6-INK4-Rb pathway. The pharmacodynamic decreasing in neutrophil counts is related to the increase of palbociclib exposure, which is a treatment for CNKD2A mutation (27). Our study illustrated the correlation between CNKD2A and neutrophil counts in blood (p-value=0.003). In addition, neutrophil counts are also correlated to STK11 (p-value=0.0269), which is the most commonly inactivated tumor. Genetic ablation of STK11/LKB1 results in the accumulation of neutrophils in non-small cell lung cancer (NSCLC) (28, 29). Myeloproliferative neoplasms (MPN) are a group of diseases, which produce excess cells in the bone marrow. They can develop myelodysplastic syndrome and acute myeloid leukemia. We showed the correlation between TERT and erythrocyte count in blood (p-value=0.0379), where TERT mutations increase the proliferation of common myeloid progenitor to affect hematopoiesis (30, 31).

Figure 4.

Figure 4.

Heatmap of the correlation of genes and lab tests

We identified 58 significant associations between the genes and phecodes shown in Figure 5. Top 5 genes that have the most phecode-related associations are KRAS (9 associations), STK11 (7 associations), TP53 (5 associations), APC (5 associations), and BRAF (5 associations). The detail of the frequency for # association is shown in Figure 5. We further validated 32 associations for the genes related to the carcinomas by a literature review, in which KRAS, APC, TP53 are the genes having the most cancer-related associations. The rest of the associations, which are potentially novel, are listed in Table 3.

Figure 5.

Figure 5.

Heatmap of the correlation of genes and diagnosis (phecode)

Table 3.

Potential correlations without validation

Gene Diagnosis P-value Gene Diagnosis P-value
APC Ileostomy status 5.49E-05 MLL2 Bone marrow or stem cell transplant 1.70E-02
BRAF Diseases of the larynx and vocal cords 9.87E-06 MLL2 Herpes simplex 1.70E-02
BRAF Nontoxic multinodular goiter 2.30E-05 MLL2 Non-Hodgkins lymphoma 1.21E-12
BRAF Nontoxic nodular goiter 2.04E-04 STK11 Chronic airway obstruction 1.65E-02
BRAF Secondary hypothyroidism 9.55E-05 STK11 Degenerative skin conditions and other dermatoses 1.60E-02
CCND1 Acquired absence of breast 3.69E-03 STK11 Dizziness and giddiness (Light-headedness and vertigo) 3.48E-02
CCNE1 Cancer of other female genital organs 1.02E-06 STK11 Emphysema 3.74E-03
EGFR Postmenopausal atrophic vaginitis 4.86E-02 STK11 Keratoderma, acquired 8.44E-03
KRAS Diseases of pancreas 3.95E-04 STK11 Peripheral or central vertigo 5.47E-05
KRAS Obstruction of bile duct 1.05E-03 STK11 Vertiginous syndromes and other disorders of vestibular system 1.78E-03
KRAS Other biliary tract disease 5.87E-04 TERT Nontoxic multinodular goiter 9.75E-05
KRAS Other disorders of biliary tract 2.59E-02 TERT Secondary hypothyroidism 7.77E-06
MYC Encounter for long-term (current) use of antibiotics 1.04E-02 TERT Swelling, mass, or lump in head and neck [Spaceoccupying lesion, intracranial NOS] 1.45E-02

KRAS functions as a switch for cell signaling and controlling cell proliferation during its normal function. However, when KRAS mutates, it will disrupt negative signaling and cause cells to proliferate and grow into cancer. The effect of KRAS mutations depends on the order of the mutations. If the KRAS mutation occur after APC mutation, it often develops into cancers, such as colorectal (p-value=6.62 E-07), colon (p-value=1.65 E-04), and rectum cancers (p-value=4.51 E-04) (32, 33).

In addition, somatic KRAS mutations are commonly found in pancreatic cancer (p-value=4.51E-04) (9.57 E-16) (34). According to a previous study (35), the most frequent metastatic sites of 468 lung adenocarcinoma patients were lung (45.6%), bone (26.2%) (p-value= 1.12 E-02), adrenal gland (17.4%), brain (16.8%), pleura (15.6%) and liver (11%). APC is regarded as a tumor suppressor gene as it could prevent uncontrolled cell growth that can lead to cancerous tumors. The protein encoded by the APC gene plays a key role in determining whether or not a cell will grow into a tumor. APC mutation and inactivation is a critical event to malignant rectum neoplasm (p-value=6.9 E-23), colorectal (p-value=2.04 E-44) and colon tumorigenesis (p-value=1.24 E-29) (36-38). In addition, the liver metastasis (p-value= 1.57 E-04) could also be caused by the combinations including APC, KRAS, and TGFB2 mutations (39). TP53 encodes the tumor protein p53, which is critical for tumor suppressor in multicellular organisms. TP53 is the most prevalent mutated gene in human cancers (> 50%), implying that the TP53 gene plays a vital role in the prevention of cancer formation. The accumulation of genetic mutations in the driving genes contributes to colorectal cancer development and malignant progression. APC, KRAS, and TP53 (p-value= 2.6 2E-are often observed as driver genes. TP53 mutations, for instance, are discovered in 60% of colorectal cancers (40, 41). TP53 is also regarded as the significant genetic variant of human ovarian epithelial and genital cancer (p-value=1.45E-02 for malignant neoplasm of the ovary, p-value=1.03 E-02 for malignant neoplasm of the ovary and other uterine adnexa, and p-value= 3.47E-03 for cancer of other female genital organs) (42-44). Multiple cancer metastases are caused by TP53, such as metastasis of gastrointestinal cancer (p-value=1.45E-02 for secondary malignant neoplasm of digestive systems) (45).

Discussion and conclusion

To facilitate the validation of PheWAS-based studies across different research organizations, we proposed an FHIR-based PheWAS data model profile to enable the standard representation of the data elements from genetic reports and EHR data that are used in the PheWAS study. A framework to automate the data population for PheWAS was introduced. As a proof-of-concept, we implemented the proposed method based on 1,595 genetic reports from FoundationOne CDx as well as the corresponding lab tests and diagnosis from Mayo Clinic’s UDP. A PheWAS study was conducted and 81 significant genotype-phenotype associations were obtained as a result. We have validated 36 significant genetic mutations for cancers based on a literature review.

There are several significant contributions and advantages of this study. Firstly, we demonstrated that it is feasible to represent the PheWAS study data using FHIR. To the best of our knowledge, this is the first study to apply standardization to model PheWAS study, with the ultimate goal of facilitating cross-validation for PheWAS studies. Secondly, we developed an FHIR-based data model profile that represents the data elements needed for PheWAS. The model uses the resources and profiles from FHIR, where a number of open-source validation mechanisms and tools, such as FHIR specification and implementation guides, are supported by the FHIR community for ensuring data quality. Our data model profile adopts the FHIR specifications to enable the modification of the constraints and rules to accommodate real data, which can be easily adapted and extended.

There are a number of limitations in this study we would like to tackle for the future work. First, the genetic reports are from FoundationOne CDx. As FoundationOne did not provide information on whether the data is generated from the tumor or normal samples, we were unable to separate germline mutations from somatic mutations, which values differ in initial diagnosis and progression, and are critical for cancer studies. The failure to capture differences in genetic data weakens our contribution to cancer studies, which is considered as a limitation of this study. Since the datasets used in this study are from multiple sources, an extra mapping effort between the proposed data profile and the schema of the local datasets is needed to enable the data population. Nevertheless, with the FHIR-based APIs under development through HL7 Argonaut project (46), such mapping between the local data and the proposed data profile will no longer be needed, which greatly promotes the flexibility and adaptability of the proposed framework. Second, by further exploring dependent phenotypes related to the same genetic alteration (e.g. KRAS and colorectal cancer v.s. KRAS and malignant neoplasm of rectum, rectosigmoid junction, and anus), we notice the limitations of performing individual genetics-phenotype associations without taking into consideration phenotype dependence and ontology structure. The biological classification of the phenotypes illustrates a hidden connection between cancers. Therefore, a more sophisticated PheWAS methodology can be designed to leverage genetic and phenotype ontologies structure to enhance the power of discoveries. In addition, by mapping lab tests in LOINC codes to more comprehensive Human Phenotype Ontology (HPO) (47, 48), will greatly advance the design of the refinement methodology for PheWAS. Third, the ultimate objective of the proposed method is to facilitate validation for PheWAS studies across multiple organizations. Due to a lack of resources, we cannot demonstrate the use case in this study. We plan to reach out to other institutions or research networks (eg, eMERGE research network (49)) for conducting PheWAS studies based on the proposed framework to have a comprehensive evaluation. In addition, for the potential new associations identified (see Table 3), further validation can also be conducted. Fourth, this study only considers the values of the diagnosis and lab test for the most recent visit. However, since the diagnosis and lab test values may change over time, the temporal aspect needs to be taken into consideration for building more sophisticated models for enabling unconfounded findings. Our future work will develop such a model based on the proposed framework. Fifth, our PheWAS study is carried out on common variants. It will be valuable to explore how the proposed FHIR-based framework will perform with common variants and non-cancer phenotypes in our future work. Lastly, the proposed method developed based on the data standard, FHIR, was selected for two major reasons: 1) FHIR is widely adopted among all modern EHR vendors and data providers and can be easily adopted. The adoption requires less Extract Transform Load (ETL) effort for the data representation from original data sources with the proposed data model; and 2) FHIR is not a data standardization model for data storage and management but rather a data communication method for efficiently exchanging medical data among organizations, which fits the original purpose of facilitating the validation for PheWAS studies across different organizations. Although there are benefits, as mentioned above, regarding the adoption of FHIR for PheWAS, we agree that the adoption of other standardization data models, such as OHDSI CDM, can be necessary for some cases where data storage and management system is needed for long-term scientific needs. For such cases, comprehensive data modeling strategies based on diverse standardization models require further study in the future.

Acknowledgments

This study is supported by the NIH BD2K grant U01 HG009450 and the Center for Individualized Medicine at Mayo Clinic.

References

  • 1.Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T. Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction. Nature genetics. 2002;32(4):650. doi: 10.1038/ng1047. [DOI] [PubMed] [Google Scholar]
  • 2.Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences. 2009;106(23):9362–7. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics. 2010;26(9):1205–10. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hebbring SJ. The challenges, advantages and future of phenome‐wide association studies. Immunology. 2014;141(2):157–65. doi: 10.1111/imm.12195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pendergrass S, Brown‐Gentry K, Dudek S, Torstenson E, Ambite J, Avery C. The use of phenome‐ wide association studies (PheWAS) for exploration of novel genotype‐phenotype relationships and pleiotropy discovery. Genetic epidemiology. 2011;35(5):410–22. doi: 10.1002/gepi.20589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hebbring SJ, Schrodi SJ, Ye Z, Zhou Z, Page D, Brilliant MH. A PheWAS approach in studying HLA-DRB1* 1501. Genes and immunity. 2013;14(3):187. doi: 10.1038/gene.2013.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lee C, Luo Z, Ngiam KY, Zhang M, Zheng K, Chen G. Big healthcare data analytics: Challenges and applications. Handbook of Large-Scale Distributed Computing in Smart Healthcare: Springer. 2017:11–41. [Google Scholar]
  • 8.Pendergrass SA, Brown-Gentry K, Dudek S, Frase A, Torstenson ES, Goodloe R. Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS genetics. 2013;9(1):e1003087. doi: 10.1371/journal.pgen.1003087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Diogo D, Tian C, Franklin CS, Alanne-Kinnunen M, March M, Spencer CCA. Phenome-wide association studies across large population cohorts support drug target validation. Nature Communications. 2018;9(1):4285. doi: 10.1038/s41467-018-06540-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.CDSIC Published User Guides: CDISC. 2019 [Available from: https://www.cdisc.org/standards/therapeutic-areas] [Google Scholar]
  • 11.HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2) HL7 FHIR. 2019 [Available from: http://build.fhir.org/ig/HL7/us-breastcancer/] [Google Scholar]
  • 12.HL7 Australia Implementation Guide. HL7 FHIR. 2014 [Available from: http://fhir.hl7.org.au/fhir/rcpa/index.html. [Google Scholar]
  • 13.HL7 Clinical Genomics. Hl7 International. 2007-2019 [Available from: http://www.hl7.org/special/committees/clingenomics/] [Google Scholar]
  • 14.Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale: informatics for integrating biology and the bedside. Journal of the American Medical Informatics Association. 2011;19(2):181–5. doi: 10.1136/amiajnl-2011-000492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Thompson WK, Rasmussen LV, Pacheco JA, Peissig PL, Denny JC, Kho AN. An evaluation of the NQF Quality Data Model for representing Electronic Health Record driven phenotyping algorithms. AMIA Annual Symposium Proceedings; 2012: American Medical Informatics Association. [PMC free article] [PubMed] [Google Scholar]
  • 16.Stang PE, Ryan PB, Racoosin JA, Overhage JM, Hartzema AG, Reich C. Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. Annals of internal medicine. 2010;153(9):600–6. doi: 10.7326/0003-4819-153-9-201011020-00010. [DOI] [PubMed] [Google Scholar]
  • 17.Dolin RH, Alschuler L, Beebe C, Biron PV, Boyer SL, Essin D. The HL7 clinical document architecture. Journal of the American Medical Informatics Association. 2001;8(6):552–69. doi: 10.1136/jamia.2001.0080552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bender D, Sartipi K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems. 2013 IEEE. [Google Scholar]
  • 19.HL7 FHIR Argonaut Project. HL7 International. 2019 [Available from: https://argonautwiki.hl7.org/Main_Page] [Google Scholar]
  • 20.Kaggal VC, Elayavilli RK, Mehrabi S, Pankratz JJ, Sohn S, Wang Y. Toward a learning health-care system–knowledge delivery at the point of care empowered by big data and NLP. Biomedical informatics insights. 2016;8(BII):S37977. doi: 10.4137/BII.S37977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.HL7.org. HL7 FHIR R4. 2018 [Available from: http://hl7.org/fhir/R4/] [Google Scholar]
  • 22.FORGE [Google Scholar]
  • 23.Verma A, Lucas A, Verma SS, Zhang Y, Josyula N, Khan A. PheWAS and beyond: the landscape of associations with medical diagnoses and clinical measures across 38,662 individuals from Geisinger. The American Journal of Human Genetics. 2018;102(4):592–608. doi: 10.1016/j.ajhg.2018.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 1995;57(1):289–300. [Google Scholar]
  • 25.Hannou SA, Wouters K, Paumelle R, Staels B. Functional genomics of the CDKN2A/B locus in cardiovascular and metabolic disease: what have we learned from GWASs? Trends in Endocrinology & Metabolism. 2015;26(4):176–84. doi: 10.1016/j.tem.2015.01.008. [DOI] [PubMed] [Google Scholar]
  • 26.Fuster JJ, Molina-Sánchez P, Jovaní D, Vinué Á, Serrano M, Andrés V. Increased gene dosage of the Ink4/Arf locus does not attenuate atherosclerosis development in hypercholesterolaemic mice. Atherosclerosis. 2012;221(1):98–105. doi: 10.1016/j.atherosclerosis.2011.12.013. [DOI] [PubMed] [Google Scholar]
  • 27.Hamilton E, Infante JR. Targeting CDK4/6 in patients with cancer. Cancer treatment reviews. 2016;45:129–38. doi: 10.1016/j.ctrv.2016.03.002. [DOI] [PubMed] [Google Scholar]
  • 28.Koyama S, Akbay EA, Li YY, Aref AR, Skoulidis F, Herter-Sprie GS. STK11/LKB1 deficiency promotes neutrophil recruitment and proinflammatory cytokine production to suppress T-cell activity in the lung tumor microenvironment. Cancer research. 2016;76(5):999–1008. doi: 10.1158/0008-5472.CAN-15-1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhang H, Brainson CF, Koyama S, Redig AJ, Chen T, Li S. Lkb1 inactivation drives lung cancer lineage switching governed by Polycomb Repressive Complex 2. Nature communications. 2017;8:14922. doi: 10.1038/ncomms14922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Oddsson A, Kristinsson S, Helgason H, Gudbjartsson D, Masson G, Sigurdsson A. The germline sequence variant rs2736100_C in TERT associates with myeloproliferative neoplasms. Leukemia. 2014;28(6):1371. doi: 10.1038/leu.2014.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kamatani Y, Matsuda K, Okada Y, Kubo M, Hosono N, Daigo Y. Genome-wide association study of hematological and biochemical traits in a Japanese population. Nature genetics. 2010;42(3):210. doi: 10.1038/ng.531. [DOI] [PubMed] [Google Scholar]
  • 32.Yamauchi M, Morikawa T, Kuchiba A, Imamura Y, Qian ZR, Nishihara R. Assessment of colorectal cancer molecular features along bowel subsites challenges the conception of distinct dichotomy of proximal versus distal colorectum. Gut. 2012;61(6):847–54. doi: 10.1136/gutjnl-2011-300865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rosty C, Young JP, Walsh MD, Clendenning M, Walters RJ, Pearson S. Colorectal carcinomas with KRAS mutation are associated with distinctive morphological and molecular features. Modern Pathology. 2013;26(6):825. doi: 10.1038/modpathol.2012.240. [DOI] [PubMed] [Google Scholar]
  • 34.Krasinskas AM, Moser AJ, Saka B, Adsay NV, Chiosea SI. KRAS mutant allele-specific imbalance is associated with worse prognosis in pancreatic cancer and progression to undifferentiated carcinoma of the pancreas. Modern Pathology. 2013;26(10):1346. doi: 10.1038/modpathol.2013.71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lohinai Z, Klikovits T, Moldvay J, Ostoros G, Raso E, Timar J. KRAS-mutation incidence and prognostic value are metastatic site-specific in lung adenocarcinoma: poor prognosis in patients with KRAS mutation and bone metastasis. Scientific reports. 2017;7:39721. doi: 10.1038/srep39721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zhang L, Shay JW. Multiple roles of APC and its therapeutic implications in colorectal cancer. JNCI: Journal of the National Cancer Institute. 2017;109(8) doi: 10.1093/jnci/djw332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schell MJ, Yang M, Teer JK, Lo FY, Madan A, Coppola D. A multigene mutation classification of 468 colorectal cancers reveals a prognostic role for APC. Nature communications. 2016;7:11743. doi: 10.1038/ncomms11743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kwong LN, Dove WF. APC and its modifiers in colon cancer. APC Proteins: Springer. 2009:85–106. doi: 10.1007/978-1-4419-1145-2_8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Sakai E, Nakayama M, Oshima H, Kouyama Y, Niida A, Fujii S. Combined mutation of Apc, Kras, and Tgfbr2 effectively drives metastasis of intestinal cancer. Cancer research. 2018;78(5):1334–46. doi: 10.1158/0008-5472.CAN-17-3303. [DOI] [PubMed] [Google Scholar]
  • 40.Nakayama M, Oshima M. Mutant p53 in colon cancer. Journal of Molecular Cell Biology. 2018;11(4):267–76. doi: 10.1093/jmcb/mjy075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Iacopetta B. TP53 mutation in colorectal cancer. Human mutation. 2003;21(3):271–6. doi: 10.1002/humu.10175. [DOI] [PubMed] [Google Scholar]
  • 42.Zhang Y, Cao L, Nguyen D, Lu H. TP53 mutations in epithelial ovarian cancer. Transl Cancer Res. 2016;5(6):650–63. doi: 10.21037/tcr.2016.08.40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Schuijer M, Berns EM. TP53 and ovarian cancer. Human mutation. 2003;21(3):285–91. doi: 10.1002/humu.10181. [DOI] [PubMed] [Google Scholar]
  • 44.Costa MJ, Vogelsan J, Young L. p53 gene mutation in female genital tract carcinosarcomas (malignant mixed müllerian tumors): a clinicopathologic study of 74 cases. Modern pathology: an official journal of the United States and Canadian Academy of Pathology, Inc. 1994;7(6):619–27. [PubMed] [Google Scholar]
  • 45.Powell E, Piwnica-Worms D, Piwnica-Worms H. Contribution of p53 to metastasis. Cancer discovery. 2014;4(4):405–14. doi: 10.1158/2159-8290.CD-13-0136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.International HLS. HL7 Argonaut project. 2019 [Available from: https://argonautwiki.hl7.org/Main_Page] [Google Scholar]
  • 47.Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. The American Journal of Human Genetics. 2008;83(5):610–5. doi: 10.1016/j.ajhg.2008.09.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhang XA, Yates A, Vasilevsky N, Gourdine JP, Callahan TJ, Carmody LC. Semantic integration of clinical laboratory tests from electronic health records for deep phenotyping and biomarker discovery. NPJ Digit Med. 2019;2:32. doi: 10.1038/s41746-019-0110-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA. The electronic medical records and genomics (eMERGE) network: past, present, and future. Genetics in Medicine. 2013;15(10):761. doi: 10.1038/gim.2013.72. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES